摘要随着中文搜索引擎技术的发展,人们对于信息获取的准确度进一步提高。搜索的准确性有更高的要求,搜索的第一步就是将用户的输入的中文字符串进行合理的中文切词。
本文设计的中文分词系统分为两个模块,建立语言模型的模块和Viterbi算法模块。语言模型模块主要对于已经获得的分词语料训练集通过机器学习算法进行训练得到词典和语言模型。目前的语言模型有基于字符串的语言模型、基于统计的语言模型和基于理解的语言模型。本文主要是针对基于统计的语言模型的研究,同时也比较各个语言模型的优劣。Viterbi算法模块主要利用得到的语言模型处理待分词的中文字符串,这是一种动态规划算法,它可以比较快的找到最佳分词序列。61630
毕业论文关键词 中文分词 Viterbi算法 机器学习 语言模型
毕业设计说明书(论文)外文摘要
Title Chinese word segmentation techniques based on statistical language models and the Viterbi algorithm
Abstract With the development of Chinese search engine technology, people access to information to further improve the accuracy. Search accuracy have higher requirements, the first step is to cut the string reasonablely which provides by the user.
This design of Chinese word segmentation system is pided into two modules, the modules and the language model to establish the Viterbi algorithm module. Language model module has been obtained for the training set of sub-word material through machine learning algorithm is trained to get dictionaries and language models. The current language model has a string-based language model, based on statistical language model and language model based on understanding. This article is mainly based on statistical language model for the study, and also compare the pros and cons of each language model. Viterbi algorithm module Utilization of the language model of Chinese word processing to be string, which is a dynamic programming algorithm, which can be faster to find the best word sequence.
Keywords Chinese word segmentation Viterbi algorithm Machine Learning Language model
1 绪论 1
1.1 中文分词的概念 1
1.2 中文分词的应用 1
1.3 中文分词的研究现状 2
1.4 本文的主要工作 2
2 中文分词的基本算法和语言模型 3
2.1 基于字符串匹配的分词算法 3
2.2 基于理解的分词 4
2.3 基于统计的分词 5
2.4 各种分词算法的比较 5
2.5 经典统计语言模型 8
2.6 维特比算法 13
3 系统设计与实现 13
3.1建立语言模型、词典与katz平滑 14
3.2 语言模型的实现 15
3.3 维特比算法动态规划的过程 23
3.4 维特比算法的实现 24
4 实验评估 30
结 论 38
致 谢 39
参考文献 40
1 绪论