A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach


Share / Export Citation / Email / Print / Text size:

International Journal on Smart Sensing and Intelligent Systems

Professor Subhas Chandra Mukhopadhyay

Exeley Inc. (New York)

Subject: Computational Science & Engineering, Engineering, Electrical & Electronic


eISSN: 1178-5608



VOLUME 7 , ISSUE 1 (March 2014) > List of articles

A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach

Lu Pengyu * / Pu Jingchuan, / Du Mingming / Lou Xiaojuan / Jin Lijun

Keywords : Chinese word segmentation, lexicon-based, Corpus-based, word frequency, natural language processing.

Citation Information : International Journal on Smart Sensing and Intelligent Systems. Volume 7, Issue 1, Pages 263-282, DOI: https://doi.org/10.21307/ijssis-2017-655

License : (CC BY-NC-ND 4.0)

Published Online: 27-December-2017



This paper presents a Lexicon-Corpus-based Unsupervised (LCU) Chinese word segmentation approach to improve the Chinese word segmentation result. Specifically, it combines advantages of lexicon-based approach and Corpus-based approach to identify out-of-vocabulary (OOV) words and guarantee segmentation consistency of the actual words in texts as well. In addition, a Forward Maximum Fixed-count Segmentation (FMFS) algorithm is developed to identify phrases in texts at first. Detailed rules and experiment results of LCU are presented, too. Compared with lexicon-based approach or corpus-based approach, LCU approach makes a great improvement in Chinese word segmentation, especially for identifying n-char words. And also, two evaluation indexes are proposed to describe the effectiveness in extracting phrases, one is segmentation rate (S), and the other is segmentation consistency degree (D).

Content not available PDF Share



[1] A M. Cretu and P. Payeur, “Visual Attention Model with Adaptive Weighting of Conspicuity Maps for Building Detection in Satellite Images” , International Journal on Smart Sensing and Intelligent Systems, Vol. 5, No. 4, pp. 742-766, December 2012.
[2] Yong Xiao, et al., “ Feed-forward Control of Temperature-Induced Head Skew for Hard Disk Drives”, International Journal on Smart Sensing and Intelligent Systems, Vol. 5, No. 1,pp. 95-106, March 2012.
[3] Peng FuChun, F.F. and Andrew Mccallum, “Chinese segmentation and new word detection using conditional random fields”, 20th International Conference On Computational Linguistics, No. 562, pp. 562-568, August 2004.
[4] Sproat Richard, et al., “A stochastic finite-state word-segmentation algorithm for Chinese”,Computational Linguistics, Vol. 22, No. 3, pp. 377-404, September 1996.
[5] Xi Luo, et al., “Impact of Word Segmentation Errors on Automatic Chinese Text Classification”, 10th IAPR International Workshop on Document Analysis Systems, pp.271-275, March 2012.
[6] Zhao Hai and Chunyu Kit, “Integrating unsupervised and supervised word segmentation:The role of goodness measures”, Information Sciences, Vol.181, Issue.1, pp. 163-183,January 2011.
[7] Chen Keh-Jiann and Liu Shing-Huan, “Word identification for Mandarin Chinese sentences”, 14th conference on Computational linguistics, Vol. 1, pp. 101–107, August 1992.
[8] Chen Wenyu, et al., “A Pragmatic Approach to Increase Accuracy of Chinese Word-Segmentation”, International Forum On Information Technology And Applications,
Vol. 1, pp. 389-391, July 2010. (DOI= http://dx.doi.org/10.1109/IFITA.2010.262).
[9] Hong ChinMing, Chen ChihMing and Chiu Chao-Yang, “Automatic extraction of new
words based on Google News corpora for supporting lexicon-based Chinese word
segmentation systems”, Expert Systems With Applications, Vol. 36, No. 2, pp. 3641-3651,
March 2009.(DOI= http://dx.doi.org/10.1016/j.eswa.2008.02.013).
[10] Chen Keh-Jiann and Bai Ming-Hong, “Unknown word detection for Chinese by a corpus-based learning method”, International Journal Of Computational Linguistics And Chinese Language Processing, Vol. 3, No. 1, pp. 27-44, February 1998.
[11] Chen Keh-Jiann and Ma Wei-Yun, “Unknown word extraction for Chinese documents”, 19th International Conference on Computational linguistics, Vol. 1, pp. 1-7, August 2002.
[12] Lin Yih-Jeng and Yu Ming-Shing, “Extracting Chinese frequent strings without a dictionary from a Chinese corpus and its applications”, Journal Of Information Science And Engineering, Vol. 17, issue. 5, pp. 805-824, September 2001.
[13] Ma Wei-Yun and Chen Keh-Jiann, “A bottom-up merging algorithm for Chinese unknown word extraction”, Second SIGHAN Workshop On Chinese Language Processing, Vol. 17, pp. 31-38,July 2003.
[14] He Shan and Zhu Jie, “A bootstrap method for Chinese new words extraction”, IEEE International Conference, Vol.1, pp. 581-584, May 2001.
[15] Lam Wai, Pik-Shan Cheung and Ruizhang Huang., “Mining events and new name translations from online daily news”, Joint ACM/IEEE Conference On Digital Libraries, pp.287-295, June 2004.
[16] Huang Cangning, Zhao Hai, “Chinese word segmentation: A decade review”, Journal of Chinese Information Processing, Vol.21, No.31, pp. 8–20, May 2007.
[17] Islam, et al., “A Generalized Approach to Word Segmentation Using Maximum Length Descending Frequency and Entropy Rate”, Computational Linguistics and Intelligent Text Processing, pp. 175-185, February 2007.
[18] Lin Shian-Hua, et al, “Extracting classification knowledge of internet documents with mining term associations: A semantic approach”, International ACM SIGIR Conference On Research And Development In Information Retrieval, pp. 241-249, July 1998.
[19] Lu Pengyu, Jin Lijun and Jiang Bin, “The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information”, Proceedings Of The National Conference On Information Technology And Computer Science,
pp. 71-74 , November 2012.
[20] Lu WenHsiang, Lee-Feng Chien and Hsi-Jian Lee, “Translation of web queries using anchor text mining”, ACM Transactions On Asian Language Information Processing , Vol. 1,issue. 2, pp. 159-172, March 2002.
[21] Wu Dekai and Pascale Fung, “Improving Chinese tokenization with linguistic filters on statistical lexical acquisition”, Fourth Conference On Applied Natural Language Processing,Stuttgart, pp. 180-181 , October 1994.