A document exploring system on LDA topic model for Wikipedia articles
LE3 .A278 2016
Master of Science
Organizing and exploring millions of documents, papers and other text information becomes a challenge for researchers and publishers. As machine learning techniques are quickly developed and widely used, a new text mining method called topic model was proposed in 2003. The topic model is based on Latent Dirichlet allocation (LDA) and has drawn much attention since it was introduced. LDA topic model is a probabilistic model, which can process text documents and exhibit hidden topics. Compared to other document processing methods working on content directly, the LDA topic model processes documents to topic distributions. The results are easier to understand, categorize and compare. Most importantly, topics make more sense to humans than structured machine formats. In the thesis, we briefly introduce the background knowledge of LDA topic model and its working principles. Then we deeply explain how to apply LDA topic model to a text corpus by doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data retrieving, pre-processing, fitting the model and evaluations. The result of the experiments shows the LDA topic model working effectively on document clustering and fnding similar documents. Meanwhile, based on LDA topic model, we propose a document exploring system which allows users to organize and explore the documents by topic where related documents are easier to fnd and access.
The author retains copyright in this thesis. Any substantial copying or any other actions that exceed fair dealing or other exceptions in the Copyright Act require the permission of the author.