Gesellschaft fr Informatik e.V.

Lecture Notes in Informatics


Innovative Internet Community Systems I2CS 2009 P-148, 194-203 (2009).

Gesellschaft für Informatik, Bonn
2009


Copyright © Gesellschaft für Informatik, Bonn

Contents

Recommending related articles in wikipedia via a topic-based model

Wongkot Sriurai , Phayung Meesad and Choochart Haruechaiyasak

Abstract


Wikipedia is currently the largest encyclopedia publicly available on the Web. In addition to keyword search and subject browsing, users may quickly access articles by following hyperlinks embedded within each article. The main drawback of this method is that some links to related articles could be missing from the current article. Also, a related article could not be inserted as a hyperlink if there is no term describing it within the current article. In this paper, we propose an approach for recommending related articles based on the Latent Dirichlet Allocation (LDA) algorithm. By applying the LDA on the anchor texts from each article, a set of diverse topics could be generated. An article can be represented as a probability distribution over this topic set. Two articles with similar topic distributions are considered conceptually related. We performed an experiment on the Wikipedia Selection for Schools which is a collection of 4,625 selected articles from the Wikipedia. Based on some initial evaluation, our proposed method could generate a set of recommended articles which are more relevant than the linked articles given on the test articles. 194 Wikipedia is a well-known free-content encyclopedia written collaboratively by volunteers and sponsored by the non-profit Wikipedia Foundation1.The aim of the project is to develop a free encyclopedia for many different languages. At present, there are over 2,400,000 articles available in English and many in other languages. The full volume of Wikipedia contents, however, contains some articles which are unsuitable for children. In May 2007, the SOS Children's Villages, the world's largest orphan charity, launched the Wikipedia Selection for Schools2. The collection contains 4,625 selected articles based on the UK National Curriculum and similar curricula elsewhere in the world. All articles in the collection have been cleaned up and checked for suitability for children. The content of Wikipedia for Schools can be navigated by browsing on a pictorial subject index or a title word index of all topics. Table 1 lists the first-level subject categories available from the collection. Organizing articles into the subject category set provides users a convenient way to access the articles on the same subject. Each article contains many hypertext links to other articles which are related to the current article. However, the links which were assigned by the authors of the article cannot fully cover all related articles. One of the reasons is due to the fact that there is no term describing related articles within the current article. Table 1: The subject categories under the Wikipedia Selection for Schools. Category Articles Category Articles Art 74 Business Studies 88 Citizenship 224 Countries 220 Design and Technology 250 Everyday life 380 Geography 650 History 400 IT 64 Language and literature 196 Mathematics 45 Music 140 People 680 Religion 146 Science 1068 1Wikipedia. http://en.wikipedia.org/wiki/WikiPedia 2Wikipedia Selection for Schools. http://schools-wikipedia.org 195 Some previous works have identified this problem as the missing link problem and also proposed some methods for automatically generating links to related articles. J. Voss [Vo05] presented an analysis of Wikipedia snapshot on March 2005. The study showed that Wikipedia links form a scale-free network and the distribution of in-degree and outdegree of Wikipedia pages follows a power law. S. Fissaha Adafre and M. de Rijke [FR05] presented an automated approach in finding related pages by exploring potential links in a wiki page. They proposed a method of discovering missing links in Wikipedia pages via a clustering approach.The clustering process is performed by grouping topically related pages using LTRank and then performing identification of link candidates by matching the anchor texts. Cosley et al. [Co07] presented SuggestBot, software that performs intelligent task routing (matching people with tasks) in Wikipedia. SuggestBot uses broadly applicable strategies of text analysis, collaborative filtering, and hyperlink following to recommend tasks. In this paper, we propose a method for recommending related articles in Wikipedia based on the Latent Dirichlet Allocation (LDA) algorithm. We adopt the dot product computation for calculating the similarity between two topic distributions which represent two different articles. Using the proposed approach, we can find the relation between two articles and use this relation to recommend links for each article. The rest of paper is organized as follows. In Section 2, we describe the topic-based mode for article recommendation. Section 3 presents experiments and discussion. Finally, we conclude our work and put forward the directions of our future work in Section 4. 2 The Topic-Based Model for Article Recommendation There have been many studies on discovering latent topics from text collections [SG06]. Latent Semantic Analysis (LSA) uses singular value decomposition (SVD) to map highdimensional term-by-document matrix to a lower dimensional representation called latent semantic space [De90]. However, SVD is actually designed for normallydistributed data. Such a distribution is inappropriate for count data which is what a termby-document matrix consists of. LSA has been applied to a wide variety of learning tasks, such as search and retrieval [De90] and classification [Bi08]. Although LSA have achieved important success but LSA have some drawbacks such as overfitting and inappropriate generative semantics [BNJ03]. 196 Due to the drawbacks of the LSA, the Latent Dirichlet Allocation (LDA) has been


Full Text: PDF

Gesellschaft für Informatik, Bonn
ISBN 978-3-88579-242-9


Last changed 24.01.2012 22:05:26