Gesellschaft für Informatik e.V.

Lecture Notes in Informatics


Natural Language Processing and Information Systems, 8th International Conference on Applications of Natural Language to Information Systems, June 2003, Burg (Spreewald), Germany. P-29, 120-126 (2003).

GI, Gesellschaft für Informatik, Bonn
2003


Editors

Antje Düsterhöft (ed.), Bernhard Thalheim (ed.)


Copyright © GI, Gesellschaft für Informatik, Bonn

Contents

Selection of representative documents for clusters in a document collection

Alexander Gelbukh , Mikhail Alexandrov , Ales Bourek and Pavel Makagonov

Abstract


An efficient way to explore a large document collection (e.g., the search results returned by a search engine) is to subdivide it into clusters of relatively similar documents, to get a general view of the collection and select its parts of particular interest. A way of presenting the clusters to the user is selection of a document in each cluster. For different purposes this can be done in different ways. We consider three cases: selection of the average, the “most typical,” and the “least typical” document. The algorithms are given, which rely on a dictionary of keywords reflecting the topic of the user's interest. After clustering, we select a document in each cluster basing on its closeness to the other ones. Different distance measures are discussed; preliminary experimental results are presented. Our approach was implemented in the new version of Document Classifier system.


Full Text: PDF

GI, Gesellschaft für Informatik, Bonn
ISBN 3-88579-358-X


Last changed 04.10.2013 17:57:44