Gesellschaft für Informatik e.V.

Lecture Notes in Informatics


Datenbanksysteme in Business, Technologie und Web (BTW 2007) P-103, 277-291 (2007).


2007


Editors

Alfons Kemper (ed.), Harald Schöning (ed.), Thomas Rose (ed.), Matthias Jarke (ed.), Thomas Seidl (ed.), Christoph Quix (ed.), Christoph Brochhaus (ed.)


Contents

YAWN: A semantically annotated wikipedia xmlcorpus

Ralf Schenkel , Fabian Suchanek and Gjergji Kasneci

Abstract


The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.


Full Text: PDF

ISBN 978-3-88579-197-3


Last changed 04.10.2013 18:13:31