Gesellschaft für Informatik e.V.

Lecture Notes in Informatics

Datenbanksysteme in Business, Technologie und Web (BTW 2007) P-103, 277-291 (2007).



Alfons Kemper (ed.), Harald Schöning (ed.), Thomas Rose (ed.), Matthias Jarke (ed.), Thomas Seidl (ed.), Christoph Quix (ed.), Christoph Brochhaus (ed.)


YAWN: A semantically annotated wikipedia xmlcorpus

Ralf Schenkel , Fabian Suchanek and Gjergji Kasneci


The paper presents YAWN, a system to convert the well-known and widely used Wikipedia collection into an XML corpus with semantically rich, self-explaining tags. We introduce algorithms to annotate pages and links with concepts from the WordNet thesaurus. This annotation process exploits categorical information in Wikipedia, which is a high-quality, manually assigned source of information, extracts additional information from lists, and utilizes the invocations of templates with named parameters. We give examples how such annotations can be exploited for high-precision queries.

Full Text: PDF

ISBN 978-3-88579-197-3

Last changed 04.10.2013 18:13:31