Experiences from developing the domain-specific entity search engine geneview
GeneView is a semantic search engine for the Life Sciences. Unlike traditional search engines, GeneView analyzes texts upon import to recognize and properly handle biomedical entities, relationships between those entities, and the structure of documents. This allows for a number of advanced features required to work effectively with scientific texts, such as entity disambiguation, ranking of documents by entity content, linking to structured knowledge about entities, userfriendly highlighting of entities etc. As of now, GeneView indexes approximately ~21,4M abstracts and ~358K full texts with more than 200M entities of 11 different types and more than 100K relationships. In this paper, we describe the architecture underlying the system with a focus on the complex pipeline of advanced NLP and information extraction tools necessary for achieving the above functionality. We also discuss open challenges in developing and maintaining a semantic search engine over a large (though not web-scale) corpus.
Full Text: PDF