Zum Hauptinhalt springen

Teaching in IR

The special interest group for IR has proposed a curriculum for teaching IR in Computer Science

Software for Teaching IR

Remarks:

  • The textual description of the software projects is largely taken from the specific website in order to present a brief overview.
  • Comments and suggestions are welcome. Please send an e-mail to Daniel Blank or Andreas Henrich.

Search engines and Search engine libraries ...

  • TERRIER (TERrabyte RetrIEveR): Terrier is a highly flexible, efficient, effective, and robust search engine, readily deployable on large-scale collections of documents. Terrier implements state-of-the-art indexing and retrieval functionalities. Terrier provides an ideal platform for the rapid development of large-scale retrieval applications.
  • Apache Lucene: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
  • Xapian: Xapian is an Open Source Search Engine Library, released under the GPL. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications. It supports the Probabilistic Information Retrieval model and also supports a rich set of boolean query operators.
     
  • GALAGO: Galago is a toolkit for experimenting with text search. It is based on small, pluggable components that are easy to replace and change, both during indexing and during retrieval.
     
  • Lemur Toolkit: The Lemur Toolkit is an open-source toolkit designed to facilitate research in language modeling and information retrieval. Lemur supports a wide range of industrial and research language applications such as ad-hoc retrieval, site-search, and text mining
    • MG4J (Managing Gigabytes for Java): MG4J (Managing Gigabytes for Java) is a free full-text search engine for large document collections written in Java. With release 1.1, MG4J becomes a highly customisable, high-performance, full-fledged search engine providing state-of-the-art features (such as BM25 scoring) and new research algorithms.

... based on Apache Lucene: 

  • Compass: Compass is an open source project built on top of Lucene aiming at simplifying the integration of search into any Java application.
  • Apache Nutch: Nutch is open source web-search software. It builds on Lucene Java, adding web-specifics, such as a crawler, a link-graph database, parsers for HTML and other document formats, etc.
  • Apache Solr: Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat.


Further links pointing to open source search engine libraries:
http://www.searchtools.com/tools/tools-opensource.html

Libraries for large-scale, distributed indexing and data processing

  • Apache Hadoop: Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS). MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on computer nodes around the cluster. MapReduce can then process the data where it is located.
  • Apache Mahout: Mahout's goal is to build scalable, Apache licensed machine learning libraries (using Hadoop).

Libraries for text mining (language analysis, data mining and machine learning)

  • Apache UIMA: Apache UIMA is an Apache-licensed open source implementation of the UIMA specification for unstructured information management.
  • Rapidminer: RapidMiner is an open-source data mining solution. Applications of RapidMiner cover a wide range of real-world data mining tasks. It is available under different licenses.
  • WEKA: Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.
  • Lingpipe: LingPipe is a suite of Java libraries for the linguistic analysis of human language. LingPipe offers information extraction and data mining tools. It is available under different licenses.

Libraries for document parsing:

  • Apache Tika: Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Apache Pdfbox: Apache PDFBox is an open source Java PDF library for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents.
  • Apache POI: The POI project is the master project for developing pure Java ports of file formats based on Microsoft's OLE 2 Compound Document Format. OLE 2 Compound Document Format is used by Microsoft Office Documents.