Apache Tika reaches 1.0
Version 1.0 of the Apache Tika metadata and structured text content detector and extractor has been released. The project began as a sub-project of Apache Lucene in 2007 and became a top level project in May last year.
Apache Tika is made up of a set of Java libraries and uses a number of existing parsers which allow it to extract metadata and structured text from HTML, XML, Microsoft Office documents (OLE2 and OOXML), OpenDocument Formats, PDF, ePub, RTF, compressed and packaged files, generic text in different encodings, Outlook and mbox mailboxes and text associated with audio, image and video files. This makes it valuable as a tool for search engines and other applications which may need to manage a variety of files.
Tika also has a graphical user interface (GUI) for exploring file content interactively. The updated version 1.0 removes all pre-1.0 API methods and drops the retrotranslated Java 1.4 support. It also improves OSGi integration so that it now automatically picks up and uses available Parser and Detector services.
The release notes lists the changes in full. Apache Tika source is available for download. A Getting Started guide also shows how to use Tika with Maven or Ant and as a command line utility. Apache Tika is licensed under the Apache Licence Version 2.0.