Apache Tika is a free and open source toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
What`s New in This Release: [ read full changelog ]
· Language identification is now dynamically configurable, managed via a config file loaded from the classpath. (TIKA-490)
· Tika now supports parsing Feeds by wrapping the underlying Rome library. (TIKA-466)
· A quick-start guide for Tika parsing was contributed. (TIKA-464)
· An approach for plumbing through XHTML attributes was added. (TIKA-379)
· Media type hierarchy information is now taken into account when selecting the best parser for a given input document. (TIKA-298)
· Support for parsing common scientific data formats including netCDF and HDF4/5 was added (TIKA-400 and TIKA-399).
· Unit tests for Windows have been fixed, allowing TestParsers to complete. (TIKA-398)