Hi all, I have committed a first version of an EnhancementEngine based on Apache Tika (see https://issues.apache.org/jira/browse/STANBOL-512).
Apache Tika™ is a framework that can be used to detect and extract metadata as well as structured text content from various documents types. See http://tika.apache.org/ for details. The current version of the engine includes the following features: * Content-Type detection: If the Content-Type of a ConentItem is not set (null or "application/octed-stream") than Apache Tika is used to automatically detect the correct type. * Plain Text extraction: Apache Tika is used to extract the text from parsed content. The plain text version only includes the body-part of the document (header information - such as the title - are skipped) * XHTML content extraction: Apache Tika also supports the conversion of content to XHTML. This format is also added as content part to the content Item as Blob with the content type "application/xhtml+xml". This serialization includes the whole content (header and body part) Still missing: * Metadata extracted by Apache Tika are currently not converted to RDF and added to the metadata. ### Tika and Metaxa: Both such engines are now included and activated in current Stanbol Launchers. Note that because the Tika Engine and the Metaxa do provide very similar functionalities some users might want to user either Tika or Metaxa in their Enhancement Chains. However it is also possible to use both engines within an Enhancement Chain. Currently this is the case for the default Enhancement Chain that gets used on requests to "/enhancer" and "/engines". If you need to extract metadata from parsed content than you will want to use Metaxa for now. ### Text Extraction and the Multipart Content Item RESTfull API The recently added extensions to the Stanbol Enhancer RESTful API do now allow to directly request transcoded content. The following example will return the extracted plain text from the parsed content. curl -v -X POST -H "Accept: text/plain" \ -H "Content-type: application/pdf" \ -T $file \ "http://localhost:8080/enhancer?omitMetadata=true" By specifying "application/xhtml+xml" as Accept header the request would return XHTML extracted by Apache Tika. Happy Testing! best Rupert -- | Rupert Westenthaler [email protected] | Bodenlehenstraße 11 ++43-699-11108907 | A-5500 Bischofshofen
