Apache Tika Enhancement Engine (STANBOL-512)

Rupert Westenthaler Tue, 28 Feb 2012 11:08:28 -0800

Hi all,

I have committed a first version of an EnhancementEngine based on
Apache Tika (see https://issues.apache.org/jira/browse/STANBOL-512).


Apache Tika™ is a framework that can be used to detect and extract
metadata as well as structured text content from various documents
types. See http://tika.apache.org/ for details.

The current version of the engine includes the following features:

* Content-Type detection: If the Content-Type of a ConentItem is not
set (null or "application/octed-stream") than Apache Tika is used to
automatically detect the correct type.
* Plain Text extraction: Apache Tika is used to extract the text from
parsed content. The plain text version only includes the body-part of
the document (header information - such as the title - are skipped)
* XHTML content extraction: Apache Tika also supports the conversion
of content to XHTML. This format is also added as content part to the
content Item as Blob with the content type "application/xhtml+xml".
This serialization includes the whole content (header and body part)

Still missing:

* Metadata extracted by Apache Tika are currently not converted to RDF
and added to the metadata.

### Tika and Metaxa:

Both such engines are now included and activated in current Stanbol
Launchers. Note that because the Tika Engine and the Metaxa do provide
very similar functionalities some users might want to user either Tika
or Metaxa in their Enhancement Chains. However it is also possible to
use both engines within an Enhancement Chain. Currently this is the
case for the default Enhancement Chain that gets used on requests to
"/enhancer" and "/engines".
If you need to extract metadata from parsed content than you will want
to use Metaxa for now.


### Text Extraction and the Multipart Content Item RESTfull API

The recently added extensions to the Stanbol Enhancer RESTful API do
now allow to directly request transcoded content.
The following example will return the extracted plain text from the
parsed content.

    curl -v -X POST -H "Accept: text/plain"  \
        -H "Content-type: application/pdf" \
        -T $file \
        "http://localhost:8080/enhancer?omitMetadata=true";

By specifying "application/xhtml+xml" as Accept header the request
would return XHTML extracted by Apache Tika.


Happy Testing!

best
Rupert

-- 
| Rupert Westenthaler             [email protected]
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Apache Tika Enhancement Engine (STANBOL-512)

Reply via email to