[RT] Tika framework usage scenario

Bertrand Delacretaz Wed, 13 Jun 2007 01:40:30 -0700

Hi,

Here are some Random Thoughs about how Tika could be used, mostly
based on (my recollection of) our discussion at ApacheCon.


See also:
http://code.google.com/p/tika/wiki/DesignDiscussion
http://code.google.com/p/tika/wiki/ArchitectureSketch

Comments/flames/etc. are welcome ;-)

Here's my proposed Tika Framework Usage Scenario:

A Pipeline takes an InputStream as input.
(not a Reader, as we might need to try different encodings).

Internally, a Pipeline consists of a series of ContentFilters
connected in a chain.
(details to be defined: encoding and content-type detectors, file
format parsers, etc.).

A Pipeline is created by the PipelineFactory, based on a StreamInfo.

A StreamInfo contains all the relevant info that we have about the
input stream: filename, HTTP headers, encoding, expected language,
configured hints and preferences, etc...everything that can help the
PipelineFactory in deciding how to setup the Pipeline.

Once its start() method is called, a Pipeline reads the InputStream
and produces ContentEvents.

A ContentEvent can be a MetadataEvent, a StreamEvent, a TextEvent or a
TikaInfoEvent.

A MetadataEvent contains extracted metadata (obviously ;-)

The names of metadata properties are standardized, as far as possible
(dublin core, etc.)

A StreamEvent encapsulates an InputStream and a StreamInfo, for
example when the original input was a ZIP archive that contains
several binary components. If the client is interested in this event,
it will have to create another Pipeline to process its contents.

A TextEvent contains extracted text, location information, etc.

A TikaInfoEvent provides information about the Pipeline execution:
progress, debugging messages, warnings, etc.

The order in which ContentEvents are produced by the Pipeline is not specified.

WDYT?

-Bertrand

[RT] Tika framework usage scenario

Reply via email to