Hi, Here are some Random Thoughs about how Tika could be used, mostly based on (my recollection of) our discussion at ApacheCon.
See also: http://code.google.com/p/tika/wiki/DesignDiscussion http://code.google.com/p/tika/wiki/ArchitectureSketch Comments/flames/etc. are welcome ;-) Here's my proposed Tika Framework Usage Scenario: A Pipeline takes an InputStream as input. (not a Reader, as we might need to try different encodings). Internally, a Pipeline consists of a series of ContentFilters connected in a chain. (details to be defined: encoding and content-type detectors, file format parsers, etc.). A Pipeline is created by the PipelineFactory, based on a StreamInfo. A StreamInfo contains all the relevant info that we have about the input stream: filename, HTTP headers, encoding, expected language, configured hints and preferences, etc...everything that can help the PipelineFactory in deciding how to setup the Pipeline. Once its start() method is called, a Pipeline reads the InputStream and produces ContentEvents. A ContentEvent can be a MetadataEvent, a StreamEvent, a TextEvent or a TikaInfoEvent. A MetadataEvent contains extracted metadata (obviously ;-) The names of metadata properties are standardized, as far as possible (dublin core, etc.) A StreamEvent encapsulates an InputStream and a StreamInfo, for example when the original input was a ZIP archive that contains several binary components. If the client is interested in this event, it will have to create another Pipeline to process its contents. A TextEvent contains extracted text, location information, etc. A TikaInfoEvent provides information about the Pipeline execution: progress, debugging messages, warnings, etc. The order in which ContentEvents are produced by the Pipeline is not specified. WDYT? -Bertrand
