Hi,
I was thinking about ways to best model the Tika interfaces, and it
seems to me that the only sane way to do that is to start with use
cases and how a client would most naturally use a toolkit like Tika.
Here are some of my initial ideas for review, feel free to add more
cases or suggest alternatives.
1) Extract structured text content from a stream (default configuration):
InputStream stream = ...;
ContentHandler handler = ...; // SAX event handler
new SomeTikaClass().parse(stream, handler);
2) Set configuration options:
SomeTikaClass tika = new SomeTikaClass();
tika.setConfigurationOption1(...);
tika.setConfigurationOption2(...);
// also composition, etc.
3) Extract metadata from a stream:
InputStream stream = ...;
Metadata metadata = new Metadata(); // Metadata container
new SomeTikaClass.parse(stream, metadata);
4) Provide external metadata as input for parsing:
InputStream stream = ...;
ContentHandler handler = ...;
Metadata metadata = new Metadata();
metadata.setFileName(...);
new SomeTikaClass.parse(stream, handler, metadata);
BR,
Jukka Zitting