Re: Moving Functionality from CLI to ParseUtils

Jukka Zitting Sun, 12 Jul 2009 12:46:59 -0700

Hi,

2009/7/11 keithrbennett <keithrbenn...@gmail.com>:
> Having pluggable parts, as you suggest, is definitely the
> way to go for optimum power and flexibility.  However, IMHO,
> for the simplest use cases, and for beginning users,
> this approach may discourage and complicate Tika's use.
> I suggest an alternate simplified interface (see below)
> for these uses/users.


Agreed, the more I think about this the more I think having something
like this would be useful.

My proposal would be to add a org.apache.tika.Tika facade class with
static methods for the most important simple use cases.

> For the simple cases, I would suggest hiding things like parser
> implementations, metadata objects, and content handlers.  The simplest
> cases with document type autodetection could be handled by:
>
> parse(InputStream inputStream, OutputStream outputStream)

I guess the most important parsing use case is to produce a Reader for
use in Lucene indexing. Thus I would add a method like this:

    Reader parse(InputStream);

Some clients may prefer to have it all in a simple string (with all
the caveats of large inputs, perhaps we should have some built-in
output size limit), so we could also do:

    String parseToString(InputStream);

The XHTML output is probably only useful in more sophisticated use
cases, where the Parser interface and an appropriate ContentHandler
can be used directly.

> Then, to specify the document type, we could add a MimeType string
> argument:
>
> parse(InputStream inputStream, OutputStream outputStream,
>        String mimeType)

Tika is already pretty good at auto-detecting the document type, and
in my experience the file name is much more useful in helping type
detection than any externally provided type information. Tika likely
has a much more complete set of file name glob patterns than what
probably was used to produce the external type information.

Thus I'd rather give the proposed parse method information about the
file name when available. And instead of adding an explicit argument,
we could just as well add overloaded methods that also take care of
correctly opening and closing the file (or URL resource) as needed.
Something like this:

    Reader parse(File);
    Reader parse(URL);

Similarly for the parseToString method. In more complex cases (e.g. if
the file is inside a database field) one can always use the Parser
interface directly.

And while we're at it, there are many cases where an application needs
to figure out the type of a given document. Instead of coming up with
its own glob patterns and the like, an application could use Tika
functionality through potential facade methods like the following that
would return the auto-detected media type of the given document:

    String detect(InputStream);
    String detect(File);
    String detect(URL);

WDYT?

> Another question...I used Tika to parse an Excel spreadsheet. and it
> created an XML file.  How could I insert a handler for parsing
> documents with multiple records (such as an Excel spreadsheets, so
> that I could, for example, insert the record into a data base instead
> of writing XML to a file?

That's a big can of worms as each document type comes with it's own
structure and semantics. Tika avoids this problem by focusing on just
the contained text and some very generic structural information.

If you need more detailed structural information, you'll inevitably
hit type-specific features and my recommendation would be to directly
use the appropriate parser library. For example, I'd use POI directly
for pulling specific information out of Excel spreadsheets.

BR,

Jukka Zitting

Re: Moving Functionality from CLI to ParseUtils

Reply via email to