Hi, 2009/7/11 keithrbennett <keithrbenn...@gmail.com>: > Having pluggable parts, as you suggest, is definitely the > way to go for optimum power and flexibility. However, IMHO, > for the simplest use cases, and for beginning users, > this approach may discourage and complicate Tika's use. > I suggest an alternate simplified interface (see below) > for these uses/users.
Agreed, the more I think about this the more I think having something like this would be useful. My proposal would be to add a org.apache.tika.Tika facade class with static methods for the most important simple use cases. > For the simple cases, I would suggest hiding things like parser > implementations, metadata objects, and content handlers. The simplest > cases with document type autodetection could be handled by: > > parse(InputStream inputStream, OutputStream outputStream) I guess the most important parsing use case is to produce a Reader for use in Lucene indexing. Thus I would add a method like this: Reader parse(InputStream); Some clients may prefer to have it all in a simple string (with all the caveats of large inputs, perhaps we should have some built-in output size limit), so we could also do: String parseToString(InputStream); The XHTML output is probably only useful in more sophisticated use cases, where the Parser interface and an appropriate ContentHandler can be used directly. > Then, to specify the document type, we could add a MimeType string > argument: > > parse(InputStream inputStream, OutputStream outputStream, > String mimeType) Tika is already pretty good at auto-detecting the document type, and in my experience the file name is much more useful in helping type detection than any externally provided type information. Tika likely has a much more complete set of file name glob patterns than what probably was used to produce the external type information. Thus I'd rather give the proposed parse method information about the file name when available. And instead of adding an explicit argument, we could just as well add overloaded methods that also take care of correctly opening and closing the file (or URL resource) as needed. Something like this: Reader parse(File); Reader parse(URL); Similarly for the parseToString method. In more complex cases (e.g. if the file is inside a database field) one can always use the Parser interface directly. And while we're at it, there are many cases where an application needs to figure out the type of a given document. Instead of coming up with its own glob patterns and the like, an application could use Tika functionality through potential facade methods like the following that would return the auto-detected media type of the given document: String detect(InputStream); String detect(File); String detect(URL); WDYT? > Another question...I used Tika to parse an Excel spreadsheet. and it > created an XML file. How could I insert a handler for parsing > documents with multiple records (such as an Excel spreadsheets, so > that I could, for example, insert the record into a data base instead > of writing XML to a file? That's a big can of worms as each document type comes with it's own structure and semantics. Tika avoids this problem by focusing on just the contained text and some very generic structural information. If you need more detailed structural information, you'll inevitably hit type-specific features and my recommendation would be to directly use the appropriate parser library. For example, I'd use POI directly for pulling specific information out of Excel spreadsheets. BR, Jukka Zitting