Jukka - Having pluggable parts, as you suggest, is definitely the way to go for optimum power and flexibility. However, IMHO, for the simplest use cases, and for beginning users, this approach may discourage and complicate Tika's use. I suggest an alternate simplified interface (see below) for these uses/users.
Renovating the entrance gate to Tika-land in this way could result in an increase in the number of beginning users, who continue on to be advanced users, and hopefully developers. A larger installed base could then result in attracting more resources to the project, human and otherwise. * * * It's been awhile since I worked on Tika, and it's evolved in the meantime, so I'm not very adept at it these days. As such, let me use this to the project's advantage, and let you know what I would value in Tika as a new user. For the simple cases, I would suggest hiding things like parser implementations, metadata objects, and content handlers. The simplest cases with document type autodetection could be handled by: parse(InputStream inputStream, OutputStream outputStream) Then, to specify the document type, we could add a MimeType string argument: parse(InputStream inputStream, OutputStream outputStream, String mimeType) I realize that this approach is not very efficient with multiple documents, since there is setup work that needs to be done for each document, but it is probably not an issue for most casual users. Another question...I used Tika to parse an Excel spreadsheet. and it created an XML file. How could I insert a handler for parsing documents with multiple records (such as an Excel spreadsheets, so that I could, for example, insert the record into a data base instead of writing XML to a file? Rather than writing a full blown XML content handler, I wonder if we could simplify it to something like this: public interface RecordProcessor { void processRecord(Object [] fields); // or List } ... and then have a method like: parseSpreadsheet(InputStream inputStream, RecordProcessor recordProcessor) For the above methods, we might also provide convenience methods for Files, URLs, Strings, etc. IMHO, having extremely simple methods like these would make it more likely for new users to attempt to use Tika, and to succeed in using it. I realize everyone's busy, and my time is limited too; this is just a wish list. Also, to the extent that these suggestions are based on a lack of understanding of how Tika works, I apologize for that and welcome any clarification. Regards, Keith Jukka Zitting wrote: > > > Instead of a fixed facade like ParseUtils I personally prefer a set of > components that I can combine in different ways to solve all kinds of > use cases. For example your case would be easy to solve like this: > > InputStream input = ...; // Where your input is coming from > OutputStream output = ...; // Where your output is going to > new AutoDetectParser().parse( > input, new BodyContentHandler(output), new Metadata()); > > Of course a static facade method like ParseUtils.parse(File input, > File output) might be easier for occasional users. > > Did you have some specific method signatures in mind? > > BR, > > Jukka Zitting > > -- View this message in context: http://www.nabble.com/Moving-Functionality-from-CLI-to-ParseUtils-tp24337541p24442544.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.