I know my following question and proposal is a bit heretic, but I think it needs to be raised nontheless...
On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)": > I would be remiss to consider implementing parsers in Tika as it really > defeats the purpose of the project: that is, to be a bridging middleware and > standard interface to parsing libraries, metadata representation, mime > extraction frameworks and content analysis mechanisms. The more I think about this issue, the less I am convinced that Tika as a wrapper for a dozen other libraries is the right way to go in the long run. We are facing several problems: 1. The big libraries such as PDFBox, POI, etc. often do much more than Tika wants to do - they can understand the document fonts, layout, graphics and other things Tika doesn't care about, and can create new documents or modify existing documents. As a result, these libraries are very complex, and very large - heck, each of PDFBox and POI is larger than the whole of Lucene (Tika's parent project)! This is a problem for applications that need text extraction but wish to remain small or at least not huge (think: desktop search, mobile devices, etc.). 2. There are many formats that will need new code anyway, such as OpenOffice, MP3, and other parsers which Tika already has code for. Very soon (if it is not true already), most of the parsers will be in Tika's code anyway, and not in external projects. 3. It is hard to enforce behavior standards and policies on the individual libraries. Each library makes its own decision whether it works by streaming or allocating the whole file, how it handles memory constraints (to avoid OOM exceptions which are catastrophic to the application), what kind of metadata it knows how to extract, and so on. This makes it harder for Tika to give consistent behavior. What I'd really love to see materialize is a relatively small stand-alone library (500K sounds a reasonable figure ;-)), which can extract text and a bit of metadata from all sorts of file formats, but nothing more. I would love to see Tika become this sort of library. I don't think this is a pipe-dream - I believe (but correct me if I got the wrong impression...) that only a minority of the code in PDFBox and POI for example is relevant at all to Tika, and that Tika could do better by copying only the relevant parts of the code rather than using the whole code as a black box. I would be even happier if those projects made the separation themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am doubtful that this will happen on its own any time soon. What I am suggesting may sound scary at first, but I think that in the long run, this is the right way to go forward. Nadav. -- Nadav Har'El | Monday, Dec 8 2008, 11 Kislev 5769 [EMAIL PROTECTED] |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |I have a great signature, but it won't http://nadav.harel.org.il |fit at the end of this message -- Fermat