On Mon, Dec 08, 2008, Jukka Zitting wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)": > As mentioned before, I don't think that's a good idea as we don't have > the required format-specific expertise here and IMHO trying to gather > such expertise into a single project and community would be a futile > exercise. > > I really don't want to start dealing with detailed questions about why > this specific PDF construct or Office XML feature is not supported by > Tika. It's an ocean of trouble that we're in no way equipped to > handle.
Yes, I admitted that it is a scary idea, but in the long run, what *will* the Tika developers do if indeed there is a bug in a specific PDF construct? Hope that the PDFbox developers fix it? > > I would be even happier if those projects made the separation > > themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which > > is > > exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am > > doubtful that this will happen on its own any time soon. > > This sounds like a much more fertile approach and I don't think it's > all that far fetched. Now with Tika we have a clear rationale why such > a trimmed down component would be useful, and I think many parser > libraries would be happy to consider such proposals. > > I'm currently mentoring the PDFBox project at the Apache Incubator, > and I think the project would respond really well if someone came up > there with a proposal of generating such an extra pdfbox-extract > release artifact. This sounds like a really interesting idea. In a desktop search engine I once wrote, pdfbox was *half* of the entire installer size, as big as the sum of Lucene, the search application, web server and servlet engine, Lucene, and a dozen other parsers and libraries. From looking around the PDFBox code I got the impression that text extraction was only a small part of the code, but it was almost impossible to separate it from the rest (I managed to pull out some of the stuff, but not all of it). I would love to see such a separation made easier. But please note that I'm not merely suggesting separating PDF *reading* and and *writing* - it is more than that - even *reading* pdf involves more than just extracting text, and we don't need any of that in Tika. The biggest thing we don't need in Tika, for example, is the font handling code. > We can and should work together with the parser projects to address > the requirements we see. That's a much better alternative than forking > parts of those projects inside Tika. If indeed such a symbiosis is possible, then it will indeed be great because the size issue becomes moot (although others like the "dll hell" mentioned earlier don't). I just just wonder if we have any power to influence these other projects. Thanks, Nadav. -- Nadav Har'El | Monday, Dec 8 2008, 11 Kislev 5769 [EMAIL PROTECTED] |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Wear short sleeves! Support your right to http://nadav.harel.org.il |bare arms!