Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Nadav Har'El Mon, 08 Dec 2008 05:05:53 -0800

On Mon, Dec 08, 2008, Jukka Zitting wrote about "Re: XML formats vs. parser 
libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that 
emmits structured XHTML content.)":
> As mentioned before, I don't think that's a good idea as we don't have
> the required format-specific expertise here and IMHO trying to gather
> such expertise into a single project and community would be a futile
> exercise.
> 
> I really don't want to start dealing with detailed questions about why
> this specific PDF construct or Office XML feature is not supported by
> Tika. It's an ocean of trouble that we're in no way equipped to
> handle.


Yes, I admitted that it is a scary idea, but in the long run, what *will*
the Tika developers do if indeed there is a bug in a specific PDF construct?
Hope that the PDFbox developers fix it?

> > I would be even happier if those projects made the separation
> > themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which 
> > is
> > exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
> > doubtful that this will happen on its own any time soon.
> 
> This sounds like a much more fertile approach and I don't think it's
> all that far fetched. Now with Tika we have a clear rationale why such
> a trimmed down component would be useful, and I think many parser
> libraries would be happy to consider such proposals.
> 
> I'm currently mentoring the PDFBox project at the Apache Incubator,
> and I think the project would respond really well if someone came up
> there with a proposal of generating such an extra pdfbox-extract
> release artifact.

This sounds like a really interesting idea.

In a desktop search engine I once wrote, pdfbox was *half* of the entire
installer size, as big as the sum of Lucene, the search application,
web server and servlet engine, Lucene, and a dozen other parsers and
libraries. From looking around the PDFBox code I got the impression that
text extraction was only a small part of the code, but it was almost
impossible to separate it from the rest (I managed to pull out some of the
stuff, but not all of it). I would love to see such a separation made easier.
But please note that I'm not merely suggesting separating PDF *reading* and
and *writing* - it is more than that - even *reading* pdf involves more than
just extracting text, and we don't need any of that in Tika. The biggest
thing we don't need in Tika, for example, is the font handling code.

> We can and should work together with the parser projects to address
> the requirements we see. That's a much better alternative than forking
> parts of those projects inside Tika.

If indeed such a symbiosis is possible, then it will indeed be great because
the size issue becomes moot (although others like the "dll hell" mentioned
earlier don't). I just just wonder if we have any power to influence these
other projects.

Thanks,
Nadav.

-- 
Nadav Har'El                        |      Monday, Dec  8 2008, 11 Kislev 5769
[EMAIL PROTECTED]             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Wear short sleeves! Support your right to
http://nadav.harel.org.il           |bare arms!

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to