Hi, On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian <[EMAIL PROTECTED]> wrote: > Parsing goes through several fairly well defined steps and in the case of > Tika it could be represented as follow: > 1) Generate Sax events out of the stream > 2) Extracts metadata and save them in an instance of the Metadata class > 3) Generate Sax events about the structure of a document
For many document types steps 1 and 2 are reversed, and 1 and 3 are actually just a single step. I'm not sure if there's much room for generalization here. > How about if we slightly modify Tika to hook custom code to 1) as well. We > could do this by adding an extra ContentHandler to the parse method: > > public void parse (InputStream stream, ContentHandler rawHanlder, > ContentHandler structuredHandler, Metadata metadata) ; Most document types simply don't have a "raw" SAX stream, so I don't think this is a good idea in the general case. The only SAX events you have are the ones sent to the content handler we have now, so what you're trying to do could just as well be achieved using a TeeContentHandler on top of the existing Parser interface. What I believe you are looking for is a mechanism that would map the low-level details of all sorts of document types to XML. That's might be interesting, but I'm not sure if Tika is the best place to do that. It might be a better idea to approach the parser libraries directly about a potential SAX mapping, as they are in a much better position to evaluate how such a mapping should look like and whether implementing it is reasonable. > 2) Ability to leverage the MatchingContentHandler which is also working in > streaming mode. BTW, to me this part would probably deserve a project on its > own Thanks, I did think it was a good idea, but it's good to hear that others like it too. :-) BR, Jukka Zitting