Hi Guys, I am going to work on getting this in today. I had wanted to wait on this because while doing the work for TIKA-6, I ran into a few issues that I think that we need to close the loop on w.r.t. to Tika before we go too far down any path. I wanted to summarize these ideas more coherently in a longer email, but since there is some interest right now, I'll try my best to get some of the ideas out there right now and then talk through them in more detail as soon as I get a chance.
Ok, so right now the biggest thing that I see is that the Parsing system that we have is based on the code for Luis, which is great. It works, and has a lot of capability, etc., however, Luis introduced some things for now that we haven't (as the "Tika" team) thought through in greater detail. Here's a practical example. We have code in Tika right now called "LuisConfig". LuisConfig is being used right now as the main "configuration" representation for Tika parsers. This is fine, however, it imposes a data model, e.g., a "Parser" should have a "name", a "class", a "namespace", etc. One of the issues with this is that, as a developer, I wasn't exactly sure while I was doing the mime database patch for TIKA-6 w.r.t. configuration. The mime database requires configuration parameters (e.g, "where is the location of the mime database XML files"), however, we don't really have a place in Tika for an overall configuration right now. We have the LuisConfig, (which should probably be renamed to "TikaParserConfig" or something like that), however that is very parser specific as far as I can tell, and not specific to the configuration of the Tika toolkit. So, the question becomes the following: 1. How do we configure Tika? This somewhat relates to the prior discussion on the Tika parser interface, but it extends beyond that. 2. What are the right data attributes to configure a parser? Could we get some documentation on them? For instance, is the "mime" attribute in the config.xml file something that defines the "acceptable" mime types that a parser can parse? We need to do some simple data engineering/documentation here, so that people know what they are doing. 3. What are the entry points into Tika? As far as I can tell, there is a ParserFactory that can be used to get a Parser for a particular file or Url, etc. This implies that the ParserFactory performs some sort of mime type resolution (which it does), however, mime type resolution (using the new mime framework) requires the ability for Tika to have a configuration. So, here's my proposition to address these issues: 1. We define a TikaConfiguration object, and an xml file location/format within CM for Tika configuration properties. I'm fine with using the Configuration object class from Nutch/Hadoop, and their associated file format. This was included with the patch from TIKA-6 that I attached, however, it was incomplete, and was stored in probably the wrong place (org.apache.tika.utils). What do others think? 2. We sit down and baseline a set of properties, including documentation on them, for tika parsers. We should also change everything in CM right now that says "Luis" to "Tika". Though Luis is really cool and a neat project, we are working on Tika here, not Luis. 3. Solving #1 and #2 above will help to address #3. Then I can link the mime system from TIKA-6 into the ParserFactory and we'll probably have enough capability and functionality to really start testing the Tika library, and maybe even be ready for an 0.1-alpha release. What do you guys think about this? Thanks! Cheers, Chris P.S. More to come later, and sorry about the stream of consciousness style writing! ;) On 9/20/07 7:46 AM, "Bertrand Delacretaz" <[EMAIL PROTECTED]> wrote: > Hi Keith, > >> ...A few days ago I posted a patch that would have enabled the use of URL's >> as >> input specifiers in Tika (see TIKA-17). However, given the code changes >> since then, I expect that applying the patch would now fail.... > > I reviewed your patch and it looks ok to me, with just minor comments > (see JIRA). > > I haven't committed it because Chris has assigned the issue to > himself. I don't know if he's working on it, but IMHO we could commit > it as is. > >> ...Tika is >> an extremely useful product, and it can be functional momentarily without a >> great deal of change.... > > Agreed, as long as we don't release it, there are no promises about > API stability or whatever, so having more usable code is certainly a > good thing. > > -Bertrand ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.
