Hi, On 6/29/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
I was wondering if you had a todo list or something somewhere? I have been loosely following the discussions here and see the general outline of what the goals are here: http://www.mail-archive.com/tika- [EMAIL PROTECTED]/msg00024.html (Tika discussions in Amsterdam)
That's probably the most complete todo list lookalike for now. There's some gradual progress going on, but we are still in a formative phase where not even some basic practices on svn use, etc. have emerged, so I wouldn't put too much weight on any single message
Here's where I am at: I am considering extracting the Nutch parsing plugins for a project I am undertaking and wrapping them for my own purposes, but knowing Tika is around, I would just as soon do this in the context of Tika, or at least try to help out that way and have it become a part of Tika. I have not looked at Lius yet. I guess I am wondering if you have some interfaces in mind that you want to hook into, or is the Nutch model (or Lius model) already going to serve as the main model? I pretty much think the Nutch model has everything I need at the moment, but I don't want to carry around the whole set of Nutch dependencies. I am not worried about content detection at this point so much as extraction. Is the plan to adopt a similar plugin approach as Nutch?
There seems to be a general consensus that the existing solutions like Nutch are a good starting point but need some modifications before they satisfy all the goals of Tika, but few specific decisions have yet been made.
So, I guess the question is what can I do at this point to help? Should I just go ahead with my needs and then give it back as a patch and you can decide what to do with it from there? I am in somewhat of a hurry to get the basics working in the next week or so.
I would recommend that you just go forward with your plan and don't wait for us. :-) One thing you may want to take a look at is "Lius Lite" in the Tika issue tracker, that contains a trimmed version of the Lius framework, but if you already are familiar with Nutch then it probably makes more sense to stick with that. I believe the eventual Tika framework will end up incorporating concepts from both Nutch and Lius (among others). It would be certainly interesting to see what you end up with and perhaps hear a brief summary of the main issues and concerns you encountered. This is exactly the sort of stuff that Tika should support, so your contributions would be very much welcome! BR, Jukka Zitting
