Re: [VOTE] TIKA 0.2 Release Candidate 2

2008-12-09 Thread Dave Meikle
This vote passes with 3 +1s from the Lucene PMC members: Grant Ingersoll Sami Siren Chris Hostetter And 4 +1s from: Jukka Zitting Uwe Schindler Chris Mattmann Rida Benjelloun Thanks to everyone for their input and I'll complete the related tasks this evening. Cheers, Dave

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

2008-12-09 Thread Niall Pemberton
On Mon, Dec 8, 2008 at 9:58 PM, Christopher Corbell <[EMAIL PROTECTED]> wrote: > Just to add my lurker's thoughts to this thread, for what it's worth... > > Nearly all of the issues raised in this thread (and in the other one I've > been following on Dublin Core) are to me appropriate to a "middlew

Re: Aperture is available under the BSD

2008-12-09 Thread Jérôme Charron
> > Yep, the mime type detection system in Tika is based on the one developed > for Nutch primarily by Jerome Charron. Jerome worked on an update to this > mime system, with the freedesktop.org-style interface, and then I worked to > clean this up and get the functionality into Tika. The basic ide

Re: Aperture is available under the BSD

2008-12-09 Thread Mattmann, Chris A
Hi Stephane, Thanks for your email. > I didn't know Tika mime type detection was based on freedesktop.org. > I've also developed a mimeType detection system built on top of > freedesktop, leveraging the shared-mime-info database to be accurate. Is > this what you guys have done as well? Yep, the

Re: Aperture is available under the BSD

2008-12-09 Thread Stephane Bastian
Hi Chris, I didn't know Tika mime type detection was based on freedesktop.org. I've also developed a mimeType detection system built on top of freedesktop, leveraging the shared-mime-info database to be accurate. Is this what you guys have done as well? In any case, the point I was trying to m

Re: Aperture is available under the BSD

2008-12-09 Thread Mattmann, Chris A
Hi Stephane, > This is definitely a good news. Besides very good parsers, Aperture also > has strong support for mime type. I know we also have support for > detecting mime types but at some point and time we may consider using > theirs and focus solely on writing Parsers? I would be strongly ag

Re: Tika Wiki (Was: [VOTE] New TIKA 0.2 Release Candidate 1)

2008-12-09 Thread Grant Ingersoll
FWIW, Mahout uses Confluence, and I find the experience a whole lot more pleasurable, but +0 on either choice. On Dec 8, 2008, at 5:03 AM, Jukka Zitting wrote: Hi, On Mon, Dec 8, 2008 at 2:41 AM, Mattmann, Chris A <[EMAIL PROTECTED]> wrote: Since I didn't see an official Tika wiki yet, I we

Re: Aperture is available under the BSD

2008-12-09 Thread Grant Ingersoll
I think I would let things shake out a little bit with the change to a new license. IANAL, but I think I would at least wait for a release. It does seem to make sense, though. Personally, though, I really like Tika's SAX model for extraction and the, um, lack of RDF. 2 more cents... G

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

2008-12-09 Thread Stephane Bastian
Hi, You're definitely right that there would be a mapping between a given document and XML, via a ContentHandler, which is king of what tika does already. This also means that metadata would be extracted from the "raw" ContentHandler. In any case, as you pointed out Tika might not be the best

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

2008-12-09 Thread Jukka Zitting
Hi, On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian <[EMAIL PROTECTED]> wrote: > Parsing goes through several fairly well defined steps and in the case of > Tika it could be represented as follow: > 1) Generate Sax events out of the stream > 2) Extracts metadata and save them in an instance of t

Re: Aperture is available under the BSD

2008-12-09 Thread Stephane Bastian
This is definitely a good news. Besides very good parsers, Aperture also has strong support for mime type. I know we also have support for detecting mime types but at some point and time we may consider using theirs and focus solely on writing Parsers? One problem though is that parsers return R

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

2008-12-09 Thread Stephane Bastian
Hi Jukka, This fix would definitely help me in the short run since I've got to extends the Html parser for my specific needs. However, I'm thinking that I may run in the same problem with another parser in a month or two. Therefore I'm leaning toward finding a solution that would work for all

Aperture is available under the BSD

2008-12-09 Thread Jukka Zitting
Hi, The Aperture project (http://aperture.sourceforge.net/) has relicensed all their code to the BSD license, see http://sourceforge.net/forum/forum.php?forum_id=891966. They probably have some code that we could reuse, and perhaps we also have some valuable bits to contribute to them. The BSD li

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

2008-12-09 Thread Jukka Zitting
Hi, On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian <[EMAIL PROTECTED]> wrote: > So, I wanted to know 1) if other people had trouble extending existing > Parser? and 2) if this is an issue we should tackle? We're of course open to contributions on issues like this, but I'm wondering if your use