Re: Thoughts on Parser design and dependencies

2006-08-19 Thread Andrzej Bialecki
Jukka Zitting wrote: Hi, On 8/19/06, Sami Siren [EMAIL PROTECTED] wrote: So far nutch has been build to deal mainly with text type documents. There's however need also to deal with non textual object eg. images, movies, sound which will provide content only in form of metadata (ok, perhaps

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as neither of these ideas apply

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to generalize this, as

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like to look for ways to

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Sami Siren
Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Jukka Zitting wrote: The Parser interface is also bound to the ideas of fetching content from the network and indexing it using a standard content model through the Content and Parse dependencies. For the Tika project I'd like

Re: Thoughts on Parser design and dependencies

2006-08-18 Thread Andrzej Bialecki
Sami Siren wrote: Original motivation for this was http headers and meta tags, which can have multiple values. Another case is the language identification, where the same key may have multiple values, coming from different sources. Additionally, MapWritable supports any Writable, which is

Thoughts on Parser design and dependencies

2006-08-16 Thread Jukka Zitting
Hi, I have some questions about the dependencies of the Parser interface, especially from the perspective of generalizing it to the potential Tika project. The current dependencies are: * Configurable - depends on the Hadoop configuration system * Pluggable - depends on the Nutch plugin