Jukka Zitting wrote:
Hi,
On 8/19/06, Sami Siren [EMAIL PROTECTED] wrote:
So far nutch has been build to deal mainly with text type documents.
There's however need also to deal with non textual object eg. images,
movies, sound which will provide content only in form of metadata (ok,
perhaps
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as neither of these ideas
apply
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to generalize this, as
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like to look for ways to
Andrzej Bialecki wrote:
Sami Siren wrote:
Andrzej Bialecki wrote:
Jukka Zitting wrote:
The Parser interface is also bound to the ideas of fetching content
from the network and indexing it using a standard content model
through the Content and Parse dependencies. For the Tika project I'd
like
Sami Siren wrote:
Original motivation for this was http headers and meta tags, which
can have multiple values. Another case is the language
identification, where the same key may have multiple values, coming
from different sources. Additionally, MapWritable supports any
Writable, which is
Hi,
I have some questions about the dependencies of the Parser interface,
especially from the perspective of generalizing it to the potential
Tika project. The current dependencies are:
* Configurable - depends on the Hadoop configuration system
* Pluggable - depends on the Nutch plugin