Hi Oliver,

On Wed, Jun 20, 2012 at 2:29 PM, Olivier LEVILLAIN
<[email protected]> wrote:
>
> Actually, I do not understand how nutch chooses the mime type. For instance,
> for an rtf file, sometimes, it takes text/rtf and sometimes
> application/rtf.... (in the same context exactly)

Do you have an example of this? I have not looked into any Nutch
parsing plugins being inconsistent with mimeTypes but if some given
page's mimeType is being interpreted inconsistently by our parsing
plugins then this is not good... it would be great to see some
evidence of this...

>
> Is there a way to manually map file termination to mime types with Nutch?
> I saw that tika was providing a tika-mimetypes.xml file for this but I can't
> find it on my (stanadrd) installation...

In Nutch we do things slightly differently than they do over in Tika
w.r.t these config files.
As you correctly pointed out the parse-plugins.xml allows us to
explicitly order preference for parsers --> mimeTypes. I am confused
that there is no dedicated Tika parser for
org.apache.tika.parser.csv... my initial guess would be that it is
maybe implemented somewhere as an extension of plain text but I would
really need someone to back me up on this one.
I suppose the other option is to implement a custom parse-csv plugin
using something like the Tika-CSV-Parser [0], it seems to look very
easy to use. Then simply replace the generated tika artifact with your
new library...

hth

[0] https://github.com/msalgado/Tika-CSV-Parser

Reply via email to