Hi Oliver,

I think if you need more info on the Tika parser implementation for
csv you should head over to [email protected] and see what folks there can
offer you.

On Thu, Jun 21, 2012 at 8:42 AM, Olivier LEVILLAIN
<[email protected]> wrote:
> Nevertheless, my parse-plugins.xml contains:
>       <mimeType name="text/rtf">
>                <plugin id="parse-tika" />
>        </mimeType>
>        <mimeType name="application/rtf">
>                <plugin id="parse-tika" />
>        </mimeType>

I thought we were talking about csv? The mime type definition for csv
and rtf is different...
This being said, the above config looks fine... if however it was not
there parse-tika would still automatically pick this up due to
wildcard default settings.

> becquse nutch complained that first it didn't find text/rtf and the second
> time it didn't find application/rtf

Strange. Tika certainly has an rtf parser implementation

>
> I'll take a look at your solution with Tika-CSV-parser but it's a shame it
> does not come out of the box with tika/nutch, but it seems to imply
> recompiling things and so on

AFAIK its a simple case of obtaining tika source making the necessary
changes, compiling the jar then putting this on your Nutch class path.
I don't envisage it to be too much hassle.

> In the meantime, what would be the simple trick to parse csv as plain text?
> Which *existing* parser should I use?

As you have highlighted it appears parse-tika is not working out tf
the box with this mimeType. I suggest you invest half an hour or so
and get the github csv parser working.

hth

Reply via email to