Hi,

Tika should parse those formats, so unless there is something peculiar
with all your files or setup, have you tried the:

- Size of the files to see if they are over configured limits
- used the nutch parsechecker command to test individual files

Cheers,
Dave

On 25 Dec 2012, at 01:34, Bayu Widyasanyata <[email protected]> wrote:

> Hi,
>
> ==Update==
>
> Checking hadoop.log found some interesting info that the parsing was
> not completed successfully.
>
> ...
> 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type application/vnd.oasis.opendocument.text, but they are not mapped
> to it  in the parse-plugins.xml file
> 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
> parse content 
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
> of type application/vnd.oasis.opendocument.text
> 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> plugin.includes system property, and all claim to support the content
> type application/pdf, but they are not mapped to it  in the
> parse-plugins.xml file
> 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
> parse content 
> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> of type application/pdf
> 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
> parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
> of type application/vnd.oasis.opendocument.text
> 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
> parse content 
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> of type application/pdf
> 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
> parse content 
> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
> of type application/pdf
> ...
>
> I checked the parse-plugins.xml file and found no plugins handling
> type of application/pdf and application/vnd.oasis.opendocument.text.
> I knew that parse-tika handle PDF files but why those errors were still 
> occurs?
>
> Any documents/links could explain in easy way to install and activate
> those supported plugins as mentioned at [1] on nutch parser?
>
> [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format
>
> Thanks,
>
> --
> wassalam,
> [bayu]

Reply via email to