Hi, Thank you for suggestions. And I was try to upgrade the Tika to 1.2 as mentioned on https://issues.apache.org/jira/browse/NUTCH-1433
I will try your suggestions and/or upgrade tika. On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <[email protected]> wrote: > Hi, > > Tika should parse those formats, so unless there is something peculiar > with all your files or setup, have you tried the: > > - Size of the files to see if they are over configured limits > - used the nutch parsechecker command to test individual files > > Cheers, > Dave > > On 25 Dec 2012, at 01:34, Bayu Widyasanyata <[email protected]> wrote: > >> Hi, >> >> ==Update== >> >> Checking hadoop.log found some interesting info that the parsing was >> not completed successfully. >> >> ... >> 2012-12-25 08:15:09,480 INFO parse.ParserJob - Parsing >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt >> 2012-12-25 08:15:09,480 INFO parse.ParserFactory - The parsing >> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> plugin.includes system property, and all claim to support the content >> type application/vnd.oasis.opendocument.text, but they are not mapped >> to it in the parse-plugins.xml file >> 2012-12-25 08:15:09,517 WARN parse.ParseUtil - Unable to successfully >> parse content >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt >> of type application/vnd.oasis.opendocument.text >> 2012-12-25 08:15:09,520 INFO parse.ParserJob - Parsing >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> 2012-12-25 08:15:09,521 INFO parse.ParserFactory - The parsing >> plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> plugin.includes system property, and all claim to support the content >> type application/pdf, but they are not mapped to it in the >> parse-plugins.xml file >> 2012-12-25 08:15:09,545 WARN parse.ParseUtil - Unable to successfully >> parse content >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> of type application/pdf >> 2012-12-25 08:15:09,551 INFO parse.ParserJob - Parsing >> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt >> 2012-12-25 08:15:09,560 WARN parse.ParseUtil - Unable to successfully >> parse content http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt >> of type application/vnd.oasis.opendocument.text >> 2012-12-25 08:15:09,563 INFO parse.ParserJob - Parsing >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf >> 2012-12-25 08:15:09,590 WARN parse.ParseUtil - Unable to successfully >> parse content >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf >> of type application/pdf >> 2012-12-25 08:15:09,597 INFO parse.ParserJob - Parsing >> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> 2012-12-25 08:15:09,652 WARN parse.ParseUtil - Unable to successfully >> parse content >> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> of type application/pdf >> ... >> >> I checked the parse-plugins.xml file and found no plugins handling >> type of application/pdf and application/vnd.oasis.opendocument.text. >> I knew that parse-tika handle PDF files but why those errors were still >> occurs? >> >> Any documents/links could explain in easy way to install and activate >> those supported plugins as mentioned at [1] on nutch parser? >> >> [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format >> >> Thanks, >> >> -- >> wassalam, >> [bayu] -- wassalam, [bayu]

