Hi Dave, Below are nutch parsechecker between nutch 1.6 and 2.x (checkout from [0]):
++++++++++++++++++++++++++++++++++++++++++++++++++++++ VERSION 2.x ++++++++++++++++++++++++++++++++++++++++++++++++++++++ bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf --------- Url --------------- http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf --------- Metadata --------- xmp:CreatorTool : Writer meta:author : Bayu Widyasanyata xmpTPg:NPages : 1 dc:creator : Bayu Widyasanyata Content-Type : application/pdf created : Fri Dec 21 05:38:05 WIT 2012 Author : Bayu Widyasanyata Creation-Date : 2012-12-20T22:38:05Z date : 2012-12-20T22:38:05Z producer : OpenOffice.org 3.2 meta:creation-date : 2012-12-20T22:38:05Z creator : Bayu Widyasanyata dcterms:created : 2012-12-20T22:38:05Z ++++++++++++++++++++++++++++++++++++++++++++++++++++++ VERSION 1.6 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch parsechecker http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf fetching: http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf parsing: http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf contentType: application/pdf signature: f992108356e0248635192bfe7c6d3efc --------- Url --------------- http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf --------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: ETag="187478-a091-4d15067c794e6" Date=Tue, 15 Jan 2013 15:00:47 GMT Content-Length=41105 Last-Modified=Thu, 20 Dec 2012 22:39:35 GMT Content-Type=application/pdf Connection=close Accept-Ranges=bytes Server=Apache/2.2.14 (Ubuntu) Parse Metadata: xmpTPg:NPages=1 Creation-Date=2012-12-20T22:38:05Z meta:author=Bayu Widyasanyata meta:creation-date=2012-12-20T22:38:05Z created=Fri Dec 21 05:38:05 WIT 2012 dc:creator=Bayu Widyasanyata Author=Bayu Widyasanyata producer=OpenOffice.org 3.2 dcterms:created=2012-12-20T22:38:05Z date=2012-12-20T22:38:05Z Content-Type=application/pdf xmp:CreatorTool=Writer creator=Bayu Widyasanyata ++++++++++++++++++++++++++++++++++++++++++++++++++++++ And below are the "indexchecker" results which available only on version 1.6: bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch indexchecker http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf fetching: http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf parsing: http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf contentType: application/pdf content : Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar secara serius oleh seorang muttaqin ial host : localhost tstamp : Tue Jan 15 22:05:50 WIT 2013 --- Since version 2.x of nutch doesn't have "indexchecker" command, how nutch2.x know the content of a document (i.e. PDF files)? I'm not sure with this since my .odt file parsed successfully... Or might be something "mapping problem in Tika's pdf" parser with nutch? Anyway, Does this issue [1] has been solved? This issue is same with me... [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ [1] http://lucene.472066.n3.nabble.com/Nutch-2-x-ParseUtil-failing-for-some-pdf-files-td4014084.html On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <[email protected]> wrote: > Hi, > > Tika should parse those formats, so unless there is something peculiar > with all your files or setup, have you tried the: > > - Size of the files to see if they are over configured limits > - used the nutch parsechecker command to test individual files > > Cheers, > Dave > > On 25 Dec 2012, at 01:34, Bayu Widyasanyata <[email protected]> > wrote: > > > Hi, > > > > ==Update== > > > > Checking hadoop.log found some interesting info that the parsing was > > not completed successfully. > > > > ... > > 2012-12-25 08:15:09,480 INFO parse.ParserJob - Parsing > > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt > > 2012-12-25 08:15:09,480 INFO parse.ParserFactory - The parsing > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > plugin.includes system property, and all claim to support the content > > type application/vnd.oasis.opendocument.text, but they are not mapped > > to it in the parse-plugins.xml file > > 2012-12-25 08:15:09,517 WARN parse.ParseUtil - Unable to successfully > > parse content > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt > > of type application/vnd.oasis.opendocument.text > > 2012-12-25 08:15:09,520 INFO parse.ParserJob - Parsing > > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > > 2012-12-25 08:15:09,521 INFO parse.ParserFactory - The parsing > > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > plugin.includes system property, and all claim to support the content > > type application/pdf, but they are not mapped to it in the > > parse-plugins.xml file > > 2012-12-25 08:15:09,545 WARN parse.ParseUtil - Unable to successfully > > parse content > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > > of type application/pdf > > 2012-12-25 08:15:09,551 INFO parse.ParserJob - Parsing > > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt > > 2012-12-25 08:15:09,560 WARN parse.ParseUtil - Unable to successfully > > parse content > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt > > of type application/vnd.oasis.opendocument.text > > 2012-12-25 08:15:09,563 INFO parse.ParserJob - Parsing > > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > > 2012-12-25 08:15:09,590 WARN parse.ParseUtil - Unable to successfully > > parse content > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > > of type application/pdf > > 2012-12-25 08:15:09,597 INFO parse.ParserJob - Parsing > > > http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > > 2012-12-25 08:15:09,652 WARN parse.ParseUtil - Unable to successfully > > parse content > http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf > > of type application/pdf > > ... > > > > I checked the parse-plugins.xml file and found no plugins handling > > type of application/pdf and application/vnd.oasis.opendocument.text. > > I knew that parse-tika handle PDF files but why those errors were still > occurs? > > > > Any documents/links could explain in easy way to install and activate > > those supported plugins as mentioned at [1] on nutch parser? > > > > [1] http://tika.apache.org/1.2/formats.html#Portable_Document_Format > > > > Thanks, > > > > -- > > wassalam, > > [bayu] > -- wassalam, [bayu]

