Did you check the http.accept property in nutch-site.xml? On Tuesday, January 15, 2013, Bayu Widyasanyata <[email protected]> wrote: > Hi Dave, > Below are nutch parsechecker between nutch 1.6 and 2.x (checkout from [0]): > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > VERSION 2.x > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > --------- > Url > --------------- > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > --------- > Metadata > --------- > xmp:CreatorTool : Writer > meta:author : Bayu Widyasanyata > xmpTPg:NPages : 1 > dc:creator : Bayu Widyasanyata > Content-Type : application/pdf > created : Fri Dec 21 05:38:05 WIT 2012 > Author : Bayu Widyasanyata > Creation-Date : 2012-12-20T22:38:05Z > date : 2012-12-20T22:38:05Z > producer : OpenOffice.org 3.2 > meta:creation-date : 2012-12-20T22:38:05Z > creator : Bayu Widyasanyata > dcterms:created : 2012-12-20T22:38:05Z > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > VERSION 1.6 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch parsechecker > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > fetching: > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > parsing: > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > contentType: application/pdf > signature: f992108356e0248635192bfe7c6d3efc > --------- > Url > --------------- > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > --------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: > Outlinks: 0 > Content Metadata: ETag="187478-a091-4d15067c794e6" Date=Tue, 15 Jan 2013 > 15:00:47 GMT Content-Length=41105 Last-Modified=Thu, 20 Dec 2012 22:39:35 > GMT Content-Type=application/pdf Connection=close Accept-Ranges=bytes > Server=Apache/2.2.14 (Ubuntu) > Parse Metadata: xmpTPg:NPages=1 Creation-Date=2012-12-20T22:38:05Z > meta:author=Bayu Widyasanyata meta:creation-date=2012-12-20T22:38:05Z > created=Fri Dec 21 05:38:05 WIT 2012 dc:creator=Bayu Widyasanyata > Author=Bayu Widyasanyata producer=OpenOffice.org 3.2 > dcterms:created=2012-12-20T22:38:05Z date=2012-12-20T22:38:05Z > Content-Type=application/pdf xmp:CreatorTool=Writer creator=Bayu > Widyasanyata > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > And below are the "indexchecker" results which available only on version > 1.6: > > bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch indexchecker > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > fetching: > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > parsing: > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf > contentType: application/pdf > content : Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar > secara serius oleh seorang muttaqin ial > host : localhost > tstamp : Tue Jan 15 22:05:50 WIT 2013 > > --- > > Since version 2.x of nutch doesn't have "indexchecker" command, how > nutch2.x know the content of a document (i.e. PDF files)? > I'm not sure with this since my .odt file parsed successfully... > > Or might be something "mapping problem in Tika's pdf" parser with nutch? > > Anyway, > Does this issue [1] has been solved? > This issue is same with me... > > [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/ > [1] > http://lucene.472066.n3.nabble.com/Nutch-2-x-ParseUtil-failing-for-some-pdf-files-td4014084.html > > On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <[email protected]> wrote: > >> Hi, >> >> Tika should parse those formats, so unless there is something peculiar >> with all your files or setup, have you tried the: >> >> - Size of the files to see if they are over configured limits >> - used the nutch parsechecker command to test individual files >> >> Cheers, >> Dave >> >> On 25 Dec 2012, at 01:34, Bayu Widyasanyata <[email protected]> >> wrote: >> >> > Hi, >> > >> > ==Update== >> > >> > Checking hadoop.log found some interesting info that the parsing was >> > not completed successfully. >> > >> > ... >> > 2012-12-25 08:15:09,480 INFO parse.ParserJob - Parsing >> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt >> > 2012-12-25 08:15:09,480 INFO parse.ParserFactory - The parsing >> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> > plugin.includes system property, and all claim to support the content >> > type application/vnd.oasis.opendocument.text, but they are not mapped >> > to it in the parse-plugins.xml file >> > 2012-12-25 08:15:09,517 WARN parse.ParseUtil - Unable to successfully >> > parse content >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt >> > of type application/vnd.oasis.opendocument.text >> > 2012-12-25 08:15:09,520 INFO parse.ParserJob - Parsing >> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> > 2012-12-25 08:15:09,521 INFO parse.ParserFactory - The parsing >> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the >> > plugin.includes system property, and all claim to support the content >> > type application/pdf, but they are not mapped to it in the >> > parse-plugins.xml file >> > 2012-12-25 08:15:09,545 WARN parse.ParseUtil - Unable to successfully >> > parse content >> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> > of type application/pdf >> > 2012-12-25 08:15:09,551 INFO parse.ParserJob - Parsing >> > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt >> > 2012-12-25 08:15:09,560 WARN parse.ParseUtil - Unable to successfully >> > parse content >> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt >> > of type application/vnd.oasis.opendocument.text >> > 2012-12-25 08:15:09,563 INFO parse.ParserJob - Parsing >> > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf >> > 2012-12-25 08:15:09,590 WARN parse.ParseUtil - Unable to successfully >> > parse content >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf >> > of type application/pdf >> > 2012-12-25 08:15:09,597 INFO parse.ParserJob - Parsing >> > >> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> > 2012-12-25 08:15:09,652 WARN parse.ParseUtil - Unable to successfully >> > parse content >> http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf >> > of type application/pdf >> > ... >> > >> > I checked the parse-plugins.xml file and found no plugins handling >> > type of application/pdf and application/vnd.oasis.opendocument.text. >> > I knew that parse-tika handle PDF files but why those errors were-- > wassalam, > [bayu] >
-- *Lewis*

