Did you check the http.accept property in nutch-site.xml?

On Tuesday, January 15, 2013, Bayu Widyasanyata <[email protected]>
wrote:
> Hi Dave,
> Below are nutch parsechecker between nutch 1.6 and 2.x (checkout from
[0]):
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> VERSION 2.x
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> Url
> ---------------
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> Metadata
> ---------
> xmp:CreatorTool :     Writer
> meta:author :     Bayu Widyasanyata
> xmpTPg:NPages :     1
> dc:creator :     Bayu Widyasanyata
> Content-Type :     application/pdf
> created :     Fri Dec 21 05:38:05 WIT 2012
> Author :     Bayu Widyasanyata
> Creation-Date :     2012-12-20T22:38:05Z
> date :     2012-12-20T22:38:05Z
> producer :     OpenOffice.org 3.2
> meta:creation-date :     2012-12-20T22:38:05Z
> creator :     Bayu Widyasanyata
> dcterms:created :     2012-12-20T22:38:05Z
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> VERSION 1.6
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch parsechecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> fetching:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> parsing:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> contentType: application/pdf
> signature: f992108356e0248635192bfe7c6d3efc
> ---------
> Url
> ---------------
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> ---------
> ParseData
> ---------
> Version: 5
> Status: success(1,0)
> Title:
> Outlinks: 0
> Content Metadata: ETag="187478-a091-4d15067c794e6" Date=Tue, 15 Jan 2013
> 15:00:47 GMT Content-Length=41105 Last-Modified=Thu, 20 Dec 2012 22:39:35
> GMT Content-Type=application/pdf Connection=close Accept-Ranges=bytes
> Server=Apache/2.2.14 (Ubuntu)
> Parse Metadata: xmpTPg:NPages=1 Creation-Date=2012-12-20T22:38:05Z
> meta:author=Bayu Widyasanyata meta:creation-date=2012-12-20T22:38:05Z
> created=Fri Dec 21 05:38:05 WIT 2012 dc:creator=Bayu Widyasanyata
> Author=Bayu Widyasanyata producer=OpenOffice.org 3.2
> dcterms:created=2012-12-20T22:38:05Z date=2012-12-20T22:38:05Z
> Content-Type=application/pdf xmp:CreatorTool=Writer creator=Bayu
> Widyasanyata
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> And below are the "indexchecker" results which available only on version
> 1.6:
>
> bayu@thinkpato:/opt/searchengine/nutch2x$ ../nutch/bin/nutch indexchecker
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> fetching:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> parsing:
> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
> contentType: application/pdf
> content :    Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar
> secara serius oleh seorang muttaqin ial
> host :    localhost
> tstamp :    Tue Jan 15 22:05:50 WIT 2013
>
> ---
>
> Since version 2.x of nutch doesn't have "indexchecker" command, how
> nutch2.x know the content of a document (i.e. PDF files)?
> I'm not sure with this since my .odt file parsed successfully...
>
> Or might be something "mapping problem in Tika's pdf" parser with nutch?
>
> Anyway,
> Does this issue [1] has been solved?
> This issue is same with me...
>
> [0] http://svn.apache.org/repos/asf/nutch/branches/2.x/
> [1]
>
http://lucene.472066.n3.nabble.com/Nutch-2-x-ParseUtil-failing-for-some-pdf-files-td4014084.html
>
> On Sun, Dec 30, 2012 at 6:07 AM, Dave Meikle <[email protected]> wrote:
>
>> Hi,
>>
>> Tika should parse those formats, so unless there is something peculiar
>> with all your files or setup, have you tried the:
>>
>> - Size of the files to see if they are over configured limits
>> - used the nutch parsechecker command to test individual files
>>
>> Cheers,
>> Dave
>>
>> On 25 Dec 2012, at 01:34, Bayu Widyasanyata <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > ==Update==
>> >
>> > Checking hadoop.log found some interesting info that the parsing was
>> > not completed successfully.
>> >
>> > ...
>> > 2012-12-25 08:15:09,480 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> > 2012-12-25 08:15:09,480 INFO  parse.ParserFactory - The parsing
>> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > plugin.includes system property, and all claim to support the content
>> > type application/vnd.oasis.opendocument.text, but they are not mapped
>> > to it  in the parse-plugins.xml file
>> > 2012-12-25 08:15:09,517 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
>> > of type application/vnd.oasis.opendocument.text
>> > 2012-12-25 08:15:09,520 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > 2012-12-25 08:15:09,521 INFO  parse.ParserFactory - The parsing
>> > plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the
>> > plugin.includes system property, and all claim to support the content
>> > type application/pdf, but they are not mapped to it  in the
>> > parse-plugins.xml file
>> > 2012-12-25 08:15:09,545 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > of type application/pdf
>> > 2012-12-25 08:15:09,551 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> > 2012-12-25 08:15:09,560 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/Akhirat_Lebih_Utama_Daripada_Dunia.odt
>> > of type application/vnd.oasis.opendocument.text
>> > 2012-12-25 08:15:09,563 INFO  parse.ParserJob - Parsing
>> > http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> > 2012-12-25 08:15:09,590 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf
>> > of type application/pdf
>> > 2012-12-25 08:15:09,597 INFO  parse.ParserJob - Parsing
>> >
>>
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > 2012-12-25 08:15:09,652 WARN  parse.ParseUtil - Unable to successfully
>> > parse content
>>
http://localhost/sapi/spasi%20Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
>> > of type application/pdf
>> > ...
>> >
>> > I checked the parse-plugins.xml file and found no plugins handling
>> > type of application/pdf and application/vnd.oasis.opendocument.text.
>> > I knew that parse-tika handle PDF files but why those errors were--
> wassalam,
> [bayu]
>

-- 
*Lewis*

Reply via email to