It looks parse process is working fine even the log said "unable to
successfully" parsed:

LOGS:
++++++++++++++++++++++++++
2013-01-16 08:13:44,887 INFO  parse.ParserJob - Parsing
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
2013-01-16 08:13:44,911 WARN  parse.ParseUtil - Unable to successfully
parse content
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf of
type application/pdf


parsechecker -dumpText output
++++++++++++++++++++++++++
bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker
-dumpText
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Metadata
---------
xmp:CreatorTool :     Writer
meta:author :     Bayu Widyasanyata
xmpTPg:NPages :     1
dc:creator :     Bayu Widyasanyata
Content-Type :     application/pdf
created :     Sun Dec 23 19:23:22 WIT 2012
Author :     Bayu Widyasanyata
Creation-Date :     2012-12-23T12:23:22Z
date :     2012-12-23T12:23:22Z
producer :     OpenOffice.org 3.2
meta:creation-date :     2012-12-23T12:23:22Z
creator :     Bayu Widyasanyata
dcterms:created :     2012-12-23T12:23:22Z
---------
ParseText
---------
Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar secara serius
oleh seorang muttaqin ialah keberhasilan di akhirat. Baginya keberhasilan
di dunia merupakan sesuatu yang bersifat supplementary (faktor pelengkap)
saja. Tetapi keberhasilan di akhirat adalah sesuatu yang tidak boleh
ditawar sedikitpun karena ia merupakan faktor utama. Ia tidak rela
mempertaruhkan keberhasilannya di akhirat demi keberhasilannya di dunia.
Namun sebaliknya, demi keberhasilannya di akhirat ia rela kehilangan
keberhasilannya di dunia. SpasiKosong.

====

"text" value on my MySQL database is still empty for that file.

Thanks,

On Wed, Jan 16, 2013 at 7:41 AM, Bayu Widyasanyata
<[email protected]>wrote:

> On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Did you check the http.accept property in nutch-site.xml
>
>
> I copied from nutch-default.xml, then add application/pdf:
>
> <property>
>   <name>http.accept</name>
>
> <value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value>
>   <description>Value of the "Accept" request header field.
>   </description>
> </property>
>
> Also has shown on hadoop.log:
> 2013-01-16 07:39:22,232 INFO  http.Http - http.accept =
> text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8
> --
> wassalam,
> [bayu]




-- 
wassalam,
[bayu]

Reply via email to