It looks parse process is working fine even the log said "unable to successfully" parsed:
LOGS: ++++++++++++++++++++++++++ 2013-01-16 08:13:44,887 INFO parse.ParserJob - Parsing http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf 2013-01-16 08:13:44,911 WARN parse.ParseUtil - Unable to successfully parse content http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf of type application/pdf parsechecker -dumpText output ++++++++++++++++++++++++++ bayu@thinkpato:/opt/searchengine/nutch2x$ ./bin/nutch parsechecker -dumpText http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf --------- Url --------------- http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf --------- Metadata --------- xmp:CreatorTool : Writer meta:author : Bayu Widyasanyata xmpTPg:NPages : 1 dc:creator : Bayu Widyasanyata Content-Type : application/pdf created : Sun Dec 23 19:23:22 WIT 2012 Author : Bayu Widyasanyata Creation-Date : 2012-12-23T12:23:22Z date : 2012-12-23T12:23:22Z producer : OpenOffice.org 3.2 meta:creation-date : 2012-12-23T12:23:22Z creator : Bayu Widyasanyata dcterms:created : 2012-12-23T12:23:22Z --------- ParseText --------- Akhirat Lebih Utama Daripada Dunia Keberhasilan yang dikejar secara serius oleh seorang muttaqin ialah keberhasilan di akhirat. Baginya keberhasilan di dunia merupakan sesuatu yang bersifat supplementary (faktor pelengkap) saja. Tetapi keberhasilan di akhirat adalah sesuatu yang tidak boleh ditawar sedikitpun karena ia merupakan faktor utama. Ia tidak rela mempertaruhkan keberhasilannya di akhirat demi keberhasilannya di dunia. Namun sebaliknya, demi keberhasilannya di akhirat ia rela kehilangan keberhasilannya di dunia. SpasiKosong. ==== "text" value on my MySQL database is still empty for that file. Thanks, On Wed, Jan 16, 2013 at 7:41 AM, Bayu Widyasanyata <[email protected]>wrote: > On Tue, Jan 15, 2013 at 11:28 PM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Did you check the http.accept property in nutch-site.xml > > > I copied from nutch-default.xml, then add application/pdf: > > <property> > <name>http.accept</name> > > <value>text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8</value> > <description>Value of the "Accept" request header field. > </description> > </property> > > Also has shown on hadoop.log: > 2013-01-16 07:39:22,232 INFO http.Http - http.accept = > text/html,application/xhtml+xml,application/xml,application/pdf;q=0.9,*/*;q=0.8 > -- > wassalam, > [bayu] -- wassalam, [bayu]

