Hi Lewis,
Sorry for late reply.

Please find the complete log here:
http://pastebin.com/EqeMtsb2

We can see that some of parse processes were not completed successfully.

Following are crawling and indexing steps commands.

*[Crawling step]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch crawl urls -depth 3
-topN 5

*[Indexing step]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch solrindex
http://localhost:8080/solr -reindex
SolrIndexerJob: starting
Adding 1 documents
SolrIndexerJob: done.

Even though I repeat many times on crawling, the indexing is always only
proceed adding 1 document.

Below are parsechecker output of success and fail files parsed:

*[success]* -- but it's inconsistent since another .odt file is FAIL parsed
by nutch. see the hadoop log.
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch parsechecker
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.odt
---------
Metadata
---------
Page-Count :     1
dc:creator :     Bayu Widyasanyata
meta:character-count :     532
Paragraph-Count :     2
nbWord :     69
meta:paragraph-count :     2
Character Count :     532
Last-Save-Date :     2012-12-21T05:37:30
dcterms:modified :     2012-12-21T05:37:30
Object-Count :     0
meta:object-count :     0
Author :     Bayu Widyasanyata
nbObject :     0
creator :     Bayu Widyasanyata
xmpTPg:NPages :     1
meta:image-count :     0
Table-Count :     0
nbCharacter :     532
Word-Count :     69
meta:table-count :     0
meta:initial-author :     Bayu Widyasanyata
Last-Modified :     2012-12-21T05:37:30
Creation-Date :     2012-12-21T05:33:12
generator :     OpenOffice.org/3.2$Linux
OpenOffice.org_project/320m12$Build-9483
meta:creation-date :     2012-12-21T05:33:12
meta:word-count :     69
Image-Count :     0
nbImg :     0
meta:author :     Bayu Widyasanyata
nbTab :     0
nbPage :     1
editing-cycles :     2
Content-Type :     application/vnd.oasis.opendocument.text
meta:save-date :     2012-12-21T05:37:30
meta:page-count :     1
Edit-Time :     PT00H04M18S
initial-creator :     Bayu Widyasanyata
nbPara :     2
modified :     2012-12-21T05:37:30
date :     2012-12-21T05:33:12
dcterms:created :     2012-12-21T05:33:12

*[failed]*
bayu@thinkpato:/opt/searchengine/nutch$ ./bin/nutch parsechecker
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Url
---------------
http://localhost/sapi/Akhirat%20Lebih%20Utama%20Daripada%20Dunia.pdf
---------
Metadata
---------
xmp:CreatorTool :     Writer
meta:author :     Bayu Widyasanyata
xmpTPg:NPages :     1
dc:creator :     Bayu Widyasanyata
Content-Type :     application/pdf
created :     Sun Dec 23 19:23:22 WIT 2012
Author :     Bayu Widyasanyata
Creation-Date :     2012-12-23T12:23:22Z
date :     2012-12-23T12:23:22Z
producer :     OpenOffice.org 3.2
meta:creation-date :     2012-12-23T12:23:22Z
creator :     Bayu Widyasanyata
dcterms:created :     2012-12-23T12:23:22Z

Thanks.-

On Fri, Jan 11, 2013 at 11:09 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> I can't see any log output. Can you fetch and parse the pdfs with the
> parsechecker tool?
>
> On Thursday, January 10, 2013, Bayu Widyasanyata <[email protected]>
> wrote:
> > For clarity, the log below is the about 4 of 5 my PDF docs that can't be
> > parsed by nutch.
> >
> > On Fri, Jan 11, 2013 at 8:29 AM, Bayu Widyasanyata
> > <[email protected]>wrote:
> >
> >> nutch parsing is still problem on pdf files.
> >> Only 1 pdf can be parsed successfully.
> >>
> >> 2013-01-11 08:11:23,679 WARN  parse.ParseUtil - Unable to successfully
> >> parse content
> >> http://localhost/sapi/nospasi_Akhirat_Lebih_Utama_Daripada_Dunia.pdf of
> >> type application/pdf
> >>
> >> Even I had added on parse-plugins.xml explicitly:
> >>
> >>     <mimeType name="application/pdf">
> >>       <plugin id="parse-tika" />
> >>     </mimeType>
> >>
> >> What the missed things?
> >>
> >> On Fri, Jan 11, 2013 at 7:55 AM, Lewis John Mcgibbney <
> >> [email protected]> wrote:
> >>
> >>> No problem at all.
> >>>
> >>> Better safe than sorry.
> >>>
> >>> Lewis
> >>>
> >>> On Thu, Jan 10, 2013 at 4:43 PM, Bayu Widyasanyata
> >>> <[email protected]>wrote:
> >>>
> >>> > Yes, I forgot that things even I already put on my notes on previous
> >>> > installation.
> >>> > I'm quite new on nutch and also Java developments :)
> >>> >
> >>> > Thanks!
> >>> >
> >>> > On Fri, Jan 11, 2013 at 7:01 AM, Lewis John Mcgibbney <
> >>> > [email protected]> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > java.io.IOException: java.lang.ClassNotFoundException:
> >>> > > > com.mysql.jdbc.Driver
> >>> > > >
> >>> > >
> >>> > > If you look at ivy.xml [0] you will see that the
> mysql-connector-java
> >>> > > dependency is commented out. Please uncomment it, then build Nutch
> 2.x
> >>> > src
> >>> > > again.
> >>> > >
> >>> > > This will download the dependency and make it available on your
> >>> > classpath.
> >>> > >
> >>> > > Thank you
> >>> > >
> >>> > > Lewis
> >>> > >
> >>> > > [0]
> >>> > >
> >>>
> http://svn.apache.org/viewvc/nutch/branches/2.x/ivy/ivy.xml?view=markup
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > wassalam,
> >>> > [bayu]
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> *Lewis*
> >>>
> >>
> >>
> >>
> >> --
> >> wassalam,
> >> [bayu]
> >
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
> --
> *Lewis*
>



-- 
wassalam,
[bayu]

Reply via email to