Re: updatedb in nutch-2.0 with mysql

alxsss Fri, 27 Jul 2012 12:57:46 -0700

I tried your suggestion with sql server and everything works fine. 
The issue that I had was with mysql though. 
mysql  Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1


After I have restarted mysql server and added to gora.properties  mysql root 
user, updatdb adds outlinks as new urls, but as I noticed it did not remove 
values of 
prsmrk, gnmrk and ftcmrk as it happens in Hbase and as follows from code  
Mark.GENERATE_MARK.removeMarkIfExist(page);... in DbUpdateReducer.java

I also see from time to time an error that text filed has size less than 
expected.

It seems to me that nutch with mysql is still buggy, so I gave up using mysql 
with it in favor of Hbase.

Thanks for your help.
Alex.




-----Original Message-----
From: Ferdy Galema <[email protected]>
To: user <[email protected]>
Sent: Fri, Jul 27, 2012 2:03 am
Subject: Re: updatedb in nutch-2.0 with mysql


I've just ran a crawl with Nutch 2.0 tag using the SqlStore. Please try to
reproduce from a clean checkout/download.

nano conf/nutch-site.xml #set http.agent.name and http.robots.agents
properties
ant clean runtime
java -cp runtime/local/lib/hsqldb-2.2.8.jar org.hsqldb.Server -database.0
mem:0 -dbname.0 nutchtest #start sql server

#open another terminal

cd runtime/local
bin/nutch inject ~/urlfolderWithOneUrl/
bin/nutch generate
bin/nutch fetch <batchIdFromGenerate>
bin/nutch parse <batchIdFromGenerate>
bin/nutch updatedb
bin/nutch readdb -stats #this will show multiple entries
bin/nutch readdb -dump out #this will dump a readable text file in folder
out/ (with multiple entries)

If this works as expected, it might be something with your sql server?
(What server are you running exactly?)

Ferdy.

On Thu, Jul 26, 2012 at 8:15 PM, <[email protected]> wrote:

> I queried webpage table and there are a few links  in outlinks column. As
> I noted in the original letter updatedb works with Hbase. This is  the
> counters output in the case of Hbase.
>
>  bin/nutch updatedb
> DbUpdaterJob: starting
> counter name=Counters: 20
>         FileSystemCounters
>                 FILE_BYTES_READ=879085
>                 FILE_BYTES_WRITTEN=993668
>         Map-Reduce Framework
>                 Combine input records=0
>                 Combine output records=0
>                 Total committed heap usage (bytes)=341442560
>                 CPU time spent (ms)=0
>                 Map input records=1
>                 Map output bytes=1421
>                 Map output materialized bytes=1457
>                 Map output records=14
>                 Physical memory (bytes) snapshot=0
>                 Reduce input groups=13
>                 Reduce input records=14
>                 Reduce output records=13
>                 Reduce shuffle bytes=0
>                 Spilled Records=28
>                 SPLIT_RAW_BYTES=701
>                 Virtual memory (bytes) snapshot=0
>         File Input Format Counters
>                 Bytes Read=0
>         File Output Format Counters
>                 Bytes Written=0
> DbUpdaterJob: done
>
> I tried crawling http://www.yahoo.com . The same issue is present.
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
> From: Ferdy Galema <[email protected]>
> To: user <[email protected]>
> Sent: Thu, Jul 26, 2012 6:26 am
> Subject: Re: updatedb in nutch-2.0 with mysql
>
>
> Yep I meant those counters.
>
> Looking at the code it seems just 1 record is passed around from mapper to
> reducer:This can only mean that no outlinks are outputted in the mapper.
> This might indicate that the url is not succesfully parsed. (Did you parse
> at all?)
>
> Are you able to peek in (or dump) your database with an external tool to
> see if outlinks are present before running the updater? Or perhaps check
> some parser log?
>
> On Wed, Jul 25, 2012 at 10:02 PM, <[email protected]> wrote:
>
> > Not sure if I understood correctly.
> > I did
> > Counters c currentJob.getCounters();
> > System.out.println(c.toString());
> >
> > With Mysql
> >
> > DbUpdaterJob: starting
> > Counters: 20
> > DbUpdaterJob: starting
> > counter name=Counters: 20
> >         FileSystemCounters
> >            FILE_BYTES_READ=878298
> >            FILE_BYTES_WRITTEN=992362
> >         Map-Reduce Framework
> >            Combine input records=0
> >            Combine output records=0
> >            Total committed heap usage (bytes)=260177920
> >            CPU time spent (ms)=0
> >            Map input records=1
> >            Map output bytes=193
> >            Map output materialized bytes=202
> >            Map output records=1
> >            Physical memory (bytes) snapshot=0
> >            Reduce input groups=1
> >            Reduce input records=1
> >            Reduce output records=1
> >            Reduce shuffle bytes=0
> >            Spilled Records=2
> >            SPLIT_RAW_BYTES=962
> >            Virtual memory (bytes) snapshot=0
> >         File Input Format Counters
> >            Bytes Read=0
> >         File Output Format Counters
> >            Bytes Written=0
> > DbUpdaterJob: done
> >
> >
> > Thanks.
> > Alex.
> >
> >
> >
> > -----Original Message-----
> > From: Ferdy Galema <[email protected]>
> > To: user <[email protected]>
> > Sent: Wed, Jul 25, 2012 12:13 am
> > Subject: Re: updatedb in nutch-2.0 with mysql
> >
> >
> > Could you post the job counters?
> >
> > On Tue, Jul 24, 2012 at 8:14 PM, <[email protected]> wrote:
> >
> > >
> > >
> > >
> > >
> > >
> > > Hello,
> > >
> > >
> > >
> > > I am testing nutch-2.0 with mysql storage with 1 url. I see that
> updatedb
> > > command does not do anything. It does not add outlinks to the table as
> > new
> > > urls and I do not see any error messages in hadoop.log Here is the log
> > > entries without plugin load info
> > >
> > >  INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
> > > 2012-07-24 10:53:46,142 WARN  util.NativeCodeLoader - Unable to load
> > > native-hadoop library for your platform... using builtin-java classes
> > where
> > > applicable
> > > 2012-07-24 10:53:46,979 INFO  mapreduce.GoraRecordReader -
> > > gora.buffer.read.limit = 10000
> > > 2012-07-24 10:53:49,801 INFO  mapreduce.GoraRecordWriter -
> > > gora.buffer.write.limit = 10000
> > > 2012-07-24 10:53:49,806 INFO  crawl.FetchScheduleFactory - Using
> > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > > 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
> > > defaultInterval=25920000
> > > 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
> > > maxInterval=25920000
> > > 2012-07-24 10:53:52,741 WARN  mapred.FileOutputCommitter - Output path
> is
> > > null in cleanup
> > > 2012-07-24 10:53:53,584 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done
> > >
> > > Also, I noticed that there is crawlId option to it. Where its value
> comes
> > > from?
> > >
> > > Btw, updatedb with no arguments works fine if Hbase is chosen for
> > storage.
> > >
> > > Thanks.
> > > Alex.
> > >
> > >
> > >
> > >
> > >
> > > ~
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
>
>
>

Re: updatedb in nutch-2.0 with mysql

Reply via email to