I queried webpage table and there are a few links in outlinks column. As I
noted in the original letter updatedb works with Hbase. This is the counters
output in the case of Hbase.
bin/nutch updatedb
DbUpdaterJob: starting
counter name=Counters: 20
FileSystemCounters
FILE_BYTES_READ=879085
FILE_BYTES_WRITTEN=993668
Map-Reduce Framework
Combine input records=0
Combine output records=0
Total committed heap usage (bytes)=341442560
CPU time spent (ms)=0
Map input records=1
Map output bytes=1421
Map output materialized bytes=1457
Map output records=14
Physical memory (bytes) snapshot=0
Reduce input groups=13
Reduce input records=14
Reduce output records=13
Reduce shuffle bytes=0
Spilled Records=28
SPLIT_RAW_BYTES=701
Virtual memory (bytes) snapshot=0
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
DbUpdaterJob: done
I tried crawling http://www.yahoo.com . The same issue is present.
Thanks.
Alex.
-----Original Message-----
From: Ferdy Galema <[email protected]>
To: user <[email protected]>
Sent: Thu, Jul 26, 2012 6:26 am
Subject: Re: updatedb in nutch-2.0 with mysql
Yep I meant those counters.
Looking at the code it seems just 1 record is passed around from mapper to
reducer:This can only mean that no outlinks are outputted in the mapper.
This might indicate that the url is not succesfully parsed. (Did you parse
at all?)
Are you able to peek in (or dump) your database with an external tool to
see if outlinks are present before running the updater? Or perhaps check
some parser log?
On Wed, Jul 25, 2012 at 10:02 PM, <[email protected]> wrote:
> Not sure if I understood correctly.
> I did
> Counters c currentJob.getCounters();
> System.out.println(c.toString());
>
> With Mysql
>
> DbUpdaterJob: starting
> Counters: 20
> DbUpdaterJob: starting
> counter name=Counters: 20
> FileSystemCounters
> FILE_BYTES_READ=878298
> FILE_BYTES_WRITTEN=992362
> Map-Reduce Framework
> Combine input records=0
> Combine output records=0
> Total committed heap usage (bytes)=260177920
> CPU time spent (ms)=0
> Map input records=1
> Map output bytes=193
> Map output materialized bytes=202
> Map output records=1
> Physical memory (bytes) snapshot=0
> Reduce input groups=1
> Reduce input records=1
> Reduce output records=1
> Reduce shuffle bytes=0
> Spilled Records=2
> SPLIT_RAW_BYTES=962
> Virtual memory (bytes) snapshot=0
> File Input Format Counters
> Bytes Read=0
> File Output Format Counters
> Bytes Written=0
> DbUpdaterJob: done
>
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
> From: Ferdy Galema <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Jul 25, 2012 12:13 am
> Subject: Re: updatedb in nutch-2.0 with mysql
>
>
> Could you post the job counters?
>
> On Tue, Jul 24, 2012 at 8:14 PM, <[email protected]> wrote:
>
> >
> >
> >
> >
> >
> > Hello,
> >
> >
> >
> > I am testing nutch-2.0 with mysql storage with 1 url. I see that updatedb
> > command does not do anything. It does not add outlinks to the table as
> new
> > urls and I do not see any error messages in hadoop.log Here is the log
> > entries without plugin load info
> >
> > INFO crawl.DbUpdaterJob - DbUpdaterJob: starting
> > 2012-07-24 10:53:46,142 WARN util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2012-07-24 10:53:46,979 INFO mapreduce.GoraRecordReader -
> > gora.buffer.read.limit = 10000
> > 2012-07-24 10:53:49,801 INFO mapreduce.GoraRecordWriter -
> > gora.buffer.write.limit = 10000
> > 2012-07-24 10:53:49,806 INFO crawl.FetchScheduleFactory - Using
> > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> > 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule -
> > defaultInterval=25920000
> > 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule -
> > maxInterval=25920000
> > 2012-07-24 10:53:52,741 WARN mapred.FileOutputCommitter - Output path is
> > null in cleanup
> > 2012-07-24 10:53:53,584 INFO crawl.DbUpdaterJob - DbUpdaterJob: done
> >
> > Also, I noticed that there is crawlId option to it. Where its value comes
> > from?
> >
> > Btw, updatedb with no arguments works fine if Hbase is chosen for
> storage.
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> > ~
> >
> >
> >
> >
> >
>
>
>