I tried your suggestion with sql server and everything works fine. The issue that I had was with mysql though. mysql Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1
After I have restarted mysql server and added to gora.properties mysql root user, updatdb adds outlinks as new urls, but as I noticed it did not remove values of prsmrk, gnmrk and ftcmrk as it happens in Hbase and as follows from code Mark.GENERATE_MARK.removeMarkIfExist(page);... in DbUpdateReducer.java I also see from time to time an error that text filed has size less than expected. It seems to me that nutch with mysql is still buggy, so I gave up using mysql with it in favor of Hbase. Thanks for your help. Alex. -----Original Message----- From: Ferdy Galema <[email protected]> To: user <[email protected]> Sent: Fri, Jul 27, 2012 2:03 am Subject: Re: updatedb in nutch-2.0 with mysql I've just ran a crawl with Nutch 2.0 tag using the SqlStore. Please try to reproduce from a clean checkout/download. nano conf/nutch-site.xml #set http.agent.name and http.robots.agents properties ant clean runtime java -cp runtime/local/lib/hsqldb-2.2.8.jar org.hsqldb.Server -database.0 mem:0 -dbname.0 nutchtest #start sql server #open another terminal cd runtime/local bin/nutch inject ~/urlfolderWithOneUrl/ bin/nutch generate bin/nutch fetch <batchIdFromGenerate> bin/nutch parse <batchIdFromGenerate> bin/nutch updatedb bin/nutch readdb -stats #this will show multiple entries bin/nutch readdb -dump out #this will dump a readable text file in folder out/ (with multiple entries) If this works as expected, it might be something with your sql server? (What server are you running exactly?) Ferdy. On Thu, Jul 26, 2012 at 8:15 PM, <[email protected]> wrote: > I queried webpage table and there are a few links in outlinks column. As > I noted in the original letter updatedb works with Hbase. This is the > counters output in the case of Hbase. > > bin/nutch updatedb > DbUpdaterJob: starting > counter name=Counters: 20 > FileSystemCounters > FILE_BYTES_READ=879085 > FILE_BYTES_WRITTEN=993668 > Map-Reduce Framework > Combine input records=0 > Combine output records=0 > Total committed heap usage (bytes)=341442560 > CPU time spent (ms)=0 > Map input records=1 > Map output bytes=1421 > Map output materialized bytes=1457 > Map output records=14 > Physical memory (bytes) snapshot=0 > Reduce input groups=13 > Reduce input records=14 > Reduce output records=13 > Reduce shuffle bytes=0 > Spilled Records=28 > SPLIT_RAW_BYTES=701 > Virtual memory (bytes) snapshot=0 > File Input Format Counters > Bytes Read=0 > File Output Format Counters > Bytes Written=0 > DbUpdaterJob: done > > I tried crawling http://www.yahoo.com . The same issue is present. > > Thanks. > Alex. > > > > -----Original Message----- > From: Ferdy Galema <[email protected]> > To: user <[email protected]> > Sent: Thu, Jul 26, 2012 6:26 am > Subject: Re: updatedb in nutch-2.0 with mysql > > > Yep I meant those counters. > > Looking at the code it seems just 1 record is passed around from mapper to > reducer:This can only mean that no outlinks are outputted in the mapper. > This might indicate that the url is not succesfully parsed. (Did you parse > at all?) > > Are you able to peek in (or dump) your database with an external tool to > see if outlinks are present before running the updater? Or perhaps check > some parser log? > > On Wed, Jul 25, 2012 at 10:02 PM, <[email protected]> wrote: > > > Not sure if I understood correctly. > > I did > > Counters c currentJob.getCounters(); > > System.out.println(c.toString()); > > > > With Mysql > > > > DbUpdaterJob: starting > > Counters: 20 > > DbUpdaterJob: starting > > counter name=Counters: 20 > > FileSystemCounters > > FILE_BYTES_READ=878298 > > FILE_BYTES_WRITTEN=992362 > > Map-Reduce Framework > > Combine input records=0 > > Combine output records=0 > > Total committed heap usage (bytes)=260177920 > > CPU time spent (ms)=0 > > Map input records=1 > > Map output bytes=193 > > Map output materialized bytes=202 > > Map output records=1 > > Physical memory (bytes) snapshot=0 > > Reduce input groups=1 > > Reduce input records=1 > > Reduce output records=1 > > Reduce shuffle bytes=0 > > Spilled Records=2 > > SPLIT_RAW_BYTES=962 > > Virtual memory (bytes) snapshot=0 > > File Input Format Counters > > Bytes Read=0 > > File Output Format Counters > > Bytes Written=0 > > DbUpdaterJob: done > > > > > > Thanks. > > Alex. > > > > > > > > -----Original Message----- > > From: Ferdy Galema <[email protected]> > > To: user <[email protected]> > > Sent: Wed, Jul 25, 2012 12:13 am > > Subject: Re: updatedb in nutch-2.0 with mysql > > > > > > Could you post the job counters? > > > > On Tue, Jul 24, 2012 at 8:14 PM, <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > I am testing nutch-2.0 with mysql storage with 1 url. I see that > updatedb > > > command does not do anything. It does not add outlinks to the table as > > new > > > urls and I do not see any error messages in hadoop.log Here is the log > > > entries without plugin load info > > > > > > INFO crawl.DbUpdaterJob - DbUpdaterJob: starting > > > 2012-07-24 10:53:46,142 WARN util.NativeCodeLoader - Unable to load > > > native-hadoop library for your platform... using builtin-java classes > > where > > > applicable > > > 2012-07-24 10:53:46,979 INFO mapreduce.GoraRecordReader - > > > gora.buffer.read.limit = 10000 > > > 2012-07-24 10:53:49,801 INFO mapreduce.GoraRecordWriter - > > > gora.buffer.write.limit = 10000 > > > 2012-07-24 10:53:49,806 INFO crawl.FetchScheduleFactory - Using > > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > > > 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - > > > defaultInterval=25920000 > > > 2012-07-24 10:53:49,807 INFO crawl.AbstractFetchSchedule - > > > maxInterval=25920000 > > > 2012-07-24 10:53:52,741 WARN mapred.FileOutputCommitter - Output path > is > > > null in cleanup > > > 2012-07-24 10:53:53,584 INFO crawl.DbUpdaterJob - DbUpdaterJob: done > > > > > > Also, I noticed that there is crawlId option to it. Where its value > comes > > > from? > > > > > > Btw, updatedb with no arguments works fine if Hbase is chosen for > > storage. > > > > > > Thanks. > > > Alex. > > > > > > > > > > > > > > > > > > ~ > > > > > > > > > > > > > > > > > > > > > > > >

