Re: updatedb in nutch-2.0 with mysql

Lewis John Mcgibbney Fri, 27 Jul 2012 13:15:14 -0700

Hi Alex,

It would be great if you have something which you think we could
replicate and could add it to a Jira ticket for reference.
A bit of history maybe...
The gora-sql module used within Nutch 2.0 has now been deprecated and
dropped from the Apache Gora project due to some licensing issues.
Luckily though there is a re-structuring of the persistence API so we
can expect to see an updated gora-sql module in one of the forthcoming
releases.
Anything you have to share regarding either compatibility problems
when using gora with SQL-based datastores or else bugs/improvements in
general would be greatly appreciated and would really help us to move
in the right direction when coming to implement this functionality in
the future. The plans are to use the JOOQ API to support Nutch + Gora
+
CUBRID 8.4.1
DB2 9.7
Derby 10.8
H2 1.3.161
HSQLDB 2.2.5
Ingres 10.1.0
MySQL 5.1.41 and 5.5.8
Oracle XE 10.2.0.1.0 and 11g
PostgreSQL 9.0
SQLite with inofficial JDBC driver v056
SQL Server 2008 R8
Sybase Adaptive Server Enterprise 15.5


Therefore if we can possibly identify the bug just now then we can do
our best to prevent it from being written into the next module.

Thanks and apologies for jumping in on this thread with no immedite
solution but a lot of noise.
Have a great weekend
Lewis

On Fri, Jul 27, 2012 at 8:57 PM,  <[email protected]> wrote:
> I tried your suggestion with sql server and everything works fine.
> The issue that I had was with mysql though.
> mysql  Ver 14.14 Distrib 5.5.18, for Linux (i686) using readline 5.1
>
> After I have restarted mysql server and added to gora.properties  mysql root 
> user, updatdb adds outlinks as new urls, but as I noticed it did not remove 
> values of
> prsmrk, gnmrk and ftcmrk as it happens in Hbase and as follows from code  
> Mark.GENERATE_MARK.removeMarkIfExist(page);... in DbUpdateReducer.java
>
> I also see from time to time an error that text filed has size less than 
> expected.
>
> It seems to me that nutch with mysql is still buggy, so I gave up using mysql 
> with it in favor of Hbase.
>
> Thanks for your help.
> Alex.
>
>
>
>
> -----Original Message-----
> From: Ferdy Galema <[email protected]>
> To: user <[email protected]>
> Sent: Fri, Jul 27, 2012 2:03 am
> Subject: Re: updatedb in nutch-2.0 with mysql
>
>
> I've just ran a crawl with Nutch 2.0 tag using the SqlStore. Please try to
> reproduce from a clean checkout/download.
>
> nano conf/nutch-site.xml #set http.agent.name and http.robots.agents
> properties
> ant clean runtime
> java -cp runtime/local/lib/hsqldb-2.2.8.jar org.hsqldb.Server -database.0
> mem:0 -dbname.0 nutchtest #start sql server
>
> #open another terminal
>
> cd runtime/local
> bin/nutch inject ~/urlfolderWithOneUrl/
> bin/nutch generate
> bin/nutch fetch <batchIdFromGenerate>
> bin/nutch parse <batchIdFromGenerate>
> bin/nutch updatedb
> bin/nutch readdb -stats #this will show multiple entries
> bin/nutch readdb -dump out #this will dump a readable text file in folder
> out/ (with multiple entries)
>
> If this works as expected, it might be something with your sql server?
> (What server are you running exactly?)
>
> Ferdy.
>
> On Thu, Jul 26, 2012 at 8:15 PM, <[email protected]> wrote:
>
>> I queried webpage table and there are a few links  in outlinks column. As
>> I noted in the original letter updatedb works with Hbase. This is  the
>> counters output in the case of Hbase.
>>
>>  bin/nutch updatedb
>> DbUpdaterJob: starting
>> counter name=Counters: 20
>>         FileSystemCounters
>>                 FILE_BYTES_READ=879085
>>                 FILE_BYTES_WRITTEN=993668
>>         Map-Reduce Framework
>>                 Combine input records=0
>>                 Combine output records=0
>>                 Total committed heap usage (bytes)=341442560
>>                 CPU time spent (ms)=0
>>                 Map input records=1
>>                 Map output bytes=1421
>>                 Map output materialized bytes=1457
>>                 Map output records=14
>>                 Physical memory (bytes) snapshot=0
>>                 Reduce input groups=13
>>                 Reduce input records=14
>>                 Reduce output records=13
>>                 Reduce shuffle bytes=0
>>                 Spilled Records=28
>>                 SPLIT_RAW_BYTES=701
>>                 Virtual memory (bytes) snapshot=0
>>         File Input Format Counters
>>                 Bytes Read=0
>>         File Output Format Counters
>>                 Bytes Written=0
>> DbUpdaterJob: done
>>
>> I tried crawling http://www.yahoo.com . The same issue is present.
>>
>> Thanks.
>> Alex.
>>
>>
>>
>> -----Original Message-----
>> From: Ferdy Galema <[email protected]>
>> To: user <[email protected]>
>> Sent: Thu, Jul 26, 2012 6:26 am
>> Subject: Re: updatedb in nutch-2.0 with mysql
>>
>>
>> Yep I meant those counters.
>>
>> Looking at the code it seems just 1 record is passed around from mapper to
>> reducer:This can only mean that no outlinks are outputted in the mapper.
>> This might indicate that the url is not succesfully parsed. (Did you parse
>> at all?)
>>
>> Are you able to peek in (or dump) your database with an external tool to
>> see if outlinks are present before running the updater? Or perhaps check
>> some parser log?
>>
>> On Wed, Jul 25, 2012 at 10:02 PM, <[email protected]> wrote:
>>
>> > Not sure if I understood correctly.
>> > I did
>> > Counters c currentJob.getCounters();
>> > System.out.println(c.toString());
>> >
>> > With Mysql
>> >
>> > DbUpdaterJob: starting
>> > Counters: 20
>> > DbUpdaterJob: starting
>> > counter name=Counters: 20
>> >         FileSystemCounters
>> >            FILE_BYTES_READ=878298
>> >            FILE_BYTES_WRITTEN=992362
>> >         Map-Reduce Framework
>> >            Combine input records=0
>> >            Combine output records=0
>> >            Total committed heap usage (bytes)=260177920
>> >            CPU time spent (ms)=0
>> >            Map input records=1
>> >            Map output bytes=193
>> >            Map output materialized bytes=202
>> >            Map output records=1
>> >            Physical memory (bytes) snapshot=0
>> >            Reduce input groups=1
>> >            Reduce input records=1
>> >            Reduce output records=1
>> >            Reduce shuffle bytes=0
>> >            Spilled Records=2
>> >            SPLIT_RAW_BYTES=962
>> >            Virtual memory (bytes) snapshot=0
>> >         File Input Format Counters
>> >            Bytes Read=0
>> >         File Output Format Counters
>> >            Bytes Written=0
>> > DbUpdaterJob: done
>> >
>> >
>> > Thanks.
>> > Alex.
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Ferdy Galema <[email protected]>
>> > To: user <[email protected]>
>> > Sent: Wed, Jul 25, 2012 12:13 am
>> > Subject: Re: updatedb in nutch-2.0 with mysql
>> >
>> >
>> > Could you post the job counters?
>> >
>> > On Tue, Jul 24, 2012 at 8:14 PM, <[email protected]> wrote:
>> >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > Hello,
>> > >
>> > >
>> > >
>> > > I am testing nutch-2.0 with mysql storage with 1 url. I see that
>> updatedb
>> > > command does not do anything. It does not add outlinks to the table as
>> > new
>> > > urls and I do not see any error messages in hadoop.log Here is the log
>> > > entries without plugin load info
>> > >
>> > >  INFO  crawl.DbUpdaterJob - DbUpdaterJob: starting
>> > > 2012-07-24 10:53:46,142 WARN  util.NativeCodeLoader - Unable to load
>> > > native-hadoop library for your platform... using builtin-java classes
>> > where
>> > > applicable
>> > > 2012-07-24 10:53:46,979 INFO  mapreduce.GoraRecordReader -
>> > > gora.buffer.read.limit = 10000
>> > > 2012-07-24 10:53:49,801 INFO  mapreduce.GoraRecordWriter -
>> > > gora.buffer.write.limit = 10000
>> > > 2012-07-24 10:53:49,806 INFO  crawl.FetchScheduleFactory - Using
>> > > FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
>> > > 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
>> > > defaultInterval=25920000
>> > > 2012-07-24 10:53:49,807 INFO  crawl.AbstractFetchSchedule -
>> > > maxInterval=25920000
>> > > 2012-07-24 10:53:52,741 WARN  mapred.FileOutputCommitter - Output path
>> is
>> > > null in cleanup
>> > > 2012-07-24 10:53:53,584 INFO  crawl.DbUpdaterJob - DbUpdaterJob: done
>> > >
>> > > Also, I noticed that there is crawlId option to it. Where its value
>> comes
>> > > from?
>> > >
>> > > Btw, updatedb with no arguments works fine if Hbase is chosen for
>> > storage.
>> > >
>> > > Thanks.
>> > > Alex.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > ~
>> > >
>> > >
>> > >
>> > >
>> > >
>> >
>> >
>> >
>>
>>
>>
>
>



-- 
Lewis

Re: updatedb in nutch-2.0 with mysql

Reply via email to