Re: Nutch 2.0 updatedb and gora query

kiran chitturi Thu, 31 Jan 2013 11:31:44 -0800

Hi Lewis,

I am using gora 0.2.1 and hbase 0.90.5.

I started from scratch and did a step by step crawling (inject, generate,
fetch, parse, dbUpdate). I am starting from a single seed.

The first four phases went well so far and metadata, outlinks, fetch, parse
fields are extracted and saved in hbase.

67 outlinks are present for the first seed. When i did the updateDB
command, all the 67 records are added with fields (f:ts, f:st, f:fi, s:s,
mk:dist, mkdt:_csh_). Before starting the crawl, i have the internal links
property turned to true but i could not see any inlinks in the 67 records.

I did the generate, fetch for 67 records. Now, in the hbase more fields are
added along with one outlink field which is same as baseURL for that
record.

Once the records are parsed, more outlinks and fields are added.

After the updatedb command now, 902 more records are added with fields
(f:ts, f:st, f:fi, s:s, mk:dist, mkdt:_csh_).

I am not sure if an inlink is added as an outlink but as far as i saw
inlinks are not at all saved with in the records at any phase. They are
somehow missed during the dbUpdateReducer phase, where as the other fields
are getting added.

When i was debugging with eclipse, i saw in dbUpdateReduce job that inlinks
are added to the page and i am able to print them to stdout.

This look like a bug, i am not sure if is with gora or nutch.

Kiran

On Wed, Jan 30, 2013 at 2:44 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Kiran,
>
> On Wed, Jan 30, 2013 at 11:10 AM, kiran chitturi
> <[email protected]>wrote:
>
> >  I have checked the database after the dbupdate job is ran and i could
> see
> > only markers, signature and fetch fields.
> >
>
> Which Gora artifacts are you using?
> We've recently fixed a bug in gora-cassandra [0] as the state for map
> values was not being correctly recorded, this prevented us from writing the
> values during the dbUpdaterJob.
> I was not aware (and no-one flagged it up during either the Gora 0.2.1 or
> Nutch 2.1 RC testing) that there was a problem with similar fields being
> written to HBase.
>
>
> >
> > The initial seed which was crawled and parsed, has only outlinks. I
> notice
> > one of the outlink is actually the inlink.
> >
>
> Can you reproduce? Is there any way of being more verbose here. This is
> starting to sound like a bug. Unfortunately, I am not 100% on the HBase
> module either!
>
>
> >
> > Aren't inlinks supposed to be saved during the dbUpdatedJob ?
>
>
> Yes, specifically in the dbUpdaterReducerJob [1]
>
>
> > When i tried
> > to debug, i could see in eclipse and in the dbUpdateReducer job that the
> > inlinks are being saved to the page object along with fetch fields,
> markers
> > but i did not understood where the data is going from there.
> >
>
> We need to narrow this down and document it fully then.
> I cannot look into this for a couple hours Kiran,
> Lewis
>
> [0] https://issues.apache.org/jira/browse/GORA-182
> [1] http://wiki.apache.org/nutch/Nutch2Crawling#DbUpdate
> [2]
>
> http://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/util/WebPageWritable.java
>

-- 
Kiran Chitturi

Re: Nutch 2.0 updatedb and gora query

Reply via email to