Re: error crawling

Christopher Gross Wed, 22 May 2013 12:25:33 -0700

Sure, I'll try.  I'm also confused about this -- I had it working at one
point, and it stopped working after migrating to a new box (copied
everything over but cleared out the HBase).

My hadoop.log for today has:
store.HBaseStore - Keyclass and nameclass match but mismatching table
names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,
assuming they are the same.

I have nothing in a config file for a "crawl_webpage".  I ran:
grep crawl_webpage *
and got nothing.
Running:
grep webpage *
gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,
as well as the nutch-default.xml file.
nutch-default.xml has a "storage.schema.webpage" which has a value of
"webpage".

Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for
whatever reason, that is the table that nutch is making is that CRAWL_ID +
_ + "webpage".

I tried making the gora mapping file use crawl_webpage but then I ended up
with some crawl_crawl_webpage error messages, so I cleared out the HBase
(again) and rolled back the file.

Perhaps I'm running on an older one, can you point me in the right
direction for getting that "crawl" script that replaces the 1.x "runbot.sh"
script?

-- Chris

On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi Chris,
>
> On Mon, May 20, 2013 at 10:21 AM, Christopher Gross <[email protected]
> >wrote:
>
> > Lewis --
> > Is the DEBUG something set in the conf/log4j.properties file?  I have the
> > rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else is
> > INFO or WARN (no DEBUGs to be found.)
> >
> >
> Well yes you can set it in the log4j.properties file, however if you are
> working with anything older than 2.x HEAD then by default the logging is
> hardcoded as INFO.
> The DEBUG logging was implemented as of NUTCH-1496 and is now built into
> 2.x HEAD. An example can be seen here
>
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&r2=1408271
>
> BTW here is the HBase thread which I referred to before
> http://www.mail-archive.com/[email protected]/msg09245.html
>
>
> > I'm still a bit lost on what I need to do for the gora-hbase portion.  My
> > gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml
> > file:
> > <property>
> >   <name>storage.schema.webpage</name>
> >   <value>webpage</value>
> >   <description>This value holds the schema name used for Nutch web db.
> >   Note that Nutch ignores the value in the gora mapping files, and uses
> >   this as the webpage schema name.
> >   </description>
> > </property>
> >
> > So that would lead me to believe that the gora file is just ignored.
> > If I have the "crawlId" set to "crawlId" -- where do I need to tell nutch
> > to look in the hbase for the "crawlId_webpage"?
> >
> > I am unsure as to what your problem is here Chris. Can you please try to
> explain it in layman's terms for me so I understand what problem you are
> facing?
> Thanks
> Lewis
>

Re: error crawling

Reply via email to