Re: error crawling

alxsss Thu, 23 May 2013 13:16:41 -0700

I do not think that script works in nutch-2.x.
For example I see this
$bin/nutch generate $commonOptions $CRAWL_ID/crawldb $CRAWL_ID/segments -topN 
$sizeFetchlist -numFetchers $numSlaves -noFilter


There are no crawldb or segments in nutch-2.x.

When you use crawlid in inject command it creates a crawlid_webpage table in 
hbase and when you use generate, fetch and etc it queries webpage table which 
does not exist.

Alex.

 

 

-----Original Message-----
From: Christopher Gross <[email protected]>
To: user <[email protected]>
Sent: Wed, May 22, 2013 6:23 pm
Subject: Re: error crawling


I'm trying to crawl. I'm just running the script that I pulled from the
nutch site, so I assumed that it would be good to go, like the old
runbot.sh script. I could try removing that part, but I still get the error
farther down in the main body of the loop.

-- Christopher Gross
Sent from my nexus 7
On May 22, 2013 4:40 PM, <[email protected]> wrote:

> what are you trying to achieve? What is the reason running inject with a
> crawlIId?
>
>
>
>
>
>
> -----Original Message-----
> From: Christopher Gross <[email protected]>
> To: user <[email protected]>
> Sent: Wed, May 22, 2013 12:25 pm
> Subject: Re: error crawling
>
>
> Sure, I'll try.  I'm also confused about this -- I had it working at one
> point, and it stopped working after migrating to a new box (copied
> everything over but cleared out the HBase).
>
> My hadoop.log for today has:
> store.HBaseStore - Keyclass and nameclass match but mismatching table
> names  mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' ,
> assuming they are the same.
>
> I have nothing in a config file for a "crawl_webpage".  I ran:
> grep crawl_webpage *
> and got nothing.
> Running:
> grep webpage *
> gets me hits on gora mapping files for accumulo, hbase, cassandra and sql,
> as well as the nutch-default.xml file.
> nutch-default.xml has a "storage.schema.webpage" which has a value of
> "webpage".
>
> Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for
> whatever reason, that is the table that nutch is making is that CRAWL_ID +
> _ + "webpage".
>
> I tried making the gora mapping file use crawl_webpage but then I ended up
> with some crawl_crawl_webpage error messages, so I cleared out the HBase
> (again) and rolled back the file.
>
> Perhaps I'm running on an older one, can you point me in the right
> direction for getting that "crawl" script that replaces the 1.x "runbot.sh"
> script?
>
>
> -- Chris
>
>
> On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
> > Hi Chris,
> >
> > On Mon, May 20, 2013 at 10:21 AM, Christopher Gross <[email protected]
> > >wrote:
> >
> > > Lewis --
> > > Is the DEBUG something set in the conf/log4j.properties file?  I have
> the
> > > rootLogger set to INFO,DRFA and the threshold is ALL.  Everything else
> is
> > > INFO or WARN (no DEBUGs to be found.)
> > >
> > >
> > Well yes you can set it in the log4j.properties file, however if you are
> > working with anything older than 2.x HEAD then by default the logging is
> > hardcoded as INFO.
> > The DEBUG logging was implemented as of NUTCH-1496 and is now built into
> > 2.x HEAD. An example can be seen here
> >
> >
> http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&r2=1408271
> >
> > BTW here is the HBase thread which I referred to before
> > http://www.mail-archive.com/[email protected]/msg09245.html
> >
> >
> > > I'm still a bit lost on what I need to do for the gora-hbase portion.
>  My
> > > gora-hbase-mapping.xml is unchanged.  Also, from the nutch-default.xml
> > > file:
> > > <property>
> > >   <name>storage.schema.webpage</name>
> > >   <value>webpage</value>
> > >   <description>This value holds the schema name used for Nutch web db.
> > >   Note that Nutch ignores the value in the gora mapping files, and uses
> > >   this as the webpage schema name.
> > >   </description>
> > > </property>
> > >
> > > So that would lead me to believe that the gora file is just ignored.
> > > If I have the "crawlId" set to "crawlId" -- where do I need to tell
> nutch
> > > to look in the hbase for the "crawlId_webpage"?
> > >
> > > I am unsure as to what your problem is here Chris. Can you please try
> to
> > explain it in layman's terms for me so I understand what problem you are
> > facing?
> > Thanks
> > Lewis
> >
>
>
>

Re: error crawling

Reply via email to