I do not think that script works in nutch-2.x. For example I see this $bin/nutch generate $commonOptions $CRAWL_ID/crawldb $CRAWL_ID/segments -topN $sizeFetchlist -numFetchers $numSlaves -noFilter
There are no crawldb or segments in nutch-2.x. When you use crawlid in inject command it creates a crawlid_webpage table in hbase and when you use generate, fetch and etc it queries webpage table which does not exist. Alex. -----Original Message----- From: Christopher Gross <[email protected]> To: user <[email protected]> Sent: Wed, May 22, 2013 6:23 pm Subject: Re: error crawling I'm trying to crawl. I'm just running the script that I pulled from the nutch site, so I assumed that it would be good to go, like the old runbot.sh script. I could try removing that part, but I still get the error farther down in the main body of the loop. -- Christopher Gross Sent from my nexus 7 On May 22, 2013 4:40 PM, <[email protected]> wrote: > what are you trying to achieve? What is the reason running inject with a > crawlIId? > > > > > > > -----Original Message----- > From: Christopher Gross <[email protected]> > To: user <[email protected]> > Sent: Wed, May 22, 2013 12:25 pm > Subject: Re: error crawling > > > Sure, I'll try. I'm also confused about this -- I had it working at one > point, and it stopped working after migrating to a new box (copied > everything over but cleared out the HBase). > > My hadoop.log for today has: > store.HBaseStore - Keyclass and nameclass match but mismatching table > names mappingfile schema is 'webpage' vs actual schema 'crawl_webpage' , > assuming they are the same. > > I have nothing in a config file for a "crawl_webpage". I ran: > grep crawl_webpage * > and got nothing. > Running: > grep webpage * > gets me hits on gora mapping files for accumulo, hbase, cassandra and sql, > as well as the nutch-default.xml file. > nutch-default.xml has a "storage.schema.webpage" which has a value of > "webpage". > > Now, what I'm thinking is that my CRAWL_ID is set to crawl, and for > whatever reason, that is the table that nutch is making is that CRAWL_ID + > _ + "webpage". > > I tried making the gora mapping file use crawl_webpage but then I ended up > with some crawl_crawl_webpage error messages, so I cleared out the HBase > (again) and rolled back the file. > > Perhaps I'm running on an older one, can you point me in the right > direction for getting that "crawl" script that replaces the 1.x "runbot.sh" > script? > > > -- Chris > > > On Mon, May 20, 2013 at 1:55 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > Hi Chris, > > > > On Mon, May 20, 2013 at 10:21 AM, Christopher Gross <[email protected] > > >wrote: > > > > > Lewis -- > > > Is the DEBUG something set in the conf/log4j.properties file? I have > the > > > rootLogger set to INFO,DRFA and the threshold is ALL. Everything else > is > > > INFO or WARN (no DEBUGs to be found.) > > > > > > > > Well yes you can set it in the log4j.properties file, however if you are > > working with anything older than 2.x HEAD then by default the logging is > > hardcoded as INFO. > > The DEBUG logging was implemented as of NUTCH-1496 and is now built into > > 2.x HEAD. An example can be seen here > > > > > http://svn.apache.org/viewvc/nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserJob.java?r1=1379438&r2=1408271 > > > > BTW here is the HBase thread which I referred to before > > http://www.mail-archive.com/[email protected]/msg09245.html > > > > > > > I'm still a bit lost on what I need to do for the gora-hbase portion. > My > > > gora-hbase-mapping.xml is unchanged. Also, from the nutch-default.xml > > > file: > > > <property> > > > <name>storage.schema.webpage</name> > > > <value>webpage</value> > > > <description>This value holds the schema name used for Nutch web db. > > > Note that Nutch ignores the value in the gora mapping files, and uses > > > this as the webpage schema name. > > > </description> > > > </property> > > > > > > So that would lead me to believe that the gora file is just ignored. > > > If I have the "crawlId" set to "crawlId" -- where do I need to tell > nutch > > > to look in the hbase for the "crawlId_webpage"? > > > > > > I am unsure as to what your problem is here Chris. Can you please try > to > > explain it in layman's terms for me so I understand what problem you are > > facing? > > Thanks > > Lewis > > > > >

