hi I followed the instructions of the mail.
after the four steps, I still have ,when calling readdb stats one url in the DB... During the parsing, it printed that: Crawl delay for queue: http://www.apache.org is set to 4000 as per robots.txt. url: http://www.apache.org/ Best regards Benjamin On Sun, Jun 30, 2013 at 5:46 PM, Tejas Patil <[email protected]>wrote: > I think that you are hitting something that one the users faced few of days > back. Can you try the things mentioned here: > > > http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%3CCAFKhtFwPozH3dokk%2B_bZKqVT81h86aCpQzbL4rR4U3wZ-%2BOmHg%40mail.gmail.com%3E > > > On Sun, Jun 30, 2013 at 5:10 AM, Sznajder ForMailingList < > [email protected]> wrote: > > > Thanks a lot for your help > > > > however, I still did not resovle this issue... > > > > > > I attach there the logs after 2 rounds of > > "generate/fetch/parse/updatedb" > > > > the DB still contains only the seed url , not more... > > > > > > > > > > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > >> Try each step with a crawlId and see if this provides you with better > >> results. > >> > >> Unless you truncated all data between Nutch tasks then you should be > >> seeing > >> more data in HBase. > >> As Tejas asked... what do the logs say? > >> > >> > >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList < > >> [email protected]> wrote: > >> > >> > Hi Lewis, > >> > > >> > Thanks for your reply > >> > > >> > I just set the values: > >> > > >> > gora.datastore.default=org.apache.gora.hbase.store.HBaseStore > >> > > >> > > >> > I already removed the Hbase table in the past. Can it be a cause? > >> > > >> > Benjamin > >> > > >> > > >> > > >> > > >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney < > >> > [email protected]> wrote: > >> > > >> > > Have you changed from the default MemStore gora storage to something > >> > else? > >> > > > >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList < > >> > > [email protected]> > >> > > wrote: > >> > > > thanks Tejas > >> > > > > >> > > > Yes, I cheecked the logs and no Error appears in them > >> > > > > >> > > > I let the http.content.limit and parser.html.impl with their > default > >> > > > value... > >> > > > > >> > > > Benajmin > >> > > > > >> > > > > >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil < > >> [email protected] > >> > > >wrote: > >> > > > > >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any > >> exception > >> > or > >> > > >> error messages ? > >> > > >> Also you might have a look at these configs in nutch-site.xml > >> (default > >> > > >> values are in nutch-default.xml): > >> > > >> http.content.limit and parser.html.impl > >> > > >> > >> > > >> > >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList < > >> > > >> [email protected]> wrote: > >> > > >> > >> > > >> > Hello > >> > > >> > > >> > > >> > I installed Nutch 2.2 on my linux machine. > >> > > >> > > >> > > >> > I defined the seed directory with one file containing: > >> > > >> > http://en.wikipedia.org/ > >> > > >> > http://edition.cnn.com/ > >> > > >> > > >> > > >> > > >> > > >> > I ran the following: > >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ > >> > > >> > > >> > > >> > After this step: > >> > > >> > the call > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > >> > > >> > > >> > > >> > returns > >> > > >> > TOTAL urls: 2 > >> > > >> > status 0 (null): 2 > >> > > >> > avg score: 1.0 > >> > > >> > > >> > > >> > > >> > > >> > Then, I ran the following: > >> > > >> > bin/nutch generate -topN 10 > >> > > >> > bin/nutch fetch -all > >> > > >> > bin/nutch parse -all > >> > > >> > bin/nutch updatedb > >> > > >> > bin/nutch generate -topN 1000 > >> > > >> > bin/nutch fetch -all > >> > > >> > bin/nutch parse -all > >> > > >> > bin/nutch updatedb > >> > > >> > > >> > > >> > > >> > > >> > However, the stats call after these steps is still: > >> > > >> > the call > >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats > >> > > >> > status 5 (status_redir_perm): 1 > >> > > >> > max score: 2.0 > >> > > >> > TOTAL urls: 3 > >> > > >> > avg score: 1.3333334 > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > Only 3 urls?! > >> > > >> > What do I miss? > >> > > >> > > >> > > >> > thanks > >> > > >> > > >> > > >> > Benjamin > >> > > >> > > >> > > >> > >> > > > > >> > > > >> > > -- > >> > > *Lewis* > >> > > > >> > > >> > >> > >> > >> -- > >> *Lewis* > >> > > > > >

