Try each step with a crawlId and see if this provides you with better results.
Unless you truncated all data between Nutch tasks then you should be seeing more data in HBase. As Tejas asked... what do the logs say? On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList < [email protected]> wrote: > Hi Lewis, > > Thanks for your reply > > I just set the values: > > gora.datastore.default=org.apache.gora.hbase.store.HBaseStore > > > I already removed the Hbase table in the past. Can it be a cause? > > Benjamin > > > > > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney < > [email protected]> wrote: > > > Have you changed from the default MemStore gora storage to something > else? > > > > On Tuesday, June 25, 2013, Sznajder ForMailingList < > > [email protected]> > > wrote: > > > thanks Tejas > > > > > > Yes, I cheecked the logs and no Error appears in them > > > > > > I let the http.content.limit and parser.html.impl with their default > > > value... > > > > > > Benajmin > > > > > > > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil <[email protected] > > >wrote: > > > > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any exception > or > > >> error messages ? > > >> Also you might have a look at these configs in nutch-site.xml (default > > >> values are in nutch-default.xml): > > >> http.content.limit and parser.html.impl > > >> > > >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList < > > >> [email protected]> wrote: > > >> > > >> > Hello > > >> > > > >> > I installed Nutch 2.2 on my linux machine. > > >> > > > >> > I defined the seed directory with one file containing: > > >> > http://en.wikipedia.org/ > > >> > http://edition.cnn.com/ > > >> > > > >> > > > >> > I ran the following: > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ > > >> > > > >> > After this step: > > >> > the call > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > >> > > > >> > returns > > >> > TOTAL urls: 2 > > >> > status 0 (null): 2 > > >> > avg score: 1.0 > > >> > > > >> > > > >> > Then, I ran the following: > > >> > bin/nutch generate -topN 10 > > >> > bin/nutch fetch -all > > >> > bin/nutch parse -all > > >> > bin/nutch updatedb > > >> > bin/nutch generate -topN 1000 > > >> > bin/nutch fetch -all > > >> > bin/nutch parse -all > > >> > bin/nutch updatedb > > >> > > > >> > > > >> > However, the stats call after these steps is still: > > >> > the call > > >> > -bash-4.1$ sh bin/nutch readdb -stats > > >> > status 5 (status_redir_perm): 1 > > >> > max score: 2.0 > > >> > TOTAL urls: 3 > > >> > avg score: 1.3333334 > > >> > > > >> > > > >> > > > >> > Only 3 urls?! > > >> > What do I miss? > > >> > > > >> > thanks > > >> > > > >> > Benjamin > > >> > > > >> > > > > > > > -- > > *Lewis* > > > -- *Lewis*

