What does your conf/regex_urlfilters file contain? Did you change this file? On Jun 30, 2013 5:10 AM, "Sznajder ForMailingList" <[email protected]> wrote:
> Thanks a lot for your help > > however, I still did not resovle this issue... > > > I attach there the logs after 2 rounds of > "generate/fetch/parse/updatedb" > > the DB still contains only the seed url , not more... > > > > > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney < > [email protected]> wrote: > >> Try each step with a crawlId and see if this provides you with better >> results. >> >> Unless you truncated all data between Nutch tasks then you should be >> seeing >> more data in HBase. >> As Tejas asked... what do the logs say? >> >> >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList < >> [email protected]> wrote: >> >> > Hi Lewis, >> > >> > Thanks for your reply >> > >> > I just set the values: >> > >> > gora.datastore.default=org.apache.gora.hbase.store.HBaseStore >> > >> > >> > I already removed the Hbase table in the past. Can it be a cause? >> > >> > Benjamin >> > >> > >> > >> > >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney < >> > [email protected]> wrote: >> > >> > > Have you changed from the default MemStore gora storage to something >> > else? >> > > >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList < >> > > [email protected]> >> > > wrote: >> > > > thanks Tejas >> > > > >> > > > Yes, I cheecked the logs and no Error appears in them >> > > > >> > > > I let the http.content.limit and parser.html.impl with their default >> > > > value... >> > > > >> > > > Benajmin >> > > > >> > > > >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil < >> [email protected] >> > > >wrote: >> > > > >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any >> exception >> > or >> > > >> error messages ? >> > > >> Also you might have a look at these configs in nutch-site.xml >> (default >> > > >> values are in nutch-default.xml): >> > > >> http.content.limit and parser.html.impl >> > > >> >> > > >> >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList < >> > > >> [email protected]> wrote: >> > > >> >> > > >> > Hello >> > > >> > >> > > >> > I installed Nutch 2.2 on my linux machine. >> > > >> > >> > > >> > I defined the seed directory with one file containing: >> > > >> > http://en.wikipedia.org/ >> > > >> > http://edition.cnn.com/ >> > > >> > >> > > >> > >> > > >> > I ran the following: >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/ >> > > >> > >> > > >> > After this step: >> > > >> > the call >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats >> > > >> > >> > > >> > returns >> > > >> > TOTAL urls: 2 >> > > >> > status 0 (null): 2 >> > > >> > avg score: 1.0 >> > > >> > >> > > >> > >> > > >> > Then, I ran the following: >> > > >> > bin/nutch generate -topN 10 >> > > >> > bin/nutch fetch -all >> > > >> > bin/nutch parse -all >> > > >> > bin/nutch updatedb >> > > >> > bin/nutch generate -topN 1000 >> > > >> > bin/nutch fetch -all >> > > >> > bin/nutch parse -all >> > > >> > bin/nutch updatedb >> > > >> > >> > > >> > >> > > >> > However, the stats call after these steps is still: >> > > >> > the call >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats >> > > >> > status 5 (status_redir_perm): 1 >> > > >> > max score: 2.0 >> > > >> > TOTAL urls: 3 >> > > >> > avg score: 1.3333334 >> > > >> > >> > > >> > >> > > >> > >> > > >> > Only 3 urls?! >> > > >> > What do I miss? >> > > >> > >> > > >> > thanks >> > > >> > >> > > >> > Benjamin >> > > >> > >> > > >> >> > > > >> > > >> > > -- >> > > *Lewis* >> > > >> > >> >> >> >> -- >> *Lewis* >> > >

