hi

I followed the instructions of the mail.

after the four steps, I still have ,when calling readdb stats

one url in the DB...


During the parsing, it printed that:

Crawl delay for queue: http://www.apache.org is set to 4000 as per
robots.txt. url: http://www.apache.org/



Best regards
Benjamin


On Sun, Jun 30, 2013 at 5:46 PM, Tejas Patil <[email protected]>wrote:

> I think that you are hitting something that one the users faced few of days
> back. Can you try the things mentioned here:
>
>
> http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%3CCAFKhtFwPozH3dokk%2B_bZKqVT81h86aCpQzbL4rR4U3wZ-%2BOmHg%40mail.gmail.com%3E
>
>
> On Sun, Jun 30, 2013 at 5:10 AM, Sznajder ForMailingList <
> [email protected]> wrote:
>
> > Thanks a lot for your help
> >
> > however, I still did not resovle this issue...
> >
> >
> > I attach there the logs after 2 rounds of
> > "generate/fetch/parse/updatedb"
> >
> > the DB still contains only the seed url , not more...
> >
> >
> >
> >
> > On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> >> Try each step with a crawlId and see if this provides you with better
> >> results.
> >>
> >> Unless you truncated all data between Nutch tasks then you should be
> >> seeing
> >> more data in HBase.
> >> As Tejas asked... what do the logs say?
> >>
> >>
> >> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList <
> >> [email protected]> wrote:
> >>
> >> > Hi Lewis,
> >> >
> >> > Thanks for your reply
> >> >
> >> > I just set the values:
> >> >
> >> >  gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
> >> >
> >> >
> >> > I already removed the Hbase table in the past. Can it be a cause?
> >> >
> >> > Benjamin
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney <
> >> > [email protected]> wrote:
> >> >
> >> > > Have you changed from the default MemStore gora storage to something
> >> > else?
> >> > >
> >> > > On Tuesday, June 25, 2013, Sznajder ForMailingList <
> >> > > [email protected]>
> >> > > wrote:
> >> > > > thanks Tejas
> >> > > >
> >> > > > Yes, I cheecked the logs and  no Error appears in them
> >> > > >
> >> > > > I let the http.content.limit and parser.html.impl with their
> default
> >> > > > value...
> >> > > >
> >> > > > Benajmin
> >> > > >
> >> > > >
> >> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil <
> >> [email protected]
> >> > > >wrote:
> >> > > >
> >> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any
> >> exception
> >> > or
> >> > > >> error messages ?
> >> > > >> Also you might have a look at these configs in nutch-site.xml
> >> (default
> >> > > >> values are in nutch-default.xml):
> >> > > >> http.content.limit and parser.html.impl
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList <
> >> > > >> [email protected]> wrote:
> >> > > >>
> >> > > >> > Hello
> >> > > >> >
> >> > > >> > I installed Nutch 2.2 on my linux machine.
> >> > > >> >
> >> > > >> > I defined the seed directory with one file containing:
> >> > > >> > http://en.wikipedia.org/
> >> > > >> > http://edition.cnn.com/
> >> > > >> >
> >> > > >> >
> >> > > >> > I ran the following:
> >> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/
> >> > > >> >
> >> > > >> > After this step:
> >> > > >> > the call
> >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> >> > > >> >
> >> > > >> > returns
> >> > > >> > TOTAL urls:     2
> >> > > >> > status 0 (null):        2
> >> > > >> > avg score:      1.0
> >> > > >> >
> >> > > >> >
> >> > > >> > Then, I ran the following:
> >> > > >> > bin/nutch generate -topN 10
> >> > > >> > bin/nutch fetch -all
> >> > > >> > bin/nutch parse -all
> >> > > >> > bin/nutch updatedb
> >> > > >> > bin/nutch generate -topN 1000
> >> > > >> > bin/nutch fetch -all
> >> > > >> > bin/nutch parse -all
> >> > > >> > bin/nutch updatedb
> >> > > >> >
> >> > > >> >
> >> > > >> > However, the stats call after these steps is still:
> >> > > >> > the call
> >> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
> >> > > >> > status 5 (status_redir_perm):   1
> >> > > >> > max score:      2.0
> >> > > >> > TOTAL urls:     3
> >> > > >> > avg score:      1.3333334
> >> > > >> >
> >> > > >> >
> >> > > >> >
> >> > > >> > Only 3 urls?!
> >> > > >> > What do I miss?
> >> > > >> >
> >> > > >> > thanks
> >> > > >> >
> >> > > >> > Benjamin
> >> > > >> >
> >> > > >>
> >> > > >
> >> > >
> >> > > --
> >> > > *Lewis*
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> *Lewis*
> >>
> >
> >
>

Reply via email to