I think that you are hitting something that one the users faced few of days
back. Can you try the things mentioned here:

http://mail-archives.apache.org/mod_mbox/nutch-user/201306.mbox/%3CCAFKhtFwPozH3dokk%2B_bZKqVT81h86aCpQzbL4rR4U3wZ-%2BOmHg%40mail.gmail.com%3E


On Sun, Jun 30, 2013 at 5:10 AM, Sznajder ForMailingList <
[email protected]> wrote:

> Thanks a lot for your help
>
> however, I still did not resovle this issue...
>
>
> I attach there the logs after 2 rounds of
> "generate/fetch/parse/updatedb"
>
> the DB still contains only the seed url , not more...
>
>
>
>
> On Thu, Jun 27, 2013 at 12:37 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Try each step with a crawlId and see if this provides you with better
>> results.
>>
>> Unless you truncated all data between Nutch tasks then you should be
>> seeing
>> more data in HBase.
>> As Tejas asked... what do the logs say?
>>
>>
>> On Wed, Jun 26, 2013 at 3:40 AM, Sznajder ForMailingList <
>> [email protected]> wrote:
>>
>> > Hi Lewis,
>> >
>> > Thanks for your reply
>> >
>> > I just set the values:
>> >
>> >  gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
>> >
>> >
>> > I already removed the Hbase table in the past. Can it be a cause?
>> >
>> > Benjamin
>> >
>> >
>> >
>> >
>> > On Tue, Jun 25, 2013 at 7:34 PM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> > > Have you changed from the default MemStore gora storage to something
>> > else?
>> > >
>> > > On Tuesday, June 25, 2013, Sznajder ForMailingList <
>> > > [email protected]>
>> > > wrote:
>> > > > thanks Tejas
>> > > >
>> > > > Yes, I cheecked the logs and  no Error appears in them
>> > > >
>> > > > I let the http.content.limit and parser.html.impl with their default
>> > > > value...
>> > > >
>> > > > Benajmin
>> > > >
>> > > >
>> > > > On Tue, Jun 25, 2013 at 6:14 PM, Tejas Patil <
>> [email protected]
>> > > >wrote:
>> > > >
>> > > >> Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any
>> exception
>> > or
>> > > >> error messages ?
>> > > >> Also you might have a look at these configs in nutch-site.xml
>> (default
>> > > >> values are in nutch-default.xml):
>> > > >> http.content.limit and parser.html.impl
>> > > >>
>> > > >>
>> > > >> On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList <
>> > > >> [email protected]> wrote:
>> > > >>
>> > > >> > Hello
>> > > >> >
>> > > >> > I installed Nutch 2.2 on my linux machine.
>> > > >> >
>> > > >> > I defined the seed directory with one file containing:
>> > > >> > http://en.wikipedia.org/
>> > > >> > http://edition.cnn.com/
>> > > >> >
>> > > >> >
>> > > >> > I ran the following:
>> > > >> > sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/
>> > > >> >
>> > > >> > After this step:
>> > > >> > the call
>> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
>> > > >> >
>> > > >> > returns
>> > > >> > TOTAL urls:     2
>> > > >> > status 0 (null):        2
>> > > >> > avg score:      1.0
>> > > >> >
>> > > >> >
>> > > >> > Then, I ran the following:
>> > > >> > bin/nutch generate -topN 10
>> > > >> > bin/nutch fetch -all
>> > > >> > bin/nutch parse -all
>> > > >> > bin/nutch updatedb
>> > > >> > bin/nutch generate -topN 1000
>> > > >> > bin/nutch fetch -all
>> > > >> > bin/nutch parse -all
>> > > >> > bin/nutch updatedb
>> > > >> >
>> > > >> >
>> > > >> > However, the stats call after these steps is still:
>> > > >> > the call
>> > > >> > -bash-4.1$ sh bin/nutch readdb -stats
>> > > >> > status 5 (status_redir_perm):   1
>> > > >> > max score:      2.0
>> > > >> > TOTAL urls:     3
>> > > >> > avg score:      1.3333334
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > Only 3 urls?!
>> > > >> > What do I miss?
>> > > >> >
>> > > >> > thanks
>> > > >> >
>> > > >> > Benjamin
>> > > >> >
>> > > >>
>> > > >
>> > >
>> > > --
>> > > *Lewis*
>> > >
>> >
>>
>>
>>
>> --
>> *Lewis*
>>
>
>

Reply via email to