Re: Extremely long fetch time

Bai Shen Wed, 05 Jun 2013 04:50:22 -0700

Found the problem.  fetchInterval only gets set when the url is injected.
 So any changes in the config file will not affect existing pages.


Also, while AdaptiveFetchSchedule modifies the fetchInterval, it does so by
multiplication, so any fetchInterval of 0 will continue to stay 0.


On Tue, Jun 4, 2013 at 4:31 PM, Bai Shen <[email protected]> wrote:

> I dropped the f family from HBase and readded it.  Nutch filled in the
> columns and now I have sane fetch times.
>
> However, my fetchInterval is not being populated and every time I run a
> crawl I get the same urls.  Here is my metadata after a crawl.
>
> status: 2 (status_fetched)
> fetchTime: 1370388764041
> prevFetchTime: 0
> fetchInterval: 0
> retriesSinceFetch: 0
> modifiedTime: 0
> prevModifiedTime: 0
> protocolStatus: SUCCESS
> parseStatus: success/ok
>
> Any ideas why?
>
>
> On Tue, Jun 4, 2013 at 8:16 AM, Bai Shen <[email protected]> wrote:
>
>> I'm looking at my base url(the root of the internal site that I first
>> injected to start the crawl).  It shows a status of 2 (status_fetched).
>> The fetch time shows sometime in 2032.
>> The prev fetch time is May 20, 2013.
>> The fetch interval is the default 30 days.
>> 0 retries since fetch.
>> Modified time is in 2026.
>> Prev modified time is 0.
>> Protocol status is SUCCESS.
>> Parse status is success/ok
>>
>> How do I convert the metadata field to readable text?
>>
>>
>> On Mon, Jun 3, 2013 at 11:39 AM, Tejas Patil <[email protected]>wrote:
>>
>>> "It's not a freshly injected url."
>>> I am smelling that those urls were attempted to be fetched but that
>>> failed
>>> and so their retry interval was incremented to a larger value. Can't say
>>> for sure though.
>>> Can you share the crawl datum ? The status and meta fields can give some
>>> clue.
>>>
>>> On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen <[email protected]>
>>> wrote:
>>>
>>> > The time on the machine is set correctly.
>>> >
>>> > It's an internal website.  Is there anything in particular in the crawl
>>> > datum you're looking for?
>>> >
>>> > It's not a freshly injected url.  And it seems like all of my urls
>>> have the
>>> > long fetch times.  And it seems odd that the max interval wouldn't
>>> cause a
>>> > fetch.
>>> >
>>> > I was able to do generate -adddays 6000 and run a fetch.  I'm waiting
>>> for
>>> > it to finish so I can check if this created long fetch times as well.
>>> >
>>> >
>>> > On Mon, Jun 3, 2013 at 10:35 AM, Tejas Patil <[email protected]
>>> > >wrote:
>>> >
>>> > > On Mon, Jun 3, 2013 at 6:53 AM, feng lu <[email protected]>
>>> wrote:
>>> > >
>>> > > > I see that nutch2.x will use the underlying operating system time
>>> to
>>> > set
>>> > > > the FetchTime. like this
>>> > > >
>>> > > > fit.page.setFetchTime(System.currentTimeMillis());
>>> > > >
>>> > > > The granularity of the value depends on the underlying operating
>>> > system.
>>> > > so
>>> > > > check your current OS time using date command.
>>> > > >
>>> > > >
>>> > > > On Mon, Jun 3, 2013 at 8:57 PM, Bai Shen <[email protected]>
>>> > > wrote:
>>> > > >
>>> > > > > I'm using the 2.x head and even with adding 30 days I'm not
>>> getting
>>> > any
>>> > > > > refetches.  I did a readdb on my injected url and it says that
>>> the
>>> > > fetch
>>> > > > > time is in 2027.
>>> > > >
>>> > >
>>> > > Can share the crawl datum for that url ?
>>> > >
>>> > >
>>> > > > >
>>> > > > > Any idea why this would occur?
>>> > > >
>>> > >
>>> > > If it was a freshly injected url, then I would go with Fengs' advice.
>>> > >
>>> > > Will db.fetch.interval.max kick in and
>>> > > > > cause it to be fetched earlier?
>>> > > >
>>> > >
>>> > > nope.
>>> > >
>>> > > Or will I have to manually change the
>>> > > > > fetchTime using the hbase shell?
>>> > > >
>>> > >
>>> > > I think so.
>>> > >
>>> > > >
>>> > > > > Thanks.
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Don't Grow Old, Grow Up... :-)
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Extremely long fetch time

Reply via email to