Re: Extremely long fetch time

Bai Shen Tue, 04 Jun 2013 13:32:40 -0700

I dropped the f family from HBase and readded it.  Nutch filled in the
columns and now I have sane fetch times.


However, my fetchInterval is not being populated and every time I run a
crawl I get the same urls.  Here is my metadata after a crawl.

status: 2 (status_fetched)
fetchTime: 1370388764041
prevFetchTime: 0
fetchInterval: 0
retriesSinceFetch: 0
modifiedTime: 0
prevModifiedTime: 0
protocolStatus: SUCCESS
parseStatus: success/ok

Any ideas why?


On Tue, Jun 4, 2013 at 8:16 AM, Bai Shen <[email protected]> wrote:

> I'm looking at my base url(the root of the internal site that I first
> injected to start the crawl).  It shows a status of 2 (status_fetched).
> The fetch time shows sometime in 2032.
> The prev fetch time is May 20, 2013.
> The fetch interval is the default 30 days.
> 0 retries since fetch.
> Modified time is in 2026.
> Prev modified time is 0.
> Protocol status is SUCCESS.
> Parse status is success/ok
>
> How do I convert the metadata field to readable text?
>
>
> On Mon, Jun 3, 2013 at 11:39 AM, Tejas Patil <[email protected]>wrote:
>
>> "It's not a freshly injected url."
>> I am smelling that those urls were attempted to be fetched but that failed
>> and so their retry interval was incremented to a larger value. Can't say
>> for sure though.
>> Can you share the crawl datum ? The status and meta fields can give some
>> clue.
>>
>> On Mon, Jun 3, 2013 at 8:30 AM, Bai Shen <[email protected]> wrote:
>>
>> > The time on the machine is set correctly.
>> >
>> > It's an internal website.  Is there anything in particular in the crawl
>> > datum you're looking for?
>> >
>> > It's not a freshly injected url.  And it seems like all of my urls have
>> the
>> > long fetch times.  And it seems odd that the max interval wouldn't
>> cause a
>> > fetch.
>> >
>> > I was able to do generate -adddays 6000 and run a fetch.  I'm waiting
>> for
>> > it to finish so I can check if this created long fetch times as well.
>> >
>> >
>> > On Mon, Jun 3, 2013 at 10:35 AM, Tejas Patil <[email protected]
>> > >wrote:
>> >
>> > > On Mon, Jun 3, 2013 at 6:53 AM, feng lu <[email protected]> wrote:
>> > >
>> > > > I see that nutch2.x will use the underlying operating system time to
>> > set
>> > > > the FetchTime. like this
>> > > >
>> > > > fit.page.setFetchTime(System.currentTimeMillis());
>> > > >
>> > > > The granularity of the value depends on the underlying operating
>> > system.
>> > > so
>> > > > check your current OS time using date command.
>> > > >
>> > > >
>> > > > On Mon, Jun 3, 2013 at 8:57 PM, Bai Shen <[email protected]>
>> > > wrote:
>> > > >
>> > > > > I'm using the 2.x head and even with adding 30 days I'm not
>> getting
>> > any
>> > > > > refetches.  I did a readdb on my injected url and it says that the
>> > > fetch
>> > > > > time is in 2027.
>> > > >
>> > >
>> > > Can share the crawl datum for that url ?
>> > >
>> > >
>> > > > >
>> > > > > Any idea why this would occur?
>> > > >
>> > >
>> > > If it was a freshly injected url, then I would go with Fengs' advice.
>> > >
>> > > Will db.fetch.interval.max kick in and
>> > > > > cause it to be fetched earlier?
>> > > >
>> > >
>> > > nope.
>> > >
>> > > Or will I have to manually change the
>> > > > > fetchTime using the hbase shell?
>> > > >
>> > >
>> > > I think so.
>> > >
>> > > >
>> > > > > Thanks.
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Don't Grow Old, Grow Up... :-)
>> > > >
>> > >
>> >
>>
>
>

Re: Extremely long fetch time

Reply via email to