What is adaptive fetch schedule as dictated by the property *
db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how does
property *db.fetch.interval.default* come to effect ?
I guess the 'generate' phase checks for the modified timestamp of every
page in the crawldb. If a page does
On a second thought, it doesn't seem that the 'generate' phase checks for
the modified timestamp of every page. It seems to be pre-calculated by a
previous generate-fetch-update cycle.
Experienced guys can comment on how a next fetch time is calculated. From
the crawldb output, it seems to have
Not experienced but this may help a bit...
The fetchTime field is used by Mapper to decide if it is time to fetch
this url. For a well written overview see this link
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
Also see the Nutch API documentation for AbstractFetchSchedule
We don't have a currency field. Please use one of the schema's shipped by Nutch:
http://svn.apache.org/viewvc/nutch/trunk/conf/
-Original message-
From:Sourajit Basak sourajit.ba...@gmail.com
Sent: Tue 14-Aug-2012 07:40
To: user@nutch.apache.org
Subject: Re: NUTCH-1443
was using
Hi,
I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I
don't really understand Nucth well. Here is what I find it weird. Could you
please advise?
I crawl a website of mimeType application/rss+xml. The fetched content is
parsed by Tika's
Thanks for reply Ferdy.
Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML
fine.
Regards,
Ake Tangkananond
On 8/14/12 6:43 PM, Ferdy Galema ferdy.gal...@kalooga.com wrote:
Hi,
Judging by your logs, it might be that you have accidentally set
Do you have specifc filtering/normalizing rules? From all urls that are
logged, what url is left over in the 'ol' field?
On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond iam...@gmail.com wrote:
Thanks for reply Ferdy.
Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML
Hi Ryan,
Further to this I've raised this conversation over on user@gora[0] so
it might be of interest for you to keep up with it over there?
What is required is a determination of interoperability over HBase
versions and what Gora is currently capable of working with before we
require an API
Hi Ferdy,
Thanks for you advise. I don't have any special filtering/normalizing
rules except the standard one. I even try disabling all url normalization
plugin, but the result is no difference.
The url left over in the ol is
column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New
Yes, it's
Hi Jan,
opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454
Thanks!
Beyond the can't retrieve parser error:
I've tried a couple of chm files (among them the test files from Tika)
but I wasn't able to get Tika to extract content.
% java -jar
10 matches
Mail list logo