adaptive fetches

2012-08-14 Thread Sourajit Basak
What is adaptive fetch schedule as dictated by the property * db.fetch.schedule.adaptive.sync_delta* ? If this is set to true how does property *db.fetch.interval.default* come to effect ? I guess the 'generate' phase checks for the modified timestamp of every page in the crawldb. If a page does

Re: adaptive fetches

2012-08-14 Thread Sourajit Basak
On a second thought, it doesn't seem that the 'generate' phase checks for the modified timestamp of every page. It seems to be pre-calculated by a previous generate-fetch-update cycle. Experienced guys can comment on how a next fetch time is calculated. From the crawldb output, it seems to have

RE: adaptive fetches

2012-08-14 Thread j.sullivan
Not experienced but this may help a bit... The fetchTime field is used by Mapper to decide if it is time to fetch this url. For a well written overview see this link http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/ Also see the Nutch API documentation for AbstractFetchSchedule

RE: NUTCH-1443

2012-08-14 Thread Markus Jelsma
We don't have a currency field. Please use one of the schema's shipped by Nutch: http://svn.apache.org/viewvc/nutch/trunk/conf/ -Original message- From:Sourajit Basak sourajit.ba...@gmail.com Sent: Tue 14-Aug-2012 07:40 To: user@nutch.apache.org Subject: Re: NUTCH-1443 was using

Tika's outlink is not as expected

2012-08-14 Thread Ake Tangkananond
Hi, I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I don't really understand Nucth well. Here is what I find it weird. Could you please advise? I crawl a website of mimeType application/rss+xml. The fetched content is parsed by Tika's

Re: Tika's outlink is not as expected

2012-08-14 Thread Ake Tangkananond
Thanks for reply Ferdy. Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML fine. Regards, Ake Tangkananond On 8/14/12 6:43 PM, Ferdy Galema ferdy.gal...@kalooga.com wrote: Hi, Judging by your logs, it might be that you have accidentally set

Re: Tika's outlink is not as expected

2012-08-14 Thread Ferdy Galema
Do you have specifc filtering/normalizing rules? From all urls that are logged, what url is left over in the 'ol' field? On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond iam...@gmail.com wrote: Thanks for reply Ferdy. Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML

Re: nutch 2.0 with hbase 0.94.0

2012-08-14 Thread Lewis John Mcgibbney
Hi Ryan, Further to this I've raised this conversation over on user@gora[0] so it might be of interest for you to keep up with it over there? What is required is a determination of interoperability over HBase versions and what Gora is currently capable of working with before we require an API

Re: Tika's outlink is not as expected

2012-08-14 Thread Ake Tangkananond
Hi Ferdy, Thanks for you advise. I don't have any special filtering/normalizing rules except the standard one. I even try disabling all url normalization plugin, but the result is no difference. The url left over in the ol is column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New Yes, it's

Re: CHM Files and Tika

2012-08-14 Thread Sebastian Nagel
Hi Jan, opened a Jira issue: https://issues.apache.org/jira/browse/NUTCH-1454 Thanks! Beyond the can't retrieve parser error: I've tried a couple of chm files (among them the test files from Tika) but I wasn't able to get Tika to extract content. % java -jar