Andrzej Bialecki wrote:
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some
From: claudio.marte...@tis.bz.it
To: nutch-user@lucene.apache.org
Subject: Re: Accessing crawled data
Hi,
actually i completely mis-explained myself. I'll try to make myself
clear: i'd like to extract the information in the segments by using the
parsers.
This means i can basically use
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction)
/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex
this command will return only the content (source pages)
hope it will help.
Date: Thu, 17 Dec 2009 15:32:33 +0100
From: claudio.marte...@tis.bz.it
To: nutch-user@lucene.apache.org
Subject: Re: Accessing crawled data
Hi
if you dont want to refetch already fetched pages,
i think of 3 possibilities:
a/ set a very high fetch interval
b/ use a customized fetch schedule class instead of DefaultFetchSchedule
implement there a method
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
which returns