Andrzej Bialecki wrote:
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some
From: claudio.marte...@tis.bz.it
To: nutch-user@lucene.apache.org
Subject: Re: Accessing crawled data
Hi,
actually i completely mis-explained myself. I'll try to make myself
clear: i'd like to extract the information in the segments by using the
parsers.
This means i can basically use
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction)
/ dump_folder
-nofetch -nogenerate -noparse -noparsedata -noparsetex
this command will return only the content (source pages)
hope it will help.
Date: Thu, 17 Dec 2009 15:32:33 +0100
From: claudio.marte...@tis.bz.it
To: nutch-user@lucene.apache.org
Subject: Re: Accessing crawled data
Hi
Hello list,
I'm using nutch 1.0 to crawl some intranet sites and i want to later put
the crawled data into my solr server. Though nutch 1.0 comes with solr
support out of the box i think that solution doesn't fit me. First, i
need to run my own code on the crawled data (particularly what comes
if you dont want to refetch already fetched pages,
i think of 3 possibilities:
a/ set a very high fetch interval
b/ use a customized fetch schedule class instead of DefaultFetchSchedule
implement there a method
public boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
which returns
Hi,
I would really appreciate if someone could provide pointers to doing the
following(via plugins or otherwise). I have gone through plugin central on
nutch wiki.
1.) Is it possible to have a control on the 'policy' to decide how soon a
url is fetched. For e.g. if a document does not change