Re: Accessing crawled data

2009-12-23 Thread Claudio Martella
Andrzej Bialecki wrote: On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some

Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
From: claudio.marte...@tis.bz.it To: nutch-user@lucene.apache.org Subject: Re: Accessing crawled data Hi, actually i completely mis-explained myself. I'll try to make myself clear: i'd like to extract the information in the segments by using the parsers. This means i can basically use

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual

Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction)

RE: Accessing crawled data

2009-12-17 Thread BELLINI ADAM
/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex this command will return only the content (source pages) hope it will help. Date: Thu, 17 Dec 2009 15:32:33 +0100 From: claudio.marte...@tis.bz.it To: nutch-user@lucene.apache.org Subject: Re: Accessing crawled data Hi

Accessing crawled data

2009-12-16 Thread Claudio Martella
Hello list, I'm using nutch 1.0 to crawl some intranet sites and i want to later put the crawled data into my solr server. Though nutch 1.0 comes with solr support out of the box i think that solution doesn't fit me. First, i need to run my own code on the crawled data (particularly what comes

Re: Accessing crawled data

2009-12-16 Thread reinhard schwab
if you dont want to refetch already fetched pages, i think of 3 possibilities: a/ set a very high fetch interval b/ use a customized fetch schedule class instead of DefaultFetchSchedule implement there a method public boolean shouldFetch(Text url, CrawlDatum datum, long curTime) which returns

Need pointers regarding accessing crawled data/plugin etc.

2008-01-15 Thread Manoj Bist
Hi, I would really appreciate if someone could provide pointers to doing the following(via plugins or otherwise). I have gone through plugin central on nutch wiki. 1.) Is it possible to have a control on the 'policy' to decide how soon a url is fetched. For e.g. if a document does not change