Re: Large files - nutch failing to fetch

2009-12-22 Thread Sundara Kaku
Hi, Thanks for the quick reply, yes i though of using wget or httracker but nutch has several features for parsing and removing duplicates while fetching web pages. more over it is written in Java and i am using java for the current project so integration is much easier. I really appreciate

Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual solution is to dump the content in a file

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between crawling and indexing! My actual

Re: Accessing crawled data

2009-12-22 Thread Claudio Martella
Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction) so i have to get in the middle between

Re: Accessing crawled data

2009-12-22 Thread Andrzej Bialecki
On 2009-12-22 16:07, Claudio Martella wrote: Andrzej Bialecki wrote: On 2009-12-22 13:16, Claudio Martella wrote: Yes, I'am aware of that. The problem is that i have some fields of the SolrDocument that i want to compute by text analysis (basically i want to do some smart keywords extraction)