Hi,
Thanks for the quick reply, yes i though of using wget or httracker but
nutch has several features for parsing and removing duplicates while
fetching web pages. more over it is written in Java and i am using java for
the current project so integration is much easier.
I really appreciate
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual solution is to dump the content
in a file
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between crawling and indexing! My actual
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction) so i have to get in the middle
between
On 2009-12-22 16:07, Claudio Martella wrote:
Andrzej Bialecki wrote:
On 2009-12-22 13:16, Claudio Martella wrote:
Yes, I'am aware of that. The problem is that i have some fields of the
SolrDocument that i want to compute by text analysis (basically i want
to do some smart keywords extraction)