Hi,

I think it is not directly supported in Nutch2. One way would be to write a
tool that simply deletes all fields not needed for general crawling. (Since
you want to keep the fields that indicate that the url is already fetched,
for example). The big fields that can be deleted after indexing include
'content' and 'text'.

Delete support is currently not optimal in Gora so you might want to
implement a workaround by directly using your store specific api. (Of
course this would not be of any benefit to the other datastores).

If you do not need inlinks (anchor texts) you could strip out some of the
functionality of the DbUpdateReducer that writes the inlinks for every row.
(Just  skip the actual writing of the inlinks to every row, but keeping the
scoring functionality that depends on the inlinks). This requires some
coding too.

Feel free to share other suggestions.

Ferdy.

On Fri, Aug 3, 2012 at 4:17 PM, Bai Shen <[email protected]> wrote:

> In Nutch 1.4, after I indexed a segment, I could delete it to save space.
> Is something like this possible with Nutch 2?
>
> Thanks.
>

Reply via email to