Re: Storage of data between crawls

Chris Alexander Thu, 28 Jul 2011 01:19:33 -0700

Cheers Lewis, perhaps I should attempt to rephrase the question.

Clearly Nutch must download and store the contents of a page during a crawl.
However, once you have indexed this content, does Nutch keep this data, or
is it cleaned up, automatically or is there a command to do it?


Thanks

Chris

On 27 July 2011 17:14, lewis john mcgibbney <[email protected]>wrote:

> HI Alexander,
>
> I don't want to state the obvious here but this will depend directly on
> what
> type of loading your Nutch implementation deals with...
>
> You are correct in stating that we store data in segments, namely
> /crawl_fetch
> /content
> /crawl_parse
> /parse_data
> /crawl_generate
> /parse_text
>
> I understand that this doesn't add much value to answering your question,
> but as we are now indexing with Solr (and therefore not storing larger
> amounts of data with Nutch) I am struggling slightly to understand the
> issues you are trying to answer.
>
>
>
>
> On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander <
> [email protected]
> > wrote:
>
> > Hi all,
> >
> > I have been asked to look at doing some disk space estimates for our
> Nutch
> > usage. It looks like Nutch stores the content of the pages it downloads
> and
> > indexes in its data directory for the segment, is this the case?
> >
> > Are there any other major storage requirements I should make not of with
> > Nutch specifically (not the Solr storage, we can handle that bit)?
> >
> > Cheers
> >
> > Chris
> >
>
>
>
> --
> *Lewis*
>

Re: Storage of data between crawls

Reply via email to