Cheers Lewis, perhaps I should attempt to rephrase the question. Clearly Nutch must download and store the contents of a page during a crawl. However, once you have indexed this content, does Nutch keep this data, or is it cleaned up, automatically or is there a command to do it?
Thanks Chris On 27 July 2011 17:14, lewis john mcgibbney <[email protected]>wrote: > HI Alexander, > > I don't want to state the obvious here but this will depend directly on > what > type of loading your Nutch implementation deals with... > > You are correct in stating that we store data in segments, namely > /crawl_fetch > /content > /crawl_parse > /parse_data > /crawl_generate > /parse_text > > I understand that this doesn't add much value to answering your question, > but as we are now indexing with Solr (and therefore not storing larger > amounts of data with Nutch) I am struggling slightly to understand the > issues you are trying to answer. > > > > > On Mon, Jul 25, 2011 at 5:13 PM, Chris Alexander < > [email protected] > > wrote: > > > Hi all, > > > > I have been asked to look at doing some disk space estimates for our > Nutch > > usage. It looks like Nutch stores the content of the pages it downloads > and > > indexes in its data directory for the segment, is this the case? > > > > Are there any other major storage requirements I should make not of with > > Nutch specifically (not the Solr storage, we can handle that bit)? > > > > Cheers > > > > Chris > > > > > > -- > *Lewis* >

