Hello - it seems you only need some segments directories. I am sure you can 
remove crawl_generate but not immediately sure about some of the others. You 
would need to check FetchOutputFormat and ParseOutputFormat to make sure which 
directory contains the data structures you need. Or maybe there was a page on 
the wiki that explains the precise contents of each dir.

Markus
 
-----Original message-----
> From:Tomasz <[email protected]>
> Sent: Wednesday 24th February 2016 15:46
> To: [email protected]
> Subject: Re: Nutch single instance
> 
> Markus, thanks for sharing. Changing a bit the topic. A few messages
> earlier I asked about storing only links between pages without a content.
> With your great help I run Nutch with fetcher.store.content = false and
> fetcher.parse = true and omit a parse step in generate/fetch/update cycle.
> What more I remove parse_text from segments directory after each cycle to
> save space, but space used by segments is growing rapidly and I wonder if I
> really need all the data. Let me summarise my case - I crawl only to get
> connections between pages (inverted links with anchors) and I don't need
> the content. I run generate/fetch/update cycle continuously (I've set up
> time limit for fetcher to run max 90 min). Is there a way I can save more
> storage space? Thanks.
> 
> Tomasz
> 
> 2016-02-24 12:09 GMT+01:00 Markus Jelsma <[email protected]>:
> 
> > Hi - see inline.
> > Markus
> >
> > -----Original message-----
> > > From:Tomasz <[email protected]>
> > > Sent: Wednesday 24th February 2016 11:54
> > > To: [email protected]
> > > Subject: Nutch single instance
> > >
> > > Hello,
> > >
> > > After a few days testing Nutch with Amazon EMR (1 master and 2 slaves) I
> > > had to give up. It was extremely slow (avg. fetching speed at 8 urls/sec
> > > counting those 2 slaves) and along with map-reduce overhead the whole
> > > solution hasn't satisfied me at all. I moved Nutch crawl databases and
> > > segments to single EC2 instance and it works pretty fast now reaching 35
> > > fetched pages/sec with an avg. 25/sec. I know that Nutch is designed to
> > > work with Hadoop environment and regret it didn't work in my case.
> >
> > Setting up Nutch the correct way is a delicate matter and quite some trial
> > and error. But in general, more machines are faster. But in some cases, one
> > fast beast can easily outperform a few less powerful machines.
> >
> > >
> > > Anyway I would like to know if I'm alone with the approach and everybody
> > > set up Nutch with Hadoop. If no and some of you runs Nutch in a single
> > > instance maybe you can share with some best practices e.g. do you use
> > crawl
> > > script or generate/fetch/update continuously perhaps using some cron
> > jobs?
> >
> > Well, in both cases you need some script(s) to run the jobs. We have a lot
> > of complicated scripts that get stuff from everywhere. We have integrated
> > Nutch in our Sitesearch platform so it has to be coupled to a lot of
> > different systems. We still rely on bash scripts but probably Python is
> > easier if scripts are complicated. Ideally, in a distributed environment,
> > you use Apache Oozie to run the crawls.
> >
> > >
> > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb stats -
> > what
> > > exactly does it mean?
> >
> > These are transient errors, e.g. connection time outs, connection resets
> > but also 5xx errors that are usually transient. They are eligble for
> > recrawl 24 hours later. By default, after retry 3, the records goes from
> > db_unfetched to db_gone.
> >
> > >
> > > Regards,
> > > Tomasz
> > >
> > > Here are my current crawldb stats:
> > > TOTAL urls:     16347942
> > > retry 0:        16012503
> > > retry 1:        134346
> > > retry 2:        106037
> > > retry 3:        95056
> > > min score:      0.0
> > > avg score:      0.04090025
> > > max score:      331.052
> > > status 1 (db_unfetched):        14045806
> > > status 2 (db_fetched):  1769382
> > > status 3 (db_gone):     160768
> > > status 4 (db_redir_temp):       68104
> > > status 5 (db_redir_perm):       151944
> > > status 6 (db_notmodified):      151938
> > >
> >
> 

Reply via email to