Yes, GZip wil certainly help a lot until you get compression sorted out. GZip
is not splittable, so you have to decompress before loading it again.
-----Original message-----
> From:Tomasz <[email protected]>
> Sent: Tuesday 1st March 2016 18:11
> To: [email protected]
> Subject: Re: Nutch single instance
>
> Since I didn't manage to enable the compression I worked out another
> solution to save space or at least save some time before I'll get it work.
> After each generate/fetch/update/invertlinks cycle I gzip the most recent
> segment directory since I don't need it for 30 days (next fetch time). Not
> giving up to set up Nutch in pseudo-distributed mode and get the benefits
> especially the compression.
>
> 2016-02-26 13:07 GMT+01:00 Markus Jelsma <[email protected]>:
>
> > I am not sure it will work on a single node / local instance. But it would
> > be a good idea to run stuff on Yarn and HDFS anyway, even in local mode. It
> > has some benefits, and perhaps even compression that works.
> > Markus
> >
> > -----Original message-----
> > > From:Tomasz <[email protected]>
> > > Sent: Thursday 25th February 2016 22:25
> > > To: [email protected]
> > > Subject: Re: Nutch single instance
> > >
> > > Thanks for the hint but still it doesn't work. I ran the commands with
> > the
> > > following arguments:
> > >
> > > -D mapreduce.map.output.compress=true -D
> > > mapreduce.output.fileoutputformat.compress=false
> > > and
> > > -D mapreduce.map.output.compress=true -D
> > > mapreduce.output.fileoutputformat.compress=true
> > >
> > > Used space didn't change regardless of true/false value for the 2nd
> > > parameter and it consumes about 1-1.5GB for each generate/fetch/update
> > > cycle which means to me I will run out of disk space in a few days. I'm
> > not
> > > even sure if the compression is available on the machine but didn't
> > notice
> > > any errors/warning on the other hand. I don't use slaves, it's a single
> > > node instance and maybe mapreduce arguments doesn't work with such a
> > > environment? Markus, what to do?
> > >
> > > Tomasz
> > >
> > > 2016-02-25 15:09 GMT+01:00 Markus Jelsma <[email protected]>:
> > >
> > > > Hi - no, not just that. My colleague tells me you also need
> > > > mapreduce.output.fileoutputformat.compress.
> > > > Markus
> > > >
> > > > -----Original message-----
> > > > > From:Tomasz <[email protected]>
> > > > > Sent: Thursday 25th February 2016 11:10
> > > > > To: [email protected]
> > > > > Subject: Re: Nutch single instance
> > > > >
> > > > > Great, I remove crawl_generate and it helps a bit to save space. I
> > run
> > > > > nutch commands with -D mapreduce.map.output.compress=true but don't
> > see
> > > > any
> > > > > significant space drop. Is this enough to enable compression? Thanks.
> > > > >
> > > > > 2016-02-24 21:39 GMT+01:00 Markus Jelsma <[email protected]
> > >:
> > > > >
> > > > > > Oh, i forgot the following; enable Hadoop's snappy compression on
> > in-
> > > > and
> > > > > > output files. It reduced our storage requirements to 10% of the
> > > > original
> > > > > > file size. Apparently Nutch' data structures are easily
> > compressed. It
> > > > also
> > > > > > greatly reduces I/O, thus speeding up all load times. CPU usage is
> > > > > > negligible compared to I/O wait.
> > > > > >
> > > > > > Markus
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:Tomasz <[email protected]>
> > > > > > > Sent: Wednesday 24th February 2016 15:46
> > > > > > > To: [email protected]
> > > > > > > Subject: Re: Nutch single instance
> > > > > > >
> > > > > > > Markus, thanks for sharing. Changing a bit the topic. A few
> > messages
> > > > > > > earlier I asked about storing only links between pages without a
> > > > content.
> > > > > > > With your great help I run Nutch with fetcher.store.content =
> > false
> > > > and
> > > > > > > fetcher.parse = true and omit a parse step in
> > generate/fetch/update
> > > > > > cycle.
> > > > > > > What more I remove parse_text from segments directory after each
> > > > cycle to
> > > > > > > save space, but space used by segments is growing rapidly and I
> > > > wonder
> > > > > > if I
> > > > > > > really need all the data. Let me summarise my case - I crawl
> > only to
> > > > get
> > > > > > > connections between pages (inverted links with anchors) and I
> > don't
> > > > need
> > > > > > > the content. I run generate/fetch/update cycle continuously (I've
> > > > set up
> > > > > > > time limit for fetcher to run max 90 min). Is there a way I can
> > save
> > > > more
> > > > > > > storage space? Thanks.
> > > > > > >
> > > > > > > Tomasz
> > > > > > >
> > > > > > > 2016-02-24 12:09 GMT+01:00 Markus Jelsma <
> > [email protected]
> > > > >:
> > > > > > >
> > > > > > > > Hi - see inline.
> > > > > > > > Markus
> > > > > > > >
> > > > > > > > -----Original message-----
> > > > > > > > > From:Tomasz <[email protected]>
> > > > > > > > > Sent: Wednesday 24th February 2016 11:54
> > > > > > > > > To: [email protected]
> > > > > > > > > Subject: Nutch single instance
> > > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > After a few days testing Nutch with Amazon EMR (1 master and
> > 2
> > > > > > slaves) I
> > > > > > > > > had to give up. It was extremely slow (avg. fetching speed
> > at 8
> > > > > > urls/sec
> > > > > > > > > counting those 2 slaves) and along with map-reduce overhead
> > the
> > > > whole
> > > > > > > > > solution hasn't satisfied me at all. I moved Nutch crawl
> > > > databases
> > > > > > and
> > > > > > > > > segments to single EC2 instance and it works pretty fast now
> > > > > > reaching 35
> > > > > > > > > fetched pages/sec with an avg. 25/sec. I know that Nutch is
> > > > designed
> > > > > > to
> > > > > > > > > work with Hadoop environment and regret it didn't work in my
> > > > case.
> > > > > > > >
> > > > > > > > Setting up Nutch the correct way is a delicate matter and quite
> > > > some
> > > > > > trial
> > > > > > > > and error. But in general, more machines are faster. But in
> > some
> > > > > > cases, one
> > > > > > > > fast beast can easily outperform a few less powerful machines.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Anyway I would like to know if I'm alone with the approach
> > and
> > > > > > everybody
> > > > > > > > > set up Nutch with Hadoop. If no and some of you runs Nutch
> > in a
> > > > > > single
> > > > > > > > > instance maybe you can share with some best practices e.g. do
> > > > you use
> > > > > > > > crawl
> > > > > > > > > script or generate/fetch/update continuously perhaps using
> > some
> > > > cron
> > > > > > > > jobs?
> > > > > > > >
> > > > > > > > Well, in both cases you need some script(s) to run the jobs. We
> > > > have a
> > > > > > lot
> > > > > > > > of complicated scripts that get stuff from everywhere. We have
> > > > > > integrated
> > > > > > > > Nutch in our Sitesearch platform so it has to be coupled to a
> > lot
> > > > of
> > > > > > > > different systems. We still rely on bash scripts but probably
> > > > Python is
> > > > > > > > easier if scripts are complicated. Ideally, in a distributed
> > > > > > environment,
> > > > > > > > you use Apache Oozie to run the crawls.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Btw. I can see retry 0, retry 1, retry 2 and so on in crawldb
> > > > stats -
> > > > > > > > what
> > > > > > > > > exactly does it mean?
> > > > > > > >
> > > > > > > > These are transient errors, e.g. connection time outs,
> > connection
> > > > > > resets
> > > > > > > > but also 5xx errors that are usually transient. They are
> > eligble
> > > > for
> > > > > > > > recrawl 24 hours later. By default, after retry 3, the records
> > goes
> > > > > > from
> > > > > > > > db_unfetched to db_gone.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Tomasz
> > > > > > > > >
> > > > > > > > > Here are my current crawldb stats:
> > > > > > > > > TOTAL urls: 16347942
> > > > > > > > > retry 0: 16012503
> > > > > > > > > retry 1: 134346
> > > > > > > > > retry 2: 106037
> > > > > > > > > retry 3: 95056
> > > > > > > > > min score: 0.0
> > > > > > > > > avg score: 0.04090025
> > > > > > > > > max score: 331.052
> > > > > > > > > status 1 (db_unfetched): 14045806
> > > > > > > > > status 2 (db_fetched): 1769382
> > > > > > > > > status 3 (db_gone): 160768
> > > > > > > > > status 4 (db_redir_temp): 68104
> > > > > > > > > status 5 (db_redir_perm): 151944
> > > > > > > > > status 6 (db_notmodified): 151938
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>