Hello - if you have slow disks but plenty of CPU power, bzip2 would be a good
choice. Otherwise gzip is probably a more suitable candidate.
Markus
-Original message-
> From:Michael Coffey
> Sent: Wednesday 3rd May 2017 20:30
> To: User
> Subject: Re: crawlDb speed around deduplicat
Hi Michael,
> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody
> know a reason I
shouldn't try that?
That's not a bad choice.
But more important: the records (CrawlDatum objects) in the CrawlDb are small,
so you should set
> mapreduce.output.fileoutputformat.compres
Ho Michael,
Good to hear my suggestion helped you!
Kind Regards,
Furkan KAMACI
On Wed, May 3, 2017 at 8:55 PM, Michael Coffey
wrote:
> Thank you, Furkan, for the excellent suggestion.
>
> Once I found the solr logs, it was not too hard to discover that there
> were OutOfMemory exceptions neigh
Good tip!
The compression topic is interesting because we spend a lot of time reading and
writing files.
For the dedup-crawldb job, I have:
mapreduce.map.output.compress = true (from command line)
mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output
Thank you, Furkan, for the excellent suggestion.
Once I found the solr logs, it was not too hard to discover that there were
OutOfMemory exceptions neighboring the "possible analysis" errors. I was able
to fix it by boosting the Java heap size (not easy to do using docker). Blaming
solr for the
Hi,
Setting the MapReduce framework to YARN solved this issue.
Yossi.
From: Yossi Tamari [mailto:yossi.tam...@pipl.com]
Sent: 30 April 2017 17:04
To: user@nutch.apache.org
Subject: Wrong FS exception in Fetcher
Hi,
I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pse
Hi Team,
I have nutch and solr setup and it is integrated on my website and working good.
Now I need to update the search index since there are updates to the content of
my website. I have a few queries regarding this as follows:
1. Do I need to delete the contents of the crawl folder
(apache-
7 matches
Mail list logo