RE: crawlDb speed around deduplication

2017-05-03 Thread Markus Jelsma
Hello - if you have slow disks but plenty of CPU power, bzip2 would be a good choice. Otherwise gzip is probably a more suitable candidate. Markus -Original message- > From:Michael Coffey > Sent: Wednesday 3rd May 2017 20:30 > To: User > Subject: Re: crawlDb speed around deduplicat

Re: crawlDb speed around deduplication

2017-05-03 Thread Sebastian Nagel
Hi Michael, > Anyway, I'm thinking of using bzip2 for the final output compression. Anybody > know a reason I shouldn't try that? That's not a bad choice. But more important: the records (CrawlDatum objects) in the CrawlDb are small, so you should set > mapreduce.output.fileoutputformat.compres

Re: idexer "possible analysis error"

2017-05-03 Thread Furkan KAMACI
Ho Michael, Good to hear my suggestion helped you! Kind Regards, Furkan KAMACI On Wed, May 3, 2017 at 8:55 PM, Michael Coffey wrote: > Thank you, Furkan, for the excellent suggestion. > > Once I found the solr logs, it was not too hard to discover that there > were OutOfMemory exceptions neigh

Re: crawlDb speed around deduplication

2017-05-03 Thread Michael Coffey
Good tip! The compression topic is interesting because we spend a lot of time reading and writing files. For the dedup-crawldb job, I have: mapreduce.map.output.compress = true (from command line) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec mapreduce.output

Re: idexer "possible analysis error"

2017-05-03 Thread Michael Coffey
Thank you, Furkan, for the excellent suggestion. Once I found the solr logs, it was not too hard to discover that there were OutOfMemory exceptions neighboring the "possible analysis" errors. I was able to fix it by boosting the Java heap size (not easy to do using docker). Blaming solr for the

RE: Wrong FS exception in Fetcher

2017-05-03 Thread Yossi Tamari
Hi, Setting the MapReduce framework to YARN solved this issue. Yossi. From: Yossi Tamari [mailto:yossi.tam...@pipl.com] Sent: 30 April 2017 17:04 To: user@nutch.apache.org Subject: Wrong FS exception in Fetcher Hi, I'm trying to run Nutch 1.13 on Hadoop 2.8.0 in pse

Nutch and SOLR - Updating DB and indexes

2017-05-03 Thread Ajmal Rahman
Hi Team, I have nutch and solr setup and it is integrated on my website and working good. Now I need to update the search index since there are updates to the content of my website. I have a few queries regarding this as follows: 1. Do I need to delete the contents of the crawl folder (apache-