Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel
Hi Lewis, hi Markus, > snappy compression, which is a massive improvement for large data shuffling jobs Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for longer. We use it for a 25-billion CrawlDB: it's almost as fast (both compression and decompression) as

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread Nicholas Roberts
Does the Apache Bigtop project not meet the requirements of a free distribution? https://github.com/apache/bigtop What is the status of that project?

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread lewis john mcgibbney
Hi Sebastian, If we did not know how long our crawl infrastructure was required for (i.e. the customer may revoke or extend the contract with very little notice) we always chose AWS EMR. Specifically to reduce costs we made sure that all worker/task nodes were run on spot instances

Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Markus Jelsma
Hello Sebastian, We have always used vanilla Apache Hadoop on our own physical servers that are running on the latest Debian, which also runs on ARM. It will run HDFS and YARN and any other custom job you can think of. It has snappy compression, which is a massive improvement for large data

Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Sebastian Nagel
Hi, does anybody have a recommendation for a free and production-ready Hadoop setup? - HDFS + YARN - run Nutch but also other MapReduce and Spark-on-Yarn jobs - with native library support: libhadoop.so and compression libs (bzip2, zstd, snappy) - must run on AWS EC2 instances and read/write