Hi Lewis, hi Markus,
> snappy compression, which is a massive improvement for large data shuffling
jobs
Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both
compression
and decompression) as
Does the Apache Bigtop project not meet the requirements of a free
distribution?
https://github.com/apache/bigtop
What is the status of that project?
Hi Sebastian,
If we did not know how long our crawl infrastructure was required for (i.e.
the customer may revoke or extend the contract with very little notice) we
always chose AWS EMR. Specifically to reduce costs we made sure that all
worker/task nodes were run on spot instances
Hello Sebastian,
We have always used vanilla Apache Hadoop on our own physical servers that
are running on the latest Debian, which also runs on ARM. It will run HDFS
and YARN and any other custom job you can think of. It has snappy
compression, which is a massive improvement for large data
Hi,
does anybody have a recommendation for a free and production-ready Hadoop setup?
- HDFS + YARN
- run Nutch but also other MapReduce and Spark-on-Yarn jobs
- with native library support: libhadoop.so and compression
libs (bzip2, zstd, snappy)
- must run on AWS EC2 instances and read/write
5 matches
Mail list logo