Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Markus Jelsma
Hello Sebastian, We have always used vanilla Apache Hadoop on our own physical servers that are running on the latest Debian, which also runs on ARM. It will run HDFS and YARN and any other custom job you can think of. It has snappy compression, which is a massive improvement for large data

Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-01 Thread Sebastian Nagel
Hi, does anybody have a recommendation for a free and production-ready Hadoop setup? - HDFS + YARN - run Nutch but also other MapReduce and Spark-on-Yarn jobs - with native library support: libhadoop.so and compression libs (bzip2, zstd, snappy) - must run on AWS EC2 instances and read/write

Re: Adding html field to NutchDocument

2021-06-01 Thread Sebastian Nagel
Hi Kieran, thanks for the feedback! > I didn't realise that it is intended for users to edit the bin/crawl file. Maybe we should add a comment to encourage users to adapt the shell scripts to their needs. Almost 10 years ago, the Java "Crawl" class was replaced by the scripts because a shell

Re: Adding html field to NutchDocument

2021-06-01 Thread Kieran Munday
Hi Sebastian, Thank you for your response. It was a great help. I didn't realise that it is intended for users to edit the bin/crawl file. Although looking at it now it's clear. This makes it easier for me to access the html content within my plugin, thanks again On Fri, May 28, 2021 at 8:36 PM