Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread Nicholas Roberts
Does the Apache Bigtop project not meet the requirements of a free
distribution?
https://github.com/apache/bigtop

What is the status of that project?


DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-03 Thread lewis john mcgibbney
Some interesting content for a short read :)

https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable_campaign=ser_newsletter_2021-06-03_medium=email

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread lewis john mcgibbney
Hi Sebastian,
If we did not know how long our crawl infrastructure was required for (i.e.
the customer may  revoke or extend the contract with very little notice) we
always chose AWS EMR. Specifically to reduce costs we made sure that all
worker/task nodes were run on spot instances
https://aws.amazon.com/ec2/spot/use-case/emr/ to achieve significant cost
savings on larger deployments. This also means however that we needed to
put in place additional monitoring (Ganglia) and disaster recovery and data
backup logic (custom via hadoop fs and aws aws-cli) but this is good
practice anyway so the small investment was well worth it.
I had contemplated working more on the configuration management side of
things e.g. using Terraform or AWS CloudFormation to drive efficiencies in
repeatable deployments but I never got around to that.
ARM support was never a concern for us so I can't help there sorry.
lewismc

From: Sebastian Nagel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Tue, 1 Jun 2021 16:35:22 +0200
> Subject: Recommendation for free and production-ready Hadoop setup to run
> Nutch
> Hi,
>
> does anybody have a recommendation for a free and production-ready Hadoop
> setup?
>
> - HDFS + YARN
> - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> - with native library support: libhadoop.so and compression
>libs (bzip2, zstd, snappy)
> - must run on AWS EC2 instances and read/write to S3
> - including smaller ones (2 vCPUs, 16 GiB RAM)
> - ideally,
>- Hadoop 3.3.0
>- Java 11 and
>- support to run on ARM machines
>
> So far, Common Crawl uses Cloudera CDH but with no free updates
> anymore we consider either to switch to Amazon EMR, a Cloudera
> subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> are required).
>
> A dockerized setup is also an option (at least, for development and
> testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> was straight-forward [2]. But native library support is still missing.
>
> Thanks,
> Sebastian
>
> [1] https://github.com/big-data-europe/docker-hadoop
> [2]
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
>
>
>
>
> -- Forwarded message --
> From: Markus Jelsma 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 1 Jun 2021 16:57:46 +0200
> Subject: Re: Recommendation for free and production-ready Hadoop setup to
> run Nutch
> Hello Sebastian,
>
> We have always used vanilla Apache Hadoop on our own physical servers that
> are running on the latest Debian, which also runs on ARM. It will run HDFS
> and YARN and any other custom job you can think of. It has snappy
> compression, which is a massive improvement for large data shuffling jobs,
> it runs on Java 11 and if neccessary even on AWS, but i dislike it.
>
> You can easily read/write large files between HDFS en S3 without storing it
> on local filesystem so it ticks that box too.
>
> I don't know much about Docker, except that i don't like it either, but
> that is personal. I do like vanilla Apache Hadoop.
>
> Regards,
> Markus
>
>
>
> Op di 1 jun. 2021 om 16:35 schreef Sebastian Nagel
> :
>
> > Hi,
> >
> > does anybody have a recommendation for a free and production-ready Hadoop
> > setup?
> >
> > - HDFS + YARN
> > - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> > - with native library support: libhadoop.so and compression
> >libs (bzip2, zstd, snappy)
> > - must run on AWS EC2 instances and read/write to S3
> > - including smaller ones (2 vCPUs, 16 GiB RAM)
> > - ideally,
> >- Hadoop 3.3.0
> >- Java 11 and
> >- support to run on ARM machines
> >
> > So far, Common Crawl uses Cloudera CDH but with no free updates
> > anymore we consider either to switch to Amazon EMR, a Cloudera
> > subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> > are required).
> >
> > A dockerized setup is also an option (at least, for development and
> > testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> > was straight-forward [2]. But native library support is still missing.
> >
> > Thanks,
> > Sebastian
> >
> > [1] https://github.com/big-data-europe/docker-hadoop
> > [2]
> >
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
> >
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc