Hi
We have a setup of Hbase on an AWS cluster with centos 7. The setup was
done using cloudera-manager. Nutch can be then run in standalone mode or
over yarn by running the deployment jar in deploy folder.
I have not tested with S3 directly but your can always backup the hbase
data daily to S3.
Hey currently,
we are on nutch 2.3.1 and using it to crawl our websites.
One of our focus is to get all the pdfs on our website crawled. -> Links on
different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf
I tried different things:
At the configurations I removed ever occur
2 matches
Mail list logo