Re: Best practice for Nutch 2.x on AWS?

2017-08-08 Thread Divjot Singh
Hi We have a setup of Hbase on an AWS cluster with centos 7. The setup was done using cloudera-manager. Nutch can be then run in standalone mode or over yarn by running the deployment jar in deploy folder. I have not tested with S3 directly but your can always backup the hbase data daily to S3.

fetching pdfs from our website

2017-08-08 Thread d.ku...@technisat.de
Hey currently, we are on nutch 2.3.1 and using it to crawl our websites. One of our focus is to get all the pdfs on our website crawled. -> Links on different Websites are like: https://assets0.mysite.com/asset /DB_product.pdf I tried different things: At the configurations I removed ever occur