Re: Nutch in production

2016-09-29 Thread Karanjeet Singh
. Please have a look and let me know if you have any questions. [0]: https://github.com/karanjeets/PCF-Nutch-on-Wrangler P.S. - There can be many solutions to this. I am just giving one. :) Regards, Karanjeet Singh http://irds.usc.edu On Thu, Sep 29, 2016 at 1:33 AM, Sachin Shaju <sa

Re: Error unknown protocol

2016-06-07 Thread Karanjeet Singh
Hi Nana, May be you can use URL regex filter to exclude these out. Following regex expression will allow only http(s) links to be crawled. +^http(s){0,1}://* Thanks & Regards, Karanjeet Singh USC On Mon, Jun 6, 2016 at 7:13 PM, Nana Pandiawan < nana.pandia...@solusi247.com.invalid

Re: [ANNOUNCE] New Nutch committer and PMC - Karanjeet Singh

2016-05-23 Thread Karanjeet Singh
. I am excited to be a part of this community!!! Regards, Karanjeet Singh USC On Sun, May 22, 2016 at 12:51 PM, Sebastian Nagel < wastl.na...@googlemail.com> wrote: > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Karanjeet Singh has joi

Re: Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-14 Thread Karanjeet Singh
interested to know the reason for this. Is it due to politeness? [0]: https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L141 Regards, Karanjeet Singh USC On Thu, Apr 14, 2016 at 1:40 AM, Sebastian Nagel <wastl.na...@googlemail.com > wrote: >

Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-13 Thread Karanjeet Singh
/segments -topN 1000 -numFetchers 1 -noFilter Can anyone please look into this and let me know if I am missing something. Please find the crawl configuration here [0]. [0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf Thanks & Regards, Karanjeet Singh USC ᐧ

Re: Crawl Script Don't Want To Use -topn

2015-12-21 Thread Karanjeet Singh
Hi Manish, If you are pointing at the links retrieved from a page, I would recommend you to have a look at the Nutch configuration properties "db.max.outlinks.per.page" and "db.max.inlinks". Hope it helps. Thanks & Regards, Karanjeet Singh CS Graduate Student Universit

Re: How to deploy Selenium on Server?

2015-12-21 Thread Karanjeet Singh
Hi Byzen, I hope you have installed all required libraries (Firefox, Xvfb) for Selenium on your remote server. Can you please share your logs (${NUTCH_HOME}/logs/hadoop.log) to get an insight of this issue. Thanks & Regards, Karanjeet Singh CS Graduate Student University of Southern Califo

Re: Configuring rotating agent in Nutch

2015-09-27 Thread Karanjeet Singh
I am facing the same problem here. Tried rebuilding it but in logs I can only see the agent name mentioned in http.agent.name property. By $NUTCH_HOME/conf do you mean runtime/local/conf directory ? Also can you please brief me on how the rotation works ? Does the agent rotates after crawling