Hi Chris/Team, I am using Nutch2.3.1 with mongoDB configured. Site - flipart.com
Even though I have added whitelist property in my nutch-stie.xml, I am not able to crawl. Please find attached log. Please help me to fix this issue. With Regards, Jyoti Aditya On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) < [email protected]> wrote: > Hi Jyoti, > > > > I need a lot more detail than “it didn’t work”. What didn’t work about it? > Do you have a log > file? What site were you trying to crawl? What command did you use? Where > is your nutch > config? Were you running in distributed or local mode? > > > > Onto Selenium – have you tried it or simply reading the docs, you think > it’s old? What have > you done? What have you tried? > > > > I need a LOT more detail before I (and I’m guessing anyone else on these > lists) can help. > > > > Cheers, > > Chris > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Principal Data Scientist, Engineering Administrative Office (3010) > > Manager, Open Source Projects Formulation and Development Office (8212) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 180-503E, Mailstop: 180-503 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Director, Information Retrieval and Data Science Group (IRDS) > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > WWW: http://irds.usc.edu/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > *From: *jyoti aditya <[email protected]> > *Date: *Monday, December 5, 2016 at 9:29 PM > *To: *"Mattmann, Chris A (3010)" <[email protected]> > *Cc: *"[email protected]" <[email protected]>, " > [email protected]" <[email protected]> > > *Subject: *Re: Impolite crawling using NUTCH > > > > Hi Chris/Team, > > > > Whitelisting domain name din't work. > > And when i was trying to configure selenium. It need one headless browser > to be integrated with. > > Documentation for selenium-protocol plugin looks old. firefox-11 is now > not supported as headless browser with selenium. > > So please help me out in configuring selenium plugin configuration. > > > > I am yet not sure, after configuring above what result it will fetch me. > > > > With Regards, > > Jyoti Aditya > > > > On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) < > [email protected]> wrote: > > Hi Jyoti, > > > > Again, please keep [email protected] CC’ed, and also you may consider looking > at this page: > > > > https://wiki.apache.org/nutch/AdvancedAjaxInteraction > > > > Cheers, > > Chris > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Principal Data Scientist, Engineering Administrative Office (3010) > > Manager, Open Source Projects Formulation and Development Office (8212) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 180-503E, Mailstop: 180-503 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Director, Information Retrieval and Data Science Group (IRDS) > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > WWW: http://irds.usc.edu/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > *From: *jyoti aditya <[email protected]> > *Date: *Monday, December 5, 2016 at 1:42 AM > *To: *Chris Mattmann <[email protected]> > > > *Subject: *Re: Impolite crawling using NUTCH > > > > Hi Chris, > > > > Whitelist din't work. > > And I was trying to configure selenium with nutch. > > But I am not sure that by doing so, what result will come. > > And also, it looks very clumsy to configure selenium with firefox. > > > > Regards, > > Jyoti Aditya > > > > On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected]> > wrote: > > Hmm, I’m a little confused here. You were first trying to use white list > robots.txt, and now > you are talking about Selenium. > > > > 1. Did the white list work > > 2. Are you now asking how to use Nutch and Selenium? > > > > Cheers, > > Chris > > > > > > > > *From: *jyoti aditya <[email protected]> > *Date: *Thursday, December 1, 2016 at 10:26 PM > *To: *"Mattmann, Chris A (3010)" <[email protected]> > *Subject: *Re: Impolite crawling using NUTCH > > > > Hi Chris, > > > > Thanks for the response. > > I added the changes as you mentioned above. > > > > But I am still not able to get all content from a webpage. > > Can you please tell me that do I need to add some selenium plugin to crawl > > dynamic content available on web page? > > > > I have a concern that this kind of wiki pages are not directly accessible. > > There is no way we can reach to these kind of useful pages. > > So please do needful regarding this. > > > > > > With Regards, > > Jyoti Aditya > > > > On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) < > [email protected]> wrote: > > There is a robots.txt whitelist. You can find documentation here: > > https://wiki.apache.org/nutch/WhiteListRobots > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Principal Data Scientist, Engineering Administrative Office (3010) > Manager, Open Source Projects Formulation and Development Office (8212) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 180-503E, Mailstop: 180-503 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote: > > Sure, you can remove the check from the code and recompile. > > Under what circumstances would you need to ignore robots.txt ? Would > something like allowing access by particular IP or user agents be an > alternative ? > > Tom > > > On 29/11/16 04:07, jyoti aditya wrote: > > Hi team, > > > > Can we use NUTCH to do impolite crawling? > > Or is there any way by which we can disobey robots.text? > > > > > > With Regards > > Jyoti Aditya > > > > > > ____________________________________________________________ > __________ > > This email has been scanned by the Symantec Email Security.cloud > service. > > For more information please visit http://www.symanteccloud.com > > ____________________________________________________________ > __________ > > > > > > -- > > With Regards > > Jyoti Aditya > > > > > > -- > > With Regards > > Jyoti Aditya > > > > > > -- > > With Regards > > Jyoti Aditya > -- With Regards Jyoti Aditya
2016-12-07 17:05:52,851 INFO crawl.InjectorJob - InjectorJob: starting at 2016-12-07 17:05:52 2016-12-07 17:05:52,852 INFO crawl.InjectorJob - InjectorJob: Injecting urlDir: seed.txt 2016-12-07 17:05:53,200 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-12-07 17:05:53,905 INFO crawl.InjectorJob - InjectorJob: Using class org.apache.gora.mongodb.store.MongoStore as the Gora storage class. 2016-12-07 17:05:54,454 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1771781124/.staging/job_local1771781124_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-12-07 17:05:54,456 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1771781124/.staging/job_local1771781124_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-12-07 17:05:54,565 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1771781124_0001/job_local1771781124_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-12-07 17:05:54,570 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1771781124_0001/job_local1771781124_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-12-07 17:05:55,627 INFO crawl.InjectorJob - InjectorJob: total number of urls rejected by filters: 1 2016-12-07 17:05:55,627 INFO crawl.InjectorJob - InjectorJob: total number of urls injected after normalization and filtering: 0 2016-12-07 17:05:55,628 INFO crawl.InjectorJob - Injector: finished at 2016-12-07 17:05:55, elapsed: 00:00:02 2016-12-07 17:05:56,746 INFO crawl.GeneratorJob - GeneratorJob: starting at 2016-12-07 17:05:56 2016-12-07 17:05:56,747 INFO crawl.GeneratorJob - GeneratorJob: Selecting best-scoring urls due for fetch. 2016-12-07 17:05:56,747 INFO crawl.GeneratorJob - GeneratorJob: starting 2016-12-07 17:05:56,747 INFO crawl.GeneratorJob - GeneratorJob: filtering: false 2016-12-07 17:05:56,747 INFO crawl.GeneratorJob - GeneratorJob: normalizing: false 2016-12-07 17:05:56,747 INFO crawl.GeneratorJob - GeneratorJob: topN: 50000 2016-12-07 17:05:56,947 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-12-07 17:05:56,959 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2016-12-07 17:05:56,960 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2016-12-07 17:05:56,960 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2016-12-07 17:05:58,144 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1498315833/.staging/job_local1498315833_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-12-07 17:05:58,146 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1498315833/.staging/job_local1498315833_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-12-07 17:05:58,235 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1498315833_0001/job_local1498315833_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2016-12-07 17:05:58,238 WARN conf.Configuration - file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1498315833_0001/job_local1498315833_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2016-12-07 17:05:58,504 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2016-12-07 17:05:58,504 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2016-12-07 17:05:58,504 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2016-12-07 17:05:59,291 INFO crawl.GeneratorJob - GeneratorJob: finished at 2016-12-07 17:05:59, time elapsed: 00:00:02 2016-12-07 17:05:59,291 INFO crawl.GeneratorJob - GeneratorJob: generated batch id: 1481110555-21430 containing 0 URLs

