Hi Jyoti, in this case, the answer is simple: the robots.txt whitelisting was never ported from 1.x to 2.x ;(
Best, Sebastian On 12/07/2016 12:44 PM, jyoti aditya wrote: > Hi Chris/Team, > > I am using Nutch2.3.1 with mongoDB configured. > Site - flipart.com <http://flipart.com> > > Even though I have added whitelist property in my nutch-stie.xml, > I am not able to crawl. > > Please find attached log. > Please help me to fix this issue. > > With Regards, > Jyoti Aditya > > On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) > <[email protected] > <mailto:[email protected]>> wrote: > > Hi Jyoti,____ > > __ __ > > I need a lot more detail than “it didn’t work”. What didn’t work about > it? Do you have a log > file? What site were you trying to crawl? What command did you use? Where > is your nutch > config? Were you running in distributed or local mode?____ > > __ __ > > Onto Selenium – have you tried it or simply reading the docs, you think > it’s old? What have > you done? What have you tried?____ > > __ __ > > I need a LOT more detail before I (and I’m guessing anyone else on these > lists) can help.____ > > __ __ > > Cheers,____ > > Chris____ > > __ __ > > __ __ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > Chris Mattmann, Ph.D.____ > > Principal Data Scientist, Engineering Administrative Office (3010)____ > > Manager, Open Source Projects Formulation and Development Office > (8212)____ > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____ > > Office: 180-503E, Mailstop: 180-503____ > > Email: [email protected] <mailto:[email protected]>____ > > WWW: http://sunset.usc.edu/~mattmann/ > <http://sunset.usc.edu/~mattmann/>____ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > Director, Information Retrieval and Data Science Group (IRDS)____ > > Adjunct Associate Professor, Computer Science Department____ > > University of Southern California, Los Angeles, CA 90089 USA____ > > WWW: http://irds.usc.edu/____ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > __ __ > > __ __ > > *From: *jyoti aditya <[email protected] > <mailto:[email protected]>> > *Date: *Monday, December 5, 2016 at 9:29 PM > *To: *"Mattmann, Chris A (3010)" <[email protected] > <mailto:[email protected]>> > *Cc: *"[email protected] <mailto:[email protected]>" > <[email protected] > <mailto:[email protected]>>, "[email protected] > <mailto:[email protected]>" > <[email protected] <mailto:[email protected]>> > > > *Subject: *Re: Impolite crawling using NUTCH____ > > __ __ > > Hi Chris/Team, ____ > > __ __ > > Whitelisting domain name din't work. ____ > > And when i was trying to configure selenium. It need one headless browser > to be integrated with.____ > > Documentation for selenium-protocol plugin looks old. firefox-11 is now > not supported as > headless browser with selenium.____ > > So please help me out in configuring selenium plugin configuration.____ > > __ __ > > I am yet not sure, after configuring above what result it will fetch > me.____ > > __ __ > > With Regards,____ > > Jyoti Aditya____ > > __ __ > > On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) > <[email protected] > <mailto:[email protected]>> wrote:____ > > Hi Jyoti,____ > > ____ > > Again, please keep [email protected] <mailto:[email protected]> CC’ed, and > also you may consider > looking at this page:____ > > ____ > > https://wiki.apache.org/nutch/AdvancedAjaxInteraction > <https://wiki.apache.org/nutch/AdvancedAjaxInteraction> ____ > > ____ > > Cheers,____ > > Chris____ > > ____ > > ____ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > Chris Mattmann, Ph.D.____ > > Principal Data Scientist, Engineering Administrative Office (3010)____ > > Manager, Open Source Projects Formulation and Development Office > (8212)____ > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____ > > Office: 180-503E, Mailstop: 180-503____ > > Email: [email protected] > <mailto:[email protected]>____ > > WWW: http://sunset.usc.edu/~mattmann/ > <http://sunset.usc.edu/~mattmann/>____ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > Director, Information Retrieval and Data Science Group (IRDS)____ > > Adjunct Associate Professor, Computer Science Department____ > > University of Southern California, Los Angeles, CA 90089 USA____ > > WWW: http://irds.usc.edu/____ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > ____ > > ____ > > *From: *jyoti aditya <[email protected] > <mailto:[email protected]>> > *Date: *Monday, December 5, 2016 at 1:42 AM > *To: *Chris Mattmann <[email protected] > <mailto:[email protected]>>____ > > > *Subject: *Re: Impolite crawling using NUTCH____ > > ____ > > Hi Chris, ____ > > ____ > > Whitelist din't work.____ > > And I was trying to configure selenium with nutch.____ > > But I am not sure that by doing so, what result will come.____ > > And also, it looks very clumsy to configure selenium with firefox. > ____ > > ____ > > Regards,____ > > Jyoti Aditya____ > > ____ > > On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected] > <mailto:[email protected]>> wrote:____ > > Hmm, I’m a little confused here. You were first trying to use > white list robots.txt, and > now > you are talking about Selenium.____ > > ____ > > 1. Did the white list work____ > > 2. Are you now asking how to use Nutch and Selenium?____ > > ____ > > Cheers,____ > > Chris____ > > ____ > > ____ > > ____ > > *From: *jyoti aditya <[email protected] > <mailto:[email protected]>> > *Date: *Thursday, December 1, 2016 at 10:26 PM > *To: *"Mattmann, Chris A (3010)" <[email protected] > <mailto:[email protected]>> > *Subject: *Re: Impolite crawling using NUTCH____ > > ____ > > Hi Chris, ____ > > ____ > > Thanks for the response.____ > > I added the changes as you mentioned above.____ > > ____ > > But I am still not able to get all content from a webpage.____ > > Can you please tell me that do I need to add some selenium plugin > to crawl ____ > > dynamic content available on web page?____ > > ____ > > I have a concern that this kind of wiki pages are not directly > accessible.____ > > There is no way we can reach to these kind of useful pages.____ > > So please do needful regarding this.____ > > ____ > > ____ > > With Regards,____ > > Jyoti Aditya____ > > ____ > > On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) > <[email protected] > <mailto:[email protected]>> wrote:____ > > There is a robots.txt whitelist. You can find documentation > here: > > https://wiki.apache.org/nutch/WhiteListRobots > <https://wiki.apache.org/nutch/WhiteListRobots> > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Principal Data Scientist, Engineering Administrative Office > (3010) > Manager, Open Source Projects Formulation and Development > Office (8212) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 180-503E, Mailstop: 180-503 > Email: [email protected] > <mailto:[email protected]> > WWW: http://sunset.usc.edu/~mattmann/ > <http://sunset.usc.edu/~mattmann/> > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____ > > > > On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected] > <mailto:[email protected]>> wrote: > > Sure, you can remove the check from the code and > recompile. > > Under what circumstances would you need to ignore > robots.txt ? Would > something like allowing access by particular IP or user > agents be an > alternative ? > > Tom > > > On 29/11/16 04:07, jyoti aditya wrote: > > Hi team, > > > > Can we use NUTCH to do impolite crawling? > > Or is there any way by which we can disobey robots.text? > > > > > > With Regards > > Jyoti Aditya > > > > > > > ______________________________________________________________________ > > This email has been scanned by the Symantec Email > Security.cloud service. > > For more information please visit > http://www.symanteccloud.com > > > __________________________________________________________________________ > > > > ____ > > ____ > > -- ____ > > With Regards ____ > > Jyoti Aditya____ > > > > ____ > > ____ > > -- ____ > > With Regards ____ > > Jyoti Aditya____ > > > > ____ > > __ __ > > -- ____ > > With Regards ____ > > Jyoti Aditya____ > > > > > -- > With Regards > Jyoti Aditya

