Hi Chris/Team, Whitelisting domain name din't work. And when i was trying to configure selenium. It need one headless browser to be integrated with. Documentation for selenium-protocol plugin looks old. firefox-11 is now not supported as headless browser with selenium. So please help me out in configuring selenium plugin configuration.
I am yet not sure, after configuring above what result it will fetch me. With Regards, Jyoti Aditya On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) < [email protected]> wrote: > Hi Jyoti, > > > > Again, please keep [email protected] CC’ed, and also you may consider looking > at this page: > > > > https://wiki.apache.org/nutch/AdvancedAjaxInteraction > > > > Cheers, > > Chris > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Principal Data Scientist, Engineering Administrative Office (3010) > > Manager, Open Source Projects Formulation and Development Office (8212) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 180-503E, Mailstop: 180-503 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Director, Information Retrieval and Data Science Group (IRDS) > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > WWW: http://irds.usc.edu/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > *From: *jyoti aditya <[email protected]> > *Date: *Monday, December 5, 2016 at 1:42 AM > *To: *Chris Mattmann <[email protected]> > > *Subject: *Re: Impolite crawling using NUTCH > > > > Hi Chris, > > > > Whitelist din't work. > > And I was trying to configure selenium with nutch. > > But I am not sure that by doing so, what result will come. > > And also, it looks very clumsy to configure selenium with firefox. > > > > Regards, > > Jyoti Aditya > > > > On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected]> > wrote: > > Hmm, I’m a little confused here. You were first trying to use white list > robots.txt, and now > you are talking about Selenium. > > > > 1. Did the white list work > > 2. Are you now asking how to use Nutch and Selenium? > > > > Cheers, > > Chris > > > > > > > > *From: *jyoti aditya <[email protected]> > *Date: *Thursday, December 1, 2016 at 10:26 PM > *To: *"Mattmann, Chris A (3010)" <[email protected]> > *Subject: *Re: Impolite crawling using NUTCH > > > > Hi Chris, > > > > Thanks for the response. > > I added the changes as you mentioned above. > > > > But I am still not able to get all content from a webpage. > > Can you please tell me that do I need to add some selenium plugin to crawl > > dynamic content available on web page? > > > > I have a concern that this kind of wiki pages are not directly accessible. > > There is no way we can reach to these kind of useful pages. > > So please do needful regarding this. > > > > > > With Regards, > > Jyoti Aditya > > > > On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) < > [email protected]> wrote: > > There is a robots.txt whitelist. You can find documentation here: > > https://wiki.apache.org/nutch/WhiteListRobots > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Principal Data Scientist, Engineering Administrative Office (3010) > Manager, Open Source Projects Formulation and Development Office (8212) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 180-503E, Mailstop: 180-503 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote: > > Sure, you can remove the check from the code and recompile. > > Under what circumstances would you need to ignore robots.txt ? Would > something like allowing access by particular IP or user agents be an > alternative ? > > Tom > > > On 29/11/16 04:07, jyoti aditya wrote: > > Hi team, > > > > Can we use NUTCH to do impolite crawling? > > Or is there any way by which we can disobey robots.text? > > > > > > With Regards > > Jyoti Aditya > > > > > > ____________________________________________________________ > __________ > > This email has been scanned by the Symantec Email Security.cloud > service. > > For more information please visit http://www.symanteccloud.com > > ____________________________________________________________ > __________ > > > > > > -- > > With Regards > > Jyoti Aditya > > > > > > -- > > With Regards > > Jyoti Aditya > -- With Regards Jyoti Aditya

