Re: Impolite crawling using NUTCH

jyoti aditya Mon, 05 Dec 2016 21:30:07 -0800

Hi Chris/Team,

Whitelisting domain name din't work.
And when i was trying to configure selenium. It need one headless browser
to be integrated with.
Documentation for selenium-protocol plugin looks old. firefox-11 is now not
supported as headless browser with selenium.
So please help me out in configuring selenium plugin configuration.


I am yet not sure, after configuring above what result it will fetch me.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <
[email protected]> wrote:

> Hi Jyoti,
>
>
>
> Again, please keep [email protected] CC’ed, and also you may consider looking
> at this page:
>
>
>
> https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: [email protected]
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <[email protected]>
> *Date: *Monday, December 5, 2016 at 1:42 AM
> *To: *Chris Mattmann <[email protected]>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Whitelist din't work.
>
> And I was trying to configure selenium with nutch.
>
> But I am not sure that by doing so, what result will come.
>
> And also, it looks very clumsy to configure selenium with firefox.
>
>
>
> Regards,
>
> Jyoti Aditya
>
>
>
> On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected]>
> wrote:
>
> Hmm, I’m a little confused here. You were first trying to use white list
> robots.txt, and now
> you are talking about Selenium.
>
>
>
> 1.       Did the white list work
>
> 2.       Are you now asking how to use Nutch and Selenium?
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
> *From: *jyoti aditya <[email protected]>
> *Date: *Thursday, December 1, 2016 at 10:26 PM
> *To: *"Mattmann, Chris A (3010)" <[email protected]>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Thanks for the response.
>
> I added the changes as you mentioned above.
>
>
>
> But I am still not able to get all content from a webpage.
>
> Can you please tell me that do I need to add some selenium plugin to crawl
>
> dynamic content available on web page?
>
>
>
> I have a concern that this kind of wiki pages are not directly accessible.
>
> There is no way we can reach to these kind of useful pages.
>
> So please do needful regarding this.
>
>
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <
> [email protected]> wrote:
>
> There is a robots.txt whitelist. You can find documentation here:
>
> https://wiki.apache.org/nutch/WhiteListRobots
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote:
>
>     Sure, you can remove the check from the code and recompile.
>
>     Under what circumstances would you need to ignore robots.txt ? Would
>     something like allowing access by particular IP or user agents be an
>     alternative ?
>
>     Tom
>
>
>     On 29/11/16 04:07, jyoti aditya wrote:
>     > Hi team,
>     >
>     > Can we use NUTCH to do impolite crawling?
>     > Or is there any way by which we can disobey robots.text?
>     >
>     >
>     > With Regards
>     > Jyoti Aditya
>     >
>     >
>     > ____________________________________________________________
> __________
>     > This email has been scanned by the Symantec Email Security.cloud
> service.
>     > For more information please visit http://www.symanteccloud.com
>     > ____________________________________________________________
> __________
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>



-- 
With Regards
Jyoti Aditya

Re: Impolite crawling using NUTCH

Reply via email to