Hi Jyoti,

in this case, the answer is simple: the robots.txt whitelisting
was never ported from 1.x to 2.x ;(

Best,
Sebastian


On 12/07/2016 12:44 PM, jyoti aditya wrote:
> Hi Chris/Team,
> 
> I am using Nutch2.3.1 with mongoDB configured.
> Site - flipart.com <http://flipart.com>
> 
> Even though I have added whitelist property in my nutch-stie.xml,
> I am not able to crawl.
> 
> Please find attached log.
> Please help me to fix this issue.
> 
> With Regards,
> Jyoti Aditya
> 
> On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) 
> <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi Jyoti,____
> 
>     __ __
> 
>     I need a lot more detail than “it didn’t work”. What didn’t work about 
> it? Do you have a log
>     file? What site were you trying to crawl? What command did you use? Where 
> is your nutch
>     config? Were you running in distributed or local mode?____
> 
>     __ __
> 
>     Onto Selenium – have you tried it or simply reading the docs, you think 
> it’s old? What have
>     you done? What have you tried?____
> 
>     __ __
> 
>     I need a LOT more detail before I (and I’m guessing anyone else on these 
> lists) can help.____
> 
>     __ __
> 
>     Cheers,____
> 
>     Chris____
> 
>     __ __
> 
>     __ __
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     Chris Mattmann, Ph.D.____
> 
>     Principal Data Scientist, Engineering Administrative Office (3010)____
> 
>     Manager, Open Source Projects Formulation and Development Office 
> (8212)____
> 
>     NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____
> 
>     Office: 180-503E, Mailstop: 180-503____
> 
>     Email: [email protected] <mailto:[email protected]>____
> 
>     WWW:  http://sunset.usc.edu/~mattmann/ 
> <http://sunset.usc.edu/~mattmann/>____
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     Director, Information Retrieval and Data Science Group (IRDS)____
> 
>     Adjunct Associate Professor, Computer Science Department____
> 
>     University of Southern California, Los Angeles, CA 90089 USA____
> 
>     WWW: http://irds.usc.edu/____
> 
>     ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>     __ __
> 
>     __ __
> 
>     *From: *jyoti aditya <[email protected] 
> <mailto:[email protected]>>
>     *Date: *Monday, December 5, 2016 at 9:29 PM
>     *To: *"Mattmann, Chris A (3010)" <[email protected]
>     <mailto:[email protected]>>
>     *Cc: *"[email protected] <mailto:[email protected]>" 
> <[email protected]
>     <mailto:[email protected]>>, "[email protected] 
> <mailto:[email protected]>"
>     <[email protected] <mailto:[email protected]>>
> 
> 
>     *Subject: *Re: Impolite crawling using NUTCH____
> 
>     __ __
> 
>     Hi Chris/Team, ____
> 
>     __ __
> 
>     Whitelisting domain name din't work. ____
> 
>     And when i was trying to configure selenium. It need one headless browser 
> to be integrated with.____
> 
>     Documentation for selenium-protocol plugin looks old. firefox-11 is now 
> not supported as
>     headless browser with selenium.____
> 
>     So please help me out in configuring selenium plugin configuration.____
> 
>     __ __
> 
>     I am yet not sure, after configuring above what result it will fetch 
> me.____
> 
>     __ __
> 
>     With Regards,____
> 
>     Jyoti Aditya____
> 
>     __ __
> 
>     On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) 
> <[email protected]
>     <mailto:[email protected]>> wrote:____
> 
>         Hi Jyoti,____
> 
>          ____
> 
>         Again, please keep [email protected] <mailto:[email protected]> CC’ed, and 
> also you may consider
>         looking at this page:____
> 
>          ____
> 
>         https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>         <https://wiki.apache.org/nutch/AdvancedAjaxInteraction> ____
> 
>          ____
> 
>         Cheers,____
> 
>         Chris____
> 
>          ____
> 
>          ____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>         Chris Mattmann, Ph.D.____
> 
>         Principal Data Scientist, Engineering Administrative Office (3010)____
> 
>         Manager, Open Source Projects Formulation and Development Office 
> (8212)____
> 
>         NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA____
> 
>         Office: 180-503E, Mailstop: 180-503____
> 
>         Email: [email protected] 
> <mailto:[email protected]>____
> 
>         WWW:  http://sunset.usc.edu/~mattmann/ 
> <http://sunset.usc.edu/~mattmann/>____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>         Director, Information Retrieval and Data Science Group (IRDS)____
> 
>         Adjunct Associate Professor, Computer Science Department____
> 
>         University of Southern California, Los Angeles, CA 90089 USA____
> 
>         WWW: http://irds.usc.edu/____
> 
>         ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
>          ____
> 
>          ____
> 
>         *From: *jyoti aditya <[email protected] 
> <mailto:[email protected]>>
>         *Date: *Monday, December 5, 2016 at 1:42 AM
>         *To: *Chris Mattmann <[email protected] 
> <mailto:[email protected]>>____
> 
> 
>         *Subject: *Re: Impolite crawling using NUTCH____
> 
>          ____
> 
>         Hi Chris, ____
> 
>          ____
> 
>         Whitelist din't work.____
> 
>         And I was trying to configure selenium with nutch.____
> 
>         But I am not sure that by doing so, what result will come.____
> 
>         And also, it looks very clumsy to configure selenium with firefox. 
> ____
> 
>          ____
> 
>         Regards,____
> 
>         Jyoti Aditya____
> 
>          ____
> 
>         On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected]
>         <mailto:[email protected]>> wrote:____
> 
>             Hmm, I’m a little confused here. You were first trying to use 
> white list robots.txt, and
>             now
>             you are talking about Selenium.____
> 
>              ____
> 
>             1.       Did the white list work____
> 
>             2.       Are you now asking how to use Nutch and Selenium?____
> 
>              ____
> 
>             Cheers,____
> 
>             Chris____
> 
>              ____
> 
>              ____
> 
>              ____
> 
>             *From: *jyoti aditya <[email protected] 
> <mailto:[email protected]>>
>             *Date: *Thursday, December 1, 2016 at 10:26 PM
>             *To: *"Mattmann, Chris A (3010)" <[email protected]
>             <mailto:[email protected]>>
>             *Subject: *Re: Impolite crawling using NUTCH____
> 
>              ____
> 
>             Hi Chris, ____
> 
>              ____
> 
>             Thanks for the response.____
> 
>             I added the changes as you mentioned above.____
> 
>              ____
> 
>             But I am still not able to get all content from a webpage.____
> 
>             Can you please tell me that do I need to add some selenium plugin 
> to crawl ____
> 
>             dynamic content available on web page?____
> 
>              ____
> 
>             I have a concern that this kind of wiki pages are not directly 
> accessible.____
> 
>             There is no way we can reach to these kind of useful pages.____
> 
>             So please do needful regarding this.____
> 
>              ____
> 
>              ____
> 
>             With Regards,____
> 
>             Jyoti Aditya____
> 
>              ____
> 
>             On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) 
> <[email protected]
>             <mailto:[email protected]>> wrote:____
> 
>                 There is a robots.txt whitelist. You can find documentation 
> here:
> 
>                 https://wiki.apache.org/nutch/WhiteListRobots
>                 <https://wiki.apache.org/nutch/WhiteListRobots>
> 
>                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>                 Chris Mattmann, Ph.D.
>                 Principal Data Scientist, Engineering Administrative Office 
> (3010)
>                 Manager, Open Source Projects Formulation and Development 
> Office (8212)
>                 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>                 Office: 180-503E, Mailstop: 180-503
>                 Email: [email protected] 
> <mailto:[email protected]>
>                 WWW:  http://sunset.usc.edu/~mattmann/ 
> <http://sunset.usc.edu/~mattmann/>
>                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>                 Director, Information Retrieval and Data Science Group (IRDS)
>                 Adjunct Associate Professor, Computer Science Department
>                 University of Southern California, Los Angeles, CA 90089 USA
>                 WWW: http://irds.usc.edu/
>                 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++____
> 
> 
> 
>                 On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]
>                 <mailto:[email protected]>> wrote:
> 
>                     Sure, you can remove the check from the code and 
> recompile.
> 
>                     Under what circumstances would you need to ignore 
> robots.txt ? Would
>                     something like allowing access by particular IP or user 
> agents be an
>                     alternative ?
> 
>                     Tom
> 
> 
>                     On 29/11/16 04:07, jyoti aditya wrote:
>                     > Hi team,
>                     >
>                     > Can we use NUTCH to do impolite crawling?
>                     > Or is there any way by which we can disobey robots.text?
>                     >
>                     >
>                     > With Regards
>                     > Jyoti Aditya
>                     >
>                     >
>                     > 
> ______________________________________________________________________
>                     > This email has been scanned by the Symantec Email 
> Security.cloud service.
>                     > For more information please visit 
> http://www.symanteccloud.com
>                     > 
> __________________________________________________________________________
> 
> 
> 
>             ____
> 
>              ____
> 
>             -- ____
> 
>             With Regards ____
> 
>             Jyoti Aditya____
> 
> 
> 
>         ____
> 
>          ____
> 
>         -- ____
> 
>         With Regards ____
> 
>         Jyoti Aditya____
> 
> 
> 
>     ____
> 
>     __ __
> 
>     -- ____
> 
>     With Regards ____
> 
>     Jyoti Aditya____
> 
> 
> 
> 
> -- 
> With Regards
> Jyoti Aditya

Reply via email to