Re: Impolite crawling using NUTCH

jyoti aditya Wed, 07 Dec 2016 03:45:14 -0800

Hi Chris/Team,

I am using Nutch2.3.1 with mongoDB configured.
Site - flipart.com


Even though I have added whitelist property in my nutch-stie.xml,
I am not able to crawl.

Please find attached log.
Please help me to fix this issue.

With Regards,
Jyoti Aditya

On Tue, Dec 6, 2016 at 11:02 AM, Mattmann, Chris A (3010) <
[email protected]> wrote:

> Hi Jyoti,
>
>
>
> I need a lot more detail than “it didn’t work”. What didn’t work about it?
> Do you have a log
> file? What site were you trying to crawl? What command did you use? Where
> is your nutch
> config? Were you running in distributed or local mode?
>
>
>
> Onto Selenium – have you tried it or simply reading the docs, you think
> it’s old? What have
> you done? What have you tried?
>
>
>
> I need a LOT more detail before I (and I’m guessing anyone else on these
> lists) can help.
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: [email protected]
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <[email protected]>
> *Date: *Monday, December 5, 2016 at 9:29 PM
> *To: *"Mattmann, Chris A (3010)" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>, "
> [email protected]" <[email protected]>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris/Team,
>
>
>
> Whitelisting domain name din't work.
>
> And when i was trying to configure selenium. It need one headless browser
> to be integrated with.
>
> Documentation for selenium-protocol plugin looks old. firefox-11 is now
> not supported as headless browser with selenium.
>
> So please help me out in configuring selenium plugin configuration.
>
>
>
> I am yet not sure, after configuring above what result it will fetch me.
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Dec 6, 2016 at 12:00 AM, Mattmann, Chris A (3010) <
> [email protected]> wrote:
>
> Hi Jyoti,
>
>
>
> Again, please keep [email protected] CC’ed, and also you may consider looking
> at this page:
>
>
>
> https://wiki.apache.org/nutch/AdvancedAjaxInteraction
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Chris Mattmann, Ph.D.
>
> Principal Data Scientist, Engineering Administrative Office (3010)
>
> Manager, Open Source Projects Formulation and Development Office (8212)
>
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>
> Office: 180-503E, Mailstop: 180-503
>
> Email: [email protected]
>
> WWW:  http://sunset.usc.edu/~mattmann/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Director, Information Retrieval and Data Science Group (IRDS)
>
> Adjunct Associate Professor, Computer Science Department
>
> University of Southern California, Los Angeles, CA 90089 USA
>
> WWW: http://irds.usc.edu/
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> *From: *jyoti aditya <[email protected]>
> *Date: *Monday, December 5, 2016 at 1:42 AM
> *To: *Chris Mattmann <[email protected]>
>
>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Whitelist din't work.
>
> And I was trying to configure selenium with nutch.
>
> But I am not sure that by doing so, what result will come.
>
> And also, it looks very clumsy to configure selenium with firefox.
>
>
>
> Regards,
>
> Jyoti Aditya
>
>
>
> On Fri, Dec 2, 2016 at 8:43 PM, Chris Mattmann <[email protected]>
> wrote:
>
> Hmm, I’m a little confused here. You were first trying to use white list
> robots.txt, and now
> you are talking about Selenium.
>
>
>
> 1.       Did the white list work
>
> 2.       Are you now asking how to use Nutch and Selenium?
>
>
>
> Cheers,
>
> Chris
>
>
>
>
>
>
>
> *From: *jyoti aditya <[email protected]>
> *Date: *Thursday, December 1, 2016 at 10:26 PM
> *To: *"Mattmann, Chris A (3010)" <[email protected]>
> *Subject: *Re: Impolite crawling using NUTCH
>
>
>
> Hi Chris,
>
>
>
> Thanks for the response.
>
> I added the changes as you mentioned above.
>
>
>
> But I am still not able to get all content from a webpage.
>
> Can you please tell me that do I need to add some selenium plugin to crawl
>
> dynamic content available on web page?
>
>
>
> I have a concern that this kind of wiki pages are not directly accessible.
>
> There is no way we can reach to these kind of useful pages.
>
> So please do needful regarding this.
>
>
>
>
>
> With Regards,
>
> Jyoti Aditya
>
>
>
> On Tue, Nov 29, 2016 at 7:29 PM, Mattmann, Chris A (3010) <
> [email protected]> wrote:
>
> There is a robots.txt whitelist. You can find documentation here:
>
> https://wiki.apache.org/nutch/WhiteListRobots
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Principal Data Scientist, Engineering Administrative Office (3010)
> Manager, Open Source Projects Formulation and Development Office (8212)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 180-503E, Mailstop: 180-503
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
> On 11/29/16, 8:57 AM, "Tom Chiverton" <[email protected]> wrote:
>
>     Sure, you can remove the check from the code and recompile.
>
>     Under what circumstances would you need to ignore robots.txt ? Would
>     something like allowing access by particular IP or user agents be an
>     alternative ?
>
>     Tom
>
>
>     On 29/11/16 04:07, jyoti aditya wrote:
>     > Hi team,
>     >
>     > Can we use NUTCH to do impolite crawling?
>     > Or is there any way by which we can disobey robots.text?
>     >
>     >
>     > With Regards
>     > Jyoti Aditya
>     >
>     >
>     > ____________________________________________________________
> __________
>     > This email has been scanned by the Symantec Email Security.cloud
> service.
>     > For more information please visit http://www.symanteccloud.com
>     > ____________________________________________________________
> __________
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>
>
>
>
>
> --
>
> With Regards
>
> Jyoti Aditya
>



-- 
With Regards
Jyoti Aditya

2016-12-07 17:05:52,851 INFO  crawl.InjectorJob - InjectorJob: starting at 
2016-12-07 17:05:52
2016-12-07 17:05:52,852 INFO  crawl.InjectorJob - InjectorJob: Injecting 
urlDir: seed.txt
2016-12-07 17:05:53,200 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2016-12-07 17:05:53,905 INFO  crawl.InjectorJob - InjectorJob: Using class 
org.apache.gora.mongodb.store.MongoStore as the Gora storage class.
2016-12-07 17:05:54,454 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1771781124/.staging/job_local1771781124_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-07 17:05:54,456 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1771781124/.staging/job_local1771781124_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-07 17:05:54,565 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1771781124_0001/job_local1771781124_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-07 17:05:54,570 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1771781124_0001/job_local1771781124_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-07 17:05:55,627 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls rejected by filters: 1
2016-12-07 17:05:55,627 INFO  crawl.InjectorJob - InjectorJob: total number of 
urls injected after normalization and filtering: 0
2016-12-07 17:05:55,628 INFO  crawl.InjectorJob - Injector: finished at 
2016-12-07 17:05:55, elapsed: 00:00:02
2016-12-07 17:05:56,746 INFO  crawl.GeneratorJob - GeneratorJob: starting at 
2016-12-07 17:05:56
2016-12-07 17:05:56,747 INFO  crawl.GeneratorJob - GeneratorJob: Selecting 
best-scoring urls due for fetch.
2016-12-07 17:05:56,747 INFO  crawl.GeneratorJob - GeneratorJob: starting
2016-12-07 17:05:56,747 INFO  crawl.GeneratorJob - GeneratorJob: filtering: 
false
2016-12-07 17:05:56,747 INFO  crawl.GeneratorJob - GeneratorJob: normalizing: 
false
2016-12-07 17:05:56,747 INFO  crawl.GeneratorJob - GeneratorJob: topN: 50000
2016-12-07 17:05:56,947 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2016-12-07 17:05:56,959 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-07 17:05:56,960 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2016-12-07 17:05:56,960 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-07 17:05:58,144 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1498315833/.staging/job_local1498315833_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-07 17:05:58,146 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/staging/jyoti.aditya1498315833/.staging/job_local1498315833_0001/job.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-07 17:05:58,235 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1498315833_0001/job_local1498315833_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-12-07 17:05:58,238 WARN  conf.Configuration - 
file:/tmp/hadoop-jyoti.aditya/mapred/local/localRunner/jyoti.aditya/job_local1498315833_0001/job_local1498315833_0001.xml:an
 attempt to override final parameter: 
mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-12-07 17:05:58,504 INFO  crawl.FetchScheduleFactory - Using FetchSchedule 
impl: org.apache.nutch.crawl.DefaultFetchSchedule
2016-12-07 17:05:58,504 INFO  crawl.AbstractFetchSchedule - 
defaultInterval=2592000
2016-12-07 17:05:58,504 INFO  crawl.AbstractFetchSchedule - maxInterval=7776000
2016-12-07 17:05:59,291 INFO  crawl.GeneratorJob - GeneratorJob: finished at 
2016-12-07 17:05:59, time elapsed: 00:00:02
2016-12-07 17:05:59,291 INFO  crawl.GeneratorJob - GeneratorJob: generated 
batch id: 1481110555-21430 containing 0 URLs

Re: Impolite crawling using NUTCH

Reply via email to