Hi,

> Nutch 2.4 with selenium

Nutch 2.4 does not include any plugin to use Selenium. In addition, 2.4 is for 
now the last release on the 2.x branch which is not
maintained anymore. You should use 1.x (1.17 is the
most recent release.

> standalone nutch crawling with selenium.

For 1.x there's a good README how to setup protocol-selenium:
  
https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md

In general, the tutorial is the recommended way to start
  https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
Please try to get it running first without Selenium, it's important to 
understand first
how Nutch works before you start with the clearly more complex Selenium-based 
crawling.

Best,
Sebastian

On 10/7/20 2:49 PM, Gajalakshmi G wrote:
> Hi,
> 
> Thanks for the response, the 'conf/regex-urlfilter.txt' file was available 
> inside the current working directory.
> 
> Please guide me or share me useful links on standalone  nutch crawling with 
> selenium.
> 
> 
> 
> Thanks & Regards,
> 
> Gajalakshmi.G
> 
> Assistant Consultant
> 
> Tata Consultancy Services
> Mailto: 
> gajalakshm...@tcs.com<https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com>
> 
> ________________________________
> From: Shashanka Balakuntala <shbalakunt...@gmail.com>
> Sent: Wednesday, October 7, 2020 5:49 PM
> To: user@nutch.apache.org <user@nutch.apache.org>
> Subject: Re: Nutch 2.4 with selenium
> 
> "External email. Open with Caution"
> 
> Hi Gajalakshmi,
> 
> The NPE can be thrown because of the file not found on the disk. So in the
> working directory/current directory check if you have the file
> conf/regex-urlfilter.txt
> 
> 
> *Regards*
>   Shashanka Balakuntala Srinivasa
> 
> 
> 
> On Wed, Oct 7, 2020 at 2:09 PM Gajalakshmi G <gajalakshm...@tcs.com.invalid>
> wrote:
> 
>> Hi all,
>>
>> I am trying to crawl dynamic webpage using Nutch 2.4 with Selenium 3.6.0
>> with Firefox version 79. I am getting the below error in injector job
>> itself.
>>
>> java.lang.Exception: java.lang.NullPointerException
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
>> Caused by: java.lang.NullPointerException
>>     at java.io.Reader.<init>(Reader.java:78)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:101)
>>     at java.io.BufferedReader.<init>(BufferedReader.java:116)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.readRules(RegexURLFilterBase.java:199)
>>     at
>> org.apache.nutch.urlfilter.api.RegexURLFilterBase.setConf(RegexURLFilterBase.java:171)
>>     at
>> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
>>     at org.apache.nutch.net.URLFilters.<init>(URLFilters.java:62)
>>     at
>> org.apache.nutch.crawl.InjectorJob$UrlMapper.setup(InjectorJob.java:113)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
>>     at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>>
>> Please guide me on resolving this issue.
>>
>>
>>
>> Thanks & Regards,
>>
>> Gajalakshmi.G
>>
>> Assistant Consultant
>>
>> Tata Consultancy Services
>> Mailto: gajalakshm...@tcs.com<
>> https://mail.tcs.com/owa/redir.aspx?C=15cf4bf65eff4bdab465e0a2dd682f11&URL=mailto%3agajalakshmi.g%40tcs.com
>>>
>> =====-----=====-----=====
>> Notice: The information contained in this e-mail
>> message and/or attachments to it may contain
>> confidential or privileged information. If you are
>> not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the
>> information contained in this e-mail message
>> and/or attachments to it are strictly prohibited. If
>> you have received this communication in error,
>> please notify us by reply e-mail or telephone and
>> immediately and permanently delete the message
>> and any attachments. Thank you
>>
>>
>>
> 

Reply via email to