Hi Sebastian,

Thanks it is working by removing "protocol-httpclient", but some compatibility 
issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0.
Here is the exception:
Caused by: java.lang.ClassCastException: 
org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
javax.xml.parsers.DocumentBuilderFactory
        at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
        at 
org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
Sent: 17 December 2018 14:57
To: user@nutch.apache.org
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

> protocol-httpclient (as the websites are with https).

With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient is 
also active, it may be used instead of protocol-selenium. There is no need to 
activate it, the description in nutch-default.xml needs to be fixed, see 
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636806356178396679&amp;sdata=%2Boc8NCNZNKdFPqvTwpb9R3ytQw2%2BWWbJO277oSB536o%3D&amp;reserved=0

Note that protocol-interactiveselenium will support https in 1.16.

Best,
Sebastian

On 12/16/18 1:40 PM, Venkata MR wrote:
> Hi Lewis,
> 
> Thanks for your email, I tried all options with no success before reaching 
> you again referring to the link you had provided.
> 
> Here I am trying to crawl websites which are having the runtime rendered 
> content to extract and parse.
> I downloaded the Nutch provided in the below email. Added 
> protocol-interactiveselenium, protocol-selenium along with 
> protocol-httpclient (as the websites are with https).
> Selenium - firefox is configured and it is working properly, and selenium is 
> configured and running while doing the crawling.
> 
> Yet, not able to get rendered content. Here I attached the nutch-site.xml for 
> reference to see any input of missing configuration.
> 
> Just wondering to guess if the issue with the tika parsers not able to parse 
> the extracted runtime rendered content or the issue with the Solr (I am using 
> Apache solr to index parsed data) for not having the indexed field to 
> represent the data (schema has content, url, title and id).
> 
> Any input really appreciated to resolve the issue.
> 
> Environment: CentOS-7
> Firefox: 60.3 oesr (64 bit)
> Selenium : v3.4.0
> Geckodriver: 0.23.0 ( 2018-10-04)
> Apache Nutch: 1.x
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Lewis John McGibbney <lewi...@apache.org>
> Sent: 09 December 2018 02:11
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi Venkata,
> This functionality is not available in 2.X at the moment.
> The functionality is available in the 1.x primary branch. You can 
> learn about the implementation at both
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-sel
> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608
> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63680635617839
> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&am
> p;reserved=0, and
> 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-int
> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b
> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636
> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY4
> gEk%3D&amp;reserved=0
> 
> Lewis
> 
> On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: 
>> Hi,
>>
>> Was trying to fetch the content rendered by ajax call using Apache Nutch 
>> 2.3.1.
>> Seems, it is not able to get the actual rendered content only getting the 
>> view source page ( as part of protocol-js plugin).
>> Has anyone able to fetch the rendered content from Ajax call using Nutch 
>> 2.3.1 or any suggestions?
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> ::DISCLAIMER::
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> -- The contents of this e-mail and any attachment(s) are confidential 
>> and intended for the named recipient(s) only. E-mail transmission is not 
>> guaranteed to be secure or error-free as information could be intercepted, 
>> corrupted, lost, destroyed, arrive late or incomplete, or may contain 
>> viruses in transmission. The e mail and its contents (with or without 
>> referred errors) shall therefore not attach any liability on the originator 
>> or HCL or its affiliates. Views or opinions, if any, presented in this email 
>> are solely those of the author and may not necessarily reflect the views or 
>> opinions of HCL or its affiliates. Any form of reproduction, dissemination, 
>> copying, disclosure, modification, distribution and / or publication of this 
>> message without the prior written consent of authorized representative of 
>> HCL is strictly prohibited. If you have received this email in error please 
>> delete it and notify the sender immediately. Before opening any email and/or 
>> attachments, please check them for viruses and other defects.
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> ---------------------------------------------------------------------
>> --
>>

Reply via email to