Hi All,

Any inputs here really appreciated. Thanks again.

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Venkata MR 
Sent: 18 December 2018 16:40
To: 'Sebastian Nagel' <wastl.na...@googlemail.com>
Cc: user@nutch.apache.org
Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

+user@nutch.apache.org

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Venkata MR
Sent: 18 December 2018 16:05
To: 'Sebastian Nagel' <wastl.na...@googlemail.com>
Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi Sebastian,

Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same 
casting exception.
Pls find below the log details.

Caused by: org.openqa.selenium.WebDriverException: 
java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl 
cannot be cast to javax.xml.parsers.DocumentBuilderFactory
Build info: version: '2.48.2', revision: 
'41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 
'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
Driver info: driver.version: FirefoxDriver
        at 
org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
        at 
org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
        at 
org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
        at 
org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
        at 
org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
        at 
org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
        ... 12 more
Caused by: java.lang.ClassCastException: 
org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
javax.xml.parsers.DocumentBuilderFactory
        at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
        at 
org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)

Thanks & Regards
Venkata MR
+91 98455 77125

-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com>
Sent: 17 December 2018 19:53
To: Venkata MR <venkata...@hcl.com>
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
Or at least a "close" version?

Alternatively, you can try to upgrade the Selenium version in Nutch, but that's 
not trivial and requires changes in multiple files.

Best,
Sebastian


On 12/17/18 12:07 PM, Venkata MR wrote:
> Hi Sebastian,
> 
> Thanks it is working by removing "protocol-httpclient", but some 
> compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 
> 0.23.0.
> Here is the exception:
> Caused by: java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
>       at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>       at
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
> f(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
> Sent: 17 December 2018 14:57
> To: user@nutch.apache.org
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi,
> 
>> protocol-httpclient (as the websites are with https).
> 
> With Nutch 1.15 protocol-selenium supports https. If 
> protocol-httpclient is also active, it may be used instead of 
> protocol-selenium. There is no need to activate it, the description in 
> nutch-default.xml needs to be fixed, see 
> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fiss
> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkata
> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b6
> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoVv
> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
> 
> Note that protocol-interactiveselenium will support https in 1.16.
> 
> Best,
> Sebastian
> 
> On 12/16/18 1:40 PM, Venkata MR wrote:
>> Hi Lewis,
>>
>> Thanks for your email, I tried all options with no success before reaching 
>> you again referring to the link you had provided.
>>
>> Here I am trying to crawl websites which are having the runtime rendered 
>> content to extract and parse.
>> I downloaded the Nutch provided in the below email. Added 
>> protocol-interactiveselenium, protocol-selenium along with 
>> protocol-httpclient (as the websites are with https).
>> Selenium - firefox is configured and it is working properly, and selenium is 
>> configured and running while doing the crawling.
>>
>> Yet, not able to get rendered content. Here I attached the nutch-site.xml 
>> for reference to see any input of missing configuration.
>>
>> Just wondering to guess if the issue with the tika parsers not able to parse 
>> the extracted runtime rendered content or the issue with the Solr (I am 
>> using Apache solr to index parsed data) for not having the indexed field to 
>> represent the data (schema has content, url, title and id).
>>
>> Any input really appreciated to resolve the issue.
>>
>> Environment: CentOS-7
>> Firefox: 60.3 oesr (64 bit)
>> Selenium : v3.4.0
>> Geckodriver: 0.23.0 ( 2018-10-04)
>> Apache Nutch: 1.x
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>> -----Original Message-----
>> From: Lewis John McGibbney <lewi...@apache.org>
>> Sent: 09 December 2018 02:11
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi Venkata,
>> This functionality is not available in 2.X at the moment.
>> The functionality is available in the 1.x primary branch. You can 
>> learn about the implementation at both
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-se
>> l
>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb7660
>> 8
>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6368063561783
>> 9
>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&a
>> m
>> p;reserved=0, and
>>
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgi
>> t
>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-in
>> t
>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5
>> b
>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63
>> 6
>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY
>> 4
>> gEk%3D&amp;reserved=0
>>
>> Lewis
>>
>> On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: 
>>> Hi,
>>>
>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 
>>> 2.3.1.
>>> Seems, it is not able to get the actual rendered content only getting the 
>>> view source page ( as part of protocol-js plugin).
>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 
>>> 2.3.1 or any suggestions?
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> ::DISCLAIMER::
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> -- The contents of this e-mail and any attachment(s) are 
>>> confidential and intended for the named recipient(s) only. E-mail 
>>> transmission is not guaranteed to be secure or error-free as information 
>>> could be intercepted, corrupted, lost, destroyed, arrive late or 
>>> incomplete, or may contain viruses in transmission. The e mail and its 
>>> contents (with or without referred errors) shall therefore not attach any 
>>> liability on the originator or HCL or its affiliates. Views or opinions, if 
>>> any, presented in this email are solely those of the author and may not 
>>> necessarily reflect the views or opinions of HCL or its affiliates. Any 
>>> form of reproduction, dissemination, copying, disclosure, modification, 
>>> distribution and / or publication of this message without the prior written 
>>> consent of authorized representative of HCL is strictly prohibited. If you 
>>> have received this email in error please delete it and notify the sender 
>>> immediately. Before opening any email and/or attachments, please check them 
>>> for viruses and other defects.
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --------------------------------------------------------------------
>>> -
>>> --
>>>
> 

Reply via email to