Hi Sebastian,

Pls find the link for issue: https://issues.apache.org/jira/browse/NUTCH-2681

Thanks & Regards
Venkata MR
+91 98455 77125


-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com> 
Sent: 21 December 2018 19:19
To: user@nutch.apache.org
Cc: Venkata MR <venkata...@hcl.com>
Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax

Hi,

sorry for the late reply. Looks like one of the really nasty dependency 
conflicts with incompatible class implementations resp. versions which are only 
observed at runtime.

That's the potential conflicting candidates (from current master):

runtime/local/plugins/lib-selenium/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/plugins/lib-selenium/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   
META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   
org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xml-apis-1.4.01.jar
     3505  2009-12-09 13:02   javax/xml/parsers/DocumentBuilderFactory.class

runtime/local/lib/xercesImpl-2.11.0.jar
       51  2010-11-26 15:37   
META-INF/services/javax.xml.parsers.DocumentBuilderFactory
     4546  2010-11-26 15:40   
org/apache/xerces/jaxp/DocumentBuilderFactoryImpl.class

runtime/local/lib/xmlParserAPIs-2.6.2.jar
     2067  2003-11-18 15:19   javax/xml/parsers/DocumentBuilderFactory.class

I have no other idea than a trial-error, e.g., remove
   .../lib/xmlParserAPIs-2.6.2.jar
  resp. delete in ivy/ivy.xml:
   <dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />

Sorry, but I have no time left now and the next two weeks to try find a fix or 
work-around.

Please also open an issue to fix this on
    
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fprojects%2FNUTCH&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C68d7209dcb3047168ae908d6674b17f3%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636809969571335592&amp;sdata=QsZlP6RpjEo%2BXFcdy7m4FIuDffny0MGV4zvQ5%2Fd8cr8%3D&amp;reserved=0

Thanks,
Sebastian

On 12/19/18 5:50 AM, Venkata MR wrote:
> Hi All,
> 
> Any inputs here really appreciated. Thanks again.
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Venkata MR
> Sent: 18 December 2018 16:40
> To: 'Sebastian Nagel' <wastl.na...@googlemail.com>
> Cc: user@nutch.apache.org
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> +user@nutch.apache.org
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> 
> -----Original Message-----
> From: Venkata MR
> Sent: 18 December 2018 16:05
> To: 'Sebastian Nagel' <wastl.na...@googlemail.com>
> Subject: RE: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi Sebastian,
> 
> Went with selenium v2.48.2 and firefox 31.4.0 as specified. It is the same 
> casting exception.
> Pls find below the log details.
> 
> Caused by: org.openqa.selenium.WebDriverException: 
> java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: 
> '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 
> 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
>       at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
>       at 
> org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
>       at 
> org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
>       at 
> org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
>       at 
> org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
>       at 
> org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
>       ... 12 more
> Caused by: java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
>       at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>       at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRd
> f(FileExtension.java:95)
> 
> Thanks & Regards
> Venkata MR
> +91 98455 77125
> 
> -----Original Message-----
> From: Sebastian Nagel <wastl.na...@googlemail.com>
> Sent: 17 December 2018 19:53
> To: Venkata MR <venkata...@hcl.com>
> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
> ajax
> 
> Hi,
> 
> what happens if you the same version of Selenium as Nutch 1.15 does - 2.48.2?
> Or at least a "close" version?
> 
> Alternatively, you can try to upgrade the Selenium version in Nutch, but 
> that's not trivial and requires changes in multiple files.
> 
> Best,
> Sebastian
> 
> 
> On 12/17/18 12:07 PM, Venkata MR wrote:
>> Hi Sebastian,
>>
>> Thanks it is working by removing "protocol-httpclient", but some 
>> compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 
>> 0.23.0.
>> Here is the exception:
>> Caused by: java.lang.ClassCastException: 
>> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
>> javax.xml.parsers.DocumentBuilderFactory
>>      at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>>      at
>> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallR
>> d
>> f(FileExtension.java:95)
>>
>> Thanks & Regards
>> Venkata MR
>> +91 98455 77125
>>
>>
>> -----Original Message-----
>> From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID>
>> Sent: 17 December 2018 14:57
>> To: user@nutch.apache.org
>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by 
>> ajax
>>
>> Hi,
>>
>>> protocol-httpclient (as the websites are with https).
>>
>> With Nutch 1.15 protocol-selenium supports https. If 
>> protocol-httpclient is also active, it may be used instead of 
>> protocol-selenium. There is no need to activate it, the description 
>> in nutch-default.xml needs to be fixed, see 
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fis
>> s 
>> ues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&amp;data=02%7C01%7CVenkat
>> a
>> .MR%40hcl.com%7Cd7e1c764a7a34216dd6d08d6642b2d47%7C189de737c93a4f5a8b
>> 6 
>> 86f4ca9941912%7C0%7C0%7C636806533951474875&amp;sdata=MFYP3ICA7XSIsOoV
>> v
>> YvNMQahG%2FKLQSWyn82ZrwaqGr8%3D&amp;reserved=0
>>
>> Note that protocol-interactiveselenium will support https in 1.16.
>>
>> Best,
>> Sebastian
>>
>> On 12/16/18 1:40 PM, Venkata MR wrote:
>>> Hi Lewis,
>>>
>>> Thanks for your email, I tried all options with no success before reaching 
>>> you again referring to the link you had provided.
>>>
>>> Here I am trying to crawl websites which are having the runtime rendered 
>>> content to extract and parse.
>>> I downloaded the Nutch provided in the below email. Added 
>>> protocol-interactiveselenium, protocol-selenium along with 
>>> protocol-httpclient (as the websites are with https).
>>> Selenium - firefox is configured and it is working properly, and selenium 
>>> is configured and running while doing the crawling.
>>>
>>> Yet, not able to get rendered content. Here I attached the nutch-site.xml 
>>> for reference to see any input of missing configuration.
>>>
>>> Just wondering to guess if the issue with the tika parsers not able to 
>>> parse the extracted runtime rendered content or the issue with the Solr (I 
>>> am using Apache solr to index parsed data) for not having the indexed field 
>>> to represent the data (schema has content, url, title and id).
>>>
>>> Any input really appreciated to resolve the issue.
>>>
>>> Environment: CentOS-7
>>> Firefox: 60.3 oesr (64 bit)
>>> Selenium : v3.4.0
>>> Geckodriver: 0.23.0 ( 2018-10-04)
>>> Apache Nutch: 1.x
>>>
>>> Thanks & Regards
>>> Venkata MR
>>> +91 98455 77125
>>>
>>> -----Original Message-----
>>> From: Lewis John McGibbney <lewi...@apache.org>
>>> Sent: 09 December 2018 02:11
>>> To: user@nutch.apache.org
>>> Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered 
>>> by ajax
>>>
>>> Hi Venkata,
>>> This functionality is not available in 2.X at the moment.
>>> The functionality is available in the 1.x primary branch. You can 
>>> learn about the implementation at both
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>> i
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-s
>>> e
>>> l
>>> enium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb766
>>> 0
>>> 8
>>> d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636806356178
>>> 3
>>> 9
>>> 6679&amp;sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&
>>> a
>>> m
>>> p;reserved=0, and
>>>
>>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fg
>>> i
>>> t
>>> hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-i
>>> n
>>> t
>>> eractiveselenium&amp;data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d
>>> 5
>>> b
>>> 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C6
>>> 3
>>> 6
>>> 806356178396679&amp;sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOf
>>> Y
>>> 4
>>> gEk%3D&amp;reserved=0
>>>
>>> Lewis
>>>
>>> On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: 
>>>> Hi,
>>>>
>>>> Was trying to fetch the content rendered by ajax call using Apache Nutch 
>>>> 2.3.1.
>>>> Seems, it is not able to get the actual rendered content only getting the 
>>>> view source page ( as part of protocol-js plugin).
>>>> Has anyone able to fetch the rendered content from Ajax call using Nutch 
>>>> 2.3.1 or any suggestions?
>>>>
>>>> Thanks & Regards
>>>> Venkata MR
>>>> +91 98455 77125
>>>>
>>>> ::DISCLAIMER::
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -- The contents of this e-mail and any attachment(s) are 
>>>> confidential and intended for the named recipient(s) only. E-mail 
>>>> transmission is not guaranteed to be secure or error-free as information 
>>>> could be intercepted, corrupted, lost, destroyed, arrive late or 
>>>> incomplete, or may contain viruses in transmission. The e mail and its 
>>>> contents (with or without referred errors) shall therefore not attach any 
>>>> liability on the originator or HCL or its affiliates. Views or opinions, 
>>>> if any, presented in this email are solely those of the author and may not 
>>>> necessarily reflect the views or opinions of HCL or its affiliates. Any 
>>>> form of reproduction, dissemination, copying, disclosure, modification, 
>>>> distribution and / or publication of this message without the prior 
>>>> written consent of authorized representative of HCL is strictly 
>>>> prohibited. If you have received this email in error please delete it and 
>>>> notify the sender immediately. Before opening any email and/or 
>>>> attachments, please check them for viruses and other defects.
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> -------------------------------------------------------------------
>>>> -
>>>> -
>>>> --
>>>>
>>
> 

Reply via email to