Hi Sebastian, Thanks it is working by removing "protocol-httpclient", but some compatibility issues between Nutch 1.x, Selenium v3.4.0 and Geckodriver: 0.23.0. Here is the exception: Caused by: java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source) at org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)
Thanks & Regards Venkata MR +91 98455 77125 -----Original Message----- From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> Sent: 17 December 2018 14:57 To: user@nutch.apache.org Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by ajax Hi, > protocol-httpclient (as the websites are with https). With Nutch 1.15 protocol-selenium supports https. If protocol-httpclient is also active, it may be used instead of protocol-selenium. There is no need to activate it, the description in nutch-default.xml needs to be fixed, see https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FNUTCH-2678&data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636806356178396679&sdata=%2Boc8NCNZNKdFPqvTwpb9R3ytQw2%2BWWbJO277oSB536o%3D&reserved=0 Note that protocol-interactiveselenium will support https in 1.16. Best, Sebastian On 12/16/18 1:40 PM, Venkata MR wrote: > Hi Lewis, > > Thanks for your email, I tried all options with no success before reaching > you again referring to the link you had provided. > > Here I am trying to crawl websites which are having the runtime rendered > content to extract and parse. > I downloaded the Nutch provided in the below email. Added > protocol-interactiveselenium, protocol-selenium along with > protocol-httpclient (as the websites are with https). > Selenium - firefox is configured and it is working properly, and selenium is > configured and running while doing the crawling. > > Yet, not able to get rendered content. Here I attached the nutch-site.xml for > reference to see any input of missing configuration. > > Just wondering to guess if the issue with the tika parsers not able to parse > the extracted runtime rendered content or the issue with the Solr (I am using > Apache solr to index parsed data) for not having the indexed field to > represent the data (schema has content, url, title and id). > > Any input really appreciated to resolve the issue. > > Environment: CentOS-7 > Firefox: 60.3 oesr (64 bit) > Selenium : v3.4.0 > Geckodriver: 0.23.0 ( 2018-10-04) > Apache Nutch: 1.x > > Thanks & Regards > Venkata MR > +91 98455 77125 > > -----Original Message----- > From: Lewis John McGibbney <lewi...@apache.org> > Sent: 09 December 2018 02:11 > To: user@nutch.apache.org > Subject: Re: Apache Nutch 2.3.1 not able to fetch content rendered by > ajax > > Hi Venkata, > This functionality is not available in 2.X at the moment. > The functionality is available in the 1.x primary branch. You can > learn about the implementation at both > > https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-sel > enium&data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b4441eb76608 > d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C63680635617839 > 6679&sdata=P89fyka%2F8bmt%2Fr54hlM60EjNFU4Glo1%2BUCxX73fUJEQ%3D&am > p;reserved=0, and > > https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > hub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-int > eractiveselenium&data=02%7C01%7CVenkata.MR%40hcl.com%7C5b926031d5b > 4441eb76608d66401c9cc%7C189de737c93a4f5a8b686f4ca9941912%7C0%7C0%7C636 > 806356178396679&sdata=mrzw3d8XIJDMq9oP%2BN3S0jKdhuYcYUzp%2BO3YOfY4 > gEk%3D&reserved=0 > > Lewis > > On 2018/12/07 07:03:30, Venkata MR <venkata...@hcl.com> wrote: >> Hi, >> >> Was trying to fetch the content rendered by ajax call using Apache Nutch >> 2.3.1. >> Seems, it is not able to get the actual rendered content only getting the >> view source page ( as part of protocol-js plugin). >> Has anyone able to fetch the rendered content from Ajax call using Nutch >> 2.3.1 or any suggestions? >> >> Thanks & Regards >> Venkata MR >> +91 98455 77125 >> >> ::DISCLAIMER:: >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> -- The contents of this e-mail and any attachment(s) are confidential >> and intended for the named recipient(s) only. E-mail transmission is not >> guaranteed to be secure or error-free as information could be intercepted, >> corrupted, lost, destroyed, arrive late or incomplete, or may contain >> viruses in transmission. The e mail and its contents (with or without >> referred errors) shall therefore not attach any liability on the originator >> or HCL or its affiliates. Views or opinions, if any, presented in this email >> are solely those of the author and may not necessarily reflect the views or >> opinions of HCL or its affiliates. Any form of reproduction, dissemination, >> copying, disclosure, modification, distribution and / or publication of this >> message without the prior written consent of authorized representative of >> HCL is strictly prohibited. If you have received this email in error please >> delete it and notify the sender immediately. Before opening any email and/or >> attachments, please check them for viruses and other defects. >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> --------------------------------------------------------------------- >> -- >>