Hi Sebastian, I used the firefox driver (head and headless with the same output). Now tried chrome, but the Selenium Driver didn't match the browser's one.
Dec 21, 2024 11:29:50 AM org.openqa.selenium.devtools.CdpVersionFinder findNearestMatch WARNING: Unable to find CDP implementation matching 131 Dec 21, 2024 11:29:50 AM org.openqa.selenium.chromium.ChromiumDriver lambda$new$5 WARNING: Unable to find version of CDP to use for 131.0.6778.204. You may need to include a dependency on a specific version of the CDP using something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.18.1` where the version ("v86") matches the version of the chromium-based browser you're using and the version number of the artifact is the same as Selenium's. The environment: - Debian 12 crawler@debian:~/apache-nutch-1.20$ dpkg -l|awk /'openjdk|chromium|firefox/' ii chromium 131.0.6778.204-1~deb12u1 amd64 web browser ii chromium-common 131.0.6778.204-1~deb12u1 amd64 web browser - common resources used by the chromium packages ii chromium-sandbox 131.0.6778.204-1~deb12u1 amd64 web browser - setuid security sandbox for chromium ii firefox-esr 128.5.0esr-1~deb12u1 amd64 Mozilla Firefox web browser - Extended Support Release (ESR) ii openjdk-17-jdk:amd64 17.0.13+11-2~deb12u1 amd64 OpenJDK Development Kit (JDK) ii openjdk-17-jdk-headless:amd64 17.0.13+11-2~deb12u1 amd64 OpenJDK Development Kit (JDK) (headless) ii openjdk-17-jre:amd64 17.0.13+11-2~deb12u1 amd64 OpenJDK Java runtime, using Hotspot JIT ii openjdk-17-jre-headless:amd64 17.0.13+11-2~deb12u1 amd64 OpenJDK Java runtime, using Hotspot JIT (headless) Nutch 1.20 crawler@debian:~/apache-nutch-1.20$ ls -la plugins/lib-selenium/|awk '/java|fire|chrom/' -rw-rw-r-- 1 crawler crawler 15248 Apr 9 2024 selenium-chrome-driver-4.18.1.jar -rw-rw-r-- 1 crawler crawler 36726 Apr 9 2024 selenium-chromium-driver-4.18.1.jar -rw-rw-r-- 1 crawler crawler 83279 Apr 9 2024 selenium-firefox-driver-4.18.1.jar -rw-rw-r-- 1 crawler crawler 545 Apr 9 2024 selenium-java-4.18.1.jar On Thu, Dec 19, 2024 at 10:53 PM Sebastian Nagel <sna...@apache.org> wrote: > Hi Peter, > > the best description for the Selenium plugin is the README.md [1]. > > Otherwise, could you share which Selenium driver is used? > > Thanks, > Sebastian > > [1] > > https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md > > On 12/17/24 21:07, Peter Viskup wrote: > > Just not able to get it working... > > At first I got selenium timeout exception even > > with libselenium.page.load.delay set. The solution was to increase the > > value of page.load.delay which was default of 3. > > > > Then I stucked with the output of Selenium which shows "You need to > enable > > JavaScript". > > > > Am running the nutch with command: > > ./bin/nutch parsechecker > -Dplugin.includes='protocol-selenium|parse-tika' \ > > -Dselenium.enable.headless=true \ > > -Dlibselenium.page.load.delay=120 \ > > -Dpage.load.delay=120 \ > > -followRedirects -dumpText https://metais.slovensko.sk > > > > Went through the source code of libselenium and selenium protocol plugins > > with no success. > > > > What else to try to get such page crawled? > > > > Peter > > > >