Hi Sebastian,
I used the firefox driver (head and headless with the same output).
Now tried chrome, but the Selenium Driver didn't match the browser's one.

Dec 21, 2024 11:29:50 AM org.openqa.selenium.devtools.CdpVersionFinder
findNearestMatch
WARNING: Unable to find CDP implementation matching 131
Dec 21, 2024 11:29:50 AM org.openqa.selenium.chromium.ChromiumDriver
lambda$new$5
WARNING: Unable to find version of CDP to use for 131.0.6778.204. You may
need to include a dependency on a specific version of the CDP using
something similar to `org.seleniumhq.selenium:selenium-devtools-v86:4.18.1`
where the version ("v86") matches the version of the chromium-based browser
you're using and the version number of the artifact is the same as
Selenium's.



The environment:
 - Debian 12
crawler@debian:~/apache-nutch-1.20$ dpkg -l|awk /'openjdk|chromium|firefox/'
ii  chromium                              131.0.6778.204-1~deb12u1
  amd64        web browser
ii  chromium-common                       131.0.6778.204-1~deb12u1
  amd64        web browser - common resources used by the chromium packages
ii  chromium-sandbox                      131.0.6778.204-1~deb12u1
  amd64        web browser - setuid security sandbox for chromium
ii  firefox-esr                           128.5.0esr-1~deb12u1
  amd64        Mozilla Firefox web browser - Extended Support Release (ESR)
ii  openjdk-17-jdk:amd64                  17.0.13+11-2~deb12u1
  amd64        OpenJDK Development Kit (JDK)
ii  openjdk-17-jdk-headless:amd64         17.0.13+11-2~deb12u1
  amd64        OpenJDK Development Kit (JDK) (headless)
ii  openjdk-17-jre:amd64                  17.0.13+11-2~deb12u1
  amd64        OpenJDK Java runtime, using Hotspot JIT
ii  openjdk-17-jre-headless:amd64         17.0.13+11-2~deb12u1
  amd64        OpenJDK Java runtime, using Hotspot JIT (headless)

Nutch 1.20
crawler@debian:~/apache-nutch-1.20$ ls -la plugins/lib-selenium/|awk
'/java|fire|chrom/'
-rw-rw-r--  1 crawler crawler   15248 Apr  9  2024
selenium-chrome-driver-4.18.1.jar
-rw-rw-r--  1 crawler crawler   36726 Apr  9  2024
selenium-chromium-driver-4.18.1.jar
-rw-rw-r--  1 crawler crawler   83279 Apr  9  2024
selenium-firefox-driver-4.18.1.jar
-rw-rw-r--  1 crawler crawler     545 Apr  9  2024 selenium-java-4.18.1.jar

On Thu, Dec 19, 2024 at 10:53 PM Sebastian Nagel <sna...@apache.org> wrote:

> Hi Peter,
>
> the best description for the Selenium plugin is the README.md [1].
>
> Otherwise, could you share which Selenium driver is used?
>
> Thanks,
> Sebastian
>
> [1]
>
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/README.md
>
> On 12/17/24 21:07, Peter Viskup wrote:
> > Just not able to get it working...
> > At first I got selenium timeout exception even
> > with libselenium.page.load.delay set. The solution was to increase the
> > value of page.load.delay which was default of 3.
> >
> > Then I stucked with the output of Selenium which shows "You need to
> enable
> > JavaScript".
> >
> > Am running the nutch with command:
> > ./bin/nutch parsechecker
> -Dplugin.includes='protocol-selenium|parse-tika' \
> >   -Dselenium.enable.headless=true \
> >   -Dlibselenium.page.load.delay=120 \
> >   -Dpage.load.delay=120 \
> >   -followRedirects -dumpText https://metais.slovensko.sk
> >
> > Went through the source code of libselenium and selenium protocol plugins
> > with no success.
> >
> > What else to try to get such page crawled?
> >
> > Peter
> >
>
>

Reply via email to