Hey Martin, I would encourage you to file issues in the spark-rapids repo for questions with that plugin: https://github.com/NVIDIA/spark-rapids/issues I'm assuming the query ran and you looked at the sql UI or the .expalin() output and it was on cpu and not gpu? I am assuming you have the cuda 11.0 runtime installed (look in /usr/local). You printed the driver version which is 11.2 but the runtimes can be different. You are using the 11.0 cuda version of the cudf library. If that didn't match runtime though it would have failed and not ran anything. The easiest way to tell why it didn't run on the GPU is to enable the config: spark.rapids.sql.explain=NOT_ON_GPU It will print out logs to your console as to why different operators don't run on the gpu. Again feel free to open up a question issues in the spark-rapids repo and we can discuss more there. Tom On Friday, April 9, 2021, 11:19:05 AM CDT, Martin Somers <sono...@gmail.com> wrote: Hi Everyone !!
Im trying to get on premise GPU instance of Spark 3 running on my ubuntu box, and I am following: https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#example-join-operation Anyone with any insight into why a spark job isnt being ran on the GPU - appears to be all on the CPU, hadoop binary installed and appears to be functioning fine export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath) here is my setup on ubuntu20.10 ▶ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 3090 Off | 00000000:21:00.0 On | N/A | | 0% 38C P8 19W / 370W | 478MiB / 24265MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ /opt/sparkRapidsPlugin ▶ ls cudf-0.18.1-cuda11.jar getGpusResources.sh rapids-4-spark_2.12-0.4.1.jar ▶ scalac --version Scala compiler version 2.13.0 -- Copyright 2002-2019, LAMP/EPFL and Lightbend, Inc. ▶ spark-shell --version 2021-04-09 17:05:36,158 WARN util.Utils: Your hostname, studio resolves to a loopback address: 127.0.1.1; using 192.168.0.221 instead (on interface wlp71s0) 2021-04-09 17:05:36,159 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.10 Branch HEAD Compiled by user ubuntu on 2021-02-22T01:04:02Z Revision 1d550c4e90275ab418b9161925049239227f3dc9 Url https://github.com/apache/spark Type --help for more information. here is how I calling spark prior to adding the test job $SPARK_HOME/bin/spark-shell \ --master local \ --num-executors 1 \ --conf spark.executor.cores=16 \ --conf spark.rapids.sql.concurrentGpuTasks=1 \ --driver-memory 10g \ --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} --conf spark.rapids.memory.pinnedPool.size=16G \ --conf spark.locality.wait=0s \ --conf spark.sql.files.maxPartitionBytes=512m \ --conf spark.sql.shuffle.partitions=10 \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --files $SPARK_RAPIDS_DIR/getGpusResources.sh \ --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} Test job is from the example join-operation val df = sc.makeRDD(1 to 10000000, 6).toDF val df2 = sc.makeRDD(1 to 10000000, 6).toDF df.select( $"value" as "a").join(df2.select($"value" as "b"), $"a" === $"b").count I just noticed that the scala versions are out of sync - that shouldnt affect it? is there anything else I can try in the --conf or is there any logs to see what might be failing behind the scenes, any suggestions? Thanks Martin -- M