Good to hear it's working. Happy Spark usage. G
On Tue, 6 Apr 2021, 21:56 Mich Talebzadeh, <mich.talebza...@gmail.com> wrote: > OK we found out the root cause of this issue. > > We were writing to Redis from Spark and downloaded a recently compiled > version of Redis jar with scala 2.12. > > spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar > > It was giving grief. We removed that one. So the job runs with either > > spark-sql-kafka-0-10_2.12-*3.1.0*.jar > > or as packages through > > spark-submit .. --packages org.apache.spark:spark-sql-kafka-0-10_2.12: > *3.1.1* > > We will follow the suggested solution as per doc > > > Batch: 18 > > ------------------------------------------- > > +--------------------+------+-------------------+------+ > > | rowkey|ticker| timeissued| price| > > +--------------------+------+-------------------+------+ > > |b539cb54-3ddd-47c...| ORCL|2021-04-06 20:53:37| 41.32| > > |2d4bae2d-649e-4b8...| VOD|2021-04-06 20:53:37|317.48| > > |2f51f188-6da4-4bb...| MKS|2021-04-06 20:53:37|376.63| > > |1a4c4645-8dc7-4ef...| BP|2021-04-06 20:53:37| 571.5| > > |45c9e738-ead7-4e5...| SBRY|2021-04-06 20:53:37|244.76| > > |48f93c13-43ad-422...| SAP|2021-04-06 20:53:37| 58.71| > > |ed4d89b1-7fc1-420...| IBM|2021-04-06 20:53:37|105.91| > > |44b3f0ce-27b8-4a9...| MRW|2021-04-06 20:53:37|297.85| > > |4441b0b5-32c1-4cb...| MSFT|2021-04-06 20:53:37| 27.83| > > |143398a4-13b5-494...| TSCO|2021-04-06 20:53:37|183.42| > > +--------------------+------+-------------------+------+ > > Now we need to go back to the drawing board and see how to integrate Redis > > > > Thanks > > > Mich > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 6 Apr 2021 at 17:52, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Fine. >> >> Just to clarify please. >> >> With SBT assembly and Scala I would create an Uber jar file and used that >> one with spark-submit >> >> As I understand (and stand corrected) with PySpark one can only run >> spark-submit in client mode by directly using a py file? >> >> So hence >> >> spark-submit --master local[4] --packages >> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 <python_file> >> >> >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Tue, 6 Apr 2021 at 17:39, Sean Owen <sro...@gmail.com> wrote: >> >>> Gabor's point is that these are not libraries you typically install in >>> your cluster itself. You package them with your app. >>> >>> On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Hi G >>>> >>>> Thanks for the heads-up. >>>> >>>> In a thread on 3rd of March I reported that 3.1.1 works in yarn mode >>>> >>>> Spark 3.1.1 Preliminary results (mainly to do with Spark Structured >>>> Streaming) (mail-archive.com) >>>> <https://www.mail-archive.com/user@spark.apache.org/msg75979.html> >>>> >>>> From that mail >>>> >>>> >>>> The needed jar files for version 3.1.1 to read from Kafka and write to >>>> BigQuery for 3.1.1 are as follows: >>>> >>>> All under $SPARK_HOME/jars on all nodes. These are the latest available jar >>>> files >>>> >>>> >>>> - commons-pool2-2.9.0.jar >>>> - spark-token-provider-kafka-0-10_2.12-3.1.0.jar >>>> - spark-sql-kafka-0-10_2.12-3.1.0.jar >>>> - kafka-clients-2.7.0.jar >>>> - spark-bigquery-latest_2.12.jar >>>> >>>> >>>> >>>> I just tested it and in local mode single JVM it works fine without the >>>> addition of package --> --packages >>>> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 >>>> BUT including all the above jars files >>>> >>>> Batch: 17 >>>> ------------------------------------------- >>>> +--------------------+------+-------------------+------+ >>>> | rowkey|ticker| timeissued| price| >>>> +--------------------+------+-------------------+------+ >>>> |54651f0d-1be0-4d7...| IBM|2021-04-06 17:17:04| 91.92| >>>> |8aa1ad79-4792-466...| SAP|2021-04-06 17:17:04| 34.93| >>>> |8567f327-cfec-43d...| TSCO|2021-04-06 17:17:04| 324.5| >>>> |138a1278-2f54-45b...| VOD|2021-04-06 17:17:04| 241.4| >>>> |e02793c3-8e78-47e...| ORCL|2021-04-06 17:17:04| 17.6| >>>> |0ab456fb-bd22-465...| SBRY|2021-04-06 17:17:04|350.45| >>>> |74588e92-a3e2-48c...| MSFT|2021-04-06 17:17:04| 44.58| >>>> |1e7203c6-6938-4ea...| BP|2021-04-06 17:17:04| 588.0| >>>> |1e55021a-148d-4aa...| MRW|2021-04-06 17:17:04|171.21| >>>> |229ad6f9-e4ed-475...| MKS|2021-04-06 17:17:04|439.17| >>>> +--------------------+------+-------------------+------+ >>>> >>>> However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar >>>> and include the packages as suggested in the link >>>> >>>> >>>> spark-submit --master local[4] --conf >>>> spark.pyspark.virtualenv.enabled=true --conf >>>> spark.pyspark.virtualenv.type=native --conf >>>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt >>>> --conf >>>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv >>>> --conf >>>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 >>>> *--packages >>>> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1* xyz.py >>>> >>>> It cannot fetch the data >>>> >>>> root >>>> |-- parsed_value: struct (nullable = true) >>>> | |-- rowkey: string (nullable = true) >>>> | |-- ticker: string (nullable = true) >>>> | |-- timeissued: timestamp (nullable = true) >>>> | |-- price: float (nullable = true) >>>> >>>> {'message': 'Initializing sources', 'isDataAvailable': False, >>>> 'isTriggerActive': False} >>>> ------------------------------------------- >>>> Batch: 0 >>>> ------------------------------------------- >>>> +------+------+----------+-----+ >>>> |rowkey|ticker|timeissued|price| >>>> +------+------+----------+-----+ >>>> +------+------+----------+-----+ >>>> >>>> 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task >>>> java.lang.NoSuchMethodError: >>>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z >>>> at >>>> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549) >>>> at >>>> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291) >>>> at >>>> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) >>>> at >>>> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604) >>>> at >>>> org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287) >>>> at >>>> org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63) >>>> at >>>> org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) >>>> at >>>> org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) >>>> at >>>> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) >>>> at >>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) >>>> at >>>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown >>>> Source) >>>> at >>>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) >>>> at >>>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) >>>> at >>>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) >>>> at >>>> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413) >>>> at >>>> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) >>>> at >>>> org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452) >>>> at >>>> org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360) >>>> at >>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) >>>> at org.apache.spark.scheduler.Task.run(Task.scala:131) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) >>>> at >>>> org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) >>>> at >>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> at java.lang.Thread.run(Thread.java:748) >>>> 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task >>>> java.lang.NoSuchMethodError: >>>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z >>>> >>>> >>>> Now I deleted ~/.ivy2 directory and ran the job again >>>> >>>> Ivy Default Cache set to: /home/hduser/.ivy2/cache >>>> The jars for the packages stored in: /home/hduser/.ivy2/jars >>>> org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency >>>> :: resolving dependencies :: >>>> org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0 >>>> >>>> let us go and have a look at the directory .ivy2/jars >>>> >>>> /home/hduser/.ivy2/jars> ltr >>>> total 13108 >>>> -rw-r--r-- 1 hduser hadoop 2777 Oct 22 2014 >>>> org.spark-project.spark_unused-1.0.0.jar >>>> -rw-r--r-- 1 hduser hadoop 129174 Apr 6 2019 >>>> org.apache.commons_commons-pool2-2.6.2.jar >>>> -rw-r--r-- 1 hduser hadoop 41472 Dec 16 2019 >>>> org.slf4j_slf4j-api-1.7.30.jar >>>> -rw-r--r-- 1 hduser hadoop 649950 Jan 18 2020 >>>> org.lz4_lz4-java-1.7.1.jar >>>> -rw-r--r-- 1 hduser hadoop 3754508 Jul 28 2020 >>>> org.apache.kafka_kafka-clients-2.6.0.jar >>>> -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 >>>> org.xerial.snappy_snappy-java-1.1.8.2.jar >>>> -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 >>>> com.github.luben_zstd-jni-1.4.8-1.jar >>>> -rw-r--r-- 1 hduser hadoop 387494 Feb 22 03:57 >>>> org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar >>>> -rw-r--r-- 1 hduser hadoop 55766 Feb 22 03:58 >>>> org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar >>>> drwxr-xr-x 4 hduser hadoop 4096 Apr 6 17:25 .. >>>> drwxr-xr-x 2 hduser hadoop 4096 Apr 6 17:25 . >>>> >>>> Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar >>>> and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date. >>>> >>>> Very confusing. Sounds like we have changed something in the cluster >>>> that as reported on 3rd March it used to work with those jar files and now >>>> not working. >>>> >>>> So in summary *without those jar files added to $SPARK_HOME/jars i*t >>>> fails totally even with the packages added. >>>> >>>> Cheers >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <gabor.g.somo...@gmail.com> >>>> wrote: >>>> >>>>> > Anyway I unzipped the tarball for Spark-3.1.1 and there is >>>>> no spark-sql-kafka-0-10_2.12-3.0.1.jar even >>>>> >>>>> Please see how Structured Streaming app with Kafka needs to be >>>>> deployed here: >>>>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying >>>>> I don't see the --packages option... >>>>> >>>>> G >>>>> >>>>> >>>>> On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> OK thanks for that. >>>>>> >>>>>> I am using spark-submit with PySpark as follows >>>>>> >>>>>> spark-submit --version >>>>>> Welcome to >>>>>> ____ __ >>>>>> / __/__ ___ _____/ /__ >>>>>> _\ \/ _ \/ _ `/ __/ '_/ >>>>>> /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 >>>>>> /_/ >>>>>> >>>>>> Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, >>>>>> 1.8.0_201 >>>>>> Branch HEAD >>>>>> Compiled by user ubuntu on 2021-02-22T01:33:19Z >>>>>> >>>>>> >>>>>> spark-submit --master yarn --deploy-mode client --conf >>>>>> spark.pyspark.virtualenv.enabled=true --conf >>>>>> spark.pyspark.virtualenv.type=native --conf >>>>>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt >>>>>> --conf >>>>>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv >>>>>> --conf >>>>>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 >>>>>> --driver-memory 16G --executor-memory 8G --num-executors 4 >>>>>> --executor-cores >>>>>> 2 xyz.py >>>>>> >>>>>> enabling with virtual environment >>>>>> >>>>>> >>>>>> That works fine with any job that does not do structured streaming in >>>>>> a client mode. >>>>>> >>>>>> >>>>>> Running on local node with >>>>>> >>>>>> >>>>>> spark-submit --master local[4] --conf >>>>>> spark.pyspark.virtualenv.enabled=true --conf >>>>>> spark.pyspark.virtualenv.type=native --conf >>>>>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt >>>>>> --conf >>>>>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv >>>>>> --conf >>>>>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 >>>>>> xyz.py >>>>>> >>>>>> >>>>>> works fine with the same spark version and $SPARK_HOME/jars >>>>>> >>>>>> >>>>>> Cheers >>>>>> >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, 6 Apr 2021 at 13:20, Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> You may be compiling your app against 3.0.1 JARs but submitting to >>>>>>> 3.1.1. >>>>>>> You do not in general modify the Spark libs. You need to package >>>>>>> libs like this with your app at the correct version. >>>>>>> >>>>>>> On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> Thanks Gabor. >>>>>>>> >>>>>>>> All nodes are running Spark /spark-3.1.1-bin-hadoop3.2 >>>>>>>> >>>>>>>> So $SPARK_HOME/jars contains all the required jars on all nodes >>>>>>>> including the jar file commons-pool2-2.9.0.jar as well. >>>>>>>> >>>>>>>> They are installed identically on all nodes. >>>>>>>> >>>>>>>> I have looked at the Spark environment for classpath. Still I don't >>>>>>>> see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2. >>>>>>>> 12-3.1.1.jar >>>>>>>> but works ok with spark-sql-kafka-0-10_2.12-3.1.0.jar >>>>>>>> >>>>>>>> Anyway I unzipped the tarball for Spark-3.1.1 and there is >>>>>>>> no spark-sql-kafka-0-10_2.12-3.0.1.jar even >>>>>>>> >>>>>>>> I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. >>>>>>>> Then I enquired the availability of new version from Maven that >>>>>>>> pointed to >>>>>>>> *spark-sql-kafka-0-10_2.12-3.1.1.jar* >>>>>>>> >>>>>>>> So to confirm Spark out of the tarball does not have any >>>>>>>> >>>>>>>> ltr spark-sql-kafka-* >>>>>>>> ls: cannot access spark-sql-kafka-*: No such file or directory >>>>>>>> >>>>>>>> >>>>>>>> For SSS, I had to add these >>>>>>>> >>>>>>>> add commons-pool2-2.9.0.jar. The one shipped is >>>>>>>> commons-pool-1.5.4.jar! >>>>>>>> >>>>>>>> add kafka-clients-2.7.0.jar Did not have any >>>>>>>> >>>>>>>> add spark-sql-kafka-0-10_2.12-3.0.1.jar Did not have any >>>>>>>> >>>>>>>> I gather from your second mail, there seems to be an issue with >>>>>>>> spark-sql-kafka-0-10_2.12-3.*1*.1.jar ? >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>>> for any loss, damage or destruction of data or any other property >>>>>>>> which may >>>>>>>> arise from relying on this email's technical content is explicitly >>>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>>> damages >>>>>>>> arising from such loss, damage or destruction. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi < >>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Since you've not shared too much details I presume you've updated >>>>>>>>> the spark-sql-kafka jar only. >>>>>>>>> KafkaTokenUtil is in the token provider jar. >>>>>>>>> >>>>>>>>> As a general note if I'm right, please update Spark as a whole on >>>>>>>>> all nodes and not just jars independently. >>>>>>>>> >>>>>>>>> BR, >>>>>>>>> G >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Any chance of someone testing the latest >>>>>>>>>> spark-sql-kafka-0-10_2.12-3.1.1.jar >>>>>>>>>> for Spark. It throws >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> java.lang.NoSuchMethodError: >>>>>>>>>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar >>>>>>>>>> works fine >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all >>>>>>>>>> responsibility for any loss, damage or destruction of data or any >>>>>>>>>> other >>>>>>>>>> property which may arise from relying on this email's technical >>>>>>>>>> content is >>>>>>>>>> explicitly disclaimed. The author will in no case be liable for any >>>>>>>>>> monetary damages arising from such loss, damage or destruction. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>