Re: Spark 3.1 with spark AVRO

2022-03-10 Thread Yong Zhang
Thank you so much, you absolutely nailed it.

There is a stupid "SPARK_HOME" env variable pointing to Spark2.4 existed on 
zsh, which is the troublemaker.

Totally forgot that and didn't realize this environment variable could cause 
days frustration for me.

Yong


From: Artemis User 
Sent: Thursday, March 10, 2022 3:13 PM
To: user 
Subject: Re: Spark 3.1 with spark AVRO

It must be some misconfiguration in your environment.  Do you perhaps have a 
hardwired $SPARK_HOME env variable in your shell?  An easy test would be to 
place the spark-avro jar file you downloaded in the jars directory of Spark and 
run spark-shell again without the packages option.  This will guarantee that 
the jar file is on the classpath of Spark driver and executors..

On 3/10/22 1:24 PM, Yong Zhang wrote:
Hi,

I am puzzled with this issue of Spark 3.1 version to read avro file. Everything 
is done on my local mac laptop so far, and I really don't know where the issue 
comes from, and I googled a lot and cannot find any clue.

I am always using Spark 2.4 version, as it is really mature. But for a new 
project, I want to taste Spark 3.1, which needs to read AVRO file.

To my surprise, on my local, the Spark 3.1.3 throws error when trying to read 
the avro files.

  *   I download the Spark 3.1.2 and 3.1.3 with Hadoop2 or 3 from 
https://spark.apache.org/downloads.html
  *   Use JDK "1.8.0_321" on the Mac
  *   Untar the spark 3.1.x local
  *   And follow https://spark.apache.org/docs/3.1.3/sql-data-sources-avro.html

Start the spark-shell in the following exactly command:

spark-3.1.3-bin-hadoop3.2/bin/spark-shell --packages 
org.apache.spark:spark-avro_2.12:3.1.3

  *

And I always get the following error when read the existing test AVRO files:

scala> val pageview = spark.read.format("avro").load("/Users/user/output/raw/")
java.lang.NoClassDefFoundError: 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

I tried different version of Spark 3.x, from Spark 3.1.2 -> 3.1.3 -> 3.2.1, and 
I believe they are all under Scala 2.12, and I start the spark-shell with 
"--packages org.apache.spark:spark-avro_2.12:x.x.x", which x.x.x matches the 
Spark version, but I got the above wired "NoClassDefFoundError" in all cases.

Meantime, download Spark2.4.8 and start spark-shell with "--packages 
org.apache.spark:spark-avro_2.11:2.4.3", I can read the exactly same ARVO file 
without any issue.

I am thinking it must be done wrongly on my end, but:

  *   I downloaded several versions of Spark and untar them directly.
  *   I DIDN'T have any custom "spark-env.sh/spark-default.conf" file to 
include any potential jar files to mess up things
  *   Straight creating a spark session under spark-shell with the correct 
package and try to read avro files. Nothing more.

I have to doubt there are something wrong with the Spark 3.x avro package 
releases, but I know that possiblity is very low, especailly for multi 
different veresions. But the class 
"org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2" existed under 
"spark-sql_2.12-3.1.3.jar", as blow:
jar tvf spark-sql_2.12-3.1.3.jar | grep FileDataSourceV2
15436 Sun Feb 06 22:54:00 EST 2022 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.class

So what could be wrong?

Thanks

Yong




Re: Spark 3.1 with spark AVRO

2022-03-10 Thread Artemis User
It must be some misconfiguration in your environment.  Do you perhaps 
have a hardwired $SPARK_HOME env variable in your shell?  An easy test 
would be to place the spark-avro jar file you downloaded in the jars 
directory of Spark and run spark-shell again without the packages 
option.  This will guarantee that the jar file is on the classpath of 
Spark driver and executors..


On 3/10/22 1:24 PM, Yong Zhang wrote:

Hi,

I am puzzled with this issue of Spark 3.1 version to read avro file. 
Everything is done on my local mac laptop so far, and I really don't 
know where the issue comes from, and I googled a lot and cannot find 
any clue.


I am always using Spark 2.4 version, as it is really mature. But for a 
new project, I want to taste Spark 3.1, which needs to read AVRO file.


To my surprise, on my local, the Spark 3.1.3 throws error when trying 
to read the avro files.


  * I download the Spark 3.1.2 and 3.1.3 with Hadoop2 or 3 from
https://spark.apache.org/downloads.html
  * Use JDK "1.8.0_321" on the Mac
  * Untar the spark 3.1.x local
  * And follow
https://spark.apache.org/docs/3.1.3/sql-data-sources-avro.html


Start the spark-shell in the following exactly command:

spark-3.1.3-bin-hadoop3.2/bin/spark-shell --packages 
org.apache.spark:*spark-avro_2.12:3.1.3*


 *


And I always get the following error when read the existing test AVRO 
files:


scala> val pageview = 
spark.read.format("avro").load("/Users/user/output/raw/")
/java.lang.NoClassDefFoundError: 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2

/
/
/
I tried different version of Spark 3.x, from Spark 3.1.2 -> 3.1.3 -> 
3.2.1, and I believe they are all under Scala 2.12, and I start the 
spark-shell with "--packages 
org.apache.spark:*spark-avro_2.12:x.x.x*", which x.x.x matches the 
Spark version, but I got the above wired "NoClassDefFoundError" in 
*all* cases.

/
/
Meantime, download Spark2.4.8 and start spark-shell with "--packages 
org.apache.spark:*spark-avro_2.11:2.4.3*", I can read the *exactly 
same ARVO file* without any issue.


I am thinking it must be done wrongly on my end, but:

  * I downloaded several versions of Spark and untar them directly.
  * I *DIDN'T *have any custom "spark-env.sh/spark-default.conf" file
to include any potential jar files to mess up things
  * Straight creating a spark session under spark-shell with the
correct package and try to read avro files. Nothing more.


I have to doubt there are something wrong with the Spark 3.x avro 
package releases, but I know that possiblity is very low, especailly 
for multi different veresions. But the class 
"org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2" 
existed under "spark-sql_2.12-3.1.3.jar", as blow:

/jar tvf spark-sql_2.12-3.1.3.jar | grep FileDataSourceV2//
/
/15436 Sun Feb 06 22:54:00 EST 2022 
org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.class/


So what could be wrong?

Thanks

Yong