Re: File not found exceptions on S3 while running spark jobs
https://examples.javacodegeeks.com/java-io-filenotfoundexception-how-to-solve-file-not-found-exception/ Are you a programmer ? Regards, Hulio > Sent: Friday, July 17, 2020 at 2:41 AM > From: "Nagendra Darla" > To: user@spark.apache.org > Subject: File not found exceptions on S3 while running spark jobs > > Hello All, > I am converting existing parquet table (size: 50GB) into Delta format. It > took around 1hr 45 mins to convert. > And I see that there are lot of FileNotFoundExceptions in the logs > > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://old-data/delta-data/PL1/output/denorm_table/part-00031-183e54ef-50bc-46fc-83a3-7836baa28f86-c000.snappy.parquet > > *How do I fix these errors?* I am using below options in spark-submit > command > > spark-submit --packages > io.delta:delta-core_2.11:0.6.0,org.apache.hadoop:hadoop-aws:2.8.5 > --conf > spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore > --class Pipeline1 Pipeline.jar > > Thank You, > Nagendra Darla > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
File not found exceptions on S3 while running spark jobs
Hello All, I am converting existing parquet table (size: 50GB) into Delta format. It took around 1hr 45 mins to convert. And I see that there are lot of FileNotFoundExceptions in the logs Caused by: java.io.FileNotFoundException: No such file or directory: s3a://old-data/delta-data/PL1/output/denorm_table/part-00031-183e54ef-50bc-46fc-83a3-7836baa28f86-c000.snappy.parquet *How do I fix these errors?* I am using below options in spark-submit command spark-submit --packages io.delta:delta-core_2.11:0.6.0,org.apache.hadoop:hadoop-aws:2.8.5 --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore --class Pipeline1 Pipeline.jar Thank You, Nagendra Darla
Re: “Pyspark.zip does not exist” using Spark in cluster mode with Yarn
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10795 https://stackoverflow.com/questions/34632617/spark-python-submission-error-file-does-not-exist-pyspark-zip https://stackoverflow.com/questions/34632617/spark-python-submission-error-file-does-not-exist-pyspark-zip> Sent: Thursday, July 16, 2020 at 6:54 PM > From: "Davide Curcio" > To: "user@spark.apache.org" > Subject: “Pyspark.zip does not exist” using Spark in cluster mode with Yarn > > I'm trying to run some Spark script in cluster mode using Yarn but I've > always obtained this error. I read in other similar question that the cause > can be: > > "Local" set up hard-coded as a master but I don't have it > HADOOP_CONF_DIR environment variable that's wrong inside spark-env.sh but it > seems right > I've tried with every code, even simple code but it still doesn't work, even > though in local mode they work. > > Here is my log when I try to execute the code: > > spark/bin/spark-submit --deploy-mode cluster --master yarn ~/prova7.py > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.util.Shell). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 20/07/16 16:10:27 INFO Client: Requesting a new application from cluster with > 2 NodeManagers > 20/07/16 16:10:27 INFO Client: Verifying our application has not requested > more than the maximum memory capability of the cluster (1536 MB per container) > 20/07/16 16:10:27 INFO Client: Will allocate AM container, with 896 MB memory > including 384 MB overhead > 20/07/16 16:10:27 INFO Client: Setting up container launch context for our AM > 20/07/16 16:10:27 INFO Client: Setting up the launch environment for our AM > container > 20/07/16 16:10:27 INFO Client: Preparing resources for our AM container > 20/07/16 16:10:27 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive > is set, falling back to uploading libraries under SPARK_HOME. > 20/07/16 16:10:31 INFO Client: Uploading resource > file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_libs__4588035472069967339.zip > -> > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_libs__4588035472069967339.zip > 20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/prova7.py > -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/prova7.py > 20/07/16 16:10:31 INFO Client: Uploading resource > file:/home/ubuntu/spark/python/lib/pyspark.zip -> > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip > 20/07/16 16:10:31 INFO Client: Uploading resource > file:/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip -> > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/py4j-0.10.7-src.zip > 20/07/16 16:10:32 INFO Client: Uploading resource > file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_conf__1291791519024875749.zip > -> > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_conf__.zip > 20/07/16 16:10:32 INFO SecurityManager: Changing view acls to: ubuntu > 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls to: ubuntu > 20/07/16 16:10:32 INFO SecurityManager: Changing view acls groups to: > 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls groups to: > 20/07/16 16:10:32 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups > with view permissions: Set(); users with modify permissions: Set(ubuntu); > groups with modify permissions: Set() > 20/07/16 16:10:33 INFO Client: Submitting application > application_1594914119543_0010 to ResourceManager > 20/07/16 16:10:33 INFO YarnClientImpl: Submitted application > application_1594914119543_0010 > 20/07/16 16:10:34 INFO Client: Application report for > application_1594914119543_0010 (state: FAILED) > 20/07/16 16:10:34 INFO Client: > client token: N/A > diagnostics: Application application_1594914119543_0010 failed 2 times > due to AM Container for appattempt_1594914119543_0010_02 exited with > exitCode: -1000 > Failing this attempt.Diagnostics: [2020-07-16 16:10:34.391]File > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip > does not exist > java.io.FileNotFoundException: File > file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454) > at >
Re: Using spark.jars conf to override jars present in spark default classpath
That's what I'm saying you don't want to do :) If you have two versions of a library with different apis the safest approach is shading and ordering probably can't be relied on. In my experience reflection will behave in ways you may not like as well as which classpath has priority when a class is loading. Spark.Jars will never be able to reorder so you'll need to get those jars on the system class loader using the driver (and executor) extra classpath args (with userClasspathFirst). I will stress again that it would be my last choice for getting it working and I would try shading first if I really have a conflict. On Thu, Jul 16, 2020 at 2:17 PM Nupur Shukla wrote: > Thank you Russel and Jeff, > > My bad, I wasn't clear before about the conflicting jars. By that, I meant > my application needs to use an updated version of certain jars than what > are present in the default classpath. What would be the best way to use > confs spark.jar and spark.driver.extraClassPath both to do a classpath > reordering so that the updated versions get picked first? Looks like the > one way to use extraClassPath conf here. > > > > > On Thu, 16 Jul 2020 at 12:05, Jeff Evans > wrote: > >> If you can't avoid it, you need to make use of the >> spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst >> properties. >> >> On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer < >> russell.spit...@gmail.com> wrote: >> >>> I believe the main issue here is that spark.jars is a bit "too late" to >>> actually prepend things to the class path. For most use cases this value is >>> not read until after the JVM has already started and the system classloader >>> has already loaded. >>> >>> The jar argument gets added via the dynamic class loader so it >>> necessarily has to come after wards :/ Driver extra classpath and it's >>> friends, modify the actual launch command of the driver (or executors) so >>> they can prepend whenever they want. >>> >>> In general you do not want to have conflicting jars at all if possible >>> and I would recommend looking into shading if it's really important for >>> your application to use a specific incompatible version of a library. Jar >>> (and extraClasspath) are really just >>> for adding additional jars and I personally would try not to rely on >>> classpath ordering to get the right libraries recognized. >>> >>> On Thu, Jul 16, 2020 at 1:55 PM Nupur Shukla >>> wrote: >>> Hello, How can we use *spark.jars* to to specify conflicting jars (that is, jars that are already present in the spark's default classpath)? Jars specified in this conf gets "appended" to the classpath, and thus gets looked at after the default classpath. Is it not intended to be used to specify conflicting jars? Meanwhile when *spark.driver.extraClassPath* conf is specified, this path is "prepended" to the classpath and thus takes precedence over the default classpath. How can I use both to specify different jars and paths but achieve a precedence of spark.jars path > spark.driver.extraClassPath > spark default classpath (left to right precedence order)? Experiment conducted: I am using sample-project.jar which has one class in it SampleProject. This has a method which prints the version number of the jar. For this experiment I am using 3 versions of this sample-project.jar Sample-project-1.0.0.jar is present in the spark default classpath in my test cluster Sample-project-2.0.0.jar is present in folder /home//ClassPathConf on driver Sample-project-3.0.0.jar is present in folder /home//JarsConf on driver (Empty cell in img below means that conf was not specified) [image: image.png] Thank you, Nupur
Re: Using spark.jars conf to override jars present in spark default classpath
Thank you Russel and Jeff, My bad, I wasn't clear before about the conflicting jars. By that, I meant my application needs to use an updated version of certain jars than what are present in the default classpath. What would be the best way to use confs spark.jar and spark.driver.extraClassPath both to do a classpath reordering so that the updated versions get picked first? Looks like the one way to use extraClassPath conf here. On Thu, 16 Jul 2020 at 12:05, Jeff Evans wrote: > If you can't avoid it, you need to make use of the > spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst > properties. > > On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer > wrote: > >> I believe the main issue here is that spark.jars is a bit "too late" to >> actually prepend things to the class path. For most use cases this value is >> not read until after the JVM has already started and the system classloader >> has already loaded. >> >> The jar argument gets added via the dynamic class loader so it >> necessarily has to come after wards :/ Driver extra classpath and it's >> friends, modify the actual launch command of the driver (or executors) so >> they can prepend whenever they want. >> >> In general you do not want to have conflicting jars at all if possible >> and I would recommend looking into shading if it's really important for >> your application to use a specific incompatible version of a library. Jar >> (and extraClasspath) are really just >> for adding additional jars and I personally would try not to rely on >> classpath ordering to get the right libraries recognized. >> >> On Thu, Jul 16, 2020 at 1:55 PM Nupur Shukla >> wrote: >> >>> Hello, >>> >>> How can we use *spark.jars* to to specify conflicting jars (that is, >>> jars that are already present in the spark's default classpath)? Jars >>> specified in this conf gets "appended" to the classpath, and thus gets >>> looked at after the default classpath. Is it not intended to be used to >>> specify conflicting jars? >>> Meanwhile when *spark.driver.extraClassPath* conf is specified, this >>> path is "prepended" to the classpath and thus takes precedence over the >>> default classpath. >>> >>> How can I use both to specify different jars and paths but achieve a >>> precedence of spark.jars path > spark.driver.extraClassPath > spark default >>> classpath (left to right precedence order)? >>> >>> Experiment conducted: >>> >>> I am using sample-project.jar which has one class in it SampleProject. >>> This has a method which prints the version number of the jar. For this >>> experiment I am using 3 versions of this sample-project.jar >>> Sample-project-1.0.0.jar is present in the spark default classpath in my >>> test cluster >>> Sample-project-2.0.0.jar is present in folder /home//ClassPathConf >>> on driver >>> Sample-project-3.0.0.jar is present in folder /home//JarsConf on >>> driver >>> >>> (Empty cell in img below means that conf was not specified) >>> >>> [image: image.png] >>> >>> >>> Thank you, >>> Nupur >>> >>> >>>
Re: Using spark.jars conf to override jars present in spark default classpath
If you can't avoid it, you need to make use of the spark.driver.userClassPathFirst and/or spark.executor.userClassPathFirst properties. On Thu, Jul 16, 2020 at 2:03 PM Russell Spitzer wrote: > I believe the main issue here is that spark.jars is a bit "too late" to > actually prepend things to the class path. For most use cases this value is > not read until after the JVM has already started and the system classloader > has already loaded. > > The jar argument gets added via the dynamic class loader so it necessarily > has to come after wards :/ Driver extra classpath and it's friends, modify > the actual launch command of the driver (or executors) so they can prepend > whenever they want. > > In general you do not want to have conflicting jars at all if possible > and I would recommend looking into shading if it's really important for > your application to use a specific incompatible version of a library. Jar > (and extraClasspath) are really just > for adding additional jars and I personally would try not to rely on > classpath ordering to get the right libraries recognized. > > On Thu, Jul 16, 2020 at 1:55 PM Nupur Shukla > wrote: > >> Hello, >> >> How can we use *spark.jars* to to specify conflicting jars (that is, >> jars that are already present in the spark's default classpath)? Jars >> specified in this conf gets "appended" to the classpath, and thus gets >> looked at after the default classpath. Is it not intended to be used to >> specify conflicting jars? >> Meanwhile when *spark.driver.extraClassPath* conf is specified, this >> path is "prepended" to the classpath and thus takes precedence over the >> default classpath. >> >> How can I use both to specify different jars and paths but achieve a >> precedence of spark.jars path > spark.driver.extraClassPath > spark default >> classpath (left to right precedence order)? >> >> Experiment conducted: >> >> I am using sample-project.jar which has one class in it SampleProject. >> This has a method which prints the version number of the jar. For this >> experiment I am using 3 versions of this sample-project.jar >> Sample-project-1.0.0.jar is present in the spark default classpath in my >> test cluster >> Sample-project-2.0.0.jar is present in folder /home//ClassPathConf >> on driver >> Sample-project-3.0.0.jar is present in folder /home//JarsConf on >> driver >> >> (Empty cell in img below means that conf was not specified) >> >> [image: image.png] >> >> >> Thank you, >> Nupur >> >> >>
Re: Using spark.jars conf to override jars present in spark default classpath
I believe the main issue here is that spark.jars is a bit "too late" to actually prepend things to the class path. For most use cases this value is not read until after the JVM has already started and the system classloader has already loaded. The jar argument gets added via the dynamic class loader so it necessarily has to come after wards :/ Driver extra classpath and it's friends, modify the actual launch command of the driver (or executors) so they can prepend whenever they want. In general you do not want to have conflicting jars at all if possible and I would recommend looking into shading if it's really important for your application to use a specific incompatible version of a library. Jar (and extraClasspath) are really just for adding additional jars and I personally would try not to rely on classpath ordering to get the right libraries recognized. On Thu, Jul 16, 2020 at 1:55 PM Nupur Shukla wrote: > Hello, > > How can we use *spark.jars* to to specify conflicting jars (that is, jars > that are already present in the spark's default classpath)? Jars specified > in this conf gets "appended" to the classpath, and thus gets looked at > after the default classpath. Is it not intended to be used to specify > conflicting jars? > Meanwhile when *spark.driver.extraClassPath* conf is specified, this path > is "prepended" to the classpath and thus takes precedence over the default > classpath. > > How can I use both to specify different jars and paths but achieve a > precedence of spark.jars path > spark.driver.extraClassPath > spark default > classpath (left to right precedence order)? > > Experiment conducted: > > I am using sample-project.jar which has one class in it SampleProject. > This has a method which prints the version number of the jar. For this > experiment I am using 3 versions of this sample-project.jar > Sample-project-1.0.0.jar is present in the spark default classpath in my > test cluster > Sample-project-2.0.0.jar is present in folder /home//ClassPathConf > on driver > Sample-project-3.0.0.jar is present in folder /home//JarsConf on > driver > > (Empty cell in img below means that conf was not specified) > > [image: image.png] > > > Thank you, > Nupur > > >
Using spark.jars conf to override jars present in spark default classpath
Hello, How can we use *spark.jars* to to specify conflicting jars (that is, jars that are already present in the spark's default classpath)? Jars specified in this conf gets "appended" to the classpath, and thus gets looked at after the default classpath. Is it not intended to be used to specify conflicting jars? Meanwhile when *spark.driver.extraClassPath* conf is specified, this path is "prepended" to the classpath and thus takes precedence over the default classpath. How can I use both to specify different jars and paths but achieve a precedence of spark.jars path > spark.driver.extraClassPath > spark default classpath (left to right precedence order)? Experiment conducted: I am using sample-project.jar which has one class in it SampleProject. This has a method which prints the version number of the jar. For this experiment I am using 3 versions of this sample-project.jar Sample-project-1.0.0.jar is present in the spark default classpath in my test cluster Sample-project-2.0.0.jar is present in folder /home//ClassPathConf on driver Sample-project-3.0.0.jar is present in folder /home//JarsConf on driver (Empty cell in img below means that conf was not specified) [image: image.png] Thank you, Nupur
“Pyspark.zip does not exist” using Spark in cluster mode with Yarn
I'm trying to run some Spark script in cluster mode using Yarn but I've always obtained this error. I read in other similar question that the cause can be: "Local" set up hard-coded as a master but I don't have it HADOOP_CONF_DIR environment variable that's wrong inside spark-env.sh but it seems right I've tried with every code, even simple code but it still doesn't work, even though in local mode they work. Here is my log when I try to execute the code: spark/bin/spark-submit --deploy-mode cluster --master yarn ~/prova7.py log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 20/07/16 16:10:27 INFO Client: Requesting a new application from cluster with 2 NodeManagers 20/07/16 16:10:27 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1536 MB per container) 20/07/16 16:10:27 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 20/07/16 16:10:27 INFO Client: Setting up container launch context for our AM 20/07/16 16:10:27 INFO Client: Setting up the launch environment for our AM container 20/07/16 16:10:27 INFO Client: Preparing resources for our AM container 20/07/16 16:10:27 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 20/07/16 16:10:31 INFO Client: Uploading resource file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_libs__4588035472069967339.zip -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_libs__4588035472069967339.zip 20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/prova7.py -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/prova7.py 20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/spark/python/lib/pyspark.zip -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip 20/07/16 16:10:31 INFO Client: Uploading resource file:/home/ubuntu/spark/python/lib/py4j-0.10.7-src.zip -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/py4j-0.10.7-src.zip 20/07/16 16:10:32 INFO Client: Uploading resource file:/tmp/spark-750fb229-4166--9c69-eb90e9a2318d/__spark_conf__1291791519024875749.zip -> file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/__spark_conf__.zip 20/07/16 16:10:32 INFO SecurityManager: Changing view acls to: ubuntu 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls to: ubuntu 20/07/16 16:10:32 INFO SecurityManager: Changing view acls groups to: 20/07/16 16:10:32 INFO SecurityManager: Changing modify acls groups to: 20/07/16 16:10:32 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set() 20/07/16 16:10:33 INFO Client: Submitting application application_1594914119543_0010 to ResourceManager 20/07/16 16:10:33 INFO YarnClientImpl: Submitted application application_1594914119543_0010 20/07/16 16:10:34 INFO Client: Application report for application_1594914119543_0010 (state: FAILED) 20/07/16 16:10:34 INFO Client: client token: N/A diagnostics: Application application_1594914119543_0010 failed 2 times due to AM Container for appattempt_1594914119543_0010_02 exited with exitCode: -1000 Failing this attempt.Diagnostics: [2020-07-16 16:10:34.391]File file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip does not exist java.io.FileNotFoundException: File file:/home/ubuntu/.sparkStaging/application_1594914119543_0010/pyspark.zip does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:641) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454) at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411) at