Re: --packages Failed to load class for data source v1.4
I don't think this is the same issue as it works just fine in pyspark v1.3.1. Are you aware of any workaround? I was hoping to start testing one of my apps in Spark 1.4 and I use the CSV exports as a safety valve to easily debug my data flow. -Don On Sun, Jun 14, 2015 at 7:18 PM, Burak Yavuz brk...@gmail.com wrote: Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the same issue. Thanks for opening the Jira, I'll take a look. Best, Burak On Jun 14, 2015 2:40 PM, Don Drake dondr...@gmail.com wrote: I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake dondr...@gmail.com wrote: I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jarAdded By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jarAdded By User Thoughts? -Don Gory details: $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13
Re: --packages Failed to load class for data source v1.4
I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake dondr...@gmail.com wrote: I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By Userfile:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jarAdded By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By Userhttp://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jarAdded By User Thoughts? -Don Gory details: $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on port 56870. 15/06/13 11:06:10 INFO SparkEnv: Registering MapOutputTracker 15/06/13 11:06:10 INFO SparkEnv: Registering BlockManagerMaster 15/06/13 11:06:10 INFO DiskBlockManager: Created local directory at /private/var/folders/7_/k5h82ws97b95v5f5h8wf9j0hgn/T/spark-f36f39f5-7f82-42e0-b3e0-9eb1e1cc0816/blockmgr-a1412b71-fe56-429c-a193-ce3fb95d2ffd 15/06/13 11:06:10 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/06/13 11:06:10 INFO HttpFileServer: HTTP File server directory is /private/var/folders/7_/k5h82ws97b95v5f5h8wf9j0hgn/T/spark-f36f39f5-7f82-42e0-b3e0-9eb1e1cc0816/httpd-84d178da-7e60-4eed-8031-e6a0c465bd4c 15/06/13 11:06:10 INFO HttpServer: Starting HTTP Server
Re: --packages Failed to load class for data source v1.4
Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the same issue. Thanks for opening the Jira, I'll take a look. Best, Burak On Jun 14, 2015 2:40 PM, Don Drake dondr...@gmail.com wrote: I looked at this again, and when I use the Scala spark-shell and load a CSV using the same package it works just fine, so this seems specific to pyspark. I've created the following JIRA: https://issues.apache.org/jira/browse/SPARK-8365 -Don On Sat, Jun 13, 2015 at 11:46 AM, Don Drake dondr...@gmail.com wrote: I downloaded the pre-compiled Spark 1.4.0 and attempted to run an existing Python Spark application against it and got the following error: py4j.protocol.Py4JJavaError: An error occurred while calling o90.save. : java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv I pass the following on the command-line to my spark-submit: --packages com.databricks:spark-csv_2.10:1.0.3 This worked fine on 1.3.1, but not in 1.4. I was able to replicate it with the following pyspark: a = {'a':1.0, 'b':'asdf'} rdd = sc.parallelize([a]) df = sqlContext.createDataFrame(rdd) df.save(/tmp/d.csv, com.databricks.spark.csv) Even using the new df.write.format('com.databricks.spark.csv').save('/tmp/d.csv') gives the same error. I see it was added in the web UI: file:/Users/drake/.ivy2/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User file:/Users/drake/.ivy2/jars/org.apache.commons_commons-csv-1.1.jarAdded By User http://10.0.0.222:56871/jars/com.databricks_spark-csv_2.10-1.0.3.jarAdded By User http://10.0.0.222:56871/jars/org.apache.commons_commons-csv-1.1.jarAdded By User Thoughts? -Don Gory details: $ pyspark --packages com.databricks:spark-csv_2.10:1.0.3 Python 2.7.6 (default, Sep 9 2014, 15:04:36) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin Type help, copyright, credits or license for more information. Ivy Default Cache set to: /Users/drake/.ivy2/cache The jars for the packages stored in: /Users/drake/.ivy2/jars :: loading settings :: url = jar:file:/Users/drake/spark/spark-1.4.0-bin-hadoop2.6/lib/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml com.databricks#spark-csv_2.10 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found com.databricks#spark-csv_2.10;1.0.3 in central found org.apache.commons#commons-csv;1.1 in central :: resolution report :: resolve 590ms :: artifacts dl 17ms :: modules in use: com.databricks#spark-csv_2.10;1.0.3 from central in [default] org.apache.commons#commons-csv;1.1 from central in [default] - | |modules|| artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| - | default | 2 | 0 | 0 | 0 || 2 | 0 | - :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 2 already retrieved (0kB/15ms) Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/13 11:06:08 INFO SparkContext: Running Spark version 1.4.0 2015-06-13 11:06:08.921 java[19233:2145789] Unable to load realm info from SCDynamicStore 15/06/13 11:06:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/06/13 11:06:09 WARN Utils: Your hostname, Dons-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.222 instead (on interface en0) 15/06/13 11:06:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/06/13 11:06:09 INFO SecurityManager: Changing view acls to: drake 15/06/13 11:06:09 INFO SecurityManager: Changing modify acls to: drake 15/06/13 11:06:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(drake); users with modify permissions: Set(drake) 15/06/13 11:06:10 INFO Slf4jLogger: Slf4jLogger started 15/06/13 11:06:10 INFO Remoting: Starting remoting 15/06/13 11:06:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.222:56870] 15/06/13 11:06:10 INFO Utils: Successfully started service 'sparkDriver' on port 56870. 15/06/13 11:06:10 INFO SparkEnv: Registering MapOutputTracker 15/06/13 11:06:10 INFO SparkEnv: Registering BlockManagerMaster 15/06/13 11:06:10 INFO DiskBlockManager: Created local directory at