Looks to me like it is a conflict between a Databricks library and Spark 2.1. That's an issue for Databricks to resolve or provide guidance.
On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com <lucas.g...@gmail.com> wrote: > I'm a bit confused by that answer, I'm assuming it's spark deciding which > lib to use. > > On 9 May 2017 at 14:30, Mark Hamstra <m...@clearstorydata.com> wrote: > >> This looks more like a matter for Databricks support than spark-user. >> >> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com < >> lucas.g...@gmail.com> wrote: >> >>> df = spark.sqlContext.read.csv('out/df_in.csv') >>>> >>> >>> >>>> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in >>>> metastore. hive.metastore.schema.verification is not enabled so >>>> recording the schema version 1.2.0 >>>> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default, >>>> returning NoSuchObjectException >>>> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database global_temp, >>>> returning NoSuchObjectException >>>> >>> >>> >>>> Py4JJavaError: An error occurred while calling o72.csv. >>>> : java.lang.RuntimeException: Multiple sources found for csv >>>> (*com.databricks.spark.csv.DefaultSource15, >>>> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*), please >>>> specify the fully qualified class name. >>>> at scala.sys.package$.error(package.scala:27) >>>> at org.apache.spark.sql.execution.datasources.DataSource$.looku >>>> pDataSource(DataSource.scala:591) >>>> at org.apache.spark.sql.execution.datasources.DataSource.provid >>>> ingClass$lzycompute(DataSource.scala:86) >>>> at org.apache.spark.sql.execution.datasources.DataSource.provid >>>> ingClass(DataSource.scala:86) >>>> at org.apache.spark.sql.execution.datasources.DataSource.resolv >>>> eRelation(DataSource.scala:325) >>>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152) >>>> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce >>>> ssorImpl.java:57) >>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe >>>> thodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:606) >>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) >>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) >>>> at py4j.Gateway.invoke(Gateway.java:280) >>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) >>>> at py4j.commands.CallCommand.execute(CallCommand.java:79) >>>> at py4j.GatewayConnection.run(GatewayConnection.java:214) at >>>> java.lang.Thread.run(Thread.java:745) >>> >>> >>> When I change our call to: >>> >>> df = spark.hiveContext.read \ >>> .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat') >>> \ >>> .load('df_in.csv) >>> >>> No such issue, I was under the impression (obviously wrongly) that spark >>> would automatically pick the local lib. We have the databricks library >>> because other jobs still explicitly call it. >>> >>> Is the 'correct answer' to go through and modify so as to remove the >>> databricks lib / remove it from our deploy? Or should this just work? >>> >>> One of the things I find less helpful in the spark docs are when there's >>> multiple ways to do it but no clear guidance on what those methods are >>> intended to accomplish. >>> >>> Thanks! >>> >> >> >