Hi, We're facing a situation where simple queries to parquet files stored in Swift through a Hive Metastore sometimes fail with this exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 58.0 failed 4 times, most recent failure: Lost task 6.3 in stage 58.0 (TID 412, agent-1.mesos.private): org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing mandatory configuration option: fs.swift.service.######.auth.url at org.apache.hadoop.fs.swift.http.RestClientBindings.copy(RestClientBindings.java:219) (...) Queries requiring a full table scan, like select(count(*)) would fail with the mentioned exception while smaller chunks of work like " select * from... LIMIT 5" would succeed. The problem seems to relate to the number of tasks scheduled: If we force a reduction of the number of tasks to 1, the job succeeds: dataframe.rdd.coalesce(1).count() Would return a correct result while dataframe.count() would fail with the exception mentioned above. To me, it looks like credentials are lost somewhere in the serialization path when the tasks are submitted to the cluster. I have not found an explanation yet to why a job that requires only one task succeeds. We are running on Apache Zepellin for Swift and Spark Notebook for S3. Both show an equivalent exception within their specific hadoop filesystem implementation when the task fails: Zepelling + Swift: org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Missing mandatory configuration option: fs.swift.service.######.auth.url Spark Notebook + S3: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively). at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70) Valid credentials are being set programmatically through sc.hadoopConfiguration Our system: Zepellin or Spark Notebook with Spark 1.5.1 running on Docker, Docker running on Mesos, Hadoop 2.4.0. One environment running on Softlayer (Swift) and other Amazon EC2 (S3) of similar sizes. Any ideas on how to address this issue or figure out what's going on?? Thanks, Gerard.
