Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
I think this is expected behavior, though not what I think is reasonable in
the long term. To my knowledge, this is how the v1 sources behave, and v2
just reuses the same mechanism to instantiate sources and uses a new
interface for v2 features.

I think that the right approach is to use catalogs, which I've proposed in
#21306 . A catalog would be
loaded by reflection just once and then configured. After that, the same
instance for a given Spark SQL session would be reused.

Because the catalog instantiates table instances that expose read and write
capabilities (ReadSupport, WriteSupport), it can choose how to manage the
life-cycle of those tables and can also cache instances to control how
table state changes after a table is loaded. (Iceberg does this to use a
fixed snapshot for all reads until the table is written to or is garbage
collected.)

rb

On Tue, Oct 9, 2018 at 8:30 PM Hyukjin Kwon  wrote:

> I took a look for the codes.
>
> val source = classOf[MyDataSource].getCanonicalName
> spark.read.format(source).load().collect()
>
> Looks indeed it calls twice.
>
> First all: Looks it creates it first to read the schema for a logical plan
>
> test.org.apache.spark.sql.sources.v2.MyDataSourceReader.(MyDataSourceReader.java:36)
> test.org.apache.spark.sql.sources.v2.MyDataSource.createReader(MyDataSource.java:35)
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:155)
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:172)
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:204)
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
>
> Second call: it creates another for its actual partitions in a physcal plan
>
> test.org.apache.spark.sql.sources.v2.MyDataSourceReader.(MyDataSourceReader.java:36)
> test.org.apache.spark.sql.sources.v2.MyDataSource.createReader(MyDataSource.java:35)
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$SourceHelpers.createReader(DataSourceV2Relation.scala:155)
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation.newReader(DataSourceV2Relation.scala:61)
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$.apply(DataSourceV2Strategy.scala:103)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
> scala.collection.Iterator$class.foreach(Iterator.scala:891)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1334)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
> org.apache.spark.sql.Dataset.withAction(Dataset.scala:3360)
> org.apache.spark.sql.Dataset.collect(Dataset.scala:2783)
>
>
> Skimming the API doc at DataSourceReader at branch-2.4, I haven’t found
> the guarantee that the readers are created only once. If that’s documented
> somewhere, we should fix it in 2.4.0. If not, I think it fine since both
> calls are in driver side and it’s something able to work around for
> instance static class or thread local in this case.
>
> Forwarding to dev mailing list in case that this is something we haven't
> foreseen.
>
> 2018년 10월 9일 (화) 오후 9:39, Shubham Chaurasia 님이
> 작성:
>
>> Alright, so it is a big project which uses a SQL store underneath.
>> I extracted out the minimal code and made a 

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter,

Thanks for the additional information - this is really helpful (I
definitively got more than I was looking for :-)

Cheers,

Peter


On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko 
wrote:

> Hi Peter, we're using a part of Crail - it's core library, called disni (
> https://github.com/zrlio/disni/). We couldn't reproduce results from that
> blog post, any case Crail is more platformic approach (it comes with it's
> own file system), while SparkRdma is a pluggable approach - it's just a
> plugin, that you can enable/disable for a particular workload, you can use
> any hadoop vendor, etc.
>
> The best optimization for shuffle between local jvms could be using
> something like short circuit local read (
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
> to use unix socket for local communication or just directly read a part
> from other's jvm shuffle file. But yes, it's not available in spark out of
> box.
>
> Thanks,
> Peter Rudenko
>
> пт, 19 жовт. 2018 о 16:54 Peter Liu  пише:
>
>> Hi Peter,
>>
>> thank you for the reply and detailed information! Would this something
>> comparable with Crail? (
>> http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html)
>> I was more looking for something simple/quick making the shuffle between
>> the local jvms quicker (like the idea of using local ram disk) for my
>> simple use case.
>>
>> of course, a general and thorough implementation should cover the shuffle
>> between the nodes as major focus. hmm, looks like there is no
>> implementation within spark itself yet.
>>
>> very much appreciated!
>>
>> Peter
>>
>> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko 
>> wrote:
>>
>>> Hey Peter, in SparkRDMA shuffle plugin (
>>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle
>>> file, to do Remote Direct Memory Access. If the shuffle data is bigger then
>>> RAM, Mellanox NIC support On Demand Paging, where OS invalidates
>>> translations which are no longer valid due to either non-present pages or
>>> mapping changes. So if you have an RDMA capable NIC (or you can try on
>>> Azure cloud
>>> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
>>>  ), have a try. For network intensive apps you should get better
>>> performance.
>>>
>>> Thanks,
>>> Peter Rudenko
>>>
>>> чт, 18 жовт. 2018 о 18:07 Peter Liu  пише:
>>>
 I would be very interested in the initial question here:

 is there a production level implementation for memory only shuffle and
 configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
 storage level) as mentioned in this ticket,
 https://github.com/apache/spark/pull/5403 ?

 It would be a quite practical and useful option/feature. not sure what
 is the status of this ticket implementation?

 Thanks!

 Peter

 On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair 
 wrote:

> Thanks..great info. Will try and let all know.
>
> Best
>
> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester <
> onmstes...@zoho.com> wrote:
>
>> create the ramdisk:
>> mount tmpfs /mnt/spark -t tmpfs -o size=2G
>>
>> then point spark.local.dir to the ramdisk, which depends on your
>> deployment strategy, for me it was through SparkConf object before 
>> passing
>> it to SparkContext:
>> conf.set("spark.local.dir","/mnt/spark")
>>
>> To validate that spark is actually using your ramdisk (by default it
>> uses /tmp), ls the ramdisk after running some jobs and you should see 
>> spark
>> directories (with date on directory name) on your ramdisk
>>
>>
>> Sent using Zoho Mail 
>>
>>
>>  On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
>> >* wrote 
>>
>> What are the steps to configure this? Thanks
>>
>> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
>> onmstes...@zoho.com.invalid> wrote:
>>
>>
>> Hi,
>> I failed to config spark for in-memory shuffle so currently just
>> using linux memory mapped directory (tmpfs) as working directory of 
>> spark,
>> so everything is fast
>>
>> Sent using Zoho Mail 
>>
>>
>>
>>


Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hi Peter, we're using a part of Crail - it's core library, called disni (
https://github.com/zrlio/disni/). We couldn't reproduce results from that
blog post, any case Crail is more platformic approach (it comes with it's
own file system), while SparkRdma is a pluggable approach - it's just a
plugin, that you can enable/disable for a particular workload, you can use
any hadoop vendor, etc.

The best optimization for shuffle between local jvms could be using
something like short circuit local read (
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html)
to use unix socket for local communication or just directly read a part
from other's jvm shuffle file. But yes, it's not available in spark out of
box.

Thanks,
Peter Rudenko

пт, 19 жовт. 2018 о 16:54 Peter Liu  пише:

> Hi Peter,
>
> thank you for the reply and detailed information! Would this something
> comparable with Crail? (
> http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html)
> I was more looking for something simple/quick making the shuffle between
> the local jvms quicker (like the idea of using local ram disk) for my
> simple use case.
>
> of course, a general and thorough implementation should cover the shuffle
> between the nodes as major focus. hmm, looks like there is no
> implementation within spark itself yet.
>
> very much appreciated!
>
> Peter
>
> On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko 
> wrote:
>
>> Hey Peter, in SparkRDMA shuffle plugin (
>> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file,
>> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
>> Mellanox NIC support On Demand Paging, where OS invalidates translations
>> which are no longer valid due to either non-present pages or mapping
>> changes. So if you have an RDMA capable NIC (or you can try on Azure cloud
>>
>> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
>>  ), have a try. For network intensive apps you should get better
>> performance.
>>
>> Thanks,
>> Peter Rudenko
>>
>> чт, 18 жовт. 2018 о 18:07 Peter Liu  пише:
>>
>>> I would be very interested in the initial question here:
>>>
>>> is there a production level implementation for memory only shuffle and
>>> configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
>>> storage level) as mentioned in this ticket,
>>> https://github.com/apache/spark/pull/5403 ?
>>>
>>> It would be a quite practical and useful option/feature. not sure what
>>> is the status of this ticket implementation?
>>>
>>> Thanks!
>>>
>>> Peter
>>>
>>> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair 
>>> wrote:
>>>
 Thanks..great info. Will try and let all know.

 Best

 On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester <
 onmstes...@zoho.com> wrote:

> create the ramdisk:
> mount tmpfs /mnt/spark -t tmpfs -o size=2G
>
> then point spark.local.dir to the ramdisk, which depends on your
> deployment strategy, for me it was through SparkConf object before passing
> it to SparkContext:
> conf.set("spark.local.dir","/mnt/spark")
>
> To validate that spark is actually using your ramdisk (by default it
> uses /tmp), ls the ramdisk after running some jobs and you should see 
> spark
> directories (with date on directory name) on your ramdisk
>
>
> Sent using Zoho Mail 
>
>
>  On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
> >* wrote 
>
> What are the steps to configure this? Thanks
>
> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
> onmstes...@zoho.com.invalid> wrote:
>
>
> Hi,
> I failed to config spark for in-memory shuffle so currently just
> using linux memory mapped directory (tmpfs) as working directory of spark,
> so everything is fast
>
> Sent using Zoho Mail 
>
>
>
>


Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
 Hi Peter,

thank you for the reply and detailed information! Would this something
comparable with Crail? (
http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html)
I was more looking for something simple/quick making the shuffle between
the local jvms quicker (like the idea of using local ram disk) for my
simple use case.

of course, a general and thorough implementation should cover the shuffle
between the nodes as major focus. hmm, looks like there is no
implementation within spark itself yet.

very much appreciated!

Peter

On Fri, Oct 19, 2018 at 9:38 AM Peter Rudenko 
wrote:

> Hey Peter, in SparkRDMA shuffle plugin (
> https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file,
> to do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
> Mellanox NIC support On Demand Paging, where OS invalidates translations
> which are no longer valid due to either non-present pages or mapping
> changes. So if you have an RDMA capable NIC (or you can try on Azure cloud
>
> https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
>  ), have a try. For network intensive apps you should get better
> performance.
>
> Thanks,
> Peter Rudenko
>
> чт, 18 жовт. 2018 о 18:07 Peter Liu  пише:
>
>> I would be very interested in the initial question here:
>>
>> is there a production level implementation for memory only shuffle and
>> configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
>> storage level) as mentioned in this ticket,
>> https://github.com/apache/spark/pull/5403 ?
>>
>> It would be a quite practical and useful option/feature. not sure what is
>> the status of this ticket implementation?
>>
>> Thanks!
>>
>> Peter
>>
>> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair 
>> wrote:
>>
>>> Thanks..great info. Will try and let all know.
>>>
>>> Best
>>>
>>> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester 
>>> wrote:
>>>
 create the ramdisk:
 mount tmpfs /mnt/spark -t tmpfs -o size=2G

 then point spark.local.dir to the ramdisk, which depends on your
 deployment strategy, for me it was through SparkConf object before passing
 it to SparkContext:
 conf.set("spark.local.dir","/mnt/spark")

 To validate that spark is actually using your ramdisk (by default it
 uses /tmp), ls the ramdisk after running some jobs and you should see spark
 directories (with date on directory name) on your ramdisk


 Sent using Zoho Mail 


  On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
 >* wrote 

 What are the steps to configure this? Thanks

 On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
 onmstes...@zoho.com.invalid> wrote:


 Hi,
 I failed to config spark for in-memory shuffle so currently just
 using linux memory mapped directory (tmpfs) as working directory of spark,
 so everything is fast

 Sent using Zoho Mail 






Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin (
https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to
do Remote Direct Memory Access. If the shuffle data is bigger then RAM,
Mellanox NIC support On Demand Paging, where OS invalidates translations
which are no longer valid due to either non-present pages or mapping
changes. So if you have an RDMA capable NIC (or you can try on Azure cloud
https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/
 ), have a try. For network intensive apps you should get better
performance.

Thanks,
Peter Rudenko

чт, 18 жовт. 2018 о 18:07 Peter Liu  пише:

> I would be very interested in the initial question here:
>
> is there a production level implementation for memory only shuffle and
> configurable (similar to  MEMORY_ONLY storage level,  MEMORY_OR_DISK
> storage level) as mentioned in this ticket,
> https://github.com/apache/spark/pull/5403 ?
>
> It would be a quite practical and useful option/feature. not sure what is
> the status of this ticket implementation?
>
> Thanks!
>
> Peter
>
> On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair 
> wrote:
>
>> Thanks..great info. Will try and let all know.
>>
>> Best
>>
>> On Thu, Oct 18, 2018, 3:12 AM onmstester onmstester 
>> wrote:
>>
>>> create the ramdisk:
>>> mount tmpfs /mnt/spark -t tmpfs -o size=2G
>>>
>>> then point spark.local.dir to the ramdisk, which depends on your
>>> deployment strategy, for me it was through SparkConf object before passing
>>> it to SparkContext:
>>> conf.set("spark.local.dir","/mnt/spark")
>>>
>>> To validate that spark is actually using your ramdisk (by default it
>>> uses /tmp), ls the ramdisk after running some jobs and you should see spark
>>> directories (with date on directory name) on your ramdisk
>>>
>>>
>>> Sent using Zoho Mail 
>>>
>>>
>>>  On Wed, 17 Oct 2018 18:57:14 +0330 *☼ R Nair
>>> >* wrote 
>>>
>>> What are the steps to configure this? Thanks
>>>
>>> On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester <
>>> onmstes...@zoho.com.invalid> wrote:
>>>
>>>
>>> Hi,
>>> I failed to config spark for in-memory shuffle so currently just
>>> using linux memory mapped directory (tmpfs) as working directory of spark,
>>> so everything is fast
>>>
>>> Sent using Zoho Mail 
>>>
>>>
>>>
>>>


[Spark for kubernetes] Azure Blob Storage credentials issue

2018-10-19 Thread Oscar Bonilla
Hello,

I'm having the following issue while trying to run Spark for kubernetes
:

2018-10-18 08:48:54 INFO  DAGScheduler:54 - Job 0 failed: reduce at
SparkPi.scala:38, took 1.743177 s
Exception in thread "main" org.apache.spark.SparkException: Job
aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.244.1.11,
executor 2): org.apache.hadoop.fs.azure.AzureException:
org.apache.hadoop.fs.azure.AzureException: No credentials found for
account datasets83d858296fd0c49b.blob.core.windows.net in the
configuration, and its container datasets is not accessible using
anonymous credentials. Please check if the container exists first. If
it is not publicly available, you have to provide account credentials.
at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1086)
at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:538)
at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1366)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3242)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1897)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:694)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:476)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:755)
at 
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:747)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org
$apache$spark$executor$Executor$$updateDependencies(Executor.scala:747)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:312)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.fs.azure.AzureException: No credentials
found for account datasets83d858296fd0c49b.blob.core.windows.net in
the configuration, and its container datasets is not accessible using
anonymous credentials. Please check if the container exists first. If
it is not publicly available, you have to provide account credentials.
at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.connectUsingAnonymousCredentials(AzureNativeFileSystemStore.java:863)
at 
org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1081)
... 24 more

The command I use to launch the job is:

/opt/spark/bin/spark-submit
--master k8s://
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=5
--conf spark.kubernetes.container.image=
--conf spark.kubernetes.namespace=
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
--conf spark.kubernetes.driver.secrets.spark=/opt/spark/conf
--conf spark.kubernetes.executor.secrets.spark=/opt/spark/conf
wasb://@.blob.core.windows.net/spark-examples_2.11-2.3.2.jar
1

I have a k8s secret named spark with the following content:

apiVersion: v1
kind: Secret
metadata:
  name: spark
  labels:
app: spark
stack: service
type: Opaque
data:
  core-site.xml: |-
{% filter b64encode %}



fs.azure.account.key..blob.core.windows.net



fs.AbstractFileSystem.wasb.Impl
org.apache.hadoop.fs.azure.Wasb


{% endfilter %}

The driver pod manages to download the jar dependencies as stored in a
container in Azure Blob Storage. As can be seen in this log snippet:

2018-10-18 08:48:16 INFO  Utils:54 - Fetching
wasb://@.blob.core.windows.net/spark-examples_2.11-2.3.2.jar
to