Re: 1 task per executor

2019-05-28 Thread Arnaud LARROQUE
Hi, How many files do you read ? Are they splittable ? If you have 4 files non splittable, your dataset would have 4 partitions and you will only see one task per partition handle by on executor Regards, Arnaud On Tue, May 28, 2019 at 10:06 AM Sachit Murarka wrote: > Hi All, > > I am using

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-21 Thread Arnaud LARROQUE
Hi Shivam, At the end, the file is taking its own space regardless of the block size. So if you're file is just a few ko bytes, it will take only this few ko bytes. But I've noticed that when the file is written, somehow a block is allocated and the Namenode consider that all the block size is

Re: State of datasource api v2

2019-01-14 Thread Arnaud LARROQUE
Hi Vladimir, I've try to do the same here when I attempted to write a Spark connector for remote file. >From my point of view, There was a lot of change in the V2 API => Better semantic at least ! I understood that only continuous streaming use datasourceV2 (Not sure if im correct). But for file

Re: Performance Issue

2019-01-13 Thread Arnaud LARROQUE
Hi, Indeed Spark use spark.sql.autoBroadcastJoinThreshold to choose if it autobroadcasts a dataset or not. Default value are 10 mb. You may execute an explain and check the different plans and see if the broadcasthashjoins are being used. You may change accordingly. There is no use to increase