Re: [pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

2019-08-30 Thread Chris Teoh
Look at your DAG. Are there lots of CSV files? Does your input CSV
dataframe have lots of partitions to start with? Bear in mind cross join
makes the dataset much larger so expect to have more tasks.

On Fri, 30 Aug 2019 at 14:11, Rishi Shah  wrote:

> Hi All,
>
> I am scratching my head against this weird behavior, where df (read from
> .csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks!
> How to correlate input size with number of tasks in this case?
>
> --
> Regards,
>
> Rishi Shah
>


-- 
Chris


[pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

2019-08-29 Thread Rishi Shah
Hi All,

I am scratching my head against this weird behavior, where df (read from
.csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks!
How to correlate input size with number of tasks in this case?

-- 
Regards,

Rishi Shah