Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread hemant singh
You can try repartitioning the data, if it’s a skewed data then you may
need to salt the keys for better partitioning.
Are you using a coalesce or any other fn which brings the data to lesser
nodes. Window function also incurs shuffling that could be an issue.

On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah  wrote:

> Thanks Hemant, underlying data volume increased from 550GB to 690GB and
> now the same job doesn't succeed. I tried incrementing executor memory to
> 20G as well, still fails. I am running this in Databricks and start cluster
> with 20G assigned to spark.executor.memory property.
>
> Also some more information on the job, I have about 4 window functions on
> this dataset before it gets written out.
>
> Any other ideas?
>
> Thanks,
> -Shraddha
>
> On Sun, Jan 5, 2020 at 11:06 PM hemant singh  wrote:
>
>> You can try increasing the executor memory, generally this error comes
>> when there is not enough memory in individual executors.
>> Job is getting completed may be because when tasks are re-scheduled it
>> would be going through.
>>
>> Thanks.
>>
>> On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah 
>> wrote:
>>
>>> Hello All,
>>>
>>> One of my jobs, keep getting into this situation where 100s of tasks
>>> keep failing with below error but job eventually completes.
>>>
>>> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
>>> bytes of memory
>>>
>>> Could someone advice?
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread Rishi Shah
Thanks Hemant, underlying data volume increased from 550GB to 690GB and now
the same job doesn't succeed. I tried incrementing executor memory to 20G
as well, still fails. I am running this in Databricks and start cluster
with 20G assigned to spark.executor.memory property.

Also some more information on the job, I have about 4 window functions on
this dataset before it gets written out.

Any other ideas?

Thanks,
-Shraddha

On Sun, Jan 5, 2020 at 11:06 PM hemant singh  wrote:

> You can try increasing the executor memory, generally this error comes
> when there is not enough memory in individual executors.
> Job is getting completed may be because when tasks are re-scheduled it
> would be going through.
>
> Thanks.
>
> On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah 
> wrote:
>
>> Hello All,
>>
>> One of my jobs, keep getting into this situation where 100s of tasks keep
>> failing with below error but job eventually completes.
>>
>> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
>> bytes of memory
>>
>> Could someone advice?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah


Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread hemant singh
You can try increasing the executor memory, generally this error comes when
there is not enough memory in individual executors.
Job is getting completed may be because when tasks are re-scheduled it
would be going through.

Thanks.

On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah  wrote:

> Hello All,
>
> One of my jobs, keep getting into this situation where 100s of tasks keep
> failing with below error but job eventually completes.
>
> org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
> bytes of memory
>
> Could someone advice?
>
> --
> Regards,
>
> Rishi Shah
>


[pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-05 Thread Rishi Shah
Hello All,

One of my jobs, keep getting into this situation where 100s of tasks keep
failing with below error but job eventually completes.

org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
bytes of memory

Could someone advice?

-- 
Regards,

Rishi Shah


OrderBy Year and Month is not displaying correctly

2020-01-05 Thread Mich Talebzadeh
Hi,

I am working out monthly outgoing etc from an account and I am using the
following code

import org.apache.spark.sql.expressions.Window
val wSpec =
Window.partitionBy(year(col("transactiondate")),month(col("transactiondate")))
joint_accounts.
  select(year(col("transactiondate")).as("Year")
, month(col("transactiondate")).as("Month")
, sum("moneyin").over(wSpec).cast("DECIMAL(10,2)").as("incoming Per
Month")
, sum("moneyout").over(wSpec).cast("DECIMAL(10,2)").as("outgoing Per
Month")).
*orderBy(year(col("transactiondate")),month(col("transactiondate"))).*
distinct.
show(1000,false)

This shows as follows:


|Year|Month|incoming Per Month|outgoing Per Month|
++-+--+--+
|2019|9|13958.58  |17920.31  |
|2019|11   |4032.30   |4225.30   |
|2020|1|1530.00   |1426.91   |
|2019|10   |10029.00  |10067.52  |
|2019|12   |742.00|814.49|
++-+--+--+

 however the orderby is not correct as I expect to see 2010 record and 2019
records in the order of year and month.

Any suggestions?

Thanks,

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: How more than one spark job can write to same partition in the parquet file

2020-01-05 Thread Iqbal Singh
Hey Chetan,

I have not got your question. Are you trying to write to a partition from
two actions ?? or you are looking for writing from two jobs. Except for
maintaining the state for the dataset completeness in that case, I dont see
any issues.

We are writing data to a Partition using two different actions in a single
spark job also partition here meant as a HDFS directory, not a hive
partition.



On Thu, Dec 12, 2019 at 1:37 AM ayan guha  wrote:

> We partitioned data logically for 2 different jobs...in our use case based
> on geography...
>
> On Thu, 12 Dec 2019 at 3:39 pm, Chetan Khatri 
> wrote:
>
>> Thanks, If you can share alternative change in design. I would love to
>> hear from you.
>>
>> On Wed, Dec 11, 2019 at 9:34 PM ayan guha  wrote:
>>
>>> No we faced problem with that setup.
>>>
>>> On Thu, 12 Dec 2019 at 11:14 am, Chetan Khatri <
>>> chetan.opensou...@gmail.com> wrote:
>>>
 Hi Spark Users,
 would that be possible to write to same partition to the parquet file
 through concurrent two spark jobs with different spark session.

 thanks

>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>> --
> Best Regards,
> Ayan Guha
>


unsubscribe

2020-01-05 Thread Bruno S. de Barros


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2020-01-05 Thread Rishabh Pugalia
unsubscribe

-- 
Thanks and Best Regards,
Rishabh