date:20200106

Re: [pyspark2.4+] When to choose RDD over Dataset, was: A lot of tasks failed, but job eventually completes

2020-01-06 Thread Enrico Minack


Hi Rishi,

generally it is better to avoid RDDs if you can and use the Dataset API. 
With Datasets (formerly DataFrames) Spark can optimize your query / tree 
of transformations, RDDs are opaque. Datasets have an optimized memory 
footprint. Pure Dataset operations provide you helpful information on 
the SQL tab in the Spark UI. For large transformations it is then easier 
to identify the transformations that cause you trouble. Switching from 
Dataset to RDD at some point hides all operations that happen before 
accessing the RDD so you lose the query debugging capability for that part.


That is my experience.

Enrico


Am 06.01.20 um 14:35 schrieb Rishi Shah:

Thank you Hemant and Enrico. Much appreciated.

your input really got me closer to the issue, I realized every task 
didn't get enough memory and hence tasks with large partitions kept 
failing. I increased executor memory and at the same time increased 
number of partitions as well. This made the job succeed with flying 
colors. Really appreciate the help here.


I do have one more question, when do you recommend using RDDs over 
data frames? Because at time using windows may get a bit complicated 
but there's always some or the other way to use windows on data 
frames. I always get confused as to when to fall back on RDD approach? 
Any use case in your experience warrant for RDD use, for better 
performance?


Thanks,
Rishi

On Mon, Jan 6, 2020 at 4:18 AM Enrico Minack > wrote:


Note that repartitioning helps to increase the number of
partitions (and hence to reduce the size of partitions and
required executor memory), but subsequent transformations like
join will repartition data again with the configured number of
partitions (|spark.sql.shuffle.partitions|), virtually undoing the
repartitioning, e.g.:

data    // may have any number of partitions
  .repartition(1000)    // has 1000 partitions
  .join(table)  // has
|spark.sql.shuffle.partitions|partitions

If you use RDDs, you need to configure |spark.default.parallelism|
rather than |spark.sql.shuffle.partitions|.

Given you have 700GB of data, the default of 200 partitions mean
that each partition is 3,5 GB (equivalent of input data) in size.
Since increasing executor memory is limited by the available
memory, executor memory does not scale for big data. Increasing
the number of partitions is the natural way of scaling in Spark land.

Having hundreds of tasks that fail is an indication that you do
not suffer from skewed data but from large partitions. Skewed data
usually has a few tasks that keep failing.

It is easy to check for skewed data in the Spark UI. Open a stage
that has failing tasks and look at the Summary Metrics, e.g.:
If the Max number of Shuffle Read Size is way higher than the 75th
percentile, than this indicates a poor distribution of the data
(or more precise the partitioning key) of this stage.

You can also sort the tasks by the "Shuffle Read Size / Records"
column and see if numbers are evenly distributed (ideally).

I hope this helped.

Enrico



Am 06.01.20 um 06:27 schrieb hemant singh:

You can try repartitioning the data, if it’s a skewed data then
you may need to salt the keys for better partitioning.
Are you using a coalesce or any other fn which brings the data to
lesser nodes. Window function also incurs shuffling that could be
an issue.

On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah
mailto:rishishah.s...@gmail.com>> wrote:

Thanks Hemant, underlying data volume increased from 550GB to
690GB and now the same job doesn't succeed. I tried
incrementing executor memory to 20G as well, still fails. I
am running this in Databricks and start cluster with 20G
assigned to spark.executor.memory property.

Also some more information on the job, I have about 4 window
functions on this dataset before it gets written out.

Any other ideas?

Thanks,
-Shraddha

On Sun, Jan 5, 2020 at 11:06 PM hemant singh
mailto:hemant2...@gmail.com>> wrote:

You can try increasing the executor memory, generally
this error comes when there is not enough memory in
individual executors.
Job is getting completed may be because when tasks are
re-scheduled it would be going through.

Thanks.

On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah
mailto:rishishah.s...@gmail.com>> wrote:

Hello All,

One of my jobs, keep getting into this situation
where 100s of tasks keep failing with below error but
job eventually completes.

org.apache.spark.memory.SparkOutOfMemoryError: Unable
to acquire 16384 bytes of memory

Unsubscribe

2020-01-06 Thread Rishabh Pugalia

Unsubscribe

Fwd: [Spark Streaming]: Why my Spark Direct stream is sending multiple offset commits to Kafka?

2020-01-06 Thread Raghu B

Hi Spark Community.

I need help with the following issue and I have been researching about it
from last 2 weeks and as a last and best resource I want to ask the Spark
community.

I am running the following code in Spark*

*  val sparkConf = new SparkConf()*

*.setMaster("local[*]")*

*.setAppName("KafkaTest")*

*.set("spark.streaming.kafka.maxRatePerPartition","10")*

*.set("spark.default.parallelism","10")*

*.set("spark.streaming.backpressure.enabled", "true")*

*.set("spark.scheduler.mode", "FAIR")*



*  lazy val sparkContext = new SparkContext(sparkConf)*

*  val sparkJob = new SparkLocal*



*  val kafkaParams = Map[String, Object](*

*  "bootstrap.servers" -> "kafka-270894369.spark.google.com:9092
",*

*  "key.deserializer" -> classOf[StringDeserializer],*

*  "value.deserializer" -> classOf[StringDeserializer],*

*  "group.id " -> "stream_group1",*

*  "auto.offset.reset" -> "latest",*

*  "enable.auto.commit" -> "false",*

*  "heartbeat.interval.ms " ->
"13", //3000*

*  "request.timeout.ms " -> "15",
//4*

*  "session.timeout.ms " -> "14",
//3*

*  "max.poll.interval.ms " ->
"14", //isn't a known config*

*  "max.poll.records" -> "100" //2147483647*

*)*



*val streamingContext = new StreamingContext(sparkContext,
Seconds(120))*



*val topics = Array("topicname")*



*val kafkaStream = KafkaUtils.createDirectStream[String, String](*

*  streamingContext,*

*  PreferConsistent,*

*  Subscribe[String, String](topics, kafkaParams)*

*)*



*def messageTuple(tuple: ConsumerRecord[String, String]): (String)
= {*

*  (null) // Removed the code*

*}*



*var offset : Array[OffsetRange] = null*



*kafkaStream.foreachRDD{rdd =>*

*  val offsetRanges =
rdd.asInstanceOf[HasOffsetRanges].offsetRanges*

*  offset = offsetRanges*



*  rdd.map(row => messageTuple(row))*

*.foreachPartition { partition =>*

*  partition.map(row => null)*

*.foreach{ record =>*

*  print("")*

*  Thread.sleep(5)*

*}*

*  }*

*
kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)*

*  }*



*streamingContext.start()*

*streamingContext.awaitTerminationOrTimeout(600)*



*sys.ShutdownHookThread{*

*  println("Gracefully shutting down App")*

*  streamingContext.stop(true,true)*

*  println("Application stopped")*

*}*




With the above code I am observing multiple commits are sending to Kafka
and I am not sure why ?

(Got the below info from kafka __consumer_offset topic)













*  [stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=157773011,
expireTimestamp=Some(1577816400011))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=157773012,
expireTimestamp=Some(1577816400012))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864005827,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=157773079,
expireTimestamp=Some(1577816400079))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120008,
expireTimestamp=Some(1577816520008))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120010,
expireTimestamp=Some(1577816520010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120077,
expireTimestamp=Some(1577816520077))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240010,
expireTimestamp=Some(1577816640010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240015,
expireTimestamp=Some(1577816640015))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240137,
expireTimestamp=Some(1577816640137))*

*[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=157773012,
expireTimestamp=Some(1577816400012))*
*[stream_group1,topicname,59]::OffsetAndMetadata(offset=864005827,
leaderEpoch=Optional.empty, metadata=, commitTimestamp=157773079,
expireTimestamp=Some(1577816400079))*

*

unsubscribe

2020-01-06 Thread Bruno S. de Barros

  unsubscribe  

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-06 Thread Rishi Shah

Thank you Hemant and Enrico. Much appreciated.

your input really got me closer to the issue, I realized every task didn't
get enough memory and hence tasks with large partitions kept failing. I
increased executor memory and at the same time increased number of
partitions as well. This made the job succeed with flying colors. Really
appreciate the help here.

I do have one more question, when do you recommend using RDDs over data
frames? Because at time using windows may get a bit complicated but there's
always some or the other way to use windows on data frames. I always get
confused as to when to fall back on RDD approach? Any use case in your
experience warrant for RDD use, for better performance?

Thanks,
Rishi

On Mon, Jan 6, 2020 at 4:18 AM Enrico Minack  wrote:

> Note that repartitioning helps to increase the number of partitions (and
> hence to reduce the size of partitions and required executor memory), but
> subsequent transformations like join will repartition data again with the
> configured number of partitions (spark.sql.shuffle.partitions), virtually
> undoing the repartitioning, e.g.:
>
> data// may have any number of partitions
>   .repartition(1000)// has 1000 partitions
>   .join(table)  // has spark.sql.shuffle.partitions partitions
>
> If you use RDDs, you need to configure spark.default.parallelism rather
> than spark.sql.shuffle.partitions.
>
> Given you have 700GB of data, the default of 200 partitions mean that each
> partition is 3,5 GB (equivalent of input data) in size. Since increasing
> executor memory is limited by the available memory, executor memory does
> not scale for big data. Increasing the number of partitions is the natural
> way of scaling in Spark land.
>
> Having hundreds of tasks that fail is an indication that you do not suffer
> from skewed data but from large partitions. Skewed data usually has a few
> tasks that keep failing.
>
> It is easy to check for skewed data in the Spark UI. Open a stage that has
> failing tasks and look at the Summary Metrics, e.g.:
> If the Max number of Shuffle Read Size is way higher than the 75th
> percentile, than this indicates a poor distribution of the data (or more
> precise the partitioning key) of this stage.
>
> You can also sort the tasks by the "Shuffle Read Size / Records" column
> and see if numbers are evenly distributed (ideally).
>
> I hope this helped.
>
> Enrico
>
>
>
> Am 06.01.20 um 06:27 schrieb hemant singh:
>
> You can try repartitioning the data, if it’s a skewed data then you may
> need to salt the keys for better partitioning.
> Are you using a coalesce or any other fn which brings the data to lesser
> nodes. Window function also incurs shuffling that could be an issue.
>
> On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah 
> wrote:
>
>> Thanks Hemant, underlying data volume increased from 550GB to 690GB and
>> now the same job doesn't succeed. I tried incrementing executor memory to
>> 20G as well, still fails. I am running this in Databricks and start cluster
>> with 20G assigned to spark.executor.memory property.
>>
>> Also some more information on the job, I have about 4 window functions on
>> this dataset before it gets written out.
>>
>> Any other ideas?
>>
>> Thanks,
>> -Shraddha
>>
>> On Sun, Jan 5, 2020 at 11:06 PM hemant singh 
>> wrote:
>>
>>> You can try increasing the executor memory, generally this error comes
>>> when there is not enough memory in individual executors.
>>> Job is getting completed may be because when tasks are re-scheduled it
>>> would be going through.
>>>
>>> Thanks.
>>>
>>> On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah 
>>> wrote:
>>>
 Hello All,

 One of my jobs, keep getting into this situation where 100s of tasks
 keep failing with below error but job eventually completes.

 org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384
 bytes of memory

 Could someone advice?

 --
 Regards,

 Rishi Shah

>>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>

-- 
Regards,

Rishi Shah

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Mich Talebzadeh

*The distinct transformation does not preserve order, you need to distinct
first, then orderby.*

Thanks Enrico. You are correct. Worked fine!

joint_accounts.
  select(year(col("transactiondate")).as("Year")
, month(col("transactiondate")).as("Month")
, sum("moneyin").over(wSpec).cast("DECIMAL(10,2)").as("incoming Per
Month")
, sum("moneyout").over(wSpec).cast("DECIMAL(10,2)").as("outgoing Per
Month")).
  withColumn(("incoming Per Month"),format_number(col("incoming Per
Month"),2)).
  withColumn(("outgoing Per Month"),format_number(col("outgoing Per
Month"),2)).


* distinct.  orderBy(col("Year"),col("Month")). *
show(1000,false)

++-+--+--+
|Year|Month|incoming Per Month|outgoing Per Month|
++-+--+--+
|2019|9|13,958.58 |17,920.31 |
|2019|10   |10,029.00 |10,067.52 |

|2019|11   |4,032.30  |4,225.30  |
|2019|12   |742.00|814.49|
|2020|1|1,570.00  |1,540.86  |
++-+--+--+

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 6 Jan 2020 at 09:35, Gourav Sengupta 
wrote:

> or just use SQL, which is less verbose, easily readable, and takes care of
> all such scenarios. But for some weird reason I have found that people
> using data frame API's have a perception that using SQL is less
> intelligent. But I think that using less effort to get better output can me
> a measure of intelligence.
>
> Regards,
> Gourav Sengupta
>
> On Mon, Jan 6, 2020 at 9:23 AM Enrico Minack 
> wrote:
>
>> The distinct transformation does not preserve order, you need to distinct
>> first, then orderby.
>>
>> Enrico
>>
>>
>> Am 06.01.20 um 00:39 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> I am working out monthly outgoing etc from an account and I am using the
>> following code
>>
>> import org.apache.spark.sql.expressions.Window
>> val wSpec =
>> Window.partitionBy(year(col("transactiondate")),month(col("transactiondate")))
>> joint_accounts.
>>   select(year(col("transactiondate")).as("Year")
>> , month(col("transactiondate")).as("Month")
>> , sum("moneyin").over(wSpec).cast("DECIMAL(10,2)").as("incoming Per
>> Month")
>> , sum("moneyout").over(wSpec).cast("DECIMAL(10,2)").as("outgoing Per
>> Month")).
>>
>> *orderBy(year(col("transactiondate")),month(col("transactiondate"))).*
>> distinct.
>> show(1000,false)
>>
>> This shows as follows:
>>
>>
>> |Year|Month|incoming Per Month|outgoing Per Month|
>> ++-+--+--+
>> |2019|9|13958.58  |17920.31  |
>> |2019|11   |4032.30   |4225.30   |
>> |2020|1|1530.00   |1426.91   |
>> |2019|10   |10029.00  |10067.52  |
>> |2019|12   |742.00|814.49|
>> ++-+--+--+
>>
>>  however the orderby is not correct as I expect to see 2010 record and
>> 2019 records in the order of year and month.
>>
>> Any suggestions?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Gourav Sengupta

or just use SQL, which is less verbose, easily readable, and takes care of
all such scenarios. But for some weird reason I have found that people
using data frame API's have a perception that using SQL is less
intelligent. But I think that using less effort to get better output can me
a measure of intelligence.

Regards,
Gourav Sengupta

On Mon, Jan 6, 2020 at 9:23 AM Enrico Minack  wrote:

> The distinct transformation does not preserve order, you need to distinct
> first, then orderby.
>
> Enrico
>
>
> Am 06.01.20 um 00:39 schrieb Mich Talebzadeh:
>
> Hi,
>
> I am working out monthly outgoing etc from an account and I am using the
> following code
>
> import org.apache.spark.sql.expressions.Window
> val wSpec =
> Window.partitionBy(year(col("transactiondate")),month(col("transactiondate")))
> joint_accounts.
>   select(year(col("transactiondate")).as("Year")
> , month(col("transactiondate")).as("Month")
> , sum("moneyin").over(wSpec).cast("DECIMAL(10,2)").as("incoming Per
> Month")
> , sum("moneyout").over(wSpec).cast("DECIMAL(10,2)").as("outgoing Per
> Month")).
> *orderBy(year(col("transactiondate")),month(col("transactiondate"))).*
> distinct.
> show(1000,false)
>
> This shows as follows:
>
>
> |Year|Month|incoming Per Month|outgoing Per Month|
> ++-+--+--+
> |2019|9|13958.58  |17920.31  |
> |2019|11   |4032.30   |4225.30   |
> |2020|1|1530.00   |1426.91   |
> |2019|10   |10029.00  |10067.52  |
> |2019|12   |742.00|814.49|
> ++-+--+--+
>
>  however the orderby is not correct as I expect to see 2010 record and
> 2019 records in the order of year and month.
>
> Any suggestions?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: OrderBy Year and Month is not displaying correctly

2020-01-06 Thread Enrico Minack

The distinct transformation does not preserve order, you need to 
distinct first, then orderby.


Enrico


Am 06.01.20 um 00:39 schrieb Mich Talebzadeh:

Hi,

I am working out monthly outgoing etc from an account and I am using 
the following code


import org.apache.spark.sql.expressions.Window
val wSpec = 
Window.partitionBy(year(col("transactiondate")),month(col("transactiondate")))

joint_accounts.
  select(year(col("transactiondate")).as("Year")
    , month(col("transactiondate")).as("Month")
    , sum("moneyin").over(wSpec).cast("DECIMAL(10,2)").as("incoming 
Per Month")
    , sum("moneyout").over(wSpec).cast("DECIMAL(10,2)").as("outgoing 
Per Month")).

*orderBy(year(col("transactiondate")),month(col("transactiondate"))).*
    distinct.
    show(1000,false)

This shows as follows:


|Year|Month|incoming Per Month|outgoing Per Month|
++-+--+--+
|2019|9    |13958.58  |17920.31  |
|2019|11   |4032.30   |4225.30   |
|2020|1    |1530.00   |1426.91   |
|2019|10   |10029.00  |10067.52  |
|2019|12   |742.00    |814.49    |
++-+--+--+

 however the orderby is not correct as I expect to see 2010 record and 
2019 records in the order of year and month.


Any suggestions?

Thanks,

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

2020-01-06 Thread Enrico Minack

Note that repartitioning helps to increase the number of partitions (and 
hence to reduce the size of partitions and required executor memory), 
but subsequent transformations like join will repartition data again 
with the configured number of partitions 
(|spark.sql.shuffle.partitions|), virtually undoing the repartitioning, 
e.g.:


data    // may have any number of partitions
  .repartition(1000)    // has 1000 partitions
  .join(table)  // has |spark.sql.shuffle.partitions|partitions

If you use RDDs, you need to configure |spark.default.parallelism| 
rather than |spark.sql.shuffle.partitions|.


Given you have 700GB of data, the default of 200 partitions mean that 
each partition is 3,5 GB (equivalent of input data) in size. Since 
increasing executor memory is limited by the available memory, executor 
memory does not scale for big data. Increasing the number of partitions 
is the natural way of scaling in Spark land.


Having hundreds of tasks that fail is an indication that you do not 
suffer from skewed data but from large partitions. Skewed data usually 
has a few tasks that keep failing.


It is easy to check for skewed data in the Spark UI. Open a stage that 
has failing tasks and look at the Summary Metrics, e.g.:
If the Max number of Shuffle Read Size is way higher than the 75th 
percentile, than this indicates a poor distribution of the data (or more 
precise the partitioning key) of this stage.


You can also sort the tasks by the "Shuffle Read Size / Records" column 
and see if numbers are evenly distributed (ideally).


I hope this helped.

Enrico



Am 06.01.20 um 06:27 schrieb hemant singh:
You can try repartitioning the data, if it’s a skewed data then you 
may need to salt the keys for better partitioning.
Are you using a coalesce or any other fn which brings the data to 
lesser nodes. Window function also incurs shuffling that could be an 
issue.


On Mon, 6 Jan 2020 at 9:49 AM, Rishi Shah > wrote:


Thanks Hemant, underlying data volume increased from 550GB to
690GB and now the same job doesn't succeed. I tried incrementing
executor memory to 20G as well, still fails. I am running this in
Databricks and start cluster with 20G assigned to
spark.executor.memory property.

Also some more information on the job, I have about 4 window
functions on this dataset before it gets written out.

Any other ideas?

Thanks,
-Shraddha

On Sun, Jan 5, 2020 at 11:06 PM hemant singh mailto:hemant2...@gmail.com>> wrote:

You can try increasing the executor memory, generally this
error comes when there is not enough memory in individual
executors.
Job is getting completed may be because when tasks are
re-scheduled it would be going through.

Thanks.

On Mon, 6 Jan 2020 at 5:47 AM, Rishi Shah
mailto:rishishah.s...@gmail.com>>
wrote:

Hello All,

One of my jobs, keep getting into this situation where
100s of tasks keep failing with below error but job
eventually completes.

org.apache.spark.memory.SparkOutOfMemoryError: Unable to
acquire 16384 bytes of memory

Could someone advice?

-- 
Regards,


Rishi Shah



-- 
Regards,


Rishi Shah

Re: [pyspark2.4+] When to choose RDD over Dataset, was: A lot of tasks failed, but job eventually completes

Unsubscribe

Fwd: [Spark Streaming]: Why my Spark Direct stream is sending multiple offset commits to Kafka?

unsubscribe

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

Re: OrderBy Year and Month is not displaying correctly

Re: OrderBy Year and Month is not displaying correctly

Re: OrderBy Year and Month is not displaying correctly

Re: [pyspark2.4+] A lot of tasks failed, but job eventually completes

9 matches

Site Navigation

Mail list logo

Footer information