Re: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread ZHANG Wei
AFAICT, there might be data skews, some partitions got too much rows, which caused out of memory limitation. Trying .groupBy().count() or .aggregateByKey().count() may help check each partition data size. If no data skew, to increase .groupBy() parameter `numPartitions` is worth a try. --

RE: [E] Re: Pyspark Kafka Structured Stream not working.

2020-05-07 Thread Vijayant Kumar
Hi Jungtek, Thanks for the response. It appears to be #1. I will appreciate if you can share some sample command to submit the Spark application.? From: Jungtaek Lim [mailto:kabhwan.opensou...@gmail.com] Sent: Wednesday, May 06, 2020 8:24 PM To: Vijayant Kumar Cc: user@spark.apache.org

java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Hrishikesh Mishra
Hi I am getting out of memory error in worker log in streaming jobs in every couple of hours. After this worker dies. There is no shuffle, no aggression, no. caching in job, its just a transformation. I'm not able to identify where is the problem, driver or executor. And why worker getting dead

[Spark SQL][Beginner] Spark throw Catalyst error while writing the dataframe in ORC format

2020-05-07 Thread Deepak Garg
Hi, I am getting following error while running a spark job. Error occurred when Spark is trying to write the dataframe in ORC format . I am pasting the error trace. Any help in resolving this would be appreciated. Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

Re: [Spark SQL][Beginner] Spark throw Catalyst error while writing the dataframe in ORC format

2020-05-07 Thread Jeff Evans
You appear to be hitting the broadcast timeout. See: https://stackoverflow.com/a/41126034/375670 On Thu, May 7, 2020 at 8:56 AM Deepak Garg wrote: > Hi, > > I am getting following error while running a spark job. Error > occurred when Spark is trying to write the dataframe in ORC format . I am

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Jeff Evans
You might want to double check your Hadoop config files. From the stack trace it looks like this is happening when simply trying to load configuration (XML files). Make sure they're well formed. On Thu, May 7, 2020 at 6:12 AM Hrishikesh Mishra wrote: > Hi > > I am getting out of memory error

RE: PyArrow Exception in Pandas UDF GROUPEDAGG()

2020-05-07 Thread Gautham Acharya
Thanks for the quick reply, Zhang. I don't think that we have too much data skew, and if we do, there isn't much of a way around it - we need to groupby this specific column in order to run aggregates. I'm running this with PySpark, it doesn't look like the groupBy() function takes a

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Hrishikesh Mishra
It's only happening for Hadoop config. The exceptions trace are different for each time it gets died. And Jobs run for couple hours then worker dies. Another Reason: *20/05/02 02:26:34 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[ExecutorRunner for

Re: How to populate all possible combination values in columns using Spark SQL

2020-05-07 Thread Aakash Basu
Hi, I've updated the SO question with masked data, added year column and other requirement. Please take a look. Hope this helps in solving the problem. Thanks and regards, AB On Thu 7 May, 2020, 10:59 AM Sonal Goyal, wrote: > As mentioned in the comments on SO, can you provide a (masked)

No. of active states?

2020-05-07 Thread Something Something
Is there a way to get the total no. of active states in memory at any given point in a Stateful Spark Structured Streaming job? We are thinking of using this metric for 'Auto Scaling' our Spark cluster.

Dynamically changing maxOffsetsPerTrigger

2020-05-07 Thread Something Something
Is there a way to dynamically modify value of 'maxOffsetsPerTrigger' while a Stateful Structured Streaming job is running? We are thinking of auto-scaling our Spark cluster but if we don't modify the value of 'maxOffsetsPerTrigger' dynamically would adding more VMs to the cluster help? I don't

Re: No. of active states?

2020-05-07 Thread Jungtaek Lim
Have you looked through and see metrics for state operators? It has been providing "total rows" of state, and starting from Spark 2.4 it also provides additional metrics specific to HDFSBackedStateStoreProvider, including estimated memory usage in overall.

Re: [E] Re: Pyspark Kafka Structured Stream not working.

2020-05-07 Thread Jungtaek Lim
It's not either 1 or 2. Both two items are applied. I haven't played with DStream + pyspark but given the error message is clear you'll probably want to change the client.id "Python Kafka streamer" to accommodate the naming convention guided in error message. On Thu, May 7, 2020 at 3:55 PM

Re: No. of active states?

2020-05-07 Thread Something Something
No. We are already capturing these metrics (e.g. numInputRows, inputRowsPerSecond). I am talking about "No. of States" in the memory at any given time. On Thu, May 7, 2020 at 4:31 PM Jungtaek Lim wrote: > If you're referring total "entries" in all states in SS job, it's being > provided via

Re: No. of active states?

2020-05-07 Thread Jungtaek Lim
If you're referring total "entries" in all states in SS job, it's being provided via StreamingQueryListener. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries Hope this helps. On Fri, May 8, 2020 at 3:26 AM Something Something wrote: