Re: [pyspark 2.4+] BucketBy SortBy doesn't retain sort order

2020-03-03 Thread Rishi Shah
Hi All, Just checking in to see if anyone has any advice on this. Thanks, Rishi On Mon, Mar 2, 2020 at 9:21 PM Rishi Shah wrote: > Hi All, > > I have 2 large tables (~1TB), I used the following to save both the > tables. Then when I try to join both tables with join_column, it still does >

Stateful Spark Streaming: Required attribute 'value' not found

2020-03-03 Thread Something Something
In a Stateful Spark Streaming application I am writing the 'OutputRow' in the 'updateAcrossEvents' but I keep getting this error (*Required attribute 'value' not found*) while it's trying to write to Kafka. I know from the documentation that 'value' attribute needs to be set but how do I do that

Example of Stateful Spark Structured Streaming with Kafka

2020-03-03 Thread Something Something
There are lots of examples on 'Stateful Structured Streaming' in 'The Definitive Guide' book BUT all of them read JSON from a 'path'. That's working for me. Now I need to read from Kafka. I Googled but I couldn't find any example. I am struggling to Map the 'Value' of the Kafka message to my

Re: How to collect Spark dataframe write metrics

2020-03-03 Thread Zohar Stiro
Hi, to get DataFrame level write metrics you can take a look at the following trait : https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala and a basic implementation example: