Accumulators and other important metrics for your job

2021-05-27 Thread Hamish Whittal
Hi folks, I have a problematic dataset I'm working with and am trying to find ways of "debugging" the data. For example, the most simple thing I would like to do is to know how many rows of data I've read and compare that to a simple count of the lines in the file. I could do: df.count() but

Re: Keeping track of how long something has been in a queue

2020-09-04 Thread Hamish Whittal
Sorry, I moved a paragraph, (2) If Ms green.th was first seen at 13:04:04, then at 13:04:05 and finally > at 13:04:17, she's been in the queue for 13 seconds (ignoring the ms). >

Keeping track of how long something has been in a queue

2020-09-04 Thread Hamish Whittal
Hi folks, I have a stream coming from Kafka. It has this schema: { "id": 4, "account_id": 1070998, "uid": "green.th", "last_activity_time": "2020-09-03 13:04:04.520129" } Another event arrives a few milliseconds/seconds later: { "id": 9, "account_id": 1070998, "uid":

Stream to Stream joins

2020-08-24 Thread Hamish Whittal
Hi folks, I've got a stream coming from Kafka. It has the following schema: userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime } I'm trying to count the number of logins by acctid. I can do the count fine, but the table only has the acctid and the count. Now I wish to get all th

Spark Streaming with Kafka and Python

2020-08-12 Thread Hamish Whittal
Hi folks, Thought I would ask here because it's somewhat confusing. I'm using Spark 2.4.5 on EMR 5.30.1 with Amazon MSK. The version of Scala used is 2.11.12. I'm using this version of the libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar Now I'm wanting to read from Kafka topics using Python (

Re: MySQL query continually add IS NOT NULL onto a query even though I don't request it

2020-04-02 Thread Hamish Whittal
Tonic. Have fun everyone (even if you're in lockdown like me!) Hamish On Wed, Apr 1, 2020 at 7:47 AM Hamish Whittal wrote: > Hi folks, > > 1) First Problem: > I'm querying MySQL. I submit a query like this: > > out = wam.select('message_id', 'business_i

MySQL query continually add IS NOT NULL onto a query even though I don't request it

2020-03-31 Thread Hamish Whittal
Hi folks, 1) First Problem: I'm querying MySQL. I submit a query like this: out = wam.select('message_id', 'business_id', 'info', 'entered_system_date', 'auto_update_time').filter("auto_update_time >= '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time']) But what I see in the D

Re: Still incompatible schemas

2020-03-09 Thread Hamish Whittal
s3 location as - val df = > spark.read.parquet("s3://path/file") df.show(3, false) // this displays the > results. "* > > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <h

Still incompatible schemas

2020-03-09 Thread Hamish Whittal
Hi folks, Thanks for the help thus far. I'm trying to track down the source of this error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary w hen doing a message.show() Basically I'm reading in a single Parquet file (to try to narrow th

Re:

2020-03-02 Thread Hamish Whittal
gt; ) > } finally { > reader.close() > } > } > .toDF("schema name", "fields") > .show(false) > > .binaryFiles provides you all filenames that match the given pattern as an > RDD, so the following .map is executed on the Spark execu

[no subject]

2020-03-01 Thread Hamish Whittal
Hi there, I have an hdfs directory with thousands of files. It seems that some of them - and I don't know which ones - have a problem with their schema and it's causing my Spark application to fail with this error: Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column