Hi folks,
I have a problematic dataset I'm working with and am trying to find ways of
"debugging" the data.
For example, the most simple thing I would like to do is to know how many
rows of data I've read and compare that to a simple count of the lines in
the file.
I could do:
df.count()
but
Sorry, I moved a paragraph,
(2) If Ms green.th was first seen at 13:04:04, then at 13:04:05 and finally
> at 13:04:17, she's been in the queue for 13 seconds (ignoring the ms).
>
Hi folks,
I have a stream coming from Kafka.
It has this schema:
{
"id": 4,
"account_id": 1070998,
"uid": "green.th",
"last_activity_time": "2020-09-03 13:04:04.520129"
}
Another event arrives a few milliseconds/seconds later:
{
"id": 9,
"account_id": 1070998,
"uid":
Hi folks,
I've got a stream coming from Kafka. It has the following schema:
userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime }
I'm trying to count the number of logins by acctid.
I can do the count fine, but the table only has the acctid and the count.
Now I wish to get all th
Hi folks,
Thought I would ask here because it's somewhat confusing. I'm using Spark
2.4.5 on EMR 5.30.1 with Amazon MSK.
The version of Scala used is 2.11.12. I'm using this version of the
libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar
Now I'm wanting to read from Kafka topics using Python (
Tonic. Have fun everyone
(even if you're in lockdown like me!)
Hamish
On Wed, Apr 1, 2020 at 7:47 AM Hamish Whittal
wrote:
> Hi folks,
>
> 1) First Problem:
> I'm querying MySQL. I submit a query like this:
>
> out = wam.select('message_id', 'business_i
Hi folks,
1) First Problem:
I'm querying MySQL. I submit a query like this:
out = wam.select('message_id', 'business_id', 'info',
'entered_system_date', 'auto_update_time').filter("auto_update_time >=
'2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time'])
But what I see in the D
s3 location as - val df =
> spark.read.parquet("s3://path/file") df.show(3, false) // this displays the
> results. "*
>
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> <h
Hi folks,
Thanks for the help thus far.
I'm trying to track down the source of this error:
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary
w hen doing a message.show()
Basically I'm reading in a single Parquet file (to try to narrow th
gt; )
> } finally {
> reader.close()
> }
> }
> .toDF("schema name", "fields")
> .show(false)
>
> .binaryFiles provides you all filenames that match the given pattern as an
> RDD, so the following .map is executed on the Spark execu
Hi there,
I have an hdfs directory with thousands of files. It seems that some of
them - and I don't know which ones - have a problem with their schema and
it's causing my Spark application to fail with this error:
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet
column
11 matches
Mail list logo