Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ranadip Chatterjee
Gzip files are not splittable. Hence using very large (i.e. non partitioned) gzip files lead to contention at reading the files as readers cannot scale beyond the number of gzip files to read. Better to use a splittable compression format instead to allow frameworks to scale up. Or manually

Re: protobuf data as input to spark streaming

2022-05-30 Thread Kiran Biswal
Hello Stelios, friendly reminder if you could share any sample code/repo Are you using a schema registry? Thanks Kiran On Fri, Apr 8, 2022 at 4:37 PM Kiran Biswal wrote: > Hello Stelios > > Just a gentle follow up if you can share any sample code/repo > > Regards > Kiran > > On Wed, Apr 6,

Re: Unable to format timestamp values in pyspark

2022-05-30 Thread Sid
Yeah, Stelios. It worked. Could you please post it as an answer so that I can accept it on the post and can be of help to people? Thanks, Sid On Mon, May 30, 2022 at 4:42 PM Stelios Philippou wrote: > Sid, > > According to the error that i am seeing there, this is the Date Format > issue. > >

Re: Unable to format timestamp values in pyspark

2022-05-30 Thread Stelios Philippou
Sid, According to the error that i am seeing there, this is the Date Format issue. Text '5/1/2019 1:02:16' could not be parsed But your time format is specific as such 'M/dd/ H:mm:ss') You can see that the day specific is /1/ but your format is dd which expects two digits. Please try

Unable to format timestamp values in pyspark

2022-05-30 Thread Sid
Hi Team, I am able to convert to timestamp. However, when I try to filter out the records based on a specific value it gives an error as mentioned in the post. Could you please help me with this? https://stackoverflow.com/questions/72422897/unable-to-format-timestamp-in-pyspark/72423394#72423394

Re: Job migrated from EMR to Dataproc takes 20 hours instead of 90 minutes

2022-05-30 Thread Ori Popowski
Thanks. Eventually the problem was solved. I am still not 100% sure what caused it but when I said the input was identical I simplified a bit because it was not (sorry for misleading, I thought this information would just be noise). Explanation: the input to the EMR job was gzips created by