Re: Spark SQL

2022-09-14 Thread Gourav Sengupta
Hi, Why spark and why scala? Regards, Gourav On Wed, 7 Sept 2022, 21:42 Mayur Benodekar, wrote: > am new to scala and spark both . > > I have a code in scala which executes quieres in while loop one after the > other. > > What we need to do is if a particular query takes more than a certain t

Big Data Contract Roles ?

2022-09-14 Thread sri hari kali charan Tummala
Hi Flink Users/ Spark Users, Is anyone hiring contract corp to corp big Data spark scala or Flink scala roles ? Thanks Sri

Re: Splittable or not?

2022-09-14 Thread Sid
Okay so you mean to say that parquet compresses the denormalized data using snappy so it won't affect the performance. Only using snappy will affect the performance Am I correct? On Thu, 15 Sep 2022, 01:08 Amit Joshi, wrote: > Hi Sid, > > Snappy itself is not splittable. But the format that co

Re: Jupyter notebook on Dataproc versus GKE

2022-09-14 Thread Bjørn Jørgensen
Mitch: Why I'm switching from Jupyter Notebooks to JupyterLab...Such a better experience! DegreeTutors.com tir. 6. sep. 2022 kl. 20:28 skrev Holden Karau : > I’ve used Argo for K8s scheduling, for awhile it’s also what Kubeflow used > underneath for scheduling. > >

Re: Splittable or not?

2022-09-14 Thread Amit Joshi
Hi Sid, Snappy itself is not splittable. But the format that contains the actual data like parquet (which are basically divided into row groups) can be compressed using snappy. This works because blocks(pages of parquet format) inside the parquet can be independently compressed using snappy. Than

Re: Long running task in spark

2022-09-14 Thread Sid
Try spark.driver.maxResultsSize =0 On Mon, 12 Sep 2022, 09:46 rajat kumar, wrote: > Hello Users, > > My 2 tasks are running forever. One of them gave a java heap space error. > I have 10 Joins , all tables are big. I understand this is data skewness. > Apart from changes at code level , any prop

Splittable or not?

2022-09-14 Thread Sid
Hello experts, I know that Gzip and snappy files are not splittable i.e data won't be distributed into multiple blocks rather it would try to load the data in a single partition/block So, my question is when I write the parquet data via spark it gets stored at the destination with something like