Creating Custom Broadcast Join

2022-09-01 Thread Murali S
Hi, I wanted to broadcast a Dataframe to all executors and do an operation similar to join, but might return a variable number of rows than the rows in each partition and could use multiple rows to produce one row. I am trying to create a custom join operator for this use case. It would be great i

Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread FengYu Cao
I will open a JIRA, but since it's our production event log, can't attach to it. try to setup a debugger to provider more information. Chao Sun 于2022年9月1日周四 23:06写道: > Hi Fengyu, > > Do you still have the Parquet file that caused the error? could you > open a JIRA and attach the file to it? I

Re: running pyspark on kubernetes - no space left on device

2022-09-01 Thread Qian SUN
Hi Spark provides spark.local.dir configuration to specify work folder on the pod. You can specify spark.local.dir as your mount path. Best regards Manoj GEORGE 于2022年9月1日周四 21:16写道: > CONFIDENTIAL & RESTRICTED > > Hi Team, > > > > I am new to spark, so please excuse my ignorance. > > > > Curre

Re: Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread Chao Sun
Hi Fengyu, Do you still have the Parquet file that caused the error? could you open a JIRA and attach the file to it? I can take a look. Chao On Thu, Sep 1, 2022 at 4:03 AM FengYu Cao wrote: > > I'm trying to upgrade our spark (3.2.1 now) > > but with spark 3.3.0 and spark 3.2.2, we had error w

Re: running pyspark on kubernetes - no space left on device

2022-09-01 Thread Matt Proetsch
Hi George, You can try mounting a larger PersistentVolume to the work directory as described here instead of using localdir which might have site-specific size constraints: https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes -Matt > On Sep 1, 2022, at 09:1

running pyspark on kubernetes - no space left on device

2022-09-01 Thread Manoj GEORGE
CONFIDENTIAL & RESTRICTED Hi Team, I am new to spark, so please excuse my ignorance. Currently we are trying to run PySpark on Kubernetes cluster. The setup is working fine for some jobs, but when we are processing a large file ( 36 gb), we run into one of space issues. Based on what was fou

Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Martin Andersson
You should check the release notes and upgrade instructions. From: rajat kumar Sent: Thursday, September 1, 2022 12:44 To: user @spark Subject: Moving to Spark 3x from Spark2 EXTERNAL SENDER. Do not click links or open attachments unless you recognize the sende

Re: Moving to Spark 3x from Spark2

2022-09-01 Thread Khalid Mammadov
Hi Rajat There were a lot of changes between those versions and the only possible option to assess impact to do your testings unfortunately. Most probably you will have to do some changes to your codebase. Regards Khalid On Thu, 1 Sept 2022, 11:45 rajat kumar, wrote: > Hello Members, > > We

Spark 3.3.0/3.2.2: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 15

2022-09-01 Thread FengYu Cao
I'm trying to upgrade our spark (3.2.1 now) but with spark 3.3.0 and spark 3.2.2, we had error with specific parquet file Is anyone else having the same problem as me? Or do I need to provide any information to the devs ? ``` org.apache.spark.SparkException: Job aborted due to stage failure: Ta

Moving to Spark 3x from Spark2

2022-09-01 Thread rajat kumar
Hello Members, We want to move to Spark 3 from Spark2.4 . Are there any changes we need to do at code level which can break the existing code? Will it work by simply changing the version of spark & scala ? Regards Rajat