Broadcast size increases with subsequent iterations

2020-12-04 Thread Kalin Stoyanov
Hi all, I have an iterative algorithm in spark that uses each iteration as the input for the following one, but the size of the data does not change. I am using localCheckpoint to cut the data's lineage (and also facilitate some computations that reuse df-s). However, this runs slower and slower a

Re: Broadcast size increases with subsequent iterations

2020-12-08 Thread Kalin Stoyanov
v(df_clust, df_F), self.step_r( df_clust, df_F) df_clust = df_r.join(df_v, "id") return (df_clust, self.dt) Regards, Kalin On Fri, Dec 4, 2020 at 1:59 PM Kalin Stoyanov wrote: > Hi all, > > I have an iterative algorithm in spark that uses each iteration a

full SQL query graph not shown in monitoring when using cache

2021-04-15 Thread Kalin Stoyanov
Hi all, I noticed something a bit strange.. When working with a cached DF, the SQL query details graph starts from when the cache takes place, and doesn't show the transformations before it. For example this code >>> df = sc.parallelize([[1,2,3],[1,4,5]]).toDF(['id','a','b']) >>> renameCols = [f"

Re: full SQL query graph not shown in monitoring when using cache

2021-04-15 Thread Kalin Stoyanov
transformations cached and spark run > only transformations that write after the cache. This is the meaning of the > cache in Spark. > > On Farvardin 26, 1400 AP, at 17:24, Kalin Stoyanov > wrote: > > Hi all, > > I noticed something a bit strange.. When working with a ca

Re: full SQL query graph not shown in monitoring when using cache

2021-04-15 Thread Kalin Stoyanov
put [3]: [id#131, a#132, b#133] > Arguments: 21 > > > HTH, > > > Mich > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and t

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
eed to talk with them instead of posting questions > in the Apache Spark community. > > Cheers, > > Xiao > > Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > >> Hi all, >> >> First of all let me say that I am pretty new to Spark so this could be >> entirely

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
eries should hit such a > major performance regression. Also, please try the 3.0 preview releases. > > Thanks, > > Xiao > > Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > >> Hi Xiao, >> >> Thanks, I didn't know that. This >> https://aws.amazon.com/about-