[ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell resolved SPARK-21595. --------------------------------------- Resolution: Fixed Assignee: Tejas Patil Fix Version/s: 2.3.0 2.2.1 > introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 > breaks existing workflow > ------------------------------------------------------------------------------------------------- > > Key: SPARK-21595 > URL: https://issues.apache.org/jira/browse/SPARK-21595 > Project: Spark > Issue Type: Bug > Components: Documentation, PySpark > Affects Versions: 2.2.0 > Environment: pyspark on linux > Reporter: Stephan Reiling > Assignee: Tejas Patil > Priority: Minor > Labels: documentation, regression > Fix For: 2.2.1, 2.3.0 > > > My pyspark code has the following statement: > {code:java} > # assign row key for tracking > df = df.withColumn( > 'association_idx', > sqlf.row_number().over( > Window.orderBy('uid1', 'uid2') > ) > ) > {code} > where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates > one large window for the whole dataframe to sort over. > In spark 2.1 this works without problem, in spark 2.2 this fails either with > out of memory exception or too many open files exception, depending on memory > settings (which is what I tried first to fix this). > Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 > creates >110,000 files. > In the log I see the following messages (110,000 of these): > {noformat} > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (0 time so far) > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of > spilledRecords crossed the threshold 4096 > 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of > 64.1 MB to disk (1 time so far) > {noformat} > So I started hunting for clues in UnsafeExternalSorter, without luck. What I > had missed was this one message: > {noformat} > 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill > threshold of 4096 rows, switching to > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter > {noformat} > Which allowed me to track down the issue. > By changing the configuration to include: > {code:java} > spark.sql.windowExec.buffer.spill.threshold 2097152 > {code} > I got it to work again and with the same performance as spark 2.1. > I have workflows where I use windowing functions that do not fail, but took a > performance hit due to the excessive spilling when using the default of 4096. > I think to make it easier to track down these issues this config variable > should be included in the configuration documentation. > Maybe 4096 is too small of a default value? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org