Yeah I figured it's not something fundamental to the task or Spark. The error is very odd, never seen that. Do you have a theory on what's going on there? I don't!
On Fri, Apr 9, 2021 at 10:43 AM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > Hi! > > I looked into the code and find a way to improve it. > > With the improvement your test runs just fine: > > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > > Using Python version 3.8.1 (default, Dec 30 2020 22:53:18) > Spark context Web UI available at http://192.168.0.199:4040 > Spark context available as 'sc' (master = local, app id = > local-1617982367872). > SparkSession available as 'spark'. > > In [1]: import pyspark > > In [2]: > conf=pyspark.SparkConf().setMaster("local[64]").setAppName("Test1") > > In [3]: sc=pyspark.SparkContext.getOrCreate(conf) > > In [4]: rows=70000 > > In [5]: data=list(range(rows)) > > In [6]: rdd=sc.parallelize(data,rows) > > In [7]: assert rdd.getNumPartitions()==rows > > In [8]: rdd0=rdd.filter(lambda x:False) > > In [9]: assert rdd0.getNumPartitions()==rows > > In [10]: rdd00=rdd0.coalesce(1) > > In [11]: data=rdd00.collect() > 21/04/09 17:32:54 WARN TaskSetManager: Stage 0 contains a task of very > large siz > e (4729 KiB). The maximum recommended task size is 1000 KiB. > > In [12]: assert data==[] > > In [13]: > > > I will create a jira and need to add some unittest before opening the PR. > > Best Regards, > Attila > >>