Shea Parkes created SPARK-17998:
-----------------------------------

             Summary: Reading Parquet files coalesces parts into too few 
in-memory partitions
                 Key: SPARK-17998
                 URL: https://issues.apache.org/jira/browse/SPARK-17998
             Project: Spark
          Issue Type: Bug
          Components: PySpark, SQL
    Affects Versions: 2.0.1, 2.0.0
         Environment: Spark Standalone Cluster (not "local mode")
Windows 10 and Windows 7
Python 3.x
            Reporter: Shea Parkes


Reading a parquet ~file into a DataFrame is resulting in far too few in-memory 
partitions.  In prior versions of Spark, the resulting DataFrame would have a 
number of partitions often equal to the number of parts in the parquet folder.

Here's a minimal reproducible sample:

{quote}
df_first = session.range(start=1, end=100000000, numPartitions=13)
assert df_first.rdd.getNumPartitions() == 13
assert session._sc.defaultParallelism == 6

path_scrap = r"c:\scratch\scrap.parquet"
df_first.write.parquet(path_scrap)
df_second = session.read.parquet(path_scrap)

print(df_second.rdd.getNumPartitions())
{quote}

The above shows only 7 partitions in the DataFrame that was created by reading 
the Parquet back into memory for me.  Why is it no longer just the number of 
part files in the Parquet folder?  (Which is 13 in the example above.)

I'm filing this as a bug because it has gotten so bad that we can't work with 
the underlying RDD without first repartitioning the DataFrame, which is costly 
and wasteful.  I really doubt this was the intended effect of moving to Spark 
2.0.

I've tried to research where the number of in-memory partitions is determined, 
but my Scala skills have proven in-adequate.  I'd be happy to dig further if 
someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to