The performance issue might be caused by the parquet table partitions count,
only 3. The reader used that partitions count to parallelize extraction.
Refer to the log you provided:
> spark.sql("select * from db.table limit 100").explain(false)
> == Physical Plan ==
> CollectLimit 100
>
Hi Zhang. Thank you for your response
While your answer clarifies my confusion with `CollectLimit` it still does
not clarify what is the recommended way to extract large amounts of data
(but not all the records) from a source and maintain a high level of
parallelism.
For example , at some
https://github.com/apache/spark/pull/7334 may explain the question as below:
> This patch preserves this optimization by treating logical Limit operators
> specially when they appear as the terminal operator in a query plan: if a
> Limit is the final operator, then we will plan a special
Looking at the results of explain, I can see a CollectLimit step. Does that
work the same way as a regular .collect() ? (where all records are sent to
the driver?)
spark.sql("select * from db.table limit 100").explain(false)
== Physical Plan ==
CollectLimit 100
+- FileScan parquet ...
When I use .limit() , the number of partitions for the returning dataframe is 1
which normally fails most jobs.
val df = spark.sql("select * from table limit n")
df.write.parquet()
Thanks!