https://github.com/apache/spark/pull/7334 may explain the question as below:
> This patch preserves this optimization by treating logical Limit operators
> specially when they appear as the terminal operator in a query plan: if a
> Limit is the final operator, then we will plan a special CollectLimit
> physical operator which implements the old take()-based logic.
For `spark.sql("select * from db.table limit 1000000").explain(false)`, `limit`
is the final operator;
for `spark.sql("select * from db.table limit
1000000").repartition(1000).explain(false)`, `repartition` is the final
operator. If you add a `.limit()` operation after `repartition`, such as
`spark.sql("select * from db.table limit
1000000").repartition(1000).limit(1000).explain(false)`, the `CollectLimit`
will show again.
---
Cheers,
-z
________________________________________
From: Yeikel <[email protected]>
Sent: Wednesday, April 15, 2020 2:45
To: [email protected]
Subject: Re: What is the best way to take the top N entries from a hive
table/data source?
Looking at the results of explain, I can see a CollectLimit step. Does that
work the same way as a regular .collect() ? (where all records are sent to
the driver?)
spark.sql("select * from db.table limit 1000000").explain(false)
== Physical Plan ==
CollectLimit 1000000
+- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet,
Location: CatalogFileIndex[...], PartitionCount: 3, PartitionFilters: [],
PushedFilters: [], ReadSchema:.....
db: Unit = ()
The number of partitions is 1 so that makes sense.
spark.sql("select * from db.table limit 1000000").rdd.partitions.size = 1
As a follow up , I tried to repartition the resultant dataframe and while I
can't see the CollectLimit step anymore , It did not make any difference in
the job. I still saw a big task at the end that ends up failing.
spark.sql("select * from db.table limit
1000000").repartition(1000).explain(false)
Exchange RoundRobinPartitioning(1000)
+- GlobalLimit 1000000
+- Exchange SinglePartition
+- LocalLimit 1000000 -> Is this a collect?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]