Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-22 Thread ZHANG Wei
The performance issue might be caused by the parquet table partitions count, only 3. The reader used that partitions count to parallelize extraction. Refer to the log you provided: > spark.sql("select * from db.table limit 100").explain(false) > == Physical Plan == > CollectLimit 100 >

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread Yeikel
Hi Zhang. Thank you for your response While your answer clarifies my confusion with `CollectLimit` it still does not clarify what is the recommended way to extract large amounts of data (but not all the records) from a source and maintain a high level of parallelism. For example , at some

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-21 Thread ZHANG Wei
https://github.com/apache/spark/pull/7334 may explain the question as below: > This patch preserves this optimization by treating logical Limit operators > specially when they appear as the terminal operator in a query plan: if a > Limit is the final operator, then we will plan a special

Re: What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread Yeikel
Looking at the results of explain, I can see a CollectLimit step. Does that work the same way as a regular .collect() ? (where all records are sent to the driver?) spark.sql("select * from db.table limit 100").explain(false) == Physical Plan == CollectLimit 100 +- FileScan parquet ...

What is the best way to take the top N entries from a hive table/data source?

2020-04-14 Thread yeikel valdes
When I use .limit() , the number of partitions for the returning dataframe is 1 which normally fails most jobs. val df = spark.sql("select * from table limit n") df.write.parquet() Thanks!