[ https://issues.apache.org/jira/browse/SPARK-30185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-30185. ---------------------------------- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26809 [https://github.com/apache/spark/pull/26809] > Implement Dataset.tail API > -------------------------- > > Key: SPARK-30185 > URL: https://issues.apache.org/jira/browse/SPARK-30185 > Project: Spark > Issue Type: New Feature > Components: SQL > Affects Versions: 3.0.0 > Reporter: Hyukjin Kwon > Assignee: Hyukjin Kwon > Priority: Major > Fix For: 3.0.0 > > > I would like to propose an API called DataFrame.tail. > *Background & Motivation* > Many other systems support the way to take data from the end, for instance, > pandas[1] and > Python[2][3]. Scala collections APIs also have head and tail > On the other hand, in Spark, we only provide a way to take data from the start > (e.g., DataFrame.head). This has been requested multiple times here and > there in Spark > user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party > projects such as > Koalas[8]. > It seems we're missing non-trivial use case in Spark and this motivated me to > propose this > API. > *Proposal* > I would like to propose an API against DataFrame called tail that collects > rows from the > end in contrast with head. > Namely, as below: > {code:java} > scala> spark.range(10).head(5) > res1: Array[Long] = Array(0, 1, 2, 3, 4) > scala> spark.range(10).tail(5) > res2: Array[Long] = Array(5, 6, 7, 8, 9){code} > Implementation details will be similar with head but it will be reversed: > Run the job against the last partition and collect rows. If this is enough, > return as is. > If this is not enough, calculate the number of partitions to select more > based upon > ‘spark.sql.limit.scaleUpFactor’ > Run more jobs against more partitions (in a reversed order compared to head) > as many as the number calculated from 2. > Go to 2. > [1] > [https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail] > [2] [https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line] > [3] > [https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python] > [4] > [http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html] > [5] > [https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index] > [6] > [https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe] > [7] https://issues.apache.org/jira/browse/SPARK-26433 > [8] [https://github.com/databricks/koalas/issues/343] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org