Hi all, We ingest our data into dataframes with multiple naturally co-sorted columns. The redundant sort required during large SortMergeJoin operations takes substantial time that we'd like to optimise -- a plain merge should be sufficient.
Is there a mechanism to avoid these sorts in general? Do we need to persist all our frames as tables with sortBy+bucketBy to get this optimisation? If we use sorted tables, does the "sorted by" metadata persist past the first join or do we need to re-write each intermediate result to a (possibly re-sorted?) table to maintain the metadata in the catalog? Our data actually has multiple distinct monotonically-increasing columns. The catalog doesn't seem to be able to capture this information, requiring a re-sort when we want to join along a different-but-still-sorted dimension. We've prototyped hacking the external catalog to let us intercept cases where we want to assert that outputOrdering is equivalent to the requiredOrdering (see SortOrder.orderingSatisfies), but this feels like an abuse. Table information has been lost at these points too, so we need to infer sortedness by comparing raw column names extracted from SortOrder expressions. This breaks in cases where our processing has caused the data to *lose* its sortedness. Have we missed something simple or do we have an exotic use-case unlike other users? Thanks! Tim -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org