[jira] [Created] (SPARK-24949) pyspark.sql.Column breaks the iterable contract
Daniel Shields created SPARK-24949: -- Summary: pyspark.sql.Column breaks the iterable contract Key: SPARK-24949 URL: https://issues.apache.org/jira/browse/SPARK-24949 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.3.1 Reporter: Daniel Shields pyspark.sql.Column implements __iter__ just to raise a TypeError: {code:java} def __iter__(self): raise TypeError("Column is not iterable") {code} This makes column look iterable even when it isn't: {code:java} isinstance(mycolumn, collections.Iterable) # Evaluates to True{code} This function should be removed from Column completely so it behaves like every other non-iterable class. For further motivation of why this should be fixed, consider the below example, which currently requires listing Column explicitly: {code:java} def listlike(value): # Column unfortunately implements __iter__ just to raise a TypeError. # This breaks the iterable contract and should be fixed in Spark proper. return isinstance(value, collections.Iterable) and not isinstance(value, (str, bytes, pyspark.sql.Column)) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24385) Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
Daniel Shields created SPARK-24385: -- Summary: Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join Key: SPARK-24385 URL: https://issues.apache.org/jira/browse/SPARK-24385 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.1 Reporter: Daniel Shields Dataset.join(right: Dataset[_], joinExprs: Column, joinType: String) has special logic for resolving trivially-true predicates to both sides. It currently handles regular equals but not null-safe equals; the code should be updated to also handle null-safe equals. Pyspark example: {code:java} df = spark.range(10) df.join(df, 'id').collect() # This works. df.join(df, df['id'] == df['id']).collect() # This works. df.join(df, df['id'].eqNullSafe(df['id'])).collect() # This fails!!! # This is a workaround that works. df2 = df.withColumn('id', F.col('id')) df.join(df2, df['id'].eqNullSafe(df2['id'])).collect(){code} The relevant code in Dataset.join should look like this: {code:java} // Otherwise, find the trivially true predicates and automatically resolves them to both sides. // By the time we get here, since we have already run analysis, all attributes should've been // resolved and become AttributeReference. val cond = plan.condition.map { _.transform { case catalyst.expressions.EqualTo(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualTo( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) // This case is new!!! case catalyst.expressions.EqualNullSafe(a: AttributeReference, b: AttributeReference) if a.sameRef(b) => catalyst.expressions.EqualNullSafe( withPlan(plan.left).resolve(a.name), withPlan(plan.right).resolve(b.name)) }} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17435) RowEncoder should be documented and publicly accessable
Daniel Shields created SPARK-17435: -- Summary: RowEncoder should be documented and publicly accessable Key: SPARK-17435 URL: https://issues.apache.org/jira/browse/SPARK-17435 Project: Spark Issue Type: Improvement Reporter: Daniel Shields The object org.apache.spark.sql.catalyst.encoders.RowEncoder is critical when constructing a Dataset that is not Dataset[Row] but still has Row in its type. For example, consider the following: df.map(row => (row.getInt(0), row))(Encoders.tuple(Encoders.scalaInt, RowEncoder(df.schema)) Alternatively, there could be a way to get the Encoder from Dataset directly; it's unclear why 'encoder' isn't a member of Dataset. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17416) Add Dataset.groupByKey overload that takes a value selector function
[ https://issues.apache.org/jira/browse/SPARK-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Shields updated SPARK-17416: --- Description: I propose that the following overload be added to Dataset[T]: def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], arg1: Encoder[V]) This would simplify a number of use cases. For example, consider the following classic MapReduce query: rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples An idiomatic way to write this with Spark 2.0 would be: dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g) Without the groupByKey overload suggested above, this must be written as: dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2)) was: I propose that the following overload be added to Dataset[T]: def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], implicit arg1: Encoder[V]) This would simplify a number of use cases. For example, consider the following classic MapReduce query: rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples An idiomatic way to write this with Spark 2.0 would be: dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g) Without the groupByKey overload suggested above, this must be written as: dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2)) > Add Dataset.groupByKey overload that takes a value selector function > > > Key: SPARK-17416 > URL: https://issues.apache.org/jira/browse/SPARK-17416 > Project: Spark > Issue Type: New Feature >Reporter: Daniel Shields > > I propose that the following overload be added to Dataset[T]: > def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: > Encoder[K], arg1: Encoder[V]) > This would simplify a number of use cases. For example, consider the > following classic MapReduce query: > rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples > An idiomatic way to write this with Spark 2.0 would be: > dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g) > Without the groupByKey overload suggested above, this must be written as: > dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17416) Add Dataset.groupByKey overload that takes a value selector function
Daniel Shields created SPARK-17416: -- Summary: Add Dataset.groupByKey overload that takes a value selector function Key: SPARK-17416 URL: https://issues.apache.org/jira/browse/SPARK-17416 Project: Spark Issue Type: New Feature Reporter: Daniel Shields I propose that the following overload be added to Dataset[T]: def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: Encoder[K], implicit arg1: Encoder[V]) This would simplify a number of use cases. For example, consider the following classic MapReduce query: rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples An idiomatic way to write this with Spark 2.0 would be: dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g) Without the groupByKey overload suggested above, this must be written as: dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org