[jira] [Created] (SPARK-24949) pyspark.sql.Column breaks the iterable contract

2018-07-27 Thread Daniel Shields (JIRA)
Daniel Shields created SPARK-24949:
--

 Summary: pyspark.sql.Column breaks the iterable contract
 Key: SPARK-24949
 URL: https://issues.apache.org/jira/browse/SPARK-24949
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.1
Reporter: Daniel Shields


pyspark.sql.Column implements __iter__ just to raise a TypeError:
{code:java}
def __iter__(self):
raise TypeError("Column is not iterable")
{code}
This makes column look iterable even when it isn't:
{code:java}
isinstance(mycolumn, collections.Iterable) # Evaluates to True{code}
This function should be removed from Column completely so it behaves like every 
other non-iterable class.

For further motivation of why this should be fixed, consider the below example, 
which currently requires listing Column explicitly:
{code:java}
def listlike(value):
# Column unfortunately implements __iter__ just to raise a TypeError.
# This breaks the iterable contract and should be fixed in Spark proper.
return isinstance(value, collections.Iterable) and not isinstance(value, 
(str, bytes, pyspark.sql.Column))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24385) Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join

2018-05-24 Thread Daniel Shields (JIRA)
Daniel Shields created SPARK-24385:
--

 Summary: Trivially-true EqualNullSafe should be handled like 
EqualTo in Dataset.join
 Key: SPARK-24385
 URL: https://issues.apache.org/jira/browse/SPARK-24385
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.1
Reporter: Daniel Shields


Dataset.join(right: Dataset[_], joinExprs: Column, joinType: String) has 
special logic for resolving trivially-true predicates to both sides. It 
currently handles regular equals but not null-safe equals; the code should be 
updated to also handle null-safe equals.

Pyspark example:
{code:java}
df = spark.range(10)
df.join(df, 'id').collect() # This works.
df.join(df, df['id'] == df['id']).collect() # This works.
df.join(df, df['id'].eqNullSafe(df['id'])).collect() # This fails!!!

# This is a workaround that works.
df2 = df.withColumn('id', F.col('id'))
df.join(df2, df['id'].eqNullSafe(df2['id'])).collect(){code}
The relevant code in Dataset.join should look like this:
{code:java}
// Otherwise, find the trivially true predicates and automatically resolves 
them to both sides.
// By the time we get here, since we have already run analysis, all attributes 
should've been
// resolved and become AttributeReference.
val cond = plan.condition.map { _.transform {
  case catalyst.expressions.EqualTo(a: AttributeReference, b: 
AttributeReference) if a.sameRef(b) =>
catalyst.expressions.EqualTo(
  withPlan(plan.left).resolve(a.name),
  withPlan(plan.right).resolve(b.name))
  // This case is new!!!
  case catalyst.expressions.EqualNullSafe(a: AttributeReference, b: 
AttributeReference) if a.sameRef(b) =>
catalyst.expressions.EqualNullSafe(
  withPlan(plan.left).resolve(a.name),
  withPlan(plan.right).resolve(b.name))
}}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17435) RowEncoder should be documented and publicly accessable

2016-09-07 Thread Daniel Shields (JIRA)
Daniel Shields created SPARK-17435:
--

 Summary: RowEncoder should be documented and publicly accessable
 Key: SPARK-17435
 URL: https://issues.apache.org/jira/browse/SPARK-17435
 Project: Spark
  Issue Type: Improvement
Reporter: Daniel Shields


The object org.apache.spark.sql.catalyst.encoders.RowEncoder is critical when 
constructing a Dataset that is not Dataset[Row] but still has Row in its type.  
For example, consider the following:

df.map(row => (row.getInt(0), row))(Encoders.tuple(Encoders.scalaInt, 
RowEncoder(df.schema))

Alternatively, there could be a way to get the Encoder from Dataset directly; 
it's unclear why 'encoder' isn't a member of Dataset.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17416) Add Dataset.groupByKey overload that takes a value selector function

2016-09-06 Thread Daniel Shields (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Shields updated SPARK-17416:
---
Description: 
I propose that the following overload be added to Dataset[T]:

def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: 
Encoder[K], arg1: Encoder[V])

This would simplify a number of use cases.  For example, consider the following 
classic MapReduce query:

rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples


An idiomatic way to write this with Spark 2.0 would be:

dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)

Without the groupByKey overload suggested above, this must be written as:

dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))

  was:
I propose that the following overload be added to Dataset[T]:

def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: 
Encoder[K], implicit arg1: Encoder[V])

This would simplify a number of use cases.  For example, consider the following 
classic MapReduce query:

rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples


An idiomatic way to write this with Spark 2.0 would be:

dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)

Without the groupByKey overload suggested above, this must be written as:

dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))


> Add Dataset.groupByKey overload that takes a value selector function
> 
>
> Key: SPARK-17416
> URL: https://issues.apache.org/jira/browse/SPARK-17416
> Project: Spark
>  Issue Type: New Feature
>Reporter: Daniel Shields
>
> I propose that the following overload be added to Dataset[T]:
> def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: 
> Encoder[K], arg1: Encoder[V])
> This would simplify a number of use cases.  For example, consider the 
> following classic MapReduce query:
> rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples
> An idiomatic way to write this with Spark 2.0 would be:
> dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)
> Without the groupByKey overload suggested above, this must be written as:
> dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17416) Add Dataset.groupByKey overload that takes a value selector function

2016-09-06 Thread Daniel Shields (JIRA)
Daniel Shields created SPARK-17416:
--

 Summary: Add Dataset.groupByKey overload that takes a value 
selector function
 Key: SPARK-17416
 URL: https://issues.apache.org/jira/browse/SPARK-17416
 Project: Spark
  Issue Type: New Feature
Reporter: Daniel Shields


I propose that the following overload be added to Dataset[T]:

def groupByKey[K, V](keyFunc: T => K, valueFunc: T => V)(implicit arg0: 
Encoder[K], implicit arg1: Encoder[V])

This would simplify a number of use cases.  For example, consider the following 
classic MapReduce query:

rdd.flatMap(f).reduceByKey(g) // where f returns a list of tuples


An idiomatic way to write this with Spark 2.0 would be:

dataset.flatMap(f).groupByKey(_._1, _._2).reduceGroups(g)

Without the groupByKey overload suggested above, this must be written as:

dataset.flatMap(f).groupByKey(_._1).reduceGroups((a, b) => g(a._2, b._2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org