[jira] [Commented] (SPARK-20077) Documentation for ml.stats.Correlation

2017-06-11 Thread Michael Patterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16046023#comment-16046023
 ] 

Michael Patterson commented on SPARK-20077:
---

Is this task still open? Are you thinking of documentation like the one that 
exists for MLlib 
(https://spark.apache.org/docs/latest/mllib-statistics.html#correlations)?

> Documentation for ml.stats.Correlation
> --
>
> Key: SPARK-20077
> URL: https://issues.apache.org/jira/browse/SPARK-20077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> Now that (Pearson) correlations are available in spark.ml, we need to write 
> some documentation to go along with this feature. It can simply be looking at 
> the unit tests for example right now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-05-04 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Description: 
Document sql.functions.py:

1. Add examples for the common string functions (upper, lower, and reverse)
2. Rename columns in datetime examples to be more informative (e.g. from 'd' to 
'date')
3. Add examples for unix_timestamp, from_unixtime, rand, randn, collect_list, 
collect_set, lit, 
4. Add note to all trigonometry functions that units are radians.
5. Add links between functions, (e.g. add link to radians from toRadians)

  was:
Document `sql.functions.py`:

1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
`count`, `collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Add example for `lit`


> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document sql.functions.py:
> 1. Add examples for the common string functions (upper, lower, and reverse)
> 2. Rename columns in datetime examples to be more informative (e.g. from 'd' 
> to 'date')
> 3. Add examples for unix_timestamp, from_unixtime, rand, randn, collect_list, 
> collect_set, lit, 
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add links between functions, (e.g. add link to radians from toRadians)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Summary: Add examples for functions collection for pyspark  (was: Document 
major aggregation functions for pyspark)

> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20456) Add examples for functions collection for pyspark

2017-04-25 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20456:
--
Description: 
Document `sql.functions.py`:

1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
`count`, `collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Add example for `lit`

  was:
Document `sql.functions.py`:

1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Document `lit`


> Add examples for functions collection for pyspark
> -
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Add examples for the common aggregate functions (`min`, `max`, `mean`, 
> `count`, `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Add example for `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983590#comment-15983590
 ] 

Michael Patterson commented on SPARK-20456:
---

I saw that there are short docstrings for the aggregate functions, but I think 
it can be unclear for people new to Spark, or relational algebra. For example, 
some of my coworkers didn't know you could do, for example, 
`df.agg(mean(col))`, without doing a `groupby`. There are also no links to 
`groupby` in any of the aggregate functions. I also didn't know about 
`collect_set` for a long time. I think adding examples would help with 
visibility and understanding.

The same things applies to `lit`. It took me a while to learn what it did.

For the datetime stuff, for example this line has a column named 'd': 
https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926

I think it would be more informative to name it 'date' or 'time'.

Do these sound reasonable?

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)
Michael Patterson created SPARK-20456:
-

 Summary: Document major aggregation functions for pyspark
 Key: SPARK-20456
 URL: https://issues.apache.org/jira/browse/SPARK-20456
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Michael Patterson


Document `sql.functions.py`:

1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)
2. Rename columns in datetime examples.
3. Add examples for `unix_timestamp` and `from_unixtime`
4. Add note to all trigonometry functions that units are radians.
5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20132) Add documentation for column string functions

2017-03-29 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-20132:
--
Comment: was deleted

(was: I have a commit with the documentation: 
https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b

I will make a more formal PR tonight.)

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20132) Add documentation for column string functions

2017-03-28 Thread Michael Patterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946130#comment-15946130
 ] 

Michael Patterson commented on SPARK-20132:
---

I have a commit with the documentation: 
https://github.com/map222/spark/commit/ac91b654555f9a07021222f2f1a162634d81be5b

I will make a more formal PR tonight.

> Add documentation for column string functions
> -
>
> Key: SPARK-20132
> URL: https://issues.apache.org/jira/browse/SPARK-20132
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>  Labels: documentation, newbie
>
> Four Column string functions do not have documentation for PySpark:
> rlike
> like
> startswith
> endswith
> These functions are called through the _bin_op interface, which allows the 
> passing of a docstring. I have added docstrings with examples to each of the 
> four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20132) Add documentation for column string functions

2017-03-28 Thread Michael Patterson (JIRA)
Michael Patterson created SPARK-20132:
-

 Summary: Add documentation for column string functions
 Key: SPARK-20132
 URL: https://issues.apache.org/jira/browse/SPARK-20132
 Project: Spark
  Issue Type: Documentation
  Components: PySpark, SQL
Affects Versions: 2.1.0
Reporter: Michael Patterson
Priority: Minor


Four Column string functions do not have documentation for PySpark:
rlike
like
startswith
endswith

These functions are called through the _bin_op interface, which allows the 
passing of a docstring. I have added docstrings with examples to each of the 
four functions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-28 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Environment: Pyspark 2.0.1, Ipython 4.2  (was: Pyspark 2.0.0, Ipython 4.2)

> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
> ^s)
>+- LogicalRDD [input#0L]
> {code}
> Executing test_def.show() after the above code in pyspark 2.0.1 yields:
> KeyError: 0
> Executing test_def.show() in pyspark 1.6.2 yields:
> {code}
> +-+--+
> |input|output|
> +-+--+
> |2|second|
> +-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-25 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Environment: Pyspark 2.0.0, Ipython 4.2  (was: Pyspark 2.0.1, Ipython 4.2)

> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.0, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
> ^s)
>+- LogicalRDD [input#0L]
> {code}
> Executing test_def.show() after the above code in pyspark 2.0.1 yields:
> KeyError: 0
> Executing test_def.show() in pyspark 1.6.2 yields:
> {code}
> +-+--+
> |input|output|
> +-+--+
> |2|second|
> +-+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:

```
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
{{+-+--+}}
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> ```
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> ```
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && 

[jira] [Created] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)
Michael Patterson created SPARK-18014:
-

 Summary: Filters are incorrectly being grouped together when there 
is processing in between
 Key: SPARK-18014
 URL: https://issues.apache.org/jira/browse/SPARK-18014
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.1
 Environment: Pyspark 2.0.1, Ipython 4.2
Reporter: Michael Patterson
Priority: Minor


I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
```import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing the above code in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
```
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing the above code in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
```import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing the above code in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> ```
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0},{'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> ```
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
{code}
Execution plan:
{code}
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]
{code}
Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields:
{code}
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+
{code}

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
{code}
Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> {code}
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
{code}
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
{code}
Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:

import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {code}
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> {code}
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:

import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:

```
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
{{import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)}}

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
{{+-+--+}}
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
{{import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)}}

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> {{import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)}}
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
```
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)
```

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing the above code in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0},{'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) && 

[jira] [Updated] (SPARK-18014) Filters are incorrectly being grouped together when there is processing in between

2016-10-19 Thread Michael Patterson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Patterson updated SPARK-18014:
--
Description: 
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0}, {'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+

  was:
I created a dataframe that needed to filter the data on columnA, create a new 
columnB by applying a user defined function to columnA, and then filter on 
columnB. However, the two filters were being grouped together in the execution 
plan after the withColumn statement, which was causing errors due to unexpected 
input to the withColumn statement.

Example code to reproduce:
import pyspark.sql.functions as F
import pyspark.sql.types as T
from functools import partial
data = [{'input':0},{'input':1}, {'input':2}]
input_df = sc.parallelize(data).toDF()
my_dict = {1:'first', 2:'second'}
def apply_dict( input_dict, value):
return input_dict[value]
test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
test_df = input_df.filter('input > 0').withColumn('output', 
test_udf('input')).filter(F.col('output').rlike('^s'))
test_df.explain(True)

Execution plan:
== Analyzed Logical Plan ==
input: bigint, output: string
Filter output#4 RLIKE ^s
+- Project [input#0L, partial(input#0L) AS output#4]
   +- Filter (input#0L > cast(0 as bigint))
  +- LogicalRDD [input#0L]

== Optimized Logical Plan ==
Project [input#0L, partial(input#0L) AS output#4]
+- Filter ((isnotnull(input#0L) && (input#0L > 0)) && partial(input#0L) RLIKE 
^s)
   +- LogicalRDD [input#0L]

Executing test_def.show() after the above code in pyspark 2.0.1 yields:
KeyError: 0

Executing test_def.show() in pyspark 1.6.2 yields
+-+--+
|input|output|
+-+--+
|2|second|
+-+--+


> Filters are incorrectly being grouped together when there is processing in 
> between
> --
>
> Key: SPARK-18014
> URL: https://issues.apache.org/jira/browse/SPARK-18014
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1
> Environment: Pyspark 2.0.1, Ipython 4.2
>Reporter: Michael Patterson
>Priority: Minor
>
> I created a dataframe that needed to filter the data on columnA, create a new 
> columnB by applying a user defined function to columnA, and then filter on 
> columnB. However, the two filters were being grouped together in the 
> execution plan after the withColumn statement, which was causing errors due 
> to unexpected input to the withColumn statement.
> Example code to reproduce:
> import pyspark.sql.functions as F
> import pyspark.sql.types as T
> from functools import partial
> data = [{'input':0}, {'input':1}, {'input':2}]
> input_df = sc.parallelize(data).toDF()
> my_dict = {1:'first', 2:'second'}
> def apply_dict( input_dict, value):
> return input_dict[value]
> test_udf = F.udf( partial(apply_dict, my_dict ), T.StringType() )
> test_df = input_df.filter('input > 0').withColumn('output', 
> test_udf('input')).filter(F.col('output').rlike('^s'))
> test_df.explain(True)
> Execution plan:
> == Analyzed Logical Plan ==
> input: bigint, output: string
> Filter output#4 RLIKE ^s
> +- Project [input#0L, partial(input#0L) AS output#4]
>+- Filter (input#0L > cast(0 as bigint))
>   +- LogicalRDD [input#0L]
> == Optimized Logical Plan ==
> Project [input#0L, partial(input#0L) AS output#4]
> +- Filter ((isnotnull(input#0L) && (input#0L > 0)) &&