[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases

2016-11-03 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18218:

Shepherd: Yanbo Liang

> Optimize BlockMatrix multiplication, which may cause OOM and low parallelism 
> usage problem in several cases
> ---
>
> Key: SPARK-18218
> URL: https://issues.apache.org/jira/browse/SPARK-18218
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> After I take a deep look into `BlockMatrix.multiply` implementation, I found 
> that current implementation may cause some problem in special cases.
> Now let me use an extreme case to represent it:
> Suppose we have two blockMatrix A and B
> A has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> B also has 1 blocks, numRowBlocks = 1,  numColBlocks = 1
> Now if we call A.mulitiply(B), no matter how A and B is partitioned,
> the resultPartitioner will always contains only one partition,
> this muliplication implementation will shuffle 1 * 1 blocks into one 
> reducer, this will cause the parallism became 1, 
> what's worse, because `RDD.cogroup` will load the total group element into 
> memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, 
> because they are all shuffled into the same group. It will easily cause 
> executor OOM.
> The above case is a little extreme, but other case, such as M*N dimensions 
> matrix A multiply N*P dimensions matrix B, when N is much larger than M and 
> P, we met the similar problem.
> The multiplication implementation do not handle the task partition properly, 
> it will cause:
> 1. when the middle dimension N is too large, it will cause reducer OOM.
> 2. even if OOM do not occur, it will still cause parallism too low.
> 3. when N is much large than M and P, and matrix A and B have many 
> partitions, it will cause too many partition on M and P dimension, it will 
> cause much larger shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java

2016-11-03 Thread Hubert Kang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635294#comment-15635294
 ] 

Hubert Kang commented on SPARK-18193:
-

Is it possible to do that in opposite way, which means update 
JavaQueueStream.java to match QueueStream.scala?
That's exactly what I want to handle live data stream with Queue.


> queueStream not updated if rddQueue.add after create queueStream in Java
> 
>
> Key: SPARK-18193
> URL: https://issues.apache.org/jira/browse/SPARK-18193
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Hubert Kang
>
> Within 
> examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java,
>  no any data is deteceted if below code to put something to rddQueue is 
> executed after queueStream is created (line 65).
> for (int i = 0; i < 30; i++) {
>   rddQueue.add(ssc.sparkContext().parallelize(list));
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18225) job will miss when driver removed by master in spark streaming

2016-11-03 Thread liujianhui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635261#comment-15635261
 ] 

liujianhui commented on SPARK-18225:


we provide a platform for user to submit their streaming app, at that case, app 
should setup an endpoint in their owner app to receive stopped message,and then 
call StreamingContext#stop to shutdown, but it not convenient

> job will miss when driver removed by master in spark streaming 
> ---
>
> Key: SPARK-18225
> URL: https://issues.apache.org/jira/browse/SPARK-18225
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler
>Affects Versions: 1.6.1, 1.6.2
>Reporter: liujianhui
>
> kill the application on spark ui, the master will send an ApplicationRemoved 
> to driver, driver will abort the all pending job,and then the job finish with 
> exception "Master removed our application:Killed",and then Jobscheduler will 
> remove the job from jobsets, but the jobgenerator still docheckpoint without 
> the job which removed before, and then driver stop;when recover  from the 
> check point file,it miss all jobs which aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18259) QueryExecution should not catch Throwable

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18259.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> QueryExecution should not catch Throwable
> -
>
> Key: SPARK-18259
> URL: https://issues.apache.org/jira/browse/SPARK-18259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> QueryExecution currently captures Throwable. This is far from a best 
> practice. It would be better if we'd catch AnalysisExceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18225) job will miss when driver removed by master in spark streaming

2016-11-03 Thread liujianhui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635252#comment-15635252
 ] 

liujianhui commented on SPARK-18225:


it still doCheckpoint even killed by UI because the recurringTimer doest stop, 
and it will post DoCheckpoint event as before. But the spark context marked 
stopped, all the jobs will aborted

> job will miss when driver removed by master in spark streaming 
> ---
>
> Key: SPARK-18225
> URL: https://issues.apache.org/jira/browse/SPARK-18225
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Scheduler
>Affects Versions: 1.6.1, 1.6.2
>Reporter: liujianhui
>
> kill the application on spark ui, the master will send an ApplicationRemoved 
> to driver, driver will abort the all pending job,and then the job finish with 
> exception "Master removed our application:Killed",and then Jobscheduler will 
> remove the job from jobsets, but the jobgenerator still docheckpoint without 
> the job which removed before, and then driver stop;when recover  from the 
> check point file,it miss all jobs which aborted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-11-03 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635204#comment-15635204
 ] 

Nattavut Sutyanyong commented on SPARK-17348:
-

[~hvanhovell], would you please review my PR to return an Analysis exception on 
the scenarios that could produce incorrect results? Thank you.

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18217:


Assignee: Xiao Li  (was: Apache Spark)

> Disallow creating permanent views based on temporary views or UDFs
> --
>
> Key: SPARK-18217
> URL: https://issues.apache.org/jira/browse/SPARK-18217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
>
> See the discussion in the parent ticket SPARK-18209. It doesn't really make 
> sense to create permanent views based on temporary views or UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18217:


Assignee: Apache Spark  (was: Xiao Li)

> Disallow creating permanent views based on temporary views or UDFs
> --
>
> Key: SPARK-18217
> URL: https://issues.apache.org/jira/browse/SPARK-18217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> See the discussion in the parent ticket SPARK-18209. It doesn't really make 
> sense to create permanent views based on temporary views or UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635202#comment-15635202
 ] 

Apache Spark commented on SPARK-18217:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15764

> Disallow creating permanent views based on temporary views or UDFs
> --
>
> Key: SPARK-18217
> URL: https://issues.apache.org/jira/browse/SPARK-18217
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Xiao Li
>
> See the discussion in the parent ticket SPARK-18209. It doesn't really make 
> sense to create permanent views based on temporary views or UDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635199#comment-15635199
 ] 

Apache Spark commented on SPARK-17348:
--

User 'nsyca' has created a pull request for this issue:
https://github.com/apache/spark/pull/15763

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17348) Incorrect results from subquery transformation

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17348:


Assignee: (was: Apache Spark)

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17348) Incorrect results from subquery transformation

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17348:


Assignee: Apache Spark

> Incorrect results from subquery transformation
> --
>
> Key: SPARK-17348
> URL: https://issues.apache.org/jira/browse/SPARK-17348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>Assignee: Apache Spark
>  Labels: correctness
>
> {noformat}
> Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1")
> Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2")
> sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= 
> t2.c2)").show
> +---+
> | c1|
> +---+
> |  1|
> +---+
> {noformat}
> The correct result of the above query should be an empty set. Here is an 
> explanation:
> Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when 
> T1.C1 = 1 so both rows needs to be processed in the same group of the 
> aggregation process in the subquery. The result of the aggregation yields 
> MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate 
> T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18262) JSON.org license is now CatX

2016-11-03 Thread Sean Busbey (JIRA)
Sean Busbey created SPARK-18262:
---

 Summary: JSON.org license is now CatX
 Key: SPARK-18262
 URL: https://issues.apache.org/jira/browse/SPARK-18262
 Project: Spark
  Issue Type: Bug
Reporter: Sean Busbey
Priority: Blocker


per [update resolved legal|http://www.apache.org/legal/resolved.html#json]:

{quote}
CAN APACHE PRODUCTS INCLUDE WORKS LICENSED UNDER THE JSON LICENSE?

No. As of 2016-11-03 this has been moved to the 'Category X' license list. 
Prior to this, use of the JSON Java library was allowed. See Debian's page for 
a list of alternatives.
{quote}

I'm not actually clear if Spark is using one of the JSON.org licensed 
libraries. As of current master (dc4c6009) the java library gets called out in 
the [NOTICE file for our source 
repo|https://github.com/apache/spark/blob/dc4c60098641cf64007e2f0e36378f000ad5f6b1/NOTICE#L424]
 but:

1) It doesn't say where in the source
2) the given url is 404 (http://www.json.org/java/index.html)
3) It doesn't actually say in the NOTICE what license the inclusion is under
4) the JSON.org license for the java {{org.json:json}} artifact (what the blurb 
in #2 is usually referring to) doesn't show up in our LICENSE file, nor in the 
{{licenses/}} directory
5) I don't see a direct reference to the {{org.json:json}} artifact in our poms.

So maybe it's just coming in transitively and we can exclude it / ping whoever 
is bringing it in?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18261) Add statistics to MemorySink for joining

2016-11-03 Thread Burak Yavuz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635014#comment-15635014
 ] 

Burak Yavuz commented on SPARK-18261:
-

Go for it!




> Add statistics to MemorySink for joining 
> -
>
> Key: SPARK-18261
> URL: https://issues.apache.org/jira/browse/SPARK-18261
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>
> Right now, there is no way to join the output of a memory sink with any table:
> {code}
> UnsupportedOperationException: LeafNode MemoryPlan must implement statistics
> {code}
> Being able to join snapshots of memory streams with tables would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18261) Add statistics to MemorySink for joining

2016-11-03 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634979#comment-15634979
 ] 

Liwei Lin commented on SPARK-18261:
---

If no one's working on this, I'd like to take this

> Add statistics to MemorySink for joining 
> -
>
> Key: SPARK-18261
> URL: https://issues.apache.org/jira/browse/SPARK-18261
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Burak Yavuz
>
> Right now, there is no way to join the output of a memory sink with any table:
> {code}
> UnsupportedOperationException: LeafNode MemoryPlan must implement statistics
> {code}
> Being able to join snapshots of memory streams with tables would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Description: 
As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive. It also doesn't respect custom partition locations.

We should delete only the proper partitions, scan the metastore for affected 
partitions with custom locations, and ensure that deletes/writes go to the 
right locations for those as well.

  was:
As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the updated 
partitions as in Hive.

This is non-trivial to fix in 2.1, so we should throw an exception here.


> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive. It also doesn't respect custom partition locations.
> We should delete only the proper partitions, scan the metastore for affected 
> partitions with custom locations, and ensure that deletes/writes go to the 
> right locations for those as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions

2016-11-03 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18185:
---
Summary: Should fix INSERT OVERWRITE TABLE of Datasource tables with 
dynamic partitions  (was: Should disallow INSERT OVERWRITE TABLE of Datasource 
tables with dynamic partitions)

> Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
> --
>
> Key: SPARK-18185
> URL: https://issues.apache.org/jira/browse/SPARK-18185
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Eric Liang
>
> As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
> Datasource table will overwrite the entire table instead of only the updated 
> partitions as in Hive.
> This is non-trivial to fix in 2.1, so we should throw an exception here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields

2016-11-03 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634816#comment-15634816
 ] 

Wenchen Fan commented on SPARK-18101:
-

Hi [~ekhliang] , can this newly added 
test(https://github.com/apache/spark/pull/14750/files#diff-8c4108666a6639034f0ddbfa075bcb37R273)
 resolve this ticket?

> ExternalCatalogSuite should test with mixed case fields
> ---
>
> Key: SPARK-18101
> URL: https://issues.apache.org/jira/browse/SPARK-18101
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Eric Liang
>
> Currently, it uses field names such as "a" and "b" which are not useful for 
> testing case preservation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-03 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634741#comment-15634741
 ] 

Nattavut Sutyanyong commented on SPARK-17337:
-

As commented in the PR, the code was nicely done but the bigger problem remains 
there, tracked by SPARK-17154. I made a few attempts to fix the root cause of 
the problem surfaced in many symptoms but it has taken me long and not yet as 
clean as I want. Also I am seeing [~kousuke] keeps refining his work through a 
series of PRs for that so I hesitate to try competing with his solution.

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18260:
-
Component/s: SQL

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18261) Add statistics to MemorySink for joining

2016-11-03 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18261:
---

 Summary: Add statistics to MemorySink for joining 
 Key: SPARK-18261
 URL: https://issues.apache.org/jira/browse/SPARK-18261
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Affects Versions: 2.0.2
Reporter: Burak Yavuz


Right now, there is no way to join the output of a memory sink with any table:
{code}
UnsupportedOperationException: LeafNode MemoryPlan must implement statistics
{code}

Being able to join snapshots of memory streams with tables would be nice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18260:
-
Target Version/s: 2.1.0
Priority: Blocker  (was: Major)

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Blocker
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18138.
-
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.1.0

> More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
> 
>
> Key: SPARK-18138
> URL: https://issues.apache.org/jira/browse/SPARK-18138
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Assignee: Sean Owen
>Priority: Blocker
> Fix For: 2.1.0
>
>
> Plan:
> - Mark it very explicit in Spark 2.1.0 that support for the aforementioned 
> environments are deprecated.
> - Remove support it Spark 2.2.0
> Also see mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18138:

Target Version/s: 2.1.0  (was: 2.2.0)

> More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
> 
>
> Key: SPARK-18138
> URL: https://issues.apache.org/jira/browse/SPARK-18138
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 2.1.0
>
>
> Plan:
> - Mark it very explicit in Spark 2.1.0 that support for the aforementioned 
> environments are deprecated.
> - Remove support it Spark 2.2.0
> Also see mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18235:


Assignee: Apache Spark

> ml.ALSModel function parity: ALSModel should support recommendforAll
> 
>
> Key: SPARK-18235
> URL: https://issues.apache.org/jira/browse/SPARK-18235
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: Apache Spark
>
> For function parity with MatrixFactorizationModel, ALS model should support 
> API:
> recommendUsersForProducts
> recommendProductsForUsers
> There're another two APIs missing (lower priority):
> recommendProducts:
> recommendUsers:
> The function requirement comes from mailing-list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634766#comment-15634766
 ] 

Michael Armbrust commented on SPARK-18260:
--

We should return null if the input is null.

> from_json can throw a better exception when it can't find the column or be 
> nullSafe
> ---
>
> Key: SPARK-18260
> URL: https://issues.apache.org/jira/browse/SPARK-18260
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Priority: Blocker
>
> I got this exception:
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
> failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 
> 74170, 10.0.138.84, executor 2): java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
>   at 
> org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
> {code}
> This was because the column that I called `from_json` on didn't exist for all 
> of my rows. Either from_json should be null safe, or it should fail with a 
> better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18235:


Assignee: (was: Apache Spark)

> ml.ALSModel function parity: ALSModel should support recommendforAll
> 
>
> Key: SPARK-18235
> URL: https://issues.apache.org/jira/browse/SPARK-18235
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> For function parity with MatrixFactorizationModel, ALS model should support 
> API:
> recommendUsersForProducts
> recommendProductsForUsers
> There're another two APIs missing (lower priority):
> recommendProducts:
> recommendUsers:
> The function requirement comes from mailing-list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18138:

Summary: More officially deprecate support for Python 2.6, Java 7, and 
Scala 2.10  (was: Remove support for Python 2.6, Hadoop 2.6-, Java 7, and Scala 
2.10)

> More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
> 
>
> Key: SPARK-18138
> URL: https://issues.apache.org/jira/browse/SPARK-18138
> Project: Spark
>  Issue Type: Task
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 2.1.0
>
>
> Plan:
> - Mark it very explicit in Spark 2.1.0 that support for the aforementioned 
> environments are deprecated.
> - Remove support it Spark 2.2.0
> Also see mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634753#comment-15634753
 ] 

Apache Spark commented on SPARK-18235:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/15762

> ml.ALSModel function parity: ALSModel should support recommendforAll
> 
>
> Key: SPARK-18235
> URL: https://issues.apache.org/jira/browse/SPARK-18235
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> For function parity with MatrixFactorizationModel, ALS model should support 
> API:
> recommendUsersForProducts
> recommendProductsForUsers
> There're another two APIs missing (lower priority):
> recommendProducts:
> recommendUsers:
> The function requirement comes from mailing-list: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2016-11-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14657:
--
Target Version/s: 2.2.0  (was: 2.1.0)

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634694#comment-15634694
 ] 

Apache Spark commented on SPARK-17337:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15761

> Incomplete algorithm for name resolution in Catalyst paser may lead to 
> incorrect result
> ---
>
> Key: SPARK-17337
> URL: https://issues.apache.org/jira/browse/SPARK-17337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> While investigating SPARK-16951, I found an incorrect results case from a NOT 
> IN subquery. I thought originally it is an edge case. Further investigation 
> found this is a more general problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2016-11-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12488.
---
  Resolution: Fixed
Assignee: Xiangrui Meng
   Fix Version/s: 1.6.1
  2.0.0
  1.5.3
  1.4.2
Target Version/s:   (was: 2.1.0)

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>Assignee: Xiangrui Meng
> Fix For: 1.4.2, 1.5.3, 2.0.0, 1.6.1
>
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe

2016-11-03 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18260:
---

 Summary: from_json can throw a better exception when it can't find 
the column or be nullSafe
 Key: SPARK-18260
 URL: https://issues.apache.org/jira/browse/SPARK-18260
 Project: Spark
  Issue Type: Bug
Reporter: Burak Yavuz


I got this exception:

{code}
SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 
failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 74170, 
10.0.138.84, executor 2): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71)
at 
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211)
at 
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
{code}

This was because the column that I called `from_json` on didn't exist for all 
of my rows. Either from_json should be null safe, or it should fail with a 
better error message



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs

2016-11-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634683#comment-15634683
 ] 

Joseph K. Bradley commented on SPARK-12488:
---

I'm going to close this since it seems like it has been fixed by [SPARK-13355]. 
 If anyone sees this is versions which include that patch, please report!  
Thanks all.

> LDA describeTopics() Generates Invalid Term IDs
> ---
>
> Key: SPARK-12488
> URL: https://issues.apache.org/jira/browse/SPARK-12488
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Ilya Ganelin
>
> When running the LDA model, and using the describeTopics function, invalid 
> values appear in the termID list that is returned:
> The below example generates 10 topics on a data set with a vocabulary of 685.
> {code}
> // Set LDA parameters
> val numTopics = 10
> val lda = new LDA().setK(numTopics).setMaxIterations(10)
> val ldaModel = lda.run(docTermVector)
> val distModel = 
> ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel]
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted.reverse
> res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, 
> 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, 
> 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, 
> 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, 
> 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, 
> 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, 
> 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 
> 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 
> 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, 
> 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, 
> 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53...
> {code}
> {code}
> scala> ldaModel.describeTopics()(0)._1.sorted
> res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, 
> -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, 
> -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, 
> -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, 
> -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, 
> -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, 
> -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, 
> -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, 
> -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
> 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 
> 38, 39, 40, 41, 42, 43, 44, 45, 4...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-18254.

Resolution: Fixed
  Assignee: Eyal Farago  (was: Davies Liu)

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Eyal Farago
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-03 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634605#comment-15634605
 ] 

Ryan Blue commented on SPARK-18086:
---

Yeah, I'll update the PR.

> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18259) QueryExecution should not catch Throwable

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18259:


Assignee: Apache Spark  (was: Herman van Hovell)

> QueryExecution should not catch Throwable
> -
>
> Key: SPARK-18259
> URL: https://issues.apache.org/jira/browse/SPARK-18259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>Priority: Minor
>
> QueryExecution currently captures Throwable. This is far from a best 
> practice. It would be better if we'd catch AnalysisExceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18259) QueryExecution should not catch Throwable

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18259:


Assignee: Herman van Hovell  (was: Apache Spark)

> QueryExecution should not catch Throwable
> -
>
> Key: SPARK-18259
> URL: https://issues.apache.org/jira/browse/SPARK-18259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> QueryExecution currently captures Throwable. This is far from a best 
> practice. It would be better if we'd catch AnalysisExceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18259) QueryExecution should not catch Throwable

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634573#comment-15634573
 ] 

Apache Spark commented on SPARK-18259:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15760

> QueryExecution should not catch Throwable
> -
>
> Key: SPARK-18259
> URL: https://issues.apache.org/jira/browse/SPARK-18259
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
>
> QueryExecution currently captures Throwable. This is far from a best 
> practice. It would be better if we'd catch AnalysisExceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-03 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634568#comment-15634568
 ] 

holdenk commented on SPARK-15581:
-

This sounds like really good suggestions - I think some of the biggest 
challenges contributors face is on the review/committer side rather than on the 
actual code changes so anything we can do to make that process simpler should 
be considered.

I'm sort of a mixed view on the umbrella JIRAs, maybe just being clearer in 
umbrella JIRAs about which sub-features are nice to have versus must-have would 
let us keep this organization?

Splitting the roadmap into two parts also sounds reasonable and will hopefully 
lead to less bouncing of issues between roadmap versions.

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence (SPARK-15573)
> h2. Python API
> * Standardize unit tests for Scala and Python to 

[jira] [Resolved] (SPARK-18257) Improve error reporting for FileStressSuite in streaming

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18257.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Improve error reporting for FileStressSuite in streaming
> 
>
> Key: SPARK-18257
> URL: https://issues.apache.org/jira/browse/SPARK-18257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> FileStressSuite doesn't report errors when they occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634455#comment-15634455
 ] 

Nicholas Chammas commented on SPARK-18254:
--

  

So it was specifically some broken interaction between structs and aliases, I 
guess? Anyway, glad it's been fixed.

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634432#comment-15634432
 ] 

Herman van Hovell commented on SPARK-18254:
---

We 'accidentally' fixed this yesterday with commit: 
https://github.com/apache/spark/commit/f151bd1af8a05d4b6c901ebe6ac0b51a4a1a20df


> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391
 ] 

Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:58 PM:
---

If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the 
original repro case works. 

(And for the record, this is still broken on the latest commit in branch-2.0, 
{{dae1581d9461346511098dc83938939a0f930048}}, so the fix is 2.1+.)

So while we're waiting for 2.1 to be released, is there a workaround you'd 
recommend, apart from {{persist()}}?


was (Author: nchammas):
If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the 
original repro case works. 

So while we're waiting for 2.1 to be released, is there a workaround you'd 
recommend, apart from {{persist()}}?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634428#comment-15634428
 ] 

Nicholas Chammas commented on SPARK-18254:
--

Just tried it. Seems like the fix is only available in 2.1+.

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18259) QueryExecution should not catch Throwable

2016-11-03 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-18259:
-

 Summary: QueryExecution should not catch Throwable
 Key: SPARK-18259
 URL: https://issues.apache.org/jira/browse/SPARK-18259
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Herman van Hovell
Assignee: Herman van Hovell
Priority: Minor


QueryExecution currently captures Throwable. This is far from a best practice. 
It would be better if we'd catch AnalysisExceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634416#comment-15634416
 ] 

Davies Liu commented on SPARK-18254:


Could you also try 2.0.2?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391
 ] 

Nicholas Chammas commented on SPARK-18254:
--

If I try branch-2.1 on {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the 
original repro case works. 

So while we're waiting for 2.1 to be released, is there a workaround you'd 
recommend, apart from {{persist()}}?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391
 ] 

Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:46 PM:
---

If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the 
original repro case works. 

So while we're waiting for 2.1 to be released, is there a workaround you'd 
recommend, apart from {{persist()}}?


was (Author: nchammas):
If I try branch-2.1 on {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the 
original repro case works. 

So while we're waiting for 2.1 to be released, is there a workaround you'd 
recommend, apart from {{persist()}}?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18212:
-
Assignee: Cody Koeninger

> Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from 
> specific offsets
> ---
>
> Key: SPARK-18212
> URL: https://issues.apache.org/jira/browse/SPARK-18212
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Davies Liu
>Assignee: Cody Koeninger
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18212.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15737
[https://github.com/apache/spark/pull/15737]

> Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from 
> specific offsets
> ---
>
> Key: SPARK-18212
> URL: https://issues.apache.org/jira/browse/SPARK-18212
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Davies Liu
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354
 ] 

Seth Hendrickson commented on SPARK-15581:
--

I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a long time and seem to have little interest. The 
list should contain only the items that we expect to get done.
-As you say, every item should have a committer linked to it that is capable of 
merging it. They do not have to be the primary reviewer, but they should have 
sufficient expertise such that they feel comfortable merging it after it has 
been appropriately reviewed. One interesting example to be wary of is that 
there seem to be a LOT of tree related items on the roadmap, but Joseph has 
traditionally been the only (at least the main) committer involved in 
tree-related JIRAs. I don't think it's realistic to target all of these tree 
improvements when we have limited committers available to review/merge them. We 
can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next relase
2. JIRAs that will be done at some point before the next major relase (e.g. 3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should 
not be on the roadmap. That does not mean they will not get done, but they are 
not necessarily "planned" for any particular timeframe. IMO this revised 
roadmap can/will provide a lot more transparency, and appropriately set review 
expectations. If it's on the list of "will do by next minor release," then 
contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think 
fit into the two categories I listed above 
[here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing]
 (note: does not contain SparkR items). I am not (necessarily) proposing to 
move the list to this google doc, and I understand this is still undergoing 
discussion. I just wanted to provide an example of what the above might look 
like.   

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
> Fix For: 2.1.0
>
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> 

[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap

2016-11-03 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354
 ] 

Seth Hendrickson edited comment on SPARK-15581 at 11/3/16 9:28 PM:
---

I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a long time and seem to have little interest. The 
list should contain only the items that we expect to get done.
- As you say, every item should have a committer linked to it that is capable 
of merging it. They do not have to be the primary reviewer, but they should 
have sufficient expertise such that they feel comfortable merging it after it 
has been appropriately reviewed. One interesting example to be wary of is that 
there seem to be a LOT of tree related items on the roadmap, but Joseph has 
traditionally been the only (at least the main) committer involved in 
tree-related JIRAs. I don't think it's realistic to target all of these tree 
improvements when we have limited committers available to review/merge them. We 
can trim them down to a realistic subset.

I propose a revised roadmap that contains two classifications of items:

1. JIRAs that will be done by the next release
2. JIRAs that will be done at some point before the next major release (e.g. 
3.0)

JIRAs that are still up for debate (e.g. adding a factorization machine) should 
not be on the roadmap. That does not mean they will not get done, but they are 
not necessarily "planned" for any particular timeframe. IMO this revised 
roadmap can/will provide a lot more transparency, and appropriately set review 
expectations. If it's on the list of "will do by next minor release," then 
contributors should expect it to be reviewed. What does everyone else think?

Also, I took a bit of time to aggregate lists of specific JIRAs that I think 
fit into the two categories I listed above 
[here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing]
 (note: does not contain SparkR items). I am not (necessarily) proposing to 
move the list to this google doc, and I understand this is still undergoing 
discussion. I just wanted to provide an example of what the above might look 
like.   


was (Author: sethah):
I think the points you mention are very important to get right moving forward. 
We can certainly debate about what should go on the roadmap, but regardless I 
think it would be helpful to maintain a specific subset of JIRAs that we expect 
to get done for the next release cycle. Particularly:

- We should maintain a list of items that we WILL get done for the next 
release, and we should deliver on nearly every one, barring unforeseen 
circumstances. If we don't get some of the items done, we should understand why 
and adjust accordingly until we can reach a list of items that we can 
consistently deliver on.
- The list of items should be small and targeted, and should take into account 
things like committer/reviewer bandwidth. MLlib does not have a ton of active 
committers right now, like SQL might have, and the roadmap should reflect that. 
We need to be realistic.
- We should make every effort to be as specific as possible. Linking to 
umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. 
Some of the umbrella tickets contain items that are longer term or have little 
interest (nice-to-haves), but realistically won't get implemented (in a timely 
manner). For example, I looked at the tree umbrellas and I see some items that 
are high priority and can be done in one release cycle, but also other items 
that have been around for a 

[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame

2016-11-03 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634352#comment-15634352
 ] 

koert kuipers commented on SPARK-15798:
---

looking at the code for Window operators it seems to me the basic operators for 
secondary sort must already be present for Dataset, since to do Window 
operations efficiently you need it. so this is good news. it just needs to be 
exposed in a more generic way than the highly specific Window operators.

> Secondary sort in Dataset/DataFrame
> ---
>
> Key: SPARK-15798
> URL: https://issues.apache.org/jira/browse/SPARK-15798
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: koert kuipers
>
> Secondary sort for Spark RDDs was discussed in 
> https://issues.apache.org/jira/browse/SPARK-3655
> Since the RDD API allows for easy extensions outside the core library this 
> was implemented separately here:
> https://github.com/tresata/spark-sorted
> However it seems to me that with Dataset an implementation in a 3rd party 
> library of such a feature is not really an option.
> Dataset already has methods that suggest a secondary sort is present, such as 
> in KeyValueGroupedDataset:
> {noformat}
> def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): 
> Dataset[U]
> {noformat}
> This operation pushes all the data to the reducer, something you only would 
> want to do if you need the elements in a particular order.
> How about as an API sortBy methods in KeyValueGroupedDataset and 
> RelationalGroupedDataset?
> {noformat}
> dataFrame.groupBy("a").sortBy("b").fold(...)
> {noformat}
> (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should 
> :))
> {noformat}
> dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-03 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634342#comment-15634342
 ] 

Saikat Kanjilal commented on SPARK-9487:


added local[4] to repl, sparksql, streaming, all tests pass, pull request is 
here: https://github.com/apache/spark/compare/master...skanjila:spark-9487


> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634333#comment-15634333
 ] 

Davies Liu commented on SPARK-18254:


I tried the following in master (2.1), it works

{code}
from pyspark.sql.functions import udf, col, struct
myadd = udf(lambda s: s.a + s.b, IntegerType())
df = self.spark.range(10).selectExpr("id as a", "id as b")\
.select(struct(col("a"), col("b")).alias('s'))
df = df.select(df.s, myadd(df.s).alias("a"))
df.explain(True)
rs = df.collect()
{code}

[~nchammas] Could you also try yours on master?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives

2016-11-03 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-18099.
---
   Resolution: Fixed
 Assignee: Kishor Patil
Fix Version/s: 2.2.0
   2.1.0

> Spark distributed cache should throw exception if same file is specified to 
> dropped in --files --archives
> -
>
> Key: SPARK-18099
> URL: https://issues.apache.org/jira/browse/SPARK-18099
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Kishor Patil
>Assignee: Kishor Patil
> Fix For: 2.1.0, 2.2.0
>
>
> Recently, for the changes to [SPARK-14423] Handle jar conflict issue when 
> uploading to distributed cache
> If by default yarn#client will upload all the --files and --archives in 
> assembly to HDFS staging folder. It should throw if file appears in both 
> --files and --archives exception to know whether uncompress or leave the file 
> compressed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-11-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634172#comment-15634172
 ] 

Reynold Xin commented on SPARK-18086:
-

[~rdblue] Does my explanation make sense? Can you change the pr (or have a new 
pr) to just do the command line argument so we can get that into 2.1?


> Regression: Hive variables no longer work in Spark 2.0
> --
>
> Key: SPARK-18086
> URL: https://issues.apache.org/jira/browse/SPARK-18086
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ryan Blue
>
> The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
> Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
> work. Queries that worked correctly in 1.6 will either fail or produce 
> unexpected results in 2.0 so I think this is a regression that should be 
> addressed.
> Hive and Spark 1.6 work like this:
> 1. Command-line args --hiveconf and --hivevar can be used to set session 
> properties. --hiveconf properties are added to the Hadoop Configuration.
> 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
> adds a Hive var.
> 3. Hive vars can be substituted into queries by name, and Configuration 
> properties can be substituted using {{hiveconf:name}}.
> In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
> the value in SQLConf for the rest of the key is returned. SET adds properties 
> to the session config and (according to [a 
> comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
>  the Hadoop configuration "during I/O".
> {code:title=Hive and Spark 1.6.1 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> ${test.conf}
> spark-sql> select "${hivevar:test.var}";
> 2
> spark-sql> select "${test.var}";
> 2
> spark-sql> set test.set=3;
> SET test.set=3
> spark-sql> select "${test.set}"
> "${test.set}"
> spark-sql> select "${hivevar:test.set}"
> "${hivevar:test.set}"
> spark-sql> select "${hiveconf:test.set}"
> 3
> spark-sql> set hivevar:test.setvar=4;
> SET hivevar:test.setvar=4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> 4
> {code}
> {code:title=Spark 2.0.0 behavior}
> [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
> spark-sql> select "${hiveconf:test.conf}";
> 1
> spark-sql> select "${test.conf}";
> 1
> spark-sql> select "${hivevar:test.var}";
> ${hivevar:test.var}
> spark-sql> select "${test.var}";
> ${test.var}
> spark-sql> set test.set=3;
> test.set3
> spark-sql> select "${test.set}";
> 3
> spark-sql> set hivevar:test.setvar=4;
> hivevar:test.setvar  4
> spark-sql> select "${hivevar:test.setvar}";
> 4
> spark-sql> select "${test.setvar}";
> ${test.setvar}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist

2016-11-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634257#comment-15634257
 ] 

yuhao yang commented on SPARK-18230:


Sorry, I got a little confused between the different recommend methods. You're 
right.

> MatrixFactorizationModel.recommendProducts throws NoSuchElement exception 
> when the user does not exist
> --
>
> Key: SPARK-18230
> URL: https://issues.apache.org/jira/browse/SPARK-18230
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: Mikael Ståldal
>Priority: Minor
>
> When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a 
> non-existing user, a {{java.util.NoSuchElementException}} is thrown:
> {code}
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
>   at 
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)
> {code}
> It would be nice if it returned the empty array, or throwed a more specific 
> exception, and that was documented in ScalaDoc for the method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634228#comment-15634228
 ] 

Davies Liu commented on SPARK-18254:


I doubt it's a bug in ExtractPythonUDFs, not operator push down, not verified 
yet.

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18210) Pipeline.copy does not create an instance with the same UID

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634240#comment-15634240
 ] 

Apache Spark commented on SPARK-18210:
--

User 'wojtek-szymanski' has created a pull request for this issue:
https://github.com/apache/spark/pull/15759

> Pipeline.copy does not create an instance with the same UID
> ---
>
> Key: SPARK-18210
> URL: https://issues.apache.org/jira/browse/SPARK-18210
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Wojciech Szymanski
>Priority: Minor
>
> org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an 
> instance with the same UID.
> It does not conform to the method specification from its base class 
> org.apache.spark.ml.param.Params.copy(extra: ParamMap)
> The following commit contains:
> - fix for Pipeline UID
> - missing tests for Pipeline.copy
> - minor improvements in tests for PipelineModel.copy
> https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf
> Let me know if you are fine with these changes, so I will open a new PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18210) Pipeline.copy does not create an instance with the same UID

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18210:


Assignee: (was: Apache Spark)

> Pipeline.copy does not create an instance with the same UID
> ---
>
> Key: SPARK-18210
> URL: https://issues.apache.org/jira/browse/SPARK-18210
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Wojciech Szymanski
>Priority: Minor
>
> org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an 
> instance with the same UID.
> It does not conform to the method specification from its base class 
> org.apache.spark.ml.param.Params.copy(extra: ParamMap)
> The following commit contains:
> - fix for Pipeline UID
> - missing tests for Pipeline.copy
> - minor improvements in tests for PipelineModel.copy
> https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf
> Let me know if you are fine with these changes, so I will open a new PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18210) Pipeline.copy does not create an instance with the same UID

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18210:


Assignee: Apache Spark

> Pipeline.copy does not create an instance with the same UID
> ---
>
> Key: SPARK-18210
> URL: https://issues.apache.org/jira/browse/SPARK-18210
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: Wojciech Szymanski
>Assignee: Apache Spark
>Priority: Minor
>
> org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an 
> instance with the same UID.
> It does not conform to the method specification from its base class 
> org.apache.spark.ml.param.Params.copy(extra: ParamMap)
> The following commit contains:
> - fix for Pipeline UID
> - missing tests for Pipeline.copy
> - minor improvements in tests for PipelineModel.copy
> https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf
> Let me know if you are fine with these changes, so I will open a new PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18258) Sinks need access to offset representation

2016-11-03 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-18258:
---
Description: 
Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the 
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.

  was:
Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.


> Sinks need access to offset representation
> --
>
> Key: SPARK-18258
> URL: https://issues.apache.org/jira/browse/SPARK-18258
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Transactional "exactly-once" semantics for output require storing an offset 
> identifier in the same transaction as results.
> The Sink.addBatch method currently only has access to batchId and data, not 
> the actual offset representation.
> I want to store the actual offsets, so that they are recoverable as long as 
> the results are and I'm not locked in to a particular streaming engine.
> I could see this being accomplished by adding parameters to Sink.addBatch for 
> the starting and ending offsets (either the offsets themselves, or the 
> SPARK-17829 string/json representation).  That would be an API change, but if 
> there's another way to map batch ids to offset representations without 
> changing the Sink api that would work as well.  
> I'm assuming we don't need the same level of access to offsets throughout a 
> job as e.g. the Kafka dstream gives, because Sinks are the main place that 
> should need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18258) Sinks need access to offset representation

2016-11-03 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-18258:
--

 Summary: Sinks need access to offset representation
 Key: SPARK-18258
 URL: https://issues.apache.org/jira/browse/SPARK-18258
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Cody Koeninger


Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18238) WARN Executor: 1 block locks were not released by TID

2016-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634156#comment-15634156
 ] 

Sean Owen commented on SPARK-18238:
---

Can you say any more about how you make this occur?

> WARN Executor: 1 block locks were not released by TID
> -
>
> Key: SPARK-18238
> URL: https://issues.apache.org/jira/browse/SPARK-18238
> Project: Spark
>  Issue Type: Bug
> Environment: 2.0.2 snapshot
>Reporter: Harish
>Priority: Minor
>  Labels: patch
>
> In spark 2.0.2/hadoop 2.7, i am getting below message. Not sure is this 
> impacting my execution.
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30541:
> [rdd_511_104]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30542:
> [rdd_511_105]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30562:
> [rdd_511_127]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30571:
> [rdd_511_137]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30572:
> [rdd_511_138]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30588:
> [rdd_511_156]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30603:
> [rdd_511_171]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30600:
> [rdd_511_168]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30612:
> [rdd_511_180]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30622:
> [rdd_511_190]
> 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = 
> 30629:
> [rdd_511_197]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143
 ] 

Jakob Odersky edited comment on SPARK-14222 at 11/3/16 8:33 PM:


Thanks Sean, however I realized that the dependency is in fact not yet 
published for 2.12.0 final. The package I linked is from a different org.

There's a ticket for a release here 
https://github.com/FasterXML/jackson-module-scala/pull/294


was (Author: jodersky):
Thanks Sean, however I realized that the dependency is in fact not yet 
published for 2.12.0 final. The package I linked is from a different org, oops

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java

2016-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634149#comment-15634149
 ] 

Sean Owen commented on SPARK-18193:
---

Oh I see, it's the opposite. The QueueStream example should be updated to match 
the JavaQueueStream example. Go ahead, yes.

> queueStream not updated if rddQueue.add after create queueStream in Java
> 
>
> Key: SPARK-18193
> URL: https://issues.apache.org/jira/browse/SPARK-18193
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Hubert Kang
>
> Within 
> examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java,
>  no any data is deteceted if below code to put something to rddQueue is 
> executed after queueStream is created (line 65).
> for (int i = 0; i < 30; i++) {
>   rddQueue.add(ssc.sparkContext().parallelize(list));
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143
 ] 

Jakob Odersky edited comment on SPARK-14222 at 11/3/16 8:30 PM:


Thanks Sean, however I realized that the dependency is in fact not yet 
published for 2.12.0 final. The package I linked is from a different org, oops


was (Author: jodersky):
Thanks Sean, however I realized that the dependency is in fact not yet 
published for 2.12.0 final. The package I linked is from a different org

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143
 ] 

Jakob Odersky commented on SPARK-14222:
---

Thanks Sean, however I realized that the dependency is in fact not yet 
published for 2.12.0 final. The package I linked is from a different org

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2016-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634128#comment-15634128
 ] 

Sean Owen commented on SPARK-14222:
---

Probably. The limiting factor is often run-time compatibility with whatever 
version Hadoop will put in the classpath. However I have been using Jackson 
2.8.x + Spark 2 + Hadoop 2.6 for a while without issue.

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18254:
-
Target Version/s: 2.1.0

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names

2016-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634122#comment-15634122
 ] 

Michael Armbrust commented on SPARK-18254:
--

Is this yet another bug caused by the generic operator push down?  Can we turn 
that off?

> UDFs don't see aliased column names
> ---
>
> Key: SPARK-18254
> URL: https://issues.apache.org/jira/browse/SPARK-18254
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
>Assignee: Davies Liu
>  Labels: correctness
>
> Dunno if I'm misinterpreting something here, but this seems like a bug in how 
> UDFs work, or in how they interface with the optimizer.
> Here's a basic reproduction. I'm using {{length_udf()}} just for 
> illustration; it could be any UDF that accesses fields that have been aliased.
> {code}
> import pyspark
> from pyspark.sql import Row
> from pyspark.sql.functions import udf, col, struct
> def length(full_name):
> # The non-aliased names, FIRST and LAST, show up here, instead of
> # first_name and last_name.
> # print(full_name)
> return len(full_name.first_name) + len(full_name.last_name)
> if __name__ == '__main__':
> spark = (
> pyspark.sql.SparkSession.builder
> .getOrCreate())
> length_udf = udf(length)
> names = spark.createDataFrame([
> Row(FIRST='Nick', LAST='Chammas'),
> Row(FIRST='Walter', LAST='Williams'),
> ])
> names_cleaned = (
> names
> .select(
> col('FIRST').alias('first_name'),
> col('LAST').alias('last_name'),
> )
> .withColumn('full_name', struct('first_name', 'last_name'))
> .select('full_name'))
> # We see the schema we expect here.
> names_cleaned.printSchema()
> # However, here we get an AttributeError. length_udf() cannot
> # find first_name or last_name.
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .show())
> {code}
> When I run this I get a long stack trace, but the relevant portions seem to 
> be:
> {code}
>   File ".../udf-alias.py", line 10, in length
> return len(full_name.first_name) + len(full_name.last_name)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1502, in __getattr__
> raise AttributeError(item)
> AttributeError: first_name
> {code}
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py",
>  line 1497, in __getattr__
> idx = self.__fields__.index(item)
> ValueError: 'first_name' is not in list
> {code}
> Here are the relevant execution plans:
> {code}
> names_cleaned.explain()
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10]
> +- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> {code}
> (names_cleaned
> .withColumn('length', length_udf('full_name'))
> .explain())
> == Physical Plan ==
> *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS 
> full_name#10, pythonUDF0#21 AS length#17]
> +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, 
> pythonUDF0#21]
>+- Scan ExistingRDD[FIRST#0,LAST#1]
> {code}
> It looks like from the second execution plan that {{BatchEvalPython}} somehow 
> gets the unaliased column names, whereas the {{Project}} right above it gets 
> the aliased names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634117#comment-15634117
 ] 

Jakob Odersky commented on SPARK-14222:
---

A newer version of module (vertsion 2.8.4) is available for scala 2.12 now 
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22jackson-module-scala_2.12%22. 
Can we upgrade spark's dependency (currently Spark uses 2.6.5)?

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15377) Enabling SASL Spark 1.6.1

2016-11-03 Thread Shridhar Ramachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634088#comment-15634088
 ] 

Shridhar Ramachandran commented on SPARK-15377:
---

It is likely that you haven't enabled spark.authenticate=true in YARN. 
Excerpted from YarnShuffleService.java --
{noformat}
 * The service also optionally supports authentication. This ensures that 
executors from one
 * application cannot read the shuffle files written by those from another. 
This feature can be
 * enabled by setting `spark.authenticate` in the Yarn configuration before 
starting the NM.
 * Note that the Spark application must also set `spark.authenticate` manually 
and, unlike in
 * the case of the service port, will not inherit this setting from the Yarn 
configuration. This
 * is because an application running on the same Yarn cluster may choose to not 
use the external
 * shuffle service, in which case its setting of `spark.authenticate` should be 
independent of
 * the service's.
{noformat}
 You can do this by adding that flag to core-site.xml

> Enabling SASL Spark 1.6.1
> -
>
> Key: SPARK-15377
> URL: https://issues.apache.org/jira/browse/SPARK-15377
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, YARN
>Affects Versions: 1.6.1
>Reporter: Fabian Tan
>
> Hi there,
> I wonder if anyone gotten SASL to work with Spark 1.6.1 on YARN?
> At this point in time, I cant confirm if this is a bug or not, but, it's 
> definitely reproducible.
> Basically Spark documentation states that you only require 3 parameters 
> enabled:
> spark.authenticate.enableSaslEncryption=true
> spark.network.sasl.serverAlwaysEncrypt=true
> spark.authenticate=true
> http://spark.apache.org/docs/latest/security.html
> However, upon launching my spark job with --master yarn and --deploy-mode 
> client, I see the following in my spark executors logs:
> 6/05/17 06:50:51 ERROR client.TransportClientFactory: Exception while 
> bootstrapping client after 29 ms
> java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown 
> message type: -22
> at 
> org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:71)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> 

[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17937:
-
Priority: Critical  (was: Major)

> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Critical
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost).   It's possible to separate 
> this into offset too small and offset too large, but I'm not sure it matters 
> for us.
> Possible sources of offsets:
> # *Earliest* position in log
> # *Latest* position in log
> # *Fail* and kill the query
> # *Checkpoint* position
> # *User specified* per topicpartition
> # *Kafka commit log*.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # *Timestamp*.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # *X offsets* before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: *earliest* OR *latest* OR *User specified* json per 
> topicpartition  (SPARK-17812)
> # failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
> *Earliest* above)  In general, I see no reason this couldn't specify Latest 
> as an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
> startingOffsets is *User specified* perTopicpartition, and the new partition 
> isn't in the map, *Fail*.  Note that this is effectively undistinguishable 
> from new parititon during query, because partitions may have changed in 
> between pre-query configuration and query start, but we treat it differently, 
> and users in this case are SOL
> #* Offset out of range on driver: We don't technically have behavior for this 
> case yet.  Could use the value of failOnDataLoss, but it's possible people 
> may want to know at startup that something was wrong, even if they're ok with 
> earliest for a during-query out of range
> #* Offset out of range on executor: seems like it should be *Fail* or 
> *Earliest*, based on failOnDataLoss.  but it looks like this setting is 
> currently ignored, and the executor will just fail...
> # During query
> #* New partition:  *Earliest*, only.  This seems to be by fiat, I see no 
> reason this can't be configurable.
> #* Offset out of range on driver:  this _probably_ doesn't happen, because 
> we're doing explicit seeks to the latest position
> #* Offset out of range on executor:  ?
> # At query restart 
> #* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason 
> this couldn't be configurable fall back to Latest
> #* Offset out of range on driver:   this _probably_ doesn't happen, because 
> we're doing explicit seeks to the specified position
> #* Offset out of range on executor:  ?
> I've probably missed something, chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17937:
-
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-15406)

> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost).   It's possible to separate 
> this into offset too small and offset too large, but I'm not sure it matters 
> for us.
> Possible sources of offsets:
> # *Earliest* position in log
> # *Latest* position in log
> # *Fail* and kill the query
> # *Checkpoint* position
> # *User specified* per topicpartition
> # *Kafka commit log*.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # *Timestamp*.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # *X offsets* before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: *earliest* OR *latest* OR *User specified* json per 
> topicpartition  (SPARK-17812)
> # failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
> *Earliest* above)  In general, I see no reason this couldn't specify Latest 
> as an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
> startingOffsets is *User specified* perTopicpartition, and the new partition 
> isn't in the map, *Fail*.  Note that this is effectively undistinguishable 
> from new parititon during query, because partitions may have changed in 
> between pre-query configuration and query start, but we treat it differently, 
> and users in this case are SOL
> #* Offset out of range on driver: We don't technically have behavior for this 
> case yet.  Could use the value of failOnDataLoss, but it's possible people 
> may want to know at startup that something was wrong, even if they're ok with 
> earliest for a during-query out of range
> #* Offset out of range on executor: seems like it should be *Fail* or 
> *Earliest*, based on failOnDataLoss.  but it looks like this setting is 
> currently ignored, and the executor will just fail...
> # During query
> #* New partition:  *Earliest*, only.  This seems to be by fiat, I see no 
> reason this can't be configurable.
> #* Offset out of range on driver:  this _probably_ doesn't happen, because 
> we're doing explicit seeks to the latest position
> #* Offset out of range on executor:  ?
> # At query restart 
> #* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason 
> this couldn't be configurable fall back to Latest
> #* Offset out of range on driver:   this _probably_ doesn't happen, because 
> we're doing explicit seeks to the specified position
> #* Offset out of range on executor:  ?
> I've probably missed something, chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17937:
-
Target Version/s: 2.1.0

> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost).   It's possible to separate 
> this into offset too small and offset too large, but I'm not sure it matters 
> for us.
> Possible sources of offsets:
> # *Earliest* position in log
> # *Latest* position in log
> # *Fail* and kill the query
> # *Checkpoint* position
> # *User specified* per topicpartition
> # *Kafka commit log*.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # *Timestamp*.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # *X offsets* before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: *earliest* OR *latest* OR *User specified* json per 
> topicpartition  (SPARK-17812)
> # failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
> *Earliest* above)  In general, I see no reason this couldn't specify Latest 
> as an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
> startingOffsets is *User specified* perTopicpartition, and the new partition 
> isn't in the map, *Fail*.  Note that this is effectively undistinguishable 
> from new parititon during query, because partitions may have changed in 
> between pre-query configuration and query start, but we treat it differently, 
> and users in this case are SOL
> #* Offset out of range on driver: We don't technically have behavior for this 
> case yet.  Could use the value of failOnDataLoss, but it's possible people 
> may want to know at startup that something was wrong, even if they're ok with 
> earliest for a during-query out of range
> #* Offset out of range on executor: seems like it should be *Fail* or 
> *Earliest*, based on failOnDataLoss.  but it looks like this setting is 
> currently ignored, and the executor will just fail...
> # During query
> #* New partition:  *Earliest*, only.  This seems to be by fiat, I see no 
> reason this can't be configurable.
> #* Offset out of range on driver:  this _probably_ doesn't happen, because 
> we're doing explicit seeks to the latest position
> #* Offset out of range on executor:  ?
> # At query restart 
> #* New partition: *Checkpoint*, fall back to *Earliest*.  Again, no reason 
> this couldn't be configurable fall back to Latest
> #* Offset out of range on driver:   this _probably_ doesn't happen, because 
> we're doing explicit seeks to the specified position
> #* Offset out of range on executor:  ?
> I've probably missed something, chime in.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming

2016-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634061#comment-15634061
 ] 

Michael Armbrust commented on SPARK-17937:
--

I'm going to pull this out from the parent JIRA as I don't think it blocks 
basic kafka usage, and there are enough things here to warrant several subtasks 
on their own.  I see a couple of possible concrete action items here:
 - make sure what we have is documented clearly, possibly even with a 
comparison to {{auto.offset.reset}} for those more familiar with kafka
 - what to do on data loss: seems we are missing the ability to do {{latest}}.  
I'd be fine with changing this to {{onDataLoss = fail,earliest,latest}} with 
fail being the default.  It would be nice to keep compatibility with the old 
option, but that is minor.
 - how to handle new partitions: if possible I'd like to lump this into the 
{{onDataLoss}} setting.  when a new partition appears midquery the default 
should be to process all of it (if that can be assured).  If it can't because 
of downtime and data has already aged out, I'd like to error by default, but 
the user should be able to pick earliest or latest.
 - timestamps: sounds awesome, should probably be its own feature JIRA
 - integration with the kafka commit log: also could probably be its own 
feature JIRA.  I'd also like to hear requests from users on what they need 
here. Is it monitoring?  Is it moving queries to structured streaming.  My big 
concern it might be confusing since we can't use the same transactional tricks 
we use for our own checkpoint commit log and I don't want users to loose 
exactly-once without understanding why.
 - X offsets - also its own feature JIRA.  I agree it only makes sense for 
topics that are uniformly hash partitioned (all of mine are).  Maybe we skip 
this if we get timestamps soon enough.

> Clarify Kafka offset semantics for Structured Streaming
> ---
>
> Key: SPARK-17937
> URL: https://issues.apache.org/jira/browse/SPARK-17937
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> Possible events for which offsets are needed:
> # New partition is discovered
> # Offset out of range (aka, data has been lost).   It's possible to separate 
> this into offset too small and offset too large, but I'm not sure it matters 
> for us.
> Possible sources of offsets:
> # *Earliest* position in log
> # *Latest* position in log
> # *Fail* and kill the query
> # *Checkpoint* position
> # *User specified* per topicpartition
> # *Kafka commit log*.  Currently unsupported.  This means users who want to 
> migrate from existing kafka jobs need to jump through hoops.  Even if we 
> never want to support it, as soon as we take on SPARK-17815 we need to make 
> sure Kafka commit log state is clearly documented and handled.
> # *Timestamp*.  Currently unsupported.  This could be supported with old, 
> inaccurate Kafka time api, or upcoming time index
> # *X offsets* before or after latest / earliest position.  Currently 
> unsupported.  I think the semantics of this are super unclear by comparison 
> with timestamp, given that Kafka doesn't have a single range of offsets.
> Currently allowed pre-query configuration, all "ORs" are exclusive:
> # startingOffsets: *earliest* OR *latest* OR *User specified* json per 
> topicpartition  (SPARK-17812)
> # failOnDataLoss: true (which implies *Fail* above) OR false (which implies 
> *Earliest* above)  In general, I see no reason this couldn't specify Latest 
> as an option.
> Possible lifecycle times in which an offset-related event may happen:
> # At initial query start
> #* New partition: if startingOffsets is *Earliest* or *Latest*, use that.  If 
> startingOffsets is *User specified* perTopicpartition, and the new partition 
> isn't in the map, *Fail*.  Note that this is effectively undistinguishable 
> from new parititon during query, because partitions may have changed in 
> between pre-query configuration and query start, but we treat it differently, 
> and users in this case are SOL
> #* Offset out of range on driver: We don't technically have behavior for this 
> case yet.  Could use the value of failOnDataLoss, but it's possible people 
> may want to know at startup that something was wrong, even if they're ok with 
> earliest for a during-query out of range
> #* Offset out of range on executor: seems like it should be *Fail* or 
> *Earliest*, based on failOnDataLoss.  but it looks like this setting is 
> currently ignored, and the executor will just fail...
> # During query
> #* New partition:  *Earliest*, only.  This seems to be by fiat, I see no 
> reason this can't be configurable.
> #* Offset out of range on driver:  this _probably_ doesn't happen, because 
> we're doing explicit seeks to 

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2016-11-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634050#comment-15634050
 ] 

Josh Rosen commented on SPARK-14220:


SPARK-14643 is likely to be the hardest task.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634027#comment-15634027
 ] 

Jakob Odersky commented on SPARK-14220:
---

at least most dependencies will probably make 2.12 builds available, now that 
it is considered binary-stable

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14220) Build and test Spark against Scala 2.12

2016-11-03 Thread Jakob Odersky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634027#comment-15634027
 ] 

Jakob Odersky edited comment on SPARK-14220 at 11/3/16 7:54 PM:


At least most dependencies will probably make 2.12 builds available, now that 
it is considered binary-stable. The closure cleaning and byte code manipulation 
stuff is a whole different story though...


was (Author: jodersky):
at least most dependencies will probably make 2.12 builds available, now that 
it is considered binary-stable

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2016-11-03 Thread Ivan Gozali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633984#comment-15633984
 ] 

Ivan Gozali commented on SPARK-11914:
-

Hi, apologies for bringing this up in an old issue. I was wondering if there's 
any particular reason the {{shuffle}} argument that's present in 
[JavaRDD.coalesce()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html#coalesce(int,%20boolean)]
 is not present in the Dataset API? Also are there plans of bringing it into 
the Dataset API?

Thank you!

> [SQL] Support coalesce and repartition in Dataset APIs
> --
>
> Key: SPARK-11914
> URL: https://issues.apache.org/jira/browse/SPARK-11914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions.
> coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions. Similar to coalesce defined on an [[RDD]], this operation results 
> in a narrow dependency, e.g. if you go from 1000 partitions to 100 
> partitions, there will not be a shuffle, instead each of the 100 new 
> partitions will claim 10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18257) Improve error reporting for FileStressSuite in streaming

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18257:


Assignee: Reynold Xin  (was: Apache Spark)

> Improve error reporting for FileStressSuite in streaming
> 
>
> Key: SPARK-18257
> URL: https://issues.apache.org/jira/browse/SPARK-18257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> FileStressSuite doesn't report errors when they occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18257) Improve error reporting for FileStressSuite in streaming

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633931#comment-15633931
 ] 

Apache Spark commented on SPARK-18257:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/15757

> Improve error reporting for FileStressSuite in streaming
> 
>
> Key: SPARK-18257
> URL: https://issues.apache.org/jira/browse/SPARK-18257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> FileStressSuite doesn't report errors when they occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15799) Release SparkR on CRAN

2016-11-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633932#comment-15633932
 ] 

Shivaram Venkataraman commented on SPARK-15799:
---

Yes - I think this is good to go. The only thing remaining IMHO is that the 
vignette needs to be packaged correctly. 
(https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-package-vignettes
 has some details).

I think if we can release it with 2.1.0 it'll be good ? We have many new ML 
algorithms in it.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18257) Improve error reporting for FileStressSuite in streaming

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18257:


Assignee: Apache Spark  (was: Reynold Xin)

> Improve error reporting for FileStressSuite in streaming
> 
>
> Key: SPARK-18257
> URL: https://issues.apache.org/jira/browse/SPARK-18257
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> FileStressSuite doesn't report errors when they occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633923#comment-15633923
 ] 

Sean Owen commented on SPARK-9487:
--

Yes, keep going, why not?

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist

2016-11-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633917#comment-15633917
 ] 

Sean Owen commented on SPARK-18230:
---

Agree, I can't see how you'd return anything in this case. A Rating with value 
NaN doesn't make sense. (Why just one?)
In fact, returning no recommendations is a valid response for an existing user, 
so the response for a non-existent user could, and probably should, be 
different. I think it's OK to view it as exceptional.

> MatrixFactorizationModel.recommendProducts throws NoSuchElement exception 
> when the user does not exist
> --
>
> Key: SPARK-18230
> URL: https://issues.apache.org/jira/browse/SPARK-18230
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: Mikael Ståldal
>Priority: Minor
>
> When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a 
> non-existing user, a {{java.util.NoSuchElementException}} is thrown:
> {code}
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at 
> scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
>   at scala.collection.IterableLike$class.head(IterableLike.scala:107)
>   at 
> scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35)
>   at 
> scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126)
>   at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35)
>   at 
> org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169)
> {code}
> It would be nice if it returned the empty array, or throwed a more specific 
> exception, and that was documented in ScalaDoc for the method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18257) Improve error reporting for FileStressSuite in streaming

2016-11-03 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-18257:
---

 Summary: Improve error reporting for FileStressSuite in streaming
 Key: SPARK-18257
 URL: https://issues.apache.org/jira/browse/SPARK-18257
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Reporter: Reynold Xin
Assignee: Reynold Xin


FileStressSuite doesn't report errors when they occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633888#comment-15633888
 ] 

Apache Spark commented on SPARK-18256:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15756

> Improve performance of event log replay in HistoryServer based on profiler 
> results
> --
>
> Key: SPARK-18256
> URL: https://issues.apache.org/jira/browse/SPARK-18256
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Profiling event log replay in the HistoryServer reveals Json4S control flow 
> exceptions and `Utils.getFormattedClassName` calls as significant 
> bottlenecks. Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18256:


Assignee: Apache Spark  (was: Josh Rosen)

> Improve performance of event log replay in HistoryServer based on profiler 
> results
> --
>
> Key: SPARK-18256
> URL: https://issues.apache.org/jira/browse/SPARK-18256
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> Profiling event log replay in the HistoryServer reveals Json4S control flow 
> exceptions and `Utils.getFormattedClassName` calls as significant 
> bottlenecks. Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18256:


Assignee: Josh Rosen  (was: Apache Spark)

> Improve performance of event log replay in HistoryServer based on profiler 
> results
> --
>
> Key: SPARK-18256
> URL: https://issues.apache.org/jira/browse/SPARK-18256
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Profiling event log replay in the HistoryServer reveals Json4S control flow 
> exceptions and `Utils.getFormattedClassName` calls as significant 
> bottlenecks. Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-03 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-18256:
--

 Summary: Improve performance of event log replay in HistoryServer 
based on profiler results
 Key: SPARK-18256
 URL: https://issues.apache.org/jira/browse/SPARK-18256
 Project: Spark
  Issue Type: Bug
Reporter: Josh Rosen
Assignee: Josh Rosen


Profiling event log replay in the HistoryServer reveals Json4S control flow 
exceptions and `Utils.getFormattedClassName` calls as significant bottlenecks. 
Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results

2016-11-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18256:
---
Issue Type: Improvement  (was: Bug)

> Improve performance of event log replay in HistoryServer based on profiler 
> results
> --
>
> Key: SPARK-18256
> URL: https://issues.apache.org/jira/browse/SPARK-18256
> Project: Spark
>  Issue Type: Improvement
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> Profiling event log replay in the HistoryServer reveals Json4S control flow 
> exceptions and `Utils.getFormattedClassName` calls as significant 
> bottlenecks. Eliminating these halves the time to replay long event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18237:

Fix Version/s: (was: 2.0.3)

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>Assignee: ClassNotFoundExp
> Fix For: 2.1.0
>
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18237.
-
   Resolution: Fixed
 Assignee: ClassNotFoundExp
Fix Version/s: 2.1.0
   2.0.3

> hive.exec.stagingdir have no effect in spark2.0.1
> -
>
> Key: SPARK-18237
> URL: https://issues.apache.org/jira/browse/SPARK-18237
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: ClassNotFoundExp
>Assignee: ClassNotFoundExp
> Fix For: 2.0.3, 2.1.0
>
>
> hive.exec.stagingdir have no effect in spark2.0.1,
> this relevant to https://issues.apache.org/jira/browse/SPARK-11021



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18244) Rename partitionProviderIsHive -> tracksPartitionsInCatalog

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18244.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Rename partitionProviderIsHive -> tracksPartitionsInCatalog
> ---
>
> Key: SPARK-18244
> URL: https://issues.apache.org/jira/browse/SPARK-18244
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> partitionProviderIsHive is too specific to Hive. In reality we can track 
> partitions in any catalog, not just Hive's metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14220) Build and test Spark against Scala 2.12

2016-11-03 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14220:

Target Version/s:   (was: 2.2.0)

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2016-11-03 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633773#comment-15633773
 ] 

Reynold Xin commented on SPARK-14220:
-

Yea in reality it's going to be really painful to upgrade.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >