[jira] [Updated] (SPARK-18218) Optimize BlockMatrix multiplication, which may cause OOM and low parallelism usage problem in several cases
[ https://issues.apache.org/jira/browse/SPARK-18218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-18218: Shepherd: Yanbo Liang > Optimize BlockMatrix multiplication, which may cause OOM and low parallelism > usage problem in several cases > --- > > Key: SPARK-18218 > URL: https://issues.apache.org/jira/browse/SPARK-18218 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > After I take a deep look into `BlockMatrix.multiply` implementation, I found > that current implementation may cause some problem in special cases. > Now let me use an extreme case to represent it: > Suppose we have two blockMatrix A and B > A has 1 blocks, numRowBlocks = 1, numColBlocks = 1 > B also has 1 blocks, numRowBlocks = 1, numColBlocks = 1 > Now if we call A.mulitiply(B), no matter how A and B is partitioned, > the resultPartitioner will always contains only one partition, > this muliplication implementation will shuffle 1 * 1 blocks into one > reducer, this will cause the parallism became 1, > what's worse, because `RDD.cogroup` will load the total group element into > memory, now at reducer-side, 1 * 1 blocks will be loaded into memory, > because they are all shuffled into the same group. It will easily cause > executor OOM. > The above case is a little extreme, but other case, such as M*N dimensions > matrix A multiply N*P dimensions matrix B, when N is much larger than M and > P, we met the similar problem. > The multiplication implementation do not handle the task partition properly, > it will cause: > 1. when the middle dimension N is too large, it will cause reducer OOM. > 2. even if OOM do not occur, it will still cause parallism too low. > 3. when N is much large than M and P, and matrix A and B have many > partitions, it will cause too many partition on M and P dimension, it will > cause much larger shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java
[ https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635294#comment-15635294 ] Hubert Kang commented on SPARK-18193: - Is it possible to do that in opposite way, which means update JavaQueueStream.java to match QueueStream.scala? That's exactly what I want to handle live data stream with Queue. > queueStream not updated if rddQueue.add after create queueStream in Java > > > Key: SPARK-18193 > URL: https://issues.apache.org/jira/browse/SPARK-18193 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Hubert Kang > > Within > examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java, > no any data is deteceted if below code to put something to rddQueue is > executed after queueStream is created (line 65). > for (int i = 0; i < 30; i++) { > rddQueue.add(ssc.sparkContext().parallelize(list)); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18225) job will miss when driver removed by master in spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635261#comment-15635261 ] liujianhui commented on SPARK-18225: we provide a platform for user to submit their streaming app, at that case, app should setup an endpoint in their owner app to receive stopped message,and then call StreamingContext#stop to shutdown, but it not convenient > job will miss when driver removed by master in spark streaming > --- > > Key: SPARK-18225 > URL: https://issues.apache.org/jira/browse/SPARK-18225 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler >Affects Versions: 1.6.1, 1.6.2 >Reporter: liujianhui > > kill the application on spark ui, the master will send an ApplicationRemoved > to driver, driver will abort the all pending job,and then the job finish with > exception "Master removed our application:Killed",and then Jobscheduler will > remove the job from jobsets, but the jobgenerator still docheckpoint without > the job which removed before, and then driver stop;when recover from the > check point file,it miss all jobs which aborted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18259) QueryExecution should not catch Throwable
[ https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18259. - Resolution: Fixed Fix Version/s: 2.1.0 > QueryExecution should not catch Throwable > - > > Key: SPARK-18259 > URL: https://issues.apache.org/jira/browse/SPARK-18259 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.1.0 > > > QueryExecution currently captures Throwable. This is far from a best > practice. It would be better if we'd catch AnalysisExceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18225) job will miss when driver removed by master in spark streaming
[ https://issues.apache.org/jira/browse/SPARK-18225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635252#comment-15635252 ] liujianhui commented on SPARK-18225: it still doCheckpoint even killed by UI because the recurringTimer doest stop, and it will post DoCheckpoint event as before. But the spark context marked stopped, all the jobs will aborted > job will miss when driver removed by master in spark streaming > --- > > Key: SPARK-18225 > URL: https://issues.apache.org/jira/browse/SPARK-18225 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler >Affects Versions: 1.6.1, 1.6.2 >Reporter: liujianhui > > kill the application on spark ui, the master will send an ApplicationRemoved > to driver, driver will abort the all pending job,and then the job finish with > exception "Master removed our application:Killed",and then Jobscheduler will > remove the job from jobsets, but the jobgenerator still docheckpoint without > the job which removed before, and then driver stop;when recover from the > check point file,it miss all jobs which aborted. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635204#comment-15635204 ] Nattavut Sutyanyong commented on SPARK-17348: - [~hvanhovell], would you please review my PR to return an Analysis exception on the scenarios that could produce incorrect results? Thank you. > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs
[ https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18217: Assignee: Xiao Li (was: Apache Spark) > Disallow creating permanent views based on temporary views or UDFs > -- > > Key: SPARK-18217 > URL: https://issues.apache.org/jira/browse/SPARK-18217 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Xiao Li > > See the discussion in the parent ticket SPARK-18209. It doesn't really make > sense to create permanent views based on temporary views or UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs
[ https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18217: Assignee: Apache Spark (was: Xiao Li) > Disallow creating permanent views based on temporary views or UDFs > -- > > Key: SPARK-18217 > URL: https://issues.apache.org/jira/browse/SPARK-18217 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > See the discussion in the parent ticket SPARK-18209. It doesn't really make > sense to create permanent views based on temporary views or UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18217) Disallow creating permanent views based on temporary views or UDFs
[ https://issues.apache.org/jira/browse/SPARK-18217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635202#comment-15635202 ] Apache Spark commented on SPARK-18217: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15764 > Disallow creating permanent views based on temporary views or UDFs > -- > > Key: SPARK-18217 > URL: https://issues.apache.org/jira/browse/SPARK-18217 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Xiao Li > > See the discussion in the parent ticket SPARK-18209. It doesn't really make > sense to create permanent views based on temporary views or UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635199#comment-15635199 ] Apache Spark commented on SPARK-17348: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/15763 > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17348: Assignee: (was: Apache Spark) > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17348) Incorrect results from subquery transformation
[ https://issues.apache.org/jira/browse/SPARK-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17348: Assignee: Apache Spark > Incorrect results from subquery transformation > -- > > Key: SPARK-17348 > URL: https://issues.apache.org/jira/browse/SPARK-17348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > Labels: correctness > > {noformat} > Seq((1,1)).toDF("c1","c2").createOrReplaceTempView("t1") > Seq((1,1),(2,0)).toDF("c1","c2").createOrReplaceTempView("t2") > sql("select c1 from t1 where c1 in (select max(t2.c1) from t2 where t1.c2 >= > t2.c2)").show > +---+ > | c1| > +---+ > | 1| > +---+ > {noformat} > The correct result of the above query should be an empty set. Here is an > explanation: > Both rows from T2 satisfies the correlated predicate T1.C2 >= T2.C2 when > T1.C1 = 1 so both rows needs to be processed in the same group of the > aggregation process in the subquery. The result of the aggregation yields > MAX(T2.C1) as 2. Therefore, the result of the evaluation of the predicate > T1.C1 (which is 1) IN MAX(T2.C1) (which is 2) should be an empty set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18262) JSON.org license is now CatX
Sean Busbey created SPARK-18262: --- Summary: JSON.org license is now CatX Key: SPARK-18262 URL: https://issues.apache.org/jira/browse/SPARK-18262 Project: Spark Issue Type: Bug Reporter: Sean Busbey Priority: Blocker per [update resolved legal|http://www.apache.org/legal/resolved.html#json]: {quote} CAN APACHE PRODUCTS INCLUDE WORKS LICENSED UNDER THE JSON LICENSE? No. As of 2016-11-03 this has been moved to the 'Category X' license list. Prior to this, use of the JSON Java library was allowed. See Debian's page for a list of alternatives. {quote} I'm not actually clear if Spark is using one of the JSON.org licensed libraries. As of current master (dc4c6009) the java library gets called out in the [NOTICE file for our source repo|https://github.com/apache/spark/blob/dc4c60098641cf64007e2f0e36378f000ad5f6b1/NOTICE#L424] but: 1) It doesn't say where in the source 2) the given url is 404 (http://www.json.org/java/index.html) 3) It doesn't actually say in the NOTICE what license the inclusion is under 4) the JSON.org license for the java {{org.json:json}} artifact (what the blurb in #2 is usually referring to) doesn't show up in our LICENSE file, nor in the {{licenses/}} directory 5) I don't see a direct reference to the {{org.json:json}} artifact in our poms. So maybe it's just coming in transitively and we can exclude it / ping whoever is bringing it in? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18261) Add statistics to MemorySink for joining
[ https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15635014#comment-15635014 ] Burak Yavuz commented on SPARK-18261: - Go for it! > Add statistics to MemorySink for joining > - > > Key: SPARK-18261 > URL: https://issues.apache.org/jira/browse/SPARK-18261 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.0.2 >Reporter: Burak Yavuz > > Right now, there is no way to join the output of a memory sink with any table: > {code} > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics > {code} > Being able to join snapshots of memory streams with tables would be nice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18261) Add statistics to MemorySink for joining
[ https://issues.apache.org/jira/browse/SPARK-18261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634979#comment-15634979 ] Liwei Lin commented on SPARK-18261: --- If no one's working on this, I'd like to take this > Add statistics to MemorySink for joining > - > > Key: SPARK-18261 > URL: https://issues.apache.org/jira/browse/SPARK-18261 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.0.2 >Reporter: Burak Yavuz > > Right now, there is no way to join the output of a memory sink with any table: > {code} > UnsupportedOperationException: LeafNode MemoryPlan must implement statistics > {code} > Being able to join snapshots of memory streams with tables would be nice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Description: As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. It also doesn't respect custom partition locations. We should delete only the proper partitions, scan the metastore for affected partitions with custom locations, and ensure that deletes/writes go to the right locations for those as well. was: As of current 2.1, INSERT OVERWRITE with dynamic partitions against a Datasource table will overwrite the entire table instead of only the updated partitions as in Hive. This is non-trivial to fix in 2.1, so we should throw an exception here. > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. It also doesn't respect custom partition locations. > We should delete only the proper partitions, scan the metastore for affected > partitions with custom locations, and ensure that deletes/writes go to the > right locations for those as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18185) Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions
[ https://issues.apache.org/jira/browse/SPARK-18185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18185: --- Summary: Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions (was: Should disallow INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions) > Should fix INSERT OVERWRITE TABLE of Datasource tables with dynamic partitions > -- > > Key: SPARK-18185 > URL: https://issues.apache.org/jira/browse/SPARK-18185 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Eric Liang > > As of current 2.1, INSERT OVERWRITE with dynamic partitions against a > Datasource table will overwrite the entire table instead of only the updated > partitions as in Hive. > This is non-trivial to fix in 2.1, so we should throw an exception here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18101) ExternalCatalogSuite should test with mixed case fields
[ https://issues.apache.org/jira/browse/SPARK-18101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634816#comment-15634816 ] Wenchen Fan commented on SPARK-18101: - Hi [~ekhliang] , can this newly added test(https://github.com/apache/spark/pull/14750/files#diff-8c4108666a6639034f0ddbfa075bcb37R273) resolve this ticket? > ExternalCatalogSuite should test with mixed case fields > --- > > Key: SPARK-18101 > URL: https://issues.apache.org/jira/browse/SPARK-18101 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Eric Liang > > Currently, it uses field names such as "a" and "b" which are not useful for > testing case preservation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634741#comment-15634741 ] Nattavut Sutyanyong commented on SPARK-17337: - As commented in the PR, the code was nicely done but the bigger problem remains there, tracked by SPARK-17154. I made a few attempts to fix the root cause of the problem surfaced in many symptoms but it has taken me long and not yet as clean as I want. Also I am seeing [~kousuke] keeps refining his work through a series of PRs for that so I hesitate to try competing with his solution. > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18260: - Component/s: SQL > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18261) Add statistics to MemorySink for joining
Burak Yavuz created SPARK-18261: --- Summary: Add statistics to MemorySink for joining Key: SPARK-18261 URL: https://issues.apache.org/jira/browse/SPARK-18261 Project: Spark Issue Type: New Feature Components: SQL, Structured Streaming Affects Versions: 2.0.2 Reporter: Burak Yavuz Right now, there is no way to join the output of a memory sink with any table: {code} UnsupportedOperationException: LeafNode MemoryPlan must implement statistics {code} Being able to join snapshots of memory streams with tables would be nice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18260: - Target Version/s: 2.1.0 Priority: Blocker (was: Major) > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Priority: Blocker > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18138. - Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.1.0 > More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 > > > Key: SPARK-18138 > URL: https://issues.apache.org/jira/browse/SPARK-18138 > Project: Spark > Issue Type: Task >Reporter: Reynold Xin >Assignee: Sean Owen >Priority: Blocker > Fix For: 2.1.0 > > > Plan: > - Mark it very explicit in Spark 2.1.0 that support for the aforementioned > environments are deprecated. > - Remove support it Spark 2.2.0 > Also see mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18138: Target Version/s: 2.1.0 (was: 2.2.0) > More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 > > > Key: SPARK-18138 > URL: https://issues.apache.org/jira/browse/SPARK-18138 > Project: Spark > Issue Type: Task >Reporter: Reynold Xin >Priority: Blocker > Fix For: 2.1.0 > > > Plan: > - Mark it very explicit in Spark 2.1.0 that support for the aforementioned > environments are deprecated. > - Remove support it Spark 2.2.0 > Also see mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll
[ https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18235: Assignee: Apache Spark > ml.ALSModel function parity: ALSModel should support recommendforAll > > > Key: SPARK-18235 > URL: https://issues.apache.org/jira/browse/SPARK-18235 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: Apache Spark > > For function parity with MatrixFactorizationModel, ALS model should support > API: > recommendUsersForProducts > recommendProductsForUsers > There're another two APIs missing (lower priority): > recommendProducts: > recommendUsers: > The function requirement comes from mailing-list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
[ https://issues.apache.org/jira/browse/SPARK-18260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634766#comment-15634766 ] Michael Armbrust commented on SPARK-18260: -- We should return null if the input is null. > from_json can throw a better exception when it can't find the column or be > nullSafe > --- > > Key: SPARK-18260 > URL: https://issues.apache.org/jira/browse/SPARK-18260 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Burak Yavuz >Priority: Blocker > > I got this exception: > {code} > SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 > failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID > 74170, 10.0.138.84, executor 2): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) > at > org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) > {code} > This was because the column that I called `from_json` on didn't exist for all > of my rows. Either from_json should be null safe, or it should fail with a > better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll
[ https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18235: Assignee: (was: Apache Spark) > ml.ALSModel function parity: ALSModel should support recommendforAll > > > Key: SPARK-18235 > URL: https://issues.apache.org/jira/browse/SPARK-18235 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > For function parity with MatrixFactorizationModel, ALS model should support > API: > recommendUsersForProducts > recommendProductsForUsers > There're another two APIs missing (lower priority): > recommendProducts: > recommendUsers: > The function requirement comes from mailing-list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18138: Summary: More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 (was: Remove support for Python 2.6, Hadoop 2.6-, Java 7, and Scala 2.10) > More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 > > > Key: SPARK-18138 > URL: https://issues.apache.org/jira/browse/SPARK-18138 > Project: Spark > Issue Type: Task >Reporter: Reynold Xin >Priority: Blocker > Fix For: 2.1.0 > > > Plan: > - Mark it very explicit in Spark 2.1.0 that support for the aforementioned > environments are deprecated. > - Remove support it Spark 2.2.0 > Also see mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18235) ml.ALSModel function parity: ALSModel should support recommendforAll
[ https://issues.apache.org/jira/browse/SPARK-18235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634753#comment-15634753 ] Apache Spark commented on SPARK-18235: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/15762 > ml.ALSModel function parity: ALSModel should support recommendforAll > > > Key: SPARK-18235 > URL: https://issues.apache.org/jira/browse/SPARK-18235 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang > > For function parity with MatrixFactorizationModel, ALS model should support > API: > recommendUsersForProducts > recommendProductsForUsers > There're another two APIs missing (lower priority): > recommendProducts: > recommendUsers: > The function requirement comes from mailing-list: > http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-using-collaborative-filtering-in-MLlib-td19677.html > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14657) RFormula output wrong features when formula w/o intercept
[ https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14657: -- Target Version/s: 2.2.0 (was: 2.1.0) > RFormula output wrong features when formula w/o intercept > - > > Key: SPARK-14657 > URL: https://issues.apache.org/jira/browse/SPARK-14657 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > SparkR::glm output different features compared with R glm when fit w/o > intercept and having string/category features. Take the following example, > SparkR output three features compared with four features for native R. > SparkR::glm > {quote} > training <- suppressWarnings(createDataFrame(sqlContext, iris)) > model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training) > summary(model) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal_Length0.67468 0.0093013 72.536 0 > Species_versicolor -1.2349 0.07269 -16.989 0 > Species_virginica -1.4708 0.077397-19.003 0 > {quote} > stats::glm > {quote} > summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris)) > Coefficients: > Estimate Std. Error t value Pr(>|t|) > Sepal.Length0.3499 0.0463 7.557 4.19e-12 *** > Speciessetosa 1.6765 0.2354 7.123 4.46e-11 *** > Speciesversicolor 0.6931 0.2779 2.494 0.0137 * > Speciesvirginica0.6690 0.3078 2.174 0.0313 * > {quote} > The encoder for string/category feature is different. R did not drop any > category but SparkR drop the last one. > I searched online and test some other cases, found when we fit R glm model(or > other models powered by R formula) w/o intercept on a dataset including > string/category features, one of the categories in the first category feature > is being used as reference category, we will not drop any category for that > feature. > I think we should keep consistent semantics between Spark RFormula and R > formula. > cc [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17337) Incomplete algorithm for name resolution in Catalyst paser may lead to incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634694#comment-15634694 ] Apache Spark commented on SPARK-17337: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15761 > Incomplete algorithm for name resolution in Catalyst paser may lead to > incorrect result > --- > > Key: SPARK-17337 > URL: https://issues.apache.org/jira/browse/SPARK-17337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Nattavut Sutyanyong > > While investigating SPARK-16951, I found an incorrect results case from a NOT > IN subquery. I thought originally it is an edge case. Further investigation > found this is a more general problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12488. --- Resolution: Fixed Assignee: Xiangrui Meng Fix Version/s: 1.6.1 2.0.0 1.5.3 1.4.2 Target Version/s: (was: 2.1.0) > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin >Assignee: Xiangrui Meng > Fix For: 1.4.2, 1.5.3, 2.0.0, 1.6.1 > > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18260) from_json can throw a better exception when it can't find the column or be nullSafe
Burak Yavuz created SPARK-18260: --- Summary: from_json can throw a better exception when it can't find the column or be nullSafe Key: SPARK-18260 URL: https://issues.apache.org/jira/browse/SPARK-18260 Project: Spark Issue Type: Bug Reporter: Burak Yavuz I got this exception: {code} SparkException: Job aborted due to stage failure: Task 0 in stage 13028.0 failed 4 times, most recent failure: Lost task 0.3 in stage 13028.0 (TID 74170, 10.0.138.84, executor 2): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.JsonToStruct.eval(jsonExpressions.scala:490) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:71) at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:211) at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:210) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804) {code} This was because the column that I called `from_json` on didn't exist for all of my rows. Either from_json should be null safe, or it should fail with a better error message -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12488) LDA describeTopics() Generates Invalid Term IDs
[ https://issues.apache.org/jira/browse/SPARK-12488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634683#comment-15634683 ] Joseph K. Bradley commented on SPARK-12488: --- I'm going to close this since it seems like it has been fixed by [SPARK-13355]. If anyone sees this is versions which include that patch, please report! Thanks all. > LDA describeTopics() Generates Invalid Term IDs > --- > > Key: SPARK-12488 > URL: https://issues.apache.org/jira/browse/SPARK-12488 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.5.2 >Reporter: Ilya Ganelin > > When running the LDA model, and using the describeTopics function, invalid > values appear in the termID list that is returned: > The below example generates 10 topics on a data set with a vocabulary of 685. > {code} > // Set LDA parameters > val numTopics = 10 > val lda = new LDA().setK(numTopics).setMaxIterations(10) > val ldaModel = lda.run(docTermVector) > val distModel = > ldaModel.asInstanceOf[org.apache.spark.mllib.clustering.DistributedLDAModel] > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted.reverse > res40: Array[Int] = Array(2064860663, 2054149956, 1991041659, 1986948613, > 1962816105, 1858775243, 1842920256, 1799900935, 1792510791, 1792371944, > 1737877485, 1712816533, 1690397927, 1676379181, 1664181296, 1501782385, > 1274389076, 1260230987, 1226545007, 1213472080, 1068338788, 1050509279, > 714524034, 678227417, 678227086, 624763822, 624623852, 618552479, 616917682, > 551612860, 453929488, 371443786, 183302140, 58762039, 42599819, 9947563, 617, > 616, 615, 612, 603, 597, 596, 595, 594, 593, 592, 591, 590, 589, 588, 587, > 586, 585, 584, 583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, > 571, 570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558, 557, > 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545, 544, 543, 542, > 541, 540, 539, 538, 537, 536, 535, 534, 533, 532, 53... > {code} > {code} > scala> ldaModel.describeTopics()(0)._1.sorted > res41: Array[Int] = Array(-2087809139, -2001127319, -1979718998, -1833443915, > -1811530305, -1765302237, -1668096260, -1527422175, -1493838005, -1452770216, > -1452508395, -1452502074, -1452277147, -1451720206, -1450928740, -1450237612, > -1448730073, -1437852514, -1420883015, -1418557080, -1397997340, -1397995485, > -1397991169, -1374921919, -1360937376, -1360533511, -1320627329, -1314475604, > -1216400643, -1210734882, -1107065297, -1063529036, -1062984222, -1042985412, > -1009109620, -951707740, -894644371, -799531743, -627436045, -586317106, > -563544698, -326546674, -174108802, -155900771, -80887355, -78916591, > -26690004, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, > 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, > 38, 39, 40, 41, 42, 43, 44, 45, 4... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-18254. Resolution: Fixed Assignee: Eyal Farago (was: Davies Liu) > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Eyal Farago > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634605#comment-15634605 ] Ryan Blue commented on SPARK-18086: --- Yeah, I'll update the PR. > Regression: Hive variables no longer work in Spark 2.0 > -- > > Key: SPARK-18086 > URL: https://issues.apache.org/jira/browse/SPARK-18086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > The behavior of variables in the SQL shell has changed from 1.6 to 2.0. > Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer > work. Queries that worked correctly in 1.6 will either fail or produce > unexpected results in 2.0 so I think this is a regression that should be > addressed. > Hive and Spark 1.6 work like this: > 1. Command-line args --hiveconf and --hivevar can be used to set session > properties. --hiveconf properties are added to the Hadoop Configuration. > 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} > adds a Hive var. > 3. Hive vars can be substituted into queries by name, and Configuration > properties can be substituted using {{hiveconf:name}}. > In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then > the value in SQLConf for the rest of the key is returned. SET adds properties > to the session config and (according to [a > comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28]) > the Hadoop configuration "during I/O". > {code:title=Hive and Spark 1.6.1 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > ${test.conf} > spark-sql> select "${hivevar:test.var}"; > 2 > spark-sql> select "${test.var}"; > 2 > spark-sql> set test.set=3; > SET test.set=3 > spark-sql> select "${test.set}" > "${test.set}" > spark-sql> select "${hivevar:test.set}" > "${hivevar:test.set}" > spark-sql> select "${hiveconf:test.set}" > 3 > spark-sql> set hivevar:test.setvar=4; > SET hivevar:test.setvar=4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > 4 > {code} > {code:title=Spark 2.0.0 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > 1 > spark-sql> select "${hivevar:test.var}"; > ${hivevar:test.var} > spark-sql> select "${test.var}"; > ${test.var} > spark-sql> set test.set=3; > test.set3 > spark-sql> select "${test.set}"; > 3 > spark-sql> set hivevar:test.setvar=4; > hivevar:test.setvar 4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > ${test.setvar} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18259) QueryExecution should not catch Throwable
[ https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18259: Assignee: Apache Spark (was: Herman van Hovell) > QueryExecution should not catch Throwable > - > > Key: SPARK-18259 > URL: https://issues.apache.org/jira/browse/SPARK-18259 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Apache Spark >Priority: Minor > > QueryExecution currently captures Throwable. This is far from a best > practice. It would be better if we'd catch AnalysisExceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18259) QueryExecution should not catch Throwable
[ https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18259: Assignee: Herman van Hovell (was: Apache Spark) > QueryExecution should not catch Throwable > - > > Key: SPARK-18259 > URL: https://issues.apache.org/jira/browse/SPARK-18259 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > > QueryExecution currently captures Throwable. This is far from a best > practice. It would be better if we'd catch AnalysisExceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18259) QueryExecution should not catch Throwable
[ https://issues.apache.org/jira/browse/SPARK-18259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634573#comment-15634573 ] Apache Spark commented on SPARK-18259: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15760 > QueryExecution should not catch Throwable > - > > Key: SPARK-18259 > URL: https://issues.apache.org/jira/browse/SPARK-18259 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > > QueryExecution currently captures Throwable. This is far from a best > practice. It would be better if we'd catch AnalysisExceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634568#comment-15634568 ] holdenk commented on SPARK-15581: - This sounds like really good suggestions - I think some of the biggest challenges contributors face is on the review/committer side rather than on the actual code changes so anything we can do to make that process simpler should be considered. I'm sort of a mixed view on the umbrella JIRAs, maybe just being clearer in umbrella JIRAs about which sub-features are nice to have versus must-have would let us keep this organization? Splitting the roadmap into two parts also sounds reasonable and will hopefully lead to less bouncing of issues between roadmap versions. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.1.0 > > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when > you start working on some features. This is to avoid duplicate work. For > small features, you don't need to wait to get JIRA assigned. > * For medium/big features or features with dependencies, please get assigned > first before coding and keep the ETA updated on the JIRA. If there exist no > activity on the JIRA page for a certain amount of time, the JIRA should be > released for other contributors. > * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one > after another. > * Remember to add the `@Since("VERSION")` annotation to new public APIs. > * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code > review greatly helps to improve others' code as well as yours. > h2. For committers: > * Try to break down big features into small and specific JIRA tasks and link > them properly. > * Add a "starter" label to starter tasks. > * Put a rough estimate for medium/big features and track the progress. > * If you start reviewing a PR, please add yourself to the Shepherd field on > JIRA. > * If the code looks good to you, please comment "LGTM". For non-trivial PRs, > please ping a maintainer to make a final pass. > * After merging a PR, create and link JIRAs for Python, example code, and > documentation if applicable. > h1. Roadmap (*WIP*) > This is NOT [a complete list of MLlib JIRAs for 2.1| > https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. > We only include umbrella JIRAs and high-level tasks. > Major efforts in this release: > * Feature parity for the DataFrames-based API (`spark.ml`), relative to the > RDD-based API > * ML persistence > * Python API feature parity and test coverage > * R API expansion and improvements > * Note about new features: As usual, we expect to expand the feature set of > MLlib. However, we will prioritize API parity, bug fixes, and improvements > over new features. > Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for > it, but new features, APIs, and improvements will only be added to `spark.ml`. > h2. Critical feature parity in DataFrame-based API > * Umbrella JIRA: [SPARK-4591] > h2. Persistence > * Complete persistence within MLlib > ** Python tuning (SPARK-13786) > * MLlib in R format: compatibility with other languages (SPARK-15572) > * Impose backwards compatibility for persistence (SPARK-15573) > h2. Python API > * Standardize unit tests for Scala and Python to
[jira] [Resolved] (SPARK-18257) Improve error reporting for FileStressSuite in streaming
[ https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18257. - Resolution: Fixed Fix Version/s: 2.1.0 > Improve error reporting for FileStressSuite in streaming > > > Key: SPARK-18257 > URL: https://issues.apache.org/jira/browse/SPARK-18257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > FileStressSuite doesn't report errors when they occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634455#comment-15634455 ] Nicholas Chammas commented on SPARK-18254: -- So it was specifically some broken interaction between structs and aliases, I guess? Anyway, glad it's been fixed. > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634432#comment-15634432 ] Herman van Hovell commented on SPARK-18254: --- We 'accidentally' fixed this yesterday with commit: https://github.com/apache/spark/commit/f151bd1af8a05d4b6c901ebe6ac0b51a4a1a20df > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:58 PM: --- If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the original repro case works. (And for the record, this is still broken on the latest commit in branch-2.0, {{dae1581d9461346511098dc83938939a0f930048}}, so the fix is 2.1+.) So while we're waiting for 2.1 to be released, is there a workaround you'd recommend, apart from {{persist()}}? was (Author: nchammas): If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the original repro case works. So while we're waiting for 2.1 to be released, is there a workaround you'd recommend, apart from {{persist()}}? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634428#comment-15634428 ] Nicholas Chammas commented on SPARK-18254: -- Just tried it. Seems like the fix is only available in 2.1+. > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18259) QueryExecution should not catch Throwable
Herman van Hovell created SPARK-18259: - Summary: QueryExecution should not catch Throwable Key: SPARK-18259 URL: https://issues.apache.org/jira/browse/SPARK-18259 Project: Spark Issue Type: Bug Components: SQL Reporter: Herman van Hovell Assignee: Herman van Hovell Priority: Minor QueryExecution currently captures Throwable. This is far from a best practice. It would be better if we'd catch AnalysisExceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634416#comment-15634416 ] Davies Liu commented on SPARK-18254: Could you also try 2.0.2? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas commented on SPARK-18254: -- If I try branch-2.1 on {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the original repro case works. So while we're waiting for 2.1 to be released, is there a workaround you'd recommend, apart from {{persist()}}? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634391#comment-15634391 ] Nicholas Chammas edited comment on SPARK-18254 at 11/3/16 9:46 PM: --- If I try branch-2.1 on commit {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the original repro case works. So while we're waiting for 2.1 to be released, is there a workaround you'd recommend, apart from {{persist()}}? was (Author: nchammas): If I try branch-2.1 on {{569f77a11819523bdf5dc2c6429fc3399cbb6519}}, the original repro case works. So while we're waiting for 2.1 to be released, is there a workaround you'd recommend, apart from {{persist()}}? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18212: - Assignee: Cody Koeninger > Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from > specific offsets > --- > > Key: SPARK-18212 > URL: https://issues.apache.org/jira/browse/SPARK-18212 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Davies Liu >Assignee: Cody Koeninger > Fix For: 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18212) Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from specific offsets
[ https://issues.apache.org/jira/browse/SPARK-18212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-18212. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15737 [https://github.com/apache/spark/pull/15737] > Flaky test: org.apache.spark.sql.kafka010.KafkaSourceSuite.assign from > specific offsets > --- > > Key: SPARK-18212 > URL: https://issues.apache.org/jira/browse/SPARK-18212 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Davies Liu > Fix For: 2.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1968/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354 ] Seth Hendrickson commented on SPARK-15581: -- I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done. -As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset. I propose a revised roadmap that contains two classifications of items: 1. JIRAs that will be done by the next relase 2. JIRAs that will be done at some point before the next major relase (e.g. 3.0) JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think? Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like. > MLlib 2.1 Roadmap > - > > Key: SPARK-15581 > URL: https://issues.apache.org/jira/browse/SPARK-15581 > Project: Spark > Issue Type: Umbrella > Components: ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > Labels: roadmap > Fix For: 2.1.0 > > > This is a master list for MLlib improvements we are working on for the next > release. Please view this as a wish list rather than a definite plan, for we > don't have an accurate estimate of available resources. Due to limited review > bandwidth, features appearing on this list will get higher priority during > code review. But feel free to suggest new items to the list in comments. We > are experimenting with this process. Your feedback would be greatly > appreciated. > h1. Instructions > h2. For contributors: > * Please read > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > carefully. Code style, documentation, and unit tests are important. > * If you are a first-time Spark contributor, please always start with a > [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather > than a medium/big feature. Based on our experience, mixing the development > process with a big feature usually causes long delay in code review. > * Never work silently. Let everyone know on the corresponding JIRA page when >
[jira] [Comment Edited] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634354#comment-15634354 ] Seth Hendrickson edited comment on SPARK-15581 at 11/3/16 9:28 PM: --- I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a long time and seem to have little interest. The list should contain only the items that we expect to get done. - As you say, every item should have a committer linked to it that is capable of merging it. They do not have to be the primary reviewer, but they should have sufficient expertise such that they feel comfortable merging it after it has been appropriately reviewed. One interesting example to be wary of is that there seem to be a LOT of tree related items on the roadmap, but Joseph has traditionally been the only (at least the main) committer involved in tree-related JIRAs. I don't think it's realistic to target all of these tree improvements when we have limited committers available to review/merge them. We can trim them down to a realistic subset. I propose a revised roadmap that contains two classifications of items: 1. JIRAs that will be done by the next release 2. JIRAs that will be done at some point before the next major release (e.g. 3.0) JIRAs that are still up for debate (e.g. adding a factorization machine) should not be on the roadmap. That does not mean they will not get done, but they are not necessarily "planned" for any particular timeframe. IMO this revised roadmap can/will provide a lot more transparency, and appropriately set review expectations. If it's on the list of "will do by next minor release," then contributors should expect it to be reviewed. What does everyone else think? Also, I took a bit of time to aggregate lists of specific JIRAs that I think fit into the two categories I listed above [here|https://docs.google.com/spreadsheets/d/1nNvbGoarRvhsMkYaFiU6midyHrndPBYQTcKKNOF5xcs/edit?usp=sharing] (note: does not contain SparkR items). I am not (necessarily) proposing to move the list to this google doc, and I understand this is still undergoing discussion. I just wanted to provide an example of what the above might look like. was (Author: sethah): I think the points you mention are very important to get right moving forward. We can certainly debate about what should go on the roadmap, but regardless I think it would be helpful to maintain a specific subset of JIRAs that we expect to get done for the next release cycle. Particularly: - We should maintain a list of items that we WILL get done for the next release, and we should deliver on nearly every one, barring unforeseen circumstances. If we don't get some of the items done, we should understand why and adjust accordingly until we can reach a list of items that we can consistently deliver on. - The list of items should be small and targeted, and should take into account things like committer/reviewer bandwidth. MLlib does not have a ton of active committers right now, like SQL might have, and the roadmap should reflect that. We need to be realistic. - We should make every effort to be as specific as possible. Linking to umbrella JIRAs hurts us IMO, and we'd be better off listing specific JIRAs. Some of the umbrella tickets contain items that are longer term or have little interest (nice-to-haves), but realistically won't get implemented (in a timely manner). For example, I looked at the tree umbrellas and I see some items that are high priority and can be done in one release cycle, but also other items that have been around for a
[jira] [Commented] (SPARK-15798) Secondary sort in Dataset/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634352#comment-15634352 ] koert kuipers commented on SPARK-15798: --- looking at the code for Window operators it seems to me the basic operators for secondary sort must already be present for Dataset, since to do Window operations efficiently you need it. so this is good news. it just needs to be exposed in a more generic way than the highly specific Window operators. > Secondary sort in Dataset/DataFrame > --- > > Key: SPARK-15798 > URL: https://issues.apache.org/jira/browse/SPARK-15798 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: koert kuipers > > Secondary sort for Spark RDDs was discussed in > https://issues.apache.org/jira/browse/SPARK-3655 > Since the RDD API allows for easy extensions outside the core library this > was implemented separately here: > https://github.com/tresata/spark-sorted > However it seems to me that with Dataset an implementation in a 3rd party > library of such a feature is not really an option. > Dataset already has methods that suggest a secondary sort is present, such as > in KeyValueGroupedDataset: > {noformat} > def flatMapGroups[U : Encoder](f: (K, Iterator[V]) => TraversableOnce[U]): > Dataset[U] > {noformat} > This operation pushes all the data to the reducer, something you only would > want to do if you need the elements in a particular order. > How about as an API sortBy methods in KeyValueGroupedDataset and > RelationalGroupedDataset? > {noformat} > dataFrame.groupBy("a").sortBy("b").fold(...) > {noformat} > (yes i know RelationalGroupedDataset doesnt have a fold yet... but it should > :)) > {noformat} > dataset.groupBy(_._1).sortBy(_._3).flatMapGroups(...) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634342#comment-15634342 ] Saikat Kanjilal commented on SPARK-9487: added local[4] to repl, sparksql, streaming, all tests pass, pull request is here: https://github.com/apache/spark/compare/master...skanjila:spark-9487 > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634333#comment-15634333 ] Davies Liu commented on SPARK-18254: I tried the following in master (2.1), it works {code} from pyspark.sql.functions import udf, col, struct myadd = udf(lambda s: s.a + s.b, IntegerType()) df = self.spark.range(10).selectExpr("id as a", "id as b")\ .select(struct(col("a"), col("b")).alias('s')) df = df.select(df.s, myadd(df.s).alias("a")) df.explain(True) rs = df.collect() {code} [~nchammas] Could you also try yours on master? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18099) Spark distributed cache should throw exception if same file is specified to dropped in --files --archives
[ https://issues.apache.org/jira/browse/SPARK-18099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-18099. --- Resolution: Fixed Assignee: Kishor Patil Fix Version/s: 2.2.0 2.1.0 > Spark distributed cache should throw exception if same file is specified to > dropped in --files --archives > - > > Key: SPARK-18099 > URL: https://issues.apache.org/jira/browse/SPARK-18099 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1 >Reporter: Kishor Patil >Assignee: Kishor Patil > Fix For: 2.1.0, 2.2.0 > > > Recently, for the changes to [SPARK-14423] Handle jar conflict issue when > uploading to distributed cache > If by default yarn#client will upload all the --files and --archives in > assembly to HDFS staging folder. It should throw if file appears in both > --files and --archives exception to know whether uncompress or leave the file > compressed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634172#comment-15634172 ] Reynold Xin commented on SPARK-18086: - [~rdblue] Does my explanation make sense? Can you change the pr (or have a new pr) to just do the command line argument so we can get that into 2.1? > Regression: Hive variables no longer work in Spark 2.0 > -- > > Key: SPARK-18086 > URL: https://issues.apache.org/jira/browse/SPARK-18086 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ryan Blue > > The behavior of variables in the SQL shell has changed from 1.6 to 2.0. > Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer > work. Queries that worked correctly in 1.6 will either fail or produce > unexpected results in 2.0 so I think this is a regression that should be > addressed. > Hive and Spark 1.6 work like this: > 1. Command-line args --hiveconf and --hivevar can be used to set session > properties. --hiveconf properties are added to the Hadoop Configuration. > 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} > adds a Hive var. > 3. Hive vars can be substituted into queries by name, and Configuration > properties can be substituted using {{hiveconf:name}}. > In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then > the value in SQLConf for the rest of the key is returned. SET adds properties > to the session config and (according to [a > comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28]) > the Hadoop configuration "during I/O". > {code:title=Hive and Spark 1.6.1 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > ${test.conf} > spark-sql> select "${hivevar:test.var}"; > 2 > spark-sql> select "${test.var}"; > 2 > spark-sql> set test.set=3; > SET test.set=3 > spark-sql> select "${test.set}" > "${test.set}" > spark-sql> select "${hivevar:test.set}" > "${hivevar:test.set}" > spark-sql> select "${hiveconf:test.set}" > 3 > spark-sql> set hivevar:test.setvar=4; > SET hivevar:test.setvar=4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > 4 > {code} > {code:title=Spark 2.0.0 behavior} > [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 > spark-sql> select "${hiveconf:test.conf}"; > 1 > spark-sql> select "${test.conf}"; > 1 > spark-sql> select "${hivevar:test.var}"; > ${hivevar:test.var} > spark-sql> select "${test.var}"; > ${test.var} > spark-sql> set test.set=3; > test.set3 > spark-sql> select "${test.set}"; > 3 > spark-sql> set hivevar:test.setvar=4; > hivevar:test.setvar 4 > spark-sql> select "${hivevar:test.setvar}"; > 4 > spark-sql> select "${test.setvar}"; > ${test.setvar} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist
[ https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634257#comment-15634257 ] yuhao yang commented on SPARK-18230: Sorry, I got a little confused between the different recommend methods. You're right. > MatrixFactorizationModel.recommendProducts throws NoSuchElement exception > when the user does not exist > -- > > Key: SPARK-18230 > URL: https://issues.apache.org/jira/browse/SPARK-18230 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.1 >Reporter: Mikael Ståldal >Priority: Minor > > When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a > non-existing user, a {{java.util.NoSuchElementException}} is thrown: > {code} > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35) > at > org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169) > {code} > It would be nice if it returned the empty array, or throwed a more specific > exception, and that was documented in ScalaDoc for the method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634228#comment-15634228 ] Davies Liu commented on SPARK-18254: I doubt it's a bug in ExtractPythonUDFs, not operator push down, not verified yet. > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18210) Pipeline.copy does not create an instance with the same UID
[ https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634240#comment-15634240 ] Apache Spark commented on SPARK-18210: -- User 'wojtek-szymanski' has created a pull request for this issue: https://github.com/apache/spark/pull/15759 > Pipeline.copy does not create an instance with the same UID > --- > > Key: SPARK-18210 > URL: https://issues.apache.org/jira/browse/SPARK-18210 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Wojciech Szymanski >Priority: Minor > > org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an > instance with the same UID. > It does not conform to the method specification from its base class > org.apache.spark.ml.param.Params.copy(extra: ParamMap) > The following commit contains: > - fix for Pipeline UID > - missing tests for Pipeline.copy > - minor improvements in tests for PipelineModel.copy > https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf > Let me know if you are fine with these changes, so I will open a new PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18210) Pipeline.copy does not create an instance with the same UID
[ https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18210: Assignee: (was: Apache Spark) > Pipeline.copy does not create an instance with the same UID > --- > > Key: SPARK-18210 > URL: https://issues.apache.org/jira/browse/SPARK-18210 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Wojciech Szymanski >Priority: Minor > > org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an > instance with the same UID. > It does not conform to the method specification from its base class > org.apache.spark.ml.param.Params.copy(extra: ParamMap) > The following commit contains: > - fix for Pipeline UID > - missing tests for Pipeline.copy > - minor improvements in tests for PipelineModel.copy > https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf > Let me know if you are fine with these changes, so I will open a new PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18210) Pipeline.copy does not create an instance with the same UID
[ https://issues.apache.org/jira/browse/SPARK-18210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18210: Assignee: Apache Spark > Pipeline.copy does not create an instance with the same UID > --- > > Key: SPARK-18210 > URL: https://issues.apache.org/jira/browse/SPARK-18210 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.1 >Reporter: Wojciech Szymanski >Assignee: Apache Spark >Priority: Minor > > org.apache.spark.ml.Pipeline.copy(extra: ParamMap) does not create an > instance with the same UID. > It does not conform to the method specification from its base class > org.apache.spark.ml.param.Params.copy(extra: ParamMap) > The following commit contains: > - fix for Pipeline UID > - missing tests for Pipeline.copy > - minor improvements in tests for PipelineModel.copy > https://github.com/apache/spark/commit/8764e36f9764f89343a3e0fe6eff05cb41bb36cf > Let me know if you are fine with these changes, so I will open a new PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18258) Sinks need access to offset representation
[ https://issues.apache.org/jira/browse/SPARK-18258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-18258: --- Description: Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. was: Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. > Sinks need access to offset representation > -- > > Key: SPARK-18258 > URL: https://issues.apache.org/jira/browse/SPARK-18258 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Transactional "exactly-once" semantics for output require storing an offset > identifier in the same transaction as results. > The Sink.addBatch method currently only has access to batchId and data, not > the actual offset representation. > I want to store the actual offsets, so that they are recoverable as long as > the results are and I'm not locked in to a particular streaming engine. > I could see this being accomplished by adding parameters to Sink.addBatch for > the starting and ending offsets (either the offsets themselves, or the > SPARK-17829 string/json representation). That would be an API change, but if > there's another way to map batch ids to offset representations without > changing the Sink api that would work as well. > I'm assuming we don't need the same level of access to offsets throughout a > job as e.g. the Kafka dstream gives, because Sinks are the main place that > should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18258) Sinks need access to offset representation
Cody Koeninger created SPARK-18258: -- Summary: Sinks need access to offset representation Key: SPARK-18258 URL: https://issues.apache.org/jira/browse/SPARK-18258 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Cody Koeninger Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18238) WARN Executor: 1 block locks were not released by TID
[ https://issues.apache.org/jira/browse/SPARK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634156#comment-15634156 ] Sean Owen commented on SPARK-18238: --- Can you say any more about how you make this occur? > WARN Executor: 1 block locks were not released by TID > - > > Key: SPARK-18238 > URL: https://issues.apache.org/jira/browse/SPARK-18238 > Project: Spark > Issue Type: Bug > Environment: 2.0.2 snapshot >Reporter: Harish >Priority: Minor > Labels: patch > > In spark 2.0.2/hadoop 2.7, i am getting below message. Not sure is this > impacting my execution. > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30541: > [rdd_511_104] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30542: > [rdd_511_105] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30562: > [rdd_511_127] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30571: > [rdd_511_137] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30572: > [rdd_511_138] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30588: > [rdd_511_156] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30603: > [rdd_511_171] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30600: > [rdd_511_168] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30612: > [rdd_511_180] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30622: > [rdd_511_190] > 16/11/03 01:10:23 WARN Executor: 1 block locks were not released by TID = > 30629: > [rdd_511_197] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143 ] Jakob Odersky edited comment on SPARK-14222 at 11/3/16 8:33 PM: Thanks Sean, however I realized that the dependency is in fact not yet published for 2.12.0 final. The package I linked is from a different org. There's a ticket for a release here https://github.com/FasterXML/jackson-module-scala/pull/294 was (Author: jodersky): Thanks Sean, however I realized that the dependency is in fact not yet published for 2.12.0 final. The package I linked is from a different org, oops > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18193) queueStream not updated if rddQueue.add after create queueStream in Java
[ https://issues.apache.org/jira/browse/SPARK-18193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634149#comment-15634149 ] Sean Owen commented on SPARK-18193: --- Oh I see, it's the opposite. The QueueStream example should be updated to match the JavaQueueStream example. Go ahead, yes. > queueStream not updated if rddQueue.add after create queueStream in Java > > > Key: SPARK-18193 > URL: https://issues.apache.org/jira/browse/SPARK-18193 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Hubert Kang > > Within > examples\src\main\java\org\apache\spark\examples\streaming\JavaQueueStream.java, > no any data is deteceted if below code to put something to rddQueue is > executed after queueStream is created (line 65). > for (int i = 0; i < 30; i++) { > rddQueue.add(ssc.sparkContext().parallelize(list)); > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143 ] Jakob Odersky edited comment on SPARK-14222 at 11/3/16 8:30 PM: Thanks Sean, however I realized that the dependency is in fact not yet published for 2.12.0 final. The package I linked is from a different org, oops was (Author: jodersky): Thanks Sean, however I realized that the dependency is in fact not yet published for 2.12.0 final. The package I linked is from a different org > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634143#comment-15634143 ] Jakob Odersky commented on SPARK-14222: --- Thanks Sean, however I realized that the dependency is in fact not yet published for 2.12.0 final. The package I linked is from a different org > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634128#comment-15634128 ] Sean Owen commented on SPARK-14222: --- Probably. The limiting factor is often run-time compatibility with whatever version Hadoop will put in the classpath. However I have been using Jackson 2.8.x + Spark 2 + Hadoop 2.6 for a while without issue. > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18254: - Target Version/s: 2.1.0 > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18254) UDFs don't see aliased column names
[ https://issues.apache.org/jira/browse/SPARK-18254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634122#comment-15634122 ] Michael Armbrust commented on SPARK-18254: -- Is this yet another bug caused by the generic operator push down? Can we turn that off? > UDFs don't see aliased column names > --- > > Key: SPARK-18254 > URL: https://issues.apache.org/jira/browse/SPARK-18254 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: Python 3.5, Java 8 >Reporter: Nicholas Chammas >Assignee: Davies Liu > Labels: correctness > > Dunno if I'm misinterpreting something here, but this seems like a bug in how > UDFs work, or in how they interface with the optimizer. > Here's a basic reproduction. I'm using {{length_udf()}} just for > illustration; it could be any UDF that accesses fields that have been aliased. > {code} > import pyspark > from pyspark.sql import Row > from pyspark.sql.functions import udf, col, struct > def length(full_name): > # The non-aliased names, FIRST and LAST, show up here, instead of > # first_name and last_name. > # print(full_name) > return len(full_name.first_name) + len(full_name.last_name) > if __name__ == '__main__': > spark = ( > pyspark.sql.SparkSession.builder > .getOrCreate()) > length_udf = udf(length) > names = spark.createDataFrame([ > Row(FIRST='Nick', LAST='Chammas'), > Row(FIRST='Walter', LAST='Williams'), > ]) > names_cleaned = ( > names > .select( > col('FIRST').alias('first_name'), > col('LAST').alias('last_name'), > ) > .withColumn('full_name', struct('first_name', 'last_name')) > .select('full_name')) > # We see the schema we expect here. > names_cleaned.printSchema() > # However, here we get an AttributeError. length_udf() cannot > # find first_name or last_name. > (names_cleaned > .withColumn('length', length_udf('full_name')) > .show()) > {code} > When I run this I get a long stack trace, but the relevant portions seem to > be: > {code} > File ".../udf-alias.py", line 10, in length > return len(full_name.first_name) + len(full_name.last_name) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1502, in __getattr__ > raise AttributeError(item) > AttributeError: first_name > {code} > {code} > Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/pyspark.zip/pyspark/sql/types.py", > line 1497, in __getattr__ > idx = self.__fields__.index(item) > ValueError: 'first_name' is not in list > {code} > Here are the relevant execution plans: > {code} > names_cleaned.explain() > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10] > +- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > {code} > (names_cleaned > .withColumn('length', length_udf('full_name')) > .explain()) > == Physical Plan == > *Project [struct(FIRST#0 AS first_name#5, LAST#1 AS last_name#6) AS > full_name#10, pythonUDF0#21 AS length#17] > +- BatchEvalPython [length(struct(FIRST#0, LAST#1))], [FIRST#0, LAST#1, > pythonUDF0#21] >+- Scan ExistingRDD[FIRST#0,LAST#1] > {code} > It looks like from the second execution plan that {{BatchEvalPython}} somehow > gets the unaliased column names, whereas the {{Project}} right above it gets > the aliased names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634117#comment-15634117 ] Jakob Odersky commented on SPARK-14222: --- A newer version of module (vertsion 2.8.4) is available for scala 2.12 now http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22jackson-module-scala_2.12%22. Can we upgrade spark's dependency (currently Spark uses 2.6.5)? > Cross-publish jackson-module-scala for Scala 2.12 > - > > Key: SPARK-14222 > URL: https://issues.apache.org/jira/browse/SPARK-14222 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen > > In order to build Spark against Scala 2.12, we need to either remove our > jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. > Personally, I'd prefer to remove it because I don't think we make extensive > use of it and because I'm not a huge fan of the implicit mapping between case > classes and JSON wire formats (the extra verbosity required by other > approaches is a feature, IMO, rather than a bug because it makes it much > harder to accidentally break wire compatibility). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15377) Enabling SASL Spark 1.6.1
[ https://issues.apache.org/jira/browse/SPARK-15377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634088#comment-15634088 ] Shridhar Ramachandran commented on SPARK-15377: --- It is likely that you haven't enabled spark.authenticate=true in YARN. Excerpted from YarnShuffleService.java -- {noformat} * The service also optionally supports authentication. This ensures that executors from one * application cannot read the shuffle files written by those from another. This feature can be * enabled by setting `spark.authenticate` in the Yarn configuration before starting the NM. * Note that the Spark application must also set `spark.authenticate` manually and, unlike in * the case of the service port, will not inherit this setting from the Yarn configuration. This * is because an application running on the same Yarn cluster may choose to not use the external * shuffle service, in which case its setting of `spark.authenticate` should be independent of * the service's. {noformat} You can do this by adding that flag to core-site.xml > Enabling SASL Spark 1.6.1 > - > > Key: SPARK-15377 > URL: https://issues.apache.org/jira/browse/SPARK-15377 > Project: Spark > Issue Type: Question > Components: Spark Core, YARN >Affects Versions: 1.6.1 >Reporter: Fabian Tan > > Hi there, > I wonder if anyone gotten SASL to work with Spark 1.6.1 on YARN? > At this point in time, I cant confirm if this is a bug or not, but, it's > definitely reproducible. > Basically Spark documentation states that you only require 3 parameters > enabled: > spark.authenticate.enableSaslEncryption=true > spark.network.sasl.serverAlwaysEncrypt=true > spark.authenticate=true > http://spark.apache.org/docs/latest/security.html > However, upon launching my spark job with --master yarn and --deploy-mode > client, I see the following in my spark executors logs: > 6/05/17 06:50:51 ERROR client.TransportClientFactory: Exception while > bootstrapping client after 29 ms > java.lang.RuntimeException: java.lang.IllegalArgumentException: Unknown > message type: -22 > at > org.apache.spark.network.shuffle.protocol.BlockTransferMessage$Decoder.fromByteBuffer(BlockTransferMessage.java:67) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:71) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:254) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at >
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Priority: Critical (was: Major) > Clarify Kafka offset semantics for Structured Streaming > --- > > Key: SPARK-17937 > URL: https://issues.apache.org/jira/browse/SPARK-17937 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Critical > > Possible events for which offsets are needed: > # New partition is discovered > # Offset out of range (aka, data has been lost). It's possible to separate > this into offset too small and offset too large, but I'm not sure it matters > for us. > Possible sources of offsets: > # *Earliest* position in log > # *Latest* position in log > # *Fail* and kill the query > # *Checkpoint* position > # *User specified* per topicpartition > # *Kafka commit log*. Currently unsupported. This means users who want to > migrate from existing kafka jobs need to jump through hoops. Even if we > never want to support it, as soon as we take on SPARK-17815 we need to make > sure Kafka commit log state is clearly documented and handled. > # *Timestamp*. Currently unsupported. This could be supported with old, > inaccurate Kafka time api, or upcoming time index > # *X offsets* before or after latest / earliest position. Currently > unsupported. I think the semantics of this are super unclear by comparison > with timestamp, given that Kafka doesn't have a single range of offsets. > Currently allowed pre-query configuration, all "ORs" are exclusive: > # startingOffsets: *earliest* OR *latest* OR *User specified* json per > topicpartition (SPARK-17812) > # failOnDataLoss: true (which implies *Fail* above) OR false (which implies > *Earliest* above) In general, I see no reason this couldn't specify Latest > as an option. > Possible lifecycle times in which an offset-related event may happen: > # At initial query start > #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If > startingOffsets is *User specified* perTopicpartition, and the new partition > isn't in the map, *Fail*. Note that this is effectively undistinguishable > from new parititon during query, because partitions may have changed in > between pre-query configuration and query start, but we treat it differently, > and users in this case are SOL > #* Offset out of range on driver: We don't technically have behavior for this > case yet. Could use the value of failOnDataLoss, but it's possible people > may want to know at startup that something was wrong, even if they're ok with > earliest for a during-query out of range > #* Offset out of range on executor: seems like it should be *Fail* or > *Earliest*, based on failOnDataLoss. but it looks like this setting is > currently ignored, and the executor will just fail... > # During query > #* New partition: *Earliest*, only. This seems to be by fiat, I see no > reason this can't be configurable. > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the latest position > #* Offset out of range on executor: ? > # At query restart > #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason > this couldn't be configurable fall back to Latest > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the specified position > #* Offset out of range on executor: ? > I've probably missed something, chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-15406) > Clarify Kafka offset semantics for Structured Streaming > --- > > Key: SPARK-17937 > URL: https://issues.apache.org/jira/browse/SPARK-17937 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Possible events for which offsets are needed: > # New partition is discovered > # Offset out of range (aka, data has been lost). It's possible to separate > this into offset too small and offset too large, but I'm not sure it matters > for us. > Possible sources of offsets: > # *Earliest* position in log > # *Latest* position in log > # *Fail* and kill the query > # *Checkpoint* position > # *User specified* per topicpartition > # *Kafka commit log*. Currently unsupported. This means users who want to > migrate from existing kafka jobs need to jump through hoops. Even if we > never want to support it, as soon as we take on SPARK-17815 we need to make > sure Kafka commit log state is clearly documented and handled. > # *Timestamp*. Currently unsupported. This could be supported with old, > inaccurate Kafka time api, or upcoming time index > # *X offsets* before or after latest / earliest position. Currently > unsupported. I think the semantics of this are super unclear by comparison > with timestamp, given that Kafka doesn't have a single range of offsets. > Currently allowed pre-query configuration, all "ORs" are exclusive: > # startingOffsets: *earliest* OR *latest* OR *User specified* json per > topicpartition (SPARK-17812) > # failOnDataLoss: true (which implies *Fail* above) OR false (which implies > *Earliest* above) In general, I see no reason this couldn't specify Latest > as an option. > Possible lifecycle times in which an offset-related event may happen: > # At initial query start > #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If > startingOffsets is *User specified* perTopicpartition, and the new partition > isn't in the map, *Fail*. Note that this is effectively undistinguishable > from new parititon during query, because partitions may have changed in > between pre-query configuration and query start, but we treat it differently, > and users in this case are SOL > #* Offset out of range on driver: We don't technically have behavior for this > case yet. Could use the value of failOnDataLoss, but it's possible people > may want to know at startup that something was wrong, even if they're ok with > earliest for a during-query out of range > #* Offset out of range on executor: seems like it should be *Fail* or > *Earliest*, based on failOnDataLoss. but it looks like this setting is > currently ignored, and the executor will just fail... > # During query > #* New partition: *Earliest*, only. This seems to be by fiat, I see no > reason this can't be configurable. > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the latest position > #* Offset out of range on executor: ? > # At query restart > #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason > this couldn't be configurable fall back to Latest > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the specified position > #* Offset out of range on executor: ? > I've probably missed something, chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17937: - Target Version/s: 2.1.0 > Clarify Kafka offset semantics for Structured Streaming > --- > > Key: SPARK-17937 > URL: https://issues.apache.org/jira/browse/SPARK-17937 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > Possible events for which offsets are needed: > # New partition is discovered > # Offset out of range (aka, data has been lost). It's possible to separate > this into offset too small and offset too large, but I'm not sure it matters > for us. > Possible sources of offsets: > # *Earliest* position in log > # *Latest* position in log > # *Fail* and kill the query > # *Checkpoint* position > # *User specified* per topicpartition > # *Kafka commit log*. Currently unsupported. This means users who want to > migrate from existing kafka jobs need to jump through hoops. Even if we > never want to support it, as soon as we take on SPARK-17815 we need to make > sure Kafka commit log state is clearly documented and handled. > # *Timestamp*. Currently unsupported. This could be supported with old, > inaccurate Kafka time api, or upcoming time index > # *X offsets* before or after latest / earliest position. Currently > unsupported. I think the semantics of this are super unclear by comparison > with timestamp, given that Kafka doesn't have a single range of offsets. > Currently allowed pre-query configuration, all "ORs" are exclusive: > # startingOffsets: *earliest* OR *latest* OR *User specified* json per > topicpartition (SPARK-17812) > # failOnDataLoss: true (which implies *Fail* above) OR false (which implies > *Earliest* above) In general, I see no reason this couldn't specify Latest > as an option. > Possible lifecycle times in which an offset-related event may happen: > # At initial query start > #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If > startingOffsets is *User specified* perTopicpartition, and the new partition > isn't in the map, *Fail*. Note that this is effectively undistinguishable > from new parititon during query, because partitions may have changed in > between pre-query configuration and query start, but we treat it differently, > and users in this case are SOL > #* Offset out of range on driver: We don't technically have behavior for this > case yet. Could use the value of failOnDataLoss, but it's possible people > may want to know at startup that something was wrong, even if they're ok with > earliest for a during-query out of range > #* Offset out of range on executor: seems like it should be *Fail* or > *Earliest*, based on failOnDataLoss. but it looks like this setting is > currently ignored, and the executor will just fail... > # During query > #* New partition: *Earliest*, only. This seems to be by fiat, I see no > reason this can't be configurable. > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the latest position > #* Offset out of range on executor: ? > # At query restart > #* New partition: *Checkpoint*, fall back to *Earliest*. Again, no reason > this couldn't be configurable fall back to Latest > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to the specified position > #* Offset out of range on executor: ? > I've probably missed something, chime in. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17937) Clarify Kafka offset semantics for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634061#comment-15634061 ] Michael Armbrust commented on SPARK-17937: -- I'm going to pull this out from the parent JIRA as I don't think it blocks basic kafka usage, and there are enough things here to warrant several subtasks on their own. I see a couple of possible concrete action items here: - make sure what we have is documented clearly, possibly even with a comparison to {{auto.offset.reset}} for those more familiar with kafka - what to do on data loss: seems we are missing the ability to do {{latest}}. I'd be fine with changing this to {{onDataLoss = fail,earliest,latest}} with fail being the default. It would be nice to keep compatibility with the old option, but that is minor. - how to handle new partitions: if possible I'd like to lump this into the {{onDataLoss}} setting. when a new partition appears midquery the default should be to process all of it (if that can be assured). If it can't because of downtime and data has already aged out, I'd like to error by default, but the user should be able to pick earliest or latest. - timestamps: sounds awesome, should probably be its own feature JIRA - integration with the kafka commit log: also could probably be its own feature JIRA. I'd also like to hear requests from users on what they need here. Is it monitoring? Is it moving queries to structured streaming. My big concern it might be confusing since we can't use the same transactional tricks we use for our own checkpoint commit log and I don't want users to loose exactly-once without understanding why. - X offsets - also its own feature JIRA. I agree it only makes sense for topics that are uniformly hash partitioned (all of mine are). Maybe we skip this if we get timestamps soon enough. > Clarify Kafka offset semantics for Structured Streaming > --- > > Key: SPARK-17937 > URL: https://issues.apache.org/jira/browse/SPARK-17937 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Cody Koeninger > > Possible events for which offsets are needed: > # New partition is discovered > # Offset out of range (aka, data has been lost). It's possible to separate > this into offset too small and offset too large, but I'm not sure it matters > for us. > Possible sources of offsets: > # *Earliest* position in log > # *Latest* position in log > # *Fail* and kill the query > # *Checkpoint* position > # *User specified* per topicpartition > # *Kafka commit log*. Currently unsupported. This means users who want to > migrate from existing kafka jobs need to jump through hoops. Even if we > never want to support it, as soon as we take on SPARK-17815 we need to make > sure Kafka commit log state is clearly documented and handled. > # *Timestamp*. Currently unsupported. This could be supported with old, > inaccurate Kafka time api, or upcoming time index > # *X offsets* before or after latest / earliest position. Currently > unsupported. I think the semantics of this are super unclear by comparison > with timestamp, given that Kafka doesn't have a single range of offsets. > Currently allowed pre-query configuration, all "ORs" are exclusive: > # startingOffsets: *earliest* OR *latest* OR *User specified* json per > topicpartition (SPARK-17812) > # failOnDataLoss: true (which implies *Fail* above) OR false (which implies > *Earliest* above) In general, I see no reason this couldn't specify Latest > as an option. > Possible lifecycle times in which an offset-related event may happen: > # At initial query start > #* New partition: if startingOffsets is *Earliest* or *Latest*, use that. If > startingOffsets is *User specified* perTopicpartition, and the new partition > isn't in the map, *Fail*. Note that this is effectively undistinguishable > from new parititon during query, because partitions may have changed in > between pre-query configuration and query start, but we treat it differently, > and users in this case are SOL > #* Offset out of range on driver: We don't technically have behavior for this > case yet. Could use the value of failOnDataLoss, but it's possible people > may want to know at startup that something was wrong, even if they're ok with > earliest for a during-query out of range > #* Offset out of range on executor: seems like it should be *Fail* or > *Earliest*, based on failOnDataLoss. but it looks like this setting is > currently ignored, and the executor will just fail... > # During query > #* New partition: *Earliest*, only. This seems to be by fiat, I see no > reason this can't be configurable. > #* Offset out of range on driver: this _probably_ doesn't happen, because > we're doing explicit seeks to
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634050#comment-15634050 ] Josh Rosen commented on SPARK-14220: SPARK-14643 is likely to be the hardest task. > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634027#comment-15634027 ] Jakob Odersky commented on SPARK-14220: --- at least most dependencies will probably make 2.12 builds available, now that it is considered binary-stable > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634027#comment-15634027 ] Jakob Odersky edited comment on SPARK-14220 at 11/3/16 7:54 PM: At least most dependencies will probably make 2.12 builds available, now that it is considered binary-stable. The closure cleaning and byte code manipulation stuff is a whole different story though... was (Author: jodersky): at least most dependencies will probably make 2.12 builds available, now that it is considered binary-stable > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs
[ https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633984#comment-15633984 ] Ivan Gozali commented on SPARK-11914: - Hi, apologies for bringing this up in an old issue. I was wondering if there's any particular reason the {{shuffle}} argument that's present in [JavaRDD.coalesce()|https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html#coalesce(int,%20boolean)] is not present in the Dataset API? Also are there plans of bringing it into the Dataset API? Thank you! > [SQL] Support coalesce and repartition in Dataset APIs > -- > > Key: SPARK-11914 > URL: https://issues.apache.org/jira/browse/SPARK-11914 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 1.6.0 > > > repartition: Returns a new [[Dataset]] that has exactly `numPartitions` > partitions. > coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` > partitions. Similar to coalesce defined on an [[RDD]], this operation results > in a narrow dependency, e.g. if you go from 1000 partitions to 100 > partitions, there will not be a shuffle, instead each of the 100 new > partitions will claim 10 of the current partitions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18257) Improve error reporting for FileStressSuite in streaming
[ https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18257: Assignee: Reynold Xin (was: Apache Spark) > Improve error reporting for FileStressSuite in streaming > > > Key: SPARK-18257 > URL: https://issues.apache.org/jira/browse/SPARK-18257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Reynold Xin >Assignee: Reynold Xin > > FileStressSuite doesn't report errors when they occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18257) Improve error reporting for FileStressSuite in streaming
[ https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633931#comment-15633931 ] Apache Spark commented on SPARK-18257: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/15757 > Improve error reporting for FileStressSuite in streaming > > > Key: SPARK-18257 > URL: https://issues.apache.org/jira/browse/SPARK-18257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Reynold Xin >Assignee: Reynold Xin > > FileStressSuite doesn't report errors when they occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633932#comment-15633932 ] Shivaram Venkataraman commented on SPARK-15799: --- Yes - I think this is good to go. The only thing remaining IMHO is that the vignette needs to be packaged correctly. (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-package-vignettes has some details). I think if we can release it with 2.1.0 it'll be good ? We have many new ML algorithms in it. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18257) Improve error reporting for FileStressSuite in streaming
[ https://issues.apache.org/jira/browse/SPARK-18257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18257: Assignee: Apache Spark (was: Reynold Xin) > Improve error reporting for FileStressSuite in streaming > > > Key: SPARK-18257 > URL: https://issues.apache.org/jira/browse/SPARK-18257 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: Reynold Xin >Assignee: Apache Spark > > FileStressSuite doesn't report errors when they occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests
[ https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633923#comment-15633923 ] Sean Owen commented on SPARK-9487: -- Yes, keep going, why not? > Use the same num. worker threads in Scala/Python unit tests > --- > > Key: SPARK-9487 > URL: https://issues.apache.org/jira/browse/SPARK-9487 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL, Tests >Affects Versions: 1.5.0 >Reporter: Xiangrui Meng > Labels: starter > Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults > > > In Python we use `local[4]` for unit tests, while in Scala/Java we use > `local[2]` and `local` for some unit tests in SQL, MLLib, and other > components. If the operation depends on partition IDs, e.g., random number > generator, this will lead to different result in Python and Scala/Java. It > would be nice to use the same number in all unit tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18230) MatrixFactorizationModel.recommendProducts throws NoSuchElement exception when the user does not exist
[ https://issues.apache.org/jira/browse/SPARK-18230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633917#comment-15633917 ] Sean Owen commented on SPARK-18230: --- Agree, I can't see how you'd return anything in this case. A Rating with value NaN doesn't make sense. (Why just one?) In fact, returning no recommendations is a valid response for an existing user, so the response for a non-existent user could, and probably should, be different. I think it's OK to view it as exceptional. > MatrixFactorizationModel.recommendProducts throws NoSuchElement exception > when the user does not exist > -- > > Key: SPARK-18230 > URL: https://issues.apache.org/jira/browse/SPARK-18230 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.1 >Reporter: Mikael Ståldal >Priority: Minor > > When invoking {{MatrixFactorizationModel.recommendProducts(Int, Int)}} with a > non-existing user, a {{java.util.NoSuchElementException}} is thrown: > {code} > java.util.NoSuchElementException: next on empty iterator > at scala.collection.Iterator$$anon$2.next(Iterator.scala:39) > at scala.collection.Iterator$$anon$2.next(Iterator.scala:37) > at > scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63) > at scala.collection.IterableLike$class.head(IterableLike.scala:107) > at > scala.collection.mutable.WrappedArray.scala$collection$IndexedSeqOptimized$$super$head(WrappedArray.scala:35) > at > scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:126) > at scala.collection.mutable.WrappedArray.head(WrappedArray.scala:35) > at > org.apache.spark.mllib.recommendation.MatrixFactorizationModel.recommendProducts(MatrixFactorizationModel.scala:169) > {code} > It would be nice if it returned the empty array, or throwed a more specific > exception, and that was documented in ScalaDoc for the method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18257) Improve error reporting for FileStressSuite in streaming
Reynold Xin created SPARK-18257: --- Summary: Improve error reporting for FileStressSuite in streaming Key: SPARK-18257 URL: https://issues.apache.org/jira/browse/SPARK-18257 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: Reynold Xin Assignee: Reynold Xin FileStressSuite doesn't report errors when they occur. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
[ https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633888#comment-15633888 ] Apache Spark commented on SPARK-18256: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15756 > Improve performance of event log replay in HistoryServer based on profiler > results > -- > > Key: SPARK-18256 > URL: https://issues.apache.org/jira/browse/SPARK-18256 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Josh Rosen > > Profiling event log replay in the HistoryServer reveals Json4S control flow > exceptions and `Utils.getFormattedClassName` calls as significant > bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
[ https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18256: Assignee: Apache Spark (was: Josh Rosen) > Improve performance of event log replay in HistoryServer based on profiler > results > -- > > Key: SPARK-18256 > URL: https://issues.apache.org/jira/browse/SPARK-18256 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Apache Spark > > Profiling event log replay in the HistoryServer reveals Json4S control flow > exceptions and `Utils.getFormattedClassName` calls as significant > bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
[ https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18256: Assignee: Josh Rosen (was: Apache Spark) > Improve performance of event log replay in HistoryServer based on profiler > results > -- > > Key: SPARK-18256 > URL: https://issues.apache.org/jira/browse/SPARK-18256 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Josh Rosen > > Profiling event log replay in the HistoryServer reveals Json4S control flow > exceptions and `Utils.getFormattedClassName` calls as significant > bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
Josh Rosen created SPARK-18256: -- Summary: Improve performance of event log replay in HistoryServer based on profiler results Key: SPARK-18256 URL: https://issues.apache.org/jira/browse/SPARK-18256 Project: Spark Issue Type: Bug Reporter: Josh Rosen Assignee: Josh Rosen Profiling event log replay in the HistoryServer reveals Json4S control flow exceptions and `Utils.getFormattedClassName` calls as significant bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18256) Improve performance of event log replay in HistoryServer based on profiler results
[ https://issues.apache.org/jira/browse/SPARK-18256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-18256: --- Issue Type: Improvement (was: Bug) > Improve performance of event log replay in HistoryServer based on profiler > results > -- > > Key: SPARK-18256 > URL: https://issues.apache.org/jira/browse/SPARK-18256 > Project: Spark > Issue Type: Improvement >Reporter: Josh Rosen >Assignee: Josh Rosen > > Profiling event log replay in the HistoryServer reveals Json4S control flow > exceptions and `Utils.getFormattedClassName` calls as significant > bottlenecks. Eliminating these halves the time to replay long event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1
[ https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18237: Fix Version/s: (was: 2.0.3) > hive.exec.stagingdir have no effect in spark2.0.1 > - > > Key: SPARK-18237 > URL: https://issues.apache.org/jira/browse/SPARK-18237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: ClassNotFoundExp >Assignee: ClassNotFoundExp > Fix For: 2.1.0 > > > hive.exec.stagingdir have no effect in spark2.0.1, > this relevant to https://issues.apache.org/jira/browse/SPARK-11021 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18237) hive.exec.stagingdir have no effect in spark2.0.1
[ https://issues.apache.org/jira/browse/SPARK-18237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18237. - Resolution: Fixed Assignee: ClassNotFoundExp Fix Version/s: 2.1.0 2.0.3 > hive.exec.stagingdir have no effect in spark2.0.1 > - > > Key: SPARK-18237 > URL: https://issues.apache.org/jira/browse/SPARK-18237 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: ClassNotFoundExp >Assignee: ClassNotFoundExp > Fix For: 2.0.3, 2.1.0 > > > hive.exec.stagingdir have no effect in spark2.0.1, > this relevant to https://issues.apache.org/jira/browse/SPARK-11021 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18244) Rename partitionProviderIsHive -> tracksPartitionsInCatalog
[ https://issues.apache.org/jira/browse/SPARK-18244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-18244. - Resolution: Fixed Fix Version/s: 2.1.0 > Rename partitionProviderIsHive -> tracksPartitionsInCatalog > --- > > Key: SPARK-18244 > URL: https://issues.apache.org/jira/browse/SPARK-18244 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > partitionProviderIsHive is too specific to Hive. In reality we can track > partitions in any catalog, not just Hive's metastore. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-14220: Target Version/s: (was: 2.2.0) > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633773#comment-15633773 ] Reynold Xin commented on SPARK-14220: - Yea in reality it's going to be really painful to upgrade. > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org