[jira] [Commented] (SPARK-12066) spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* with join
[ https://issues.apache.org/jira/browse/SPARK-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067714#comment-15067714 ] Ricky Yang commented on SPARK-12066: Problems have been confirmed,There is a line of dirty data hive table, but spark sql not catch this exception. exception as following: java.lang.ArrayIndexOutOfBoundsException: 9731 at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils.byteArrayToLong(LazyBinaryUtils.java:81) at org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryDouble.init(LazyBinaryDouble.java:43) at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase$FieldInfo.uncheckedGetField(ColumnarStructBase.java:110) at org.apache.hadoop.hive.serde2.columnar.ColumnarStructBase.getField(ColumnarStructBase.java:171) at org.apache.hadoop.hive.serde2.objectinspector.ColumnarStructObjectInspector.getStructFieldData(ColumnarStructObjectInspector.java:166) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:386) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:382) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1792) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1792) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) > spark sql throw java.lang.ArrayIndexOutOfBoundsException when use table.* > with join > - > > Key: SPARK-12066 > URL: https://issues.apache.org/jira/browse/SPARK-12066 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.5.2 > Environment: linux >Reporter: Ricky Yang >Priority: Blocker > > throw java.lang.ArrayIndexOutOfBoundsException when I use following spark > sql on spark standlone or yarn. >the sql: > select ta.* > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_from > and ta.sale_price < tb.pri_to limit 10 ; > But ,the result is correct when using no * as following: > select ta.sale_dt > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_from > and ta.sale_price < tb.pri_to limit 10 ; > standlone version is 1.4.0 and version spark on yarn is 1.5.2 > error log : > > 15/11/30 14:19:59 ERROR SparkSQLDriver: Failed in [select ta.* > from bi_td.dm_price_seg_td tb > join bi_sor.sor_ord_detail_tf ta > on 1 = 1 > where ta.sale_dt = '20140514' > and ta.sale_price >= tb.pri_
[jira] [Assigned] (SPARK-12475) Upgrade Zinc from 0.3.5.3 to 0.3.9
[ https://issues.apache.org/jira/browse/SPARK-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12475: Assignee: Apache Spark (was: Josh Rosen) > Upgrade Zinc from 0.3.5.3 to 0.3.9 > -- > > Key: SPARK-12475 > URL: https://issues.apache.org/jira/browse/SPARK-12475 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Apache Spark > > We should update to the latest version of Zinc in order to match our SBT > version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12458: Assignee: (was: Apache Spark) > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12458: Assignee: Apache Spark > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11823. Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10425 [https://github.com/apache/spark/pull/10425] > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Fix For: 2.0.0, 1.6.1 > > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11807) Remove support for Hadoop < 2.2 (i.e. Hadoop 1 and 2.0)
[ https://issues.apache.org/jira/browse/SPARK-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067608#comment-15067608 ] Reynold Xin commented on SPARK-11807: - [~srowen] please submit prs that can simplify code (reflection) from the removal of hadoop 1.x. Maybe create a new ticket and link to this one when you do them? > Remove support for Hadoop < 2.2 (i.e. Hadoop 1 and 2.0) > --- > > Key: SPARK-11807 > URL: https://issues.apache.org/jira/browse/SPARK-11807 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11807) Remove support for Hadoop < 2.2 (i.e. Hadoop 1 and 2.0)
[ https://issues.apache.org/jira/browse/SPARK-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11807. - Resolution: Fixed Fix Version/s: 2.0.0 > Remove support for Hadoop < 2.2 (i.e. Hadoop 1 and 2.0) > --- > > Key: SPARK-11807 > URL: https://issues.apache.org/jira/browse/SPARK-11807 > Project: Spark > Issue Type: Sub-task > Components: Build >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12327) lint-r checks fail with commented code
[ https://issues.apache.org/jira/browse/SPARK-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067575#comment-15067575 ] Felix Cheung commented on SPARK-12327: -- https://github.com/apache/spark/pull/10408 > lint-r checks fail with commented code > -- > > Key: SPARK-12327 > URL: https://issues.apache.org/jira/browse/SPARK-12327 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shivaram Venkataraman > > We get this after our R version downgrade > {code} > R/RDD.R:183:68: style: Commented code should be removed. > rdd@env$jrdd_val <- callJMethod(rddRef, "asJavaRDD") # > rddRef$asJavaRDD() > > ^~ > R/RDD.R:228:63: style: Commented code should be removed. > #' http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. > ^~~~ > R/RDD.R:388:24: style: Commented code should be removed. > #' collectAsMap(rdd) # list(`1` = 2, `3` = 4) >^~ > R/RDD.R:603:61: style: Commented code should be removed. > #' unlist(collect(filterRDD(rdd, function (x) { x < 3 }))) # c(1, 2) > ^~~~ > R/RDD.R:762:20: style: Commented code should be removed. > #' take(rdd, 2L) # list(1, 2) >^~ > R/RDD.R:830:42: style: Commented code should be removed. > #' sort(unlist(collect(distinct(rdd # c(1, 2, 3) > ^~~ > R/RDD.R:980:47: style: Commented code should be removed. > #' collect(keyBy(rdd, function(x) { x*x })) # list(list(1, 1), list(4, 2), > list(9, 3)) > > ^~~~ > R/RDD.R:1194:27: style: Commented code should be removed. > #' takeOrdered(rdd, 6L) # list(1, 2, 3, 4, 5, 6) > ^~ > R/RDD.R:1215:19: style: Commented code should be removed. > #' top(rdd, 6L) # list(10, 9, 7, 6, 5, 4) > ^~~ > R/RDD.R:1270:50: style: Commented code should be removed. > #' aggregateRDD(rdd, zeroValue, seqOp, combOp) # list(10, 4) > ^~~ > R/RDD.R:1374:6: style: Commented code should be removed. > #' # list(list("a", 0), list("b", 3), list("c", 1), list("d", 4), list("e", > 2)) > > ^~ > R/RDD.R:1415:6: style: Commented code should be removed. > #' # list(list("a", 0), list("b", 1), list("c", 2), list("d", 3), list("e", > 4)) > > ^~ > R/RDD.R:1461:6: style: Commented code should be removed. > #' # list(list(1, 2), list(3, 4)) > ^~~~ > R/RDD.R:1527:6: style: Commented code should be removed. > #' # list(list(0, 1000), list(1, 1001), list(2, 1002), list(3, 1003), list(4, > 1004)) > > ^~~ > R/RDD.R:1564:6: style: Commented code should be removed. > #' # list(list(1, 1), list(1, 2), list(2, 1), list(2, 2)) > ^~~~ > R/RDD.R:1595:6: style: Commented code should be removed. > #' # list(1, 1, 3) > ^ > R/RDD.R:1627:6: style: Commented code should be removed. > #' # list(1, 2, 3) > ^ > R/RDD.R:1663:6: style: Commented code should be removed. > #' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6))) > ^~ > R/deserialize.R:22:3: style: Commented code should be removed. > # void -> NULL > ^~~~ > R/deserialize.R:23:3: style: Commented code should be removed. > # Int -> integer > ^~ > R/deserialize.R:24:3: style: Commented code should be removed. > # String -> character > ^~~ > R/deserialize.R:25:3: style: Commented code should be removed. > # Boolean -> logical > ^~ > R/deserialize.R:26:3: style: Commented code should be removed. > # Float -> double > ^~~ > R/deserialize.R:27:3: style: Commented code should be removed. > # Double -> double > ^~~~ > R/deserialize.R:28:3: style: Commented code should be removed. > # Long -> double > ^~ > R/deserialize.R:29:3: style: Commented code should be removed. > # Array[Byte] -> raw > ^~ > R/deserialize.R:30:3: style: Commented code should be removed. > # Date -> Date > ^~~~ > R/deserialize.R:31:3: style: Commented code should be removed. > # Time -> POSIXct > ^~~ > R/deserialize.R
[jira] [Assigned] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12476: Assignee: (was: Apache Spark) > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > Current plan: > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12476: Assignee: Apache Spark > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro >Assignee: Apache Spark > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > Current plan: > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067528#comment-15067528 ] Apache Spark commented on SPARK-12476: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/10427 > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > Current plan: > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12476: - Description: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} was: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > Current plan: > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12476: - Description: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' {code} Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} was: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' {code} Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter > -- > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > {code} > Current plan: > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12476: - Description: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' Current plan: {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} was: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' {code} Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; {code} == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > Current plan: > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12476: - Summary: Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter (was: Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter) > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter > - > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > {code} > Current plan: > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > {code} > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter
[ https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12476: - Description: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' {code} Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] {code} This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` was: Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' ``` Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` > Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter > -- > > Key: SPARK-12476 > URL: https://issues.apache.org/jira/browse/SPARK-12476 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takeshi Yamamuro > > Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' > {code} > Current plan: > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > {code} > This patch enables a plan below; > ``` > == Optimized Logical Plan == > Project [col0#0,col1#1] > +- Filter (col0#0 = xxx) >+- Relation[col0#0,col1#1] > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver}) > == Physical Plan == > +- Filter (col0#0 = xxx) >+- Scan > JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, > password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: > [EqualTo(col0,xxx)] > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter
Takeshi Yamamuro created SPARK-12476: Summary: Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter Key: SPARK-12476 URL: https://issues.apache.org/jira/browse/SPARK-12476 Project: Spark Issue Type: Improvement Components: SQL Reporter: Takeshi Yamamuro Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx' ``` Current plan: == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` This patch enables a plan below; ``` == Optimized Logical Plan == Project [col0#0,col1#1] +- Filter (col0#0 = xxx) +- Relation[col0#0,col1#1] JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver}) == Physical Plan == +- Filter (col0#0 = xxx) +- Scan JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu, password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: [EqualTo(col0,xxx)] ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12475) Upgrade Zinc from 0.3.5.3 to 0.3.9
[ https://issues.apache.org/jira/browse/SPARK-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067509#comment-15067509 ] Apache Spark commented on SPARK-12475: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10426 > Upgrade Zinc from 0.3.5.3 to 0.3.9 > -- > > Key: SPARK-12475 > URL: https://issues.apache.org/jira/browse/SPARK-12475 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Josh Rosen >Assignee: Josh Rosen > > We should update to the latest version of Zinc in order to match our SBT > version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12475) Upgrade Zinc from 0.3.5.3 to 0.3.9
Josh Rosen created SPARK-12475: -- Summary: Upgrade Zinc from 0.3.5.3 to 0.3.9 Key: SPARK-12475 URL: https://issues.apache.org/jira/browse/SPARK-12475 Project: Spark Issue Type: Improvement Components: Build, Project Infra Reporter: Josh Rosen Assignee: Josh Rosen We should update to the latest version of Zinc in order to match our SBT version. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12474) support deserialization for physical plan and hive logical plan from JSON string
[ https://issues.apache.org/jira/browse/SPARK-12474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-12474: Priority: Minor (was: Major) > support deserialization for physical plan and hive logical plan from JSON > string > > > Key: SPARK-12474 > URL: https://issues.apache.org/jira/browse/SPARK-12474 > Project: Spark > Issue Type: New Feature >Reporter: Wenchen Fan >Priority: Minor > > SPARK-12321 add a framework based on reflection that can serialize > {{TreeNode}} to JSON and deserialize it back. However, it can't handle all > corner cases and we bypass them in the test, see > https://github.com/apache/spark/pull/10311/files#diff-238d584c15e16c24f49a40bcf163fe13R190 > and > https://github.com/apache/spark/pull/10311/files#diff-238d584c15e16c24f49a40bcf163fe13R212. > known corner cases: > 1. ExpressionEncoder > 2. BaseRelation > 3. hive logical plan > 4. physical plan > The framework is in catalyst module and may not be able to handle corner > cases from other modules, an idea is defining a {{JsonSerializable}} trait > and implement it for corner cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12474) support deserialization for physical plan and hive logical plan from JSON string
Wenchen Fan created SPARK-12474: --- Summary: support deserialization for physical plan and hive logical plan from JSON string Key: SPARK-12474 URL: https://issues.apache.org/jira/browse/SPARK-12474 Project: Spark Issue Type: New Feature Reporter: Wenchen Fan SPARK-12321 add a framework based on reflection that can serialize {{TreeNode}} to JSON and deserialize it back. However, it can't handle all corner cases and we bypass them in the test, see https://github.com/apache/spark/pull/10311/files#diff-238d584c15e16c24f49a40bcf163fe13R190 and https://github.com/apache/spark/pull/10311/files#diff-238d584c15e16c24f49a40bcf163fe13R212. known corner cases: 1. ExpressionEncoder 2. BaseRelation 3. hive logical plan 4. physical plan The framework is in catalyst module and may not be able to handle corner cases from other modules, an idea is defining a {{JsonSerializable}} trait and implement it for corner cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067477#comment-15067477 ] Zhuo Liu edited comment on SPARK-10873 at 12/22/15 2:56 AM: Hi Alex, yes, that is exactly what I am going to do, both column sorting and search box should be fixed together with jQuery DataTables. [~ajbozarth] was (Author: zhuoliu): Hi Alex, yes, that is exactly what I am going to do, both column sorting and search box should be fixed together with jQuery DataTables. > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Thomas Graves >Assignee: Zhuo Liu > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067477#comment-15067477 ] Zhuo Liu commented on SPARK-10873: -- Hi Alex, yes, that is exactly what I am going to do, both column sorting and search box should be fixed together with jQuery DataTables. > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Thomas Graves >Assignee: Zhuo Liu > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067430#comment-15067430 ] Apache Spark commented on SPARK-11823: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/10425 > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12472) OOM when sort a table and save as parquet
[ https://issues.apache.org/jira/browse/SPARK-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067413#comment-15067413 ] Davies Liu commented on SPARK-12472: could be workarounded by decreasing the memory used by both Spark and Parquet using spark.memory.fraction (for example, 0.4) and parquet.memory.pool.ratio (for example, 0.3, in core-site.xml) > OOM when sort a table and save as parquet > - > > Key: SPARK-12472 > URL: https://issues.apache.org/jira/browse/SPARK-12472 > Project: Spark > Issue Type: Bug >Reporter: Davies Liu > > {code} > t = sqlContext.table('store_sales') > t.unionAll(t).coalesce(2).sortWithinPartitions(t[0]).write.partitionBy('ss_sold_date_sk').parquet("/tmp/ttt") > {code} > {code} > 15/12/21 14:35:52 WARN TaskSetManager: Lost task 1.0 in stage 25.0 (TID 96, > 192.168.0.143): java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32) > at > org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951) > at > org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699) > at > org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) > at > org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) > at > org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:226) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:327) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:342) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11931) "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test jdbc cancel" is sometimes very slow
[ https://issues.apache.org/jira/browse/SPARK-11931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11931. Resolution: Duplicate > "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test jdbc > cancel" is sometimes very slow > > > Key: SPARK-11931 > URL: https://issues.apache.org/jira/browse/SPARK-11931 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > > The "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test > jdbc cancel" test is sometimes very slow. It usually takes about 8 seconds to > run, but in some cases can take up to 5 minutes: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4874/testReport/org.apache.spark.sql.hive.thriftserver/HiveThriftBinaryServerSuite/test_jdbc_cancel/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11823: Assignee: Apache Spark (was: Josh Rosen) > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Apache Spark > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11823: Assignee: Josh Rosen (was: Apache Spark) > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11823: Assignee: Apache Spark (was: Josh Rosen) > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Apache Spark > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11823: Assignee: Josh Rosen (was: Apache Spark) > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance
[ https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12473: -- Description: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so we should do this the right way. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. was: After commit de02782 of page rank regressed from 242s to 260s, about 7%. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. > Reuse serializer instances for performance > -- > > Key: SPARK-12473 > URL: https://issues.apache.org/jira/browse/SPARK-12473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After commit de02782 of page rank regressed from 242s to 260s, about 7%. > Although currently it's only 7%, we will likely register more classes in the > future so we should do this the right way. > The commit added 26 types to register every time we create a Kryo serializer > instance. I ran a small microbenchmark to prove that this is noticeably > expensive: > {code} > import org.apache.spark.serializer._ > import org.apache.spark.SparkConf > def makeMany(num: Int): Long = { > val start = System.currentTimeMillis > (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } > System.currentTimeMillis - start > } > // before commit de02782, averaged over multiple runs > makeMany(5000) == 1500 > // after commit de02782, averaged over multiple runs > makeMany(5000) == 2750 > {code} > Since we create multiple serializer instances per partition, this means a > 5000-partition stage will unconditionally see an increase of > 1s for the > stage. In page rank, we may run many such stages. > We should explore the alternative of reusing thread-local serializer > instances, which would lead to much fewer calls to `kryo.register`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance
[ https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12473: -- Description: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so this will only increase. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. was: After commit de02782 of page rank regressed from 242s to 260s, about 7%. Although currently it's only 7%, we will likely register more classes in the future so we should do this the right way. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. > Reuse serializer instances for performance > -- > > Key: SPARK-12473 > URL: https://issues.apache.org/jira/browse/SPARK-12473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After commit de02782 of page rank regressed from 242s to 260s, about 7%. > Although currently it's only 7%, we will likely register more classes in the > future so this will only increase. > The commit added 26 types to register every time we create a Kryo serializer > instance. I ran a small microbenchmark to prove that this is noticeably > expensive: > {code} > import org.apache.spark.serializer._ > import org.apache.spark.SparkConf > def makeMany(num: Int): Long = { > val start = System.currentTimeMillis > (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } > System.currentTimeMillis - start > } > // before commit de02782, averaged over multiple runs > makeMany(5000) == 1500 > // after commit de02782, averaged over multiple runs > makeMany(5000) == 2750 > {code} > Since we create multiple serializer instances per partition, this means a > 5000-partition stage will unconditionally see an increase of > 1s for the > stage. In page rank, we may run many such stages. > We should explore the alternative of reusing thread-local serializer > instances, which would lead to much fewer calls to `kryo.register`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: Apache Spark > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12473) Reuse serializer instances for performance
[ https://issues.apache.org/jira/browse/SPARK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12473: -- Description: After commit de02782 of page rank regressed from 242s to 260s, about 7%. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. We should explore the alternative of reusing thread-local serializer instances, which would lead to much fewer calls to `kryo.register`. was: After commit de02782 of page rank regressed from 242s to 260s, about 7%. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. > Reuse serializer instances for performance > -- > > Key: SPARK-12473 > URL: https://issues.apache.org/jira/browse/SPARK-12473 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > > After commit de02782 of page rank regressed from 242s to 260s, about 7%. > The commit added 26 types to register every time we create a Kryo serializer > instance. I ran a small microbenchmark to prove that this is noticeably > expensive: > {code} > import org.apache.spark.serializer._ > import org.apache.spark.SparkConf > def makeMany(num: Int): Long = { > val start = System.currentTimeMillis > (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } > System.currentTimeMillis - start > } > // before commit de02782, averaged over multiple runs > makeMany(5000) == 1500 > // after commit de02782, averaged over multiple runs > makeMany(5000) == 2750 > {code} > Since we create multiple serializer instances per partition, this means a > 5000-partition stage will unconditionally see an increase of > 1s for the > stage. In page rank, we may run many such stages. > We should explore the alternative of reusing thread-local serializer > instances, which would lead to much fewer calls to `kryo.register`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12473) Reuse serializer instances for performance
Andrew Or created SPARK-12473: - Summary: Reuse serializer instances for performance Key: SPARK-12473 URL: https://issues.apache.org/jira/browse/SPARK-12473 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Andrew Or Assignee: Andrew Or After commit de02782 of page rank regressed from 242s to 260s, about 7%. The commit added 26 types to register every time we create a Kryo serializer instance. I ran a small microbenchmark to prove that this is noticeably expensive: {code} import org.apache.spark.serializer._ import org.apache.spark.SparkConf def makeMany(num: Int): Long = { val start = System.currentTimeMillis (1 to num).foreach { _ => new KryoSerializer(new SparkConf).newKryo() } System.currentTimeMillis - start } // before commit de02782, averaged over multiple runs makeMany(5000) == 1500 // after commit de02782, averaged over multiple runs makeMany(5000) == 2750 {code} Since we create multiple serializer instances per partition, this means a 5000-partition stage will unconditionally see an increase of > 1s for the stage. In page rank, we may run many such stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067345#comment-15067345 ] Josh Rosen commented on SPARK-11823: I think I spotted the problem; it may be a bad use of Thread.sleep() in a test: https://github.com/apache/spark/pull/6207/files#r30935200 > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-11823: -- Assignee: Josh Rosen > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp >Assignee: Josh Rosen > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11931) "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test jdbc cancel" is sometimes very slow
[ https://issues.apache.org/jira/browse/SPARK-11931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-11931: -- Assignee: Josh Rosen > "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test jdbc > cancel" is sometimes very slow > > > Key: SPARK-11931 > URL: https://issues.apache.org/jira/browse/SPARK-11931 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > > The "org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.test > jdbc cancel" test is sometimes very slow. It usually takes about 8 seconds to > run, but in some cases can take up to 5 minutes: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4874/testReport/org.apache.spark.sql.hive.thriftserver/HiveThriftBinaryServerSuite/test_jdbc_cancel/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11823) HiveThriftBinaryServerSuite tests timing out, leaves hanging processes
[ https://issues.apache.org/jira/browse/SPARK-11823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067314#comment-15067314 ] Josh Rosen commented on SPARK-11823: It looks like this has caused a huge number of timeouts in the Master Maven Hadoop 2.4 builds this week: https://spark-tests.appspot.com/jobs/Spark-Master-Maven-with-YARN%20%C2%BB%20hadoop-2.4%2Cspark-test I'm going to pull some logs and take a look. > HiveThriftBinaryServerSuite tests timing out, leaves hanging processes > -- > > Key: SPARK-11823 > URL: https://issues.apache.org/jira/browse/SPARK-11823 > Project: Spark > Issue Type: Bug > Components: Tests >Reporter: shane knapp > Attachments: > spark-jenkins-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-amp-jenkins-worker-05.out, > stack.log > > > i've noticed on a few branches that the HiveThriftBinaryServerSuite tests > time out, and when that happens, the build is aborted but the tests leave > behind hanging processes that eat up cpu and ram. > most recently, i discovered this happening w/the 1.6 SBT build, specifically > w/the hadoop 2.0 profile: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.6-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/56/console > [~vanzin] grabbed the jstack log, which i've attached to this issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12414) Remove closure serializer
[ https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12414: -- Issue Type: Sub-task (was: Bug) Parent: SPARK-11806 > Remove closure serializer > - > > Key: SPARK-12414 > URL: https://issues.apache.org/jira/browse/SPARK-12414 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > > There is a config `spark.closure.serializer` that accepts exactly one value: > the java serializer. This is because there are currently bugs in the Kryo > serializer that make it not a viable candidate. This was uncovered by an > unsuccessful attempt to make it work: SPARK-7708. > My high level point is that the Java serializer has worked well for at least > 6 Spark versions now, and it is an incredibly complicated task to get other > serializers (not just Kryo) to work with Spark's closures. IMO the effort is > not worth it and we should just remove this documentation and all the code > associated with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10873) can't sort columns on history page
[ https://issues.apache.org/jira/browse/SPARK-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067304#comment-15067304 ] Alex Bozarth commented on SPARK-10873: -- Thanks [~zhuoliu], is your fix for this going to include adding the search feature (like in jQuery DataTables)? Otherwise I might look at picking up SPARK-10874 > can't sort columns on history page > -- > > Key: SPARK-10873 > URL: https://issues.apache.org/jira/browse/SPARK-10873 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Thomas Graves >Assignee: Zhuo Liu > > Starting with 1.5.1 the history server page isn't allowing sorting by column -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067301#comment-15067301 ] Martin Schade commented on SPARK-12453: --- That would do that trick as well. Wasn't sure what you prefer. My PR just changes the version to the one that matches the KCL. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12471: Assignee: Apache Spark > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li >Assignee: Apache Spark > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12458) Add ExpressionDescription to datetime functions
[ https://issues.apache.org/jira/browse/SPARK-12458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067278#comment-15067278 ] Dilip Biswal commented on SPARK-12458: -- I would like to work on this one. > Add ExpressionDescription to datetime functions > --- > > Key: SPARK-12458 > URL: https://issues.apache.org/jira/browse/SPARK-12458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12472) OOM when sort a table and save as parquet
Davies Liu created SPARK-12472: -- Summary: OOM when sort a table and save as parquet Key: SPARK-12472 URL: https://issues.apache.org/jira/browse/SPARK-12472 Project: Spark Issue Type: Bug Reporter: Davies Liu {code} t = sqlContext.table('store_sales') t.unionAll(t).coalesce(2).sortWithinPartitions(t[0]).write.partitionBy('ss_sold_date_sk').parquet("/tmp/ttt") {code} {code} 15/12/21 14:35:52 WARN TaskSetManager: Lost task 1.0 in stage 25.0 (TID 96, 192.168.0.143): java.lang.OutOfMemoryError: Java heap space at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:86) at org.apache.spark.util.collection.unsafe.sort.UnsafeSortDataFormat.allocate(UnsafeSortDataFormat.java:32) at org.apache.spark.util.collection.TimSort$SortState.ensureCapacity(TimSort.java:951) at org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:699) at org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) at org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) at org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:226) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187) at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:327) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:342) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions
[ https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12456: Assignee: Apache Spark > Add ExpressionDescription to misc functions > --- > > Key: SPARK-12456 > URL: https://issues.apache.org/jira/browse/SPARK-12456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12456) Add ExpressionDescription to misc functions
[ https://issues.apache.org/jira/browse/SPARK-12456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12456: Assignee: (was: Apache Spark) > Add ExpressionDescription to misc functions > --- > > Key: SPARK-12456 > URL: https://issues.apache.org/jira/browse/SPARK-12456 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12471: Assignee: (was: Apache Spark) > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12471: Assignee: Apache Spark > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li >Assignee: Apache Spark > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12471) Spark daemons should log their pid in the log file
[ https://issues.apache.org/jira/browse/SPARK-12471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067218#comment-15067218 ] Apache Spark commented on SPARK-12471: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/10422 > Spark daemons should log their pid in the log file > -- > > Key: SPARK-12471 > URL: https://issues.apache.org/jira/browse/SPARK-12471 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Nong Li > > This is useful when debugging from the log files without the processes > running. This information makes it possible to combine the log files with > other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12470: Assignee: Apache Spark > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Assignee: Apache Spark >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12470: Assignee: (was: Apache Spark) > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12299) Remove history serving functionality from standalone Master
[ https://issues.apache.org/jira/browse/SPARK-12299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12299: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-11806 > Remove history serving functionality from standalone Master > --- > > Key: SPARK-12299 > URL: https://issues.apache.org/jira/browse/SPARK-12299 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Or > > The standalone Master currently continues to serve the historical UIs of > applications that have completed and enabled event logging. This poses > problems, however, if the event log is very large, e.g. SPARK-6270. The > Master might OOM or hang while it rebuilds the UI, rejecting applications in > the mean time. > Personally, I have had to make modifications in the code to disable this > myself, because I wanted to use event logging in standalone mode for > applications that produce a lot of logging. > Removing this from the Master would simplify the process significantly. This > issue supersedes SPARK-12062. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12299) Remove history serving functionality from standalone Master
[ https://issues.apache.org/jira/browse/SPARK-12299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-12299: -- Assignee: Josh Rosen > Remove history serving functionality from standalone Master > --- > > Key: SPARK-12299 > URL: https://issues.apache.org/jira/browse/SPARK-12299 > Project: Spark > Issue Type: Sub-task > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Josh Rosen > > The standalone Master currently continues to serve the historical UIs of > applications that have completed and enabled event logging. This poses > problems, however, if the event log is very large, e.g. SPARK-6270. The > Master might OOM or hang while it rebuilds the UI, rejecting applications in > the mean time. > Personally, I have had to make modifications in the code to disable this > myself, because I wanted to use event logging in standalone mode for > applications that produce a lot of logging. > Removing this from the Master would simplify the process significantly. This > issue supersedes SPARK-12062. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12471) Spark daemons should log their pid in the log file
Nong Li created SPARK-12471: --- Summary: Spark daemons should log their pid in the log file Key: SPARK-12471 URL: https://issues.apache.org/jira/browse/SPARK-12471 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Nong Li This is useful when debugging from the log files without the processes running. This information makes it possible to combine the log files with other system information (e.g. dmesg output) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067200#comment-15067200 ] mustafa elbehery commented on SPARK-5226: - Better Implementation, based on research paper for parallel DBSCAN can be found here. https://github.com/irvingc/dbscan-on-spark .. The approach solved bottleneck of reduce step, in which discovered clusters are merged. Hope it helps. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067201#comment-15067201 ] Apache Spark commented on SPARK-12470: -- User 'robbinspg' has created a pull request for this issue: https://github.com/apache/spark/pull/10421 > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
[ https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pete Robbins updated SPARK-12470: - Component/s: SQL Summary: Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner (was: Incorrect calculation of row size in o.a.s.catalyst.expressions.codegen.GenerateUnsafeRowJoiner) > Incorrect calculation of row size in > o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner > --- > > Key: SPARK-12470 > URL: https://issues.apache.org/jira/browse/SPARK-12470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Pete Robbins >Priority: Minor > > While looking into https://issues.apache.org/jira/browse/SPARK-12319 I > noticed that the row size is incorrectly calculated. > The "sizeReduction" value is calculated in words: >// The number of words we can reduce when we concat two rows together. > // The only reduction comes from merging the bitset portion of the two > rows, saving 1 word. > val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords > but then it is subtracted from the size of the row in bytes: >|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - > $sizeReduction); > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12363) PowerIterationClustering test case failed if we deprecated KMeans.setRuns
[ https://issues.apache.org/jira/browse/SPARK-12363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12363: Assignee: Apache Spark > PowerIterationClustering test case failed if we deprecated KMeans.setRuns > - > > Key: SPARK-12363 > URL: https://issues.apache.org/jira/browse/SPARK-12363 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > We plan to deprecated `runs` of KMeans, PowerIterationClustering will > leverage KMeans to train model. > I removed `setRuns` used in PowerIterationClustering, but one of the test > cases failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12470) Incorrect calculation of row size in o.a.s.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
Pete Robbins created SPARK-12470: Summary: Incorrect calculation of row size in o.a.s.catalyst.expressions.codegen.GenerateUnsafeRowJoiner Key: SPARK-12470 URL: https://issues.apache.org/jira/browse/SPARK-12470 Project: Spark Issue Type: Bug Affects Versions: 1.5.2 Reporter: Pete Robbins Priority: Minor While looking into https://issues.apache.org/jira/browse/SPARK-12319 I noticed that the row size is incorrectly calculated. The "sizeReduction" value is calculated in words: // The number of words we can reduce when we concat two rows together. // The only reduction comes from merging the bitset portion of the two rows, saving 1 word. val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords but then it is subtracted from the size of the row in bytes: |out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - $sizeReduction); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067177#comment-15067177 ] Stephen Boesch commented on SPARK-5226: --- It seems that Aliaksei has not been able to add to spark-packages. So I presume DBScan is still an open ticket ? > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12388) Change default compressor to LZ4
[ https://issues.apache.org/jira/browse/SPARK-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-12388. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10342 [https://github.com/apache/spark/pull/10342] > Change default compressor to LZ4 > > > Key: SPARK-12388 > URL: https://issues.apache.org/jira/browse/SPARK-12388 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Labels: releasenotes > Fix For: 2.0.0 > > > According the benchmark [1], LZ4-java could be 80% (or 30%) faster than > Snappy. > After changing the compressor to LZ4, I saw 20% improvement on end-to-end > time for a TPCDS query (Q4). > [1] https://github.com/ning/jvm-compressor-benchmark/wiki -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12339) NullPointerException on stage kill from web UI
[ https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067154#comment-15067154 ] Andrew Or commented on SPARK-12339: --- I've updated the affected version to 2.0 since SPARK-11206 was merged only there. Please let me know if this is not the case. > NullPointerException on stage kill from web UI > -- > > Key: SPARK-12339 > URL: https://issues.apache.org/jira/browse/SPARK-12339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > The following message is in the logs after killing a stage: > {code} > scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33) > INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32) > WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): > TaskKilled (killed intentionally) > WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): > TaskKilled (killed intentionally) > INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, > from pool > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > {code} > To reproduce, start a job and kill the stage from web UI, e.g.: > {code} > val rdd = sc.parallelize(0 to 9, 2) > rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it > }.count > {code} > Go to web UI and in Stages tab click "kill" for the stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067156#comment-15067156 ] Sean Owen commented on SPARK-12453: --- Ah, I misread the PR and it already just removes aws.java.sdk.version and manually managing the dependency. Just deleting the version and the dependencyManagement entry does the trick right? > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5882) Add a test for GraphLoader.edgeListFile
[ https://issues.apache.org/jira/browse/SPARK-5882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-5882. -- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Add a test for GraphLoader.edgeListFile > --- > > Key: SPARK-5882 > URL: https://issues.apache.org/jira/browse/SPARK-5882 > Project: Spark > Issue Type: Test > Components: GraphX >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Trivial > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12392) Optimize a location order of broadcast blocks by considering preferred local hosts
[ https://issues.apache.org/jira/browse/SPARK-12392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12392. --- Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > Optimize a location order of broadcast blocks by considering preferred local > hosts > -- > > Key: SPARK-12392 > URL: https://issues.apache.org/jira/browse/SPARK-12392 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro > Fix For: 2.0.0 > > > When multiple workers exist in a host, we can bypass unnecessary remote > access for broadcasts; block managers fetch broadcast blocks from the same > host instead of remote hosts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12339) NullPointerException on stage kill from web UI
[ https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-12339: -- Affects Version/s: (was: 1.6.0) 2.0.0 > NullPointerException on stage kill from web UI > -- > > Key: SPARK-12339 > URL: https://issues.apache.org/jira/browse/SPARK-12339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > The following message is in the logs after killing a stage: > {code} > scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33) > INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32) > WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): > TaskKilled (killed intentionally) > WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): > TaskKilled (killed intentionally) > INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, > from pool > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > {code} > To reproduce, start a job and kill the stage from web UI, e.g.: > {code} > val rdd = sc.parallelize(0 to 9, 2) > rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it > }.count > {code} > Go to web UI and in Stages tab click "kill" for the stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12339) NullPointerException on stage kill from web UI
[ https://issues.apache.org/jira/browse/SPARK-12339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12339. --- Resolution: Fixed Assignee: Alex Bozarth (was: Apache Spark) Fix Version/s: 2.0.0 Target Version/s: 2.0.0 > NullPointerException on stage kill from web UI > -- > > Key: SPARK-12339 > URL: https://issues.apache.org/jira/browse/SPARK-12339 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Jacek Laskowski >Assignee: Alex Bozarth > Fix For: 2.0.0 > > > The following message is in the logs after killing a stage: > {code} > scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33) > INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32) > WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): > TaskKilled (killed intentionally) > WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): > TaskKilled (killed intentionally) > INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, > from pool > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > ERROR LiveListenerBus: Listener SQLListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) > {code} > To reproduce, start a job and kill the stage from web UI, e.g.: > {code} > val rdd = sc.parallelize(0 to 9, 2) > rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it > }.count > {code} > Go to web UI and in Stages tab click "kill" for the stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-12466. --- Resolution: Fixed > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or resolved SPARK-2331. -- Resolution: Fixed Fix Version/s: 2.0.0 > SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] > -- > > Key: SPARK-2331 > URL: https://issues.apache.org/jira/browse/SPARK-2331 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Ian Hummel >Assignee: Reynold Xin >Priority: Minor > Fix For: 2.0.0 > > > The return type for SparkContext.emptyRDD is EmptyRDD[T]. > It should be RDD[T]. That means you have to add extra type annotations on > code like the below (which creates a union of RDDs over some subset of paths > in a folder) > {code} > val rdds = Seq("a", "b", "c").foldLeft[RDD[String]](sc.emptyRDD[String]) { > (rdd, path) ⇒ > rdd.union(sc.textFile(path)) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Priority: Trivial (was: Major) > Avoid setCheckpointDir warning when filesystem is not local > --- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans >Priority: Trivial > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Summary: Avoid setCheckpointDir warning when filesystem is not local (was: [CORE] Avoid setCheckpointDir warning when filesystem is not local) > Avoid setCheckpointDir warning when filesystem is not local > --- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067125#comment-15067125 ] Martin Schade commented on SPARK-12453: --- Make sense, thank you. Ideally it should be 1.9.37 instead of 1.9.40 though. Both KCL 1.4.0 and KPL 0.10.1 reference 1.9.37. https://github.com/awslabs/amazon-kinesis-producer/blob/v0.10.1/java/amazon-kinesis-producer/pom.xml https://github.com/awslabs/amazon-kinesis-client/blob/v1.4.0/pom.xml Both reference 1.9.37 In the latest version of KPL (v0.10.2) 1.10.34 is references and in latest KCL 1.6.1 it is version 1.10.20, not easy so sync then. So it would need to do some testing which combination works actually. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12440) [CORE] Avoid setCheckpointDir warning when filesystem is not local
[ https://issues.apache.org/jira/browse/SPARK-12440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12440: -- Component/s: Spark Core > [CORE] Avoid setCheckpointDir warning when filesystem is not local > -- > > Key: SPARK-12440 > URL: https://issues.apache.org/jira/browse/SPARK-12440 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.2, 1.6.0, 1.6.1 >Reporter: Pierre Borckmans > > In SparkContext method `setCheckpointDir`, a warning is issued when spark > master is not local and the passed directory for the checkpoint dir appears > to be local. > In practice, when relying on hdfs configuration file and using relative path > (incomplete URI without hdfs scheme, ...), this warning should not be issued > and might be confusing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12463: -- Component/s: Mesos > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12374) Improve performance of Range APIs via adding logical/physical operators
[ https://issues.apache.org/jira/browse/SPARK-12374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12374. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10335 [https://github.com/apache/spark/pull/10335] > Improve performance of Range APIs via adding logical/physical operators > --- > > Key: SPARK-12374 > URL: https://issues.apache.org/jira/browse/SPARK-12374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Critical > Fix For: 2.0.0 > > > Creating an actual logical/physical operator for range for matching the > performance of RDD Range APIs. > Compared with the old Range API, the new version is 3 times faster than the > old version. > {code} > scala> val startTime = System.currentTimeMillis; sqlContext.oldRange(0, > 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start > = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = > (endTime - startTime)/ 1000.0 > startTime: Long = 1450416394240 > > endTime: Long = 1450416421199 > start: java.sql.Timestamp = 2015-12-17 21:26:34.24 > end: java.sql.Timestamp = 2015-12-17 21:27:01.199 > elapsed: Double = 26.959 > {code} > {code} > scala> val startTime = System.currentTimeMillis; sqlContext.range(0, > 10, 1, 15).count(); val endTime = System.currentTimeMillis; val start > = new Timestamp(startTime); val end = new Timestamp(endTime); val elapsed = > (endTime - startTime)/ 1000.0 > startTime: Long = 1450416360107 > > endTime: Long = 1450416368590 > start: java.sql.Timestamp = 2015-12-17 21:26:00.107 > end: java.sql.Timestamp = 2015-12-17 21:26:08.59 > elapsed: Double = 8.483 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12150) numPartitions argument to sqlContext.range() should be optional
[ https://issues.apache.org/jira/browse/SPARK-12150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12150. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10335 [https://github.com/apache/spark/pull/10335] > numPartitions argument to sqlContext.range() should be optional > > > Key: SPARK-12150 > URL: https://issues.apache.org/jira/browse/SPARK-12150 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Henri DF >Priority: Minor > Fix For: 2.0.0 > > > It's a little inconsistent that the first two sqlContext.range() methods > don't take a numPartitions arg, while the third one does. > And more importantly, it's a little inconvenient that the numPartitions arg > is mandatory for the third range() method - it means that if you want to > specify a step, you suddenly have to think about partitioning - an orthogonal > concern. > My suggestion would be to make numPartitions optional, like it is on the > sparkContext.range(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067115#comment-15067115 ] Sean Owen commented on SPARK-12453: --- OK, I see what happened here: https://github.com/apache/spark/commit/87f82a5fb9c4350a97c761411069245f07aad46f How about updating to 1.9.40 for consistency? really, it sounds like there's no point manually setting the SDK version here -- how about preemptively bringing those parts of SPARK-12269 back? Then really it should go into master first, and be backported, and then further updated by 12269. This is why I view it as sort of a duplicate, since it could as well come from back-porting just a subset of 12269. I don't know if a new 1.5.x release will happen. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067112#comment-15067112 ] Apache Spark commented on SPARK-12466: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/10417 > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12466: Assignee: Apache Spark (was: Andrew Or) > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Apache Spark > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12466) Harmless Master NPE in tests
[ https://issues.apache.org/jira/browse/SPARK-12466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12466: Assignee: Andrew Or (was: Apache Spark) > Harmless Master NPE in tests > > > Key: SPARK-12466 > URL: https://issues.apache.org/jira/browse/SPARK-12466 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 1.6.0 >Reporter: Andrew Or >Assignee: Andrew Or > Fix For: 1.6.1, 2.0.0 > > > {code} > [info] ReplayListenerSuite: > [info] - Simple replay (58 milliseconds) > java.lang.NullPointerException > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982) > at > org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:117) > at scala.concurrent.Future$$anonfun$onSuccess$1.apply(Future.scala:115) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > at > com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:23) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > [info] - End-to-end replay (10 seconds, 755 milliseconds) > {code} > https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull > caused by https://github.com/apache/spark/pull/10284 > Thanks to [~ted_yu] for reporting. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Schade reopened SPARK-12453: --- This is a bug specifically in 1.5.2 and not fixed by updating the aws-java-sdk version on master. Hence I created a pull request for the 1.5.2 branch, not for master. The ticket SparkSPARK-12269 won't fix this on the 1.5 branch from what I can see. Not sure if that is the right way to fix an issue on the branch, but seemed the most reasonable to me. > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12453) Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version
[ https://issues.apache.org/jira/browse/SPARK-12453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12453. --- Resolution: Duplicate > Spark Streaming Kinesis Example broken due to wrong AWS Java SDK version > > > Key: SPARK-12453 > URL: https://issues.apache.org/jira/browse/SPARK-12453 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Martin Schade >Priority: Critical > Labels: easyfix > > The Spark Streaming Kinesis Example (kinesis-asl) is broken due to wrong AWS > Java SDK version (1.9.16) referenced with the AWS KCL version (1.3.0). > AWS KCL 1.3.0 references AWS Java SDK version 1.9.37. > Using 1.9.16 in combination with 1.3.0 does fail to get data out of the > stream. > I tested Spark Streaming with 1.9.37 and it works fine. > Testing a simple KCL client outside of Spark with 1.3.0 and 1.9.16 also > fails, so it is due to the specific versions used in 1.5.2 and not a Spark > related implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: (was: Apache Spark) > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12247: Assignee: (was: Apache Spark) > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12247: Assignee: Apache Spark > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Apache Spark > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12468: Assignee: Apache Spark > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Assignee: Apache Spark >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12468) getParamMap in Pyspark ML API returns empty dictionary in example for Documentation
[ https://issues.apache.org/jira/browse/SPARK-12468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067080#comment-15067080 ] Zachary Brown commented on SPARK-12468: --- Found a possible fix for this by modifying the `_fit()` method of the JavaEstimator class in `python/pyspark/ml/wrapper.py` to update the paramMap of the returned model. Created a pull request for it here: https://github.com/apache/spark/pull/10419 > getParamMap in Pyspark ML API returns empty dictionary in example for > Documentation > --- > > Key: SPARK-12468 > URL: https://issues.apache.org/jira/browse/SPARK-12468 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2 >Reporter: Zachary Brown >Priority: Minor > > The `extractParamMap()` method for a model that has been fit returns an empty > dictionary, e.g. (from the [Pyspark ML API > Documentation](http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param)): > ```python > from pyspark.mllib.linalg import Vectors > from pyspark.ml.classification import LogisticRegression > from pyspark.ml.param import Param, Params > # Prepare training data from a list of (label, features) tuples. > training = sqlContext.createDataFrame([ > (1.0, Vectors.dense([0.0, 1.1, 0.1])), > (0.0, Vectors.dense([2.0, 1.0, -1.0])), > (0.0, Vectors.dense([2.0, 1.3, 1.0])), > (1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"]) > # Create a LogisticRegression instance. This instance is an Estimator. > lr = LogisticRegression(maxIter=10, regParam=0.01) > # Print out the parameters, documentation, and any default values. > print "LogisticRegression parameters:\n" + lr.explainParams() + "\n" > # Learn a LogisticRegression model. This uses the parameters stored in lr. > model1 = lr.fit(training) > # Since model1 is a Model (i.e., a transformer produced by an Estimator), > # we can view the parameters it used during fit(). > # This prints the parameter (name: value) pairs, where names are unique IDs > for this > # LogisticRegression instance. > print "Model 1 was fit using parameters: " > print model1.extractParamMap() > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12469) Consistent Accumulators for Spark
holdenk created SPARK-12469: --- Summary: Consistent Accumulators for Spark Key: SPARK-12469 URL: https://issues.apache.org/jira/browse/SPARK-12469 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: holdenk Tasks executed on Spark workers are unable to modify values from the driver, and accumulators are the one exception for this. Accumulators in Spark are implemented in such a way that when a stage is recomputed (say for cache eviction) the accumulator will be updated a second time. This makes accumulators inside of transformations more difficult to use for things like counting invalid records (one of the primary potential use cases of collecting side information during a transformation). However in some cases this counting during re-evaluation is exactly the behaviour we want (say in tracking total execution time for a particular function). Spark would benefit from a version of accumulators which did not double count even if stages were re-executed. Motivating example: {code} val parseTime = sc.accumulator(0L) val parseFailures = sc.accumulator(0L) val parsedData = sc.textFile(...).flatMap { line => val start = System.currentTimeMillis() val parsed = Try(parse(line)) if (parsed.isFailure) parseFailures += 1 parseTime += System.currentTimeMillis() - start parsed.toOption } parsedData.cache() val resultA = parsedData.map(...).filter(...).count() // some intervening code. Almost anything could happen here -- some of parsedData may // get kicked out of the cache, or an executor where data was cached might get lost val resultB = parsedData.filter(...).map(...).flatMap(...).count() // now we look at the accumulators {code} Here we would want parseFailures to only have been added to once for every line which failed to parse. Unfortunately, the current Spark accumulator API doesn’t support the current parseFailures use case since if some data had been evicted its possible that it will be double counted. See the full design document at https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12463: Assignee: Apache Spark > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: Apache Spark > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12463) Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode
[ https://issues.apache.org/jira/browse/SPARK-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067065#comment-15067065 ] Apache Spark commented on SPARK-12463: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > > > Key: SPARK-12463 > URL: https://issues.apache.org/jira/browse/SPARK-12463 > Project: Spark > Issue Type: Task >Reporter: Timothy Chen > > Remove spark.deploy.mesos.recoveryMode and use spark.deploy.recoveryMode > configuration for cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12464) Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url
[ https://issues.apache.org/jira/browse/SPARK-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067062#comment-15067062 ] Andrew Or commented on SPARK-12464: --- By the way for future reference you probably don't need a separate issue for each config. Just have an issue that says `Remove spark.deploy.mesos.* and use spark.deploy.* instead`. Since you already opened these we can just keep them. > Remove spark.deploy.mesos.zookeeper.url and use spark.deploy.zookeeper.url > -- > > Key: SPARK-12464 > URL: https://issues.apache.org/jira/browse/SPARK-12464 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen >Assignee: Apache Spark > > Remove spark.deploy.mesos.zookeeper.url and use existing configuration > spark.deploy.zookeeper.url for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15067059#comment-15067059 ] Apache Spark commented on SPARK-12465: -- User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/10057 > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-12321: - Assignee: Wenchen Fan > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12465) Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir
[ https://issues.apache.org/jira/browse/SPARK-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12465: Assignee: (was: Apache Spark) > Remove spark.deploy.mesos.zookeeper.dir and use spark.deploy.zookeeper.dir > -- > > Key: SPARK-12465 > URL: https://issues.apache.org/jira/browse/SPARK-12465 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Timothy Chen > > Remove spark.deploy.mesos.zookeeper.dir and use existing configuration > spark.deploy.zookeeper.dir for Mesos cluster mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12321) JSON format for logical/physical execution plans
[ https://issues.apache.org/jira/browse/SPARK-12321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12321. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10311 [https://github.com/apache/spark/pull/10311] > JSON format for logical/physical execution plans > > > Key: SPARK-12321 > URL: https://issues.apache.org/jira/browse/SPARK-12321 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org