[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095588#comment-15095588 ] Sun Rui commented on SPARK-6817: Attached the first draft design doc, please review and give comments > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095590#comment-15095590 ] Sun Rui commented on SPARK-6817: [~mpollock], this PR will support row-based UDF. UDF operating on columns may be supported after R UDAF is supported. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12373) Type coercion rule of dividing two decimal values may choose an intermediate precision that does not have enough number of digits at the left of decimal point
[ https://issues.apache.org/jira/browse/SPARK-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12373: - Target Version/s: 2.0.0 (was: 1.6.1, 2.0.0) > Type coercion rule of dividing two decimal values may choose an intermediate > precision that does not have enough number of digits at the left of decimal > point > --- > > Key: SPARK-12373 > URL: https://issues.apache.org/jira/browse/SPARK-12373 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai > > Looks like the {{widerDecimalType}} at > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L432 > can produce something like {{(38, 38)}} when we have have two operand types > {{Decimal(38, 0)}} and {{Decimal(38, 38)}}. We should take a look at if there > is more reasonable way to handle precision/scale. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10538) java.lang.NegativeArraySizeException during join
[ https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mayxine updated SPARK-10538: Attachment: java.lang.NegativeArraySizeException.png > java.lang.NegativeArraySizeException during join > > > Key: SPARK-10538 > URL: https://issues.apache.org/jira/browse/SPARK-10538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Maciej Bryński >Assignee: Davies Liu > Attachments: java.lang.NegativeArraySizeException.png, > screenshot-1.png > > > Hi, > I've got a problem during joining tables in PySpark. (in my example 20 of > them) > I can observe that during calculation of first partition (on one of > consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) > vs on others partitions (approx. 272.5 KB / 113 record) > I can also observe that just before the crash python process going up to few > gb of RAM. > After some time there is an exception: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90) > at > org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > I'm running this on 2 nodes cluster (12 cores, 64 GB RAM) > Config: > {code} > spark.driver.memory 10g > spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC > -Dfile.encoding=UTF8 > spark.executor.memory 60g > spark.storage.memoryFraction0.05 > spark.shuffle.memoryFraction0.75 > spark.driver.maxResultSize 10g > spark.cores.max 24 > spark.kryoserializer.buffer.max 1g > spark.default.parallelism 200 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095600#comment-15095600 ] Sun Rui commented on SPARK-6817: [~shivaram] I first focus on the row-based UDF functionality. For high-level APIs() like dapply(), I think that needs support of UDAF, which is not supported in this PR yet. I can create a new JIRA for supporting R UDAF. Any comments? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12558: - Assignee: Dilip Biswal > AnalysisException when multiple functions applied in GROUP BY clause > > > Key: SPARK-12558 > URL: https://issues.apache.org/jira/browse/SPARK-12558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Dilip Biswal > > Hi, > I have following issue when trying to use functions in group by clause. > Example: > {code} > sqlCtx = HiveContext(sc) > rdd = sc.parallelize([{'test_date': 1451400761}]) > df = sqlCtx.createDataFrame(rdd) > df.registerTempTable("df") > {code} > Now, where I'm using single function it's OK. > {code} > sqlCtx.sql("select cast(test_date as timestamp) from df group by > cast(test_date as timestamp)").collect() > [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))] > {code} > Where I'm using more than one function I'm getting AnalysisException > {code} > sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by > date(cast(test_date as timestamp))").collect() > Py4JJavaError: An error occurred while calling o38.sql. > : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12796) initial prototype: projection/filter/range
[ https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12796: Assignee: Apache Spark (was: Davies Liu) > initial prototype: projection/filter/range > -- > > Key: SPARK-12796 > URL: https://issues.apache.org/jira/browse/SPARK-12796 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12796) initial prototype: projection/filter/range
[ https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095710#comment-15095710 ] Apache Spark commented on SPARK-12796: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10735 > initial prototype: projection/filter/range > -- > > Key: SPARK-12796 > URL: https://issues.apache.org/jira/browse/SPARK-12796 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12796) initial prototype: projection/filter/range
[ https://issues.apache.org/jira/browse/SPARK-12796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12796: Assignee: Davies Liu (was: Apache Spark) > initial prototype: projection/filter/range > -- > > Key: SPARK-12796 > URL: https://issues.apache.org/jira/browse/SPARK-12796 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu >Assignee: Davies Liu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:58 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. In order to support the row-oriented API efficiently, we'd need to replicate all the infrastructure built for Python. I don't think that is maintainable in the long run. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12792) Refactor RRDD to support R UDF
Sun Rui created SPARK-12792: --- Summary: Refactor RRDD to support R UDF Key: SPARK-12792 URL: https://issues.apache.org/jira/browse/SPARK-12792 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.6.0 Reporter: Sun Rui Extract the logic in compute() to a new class named RRunner, similar to PythonRunner in the PythonRDD class. It can be used to run R UDFs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095594#comment-15095594 ] Sun Rui commented on SPARK-6817: [~piccolbo] I am not sure If I understand your meaning. This is to support UDF in R code. Spark has already supported Scala/Python UDF. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12797) Aggregation without grouping keys
Davies Liu created SPARK-12797: -- Summary: Aggregation without grouping keys Key: SPARK-12797 URL: https://issues.apache.org/jira/browse/SPARK-12797 Project: Spark Issue Type: New Feature Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12796) initial prototype: projection/filter/range
Davies Liu created SPARK-12796: -- Summary: initial prototype: projection/filter/range Key: SPARK-12796 URL: https://issues.apache.org/jira/browse/SPARK-12796 Project: Spark Issue Type: New Feature Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734 ] Jeff Zhang edited comment on SPARK-6817 at 1/13/16 7:09 AM: +1 on block based API, UDF would usually call other R packages and most of R packages are block based (R's dataframe), and this lead performance gain. was (Author: zjffdu): +1 on block based API, UDF would usually call other R packages and most of R packages are for block based (R's dataframe), and this lead performance gain. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095756#comment-15095756 ] Weiqiang Zhuang commented on SPARK-6817: We did see both apply use cases. But the block/group/column oriented apply is more important if we can have it earlier. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sun Rui updated SPARK-6817: --- Attachment: SparkR UDF Design Documentation v1.pdf > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12558) AnalysisException when multiple functions applied in GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12558. -- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10520 [https://github.com/apache/spark/pull/10520] > AnalysisException when multiple functions applied in GROUP BY clause > > > Key: SPARK-12558 > URL: https://issues.apache.org/jira/browse/SPARK-12558 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Dilip Biswal > Fix For: 2.0.0, 1.6.1 > > > Hi, > I have following issue when trying to use functions in group by clause. > Example: > {code} > sqlCtx = HiveContext(sc) > rdd = sc.parallelize([{'test_date': 1451400761}]) > df = sqlCtx.createDataFrame(rdd) > df.registerTempTable("df") > {code} > Now, where I'm using single function it's OK. > {code} > sqlCtx.sql("select cast(test_date as timestamp) from df group by > cast(test_date as timestamp)").collect() > [Row(test_date=datetime.datetime(2015, 12, 29, 15, 52, 41))] > {code} > Where I'm using more than one function I'm getting AnalysisException > {code} > sqlCtx.sql("select date(cast(test_date as timestamp)) from df group by > date(cast(test_date as timestamp))").collect() > Py4JJavaError: An error occurred while calling o38.sql. > : org.apache.spark.sql.AnalysisException: expression 'test_date' is neither > present in the group by, nor is it an aggregate function. Add to group by or > wrap in first() (or first_value) if you don't care which value you get.; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12798) Broadcast hash join
Davies Liu created SPARK-12798: -- Summary: Broadcast hash join Key: SPARK-12798 URL: https://issues.apache.org/jira/browse/SPARK-12798 Project: Spark Issue Type: New Feature Reporter: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12728) Integrate SQL generation feature with native view
[ https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12728: Assignee: (was: Apache Spark) > Integrate SQL generation feature with native view > - > > Key: SPARK-12728 > URL: https://issues.apache.org/jira/browse/SPARK-12728 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12785) Implement columnar in memory representation
[ https://issues.apache.org/jira/browse/SPARK-12785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12785. - Resolution: Fixed Assignee: Nong Li Fix Version/s: 2.0.0 > Implement columnar in memory representation > --- > > Key: SPARK-12785 > URL: https://issues.apache.org/jira/browse/SPARK-12785 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 2.0.0 > > > Tungsten can benefit from having a columnar in memory representation which > can provide a few benefits: > - Enables vectorized execution > - Improves memory efficiency (memory is more tightly packed) > - Enables cheap serialization/zero-copy transfer with third party components > (e.g. numpy) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12790) Remove HistoryServer old multiple files format
Andrew Or created SPARK-12790: - Summary: Remove HistoryServer old multiple files format Key: SPARK-12790 URL: https://issues.apache.org/jira/browse/SPARK-12790 Project: Spark Issue Type: Sub-task Components: Deploy Reporter: Andrew Or HistoryServer has 2 formats. The old one makes a directory and puts multiple files in there (APPLICATION_COMPLETE, EVENT_LOG1 etc.). The new one has just 1 file called local_2593759238651.log or something. It's been a nightmare to maintain both code paths. We should just remove the old legacy format (which has been out of use for many versions now) when we still have the chance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"
[ https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095537#comment-15095537 ] Apache Spark commented on SPARK-12791: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/10734 > Simplify CaseWhen by breaking "branches" into "conditions" and "values" > --- > > Key: SPARK-12791 > URL: https://issues.apache.org/jira/browse/SPARK-12791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"
[ https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12791: Assignee: Apache Spark (was: Reynold Xin) > Simplify CaseWhen by breaking "branches" into "conditions" and "values" > --- > > Key: SPARK-12791 > URL: https://issues.apache.org/jira/browse/SPARK-12791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095605#comment-15095605 ] Sun Rui commented on SPARK-12172: - As Spark is migrating from RDD API to Dataset API, after Dataset API is supported in SparkR, we can remove RDD API > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12172) Consider removing SparkR internal RDD APIs
[ https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095605#comment-15095605 ] Sun Rui edited comment on SPARK-12172 at 1/13/16 4:50 AM: -- As Spark is migrating from RDD API to Dataset API, after Dataset API is supported in SparkR, we can remove RDD API. But I am not sure if Dataset API is mature enough in 2.0. was (Author: sunrui): As Spark is migrating from RDD API to Dataset API, after Dataset API is supported in SparkR, we can remove RDD API > Consider removing SparkR internal RDD APIs > -- > > Key: SPARK-12172 > URL: https://issues.apache.org/jira/browse/SPARK-12172 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734 ] Jeff Zhang commented on SPARK-6817: --- +1 on block based API, UDF would usually call other R packages and most of R packages are for block based (R's dataframe), and this lead performance gain. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12800) Subtle bug on Spark Yarn Client under Kerberos Security Mode
Chester created SPARK-12800: --- Summary: Subtle bug on Spark Yarn Client under Kerberos Security Mode Key: SPARK-12800 URL: https://issues.apache.org/jira/browse/SPARK-12800 Project: Spark Issue Type: Bug Affects Versions: 1.5.2, 1.5.1 Reporter: Chester Version used: Spark 1.5.1 (1.5.2-SNAPSHOT) Deployment Mode: Yarn-Cluster Problem observed: When running spark job directly from YarnClient (without using spark-submit, I did not verify the spark-submit has the same issue or not), when kerberos security is enabled, the first time run spark job always fail. The failure is due to that the hadoop consider the job is in SIMPLE model rather than Kerberos mode. But without shutting down the JVM, run the same job again, the spark job will pass. If one restart the JVM, then the spark job will fail again. The cause: Tracking down the source of the issue, I found that the problem seems lie at the spark Yarn Client.scala. In the Client def prepareLocalResources() method L 266 of Client.java, the following line code is called. YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, credentials) The YarnSparkHadoopUtil.get is in turns get initialized via reflection object SparkHadoopUtil { private val hadoop = { val yarnMode = java.lang.Boolean.valueOf( System.getProperty("SPARK_YARN_MODE", System.getenv("SPARK_YARN_MODE"))) if (yarnMode) { try { Utils.classForName("org.apache.spark.deploy.yarn.YarnSparkHadoopUtil") .newInstance() .asInstanceOf[SparkHadoopUtil] } catch { case e: Exception => throw new SparkException("Unable to load YARN support", e) } } else { new SparkHadoopUtil } } def get: SparkHadoopUtil = { hadoop } } class SparkHadoopUtil extends Logging { private val sparkConf = new SparkConf() val conf: Configuration = newConfiguration(sparkConf) UserGroupInformation.setConfiguration(conf) rest of line } Here SparkHadoopUtil creates a empty SparkConf and Hadoop Configuration from that and set to UserGroupInformation UserGroupInformation.setConfiguration(conf) As the UserGroupInformation.authenticationMethod is static, above all wipe out the security settings. UserGroupInformation.isSecurityEnabled() changed from true to false. Thus the sequence call will fail. Since the SparkHadoopUtil.hadoop is static/non-mutable variable, so the next run it will be not create again, then UserGroupInformation.setConfiguration(conf) will not be called again, so the sequence spark job works. The work around: //first initialize the SparkHadoopUtil, which will create a static instance //which will set UserGroupInformation to a empty hadoop Configuration. //we will need to reset the UserGroupInformation after that. val util = SparkHadoopUtil.get UserGroupInformation.setConfiguration(hadoopConf) Then call client.run() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12771) Improve code generation for CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12771: Assignee: (was: Apache Spark) > Improve code generation for CaseWhen > > > Key: SPARK-12771 > URL: https://issues.apache.org/jira/browse/SPARK-12771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The generated code for CaseWhen uses a control variable "got" to make sure we > do not evaluate more branches once a branch is true. Changing that to > generate just simple "if / else" would be slightly more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12771) Improve code generation for CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12771: Assignee: Apache Spark > Improve code generation for CaseWhen > > > Key: SPARK-12771 > URL: https://issues.apache.org/jira/browse/SPARK-12771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > The generated code for CaseWhen uses a control variable "got" to make sure we > do not evaluate more branches once a branch is true. Changing that to > generate just simple "if / else" would be slightly more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12771) Improve code generation for CaseWhen
[ https://issues.apache.org/jira/browse/SPARK-12771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095790#comment-15095790 ] Apache Spark commented on SPARK-12771: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/10737 > Improve code generation for CaseWhen > > > Key: SPARK-12771 > URL: https://issues.apache.org/jira/browse/SPARK-12771 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > The generated code for CaseWhen uses a control variable "got" to make sure we > do not evaluate more branches once a branch is true. Changing that to > generate just simple "if / else" would be slightly more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095494#comment-15095494 ] Sun Rui commented on SPARK-12635: - [~dselivanov] PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, array byte, they shares the same format. For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement? Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort. > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12635) More efficient (column batch) serialization for Python/R
[ https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095494#comment-15095494 ] Sun Rui edited comment on SPARK-12635 at 1/13/16 2:35 AM: -- [~dselivanov] PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and Java DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, byte array, they share the same format. For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement? Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort. was (Author: sunrui): [~dselivanov] PySpark uses pickle and CloudPickle on python side and net.razorvine.pickle on JVM side for data serialization/deserialization between Python and JVM. While there lacks a library similar to net.razorvine.pickle which can deserialize from and serialize to R serialization format. So currently, SparkR depends on ReadBin()/writeBin() on R side and DataInputStream/DataOutputStream for serialization/deserialization between R and JVM, based on the fact that for simple types like integer, double, array byte, they shares the same format. For collect(), the serialization/deserialization happens along with the communication via socket. I suspect there are much communication overhead occurring during many socket reads/writes. Maybe we can change the behavior in batch way, that is, serialize part of the collection result into a buffer in memory and transfer it back. Would you interested in doing a prototype and see if there is any performance improvement? Another idea would be introduce something like net.razorvine.pickle, but that sounds a lot of effort. > More efficient (column batch) serialization for Python/R > > > Key: SPARK-12635 > URL: https://issues.apache.org/jira/browse/SPARK-12635 > Project: Spark > Issue Type: New Feature > Components: PySpark, SparkR, SQL >Reporter: Reynold Xin > > Serialization between Scala / Python / R is pretty slow. Python and R both > work pretty well with column batch interface (e.g. numpy arrays). Technically > we should be able to just pass column batches around with minimal > serialization (maybe even zero copy memory). > Note that this depends on some internal refactoring to use a column batch > interface in Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12788) Simplify BooleanEquality by using casts
[ https://issues.apache.org/jira/browse/SPARK-12788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12788. - Resolution: Fixed Fix Version/s: 2.0.0 > Simplify BooleanEquality by using casts > --- > > Key: SPARK-12788 > URL: https://issues.apache.org/jira/browse/SPARK-12788 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12793) Support R UDF Evaluation
Sun Rui created SPARK-12793: --- Summary: Support R UDF Evaluation Key: SPARK-12793 URL: https://issues.apache.org/jira/browse/SPARK-12793 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.6.0 Reporter: Sun Rui Basically follows the logic as that for Python UDF evaluation (org/apache/spark/sql/execution/python.scala). Will extract and reuse common logic between Python UDF and R UDF evaluation. Serialization/deserialization is different from Python UDF. R UDF will use R SerDe to directly serialize a batch of InternalRows into bytes (that is, in SerializationFormats.ROW format) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12794) Support Defining and Registration of R UDF
Sun Rui created SPARK-12794: --- Summary: Support Defining and Registration of R UDF Key: SPARK-12794 URL: https://issues.apache.org/jira/browse/SPARK-12794 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.6.0 Reporter: Sun Rui Create UserDefinedRFunction class in Scala similar to UserDefinedPythonFunction class. Support registering R UDF in UDFRegistration class. Implement udf() function in functions.R. Implement registerFunction() in SQLContext.R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12692) Scala style: check no white space before comma and colon
[ https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095714#comment-15095714 ] Apache Spark commented on SPARK-12692: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/10736 > Scala style: check no white space before comma and colon > > > Key: SPARK-12692 > URL: https://issues.apache.org/jira/browse/SPARK-12692 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > > We should not put a white space before `,` and `:` so let's check it. > Because there are lots of style violation, first, I'd like to add a checker, > enable and let the level `warn`. > Then, I'd like to fix the style step by step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095747#comment-15095747 ] Reynold Xin commented on SPARK-6817: Please take a look at the original design doc for this: https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095745#comment-15095745 ] Sun Rui commented on SPARK-6817: [~rxin] Row-oriented R UDF is for SQL and is similar to Python UDF. I am not making the R UDF depends on RRDD, but abstract the re-usable logic that can be shared by RRDD and R UDF, which is also similar to Python UDF. I don't know what the block means in "block oriented API"? something like GroupedData? I think that depends on UDAF support, which will be supported after UDF support. Maybe something I mis-understand? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12799) Simplify various string output for expressions
Reynold Xin created SPARK-12799: --- Summary: Simplify various string output for expressions Key: SPARK-12799 URL: https://issues.apache.org/jira/browse/SPARK-12799 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin We currently have "sql", "prettyString", "toString". The default implementation of "prettyString" is simply "toString" but replaced the AttributeReferences with PrettyAttributes. I think we can just remove the existing "sql" one, and rename "prettyString" to "sql". We might need to do a little bit cleanup to make the prettyString work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12728) Integrate SQL generation feature with native view
[ https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095478#comment-15095478 ] Apache Spark commented on SPARK-12728: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/10733 > Integrate SQL generation feature with native view > - > > Key: SPARK-12728 > URL: https://issues.apache.org/jira/browse/SPARK-12728 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"
[ https://issues.apache.org/jira/browse/SPARK-12791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12791: Assignee: Reynold Xin (was: Apache Spark) > Simplify CaseWhen by breaking "branches" into "conditions" and "values" > --- > > Key: SPARK-12791 > URL: https://issues.apache.org/jira/browse/SPARK-12791 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12791) Simplify CaseWhen by breaking "branches" into "conditions" and "values"
Reynold Xin created SPARK-12791: --- Summary: Simplify CaseWhen by breaking "branches" into "conditions" and "values" Key: SPARK-12791 URL: https://issues.apache.org/jira/browse/SPARK-12791 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4226) SparkSQL - Add support for subqueries in predicates
[ https://issues.apache.org/jira/browse/SPARK-4226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095695#comment-15095695 ] Apache Spark commented on SPARK-4226: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/10706 > SparkSQL - Add support for subqueries in predicates > --- > > Key: SPARK-4226 > URL: https://issues.apache.org/jira/browse/SPARK-4226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.2.0 > Environment: Spark 1.2 snapshot >Reporter: Terry Siu > > I have a test table defined in Hive as follows: > {code:sql} > CREATE TABLE sparkbug ( > id INT, > event STRING > ) STORED AS PARQUET; > {code} > and insert some sample data with ids 1, 2, 3. > In a Spark shell, I then create a HiveContext and then execute the following > HQL to test out subquery predicates: > {code} > val hc = HiveContext(hc) > hc.hql("select customerid from sparkbug where customerid in (select > customerid from sparkbug where customerid in (2,3))") > {code} > I get the following error: > {noformat} > java.lang.RuntimeException: Unsupported language features in query: select > customerid from sparkbug where customerid in (select customerid from sparkbug > where customerid in (2,3)) > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > scala.NotImplementedError: No parse rules for ASTNode type: 817, text: > TOK_SUBQUERY_EXPR : > TOK_SUBQUERY_EXPR > TOK_SUBQUERY_OP > in > TOK_QUERY > TOK_FROM > TOK_TABREF > TOK_TABNAME > sparkbug > TOK_INSERT > TOK_DESTINATION > TOK_DIR > TOK_TMP_FILE > TOK_SELECT > TOK_SELEXPR > TOK_TABLE_OR_COL > customerid > TOK_WHERE > TOK_FUNCTION > in > TOK_TABLE_OR_COL > customerid > 2 > 3 > TOK_TABLE_OR_COL > customerid > " + > > org.apache.spark.sql.hive.HiveQl$.nodeToExpr(HiveQl.scala:1098) > > at scala.sys.package$.error(package.scala:27) > at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:252) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50) > at > org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49) > at > scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) > {noformat} > [This > thread|http://apache-spark-user-list.1001560.n3.nabble.com/Subquery-in-having-clause-Spark-1-1-0-td17401.html] > also brings up lack of subquery support in SparkSQL. It would be nice to > have subquery predicate support in a near, future release (1.3, maybe?). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12795) Whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12795: --- Description: Whole stage codegen is used by some modern MPP databases to archive great performance. See http://www.vldb.org/pvldb/vol4/p539-neumann.pdf For Spark SQL, we can compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator. was:Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator > Whole stage codegen > --- > > Key: SPARK-12795 > URL: https://issues.apache.org/jira/browse/SPARK-12795 > Project: Spark > Issue Type: Epic >Reporter: Davies Liu >Assignee: Davies Liu > > Whole stage codegen is used by some modern MPP databases to archive great > performance. See http://www.vldb.org/pvldb/vol4/p539-neumann.pdf > For Spark SQL, we can compile multiple operator into a single Java function > to avoid the overhead from materialize rows and Scala iterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776 ] Antonio Piccolboni commented on SPARK-6817: --- My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12795) Whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-12795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12795: --- Summary: Whole stage codegen (was: Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator) > Whole stage codegen > --- > > Key: SPARK-12795 > URL: https://issues.apache.org/jira/browse/SPARK-12795 > Project: Spark > Issue Type: Epic >Reporter: Davies Liu >Assignee: Davies Liu > > Compile multiple operator into a single Java function to avoid the overhead > from materialize rows and Scala iterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12795) Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator
Davies Liu created SPARK-12795: -- Summary: Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator Key: SPARK-12795 URL: https://issues.apache.org/jira/browse/SPARK-12795 Project: Spark Issue Type: Epic Reporter: Davies Liu Assignee: Davies Liu Compile multiple operator into a single Java function to avoid the overhead from materialize rows and Scala iterator -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes a lot more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin commented on SPARK-6817: [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes a lot more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776 ] Antonio Piccolboni edited comment on SPARK-6817 at 1/13/16 7:41 AM: My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort). was (Author: piccolbo): My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals
[ https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12770: Description: There are a few things we can do: 1. If a branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. was: There are a few things we can do: 1. If a branch is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch is a literal that is false or null, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. > Implement rules for branch elimination for CaseWhen in SimplifyConditionals > --- > > Key: SPARK-12770 > URL: https://issues.apache.org/jira/browse/SPARK-12770 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Reporter: Reynold Xin > > There are a few things we can do: > 1. If a branch's condition is a true literal, remove the CaseWhen and use the > value from that branch. > 2. If a branch's condition is a false or null literal, remove that branch. > 3. If only the else branch is left, remove the CaseWhen and use the value > from the else branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources
[ https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093533#comment-15093533 ] Santiago M. Mola commented on SPARK-12449: -- Implementing this interface or an equivalent one would help standarize a lot of advanced features that data sources have been doing for some time. And while doing so, it would prevent them from creating their own SQLContext variants or patching the running SQLContext at runtime (using extraStrategies). Here's a list of data source that are currently this approach. It would also be good to take them into account for this JIRA. The proposed interface and strategy should probably support all of these use cases. Some of them also use their own catalog implementation, but that should be something for a separate JIRA. *spark-sql-on-hbase* Already mentioned by [~yzhou2001]. They are using HBaseContext with extraStrategies that inject HBaseStrategies doing aggregation push down: https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/src/main/scala/org/apache/spark/sql/hbase/execution/HBaseStrategies.scala *memsql-spark-connector* They offer both their own SQLContext or inject their MemSQL-specific push down strategy on runtime. They do match Catalyst's LogicalPlan in the same way we're proposing to push down filters, projects, aggregates, limits, sorts and joins: https://github.com/memsql/memsql-spark-connector/blob/master/connectorLib/src/main/scala/com/memsql/spark/pushdown/MemSQLPushdownStrategy.scala *spark-iqmulus* Strategy injected to push down counts and some aggregates: https://github.com/IGNF/spark-iqmulus/blob/master/src/main/scala/fr/ign/spark/iqmulus/ExtraStrategies.scala *druid-olap* They use SparkPlanner, Strategy and LogicalPlan APIs to do extensive push down. Their API usage could be limited to LogicalPlan only if this JIRA is implemented: https://github.com/SparklineData/spark-druid-olap/blob/master/src/main/scala/org/apache/spark/sql/sources/druid/ *magellan* _(probably out of scope)_ Does its own BroadcastJoin. Although, it seems to me that this usage would be out of scope for us. https://github.com/harsha2010/magellan/blob/master/src/main/scala/magellan/execution/MagellanStrategies.scala > Pushing down arbitrary logical plans to data sources > > > Key: SPARK-12449 > URL: https://issues.apache.org/jira/browse/SPARK-12449 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Stephan Kessler > Attachments: pushingDownLogicalPlans.pdf > > > With the help of the DataSource API we can pull data from external sources > for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows > to push down filters and projects pruning unnecessary fields and rows > directly in the data source. > However, data sources such as SQL Engines are capable of doing even more > preprocessing, e.g., evaluating aggregates. This is beneficial because it > would reduce the amount of data transferred from the source to Spark. The > existing interfaces do not allow such kind of processing in the source. > We would propose to add a new interface {{CatalystSource}} that allows to > defer the processing of arbitrary logical plans to the data source. We have > already shown the details at the Spark Summit 2015 Europe > [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/] > I will add a design document explaining details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12771) Improve code generation for CaseWhen
Reynold Xin created SPARK-12771: --- Summary: Improve code generation for CaseWhen Key: SPARK-12771 URL: https://issues.apache.org/jira/browse/SPARK-12771 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin The generated code for CaseWhen uses a control variable "got" to make sure we do not evaluate more branches once a branch is true. Changing that to generate just simple "if / else" would be slightly more efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093511#comment-15093511 ] Konstantin Shaposhnikov commented on SPARK-2984: I am seeing the same error message with Spark 1.6 and HDFS. This happens after an earlier job failure (ClassCastException) > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {noformat} > -- Chen Song at > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFiles-file-not-found-exception-td10686.html > {noformat} > I am running a Spark Streaming job that uses saveAsTextFiles to save results > into hdfs files. However, it has an exception after 20 batches > result-140631234/_temporary/0/task_201407251119__m_03 does not > exist. > {noformat} > and > {noformat} > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on /apps/data/vddil/real-time/checkpoint/temp: File does not exist. > Holder DFSClient_NONMAPREDUCE_327993456_13 does not have any open files. > at >
[jira] [Created] (SPARK-12772) Better error message for parsing failure?
Reynold Xin created SPARK-12772: --- Summary: Better error message for parsing failure? Key: SPARK-12772 URL: https://issues.apache.org/jira/browse/SPARK-12772 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin {code} scala> sql("select case if(true, 'one', 'two')").explain(true) org.apache.spark.sql.AnalysisException: org.antlr.runtime.EarlyExitException line 1:34 required (...)+ loop did not match anything at input '' in case expression ; line 1 pos 34 at org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:140) at org.apache.spark.sql.catalyst.parser.ParseErrorReporter.throwError(ParseDriver.scala:129) at org.apache.spark.sql.catalyst.parser.ParseDriver$.parse(ParseDriver.scala:77) at org.apache.spark.sql.catalyst.CatalystQl.createPlan(CatalystQl.scala:53) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) {code} Is there a way to say something better other than "required (...)+ loop did not match anything at input"? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser
[ https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12689: Assignee: (was: Apache Spark) > Migrate DDL parsing to the newly absorbed parser > > > Key: SPARK-12689 > URL: https://issues.apache.org/jira/browse/SPARK-12689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12770) Implement rules for branch elimination for CaseWhen in SimplifyConditionals
[ https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12770: Summary: Implement rules for branch elimination for CaseWhen in SimplifyConditionals (was: Implement rules for removing unnecessary branches for CaseWhen in SimplifyConditionals) > Implement rules for branch elimination for CaseWhen in SimplifyConditionals > --- > > Key: SPARK-12770 > URL: https://issues.apache.org/jira/browse/SPARK-12770 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Reporter: Reynold Xin > > There are a few things we can do: > 1. If a branch is a true literal, remove the CaseWhen and use the value from > that branch. > 2. If a branch is a literal that is false or null, remove that branch. > 3. If only the else branch is left, remove the CaseWhen and use the value > from the else branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12770) Implement rules for removing unnecessary branches for CaseWhen in SimplifyConditionals
[ https://issues.apache.org/jira/browse/SPARK-12770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12770: Description: There are a few things we can do: 1. If a branch is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch is a literal that is false or null, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. > Implement rules for removing unnecessary branches for CaseWhen in > SimplifyConditionals > -- > > Key: SPARK-12770 > URL: https://issues.apache.org/jira/browse/SPARK-12770 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Reporter: Reynold Xin > > There are a few things we can do: > 1. If a branch is a true literal, remove the CaseWhen and use the value from > that branch. > 2. If a branch is a literal that is false or null, remove that branch. > 3. If only the else branch is left, remove the CaseWhen and use the value > from the else branch. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12768) Remove CaseKeyWhen expression
[ https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12768: Summary: Remove CaseKeyWhen expression (was: Remove CaseKeyWhen) > Remove CaseKeyWhen expression > - > > Key: SPARK-12768 > URL: https://issues.apache.org/jira/browse/SPARK-12768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > CaseKeyWhen was added to improve the performance of "case a when ..." when we > did not have common subexpression elimination. We now have that so we can > remove CaseKeyWhen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12762) Add unit test for simplifying if expression
[ https://issues.apache.org/jira/browse/SPARK-12762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12762: Issue Type: Sub-task (was: Improvement) Parent: SPARK-12767 > Add unit test for simplifying if expression > --- > > Key: SPARK-12762 > URL: https://issues.apache.org/jira/browse/SPARK-12762 > Project: Spark > Issue Type: Sub-task > Components: Optimizer, SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12773) Impurity and Sample details for each node of a decision tree
Rahul Tanwani created SPARK-12773: - Summary: Impurity and Sample details for each node of a decision tree Key: SPARK-12773 URL: https://issues.apache.org/jira/browse/SPARK-12773 Project: Spark Issue Type: Question Components: ML, MLlib Affects Versions: 1.5.2 Reporter: Rahul Tanwani I just want to understand if each node in the decision tree calculates / stores information about no. of samples that satisfy the split criteria. Looking at the code, I find some information about the impurity statistics but did not find anything on the samples. Sci-kit learn exposes both of these metrics. The information may help in the cases where there are multiple decision rules (multiple leaf nodes) yielding the same prediction and we want to do some relative comparisions of decision paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12774) DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows
[ https://issues.apache.org/jira/browse/SPARK-12774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh updated SPARK-12774: - Description: Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in both Spark and pySpark. The function that is applied to each partition _f_ must operate on a list generator. This is however very inefficient in Python. It would be more logical and efficient if the apply function _f_ operated on Pandas DataFrames instead and also returned a DataFrame. This avoids unnecessary iteration in Python which is slow. Currently: {code} def apply_function(rows): df = pd.DataFrame(list(rows)) df = df % 100 # Do something on df return df.values.tolist() table = sqlContext.read.parquet("") table = table.mapPatitions(apply_function) {code} New apply function would accept a Pandas DataFrame and return a DataFrame: {code} def apply_function(df): df = df % 100 # Do something on df return df {code} was: Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in both Spark and pySpark. The function that is applied to each partition _f_ must operate on a list generator. This is however very inefficient in Python. It would be more logical and efficient if the apply function _f_ operated on Pandas DataFrames instead and also returned a DataFrame. This avoids unnecessary iteration in Python which is slow. Currently: {code:python} def apply_function(rows): df = pd.DataFrame(list(rows)) df = df % 100 # Do something on df return df.values.tolist() table = sqlContext.read.parquet("") table = table.mapPatitions(apply_function) {code} New apply function would accept a Pandas DataFrame and return a DataFrame: {code:python} def apply_function(df): df = df % 100 # Do something on df return df {code} > DataFrame.mapPartitions apply function operates on Pandas DataFrame instead > of a generator or rows > -- > > Key: SPARK-12774 > URL: https://issues.apache.org/jira/browse/SPARK-12774 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Josh > Labels: dataframe, pandas > > Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions > in both Spark and pySpark. The function that is applied to each partition _f_ > must operate on a list generator. This is however very inefficient in Python. > It would be more logical and efficient if the apply function _f_ operated on > Pandas DataFrames instead and also returned a DataFrame. This avoids > unnecessary iteration in Python which is slow. > Currently: > {code} > def apply_function(rows): > df = pd.DataFrame(list(rows)) > df = df % 100 # Do something on df > return df.values.tolist() > table = sqlContext.read.parquet("") > table = table.mapPatitions(apply_function) > {code} > New apply function would accept a Pandas DataFrame and return a DataFrame: > {code} > def apply_function(df): > df = df % 100 # Do something on df > return df > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12774) DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows
Josh created SPARK-12774: Summary: DataFrame.mapPartitions apply function operates on Pandas DataFrame instead of a generator or rows Key: SPARK-12774 URL: https://issues.apache.org/jira/browse/SPARK-12774 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Josh Currently DataFrame.mapPatitions is analogous to DataFrame.rdd.mapPatitions in both Spark and pySpark. The function that is applied to each partition _f_ must operate on a list generator. This is however very inefficient in Python. It would be more logical and efficient if the apply function _f_ operated on Pandas DataFrames instead and also returned a DataFrame. This avoids unnecessary iteration in Python which is slow. Currently: {code:python} def apply_function(rows): df = pd.DataFrame(list(rows)) df = df % 100 # Do something on df return df.values.tolist() table = sqlContext.read.parquet("") table = table.mapPatitions(apply_function) {code} New apply function would accept a Pandas DataFrame and return a DataFrame: {code:python} def apply_function(df): df = df % 100 # Do something on df return df {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser
[ https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12689: Assignee: Apache Spark > Migrate DDL parsing to the newly absorbed parser > > > Key: SPARK-12689 > URL: https://issues.apache.org/jira/browse/SPARK-12689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12689) Migrate DDL parsing to the newly absorbed parser
[ https://issues.apache.org/jira/browse/SPARK-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093672#comment-15093672 ] Apache Spark commented on SPARK-12689: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/10723 > Migrate DDL parsing to the newly absorbed parser > > > Key: SPARK-12689 > URL: https://issues.apache.org/jira/browse/SPARK-12689 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12769) Remove If expression
[ https://issues.apache.org/jira/browse/SPARK-12769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-12769: Description: If can be a simple factory method for CaseWhen, similar to CaseKeyWhen. We can then simplify the optimizer rules we implement for conditional expressions. was: If can be a simple factory method for CaseWhen. We can then simplify the optimizer rules we implement for conditional expressions. > Remove If expression > > > Key: SPARK-12769 > URL: https://issues.apache.org/jira/browse/SPARK-12769 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > If can be a simple factory method for CaseWhen, similar to CaseKeyWhen. > We can then simplify the optimizer rules we implement for conditional > expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12768) Remove CaseKeyWhen expression
[ https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12768: Assignee: Apache Spark (was: Reynold Xin) > Remove CaseKeyWhen expression > - > > Key: SPARK-12768 > URL: https://issues.apache.org/jira/browse/SPARK-12768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > CaseKeyWhen was added to improve the performance of "case a when ..." when we > did not have common subexpression elimination. We now have that so we can > remove CaseKeyWhen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12768) Remove CaseKeyWhen expression
[ https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12768: Assignee: Reynold Xin (was: Apache Spark) > Remove CaseKeyWhen expression > - > > Key: SPARK-12768 > URL: https://issues.apache.org/jira/browse/SPARK-12768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > CaseKeyWhen was added to improve the performance of "case a when ..." when we > did not have common subexpression elimination. We now have that so we can > remove CaseKeyWhen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12768) Remove CaseKeyWhen expression
[ https://issues.apache.org/jira/browse/SPARK-12768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093506#comment-15093506 ] Apache Spark commented on SPARK-12768: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/10722 > Remove CaseKeyWhen expression > - > > Key: SPARK-12768 > URL: https://issues.apache.org/jira/browse/SPARK-12768 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > CaseKeyWhen was added to improve the performance of "case a when ..." when we > did not have common subexpression elimination. We now have that so we can > remove CaseKeyWhen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling
[ https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12760: -- Priority: Minor (was: Trivial) Issue Type: Bug (was: Question) Summary: inaccurate description for difference between local vs cluster mode in closure handling (was: inaccurate description for difference between local vs cluster mode ) I think the example needs an update, but not for this reason. There's no separate "memory space" in local mode. It's one JVM. However it's undefined whether the copy of {{counter}} is the same or different in this case. Actually, I find a copy is serialized with the closure at this point so the result is still 0. I think the explanation should be changed to say the result is undefined here, and could be 0 or not, and explain why. Do you want to try a PR? > inaccurate description for difference between local vs cluster mode in > closure handling > --- > > Key: SPARK-12760 > URL: https://issues.apache.org/jira/browse/SPARK-12760 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Mortada Mehyar >Priority: Minor > > In the spark documentation there's an example for illustrating how `local` > and `cluster` mode can differ > http://spark.apache.org/docs/latest/programming-guide.html#example > " In local mode with a single JVM, the above code will sum the values within > the RDD and store it in counter. This is because both the RDD and the > variable counter are in the same memory space on the driver node." > However the above doesn't seem to be true. Even in `local` mode it seems like > the counter value should still be 0, because the variable will be summed up > in the executor memory space, but the final value in the driver memory space > is still 0. I tested this snippet and verified that in `local` mode the value > is indeed still 0. > Is the doc wrong or perhaps I'm missing something the doc is trying to say? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12766) Unshaded google guava classes in spark-network-common jar
[ https://issues.apache.org/jira/browse/SPARK-12766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12766. --- Resolution: Not A Problem This is on purpose. Some Guava classes are used in the public Java API (unfortunately). This was rectified for Spark 2.x but you will find {{Optional}} and some dependent classes unshaded in 1.x. > Unshaded google guava classes in spark-network-common jar > - > > Key: SPARK-12766 > URL: https://issues.apache.org/jira/browse/SPARK-12766 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Jake Yoon >Priority: Minor > Labels: build, sbt > > I found an unshaded google guava classes used internally in > spark-network-common while working with ElasticSearch. > Following link discusses about duplicate dependencies conflict cause by guava > classes and how I solved the build conflict issue. > https://discuss.elastic.co/t/exception-when-using-elasticsearch-spark-and-elasticsearch-core-together/38471/4 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12764) XML Column type is not supported
[ https://issues.apache.org/jira/browse/SPARK-12764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093836#comment-15093836 ] Ewan Leith edited comment on SPARK-12764 at 1/12/16 12:53 PM: -- What are you expecting it to do, output the XML as a string, or something else? I doubt this will work, but you might try adding this code before the initial references to JDBC: case object PostgresDialect extends JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:postgresql") override def getCatalystType( sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): Option[DataType] = { if (typeName.contains("xml")) { Some(StringType) } else None } } JdbcDialects.registerDialect(PostgresDialect) was (Author: ewanleith): What are you expecting it to do, output the XML as a string, or something else? > XML Column type is not supported > > > Key: SPARK-12764 > URL: https://issues.apache.org/jira/browse/SPARK-12764 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.0 > Environment: Mac Os X El Capitan >Reporter: Rajeshwar Gaini > > Hi All, > I am using PostgreSQL database. I am using the following jdbc call to access > a customer table (customer_id int, event text, country text, content xml) in > my database. > {code} > val dataframe1 = sqlContext.load("jdbc", Map("url" -> > "jdbc:postgresql://localhost/customerlogs?user=postgres=postgres", > "dbtable" -> "customer")) > {code} > When i run above command in spark-shell i receive the following error. > {code} > java.sql.SQLException: Unsupported type > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91) > at > org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:25) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:34) > at $iwC$$iwC$$iwC$$iwC.(:36) > at $iwC$$iwC$$iwC.(:38) > at $iwC$$iwC.(:40) > at $iwC.(:42) > at (:44) > at .(:48) > at .() > at .(:7) > at .() > at $print() > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) > at > org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) > at > org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) > at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) > at > org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) > at > org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) > at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) > at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) > at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at >
[jira] [Commented] (SPARK-12775) Couldn't find leader offsets exception when hostname can't be resolved
[ https://issues.apache.org/jira/browse/SPARK-12775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15093858#comment-15093858 ] Sean Owen commented on SPARK-12775: --- Hm, I don't think that's a spark problem though. > Couldn't find leader offsets exception when hostname can't be resolved > -- > > Key: SPARK-12775 > URL: https://issues.apache.org/jira/browse/SPARK-12775 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Sebastian Piu >Priority: Minor > > When hostname resolution fails for a broker an unclear/misleading error is > shown: > org.apache.spark.SparkException: java.nio.channels.ClosedChannelException > org.apache.spark.SparkException: Couldn't find leader offsets for > Set([mytopic,0], [mytopic,18], [mytopic,12], [mytopic,6]) > at > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) > at > org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366) > Error above ocurred when a broker was added to the cluster and my machine > could not resolve its hostname -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12582) IndexShuffleBlockResolverSuite fails in windows
[ https://issues.apache.org/jira/browse/SPARK-12582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12582. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10526 [https://github.com/apache/spark/pull/10526] > IndexShuffleBlockResolverSuite fails in windows > --- > > Key: SPARK-12582 > URL: https://issues.apache.org/jira/browse/SPARK-12582 > Project: Spark > Issue Type: Bug > Components: Tests, Windows >Reporter: yucai >Assignee: yucai > Fix For: 2.0.0, 1.6.1 > > > IndexShuffleBlockResolverSuite fails in my windows develop machine. > {code} > [info] IndexShuffleBlockResolverSuite: > [info] - commit shuffle files multiple times *** FAILED *** (388 milliseconds) > [info] Array(10, 0, 20) equaled Array(10, 0, 20) > (IndexShuffleBlockResolverSuite.scala:108) > [info] org.scalatest.exceptions.TestFailedException: > . > . > [info] Exception encountered when attempting to run a suite with class name: > org.apache.spark.shuffle.sort.IndexShuffleB > lockResolverSuite *** ABORTED *** (2 seconds, 234 milliseconds) > [info] java.io.IOException: Failed to delete: > C:\Users\yyu29\Documents\codes.next\spark\target\tmp\spark-0e81a15a-e712 > -4b1c-a089-f421db149e65 > [info] at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:940) > [info] at > org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala: > 60) > [info] at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:205) > [info] at > org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala: > 36) > [info] at > org.scalatest.BeforeAndAfterEach$class.afterEach(BeforeAndAfterEach.scala:220) > [info] at > org.apache.spark.shuffle.sort.IndexShuffleBlockResolverSuite.afterEach(IndexShuffleBlockResolverSuite.scala: > 36) > {code} > Root cause is when "afterEach" wants to clean up data, some files are still > open. For example: > {code} > // The dataFile should be the previous one > val in = new FileInputStream(dataFile) > val firstByte = new Array[Byte](1) > in.read(firstByte) > assert(firstByte(0) === 0) > {code} > Lack of "in.close()". > In Linux, it is not a problem, you can still delete a file even it is open, > but this does not work in windows, which will report "resource is busy". > Another issue is this IndexShuffleBlockResolverSuite.scala is a scala file > but it is placed in "test/java". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7615) MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero
[ https://issues.apache.org/jira/browse/SPARK-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7615. -- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 1.6.1 Resolved by https://github.com/apache/spark/pull/10696 > MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero > > > Key: SPARK-7615 > URL: https://issues.apache.org/jira/browse/SPARK-7615 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 >Reporter: Eric Li >Assignee: Sean Owen >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > In Word2VecModel, wordVecNorms may contains Euclidean Norm equals to zero. > This will cause incorrect calculation for cosine distance. when you do > cosineVec(ind) / wordVecNorms(ind). Cosine distance should be equal to 0 for > norm = 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12773) Impurity and Sample details for each node of a decision tree
[ https://issues.apache.org/jira/browse/SPARK-12773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12773. --- Resolution: Invalid Target Version/s: (was: 1.5.2) Please ask questions at u...@spark.apache.org > Impurity and Sample details for each node of a decision tree > > > Key: SPARK-12773 > URL: https://issues.apache.org/jira/browse/SPARK-12773 > Project: Spark > Issue Type: Question > Components: ML, MLlib >Affects Versions: 1.5.2 >Reporter: Rahul Tanwani > > I just want to understand if each node in the decision tree calculates / > stores information about no. of samples that satisfy the split criteria. > Looking at the code, I find some information about the impurity statistics > but did not find anything on the samples. Sci-kit learn exposes both of these > metrics. The information may help in the cases where there are multiple > decision rules (multiple leaf nodes) yielding the same prediction and we want > to do some relative comparisions of decision paths. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12759) Spark should fail fast if --executor-memory is too small for spark to start
[ https://issues.apache.org/jira/browse/SPARK-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12759: -- Component/s: Spark Submit Spark Core > Spark should fail fast if --executor-memory is too small for spark to start > --- > > Key: SPARK-12759 > URL: https://issues.apache.org/jira/browse/SPARK-12759 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit >Affects Versions: 1.6.0 >Reporter: Imran Rashid > > With the UnifiedMemoryManager, the minimum memory for executor and driver > JVMs was increased to 450MB. There is code in {{UnifiedMemoryManager}} to > provide a helpful warning if less than that much memory is provided. > However if you set {{--executor-memory}} to something less than that, from > the driver process you just see executor failures with no warning, since the > more meaningful errors are buried in the executor logs. Eg., on Yarn, you see > {noformat} > 16/01/11 13:59:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_1452548703600_0001_01_02 on host: > imran-adhoc-2.vpc.cloudera.com. Exit status: 1. Diagnostics: Exception from > container-launch. > Container id: container_1452548703600_0001_01_02 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) > at org.apache.hadoop.util.Shell.run(Shell.java:478) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > {noformat} > Though there is already a message from {{UnifiedMemoryManager}} if there > isn't enough memory for the driver, as long as this is being changed it would > be nice if the message more clearly indicated the {{--driver-memory}} > configuration as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12763) Spark gets stuck executing SSB query
[ https://issues.apache.org/jira/browse/SPARK-12763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12763: -- Component/s: SQL > Spark gets stuck executing SSB query > > > Key: SPARK-12763 > URL: https://issues.apache.org/jira/browse/SPARK-12763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Standalone cluster >Reporter: Vadim Tkachenko > Attachments: Spark shell - Details for Stage 5 (Attempt 0).pdf > > > I am trying to emulate SSB load. Data generated with > https://github.com/Percona-Lab/ssb-dbgen > generated size is with 1000 scale factor and converted to parquet format. > Now there is a following script > val pLineOrder = > sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/lineorder").cache() > val pDate = sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/date").cache() > val pPart = sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/part").cache() > val pSupplier = > sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/supplier").cache() > val pCustomer = > sqlContext.read.parquet("/mnt/i3600/spark/ssb-1000/customer").cache() > pLineOrder.registerTempTable("lineorder") > pDate.registerTempTable("date") > pPart.registerTempTable("part") > pSupplier.registerTempTable("supplier") > pCustomer.registerTempTable("customer") > query > val sql41=sqlContext.sql("select D_YEAR, C_NATION,sum(LO_REVENUE - > LO_SUPPLYCOST) as profit from date, customer, supplier, part, lineorder > where LO_CUSTKEY = C_CUSTKEYand LO_SUPPKEY = S_SUPPKEYand > LO_PARTKEY = P_PARTKEY and LO_ORDERDATE = D_DATEKEYand C_REGION = > 'AMERICA'and S_REGION = 'AMERICA'and (P_MFGR = 'MFGR#1' or P_MFGR = > 'MFGR#2') group by D_YEAR, C_NATION order by D_YEAR, C_NATION") > and > sql41.show() > get stuck, at some point there is no progress and server is fully idle, but > Job is staying at the same stage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2516) Bootstrapping
[ https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2516. -- Resolution: Won't Fix This is the only one left under this umbrella; I assume it's stale, or really better implemented in ML. Feel free to reopen as a stand-alone task but I didn't see any activity on this. > Bootstrapping > - > > Key: SPARK-2516 > URL: https://issues.apache.org/jira/browse/SPARK-2516 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa > > Support re-sampling and bootstrap estimators in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3669) Extract IndexedRDD interface
[ https://issues.apache.org/jira/browse/SPARK-3669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3669. -- Resolution: Won't Fix Resolved for now per parent discussion > Extract IndexedRDD interface > > > Key: SPARK-3669 > URL: https://issues.apache.org/jira/browse/SPARK-3669 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3668) Support for arbitrary key types in IndexedRDD
[ https://issues.apache.org/jira/browse/SPARK-3668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3668. -- Resolution: Won't Fix Resolved for now per parent discussion > Support for arbitrary key types in IndexedRDD > - > > Key: SPARK-3668 > URL: https://issues.apache.org/jira/browse/SPARK-3668 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4043) Add a flag for stopping threads of cancelled tasks if Thread.interrupt doesn't kill them
[ https://issues.apache.org/jira/browse/SPARK-4043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4043. -- Resolution: Won't Fix I think this never went anywhere specific, so closing it > Add a flag for stopping threads of cancelled tasks if Thread.interrupt > doesn't kill them > > > Key: SPARK-4043 > URL: https://issues.apache.org/jira/browse/SPARK-4043 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia > > While killing user code with Thread.stop can be risky, we might want to do it > for things like long-running SQL servers, where users have to be able to > cancel a query even if it's spinning in the CPU and they know the code > involved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3818) Graph coarsening
[ https://issues.apache.org/jira/browse/SPARK-3818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3818. -- Resolution: Won't Fix > Graph coarsening > > > Key: SPARK-3818 > URL: https://issues.apache.org/jira/browse/SPARK-3818 > Project: Spark > Issue Type: New Feature > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave > > Listing 7 in the [GraphX OSDI > paper|http://ankurdave.com/dl/graphx-osdi14.pdf] contains pseudocode for a > coarsening operator that allows merging edges that satisfy an edge predicate, > collapsing vertices connected by merged edges. GraphX should provide an > implementation of this operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3360) Add RowMatrix.multiply(Vector)
[ https://issues.apache.org/jira/browse/SPARK-3360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3360. -- Resolution: Won't Fix > Add RowMatrix.multiply(Vector) > -- > > Key: SPARK-3360 > URL: https://issues.apache.org/jira/browse/SPARK-3360 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Sandy Ryza > > RowMatrix currently has multiply(Matrix), but multiply(Vector) would be > useful as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12638) Parameter explaination not very accurate for rdd function "aggregate"
[ https://issues.apache.org/jira/browse/SPARK-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12638: -- Assignee: Tommy Yu > Parameter explaination not very accurate for rdd function "aggregate" > - > > Key: SPARK-12638 > URL: https://issues.apache.org/jira/browse/SPARK-12638 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 1.5.2 >Reporter: Tommy Yu >Assignee: Tommy Yu >Priority: Trivial > Fix For: 1.6.1, 2.0.0 > > > Currently, RDD function aggregate's parameter doesn't explain well, > especially parameter "zeroValue". > It's necessary to let junior scala user know that "zeroValue" attend both > "seqOp" and "combOp" phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12638) Parameter explaination not very accurate for rdd function "aggregate"
[ https://issues.apache.org/jira/browse/SPARK-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12638. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10587 [https://github.com/apache/spark/pull/10587] > Parameter explaination not very accurate for rdd function "aggregate" > - > > Key: SPARK-12638 > URL: https://issues.apache.org/jira/browse/SPARK-12638 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 1.5.2 >Reporter: Tommy Yu >Priority: Trivial > Fix For: 2.0.0, 1.6.1 > > > Currently, RDD function aggregate's parameter doesn't explain well, > especially parameter "zeroValue". > It's necessary to let junior scala user know that "zeroValue" attend both > "seqOp" and "combOp" phase. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1521) Take character set size into account when compressing in-memory string columns
[ https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1521. -- Resolution: Won't Fix I assume this is obsolete or else already implemented in some sense by tungsten > Take character set size into account when compressing in-memory string columns > -- > > Key: SPARK-1521 > URL: https://issues.apache.org/jira/browse/SPARK-1521 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > Labels: compression > > Quoted from [a blog > post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/] > from Facebook: > bq. Strings dominate the largest tables in our warehouse and make up about > 80% of the columns across the warehouse, so optimizing compression for string > columns was important. By using a threshold on observed number of distinct > column values per stripe, we modified the ORCFile writer to apply dictionary > encoding to a stripe only when beneficial. Additionally, we sample the column > values and take the character set of the column into account, since a small > character set can be leveraged by codecs like Zlib for good compression and > dictionary encoding then becomes unnecessary or sometimes even detrimental if > applied. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-873) Add a way to specify rack topology in Mesos and standalone modes
[ https://issues.apache.org/jira/browse/SPARK-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-873. - Resolution: Won't Fix > Add a way to specify rack topology in Mesos and standalone modes > > > Key: SPARK-873 > URL: https://issues.apache.org/jira/browse/SPARK-873 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 0.8.0 >Reporter: Matei Zaharia > > Right now the YARN mode can look up rack information from YARN, but the > standalone and Mesos modes don't have any way of specifying rack topology. We > should have a pluggable script or config file that allows this. For the > standalone mode, we'd probably want the rack info to be known by the Master > rather than driver apps, and maybe the apps can get a cluster map when they > register. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1515) Specialized ColumnTypes for Array, Map and Struct
[ https://issues.apache.org/jira/browse/SPARK-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1515. -- Resolution: Won't Fix Assuming this is obsolete > Specialized ColumnTypes for Array, Map and Struct > - > > Key: SPARK-1515 > URL: https://issues.apache.org/jira/browse/SPARK-1515 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Cheng Lian > Labels: compression > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1614) Move Mesos protobufs out of TaskState
[ https://issues.apache.org/jira/browse/SPARK-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1614. -- Resolution: Won't Fix > Move Mesos protobufs out of TaskState > - > > Key: SPARK-1614 > URL: https://issues.apache.org/jira/browse/SPARK-1614 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 0.9.1 >Reporter: Shivaram Venkataraman >Priority: Minor > Labels: Starter > > To isolate usage of Mesos protobufs it would be good to move them out of > TaskState into either a new class (MesosUtils ?) or > CoarseGrainedMesos{Executor, Backend}. > This would allow applications to build Spark to run without including > protobuf from Mesos in their shaded jars. This is one way to avoid protobuf > conflicts between Mesos and Hadoop > (https://issues.apache.org/jira/browse/MESOS-1203) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3055) Stack trace logged in driver on job failure is usually uninformative
[ https://issues.apache.org/jira/browse/SPARK-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3055. -- Resolution: Won't Fix > Stack trace logged in driver on job failure is usually uninformative > > > Key: SPARK-3055 > URL: https://issues.apache.org/jira/browse/SPARK-3055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Priority: Minor > > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:5 > failed 4 times, most recent failure: TID 24 on host hddn04.lsrc.duke.edu > failed for unknown reason > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} > At a cursory glance, I would expect the stack trace to have something to be > where the task error occurred. In fact it's where the driver became aware of > the error and decided to fail the job. This has been a common point of > confusion among our customers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2359) Supporting common statistical functions in MLlib
[ https://issues.apache.org/jira/browse/SPARK-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2359. -- Resolution: Done > Supporting common statistical functions in MLlib > > > Key: SPARK-2359 > URL: https://issues.apache.org/jira/browse/SPARK-2359 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Reynold Xin >Assignee: Doris Xin > > This is originally proposed by [~falaki]. > This is a proposal for a new package within the Spark distribution to support > common statistical estimators. We think consolidating statistical related > functions in a separate package will help with readability of core source > code and encourage spark users to submit back their functions. > Please see the initial design document here: > https://docs.google.com/document/d/1Kju9kWSYMXMjEO6ggC9bF9eNbaM4MxcFs_KDqgAcH9c/pub -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side
[ https://issues.apache.org/jira/browse/SPARK-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3172. -- Resolution: Won't Fix > Distinguish between shuffle spill on the map and reduce side > > > Key: SPARK-3172 > URL: https://issues.apache.org/jira/browse/SPARK-3172 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Sandy Ryza >Assignee: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-809) Give newly registered apps a set of executors right away
[ https://issues.apache.org/jira/browse/SPARK-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-809. - Resolution: Won't Fix I'm assuming this is WontFix at this point. > Give newly registered apps a set of executors right away > > > Key: SPARK-809 > URL: https://issues.apache.org/jira/browse/SPARK-809 > Project: Spark > Issue Type: New Feature > Components: Deploy >Reporter: Matei Zaharia >Priority: Minor > > Right now, newly connected apps in the standalone cluster will not set a good > defaultParallelism value if they create RDDs right after creating a > SparkContext, because the executorAdded calls are asynchronous and happen > after. It would be nice to wait for a few such calls before returning from > the scheduler initializer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5273) Improve documentation examples for LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5273. -- Resolution: Fixed Assignee: Sean Owen Fix Version/s: 2.0.0 1.6.1 Resolved by https://github.com/apache/spark/pull/10675 > Improve documentation examples for LinearRegression > > > Key: SPARK-5273 > URL: https://issues.apache.org/jira/browse/SPARK-5273 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Dev Lakhani >Assignee: Sean Owen >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > > In the document: > https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html > Under > Linear least squares, Lasso, and ridge regression > The suggested method to use LinearRegressionWithSGD.train() > // Building the model > val numIterations = 100 > val model = LinearRegressionWithSGD.train(parsedData, numIterations) > is not ideal even for simple examples such as y=x. This should be replaced > with more real world parameters with step size: > val lr = new LinearRegressionWithSGD() > lr.optimizer.setStepSize(0.0001) > lr.optimizer.setNumIterations(100) > or > LinearRegressionWithSGD.train(input,100,0.0001) > To create a reasonable MSE. It took me a while using the dev forum to learn > that the step size should be really small. Might help save someone the same > effort when learning mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12759) Spark should fail fast if --executor-memory is too small for spark to start
[ https://issues.apache.org/jira/browse/SPARK-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12759: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Spark should fail fast if --executor-memory is too small for spark to start > --- > > Key: SPARK-12759 > URL: https://issues.apache.org/jira/browse/SPARK-12759 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit >Affects Versions: 1.6.0 >Reporter: Imran Rashid >Priority: Minor > > With the UnifiedMemoryManager, the minimum memory for executor and driver > JVMs was increased to 450MB. There is code in {{UnifiedMemoryManager}} to > provide a helpful warning if less than that much memory is provided. > However if you set {{--executor-memory}} to something less than that, from > the driver process you just see executor failures with no warning, since the > more meaningful errors are buried in the executor logs. Eg., on Yarn, you see > {noformat} > 16/01/11 13:59:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: > Container marked as failed: container_1452548703600_0001_01_02 on host: > imran-adhoc-2.vpc.cloudera.com. Exit status: 1. Diagnostics: Exception from > container-launch. > Container id: container_1452548703600_0001_01_02 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:561) > at org.apache.hadoop.util.Shell.run(Shell.java:478) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Container exited with a non-zero exit code 1 > {noformat} > Though there is already a message from {{UnifiedMemoryManager}} if there > isn't enough memory for the driver, as long as this is being changed it would > be nice if the message more clearly indicated the {{--driver-memory}} > configuration as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12765) CountVectorizerModel.transform lost the transformSchema
[ https://issues.apache.org/jira/browse/SPARK-12765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12765: -- Fix Version/s: (was: 1.6.1) (was: 1.6.0) [~sloth2012] don't set fix version; it doesn't make sense now. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > CountVectorizerModel.transform lost the transformSchema > --- > > Key: SPARK-12765 > URL: https://issues.apache.org/jira/browse/SPARK-12765 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.0, 1.6.1 >Reporter: sloth > Labels: patch > > In ml package , CountVectorizerModel forgot to do transformSchema in > transform function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2011) Eliminate duplicate join in Pregel
[ https://issues.apache.org/jira/browse/SPARK-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2011. -- Resolution: Won't Fix > Eliminate duplicate join in Pregel > -- > > Key: SPARK-2011 > URL: https://issues.apache.org/jira/browse/SPARK-2011 > Project: Spark > Issue Type: Improvement > Components: GraphX >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Minor > > In the iteration loop, Pregel currently performs an innerJoin to apply > messages to vertices followed by an outerJoinVertices to join the resulting > subset of vertices back to the graph. These two operations could be merged > into a single call to joinVertices, which should be reimplemented in a more > efficient manner. This would allow us to examine only the vertices that > received messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org