date:20160930

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537900#comment-15537900
 ] 

Apache Spark commented on SPARK-17733:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15319

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
>   (CAST(-836.513475295 AS DOUBLE),

[jira] [Assigned] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17733:


Assignee: (was: Apache Spark)

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
>   (CAST(-836.513475295 AS DOUBLE), true, TIMESTAMP('2027-01-02 00:00:00.0'), 
> CAST(-446 AS SMALLINT), true, CAST(NULL AS INT),

[jira] [Assigned] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17733:


Assignee: Apache Spark

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true, TIMESTAMP('1995-06-01 00:00:00.0'), 
> CAST(-792 AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('2021-07-11 00:00:00.0'), '-318', 564, '142'),
>   (CAST(-836.513475295 AS DOUBLE), true, TIMESTAMP('2027-01-02 00:00:00.0'), 
> CAST(-446 AS SMALLINT), true,

[jira] [Updated] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17739:
--
Issue Type: Improvement  (was: Bug)

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231,

[jira] [Resolved] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17739.
---
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0

Resolved per Dongjoon's PR.

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Dongjoon Hyun
> Fix For: 2.1.0
>
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L,

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537815#comment-15537815
 ] 

Jiang Xingbo commented on SPARK-17733:
--

[~sameer]Thank you for your help! Since `UnaryNode.getAliasedConstraints` don't 
generate recursive constraints, I think we'd better modify 
`QueryPlan.inferAdditionalConstraints` to avoid problems of this kind, I've 
finished a naive version of this fix and will soon submit a patch. Thank you!

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS BOOLEAN), -925, TIMESTAMP('2028-06-27 
> 00:00:00.0'), '-657', 948, '18'),
>   (CAST(-306.138230875 AS DOUBLE), true, TIMESTAMP('1997-10-07 00:00:00.0'), 
> CAST(332 AS SMALLINT), false, 744, TIMESTAMP('1990-09-22 00:00:00.0'), 
> '-345', 566, '-574'),
>   (CAST(675.402140308 AS DOUBLE), false, TIMESTAMP('2017-06-26 00:00:00.0'), 
> CAST(972 AS SMALLINT), true, CAST(NULL AS INT), TIMESTAMP('2026-06-10 
> 00:00:00.0'), '518', 683, '-320'),
>   (CAST(734.839647174 AS DOUBLE), true,

[jira] [Assigned] (SPARK-17750) Cannot create view which includes interval arithmetic

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17750:


Assignee: Apache Spark

> Cannot create view which includes interval arithmetic
> -
>
> Key: SPARK-17750
> URL: https://issues.apache.org/jira/browse/SPARK-17750
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>Assignee: Apache Spark
>
> Given table
> create table dates (ts timestamp)
> the following view creation SQL failes with Failed to analyze the 
> canonicalized SQL. It is possible there is a bug in Spark.
> create view test_dates as select ts + interval 1 day from dates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17750) Cannot create view which includes interval arithmetic

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17750:


Assignee: (was: Apache Spark)

> Cannot create view which includes interval arithmetic
> -
>
> Key: SPARK-17750
> URL: https://issues.apache.org/jira/browse/SPARK-17750
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>
> Given table
> create table dates (ts timestamp)
> the following view creation SQL failes with Failed to analyze the 
> canonicalized SQL. It is possible there is a bug in Spark.
> create view test_dates as select ts + interval 1 day from dates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17750) Cannot create view which includes interval arithmetic

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537772#comment-15537772
 ] 

Apache Spark commented on SPARK-17750:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15318

> Cannot create view which includes interval arithmetic
> -
>
> Key: SPARK-17750
> URL: https://issues.apache.org/jira/browse/SPARK-17750
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>
> Given table
> create table dates (ts timestamp)
> the following view creation SQL failes with Failed to analyze the 
> canonicalized SQL. It is possible there is a bug in Spark.
> create view test_dates as select ts + interval 1 day from dates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537740#comment-15537740
 ] 

Ashish Shrowty commented on SPARK-17709:


Join keys are both companyid and loyaltycardnumber. Wonder why you are not 
seeing it. I tried it on a few other tables I have and its the same behavior.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15353) Making peer selection for block replication pluggable

2016-09-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15353.
-
   Resolution: Fixed
 Assignee: Shubham Chopra
Fix Version/s: 2.1.0

> Making peer selection for block replication pluggable
> -
>
> Key: SPARK-15353
> URL: https://issues.apache.org/jira/browse/SPARK-15353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Shubham Chopra
>Assignee: Shubham Chopra
> Fix For: 2.1.0
>
> Attachments: BlockManagerSequenceDiagram.png
>
>
> BlockManagers running on executors provide all logistics around block 
> management. Before a BlockManager can be used, it has to be “initialized”. As 
> a part of the initialization, BlockManager asks the 
> BlockManagerMasterEndpoint to give it topology information. The 
> BlockManagerMasterEndpoint is provided a pluggable interface that can be used 
> to resolve a hostname to topology. This information is used to decorate the 
> BlockManagerId. This happens at cluster start and whenever a new executor is 
> added.
> During replication, the BlockManager gets the list of all its peers in the 
> form of a Seq[BlockManagerId]. We add a pluggable prioritizer that can be 
> used to prioritize this list of peers based on topology information. Peers 
> with higher priority occur first in the sequence and the BlockManager tries 
> to replicate blocks in that order.
> There would be default implementations for these pluggable interfaces that 
> replicate the existing behavior of randomly choosing a peer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-30 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537544#comment-15537544
 ] 

Liang-Chi Hsieh commented on SPARK-17556:
-

Update the design document again to address some review comments. 

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-30 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: (was: executor-side-broadcast.pdf)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-30 Thread Liang-Chi Hsieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17703) Add unnamed version of addReferenceObj for minor objects.

2016-09-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17703.
-
   Resolution: Fixed
 Assignee: Takuya Ueshin
Fix Version/s: 2.1.0

> Add unnamed version of addReferenceObj for minor objects.
> -
>
> Key: SPARK-17703
> URL: https://issues.apache.org/jira/browse/SPARK-17703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
> Fix For: 2.1.0
>
>
> There are many `minor` objects in references, which are extracted to the 
> generated class field, e.g. {{errMsg}} in {{GetExternalRowField}} or 
> {{ValidateExternalType}}, but number of fields in class is limited so we 
> should reduce the number.
> I added unnamed version of {{addReferenceObj}} for these minor objects not to 
> store the object into field but refer it from the {{references}} field at the 
> time of use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17707) Web UI prevents spark-submit application to be finished

2016-09-30 Thread Nick Orka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537498#comment-15537498
 ] 

Nick Orka commented on SPARK-17707:
---

I noticed this a couple of days ago. We've just switched to Spark 2.0. I'm on 
Mac OS usitng start-all to spin up standalone spark. While I was developing 
just spark part on scala, I was using Scala Console for testing and debugging 
and it looks fine there because it holds all the variables during one session. 

We are using Luigi workflow management for our data pipeline. This is where 
I've noticed that the same Spark task has unstable behavior. Sometime it passes 
through sometime it stucks. Luigi opens separate thread to run spark-submit as 
a shell command. I intercepted the exact command line and started it just in 
shell. And I've noticed that if you open the running application details in Web 
UI the application opens socket for piping the details to web listener as a 
separate thread. I can see the very last statement execution of my app 
(println("I'm done")) but the shell is still waiting. Ctrl-C is the only way to 
finish the process. If I don't open app details in Web UI, it prints "I'm done" 
closes all accumulators' and shuffles' processes and returns to shell without 
any problem.

> Web UI prevents spark-submit application to be finished
> ---
>
> Key: SPARK-17707
> URL: https://issues.apache.org/jira/browse/SPARK-17707
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Nick Orka
>
> Here are re-production steps:
> 1. create any scala spark application which will work long enough to open the 
> application details in Web UI
> 2. run spark-submit command for standalone cluster, like: --master 
> spark:\\localhost:7077
> 3. open running application details in Web UI, like: localhost:4040
> 4. spark-submit will never finish, you will have to kill the process
> Cause: The application creates a thread with infinite loop for web UI 
> communication  and never stops it. The application is waiting for the thread 
> to be finished instead, even if you close the web page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17669) Strange behavior using Datasets

2016-09-30 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537420#comment-15537420
 ] 

Miles Crawford commented on SPARK-17669:


Was not able to reproduce using a variety of combinations of ml libraries and 
rdd-to-dataset-and-back-again transformations.  Probably something about our 
classpath or configuration that I haven't yet isolated.  Hopefully I can get 
back to this soon.

> Strange behavior using Datasets
> ---
>
> Key: SPARK-17669
> URL: https://issues.apache.org/jira/browse/SPARK-17669
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Miles Crawford
>
> I recently migrated my application to Spark 2.0, and everything worked well, 
> except for one function that uses "toDS" and the ML libraries.
> This stage used to complete in 15 minutes or so on 1.6.2, and now takes 
> almost two hours.
> The UI shows very strange behavior - completed stages still being worked on, 
> concurrent work on tons of stages, including ones from downstream jobs:
> https://dl.dropboxusercontent.com/u/231152/spark.png
> The only source change I made was changing "toDF" to "toDS()" before handing 
> my RDDs to the ML libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17755) Master may ask a worker to launch an executor before

2016-09-30 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537411#comment-15537411
 ] 

Yin Huai commented on SPARK-17755:
--

cc [~joshrosen]

> Master may ask a worker to launch an executor before 
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17755) Master may ask a worker to launch an executor before

2016-09-30 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17755:
-
Description: 
I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
memory, serialized, replicated}}. Its log shows that Spark master asked the 
worker to launch an executor before the worker actually got the response of 
registration. So, the master knew that the worker had been registered. But, the 
worker did not know if it self had been registered. 

{code}
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
localhost:38262 with 1 cores, 1024.0 MB RAM
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO StandaloneSchedulerBackend: 
Granted executor ID app-20160930145353-/1 on hostPort localhost:38262 with 
1 cores, 1024.0 MB RAM
16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
(spark://localhost:46460) attempted to launch executor.
16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
Successfully registered with master spark://localhost:46460
{code}

Then, seems the worker did not launch any executor. 

  was:
I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
memory, serialized, replicated}}. Its log shows that Spark master asked the 
worker to launch an executor before the worker actually got the response of 
registration. So, the master knew that the worker had been registered. But, the 
worker did not know if it self had been registered. 

{code}
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
localhost:38262 with 1 cores, 1024.0 MB RAM
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO StandaloneSchedulerBackend: 
Granted executor ID app-20160930145353-/1 on hostPort localhost:38262 with 
1 cores, 1024.0 MB RAM
16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
(spark://localhost:46460) attempted to launch executor.
16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
Successfully registered with master spark://localhost:46460
{code}



> Master may ask a worker to launch an executor before 
> -
>
> Key: SPARK-17755
> URL: https://issues.apache.org/jira/browse/SPARK-17755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>
> I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
> memory, serialized, replicated}}. Its log shows that Spark master asked the 
> worker to launch an executor before the worker actually got the response of 
> registration. So, the master knew that the worker had been registered. But, 
> the worker did not know if it self had been registered. 
> {code}
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
> localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
> app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
> 16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
> StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
> on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
> 16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO 
> StandaloneSchedulerBackend: Granted executor ID app-20160930145353-/1 on 
> hostPort localhost:38262 with 1 cores, 1024.0 MB RAM
> 16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
> (spark://localhost:46460) attempted to launch executor.
> 16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
> Successfully registered with master spark://localhost:46460
> {code}
> Then, seems the worker did not launch any executor. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17755) Master may ask a worker to launch an executor before

2016-09-30 Thread Yin Huai (JIRA)

Yin Huai created SPARK-17755:


 Summary: Master may ask a worker to launch an executor before 
 Key: SPARK-17755
 URL: https://issues.apache.org/jira/browse/SPARK-17755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Yin Huai


I somehow saw a failed test {{org.apache.spark.DistributedSuite.caching in 
memory, serialized, replicated}}. Its log shows that Spark master asked the 
worker to launch an executor before the worker actually got the response of 
registration. So, the master knew that the worker had been registered. But, the 
worker did not know if it self had been registered. 

{code}
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Registering worker 
localhost:38262 with 1 cores, 1024.0 MB RAM
16/09/30 14:53:53.681 dispatcher-event-loop-0 INFO Master: Launching executor 
app-20160930145353-/1 on worker worker-20160930145353-localhost-38262
16/09/30 14:53:53.682 dispatcher-event-loop-3 INFO 
StandaloneAppClient$ClientEndpoint: Executor added: app-20160930145353-/1 
on worker-20160930145353-localhost-38262 (localhost:38262) with 1 cores
16/09/30 14:53:53.683 dispatcher-event-loop-3 INFO StandaloneSchedulerBackend: 
Granted executor ID app-20160930145353-/1 on hostPort localhost:38262 with 
1 cores, 1024.0 MB RAM
16/09/30 14:53:53.683 dispatcher-event-loop-0 WARN Worker: Invalid Master 
(spark://localhost:46460) attempted to launch executor.
16/09/30 14:53:53.687 worker-register-master-threadpool-0 INFO Worker: 
Successfully registered with master spark://localhost:46460
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17750) Cannot create view which includes interval arithmetic

2016-09-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537389#comment-15537389
 ] 

Dongjoon Hyun commented on SPARK-17750:
---

Hi, [~andreasdamm].
It really does. I'll make a PR for this.

> Cannot create view which includes interval arithmetic
> -
>
> Key: SPARK-17750
> URL: https://issues.apache.org/jira/browse/SPARK-17750
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andreas Damm
>
> Given table
> create table dates (ts timestamp)
> the following view creation SQL failes with Failed to analyze the 
> canonicalized SQL. It is possible there is a bug in Spark.
> create view test_dates as select ts + interval 1 day from dates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537336#comment-15537336
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~ashrowty] Hmmn.. and your join keys are companyid or loyalitycardnumber or 
both ? If so, i have the exact same scenario but not seeing the error you are 
seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17754) DataFrame reader and writer don't show Input/Output metrics in Spark UI

2016-09-30 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-17754:

Component/s: Web UI
 SQL

> DataFrame reader and writer don't show Input/Output metrics in Spark UI
> ---
>
> Key: SPARK-17754
> URL: https://issues.apache.org/jira/browse/SPARK-17754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> The Spark UI shows metrics such as Input Size, Output Size, Shuffle 
> Read/Write, etc.
> The input and output fields are blank for DataFrames even though I'm reading 
> or writing data. The input field was not empty back in 1.6.2, so this has 
> been a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17754) DataFrame reader and writer don't show Input/Output metrics in Spark UI

2016-09-30 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-17754:
---

 Summary: DataFrame reader and writer don't show Input/Output 
metrics in Spark UI
 Key: SPARK-17754
 URL: https://issues.apache.org/jira/browse/SPARK-17754
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Burak Yavuz


The Spark UI shows metrics such as Input Size, Output Size, Shuffle Read/Write, 
etc.

The input and output fields are blank for DataFrames even though I'm reading or 
writing data. The input field was not empty back in 1.6.2, so this has been a 
regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-09-30 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537295#comment-15537295
 ] 

Timothy Hunter commented on SPARK-17074:


We have discussed this through email and either is fine. Regarding the second 
one, even if the result is approximate, you can still get some reasonable 
bounds on the error.

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537264#comment-15537264
 ] 

Ashish Shrowty commented on SPARK-17709:


[~dkbiswal] Attached are the explain() outputs -

df1.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[avg(cast(itemcount#3372 as bigint))])
+- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200)
   +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[partial_avg(cast(itemcount#3372 as bigint))])
  +- *Project [loyaltycardnumber#3370, itemcount#3372, companyid#3364]
 +- *BatchedScan parquet 
facts.storetransaction[loyaltycardnumber#3370,itemcount#3372,year#3362,month#3363,companyid#3364]
 Format: ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month...,
 PushedFilters: [], ReadSchema: struct

df2.explain
== Physical Plan ==
*HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[avg(totalprice#3373)])
+- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200)
   +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], 
functions=[partial_avg(totalprice#3373)])
  +- *Project [loyaltycardnumber#3370, totalprice#3373, companyid#3364]
 +- *BatchedScan parquet 
facts.storetransaction[loyaltycardnumber#3370,totalprice#3373,year#3362,month#3363,companyid#3364]
 Format: ParquetFormat, InputPaths: 
s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month...,
 PushedFilters: [], ReadSchema: 
struct


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17753) Simple case in spark sql throws ParseException

2016-09-30 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17753:
--
Description: 
Simple case in sql throws parser exception in spark 2.0.
The following query as well as similar queries fail in spark 2.0 
scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
FROM hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 
LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', 'SORT', 
'CLUSTER', 'DISTRIBUTE'}(line 1, pos 60)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 LTE 
LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided


  was:
Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 
scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
FROM hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 
LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', 'SORT', 
'CLUSTER', 'DISTRIBUTE'}(line 1, pos 60)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 LTE 
LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided



> Simple case in spark sql throws ParseException
> --
>
> Key: SPARK-17753
> URL: https://issues.apache.org/jira/browse/SPARK-17753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> Simple case in sql throws parser exception in spark 2.0.
> The following query as well as similar queries fail in spark 2.0 
> scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
> FROM hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR 
> (8 LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE 
> CAST(NULL AS INT) END))")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 60)
> == SQL ==
> SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
> hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 
> LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL 
> AS INT) END))
> ^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17753) Simple case in spark sql throws ParseException

2016-09-30 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17753:
--
Description: 
Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 
scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
FROM hadoop_tbl_all alias WHERE  CASE 'ab' = alias.p_text  WHEN TRUE 
THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS INT) END")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 111)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE THEN 1  WHEN FALSE 
THEN 0 ELSE CAST(NULL AS INT) END
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided



  was:
Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 

spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 <= 
LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))")

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 111)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE=TRUE THEN 1  WHEN 
TRUE=FALSE THEN 0 ELSE CAST(NULL AS INT) END
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided



> Simple case in spark sql throws ParseException
> --
>
> Key: SPARK-17753
> URL: https://issues.apache.org/jira/browse/SPARK-17753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> Simple case in sql throws parser exception in spark 2.0.
> The following query fails in spark 2.0 
> scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
> FROM hadoop_tbl_all alias WHERE  CASE 'ab' = alias.p_text  WHEN TRUE 
> THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS INT) END")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
> 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
> 'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, 
> '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 
> 'DISTRIBUTE'}(line 1, pos 111)
> == SQL ==
> SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
> hadoop_tbl_all alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE THEN 
> 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS INT) END
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at

[jira] [Commented] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537252#comment-15537252
 ] 

Apache Spark commented on SPARK-17739:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/15317

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
>

[jira] [Assigned] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17739:


Assignee: Apache Spark

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231, col1_max#10231]
>   +-

[jira] [Assigned] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17739:


Assignee: (was: Apache Spark)

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231, col1_max#10231]
>   +- Window [max(col1#10097) 
>

[jira] [Updated] (SPARK-17753) Simple case in spark sql throws ParseException

2016-09-30 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17753:
--
Description: 
Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 
scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
FROM hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 
LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', 'SORT', 
'CLUSTER', 'DISTRIBUTE'}(line 1, pos 60)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 LTE 
LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))
^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided


  was:
Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 
scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
FROM hadoop_tbl_all alias WHERE  CASE 'ab' = alias.p_text  WHEN TRUE 
THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS INT) END")
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 111)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE THEN 1  WHEN FALSE 
THEN 0 ELSE CAST(NULL AS INT) END
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided




> Simple case in spark sql throws ParseException
> --
>
> Key: SPARK-17753
> URL: https://issues.apache.org/jira/browse/SPARK-17753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> Simple case in sql throws parser exception in spark 2.0.
> The following query fails in spark 2.0 
> scala> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 
> FROM hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR 
> (8 LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE 
> CAST(NULL AS INT) END))")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'FROM' expecting {, 'WHERE', 'GROUP', 'ORDER', 
> 'HAVING', 'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'INTERSECT', 
> 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 1, pos 60)
> == SQL ==
> SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
> hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 
> LTE LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL 
> AS INT) END))
> ^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-09-30 Thread Zhenhua Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537248#comment-15537248
 ] 

Zhenhua Wang commented on SPARK-17074:
--

Hi, there's something I want to discuss here. In order to generate equi-height 
histograms, we need to get ndv(number of distinct values) for each bin in the 
histogram (this information is important in estimation).
I think we have two ways to get it:
1. Use percentile_approx to get percentiles (equi-height bin intervals), and 
use a new aggregate function to count ndv in each of these interval. - This 
takes two table scans.
2. Modify the QuantileSummaries to enable it to count distinct values at the 
same time when computing percentiles. - This only takes one table scan, but I'm 
not sure about the accuracy of ndv results.
So there's a performance vs accuracy trade off here. I tend to use the second 
method. What do you think? [~rxin] [~hvanhovell] [~vssrinath] 
[~thunterdb][~ron8hu]


> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17753) Simple case in spark sql throws ParseException

2016-09-30 Thread kanika dhuria (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kanika dhuria updated SPARK-17753:
--
Summary: Simple case in spark sql throws ParseException  (was: Simple case 
in spark sql throws ParserException)

> Simple case in spark sql throws ParseException
> --
>
> Key: SPARK-17753
> URL: https://issues.apache.org/jira/browse/SPARK-17753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: kanika dhuria
>
> Simple case in sql throws parser exception in spark 2.0.
> The following query fails in spark 2.0 
> spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
> hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 <= 
> LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
> INT) END))")
> org.apache.spark.sql.catalyst.parser.ParseException:
> mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
> 'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
> 'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, 
> '+', '-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 
> 'DISTRIBUTE'}(line 1, pos 111)
> == SQL ==
> SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
> hadoop_tbl_all alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE=TRUE 
> THEN 1  WHEN TRUE=FALSE THEN 0 ELSE CAST(NULL AS INT) END
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   ... 48 elided



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17753) Simple case in spark sql throws ParserException

2016-09-30 Thread kanika dhuria (JIRA)

kanika dhuria created SPARK-17753:
-

 Summary: Simple case in spark sql throws ParserException
 Key: SPARK-17753
 URL: https://issues.apache.org/jira/browse/SPARK-17753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: kanika dhuria


Simple case in sql throws parser exception in spark 2.0.
The following query fails in spark 2.0 

spark.sql("SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM 
hadoop_tbl_all alias WHERE  (1 = (CASE ('ab' = alias.p_text) OR (8 <= 
LENGTH(alias.p_text)) WHEN TRUE THEN 1  WHEN FALSE THEN 0 ELSE CAST(NULL AS 
INT) END))")

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'EQ' expecting {, '.', '[', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'OR', 'AND', 'IN', NOT, 'BETWEEN', 'LIKE', RLIKE, 'IS', 'WINDOW', 
'UNION', 'EXCEPT', 'INTERSECT', EQ, '<=>', '<>', '!=', '<', LTE, '>', GTE, '+', 
'-', '*', '/', '%', 'DIV', '&', '|', '^', 'SORT', 'CLUSTER', 'DISTRIBUTE'}(line 
1, pos 111)

== SQL ==
SELECT alias.p_double as a0, alias.p_text as a1, NULL as a2 FROM hadoop_tbl_all 
alias WHERE  CASE 'ab' EQ alias.p_text  WHEN TRUE=TRUE THEN 1  WHEN 
TRUE=FALSE THEN 0 ELSE CAST(NULL AS INT) END
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  ... 48 elided




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-09-30 Thread Kevin Ushey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Ushey updated SPARK-17752:

Description: 
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

{code:r}
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
{code}

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)


  was:
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

{{
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
}}

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)



> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {code:r}
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> path <- tempfile()
> write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
> quote = FALSE)
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> {code}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-09-30 Thread Kevin Ushey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Ushey updated SPARK-17752:

Description: 
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

{{
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
}}

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)


  was:
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

```
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
```

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)



> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> {{
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> path <- tempfile()
> write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
> quote = FALSE)
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> }}
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-09-30 Thread Kevin Ushey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Ushey updated SPARK-17752:

Description: 
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

```
SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE
```

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.

For posterity:

> sessionInfo()
R version 3.3.1 Patched (2016-07-30 r71015)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra (10.12)


  was:
Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

---

SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE

---

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.


> Spark returns incorrect result when 'collect()'ing a cached Dataset with many 
> columns
> -
>
> Key: SPARK-17752
> URL: https://issues.apache.org/jira/browse/SPARK-17752
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kevin Ushey
>Priority: Critical
>
> Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
> installation as necessary):
> ```
> SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
> Sys.setenv(SPARK_HOME = SPARK_HOME)
> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
> sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
> "2g"))
> n <- 1E3
> df <- as.data.frame(replicate(n, 1L, FALSE))
> names(df) <- paste("X", 1:n, sep = "")
> path <- tempfile()
> write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
> quote = FALSE)
> tbl <- as.DataFrame(df)
> cache(tbl) # works fine without this
> cl <- collect(tbl)
> identical(df, cl) # FALSE
> ```
> Although this is reproducible with SparkR, it seems more likely that this is 
> an error in the Java / Scala Spark sources.
> For posterity:
> > sessionInfo()
> R version 3.3.1 Patched (2016-07-30 r71015)
> Platform: x86_64-apple-darwin13.4.0 (64-bit)
> Running under: macOS Sierra (10.12)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17752) Spark returns incorrect result when 'collect()'ing a cached Dataset with many columns

2016-09-30 Thread Kevin Ushey (JIRA)

Kevin Ushey created SPARK-17752:
---

 Summary: Spark returns incorrect result when 'collect()'ing a 
cached Dataset with many columns
 Key: SPARK-17752
 URL: https://issues.apache.org/jira/browse/SPARK-17752
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Kevin Ushey
Priority: Critical


Run the following code (modify SPARK_HOME to point to a Spark 2.0.0 
installation as necessary):

---

SPARK_HOME <- path.expand("~/Library/Caches/spark/spark-2.0.0-bin-hadoop2.7")
Sys.setenv(SPARK_HOME = SPARK_HOME)

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(master = "local[*]", sparkConfig = list(spark.driver.memory = 
"2g"))

n <- 1E3
df <- as.data.frame(replicate(n, 1L, FALSE))
names(df) <- paste("X", 1:n, sep = "")

path <- tempfile()
write.table(df, file = path, row.names = FALSE, col.names = TRUE, sep = ",", 
quote = FALSE)

tbl <- as.DataFrame(df)
cache(tbl) # works fine without this
cl <- collect(tbl)

identical(df, cl) # FALSE

---

Although this is reproducible with SparkR, it seems more likely that this is an 
error in the Java / Scala Spark sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537205#comment-15537205
 ] 

Dilip Biswal edited comment on SPARK-17709 at 9/30/16 10:07 PM:


@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

{code}
 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))
{code}
I am able to see the same error you are seeing.


was (Author: dkbiswal):
@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))

I am able to see the same error you are seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537205#comment-15537205
 ] 

Dilip Biswal commented on SPARK-17709:
--

@ashrowty Hi Ashish, is it possible for you to post explain output for both the 
legs of the join. 
So if we are joining two dataframes df1 and df2 , can we get the output of
df1.explain(true)
df2.explain(true)

>From the error, it seems like key1 and key2 are not present in one leg of join 
>output attribute set.

So if i were to change your test program to the following :

 val df1 = d1.groupBy("key1", "key2")
  .agg(avg("totalprice").as("avgtotalprice"))
  df1.explain(true)
  val df2 = d1.agg(avg("itemcount").as("avgqty"))
  df2.explain(true)
df1.join(df2, Seq("key1", "key2"))

I am able to see the same error you are seeing.

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17751) Remove spark.sql.eagerAnalysis

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537197#comment-15537197
 ] 

Apache Spark commented on SPARK-17751:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15316

> Remove spark.sql.eagerAnalysis
> --
>
> Key: SPARK-17751
> URL: https://issues.apache.org/jira/browse/SPARK-17751
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Dataset always does eager analysis now. Thus, spark.sql.eagerAnalysis is not 
> used any more. Thus, we need to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17751) Remove spark.sql.eagerAnalysis

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17751:


Assignee: Apache Spark

> Remove spark.sql.eagerAnalysis
> --
>
> Key: SPARK-17751
> URL: https://issues.apache.org/jira/browse/SPARK-17751
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Dataset always does eager analysis now. Thus, spark.sql.eagerAnalysis is not 
> used any more. Thus, we need to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17751) Remove spark.sql.eagerAnalysis

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17751:


Assignee: (was: Apache Spark)

> Remove spark.sql.eagerAnalysis
> --
>
> Key: SPARK-17751
> URL: https://issues.apache.org/jira/browse/SPARK-17751
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Dataset always does eager analysis now. Thus, spark.sql.eagerAnalysis is not 
> used any more. Thus, we need to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17751) Remove spark.sql.eagerAnalysis

2016-09-30 Thread Xiao Li (JIRA)

Xiao Li created SPARK-17751:
---

 Summary: Remove spark.sql.eagerAnalysis
 Key: SPARK-17751
 URL: https://issues.apache.org/jira/browse/SPARK-17751
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Xiao Li


Dataset always does eager analysis now. Thus, spark.sql.eagerAnalysis is not 
used any more. Thus, we need to remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17737) cannot import name accumulators error

2016-09-30 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17737.
---
Resolution: Duplicate

> cannot import name accumulators error
> -
>
> Key: SPARK-17737
> URL: https://issues.apache.org/jira/browse/SPARK-17737
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
> Environment: unix
> python 2.7
>Reporter: Pruthveej Reddy Kasarla
>
> Hi I am trying to setup my sparkcontext using the below code
> import sys
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> from pyspark import SparkConf, SparkContext
> sconf = SparkConf()
> sc = SparkContext(conf=sconf)
> print sc
> got below error
> ImportError   Traceback (most recent call last)
>  in ()
>   2 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
>   3 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> > 4 from pyspark import SparkConf, SparkContext
>   5 sconf = SparkConf()
>   6 sc = SparkContext(conf=sconf)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/__init__.py in ()
>  39 
>  40 from pyspark.conf import SparkConf
> ---> 41 from pyspark.context import SparkContext
>  42 from pyspark.rdd import RDD
>  43 from pyspark.files import SparkFiles
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in ()
>  26 from tempfile import NamedTemporaryFile
>  27 
> ---> 28 from pyspark import accumulators
>  29 from pyspark.accumulators import Accumulator
>  30 from pyspark.broadcast import Broadcast
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17746) Code duplication to compute the path to spark-defaults.conf

2016-09-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537130#comment-15537130
 ] 

Sean Owen commented on SPARK-17746:
---

I'm not sure it can, because core and launcher don't / can't depend on each 
other?

> Code duplication to compute the path to spark-defaults.conf
> ---
>
> Key: SPARK-17746
> URL: https://issues.apache.org/jira/browse/SPARK-17746
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> [CommandBuilderUtils.DEFAULT_PROPERTIES_FILE|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java#L31],
>  
> [AbstractCommandBuilder.getConfDir|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L312]
>  and 
> [Utils.getDefaultPropertiesFile|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2006]
>  are dealing with the default properties file, i.e. {{spark-defaults.conf}} 
> (or the Spark configuration directory where the file is).
> The code duplication could be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17750) Cannot create view which includes interval arithmetic

2016-09-30 Thread Andreas Damm (JIRA)

Andreas Damm created SPARK-17750:


 Summary: Cannot create view which includes interval arithmetic
 Key: SPARK-17750
 URL: https://issues.apache.org/jira/browse/SPARK-17750
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andreas Damm


Given table

create table dates (ts timestamp)

the following view creation SQL failes with Failed to analyze the canonicalized 
SQL. It is possible there is a bug in Spark.

create view test_dates as select ts + interval 1 day from dates




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17749) Unresolved columns when nesting SQL join clauses

2016-09-30 Thread Andreas Damm (JIRA)

Andreas Damm created SPARK-17749:


 Summary: Unresolved columns when nesting SQL join clauses
 Key: SPARK-17749
 URL: https://issues.apache.org/jira/browse/SPARK-17749
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Andreas Damm


Given tables

CREATE TABLE `sf_datedconversionrate2`(`isocode` string)
CREATE TABLE `sf_opportunity2`(`currencyisocode` string, `accountid` string)
CREATE TABLE `sf_account2`(`id` string)

the following SQL will cause an analysis exception (cannot resolve 
'`sf_opportunity.currencyisocode`' given input columns: [isocode, id])

SELECT0 
FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
LEFT JOIN `sf_account2` AS `sf_account` 
LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
ON`sf_datedconversionrate`.`isocode` = 
`sf_opportunity`.`currencyisocode` 

even though all columns referred to in the conditions should be in scope.

Re-ordering the join and on clauses will make it work

SELECT0 
FROM  `sf_datedconversionrate2` AS `sf_datedconversionrate` 
LEFT JOIN `sf_opportunity2` AS `sf_opportunity` 
LEFT JOIN `sf_account2` AS `sf_account` 
ON`sf_account`.`id` = `sf_opportunity`.`accountid` 
ON`sf_datedconversionrate`.`isocode` = 
`sf_opportunity`.`currencyisocode` 

but the original should work also.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536993#comment-15536993
 ] 

Dongjoon Hyun commented on SPARK-17739:
---

Thank you!

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231, col1_max#10231]
>   +- Window

[jira] [Commented] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536957#comment-15536957
 ] 

Herman van Hovell commented on SPARK-17739:
---

Go ahead!

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231, col1_max#10231]
>   +- Window

[jira] [Commented] (SPARK-17739) Collapse adjacent similar Window operators

2016-09-30 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536936#comment-15536936
 ] 

Dongjoon Hyun commented on SPARK-17739:
---

Hi, [~hvanhovell].
This issue looks important and attractive. If you didn't start this yet, may I 
work on this?

> Collapse adjacent similar Window operators
> --
>
> Key: SPARK-17739
> URL: https://issues.apache.org/jira/browse/SPARK-17739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently does not collapse adjacent windows with the same partitioning 
> and (similar) sorting. For example:
> {noformat}
> val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as 
> "col1", rand() as "col2")
> // Add summary statistics for all columns
> import org.apache.spark.sql.expressions.Window
> val cols = Seq("id", "col1", "col2")
> val window = Window.partitionBy($"grp").orderBy($"id")
> val result = cols.foldLeft(df) { (base, name) =>
>   base.withColumn(s"${name}_avg", avg(col(name)).over(window))
>   .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
>   .withColumn(s"${name}_min", min(col(name)).over(window))
>   .withColumn(s"${name}_max", max(col(name)).over(window))
> }
> {noformat}
> Leads to following plan:
> {noformat}
> == Parsed Logical Plan ==
> 'Project [*, max('col2) windowspecdefinition('grp, 'id ASC NULLS FIRST, 
> UnspecifiedFrame) AS col2_max#10313]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, id_avg#10105, 
> id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_min#10295, col2_min#10295]
>   +- Window [min(col2#10098) windowspecdefinition(grp#10096L, id#10093L 
> ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS 
> col2_min#10295], [grp#10096L], [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
> +- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270]
>+- Project [grp#10096L, id#10093L, col1#10097, col2#10098, 
> id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, col1_avg#10176, 
> col1_stddev#10196, col1_min#10217, col1_max#10231, col2_avg#10246, 
> col2_stddev#10270, col2_stddev#10270]
>   +- Window [stddev_samp(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#10270], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
> +- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246]
>+- Project [grp#10096L, id#10093L, col1#10097, 
> col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, id_max#10165L, 
> col1_avg#10176, col1_stddev#10196, col1_min#10217, col1_max#10231, 
> col2_avg#10246, col2_avg#10246]
>   +- Window [avg(col2#10098) 
> windowspecdefinition(grp#10096L, id#10093L ASC NULLS FIRST, RANGE BETWEEN 
> UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#10246], [grp#10096L], 
> [id#10093L ASC NULLS FIRST]
>  +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
> +- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196, col1_min#10217, 
> col1_max#10231]
>+- Project [grp#10096L, id#10093L, 
> col1#10097, col2#10098, id_avg#10105, id_stddev#10121, id_min#10155L, 
> id_max#10165L, col1_avg#10176, col1_stddev#10196,

[jira] [Updated] (SPARK-17716) Hidden Markov Model (HMM)

2016-09-30 Thread Runxin Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Runxin Li updated SPARK-17716:
--
Description: 
Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at 
https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him 
to list popular HMM applications, describe public API (params, input/output 
schemas), compare its API with existing HMM implementations.

h1. Hidden Markov Model (HMM) Design Doc
h2. Overview
h3. Introduction to HMM
Hidden Markov Model is a type of statistical Machine Learning model that 
assumes a sequence of observations is generated by a Markov process with hidden 
states. There are 3 (or 2, depending on the implementation) main components of 
the model:
* *Transition Probability*: describes the probability distribution of 
transitions from each state to other states (including self) in the Markov 
process
* *Emission Probability*: describes the probability distribution for an 
observation associated with hidden states
* *Initial/Start Probability* (optional): represents the prior probability of 
each state at the beginning of the observation sequence

_Note: some implementations merge the Initial Probability into Transition 
Probability by adding an arbitrary Start state before the first observation 
point._

h3. HMM Models and Algorithms
Given a limited number of states, most HMM models have the same form of 
Transition Probability: a matrix, where each element _(i, j)_ represents the 
probability of transition from state _i_ to state _j_. The Initial Probability 
usually take the simple form of a probabilistic vector.

The Emission Probability, on the other hand, can be represented in many 
different ways, depending on different nature of observations, i.e. continuous 
vs. discrete, or different model assumptions, e.g. single Gaussian vs. Gaussian 
Mixtures.

There are three main problems associated with HMM models, and their canonical 
algorithms:

# *Evaluation*: What’s the probability of a given observation sequence, based 
on the model? It’s usually done by either *Forward* or *Backward* algorithms
# *Decoding*: What’s the most likely state sequence, given the observation 
sequence and the model? It’s usually done by *Viterbi* decoding
# *Learning*: How to train the parameters of the model based on the observation 
sequences? *Baum-Welch* (Forward-Backward) is usually used as part of the *EM* 
algorithm in unsupervised training

h3. Popular Applications of HMM
* Speech Recognition
* Part-of-speech Tagging
* Named Entity Recognition
* Machine Translation
* Gene Prediction

h2. Alternate Libraries
[Mallet|http://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
* Treat HMM as a Finite State Transducer (FST)
* Theoretically can go beyond first-order Markov assumption by setting an 
arbitrary order
* Limited to text data, i.e. discrete observation sequence with Multinomial 
emission model assumption
* Supervised training only
* API:
** Training:
{{HMM hmm = new HMM(pipe, null);}}
{{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
{{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelihood(hmm);}}
{{trainer.train(trainingInstances, 10);}}
** Testing:
{{evaluator.evaluate(trainer);}}

[HMMLearn|https://github.com/hmmlearn/hmmlearn]
* Previously part of scikit-learn
* Algo:
** Standard HMM unsupervised training algorithm
** Three types of emission models: GMM, Gaussian and Multinomial
* API:
** Training: 
{{model = hmm.GaussianHMM(n_components=3, covariance_type="full")}}
{{model.fit(X)}}
** Testing: 
{{hidden_states = model.predict(X)}}

h2. API
Design Goals
* Build the foundation for general Sequential Tagging models (HMM, CRF, etc.)
* Support multiple Emission Probability models such as “Multinomial” and 
“Gaussian Mixture”
* Keep both supervised and unsupervised learning for HMM in mind

h3. Proposed API
_Note: This is written for the spark.ml API._

Decoder API
{code:title=Decoder.scala}
trait DecoderParams extends Params {
  def featureCol: DataType // column of sequential features, e.g. MatrixUDT
  def predictionCol: DoubleType // column of prediction
  def labelCol: DataType // column of sequential labels, e.g. VectorUDT
}

abstract class Decoder extends Estimator with DecoderParams {
  def extractLabeledSequences(dataset: Dataset[_]): RDD[LabeledSequence]
}

abstract class DecodingModel extends Model with DecoderParams {
  def numFeatures: Int
  def decode(features: FeatureType): Vector
}
{code}

Tagger API
{code:title=Tagger.scala}
trait TaggerParams extends DecoderParams with HasRawPredictionCol {
  def rawPredictionCol: MatrixUDT // column for all predicted label sequences
}

abstract class Tagger extends Decoder with TaggerParams

abstract class TaggingModel extends DecodingModel with TaggerParams {
  def numClasses: Int
  def decodeRaw(features: FeaturesType): Array[(Double, Vector)]
  def raw2prediction(rawPrediction:

[jira] [Updated] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-17748:

Assignee: Seth Hendrickson

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-17748:

Issue Type: New Feature  (was: Bug)

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM:
---

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang] [~dbtsai]


was (Author: sethah):
I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang]

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson edited comment on SPARK-17748 at 9/30/16 7:16 PM:
---

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

cc [~srowen] [~yanboliang]


was (Author: sethah):
I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536781#comment-15536781
 ] 

Seth Hendrickson commented on SPARK-17748:
--

I am working on this currently. The basic plan is to refactor WLS so that it 
has a pluggable solver for the normal equations. We can implement a new 
interface like 

{code:java}
trait NormalEquationSolver {
  def solve(
  bBar: Double,
  bbBar: Double,
  abBar: DenseVector,
  aaBar: DenseVector,
  aBar: DenseVector): NormalEquationSolution
}
class CholeskySolver extends NormalEquationsSolver
class QuasiNewtonSolver extends NormalEquationSolver
{code}

If others have thoughts on the design please comment, otherwise I will continue 
working on this and submit a PR reasonably soon.

> One-pass algorithm for linear regression with L1 and elastic-net penalties
> --
>
> Key: SPARK-17748
> URL: https://issues.apache.org/jira/browse/SPARK-17748
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Seth Hendrickson
>
> Currently linear regression uses weighted least squares to solve the normal 
> equations locally on the driver when the dimensionality is small (<4096). 
> Weighted least squares uses a Cholesky decomposition to solve the problem 
> with L2 regularization (which has a closed-form solution). We can support 
> L1/elasticnet penalties by solving the equations locally using OWL-QN solver.
> Also note that Cholesky does not handle singular covariance matrices, but 
> L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch 
> can also add support for solving singular covariance matrices by also adding 
> L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17748) One-pass algorithm for linear regression with L1 and elastic-net penalties

2016-09-30 Thread Seth Hendrickson (JIRA)

Seth Hendrickson created SPARK-17748:


 Summary: One-pass algorithm for linear regression with L1 and 
elastic-net penalties
 Key: SPARK-17748
 URL: https://issues.apache.org/jira/browse/SPARK-17748
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Seth Hendrickson


Currently linear regression uses weighted least squares to solve the normal 
equations locally on the driver when the dimensionality is small (<4096). 
Weighted least squares uses a Cholesky decomposition to solve the problem with 
L2 regularization (which has a closed-form solution). We can support 
L1/elasticnet penalties by solving the equations locally using OWL-QN solver.

Also note that Cholesky does not handle singular covariance matrices, but 
L-BFGS and OWL-QN are capable of providing reasonable solutions. This patch can 
also add support for solving singular covariance matrices by also adding L-BFGS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536694#comment-15536694
 ] 

Ashish Shrowty commented on SPARK-17709:


Sorry ... its not really col1, its another column .. edited it to col8

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535957#comment-15535957
 ] 

Ashish Shrowty edited comment on SPARK-17709 at 9/30/16 6:37 PM:
-

Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col8 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 



was (Author: ashrowty):
Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col1 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Matthew Seal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536646#comment-15536646
 ] 

Matthew Seal commented on SPARK-4105:
-

Backing off executor memory away from boundary of physical memory did not solve 
the problem for my above crash report.

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>

[jira] [Commented] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Sameer Agarwal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536636#comment-15536636
 ] 

Sameer Agarwal commented on SPARK-17733:


Thanks [~jiangxb], we came to the same conclusion yesterday. More generally, 
the issue here is that the `QueryPlan.inferAdditionalConstraints` and 
`UnaryNode.getAliasedConstraints` can produce a non-converging set of 
constraints for recursive functions. So if we have 2 constraints of the form 
(where a is an alias):

{code}
a = b, a = f(b, c)
{code}

Applying both these rules in the next iteration would infer:

{code}
b = f(f(b, c), c)
{code}

and next would infer: 

{code}
b = f(f(f(b, c), c), c)
{code}

and so on...

These rules aren't incorrect per se, but are obviously useless and causes 
problems of this kind. I think the right fix here would be to modify 
`UnaryNode.getAliasedConstraints` to not produce these recursive constraints. 
Would you like to submit a patch?

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
> int_col_9, string_col_10) AS VALUES
>   (CAST(-147.818640624 AS DOUBLE), CAST(NULL AS BOOLEAN), 
> TIMESTAMP('2012-10-19 00:00:00.0'), CAST(9 AS SMALLINT), false, 77, 
> TIMESTAMP('2014-07-01 00:00:00.0'), '-945', -646, '722'),
>   (CAST(594.195125271 AS DOUBLE), false, TIMESTAMP('2016-12-04 00:00:00.0'), 
> CAST(NULL AS SMALLINT), CAST(NULL AS BOOLEAN), CAST(NULL AS INT), 
> TIMESTAMP('1999-12-26 00:00:00.0'), '250', -861, '55'),
>   (CAST(-454.171126363 AS DOUBLE), false, TIMESTAMP('2008-12-13 00:00:00.0'), 
> CAST(NULL AS SMALLINT), false, -783, TIMESTAMP('2010-05-28 00:00:00.0'), 
> '211', -959, CAST(NULL AS STRING)),
>   (CAST(437.670945524 AS DOUBLE), true, TIMESTAMP('2011-10-16 00:00:00.0'), 
> CAST(952 AS SMALLINT), true, 297, TIMESTAMP('2013-01-13 00:00:00.0'), '262', 
> CAST(NULL AS INT), '936'),
>   (CAST(-387.226759334 AS DOUBLE), false, TIMESTAMP('2019-10-03 00:00:00.0'), 
> CAST(-496 AS SMALLINT), CAST(NULL AS

[jira] [Commented] (SPARK-17737) cannot import name accumulators error

2016-09-30 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536516#comment-15536516
 ] 

Bryan Cutler commented on SPARK-17737:
--

Ah, sorry I didn't ready your stacktrace properly.  I've seen this before a 
couple of times, it could be because of a failed previous import of PySpark.  
In an interactive session, if there is an import failure, the modules imported 
up until then will still be in the env.  Make sure you start from a clean env 
and try this again.  See SPARK-16665 and the discussion in the PR which sounds 
very similar to this.

> cannot import name accumulators error
> -
>
> Key: SPARK-17737
> URL: https://issues.apache.org/jira/browse/SPARK-17737
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
> Environment: unix
> python 2.7
>Reporter: Pruthveej Reddy Kasarla
>
> Hi I am trying to setup my sparkcontext using the below code
> import sys
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
> sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> from pyspark import SparkConf, SparkContext
> sconf = SparkConf()
> sc = SparkContext(conf=sconf)
> print sc
> got below error
> ImportError   Traceback (most recent call last)
>  in ()
>   2 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python/build')
>   3 sys.path.append('/opt/cloudera/parcels/CDH/lib/spark/python')
> > 4 from pyspark import SparkConf, SparkContext
>   5 sconf = SparkConf()
>   6 sc = SparkContext(conf=sconf)
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/__init__.py in ()
>  39 
>  40 from pyspark.conf import SparkConf
> ---> 41 from pyspark.context import SparkContext
>  42 from pyspark.rdd import RDD
>  43 from pyspark.files import SparkFiles
> /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py in ()
>  26 from tempfile import NamedTemporaryFile
>  27 
> ---> 28 from pyspark import accumulators
>  29 from pyspark.accumulators import Accumulator
>  30 from pyspark.broadcast import Broadcast
> ImportError: cannot import name accumulators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Dilip Biswal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536464#comment-15536464
 ] 

Dilip Biswal commented on SPARK-17709:
--

[~ashrowty] Ashish, you have the same column name as regular and partitioning 
columns ? I thought hive didn't allow it ?

> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-30 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17738:
---
Fix Version/s: (was: 2.2.0)
   2.1.0

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17738) Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP append/extract

2016-09-30 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17738.

   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15305
[https://github.com/apache/spark/pull/15305]

> Flaky test: org.apache.spark.sql.execution.columnar.ColumnTypeSuite MAP 
> append/extract
> --
>
> Key: SPARK-17738
> URL: https://issues.apache.org/jira/browse/SPARK-17738
> Project: Spark
>  Issue Type: Bug
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.2.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.3/1786/testReport/junit/org.apache.spark.sql.execution.columnar/ColumnTypeSuite/MAP_append_extract/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-09-30 Thread Seth Bromberger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536356#comment-15536356
 ] 

Seth Bromberger commented on SPARK-17097:
-

Great explanation. Thank you very much!

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-30 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536321#comment-15536321
 ] 

Alexander Ulanov commented on SPARK-5575:
-

I recently released a package to handle new features that are not yet merged in 
Spark: https://spark-packages.org/package/avulanov/scalable-deeplearning

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default interface in Spark, 
> we

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535884#comment-15535884
 ] 

Luca Menichetti edited comment on SPARK-4105 at 9/30/16 2:25 PM:
-

I have exactly the same problem. It occurs when I process more than one TB of 
data.

Splitting the computation in halves it works perfectly, but I need to process 
the whole thing together.

{noformat}
java.io.IOException: failed to read chunk
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:347)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

as a consequence, since I am using SparkSQL, the job will fail, reporting

{noformat}
16/09/30 14:40:51 ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at

[jira] [Commented] (SPARK-17747) WeightCol support non-double datetypes

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536113#comment-15536113
 ] 

Apache Spark commented on SPARK-17747:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15314

> WeightCol support non-double datetypes
> --
>
> Key: SPARK-17747
> URL: https://issues.apache.org/jira/browse/SPARK-17747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> WeightCol only support double type now, which should fit with other numeric 
> types, such as Int.
> {code}
> scala> df3.show(5)
> +-++--+
> |label|features|weight|
> +-++--+
> |  0.0|(692,[127,128,129...| 1|
> |  1.0|(692,[158,159,160...| 1|
> |  1.0|(692,[124,125,126...| 1|
> |  1.0|(692,[152,153,154...| 1|
> |  1.0|(692,[151,152,153...| 1|
> +-++--+
> only showing top 5 rows
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ee0308a72919
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training 
> finished but the result is not converged because: max iterations reached
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_ee0308a72919
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ced7579d5680
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
> 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
> scala.MatchError: 
> [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
>  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at

[jira] [Assigned] (SPARK-17747) WeightCol support non-double datetypes

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17747:


Assignee: (was: Apache Spark)

> WeightCol support non-double datetypes
> --
>
> Key: SPARK-17747
> URL: https://issues.apache.org/jira/browse/SPARK-17747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> WeightCol only support double type now, which should fit with other numeric 
> types, such as Int.
> {code}
> scala> df3.show(5)
> +-++--+
> |label|features|weight|
> +-++--+
> |  0.0|(692,[127,128,129...| 1|
> |  1.0|(692,[158,159,160...| 1|
> |  1.0|(692,[124,125,126...| 1|
> |  1.0|(692,[152,153,154...| 1|
> |  1.0|(692,[151,152,153...| 1|
> +-++--+
> only showing top 5 rows
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ee0308a72919
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training 
> finished but the result is not converged because: max iterations reached
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_ee0308a72919
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ced7579d5680
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
> 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
> scala.MatchError: 
> [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
>  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at

[jira] [Assigned] (SPARK-17747) WeightCol support non-double datetypes

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17747:


Assignee: Apache Spark

> WeightCol support non-double datetypes
> --
>
> Key: SPARK-17747
> URL: https://issues.apache.org/jira/browse/SPARK-17747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>
> WeightCol only support double type now, which should fit with other numeric 
> types, such as Int.
> {code}
> scala> df3.show(5)
> +-++--+
> |label|features|weight|
> +-++--+
> |  0.0|(692,[127,128,129...| 1|
> |  1.0|(692,[158,159,160...| 1|
> |  1.0|(692,[124,125,126...| 1|
> |  1.0|(692,[152,153,154...| 1|
> |  1.0|(692,[151,152,153...| 1|
> +-++--+
> only showing top 5 rows
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ee0308a72919
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training 
> finished but the result is not converged because: max iterations reached
> lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
> logreg_ee0308a72919
> scala> val lr = new 
> LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_ced7579d5680
> scala> val lrm = lr.fit(df3)
> 16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
> 16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
> scala.MatchError: 
> [0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
>  (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at 
> org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at

[jira] [Comment Edited] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2016-09-30 Thread Vishal Donderia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536110#comment-15536110
 ] 

Vishal Donderia edited comment on SPARK-12334 at 9/30/16 2:15 PM:
--

Is above patch is working fine ? Are we planning to deliver this change in next 
version ?



was (Author: vishaldonde...@gmail.com):
Is above patch is working fine ? 


> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2016-09-30 Thread Vishal Donderia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536110#comment-15536110
 ] 

Vishal Donderia commented on SPARK-12334:
-

Is above patch is working fine ? 


> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17747) WeightCol support non-double datetypes

2016-09-30 Thread zhengruifeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-17747:
-
Description: 
WeightCol only support double type now, which should fit with other numeric 
types, such as Int.
{code}
scala> df3.show(5)
+-++--+
|label|features|weight|
+-++--+
|  0.0|(692,[127,128,129...| 1|
|  1.0|(692,[158,159,160...| 1|
|  1.0|(692,[124,125,126...| 1|
|  1.0|(692,[152,153,154...| 1|
|  1.0|(692,[151,152,153...| 1|
+-++--+
only showing top 5 rows


scala> val lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_ee0308a72919

scala> val lrm = lr.fit(df3)
16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training finished 
but the result is not converged because: max iterations reached
lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
logreg_ee0308a72919

scala> val lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_ced7579d5680

scala> val lrm = lr.fit(df3)
16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
scala.MatchError: 
[0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
 (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at 
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at 
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

[jira] [Created] (SPARK-17747) WeightCol support non-double datetypes

2016-09-30 Thread zhengruifeng (JIRA)

zhengruifeng created SPARK-17747:


 Summary: WeightCol support non-double datetypes
 Key: SPARK-17747
 URL: https://issues.apache.org/jira/browse/SPARK-17747
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: zhengruifeng


WeightCol only support double type now, which should fit with other numeric 
types, such as Int.
{{code}}
scala> df3.show(5)
+-++--+
|label|features|weight|
+-++--+
|  0.0|(692,[127,128,129...| 1|
|  1.0|(692,[158,159,160...| 1|
|  1.0|(692,[124,125,126...| 1|
|  1.0|(692,[152,153,154...| 1|
|  1.0|(692,[151,152,153...| 1|
+-++--+
only showing top 5 rows


scala> val lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_ee0308a72919

scala> val lrm = lr.fit(df3)
16/09/20 15:46:12 WARN LogisticRegression: LogisticRegression training finished 
but the result is not converged because: max iterations reached
lrm: org.apache.spark.ml.classification.LogisticRegressionModel = 
logreg_ee0308a72919

scala> val lr = new 
LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setWeightCol("weight")
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_ced7579d5680

scala> val lrm = lr.fit(df3)
16/09/20 15:46:27 WARN BlockManager: Putting block rdd_211_0 failed
16/09/20 15:46:27 ERROR Executor: Exception in task 0.0 in stage 89.0 (TID 92)
scala.MatchError: 
[0.0,1,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])]
 (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at 
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at 
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535884#comment-15535884
 ] 

Luca Menichetti edited comment on SPARK-4105 at 9/30/16 2:04 PM:
-

I have exactly the same problem. It occurs when I process more than one TB of 
data.

Splitting the computation in halves it works perfectly, but I need to process 
the whole thing together.

{noformat}
java.io.IOException: failed to read chunk
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:347)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

as a consequence, since I am using SparkSQL, the job will fail, reporting

{noformat}
16/09/30 14:40:51 ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at

[jira] [Commented] (SPARK-14077) Support weighted instances in naive Bayes

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536076#comment-15536076
 ] 

Apache Spark commented on SPARK-14077:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15313

> Support weighted instances in naive Bayes
> -
>
> Key: SPARK-14077
> URL: https://issues.apache.org/jira/browse/SPARK-14077
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: zhengruifeng
>  Labels: naive-bayes
> Fix For: 2.1.0
>
>
> In naive Bayes, we expect inputs to be individual observations. In practice, 
> people may have the frequency table instead. It is useful for us to support 
> instance weights to handle this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535884#comment-15535884
 ] 

Luca Menichetti edited comment on SPARK-4105 at 9/30/16 1:54 PM:
-

I have exactly the same problem. It occurs when I process more than one TB of 
data.

Splitting the computation in halves it works perfectly, but I need to process 
the whole thing together.

{noformat}
java.io.IOException: failed to read chunk
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:347)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

as a consequence, since I am using SparkSQL, the job will fail, reporting

{noformat}
16/09/30 14:40:51 ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at

[jira] [Created] (SPARK-17746) Code duplication to compute the path to spark-defaults.conf

2016-09-30 Thread Jacek Laskowski (JIRA)

Jacek Laskowski created SPARK-17746:
---

 Summary: Code duplication to compute the path to 
spark-defaults.conf
 Key: SPARK-17746
 URL: https://issues.apache.org/jira/browse/SPARK-17746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Jacek Laskowski
Priority: Minor


[CommandBuilderUtils.DEFAULT_PROPERTIES_FILE|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/CommandBuilderUtils.java#L31],
 
[AbstractCommandBuilder.getConfDir|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L312]
 and 
[Utils.getDefaultPropertiesFile|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2006]
 are dealing with the default properties file, i.e. {{spark-defaults.conf}} (or 
the Spark configuration directory where the file is).

The code duplication could be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536032#comment-15536032
 ] 

Apache Spark commented on SPARK-17744:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15312

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml, since we may delete 
> spark.mllib packages in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17744:


Assignee: Apache Spark

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml, since we may delete 
> spark.mllib packages in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17744:


Assignee: (was: Apache Spark)

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml, since we may delete 
> spark.mllib packages in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17745) Update Python API for NB to support weighted instances

2016-09-30 Thread zhengruifeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535983#comment-15535983
 ] 

zhengruifeng commented on SPARK-17745:
--

GO ahead!

> Update Python API for NB to support weighted instances
> --
>
> Key: SPARK-17745
> URL: https://issues.apache.org/jira/browse/SPARK-17745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> Update python wrapper of NB to support weighted instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Bezruchko Vadim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535979#comment-15535979
 ] 

Bezruchko Vadim commented on SPARK-4105:


we had gotten the same behavior but the problem was in not enough physical 
memory for executors. We reduced executor's memory and the problem was fixed.

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
>

[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error

2016-09-30 Thread Ashish Shrowty (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535957#comment-15535957
 ] 

Ashish Shrowty commented on SPARK-17709:


Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. 
Once there, I issue the following commands in Hive (2.1.0) to store it in S3 -

CREATE EXTERNAL TABLE  (
   col1 bigint,
   col2 int,
   col3 string,
   
)
PARTITIONED BY (col1 int)
STORED AS PARQUET
LOCATION 's3_table_dir'

INSERT into 
SELECT col1,col2, FROM 


> spark 2.0 join - column resolution error
> 
>
> Key: SPARK-17709
> URL: https://issues.apache.org/jira/browse/SPARK-17709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Ashish Shrowty
>  Labels: easyfix
>
> If I try to inner-join two dataframes which originated from the same initial 
> dataframe that was loaded using spark.sql() call, it results in an error -
> // reading from Hive .. the data is stored in Parquet format in Amazon S3
> val d1 = spark.sql("select * from ")  
> val df1 = d1.groupBy("key1","key2")
>   .agg(avg("totalprice").as("avgtotalprice"))
> val df2 = d1.groupBy("key1","key2")
>   .agg(avg("itemcount").as("avgqty")) 
> df1.join(df2, Seq("key1","key2")) gives error -
> org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can 
> not be resolved given input columns: [key1, key2, avgtotalprice, avgqty];
> If the same Dataframe is initialized via spark.read.parquet(), the above code 
> works. This same code above worked with Spark 1.6.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535884#comment-15535884
 ] 

Luca Menichetti edited comment on SPARK-4105 at 9/30/16 12:43 PM:
--

I have exactly the same problem. It occurs when I process more than one TB of 
data.

Splitting the computation in halves it works perfectly, but I need to process 
the whole thing together.

{noformat}
java.io.IOException: failed to read chunk
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:347)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

as a consequence, since I am using SparkSQL, the job will fail, reporting

{noformat}
16/09/30 14:40:51 ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-09-30 Thread Luca Menichetti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535884#comment-15535884
 ] 

Luca Menichetti commented on SPARK-4105:


I have exactly the same problem. It occurs when I process more than one TB of 
data.

Splitting the computation in halves it will work perfectly, but I need to 
process the whole thing together.

```
java.io.IOException: failed to read chunk
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:347)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
```

as a consequence, since I am using SparkSQL, the job will fail, reporting

```
16/09/30 14:40:51 ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at

[jira] [Updated] (SPARK-17745) Update Python API for NB to support weighted instances

2016-09-30 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-17745:
---
Component/s: PySpark

> Update Python API for NB to support weighted instances
> --
>
> Key: SPARK-17745
> URL: https://issues.apache.org/jira/browse/SPARK-17745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: zhengruifeng
>Priority: Minor
>
> Update python wrapper of NB to support weighted instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17744:

Description: We have moving {{NaiveBayes}} implementation from mllib to ml 
package in SPARK-14077. After that, we should do parity check between the ml 
and mllib test suites, and complement missing test cases for ml, since we may 
delete spark.mllib packages in the future.  (was: We have moving {{NaiveBayes}} 
implementation from mllib to ml package in SPARK-14077. After that, we should 
do parity check between the ml and mllib test suites, and complement missing 
test cases for ml, since we may delete mllib packages in the future.)

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml, since we may delete 
> spark.mllib packages in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17744:

Description: We have moving {{NaiveBayes}} implementation from mllib to ml 
package in SPARK-14077. After that, we should do parity check between the ml 
and mllib test suites, and complement missing test cases for ml, since we may 
delete mllib packages in the future.  (was: We have moving {{NaiveBayes}} 
implementation from mllib to ml package in SPARK-14077. After that, we should 
do parity check between the ml and mllib test suites, and complement missing 
test cases for ml.)

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml, since we may delete 
> mllib packages in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17744:

Description: We have moving {{NaiveBayes}} implementation from mllib to ml 
package in SPARK-14077. After that, we should do parity check between the ml 
and mllib test suites, and complement missing test cases for ml.  (was: Parity 
check between the ml and mllib test suites, and complement missing test cases 
for ml.)

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> We have moving {{NaiveBayes}} implementation from mllib to ml package in 
> SPARK-14077. After that, we should do parity check between the ml and mllib 
> test suites, and complement missing test cases for ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17744:

Priority: Minor  (was: Major)

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Parity check between the ml and mllib test suites, and complement missing 
> test cases for ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17744) Parity check between the ml and mllib test suites for NB

2016-09-30 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535775#comment-15535775
 ] 

Yanbo Liang commented on SPARK-17744:
-

[~srowen] We are working to copy algorithm implementations from mllib to ml, 
and leave mllib as a wrapper to call ml. And {{NaiveBayes}} is the first one we 
are working on, so we separate the moving into several steps: copying the 
implementation, parity check and add missing test suites for ml, investigate 
mllib wrapper performance, etc. So this is one of the sub-tasks. It should be 
more clear to let us understand the workload for moving an algorithm 
implementation from mllib to ml. This is suggested by [~mengxr]. We should 
certainly address all related issues in one PR for moving other algorithms when 
we have clear ideas. Thanks.  

> Parity check between the ml and mllib test suites for NB
> 
>
> Key: SPARK-17744
> URL: https://issues.apache.org/jira/browse/SPARK-17744
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>
> Parity check between the ml and mllib test suites, and complement missing 
> test cases for ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17745) Update Python API for NB to support weighted instances

2016-09-30 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535750#comment-15535750
 ] 

Weichen Xu commented on SPARK-17745:


I will work on it and create PR ASAP, thanks!

> Update Python API for NB to support weighted instances
> --
>
> Key: SPARK-17745
> URL: https://issues.apache.org/jira/browse/SPARK-17745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Update python wrapper of NB to support weighted instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17745) Update Python API for NB to support weighted instances

2016-09-30 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17745:

Priority: Minor  (was: Major)

> Update Python API for NB to support weighted instances
> --
>
> Key: SPARK-17745
> URL: https://issues.apache.org/jira/browse/SPARK-17745
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> Update python wrapper of NB to support weighted instances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535727#comment-15535727
 ] 

Jiang Xingbo edited comment on SPARK-17733 at 9/30/16 11:26 AM:


The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c
(a、b、c are all `Attribute`s.)
{code}
When c is created from an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct 
{code}b = Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` 
push this predicate through `Join` operator, it will appear in constraints. 
Such process repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the wrongly propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!


was (Author: jiangxb1987):
The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c
(a、b、c are all `Attribute`s.)
{code}
When c is created from an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct 
{code}b = Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` 
push this predicate through `Join` operator, it will appear in constraints. 
Such process repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6,

[jira] [Comment Edited] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535727#comment-15535727
 ] 

Jiang Xingbo edited comment on SPARK-17733 at 9/30/16 11:21 AM:


The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c
(a、b、c are all `Attribute`s.)
{code}
When c is created from an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct 
{code}b = Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` 
push this predicate through `Join` operator, it will appear in constraints. 
Such process repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!


was (Author: jiangxb1987):
The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c
(a、b、c are all `Attribute`s.)
{code}
When c is an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct {code}b = 
Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` push this 
predicate through `Join` operator, it will appear in constraints. Such process 
repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7,

[jira] [Comment Edited] (SPARK-17733) InferFiltersFromConstraints rule never terminates for query

2016-09-30 Thread Jiang Xingbo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15535727#comment-15535727
 ] 

Jiang Xingbo edited comment on SPARK-17733 at 9/30/16 11:20 AM:


The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c
(a、b、c are all `Attribute`s.)
{code}
When c is an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct {code}b = 
Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` push this 
predicate through `Join` operator, it will appear in constraints. Such process 
repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!


was (Author: jiangxb1987):
The problem lies in function `QueryPlan.inferAdditionalConstraints`, this 
functions infers an set of additional constraints from given equality 
constraints, and it's possible to cause expressions to propagate, For instance:
{code:sql}
Set(a = b, a = c) => b = c (a、b、c are all `Attribute`s.)
{code}
When c is an `Alias`, like `Alias(Coalesce(a, b))`, then we deduct {code}b = 
Alias(Coalesce(a, b)){code}, after the `PushPredicateThroughJoin` push this 
predicate through `Join` operator, it will appear in constraints. Such process 
repeated, the constraints grows larger and larger.

It will be complecated to adapt the `PushPredicateThroughJoin` rule to this 
case, so maybe we should remove the propagated constraints in rule 
`InferFiltersFromConstraints`. I'll submit a PR to resolve this problem soon. 
Thank you!

> InferFiltersFromConstraints rule never terminates for query
> ---
>
> Key: SPARK-17733
> URL: https://issues.apache.org/jira/browse/SPARK-17733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
> Attachments: 
> SparkSubmit-2016-09-29-1_snapshot___Users_joshrosen_Snapshots__-_YourKit_Java_Profiler_2013_build_13088_-_64-bit.png,
>  constraints.png
>
>
> The following (complicated) example becomes stuck in the 
> {{InferFiltersFromConstraints}} rule and never runs. However, it doesn't fail 
> with a stack overflow and doesn't hit the limit on optimization passes, so I 
> think there's some sort of non-obvious infinite loop within the rule itself.
> {code:title=Table Creation|borderStyle=solid}
>  -- Query #0
> CREATE TEMPORARY VIEW table_4(float_col_1, boolean_col_2, decimal2610_col_3, 
> boolean_col_4, timestamp_col_5, boolean_col_6, bigint_col_7, timestamp_col_8) 
> AS VALUES
>   (CAST(21.920416 AS FLOAT), false, -182.07BD, true, 
> TIMESTAMP('1996-10-24 00:00:00.0'), true, CAST(-993 AS BIGINT), 
> TIMESTAMP('2007-01-13 00:00:00.0')),
>   (CAST(722.4906 AS FLOAT), true, 497.54BD, true, 
> TIMESTAMP('2015-12-14 00:00:00.0'), false, CAST(268 AS BIGINT), 
> TIMESTAMP('2021-04-19 00:00:00.0')),
>   (CAST(534.9996 AS FLOAT), true, -470.83BD, true, 
> TIMESTAMP('1996-01-31 00:00:00.0'), false, CAST(-910 AS BIGINT), 
> TIMESTAMP('2019-10-16 00:00:00.0')),
>   (CAST(-289.6454 AS FLOAT), false, 892.25BD, false, 
> TIMESTAMP('2014-03-14 00:00:00.0'), false, CAST(-462 AS BIGINT), CAST(NULL AS 
> TIMESTAMP)),
>   (CAST(46.395535 AS FLOAT), true, -662.89BD, true, 
> TIMESTAMP('2000-10-16 00:00:00.0'), false, CAST(-656 AS BIGINT), 
> TIMESTAMP('2024-09-01 00:00:00.0')),
>   (CAST(-555.36285 AS FLOAT), true, -938.93BD, true, 
> TIMESTAMP('2007-04-10 00:00:00.0'), true, CAST(252 AS BIGINT), 
> TIMESTAMP('2028-12-03 00:00:00.0')),
>   (CAST(826.29004 AS FLOAT), true, 53.18BD, false, 
> TIMESTAMP('2004-06-11 00:00:00.0'), false, CAST(437 AS BIGINT), 
> TIMESTAMP('1994-04-04 00:00:00.0')),
>   (CAST(-15.276999 AS FLOAT), CAST(NULL AS BOOLEAN), -889.31BD, true, 
> TIMESTAMP('1991-05-23 00:00:00.0'), true, CAST(226 AS BIGINT), 
> TIMESTAMP('2023-07-08 00:00:00.0')),
>   (CAST(385.27386 AS FLOAT), CAST(NULL AS BOOLEAN), -9.95BD, false, 
> TIMESTAMP('2022-10-22 00:00:00.0'), true, CAST(430 AS BIGINT), 
> TIMESTAMP('2013-09-29 00:00:00.0')),
>   (CAST(988.7868 AS FLOAT), CAST(NULL AS BOOLEAN), 715.17BD, false, 
> TIMESTAMP('2026-10-03 00:00:00.0'), true, CAST(-696 AS BIGINT), 
> TIMESTAMP('1990-08-10 00:00:00.0'))
>  ;
>  -- Query #1
> CREATE TEMPORARY VIEW table_1(double_col_1, boolean_col_2, timestamp_col_3, 
> smallint_col_4, boolean_col_5, int_col_6, timestamp_col_7, varchar0008_col_8, 
>

1 2 >

1 - 100 of 123 matches

Mail list logo