[jira] [Assigned] (SPARK-20907) Use testQuietly for test suites that generate long log output
[ https://issues.apache.org/jira/browse/SPARK-20907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20907: Assignee: Apache Spark > Use testQuietly for test suites that generate long log output > - > > Key: SPARK-20907 > URL: https://issues.apache.org/jira/browse/SPARK-20907 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0, 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Apache Spark > > Use `testQuietly` instead of `test` for test causes that generate long output -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20907) Use testQuietly for test suites that generate long log output
[ https://issues.apache.org/jira/browse/SPARK-20907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027707#comment-16027707 ] Apache Spark commented on SPARK-20907: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/18135 > Use testQuietly for test suites that generate long log output > - > > Key: SPARK-20907 > URL: https://issues.apache.org/jira/browse/SPARK-20907 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0, 2.3.0 >Reporter: Kazuaki Ishizaki > > Use `testQuietly` instead of `test` for test causes that generate long output -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20907) Use testQuietly for test suites that generate long log output
[ https://issues.apache.org/jira/browse/SPARK-20907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20907: Assignee: (was: Apache Spark) > Use testQuietly for test suites that generate long log output > - > > Key: SPARK-20907 > URL: https://issues.apache.org/jira/browse/SPARK-20907 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.2.0, 2.3.0 >Reporter: Kazuaki Ishizaki > > Use `testQuietly` instead of `test` for test causes that generate long output -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK
[ https://issues.apache.org/jira/browse/SPARK-20909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20909: Assignee: Apache Spark > Build-in SQL Function Support - DAYOFWEEK > - > > Key: SPARK-20909 > URL: https://issues.apache.org/jira/browse/SPARK-20909 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark > Labels: starter > > {noformat} > DAYOFWEEK(date) > {noformat} > Return the weekday index of the argument. > Ref: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK
[ https://issues.apache.org/jira/browse/SPARK-20909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20909: Assignee: (was: Apache Spark) > Build-in SQL Function Support - DAYOFWEEK > - > > Key: SPARK-20909 > URL: https://issues.apache.org/jira/browse/SPARK-20909 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > DAYOFWEEK(date) > {noformat} > Return the weekday index of the argument. > Ref: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK
[ https://issues.apache.org/jira/browse/SPARK-20909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027705#comment-16027705 ] Apache Spark commented on SPARK-20909: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18134 > Build-in SQL Function Support - DAYOFWEEK > - > > Key: SPARK-20909 > URL: https://issues.apache.org/jira/browse/SPARK-20909 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > DAYOFWEEK(date) > {noformat} > Return the weekday index of the argument. > Ref: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
[ https://issues.apache.org/jira/browse/SPARK-20908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20908. - Resolution: Fixed Fix Version/s: 2.2.0 > Cache Manager: Hint should be ignored in plan matching > -- > > Key: SPARK-20908 > URL: https://issues.apache.org/jira/browse/SPARK-20908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > In Cache manager, the plan matching should ignore Hint. > {noformat} > val df1 = spark.range(10).join(broadcast(spark.range(10))) > df1.cache() > spark.range(10).join(spark.range(10)).explain() > {noformat} > The output plan of the above query shows that the second query is not using > the cached data of the first query. > {noformat} > BroadcastNestedLoopJoin BuildRight, Inner > :- *Range (0, 10, step=1, splits=2) > +- BroadcastExchange IdentityBroadcastMode >+- *Range (0, 10, step=1, splits=2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK
[ https://issues.apache.org/jira/browse/SPARK-20909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027694#comment-16027694 ] Yuming Wang commented on SPARK-20909: - I'm work on this. > Build-in SQL Function Support - DAYOFWEEK > - > > Key: SPARK-20909 > URL: https://issues.apache.org/jira/browse/SPARK-20909 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Yuming Wang > Labels: starter > > {noformat} > DAYOFWEEK(date) > {noformat} > Return the weekday index of the argument. > Ref: > https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20909) Build-in SQL Function Support - DAYOFWEEK
Yuming Wang created SPARK-20909: --- Summary: Build-in SQL Function Support - DAYOFWEEK Key: SPARK-20909 URL: https://issues.apache.org/jira/browse/SPARK-20909 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Yuming Wang {noformat} DAYOFWEEK(date) {noformat} Return the weekday index of the argument. Ref: https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html#function_dayofweek -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027685#comment-16027685 ] kant kodali commented on SPARK-20894: - Not a Bug. > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kant kodali closed SPARK-20894. --- Resolution: Not A Problem > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8184) date/time function: weekofyear
[ https://issues.apache.org/jira/browse/SPARK-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027680#comment-16027680 ] Apache Spark commented on SPARK-8184: - User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/18132 > date/time function: weekofyear > -- > > Key: SPARK-8184 > URL: https://issues.apache.org/jira/browse/SPARK-8184 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Tarek Auel > Fix For: 1.5.0 > > > weekofyear(string|date|timestamp): int > Returns the week number of a timestamp string: weekofyear("1970-11-01 > 00:00:00") = 44, weekofyear("1970-11-01") = 44. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
[ https://issues.apache.org/jira/browse/SPARK-20908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20908: Assignee: Apache Spark (was: Xiao Li) > Cache Manager: Hint should be ignored in plan matching > -- > > Key: SPARK-20908 > URL: https://issues.apache.org/jira/browse/SPARK-20908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > In Cache manager, the plan matching should ignore Hint. > {noformat} > val df1 = spark.range(10).join(broadcast(spark.range(10))) > df1.cache() > spark.range(10).join(spark.range(10)).explain() > {noformat} > The output plan of the above query shows that the second query is not using > the cached data of the first query. > {noformat} > BroadcastNestedLoopJoin BuildRight, Inner > :- *Range (0, 10, step=1, splits=2) > +- BroadcastExchange IdentityBroadcastMode >+- *Range (0, 10, step=1, splits=2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
[ https://issues.apache.org/jira/browse/SPARK-20908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20908: Assignee: Xiao Li (was: Apache Spark) > Cache Manager: Hint should be ignored in plan matching > -- > > Key: SPARK-20908 > URL: https://issues.apache.org/jira/browse/SPARK-20908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > In Cache manager, the plan matching should ignore Hint. > {noformat} > val df1 = spark.range(10).join(broadcast(spark.range(10))) > df1.cache() > spark.range(10).join(spark.range(10)).explain() > {noformat} > The output plan of the above query shows that the second query is not using > the cached data of the first query. > {noformat} > BroadcastNestedLoopJoin BuildRight, Inner > :- *Range (0, 10, step=1, splits=2) > +- BroadcastExchange IdentityBroadcastMode >+- *Range (0, 10, step=1, splits=2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
[ https://issues.apache.org/jira/browse/SPARK-20908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027660#comment-16027660 ] Apache Spark commented on SPARK-20908: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18131 > Cache Manager: Hint should be ignored in plan matching > -- > > Key: SPARK-20908 > URL: https://issues.apache.org/jira/browse/SPARK-20908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > In Cache manager, the plan matching should ignore Hint. > {noformat} > val df1 = spark.range(10).join(broadcast(spark.range(10))) > df1.cache() > spark.range(10).join(spark.range(10)).explain() > {noformat} > The output plan of the above query shows that the second query is not using > the cached data of the first query. > {noformat} > BroadcastNestedLoopJoin BuildRight, Inner > :- *Range (0, 10, step=1, splits=2) > +- BroadcastExchange IdentityBroadcastMode >+- *Range (0, 10, step=1, splits=2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
[ https://issues.apache.org/jira/browse/SPARK-20908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20908: Description: In Cache manager, the plan matching should ignore Hint. {noformat} val df1 = spark.range(10).join(broadcast(spark.range(10))) df1.cache() spark.range(10).join(spark.range(10)).explain() {noformat} The output plan of the above query shows that the second query is not using the cached data of the first query. {noformat} BroadcastNestedLoopJoin BuildRight, Inner :- *Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- *Range (0, 10, step=1, splits=2) {noformat} was: In Cache manager, the plan matching should ignore Hint. {noformat} val df1 = spark.range(10).join(broadcast(spark.range(10))) df1.cache() spark.range(10).join(spark.range(10)).explain() {noformat} The above query shows the plan that does not use the cached data {noformat} BroadcastNestedLoopJoin BuildRight, Inner :- *Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- *Range (0, 10, step=1, splits=2) {noformat} > Cache Manager: Hint should be ignored in plan matching > -- > > Key: SPARK-20908 > URL: https://issues.apache.org/jira/browse/SPARK-20908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > In Cache manager, the plan matching should ignore Hint. > {noformat} > val df1 = spark.range(10).join(broadcast(spark.range(10))) > df1.cache() > spark.range(10).join(spark.range(10)).explain() > {noformat} > The output plan of the above query shows that the second query is not using > the cached data of the first query. > {noformat} > BroadcastNestedLoopJoin BuildRight, Inner > :- *Range (0, 10, step=1, splits=2) > +- BroadcastExchange IdentityBroadcastMode >+- *Range (0, 10, step=1, splits=2) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20908) Cache Manager: Hint should be ignored in plan matching
Xiao Li created SPARK-20908: --- Summary: Cache Manager: Hint should be ignored in plan matching Key: SPARK-20908 URL: https://issues.apache.org/jira/browse/SPARK-20908 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.2.0 Reporter: Xiao Li Assignee: Xiao Li In Cache manager, the plan matching should ignore Hint. {noformat} val df1 = spark.range(10).join(broadcast(spark.range(10))) df1.cache() spark.range(10).join(spark.range(10)).explain() {noformat} The above query shows the plan that does not use the cached data {noformat} BroadcastNestedLoopJoin BuildRight, Inner :- *Range (0, 10, step=1, splits=2) +- BroadcastExchange IdentityBroadcastMode +- *Range (0, 10, step=1, splits=2) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-20876: --- Assignee: liuxian > If the input parameter is float type for ceil or floor ,the result is not we > expected > -- > > Key: SPARK-20876 > URL: https://issues.apache.org/jira/browse/SPARK-20876 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Assignee: liuxian > Fix For: 2.3.0 > > > spark-sql>SELECT ceil(cast(12345.1233 as float)); > spark-sql>12345 > For this case, the result we expected is 12346 > spark-sql>SELECT floor(cast(-12345.1233 as float)); > spark-sql>-12345 > For this case, the result we expected is -12346 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20876) If the input parameter is float type for ceil or floor ,the result is not we expected
[ https://issues.apache.org/jira/browse/SPARK-20876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20876. - Resolution: Fixed Fix Version/s: 2.3.0 > If the input parameter is float type for ceil or floor ,the result is not we > expected > -- > > Key: SPARK-20876 > URL: https://issues.apache.org/jira/browse/SPARK-20876 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: liuxian >Assignee: liuxian > Fix For: 2.3.0 > > > spark-sql>SELECT ceil(cast(12345.1233 as float)); > spark-sql>12345 > For this case, the result we expected is 12346 > spark-sql>SELECT floor(cast(-12345.1233 as float)); > spark-sql>-12345 > For this case, the result we expected is -12346 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20897. - Resolution: Fixed Fix Version/s: 2.2.0 > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20907) Use testQuietly for test suites that generate long log output
Kazuaki Ishizaki created SPARK-20907: Summary: Use testQuietly for test suites that generate long log output Key: SPARK-20907 URL: https://issues.apache.org/jira/browse/SPARK-20907 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.2.0, 2.3.0 Reporter: Kazuaki Ishizaki Use `testQuietly` instead of `test` for test causes that generate long output -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027476#comment-16027476 ] Dongjoon Hyun edited comment on SPARK-19809 at 5/27/17 4:18 PM: [~hyukjin.kwon]. I don't think so. Parquet file does not need `spark.sql.files.ignoreCorruptFiles` option. {code} scala> sql("create table empty_parquet(a int) stored as parquet location '/tmp/empty_parquet'").show ++ || ++ ++ $ touch /tmp/empty_parquet/zero.parquet scala> sql("select * from empty_parquet").show +---+ | a| +---+ +---+ {code} You can test this in Spark with SPARK-20728. {code} scala> sql("create table empty_orc2(a int) using orc location '/tmp/empty_orc'").show ++ || ++ ++ scala> sql("select * from empty_orc2").show +---+ | a| +---+ +---+ {code} I think this is a part of SPARK-20901. And ORC community will handle this. What we need is just to use latest ORC. One thing I'm wondering is this is tracked in https://issues.apache.org/jira/browse/ORC-162 (Open). was (Author: dongjoon): [~hyukjin.kwon]. I don't think so. Parquet file does not need `spark.sql.files.ignoreCorruptFiles` option. {code} scala> sql("create table empty_parquet(a int) stored as parquet location '/tmp/empty_parquet'").show ++ || ++ ++ $ touch /tmp/empty_parquet/zero.parquet scala> sql("select * from empty_parquet").show +---+ | a| +---+ +---+ {code} Also latest ORC file does not, too. It's fixed in https://issues.apache.org/jira/browse/ORC-162 . You can test this in Spark with SPARK-20728. {code} scala> sql("create table empty_orc2(a int) using orc location '/tmp/empty_orc'").show ++ || ++ ++ scala> sql("select * from empty_orc2").show +---+ | a| +---+ +---+ {code} I think this is a part of SPARK-20901. And ORC community already resolved this. What we need is just to use latest ORC. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027476#comment-16027476 ] Dongjoon Hyun commented on SPARK-19809: --- [~hyukjin.kwon]. I don't think so. Parquet file does not need `spark.sql.files.ignoreCorruptFiles` option. {code} scala> sql("create table empty_parquet(a int) stored as parquet location '/tmp/empty_parquet'").show ++ || ++ ++ $ touch /tmp/empty_parquet/zero.parquet scala> sql("select * from empty_parquet").show +---+ | a| +---+ +---+ {code} Also latest ORC file does not, too. It's fixed in https://issues.apache.org/jira/browse/ORC-162 . You can test this in Spark with SPARK-20728. {code} scala> sql("create table empty_orc2(a int) using orc location '/tmp/empty_orc'").show ++ || ++ ++ scala> sql("select * from empty_orc2").show +---+ | a| +---+ +---+ {code} I think this is a part of SPARK-20901. And ORC community already resolved this. What we need is just to use latest ORC. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at >
[jira] [Assigned] (SPARK-20875) Spark should print the log when the directory has been deleted
[ https://issues.apache.org/jira/browse/SPARK-20875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-20875: - Assignee: liuzhaokun Priority: Trivial (was: Major) > Spark should print the log when the directory has been deleted > -- > > Key: SPARK-20875 > URL: https://issues.apache.org/jira/browse/SPARK-20875 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: liuzhaokun >Assignee: liuzhaokun >Priority: Trivial > Fix For: 2.3.0 > > > When the "deleteRecursively" method is invoked,spark doesn't print any log if > the path was deleted.For example,spark only print "Removing directory" when > the worker began cleaning spark.work.dir,but didn't print any log about "the > path has been delete".So, I can't judge whether the path was deleted form > the worker's logfile,If there is any accidents about Linux. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20875) Spark should print the log when the directory has been deleted
[ https://issues.apache.org/jira/browse/SPARK-20875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20875. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18102 [https://github.com/apache/spark/pull/18102] > Spark should print the log when the directory has been deleted > -- > > Key: SPARK-20875 > URL: https://issues.apache.org/jira/browse/SPARK-20875 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: liuzhaokun > Fix For: 2.3.0 > > > When the "deleteRecursively" method is invoked,spark doesn't print any log if > the path was deleted.For example,spark only print "Removing directory" when > the worker began cleaning spark.work.dir,but didn't print any log about "the > path has been delete".So, I can't judge whether the path was deleted form > the worker's logfile,If there is any accidents about Linux. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20393) Strengthen Spark to prevent XSS vulnerabilities
[ https://issues.apache.org/jira/browse/SPARK-20393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20393: -- Priority: Major (was: Minor) Fix Version/s: 2.1.2 > Strengthen Spark to prevent XSS vulnerabilities > --- > > Key: SPARK-20393 > URL: https://issues.apache.org/jira/browse/SPARK-20393 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.2, 2.0.2, 2.1.0 >Reporter: Nicholas Marion >Assignee: Nicholas Marion > Labels: security > Fix For: 2.1.2, 2.2.0 > > > Using IBM Security AppScan Standard, we discovered several easy to recreate > MHTML cross site scripting vulnerabilities in the Apache Spark Web GUI > application and these vulnerabilities were found to exist in Spark version > 1.5.2 and 2.0.2, the two levels we initially tested. Cross-site scripting > attack is not really an attack on the Spark server as much as an attack on > the end user, taking advantage of their trust in the Spark server to get them > to click on a URL like the ones in the examples below. So whether the user > could or could not change lots of stuff on the Spark server is not the key > point. It is an attack on the user themselves. If they click the link the > script could run in their browser and comprise their device. Once the > browser is compromised it could submit Spark requests but it also might not. > https://blogs.technet.microsoft.com/srd/2011/01/28/more-information-about-the-mhtml-script-injection-vulnerability/ > {quote} > Request: GET > /app/?appId=Content-Type:%20multipart/related;%20boundary=_AppScan%0d%0a-- > _AppScan%0d%0aContent-Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > HTTP/1.1 > Excerpt from response: No running application with ID > Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET > /history/app-20161012202114-0038/stages/stage?id=1=0=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > k.pageSize=100 HTTP/1.1 > Excerpt from response: Content-Type: multipart/related; > boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > Request: GET /log?appId=app-20170113131903-=0=Content- > Type:%20multipart/related;%20boundary=_AppScan%0d%0a--_AppScan%0d%0aContent- > Location:foo%0d%0aContent-Transfer- > Encoding:base64%0d%0a%0d%0aPGh0bWw%2bPHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw%2b%0d%0a > eLength=0 HTTP/1.1 > Excerpt from response: Bytes 0-0 of 0 of > /u/nmarion/Spark_2.0.2.0/Spark-DK/work/app-20170113131903-/0/Content- > Type: multipart/related; boundary=_AppScan > --_AppScan > Content-Location:foo > Content-Transfer-Encoding:base64 > PGh0bWw+PHNjcmlwdD5hbGVydCgiWFNTIik8L3NjcmlwdD48L2h0bWw+ > Result: In the above payload the BASE64 data decodes as: > alert("XSS") > {quote} > security@apache was notified and recommended a PR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20896) spark executor get java.lang.ClassCastException when trigger two job at same time
[ https://issues.apache.org/jira/browse/SPARK-20896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027389#comment-16027389 ] Sean Owen commented on SPARK-20896: --- I don't think it has anything to do with running two jobs at the same time. You show some errors in your code above, is that related? If you're saying it's not a problem in spark-shell or spark-submit, then it's something to do with how your code interacts with Zeppelin, maybe. > spark executor get java.lang.ClassCastException when trigger two job at same > time > - > > Key: SPARK-20896 > URL: https://issues.apache.org/jira/browse/SPARK-20896 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.1 >Reporter: poseidon > > 1、zeppelin 0.6.2 in *SCOPE* mode > 2、spark 1.6.2 > 3、HDP 2.4 for HDFS YARN > trigger scala code like : > {quote} > var tmpDataFrame = sql(" select b1,b2,b3 from xxx.x") > val vectorDf = assembler.transform(tmpDataFrame) > val vectRdd = vectorDf.select("features").map{x:Row => x.getAs[Vector](0)} > val correlMatrix: Matrix = Statistics.corr(vectRdd, "spearman") > val columns = correlMatrix.toArray.grouped(correlMatrix.numRows) > val rows = columns.toSeq.transpose > val vectors = rows.map(row => new DenseVector(row.toArray)) > val vRdd = sc.parallelize(vectors) > import sqlContext.implicits._ > val dfV = vRdd.map(_.toArray).map{ case Array(b1,b2,b3) => (b1,b2,b3) }.toDF() > val rows = dfV.rdd.zipWithIndex.map(_.swap) > > .join(sc.parallelize(Array("b1","b2","b3")).zipWithIndex.map(_.swap)) > .values.map{case (row: Row, x: String) => Row.fromSeq(row.toSeq > :+ x)} > {quote} > --- > and code : > {quote} > var df = sql("select b1,b2 from .x") > var i = 0 > var threshold = Array(2.0,3.0) > var inputCols = Array("b1","b2") > var tmpDataFrame = df > for (col <- inputCols){ > val binarizer: Binarizer = new Binarizer().setInputCol(col) > .setOutputCol(inputCols(i)+"_binary") > .setThreshold(threshold(i)) > tmpDataFrame = binarizer.transform(tmpDataFrame).drop(inputCols(i)) > i = i+1 > } > var saveDFBin = tmpDataFrame > val dfAppendBin = sql("select b3 from poseidon.corelatdemo") > val rows = saveDFBin.rdd.zipWithIndex.map(_.swap) > .join(dfAppendBin.rdd.zipWithIndex.map(_.swap)) > .values.map{case (row1: Row, row2: Row) => Row.fromSeq(row1.toSeq > ++ row2.toSeq)} > import org.apache.spark.sql.types.StructType > val rowSchema = StructType(saveDFBin.schema.fields ++ > dfAppendBin.schema.fields) > saveDFBin = sqlContext.createDataFrame(rows, rowSchema) > //save result to table > import org.apache.spark.sql.SaveMode > saveDFBin.write.mode(SaveMode.Overwrite).saveAsTable(".") > sql("alter table . set lifecycle 1") > {quote} > on zeppelin with two different notebook at same time. > Found this exception log in executor : > {quote} > l1.dtdream.com): java.lang.ClassCastException: > org.apache.spark.mllib.linalg.DenseVector cannot be cast to scala.Tuple2 > at > $line127359816836.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:34) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) > at > org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) > at > org.apache.spark.rdd.ZippedWithIndexRDD$$anonfun$2.apply(ZippedWithIndexRDD.scala:52) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1875) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > OR > {quote} > java.lang.ClassCastException: scala.Tuple2 cannot be cast to > org.apache.spark.mllib.linalg.DenseVector > at > $line34684895436.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:57) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at >
[jira] [Commented] (SPARK-20320) AnalysisException: Columns of grouping_id (count(value#17L)) does not match grouping columns (count(value#17L))
[ https://issues.apache.org/jira/browse/SPARK-20320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027384#comment-16027384 ] lyc commented on SPARK-20320: - It seems `count("value")` should not be in `cube`, there should only be column names.Like in `groupBy`, it is invalid to `group by count("value")`. > AnalysisException: Columns of grouping_id (count(value#17L)) does not match > grouping columns (count(value#17L)) > --- > > Key: SPARK-20320 > URL: https://issues.apache.org/jira/browse/SPARK-20320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Jacek Laskowski >Priority: Minor > > I'm not questioning the {{AnalysisException}} (which I don't know whether > should be reported or not), but the exception message that tells...nothing > helpful. > {code} > val records = spark.range(5).flatMap(n => Seq.fill(n.toInt)(n)) > scala> > records.cube(count("value")).agg(grouping_id(count("value"))).queryExecution.logical > org.apache.spark.sql.AnalysisException: Columns of grouping_id > (count(value#17L)) does not match grouping columns (count(value#17L)); > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:313) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.applyOrElse(Analyzer.scala:308) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20891) Reduce duplicate code in typedaggregators.scala
[ https://issues.apache.org/jira/browse/SPARK-20891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027381#comment-16027381 ] Ruben Janssen commented on SPARK-20891: --- Ok I will take that approach next time, thanks for the suggestion :) I have submitted the change and will continue with 20890 when its merged in. > Reduce duplicate code in typedaggregators.scala > --- > > Key: SPARK-20891 > URL: https://issues.apache.org/jira/browse/SPARK-20891 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ruben Janssen > > With SPARK-20411, a significant amount of functions will be added to > typedaggregators.scala, resulting in a large amount of duplicate code -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20905) When running spark with yarn-client, large executor-cores will lead to bad performance.
[ https://issues.apache.org/jira/browse/SPARK-20905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20905. --- Resolution: Invalid Questions should go to the mailing list, not JIRA > When running spark with yarn-client, large executor-cores will lead to bad > performance. > > > Key: SPARK-20905 > URL: https://issues.apache.org/jira/browse/SPARK-20905 > Project: Spark > Issue Type: Question > Components: Examples >Affects Versions: 2.0.0 >Reporter: Cherry Zhang > > Hi, all: > When I run a training job in spark with yarn-client, and set > executor-cores=20(less than vcores=24) and executor-num=4(my cluster has 4 > slaves), then there will be always one node computing time is larger than > others. > I checked some blogs, and they says executor-cores should be set less than 5 > if there are tons of concurrency threads. I tried to set executor-cores=4, > and executor-num=20, then it worked. > But I don't know why, can you give some explain? Thank you very much. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kant kodali updated SPARK-20894: Attachment: driver_info_log > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kant kodali updated SPARK-20894: Attachment: (was: driver_log) > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_info_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20897) cached self-join should not fail
[ https://issues.apache.org/jira/browse/SPARK-20897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-20897: Target Version/s: 2.2.0 > cached self-join should not fail > > > Key: SPARK-20897 > URL: https://issues.apache.org/jira/browse/SPARK-20897 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > > code to reproduce this bug: > {code} > // force to plan sort merge join > spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0") > val df = Seq(1 -> "a").toDF("i", "j") > val df1 = df.as("t1") > val df2 = df.as("t2") > assert(df1.join(df2, $"t1.i" === $"t2.i").cache().count() == 1) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027361#comment-16027361 ] Hyukjin Kwon commented on SPARK-19809: -- I think this is then rather about handling malformed files (e.g., {{spark.sql.files.ignoreCorruptFiles}}). > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2, 2.1.1 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at >
[jira] [Assigned] (SPARK-20365) Not so accurate classpath format for AM and Containers
[ https://issues.apache.org/jira/browse/SPARK-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20365: Assignee: (was: Apache Spark) > Not so accurate classpath format for AM and Containers > -- > > Key: SPARK-20365 > URL: https://issues.apache.org/jira/browse/SPARK-20365 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars > started with "local" scheme), we will get inaccurate classpath for AM and > containers. This is because we don't remove "local" scheme when concatenating > classpath. It is OK to run because classpath is separated with ":" and java > treat "local" as a separate jar. But we could improve it to remove the scheme. > {code} > java.class.path = >
[jira] [Assigned] (SPARK-20365) Not so accurate classpath format for AM and Containers
[ https://issues.apache.org/jira/browse/SPARK-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20365: Assignee: Apache Spark > Not so accurate classpath format for AM and Containers > -- > > Key: SPARK-20365 > URL: https://issues.apache.org/jira/browse/SPARK-20365 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars > started with "local" scheme), we will get inaccurate classpath for AM and > containers. This is because we don't remove "local" scheme when concatenating > classpath. It is OK to run because classpath is separated with ":" and java > treat "local" as a separate jar. But we could improve it to remove the scheme. > {code} > java.class.path = >
[jira] [Commented] (SPARK-20365) Not so accurate classpath format for AM and Containers
[ https://issues.apache.org/jira/browse/SPARK-20365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027350#comment-16027350 ] Apache Spark commented on SPARK-20365: -- User 'liyichao' has created a pull request for this issue: https://github.com/apache/spark/pull/18129 > Not so accurate classpath format for AM and Containers > -- > > Key: SPARK-20365 > URL: https://issues.apache.org/jira/browse/SPARK-20365 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Saisai Shao >Priority: Minor > > In Spark on YARN, when configuring "spark.yarn.jars" with local jars (jars > started with "local" scheme), we will get inaccurate classpath for AM and > containers. This is because we don't remove "local" scheme when concatenating > classpath. It is OK to run because classpath is separated with ":" and java > treat "local" as a separate jar. But we could improve it to remove the scheme. > {code} > java.class.path = >
[jira] [Assigned] (SPARK-20906) Constrained Logistic Regression for SparkR
[ https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20906: Assignee: Apache Spark > Constrained Logistic Regression for SparkR > -- > > Key: SPARK-20906 > URL: https://issues.apache.org/jira/browse/SPARK-20906 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0, 2.2.1 >Reporter: Miao Wang >Assignee: Apache Spark > > PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic > Regression for ML. We should add it to SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20906) Constrained Logistic Regression for SparkR
[ https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20906: Assignee: (was: Apache Spark) > Constrained Logistic Regression for SparkR > -- > > Key: SPARK-20906 > URL: https://issues.apache.org/jira/browse/SPARK-20906 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0, 2.2.1 >Reporter: Miao Wang > > PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic > Regression for ML. We should add it to SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20906) Constrained Logistic Regression for SparkR
[ https://issues.apache.org/jira/browse/SPARK-20906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027320#comment-16027320 ] Apache Spark commented on SPARK-20906: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/18128 > Constrained Logistic Regression for SparkR > -- > > Key: SPARK-20906 > URL: https://issues.apache.org/jira/browse/SPARK-20906 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0, 2.2.1 >Reporter: Miao Wang > > PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic > Regression for ML. We should add it to SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20906) Constrained Logistic Regression for SparkR
Miao Wang created SPARK-20906: - Summary: Constrained Logistic Regression for SparkR Key: SPARK-20906 URL: https://issues.apache.org/jira/browse/SPARK-20906 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.2.0, 2.2.1 Reporter: Miao Wang PR https://github.com/apache/spark/pull/17715 Added Constrained Logistic Regression for ML. We should add it to SparkR. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20894) Error while checkpointing to HDFS (similar to JIRA SPARK-19268)
[ https://issues.apache.org/jira/browse/SPARK-20894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kant kodali updated SPARK-20894: Attachment: executor2_log executor1_log driver_log Attached Driver logs and executor logs > Error while checkpointing to HDFS (similar to JIRA SPARK-19268) > --- > > Key: SPARK-20894 > URL: https://issues.apache.org/jira/browse/SPARK-20894 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.1 > Environment: Ubuntu, Spark 2.1.1, hadoop 2.7 >Reporter: kant kodali > Attachments: driver_log, executor1_log, executor2_log > > > Dataset df2 = df1.groupBy(functions.window(df1.col("Timestamp5"), "24 > hours", "24 hours"), df1.col("AppName")).count(); > StreamingQuery query = df2.writeStream().foreach(new > KafkaSink()).option("checkpointLocation","/usr/local/hadoop/checkpoint").outputMode("update").start(); > query.awaitTermination(); > This for some reason fails with the Error > ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.IllegalStateException: Error reading delta file > /usr/local/hadoop/checkpoint/state/0/0/1.delta of HDFSStateStoreProvider[id = > (op=0, part=0), dir = /usr/local/hadoop/checkpoint/state/0/0]: > /usr/local/hadoop/checkpoint/state/0/0/1.delta does not exist > I did clear all the checkpoint data in /usr/local/hadoop/checkpoint/ and all > consumer offsets in Kafka from all brokers prior to running and yet this > error still persists. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit
[ https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16027270#comment-16027270 ] Dongjoon Hyun commented on SPARK-19372: --- Thank you so much all! > Code generation for Filter predicate including many OR conditions exceeds JVM > method size limit > > > Key: SPARK-19372 > URL: https://issues.apache.org/jira/browse/SPARK-19372 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0, 2.3.0 > > Attachments: wide400cols.csv > > > For the attached csv file, the code below causes the exception > "org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" > grows beyond 64 KB > Code: > {code:borderStyle=solid} > val conf = new SparkConf().setMaster("local[1]") > val sqlContext = > SparkSession.builder().config(conf).getOrCreate().sqlContext > val dataframe = > sqlContext > .read > .format("com.databricks.spark.csv") > .load("wide400cols.csv") > val filter = (0 to 399) > .foldLeft(lit(false))((e, index) => > e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}")) > val filtered = dataframe.filter(filter) > filtered.show(100) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org