[jira] [Commented] (SPARK-29683) Job failed due to executor failures all available nodes are blacklisted
[ https://issues.apache.org/jira/browse/SPARK-29683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967188#comment-16967188 ] Steven Rand commented on SPARK-29683: - We're experiencing this as well during HA YARN failover. > Job failed due to executor failures all available nodes are blacklisted > --- > > Key: SPARK-29683 > URL: https://issues.apache.org/jira/browse/SPARK-29683 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.0.0 >Reporter: Genmao Yu >Priority: Major > > My streaming job will fail *due to executor failures all available nodes are > blacklisted*. This exception is thrown only when all node is blacklisted: > {code:java} > def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= > numClusterNodes > val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ > allocatorBlacklist.keySet > {code} > After diving into the code, I found some critical conditions not be handled > properly: > - unchecked `excludeNodes`: it comes from user config. If not set properly, > it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For > example, we may set some nodes not in Yarn cluster. > {code:java} > excludeNodes = (invalid1, invalid2, invalid3) > clusterNodes = (valid1, valid2) > {code} > - `numClusterNodes` may equals 0: When HA Yarn failover, it will take some > time for all NodeManagers to register ResourceManager again. In this case, > `numClusterNode` may equals 0 or some other number, and Spark driver failed. > - too strong condition check: Spark driver will fail as long as > "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should > not indicate a unrecovered fatal. For example, there are some NodeManagers > restarting. So we can give some waiting time before job failed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler
[ https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated SPARK-27773: Description: The health of the external shuffle service is currently difficult to monitor. At least for the YARN shuffle service, the only current indication of health is whether or not the shuffle service threads are running in the NodeManager. However, we've seen that clients can sometimes experience elevated failure rates on requests to the shuffle service even when those threads are running. It would be helpful to have some indication of how often requests to the shuffle service are failing, as this could be monitored, alerted on, etc. One suggestion (implemented in the PR I'll attach to this ticket) is to add a metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of how many times we caught an exception in the shuffle service's RPC handler. I think that this gives us the insight into request failure rates that we're currently missing, but obviously I'm open to alternatives as well if people have other ideas. was: The health of the external shuffle service is currently difficult to monitor. At least for the YARN shuffle service, the only current indication of health is whether or not the shuffle service threads are running in the NodeManager. However, we've seen that clients can sometimes experience elevated failure rates on requests to the shuffle service even when those threads are running. It would be helpful to have some indication of how often requests to the shuffle service are failing, as this could be monitored, alerted on, etc. One suggestion (implemented in the PR I'll attach to this ticket) is to add a metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of how many times we called {{TransportChannelHandler#exceptionCaught}}. I think that this gives us the insight into request failure rates that we're currently missing, but obviously I'm open to alternatives as well if people have other ideas. Summary: Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler (was: Add shuffle service metric for number of exceptions caught in TransportChannelHandler) > Add shuffle service metric for number of exceptions caught in > ExternalShuffleBlockHandler > - > > Key: SPARK-27773 > URL: https://issues.apache.org/jira/browse/SPARK-27773 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.3 >Reporter: Steven Rand >Priority: Minor > > The health of the external shuffle service is currently difficult to monitor. > At least for the YARN shuffle service, the only current indication of health > is whether or not the shuffle service threads are running in the NodeManager. > However, we've seen that clients can sometimes experience elevated failure > rates on requests to the shuffle service even when those threads are running. > It would be helpful to have some indication of how often requests to the > shuffle service are failing, as this could be monitored, alerted on, etc. > One suggestion (implemented in the PR I'll attach to this ticket) is to add a > metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of > how many times we caught an exception in the shuffle service's RPC handler. I > think that this gives us the insight into request failure rates that we're > currently missing, but obviously I'm open to alternatives as well if people > have other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27797) Shuffle service metric "registeredConnections" not tracked correctly
[ https://issues.apache.org/jira/browse/SPARK-27797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845535#comment-16845535 ] Steven Rand commented on SPARK-27797: - I think it might be okay -- ExternalShuffleService.start() does: {code} blockHandler.getAllMetrics.getMetrics.put("numRegisteredConnections", server.getRegisteredConnections) {code} Which I think makes the result correct (though in any case, agreed that the structure is confusing). > Shuffle service metric "registeredConnections" not tracked correctly > > > Key: SPARK-27797 > URL: https://issues.apache.org/jira/browse/SPARK-27797 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > In {{ExternalShuffleBlockHandler}}: > {code} > // Number of registered connections to the shuffle service > private Counter registeredConnections = new Counter(); > public ShuffleMetrics() { > ... > allMetrics.put("numRegisteredConnections", registeredConnections); > } > {code} > But the counter that's actually updated is in {{TransportContext}}. The call > site is in {{TransportChannelHandler}}: > {code} > @Override > public void channelRegistered(ChannelHandlerContext ctx) throws Exception { > transportContext.getRegisteredConnections().inc(); > super.channelRegistered(ctx); > } > @Override > public void channelUnregistered(ChannelHandlerContext ctx) throws Exception > { > transportContext.getRegisteredConnections().dec(); > super.channelUnregistered(ctx); > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27773) Add shuffle service metric for number of exceptions caught in TransportChannelHandler
Steven Rand created SPARK-27773: --- Summary: Add shuffle service metric for number of exceptions caught in TransportChannelHandler Key: SPARK-27773 URL: https://issues.apache.org/jira/browse/SPARK-27773 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.3 Reporter: Steven Rand The health of the external shuffle service is currently difficult to monitor. At least for the YARN shuffle service, the only current indication of health is whether or not the shuffle service threads are running in the NodeManager. However, we've seen that clients can sometimes experience elevated failure rates on requests to the shuffle service even when those threads are running. It would be helpful to have some indication of how often requests to the shuffle service are failing, as this could be monitored, alerted on, etc. One suggestion (implemented in the PR I'll attach to this ticket) is to add a metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of how many times we called {{TransportChannelHandler#exceptionCaught}}. I think that this gives us the insight into request failure rates that we're currently missing, but obviously I'm open to alternatives as well if people have other ideas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637630#comment-16637630 ] Steven Rand commented on SPARK-25538: - Thanks all! > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Assignee: Marco Gaido >Priority: Blocker > Labels: correctness > Fix For: 2.4.0 > > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633254#comment-16633254 ] Steven Rand commented on SPARK-25538: - [~kiszk] I've uploaded a tarball containing parquet files that reproduce the issue but don't contain any of the values in the original dataset. Specifically, some columns have been dropped, all strings have been changed to "test_string", all values in col_50 have been changed to 0.0043, and the values in col_14 have all been mapped from their original values to values between 0.001 and 0.0044. This new DataFrame still reproduces issues similar to those in the description: {code:java} scala> df.distinct.count res3: Long = 64 scala> df.sort("col_0").distinct.count res4: Long = 73 scala> df.withColumnRenamed("col_0", "new").distinct.count res5: Long = 63 {code} I get those inconsistent/wrong results on {{2.4.0-rc2}} and if I check out commit {{a7c19d9c21d59fd0109a7078c80b33d3da03fafd}}, which is SPARK-23713. If I check out the commit immediately before, which is {{fe2b7a4568d65a62da6e6eb00fff05f248b4332c}}, then all three commands return 63. cc [~cloud_fan] – IMO this should block the 2.4.0 release. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Rand updated SPARK-25538: Attachment: SPARK-25538-repro.tgz > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632729#comment-16632729 ] Steven Rand commented on SPARK-25538: - [~kiszk] that makes sense, I'll try to do so. The issue I've been having so far is that when I run the UDF I've written to change the data (while preserving number of duplicate rows), the resulting DataFrame doesn't reproduce the issue. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629561#comment-16629561 ] Steven Rand commented on SPARK-25538: - [~kiszk], yes, the schema is: {code} scala> spark.read.parquet("hdfs:///data").printSchema root |-- col_0: string (nullable = true) |-- col_1: timestamp (nullable = true) |-- col_2: string (nullable = true) |-- col_3: timestamp (nullable = true) |-- col_4: string (nullable = true) |-- col_5: array (nullable = true) ||-- element: string (containsNull = true) |-- col_6: string (nullable = true) |-- col_7: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_8: array (nullable = true) ||-- element: string (containsNull = true) |-- col_9: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_10: string (nullable = true) |-- col_11: timestamp (nullable = true) |-- col_12: integer (nullable = true) |-- col_13: boolean (nullable = true) |-- col_14: decimal(38,18) (nullable = true) |-- col_15: long (nullable = true) |-- col_16: string (nullable = true) |-- col_17: integer (nullable = true) |-- col_18: array (nullable = true) ||-- element: string (containsNull = true) |-- col_19: string (nullable = true) |-- col_20: string (nullable = true) |-- col_21: array (nullable = true) ||-- element: string (containsNull = true) |-- col_22: string (nullable = true) |-- col_23: array (nullable = true) ||-- element: timestamp (containsNull = true) |-- col_24: string (nullable = true) |-- col_25: string (nullable = true) |-- col_26: string (nullable = true) |-- col_27: array (nullable = true) ||-- element: string (containsNull = true) |-- col_28: array (nullable = true) ||-- element: string (containsNull = true) |-- col_29: array (nullable = true) ||-- element: string (containsNull = true) |-- col_30: array (nullable = true) ||-- element: string (containsNull = true) |-- col_31: decimal(38,18) (nullable = true) |-- col_32: array (nullable = true) ||-- element: string (containsNull = true) |-- col_33: string (nullable = true) |-- col_34: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_35: decimal(38,18) (nullable = true) |-- col_36: array (nullable = true) ||-- element: string (containsNull = true) |-- col_37: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_38: decimal(38,18) (nullable = true) |-- col_39: array (nullable = true) ||-- element: string (containsNull = true) |-- col_40: string (nullable = true) |-- col_41: string (nullable = true) |-- col_42: string (nullable = true) |-- col_43: array (nullable = true) ||-- element: string (containsNull = true) |-- col_44: array (nullable = true) ||-- element: string (containsNull = true) |-- col_45: string (nullable = true) |-- col_46: array (nullable = true) ||-- element: string (containsNull = true) |-- col_47: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_48: string (nullable = true) |-- col_49: array (nullable = true) ||-- element: string (containsNull = true) |-- col_50: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_51: array (nullable = true) ||-- element: string (containsNull = true) |-- col_52: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) |-- col_53: string (nullable = true) |-- col_54: decimal(38,18) (nullable = true) |-- col_55: decimal(38,18) (nullable = true) |-- col_56: decimal(38,18) (nullable = true) |-- col_57: array (nullable = true) ||-- element: decimal(38,18) (containsNull = true) {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628449#comment-16628449 ] Steven Rand commented on SPARK-25538: - Hi [~cloud_fan], I'm still trying to create a scrubbed version of the file. It's proving to be difficult since mutating the original DataFrame often prevents the issue from reproducing (e.g., the example in the description where calling {{withColumnRenamed}} before {{distinct.count}} leads to the correct result being printed). > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25538) incorrect row counts after distinct()
Steven Rand created SPARK-25538: --- Summary: incorrect row counts after distinct() Key: SPARK-25538 URL: https://issues.apache.org/jira/browse/SPARK-25538 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Environment: Reproduced on a Centos7 VM and from source in Intellij on OS X. Reporter: Steven Rand It appears that {{df.distinct.count}} can return incorrect values after SPARK-23713. It's possible that other operations are affected as well; {{distinct}} just happens to be the one that we noticed. I believe that this issue was introduced by SPARK-23713 because I can't reproduce it until that commit, and I've been able to reproduce it after that commit as well as with {{tags/v2.4.0-rc1}}. Below are example spark-shell sessions to illustrate the problem. Unfortunately the data used in these examples can't be uploaded to this Jira ticket. I'll try to create test data which also reproduces the issue, and will upload that if I'm able to do so. Example from Spark 2.3.1, which behaves correctly: {code} scala> val df = spark.read.parquet("hdfs:///data") df: org.apache.spark.sql.DataFrame = [] scala> df.count res0: Long = 123 scala> df.distinct.count res1: Long = 115 {code} Example from Spark 2.4.0-rc1, which returns different output: {code} scala> val df = spark.read.parquet("hdfs:///data") df: org.apache.spark.sql.DataFrame = [] scala> df.count res0: Long = 123 scala> df.distinct.count res1: Long = 116 scala> df.sort("col_0").distinct.count res2: Long = 123 scala> df.withColumnRenamed("col_0", "newName").distinct.count res3: Long = 115 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25294) Add integration test for Kerberos
[ https://issues.apache.org/jira/browse/SPARK-25294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598366#comment-16598366 ] Steven Rand commented on SPARK-25294: - +1 – another example of how easy it is to break the krb integration w/o noticing is https://issues.apache.org/jira/browse/SPARK-22319 > Add integration test for Kerberos > -- > > Key: SPARK-25294 > URL: https://issues.apache.org/jira/browse/SPARK-25294 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.4.0 >Reporter: Yuming Wang >Priority: Major > > Some changes may cause Kerberos issues, such as {{Yarn}}, {{Hive}}, {{HDFS}}. > we should add tests. > https://issues.apache.org/jira/browse/SPARK-23789 > https://github.com/apache/spark/pull/21987#issuecomment-417560077 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab
Steven Rand created SPARK-22319: --- Summary: SparkSubmit calls getFileStatus before calling loginUserFromKeytab Key: SPARK-22319 URL: https://issues.apache.org/jira/browse/SPARK-22319 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 2.3.0 Reporter: Steven Rand In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls {{getFileStatus}}, which for HDFS is an RPC call to the NameNode: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346. We do this before we call {{loginUserFromKeytab}}, which is further down in the same method: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655. The result is that the call to {{resolveGlobPaths}} fails in secure clusters with: {code} javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] {code} A workaround is to {{kinit}} on the host before using spark-submit. However, it's better if this workaround isn't necessary. A simple fix is to call loginUserFromKeytab before attempting to interact with HDFS. At least for cluster mode, this would appear to be a regression caused by SPARK-21012. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support
[ https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982587#comment-15982587 ] Steven Rand commented on SPARK-7481: What happened to https://github.com/apache/spark/pull/12004? It doesn't look like there were any concrete objections to the changes made there -- was it just closed for lack of a reviewer? As someone who has spent several tens of hours (and counting!) debugging classpath issues for Spark applications that read from and write to an object store, I think this change is hugely valuable. I suspect that the large number of votes and watchers indicates that others think this as well, so it'd be pretty depressing if it didn't happen just because no one will review the patch. Unfortunately I'm not qualified to review it myself, but I'd be quite grateful if someone more competent were to do so. > Add spark-hadoop-cloud module to pull in object store support > - > > Key: SPARK-7481 > URL: https://issues.apache.org/jira/browse/SPARK-7481 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Steve Loughran > > To keep the s3n classpath right, to add s3a, swift & azure, the dependencies > of spark in a 2.6+ profile need to add the relevant object store packages > (hadoop-aws, hadoop-openstack, hadoop-azure) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService
[ https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649506#comment-15649506 ] Steven Rand commented on SPARK-18364: - The solution I'd initially had in mind, simply creating a {{MetricsSystem}} in {{YarnShuffleService}}, isn't ideal, since then {{network-yarn}} has to depend on {{core}}. I tried splitting the metrics code currently in {{core}} into its own module, but this is pretty tough given how much other code from {{core}} the metrics code depends on. And the {{metrics}} module can't depend on {{core}}, since that creates a circular dependency (core has to depend on metrics). Another idea might be to try to get metrics out of the executors instead of from ExternalShuffleBlockHandler, though this also seems tricky. Open to ideas on how to proceed here if anyone has them. > expose metrics for YarnShuffleService > - > > Key: SPARK-18364 > URL: https://issues.apache.org/jira/browse/SPARK-18364 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.1 >Reporter: Steven Rand >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > ExternalShuffleService exposes metrics as of SPARK-16405. However, > YarnShuffleService does not. > The work of instrumenting ExternalShuffleBlockHandler was already done in > SPARK-1645, so this JIRA is for creating a MetricsSystem in > YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService
[ https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648785#comment-15648785 ] Steven Rand commented on SPARK-18364: - I can implement this if people think it makes sense. > expose metrics for YarnShuffleService > - > > Key: SPARK-18364 > URL: https://issues.apache.org/jira/browse/SPARK-18364 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.1 >Reporter: Steven Rand >Priority: Minor > Original Estimate: 336h > Remaining Estimate: 336h > > ExternalShuffleService exposes metrics as of SPARK-16405. However, > YarnShuffleService does not. > The work of instrumenting ExternalShuffleBlockHandler was already done in > SPARK-1645, so this JIRA is for creating a MetricsSystem in > YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18364) expose metrics for YarnShuffleService
Steven Rand created SPARK-18364: --- Summary: expose metrics for YarnShuffleService Key: SPARK-18364 URL: https://issues.apache.org/jira/browse/SPARK-18364 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.0.1 Reporter: Steven Rand Priority: Minor ExternalShuffleService exposes metrics as of SPARK-16405. However, YarnShuffleService does not. The work of instrumenting ExternalShuffleBlockHandler was already done in SPARK-1645, so this JIRA is for creating a MetricsSystem in YarnShuffleService similarly to how ExternalShuffleService already does it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org