[jira] [Commented] (SPARK-29683) Job failed due to executor failures all available nodes are blacklisted

2019-11-04 Thread Steven Rand (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967188#comment-16967188
 ] 

Steven Rand commented on SPARK-29683:
-

We're experiencing this as well during HA YARN failover.

> Job failed due to executor failures all available nodes are blacklisted
> ---
>
> Key: SPARK-29683
> URL: https://issues.apache.org/jira/browse/SPARK-29683
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Genmao Yu
>Priority: Major
>
> My streaming job will fail *due to executor failures all available nodes are 
> blacklisted*. This exception is thrown only when all node is blacklisted:
> {code:java}
> def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= 
> numClusterNodes
> val allBlacklistedNodes = excludeNodes ++ schedulerBlacklist ++ 
> allocatorBlacklist.keySet
> {code}
> After diving into the code, I found some critical conditions not be handled 
> properly:
>  - unchecked `excludeNodes`: it comes from user config. If not set properly, 
> it may lead to "currentBlacklistedYarnNodes.size >= numClusterNodes". For 
> example, we may set some nodes not in Yarn cluster.
> {code:java}
> excludeNodes = (invalid1, invalid2, invalid3)
> clusterNodes = (valid1, valid2)
> {code}
>  - `numClusterNodes` may equals 0: When HA Yarn failover, it will take some 
> time for all NodeManagers to register ResourceManager again. In this case, 
> `numClusterNode` may equals 0 or some other number, and Spark driver failed.
>  - too strong condition check: Spark driver will fail as long as 
> "currentBlacklistedYarnNodes.size >= numClusterNodes". This condition should 
> not indicate a unrecovered fatal. For example, there are some NodeManagers 
> restarting. So we can give some waiting time before job failed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27773) Add shuffle service metric for number of exceptions caught in ExternalShuffleBlockHandler

2019-05-22 Thread Steven Rand (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated SPARK-27773:

Description: 
The health of the external shuffle service is currently difficult to monitor. 
At least for the YARN shuffle service, the only current indication of health is 
whether or not the shuffle service threads are running in the NodeManager. 
However, we've seen that clients can sometimes experience elevated failure 
rates on requests to the shuffle service even when those threads are running. 
It would be helpful to have some indication of how often requests to the 
shuffle service are failing, as this could be monitored, alerted on, etc.

One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
how many times we caught an exception in the shuffle service's RPC handler. I 
think that this gives us the insight into request failure rates that we're 
currently missing, but obviously I'm open to alternatives as well if people 
have other ideas.

  was:
The health of the external shuffle service is currently difficult to monitor. 
At least for the YARN shuffle service, the only current indication of health is 
whether or not the shuffle service threads are running in the NodeManager. 
However, we've seen that clients can sometimes experience elevated failure 
rates on requests to the shuffle service even when those threads are running. 
It would be helpful to have some indication of how often requests to the 
shuffle service are failing, as this could be monitored, alerted on, etc.

One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
how many times we called {{TransportChannelHandler#exceptionCaught}}. I think 
that this gives us the insight into request failure rates that we're currently 
missing, but obviously I'm open to alternatives as well if people have other 
ideas.

Summary: Add shuffle service metric for number of exceptions caught in 
ExternalShuffleBlockHandler  (was: Add shuffle service metric for number of 
exceptions caught in TransportChannelHandler)

> Add shuffle service metric for number of exceptions caught in 
> ExternalShuffleBlockHandler
> -
>
> Key: SPARK-27773
> URL: https://issues.apache.org/jira/browse/SPARK-27773
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.3
>Reporter: Steven Rand
>Priority: Minor
>
> The health of the external shuffle service is currently difficult to monitor. 
> At least for the YARN shuffle service, the only current indication of health 
> is whether or not the shuffle service threads are running in the NodeManager. 
> However, we've seen that clients can sometimes experience elevated failure 
> rates on requests to the shuffle service even when those threads are running. 
> It would be helpful to have some indication of how often requests to the 
> shuffle service are failing, as this could be monitored, alerted on, etc.
> One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
> metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
> how many times we caught an exception in the shuffle service's RPC handler. I 
> think that this gives us the insight into request failure rates that we're 
> currently missing, but obviously I'm open to alternatives as well if people 
> have other ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27797) Shuffle service metric "registeredConnections" not tracked correctly

2019-05-21 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845535#comment-16845535
 ] 

Steven Rand commented on SPARK-27797:
-

I think it might be okay -- ExternalShuffleService.start() does:

{code}
blockHandler.getAllMetrics.getMetrics.put("numRegisteredConnections",
server.getRegisteredConnections)
{code}

Which I think makes the result correct (though in any case, agreed that the 
structure is confusing).

> Shuffle service metric "registeredConnections" not tracked correctly
> 
>
> Key: SPARK-27797
> URL: https://issues.apache.org/jira/browse/SPARK-27797
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> In {{ExternalShuffleBlockHandler}}:
> {code}
> // Number of registered connections to the shuffle service
> private Counter registeredConnections = new Counter();
> public ShuffleMetrics() {
>   ...
>   allMetrics.put("numRegisteredConnections", registeredConnections);
> }
> {code}
> But the counter that's actually updated is in {{TransportContext}}. The call 
> site is in {{TransportChannelHandler}}:
> {code}
>   @Override
>   public void channelRegistered(ChannelHandlerContext ctx) throws Exception {
> transportContext.getRegisteredConnections().inc();
> super.channelRegistered(ctx);
>   }
>   @Override
>   public void channelUnregistered(ChannelHandlerContext ctx) throws Exception 
> {
> transportContext.getRegisteredConnections().dec();
> super.channelUnregistered(ctx);
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27773) Add shuffle service metric for number of exceptions caught in TransportChannelHandler

2019-05-19 Thread Steven Rand (JIRA)
Steven Rand created SPARK-27773:
---

 Summary: Add shuffle service metric for number of exceptions 
caught in TransportChannelHandler
 Key: SPARK-27773
 URL: https://issues.apache.org/jira/browse/SPARK-27773
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.3
Reporter: Steven Rand


The health of the external shuffle service is currently difficult to monitor. 
At least for the YARN shuffle service, the only current indication of health is 
whether or not the shuffle service threads are running in the NodeManager. 
However, we've seen that clients can sometimes experience elevated failure 
rates on requests to the shuffle service even when those threads are running. 
It would be helpful to have some indication of how often requests to the 
shuffle service are failing, as this could be monitored, alerted on, etc.

One suggestion (implemented in the PR I'll attach to this ticket) is to add a 
metric to {{ExternalShuffleBlockHandler.ShuffleMetrics}} which keeps track of 
how many times we called {{TransportChannelHandler#exceptionCaught}}. I think 
that this gives us the insight into request failure rates that we're currently 
missing, but obviously I'm open to alternatives as well if people have other 
ideas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-03 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637630#comment-16637630
 ] 

Steven Rand commented on SPARK-25538:
-

Thanks all!

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Assignee: Marco Gaido
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.4.0
>
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-29 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633254#comment-16633254
 ] 

Steven Rand commented on SPARK-25538:
-

[~kiszk] I've uploaded a tarball containing parquet files that reproduce the 
issue but don't contain any of the values in the original dataset. 
Specifically, some columns have been dropped, all strings have been changed to 
"test_string", all values in col_50 have been changed to 0.0043, and the values 
in col_14 have all been mapped from their original values to values between 
0.001 and 0.0044.

This new DataFrame still reproduces issues similar to those in the description:
{code:java}
scala> df.distinct.count
res3: Long = 64

scala> df.sort("col_0").distinct.count
res4: Long = 73

scala> df.withColumnRenamed("col_0", "new").distinct.count
res5: Long = 63
{code}
I get those inconsistent/wrong results on {{2.4.0-rc2}} and if I check out 
commit {{a7c19d9c21d59fd0109a7078c80b33d3da03fafd}}, which is SPARK-23713. If I 
check out the commit immediately before, which is 
{{fe2b7a4568d65a62da6e6eb00fff05f248b4332c}}, then all three commands return 63.

cc [~cloud_fan] – IMO this should block the 2.4.0 release.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()

2018-09-29 Thread Steven Rand (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Rand updated SPARK-25538:

Attachment: SPARK-25538-repro.tgz

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-28 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16632729#comment-16632729
 ] 

Steven Rand commented on SPARK-25538:
-

[~kiszk] that makes sense, I'll try to do so. The issue I've been having so far 
is that when I run the UDF I've written to change the data (while preserving 
number of duplicate rows), the resulting DataFrame doesn't reproduce the issue.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629561#comment-16629561
 ] 

Steven Rand commented on SPARK-25538:
-

[~kiszk], yes, the schema is:

 
{code}
scala> spark.read.parquet("hdfs:///data").printSchema
root
 |-- col_0: string (nullable = true)
 |-- col_1: timestamp (nullable = true)
 |-- col_2: string (nullable = true)
 |-- col_3: timestamp (nullable = true)
 |-- col_4: string (nullable = true)
 |-- col_5: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_6: string (nullable = true)
 |-- col_7: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_8: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_9: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_10: string (nullable = true)
 |-- col_11: timestamp (nullable = true)
 |-- col_12: integer (nullable = true)
 |-- col_13: boolean (nullable = true)
 |-- col_14: decimal(38,18) (nullable = true)
 |-- col_15: long (nullable = true)
 |-- col_16: string (nullable = true)
 |-- col_17: integer (nullable = true)
 |-- col_18: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_19: string (nullable = true)
 |-- col_20: string (nullable = true)
 |-- col_21: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_22: string (nullable = true)
 |-- col_23: array (nullable = true)
 ||-- element: timestamp (containsNull = true)
 |-- col_24: string (nullable = true)
 |-- col_25: string (nullable = true)
 |-- col_26: string (nullable = true)
 |-- col_27: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_28: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_29: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_30: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_31: decimal(38,18) (nullable = true)
 |-- col_32: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_33: string (nullable = true)
 |-- col_34: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_35: decimal(38,18) (nullable = true)
 |-- col_36: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_37: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_38: decimal(38,18) (nullable = true)
 |-- col_39: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_40: string (nullable = true)
 |-- col_41: string (nullable = true)
 |-- col_42: string (nullable = true)
 |-- col_43: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_44: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_45: string (nullable = true)
 |-- col_46: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_47: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_48: string (nullable = true)
 |-- col_49: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_50: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_51: array (nullable = true)
 ||-- element: string (containsNull = true)
 |-- col_52: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
 |-- col_53: string (nullable = true)
 |-- col_54: decimal(38,18) (nullable = true)
 |-- col_55: decimal(38,18) (nullable = true)
 |-- col_56: decimal(38,18) (nullable = true)
 |-- col_57: array (nullable = true)
 ||-- element: decimal(38,18) (containsNull = true)
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if 

[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16628449#comment-16628449
 ] 

Steven Rand commented on SPARK-25538:
-

Hi [~cloud_fan], I'm still trying to create a scrubbed version of the file. 
It's proving to be difficult since mutating the original DataFrame often 
prevents the issue from reproducing (e.g., the example in the description where 
calling {{withColumnRenamed}} before {{distinct.count}} leads to the correct 
result being printed).

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25538) incorrect row counts after distinct()

2018-09-25 Thread Steven Rand (JIRA)
Steven Rand created SPARK-25538:
---

 Summary: incorrect row counts after distinct()
 Key: SPARK-25538
 URL: https://issues.apache.org/jira/browse/SPARK-25538
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Reproduced on a Centos7 VM and from source in Intellij on 
OS X.
Reporter: Steven Rand


It appears that {{df.distinct.count}} can return incorrect values after 
SPARK-23713. It's possible that other operations are affected as well; 
{{distinct}} just happens to be the one that we noticed. I believe that this 
issue was introduced by SPARK-23713 because I can't reproduce it until that 
commit, and I've been able to reproduce it after that commit as well as with 
{{tags/v2.4.0-rc1}}. 

Below are example spark-shell sessions to illustrate the problem. Unfortunately 
the data used in these examples can't be uploaded to this Jira ticket. I'll try 
to create test data which also reproduces the issue, and will upload that if 
I'm able to do so.

Example from Spark 2.3.1, which behaves correctly:

{code}
scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = []

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 115
{code}

Example from Spark 2.4.0-rc1, which returns different output:

{code}
scala> val df = spark.read.parquet("hdfs:///data")
df: org.apache.spark.sql.DataFrame = []

scala> df.count
res0: Long = 123

scala> df.distinct.count
res1: Long = 116

scala> df.sort("col_0").distinct.count
res2: Long = 123

scala> df.withColumnRenamed("col_0", "newName").distinct.count
res3: Long = 115
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25294) Add integration test for Kerberos

2018-08-31 Thread Steven Rand (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16598366#comment-16598366
 ] 

Steven Rand commented on SPARK-25294:
-

+1 – another example of how easy it is to break the krb integration w/o 
noticing is https://issues.apache.org/jira/browse/SPARK-22319

> Add integration test for Kerberos 
> --
>
> Key: SPARK-25294
> URL: https://issues.apache.org/jira/browse/SPARK-25294
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Some changes may cause Kerberos issues, such as {{Yarn}}, {{Hive}}, {{HDFS}}. 
> we should add tests.
> https://issues.apache.org/jira/browse/SPARK-23789
> https://github.com/apache/spark/pull/21987#issuecomment-417560077



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22319) SparkSubmit calls getFileStatus before calling loginUserFromKeytab

2017-10-19 Thread Steven Rand (JIRA)
Steven Rand created SPARK-22319:
---

 Summary: SparkSubmit calls getFileStatus before calling 
loginUserFromKeytab
 Key: SPARK-22319
 URL: https://issues.apache.org/jira/browse/SPARK-22319
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 2.3.0
Reporter: Steven Rand


In the SparkSubmit code, we call {{resolveGlobPaths}}, which eventually calls 
{{getFileStatus}}, which for HDFS is an RPC call to the NameNode: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L346.

We do this before we call {{loginUserFromKeytab}}, which is further down in the 
same method: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L655.

The result is that the call to {{resolveGlobPaths}} fails in secure clusters 
with:

{code}
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
{code}

A workaround is to {{kinit}} on the host before using spark-submit. However, 
it's better if this workaround isn't necessary. A simple fix is to call 
loginUserFromKeytab before attempting to interact with HDFS.

At least for cluster mode, this would appear to be a regression caused by 
SPARK-21012.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add spark-hadoop-cloud module to pull in object store support

2017-04-25 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982587#comment-15982587
 ] 

Steven Rand commented on SPARK-7481:


What happened to https://github.com/apache/spark/pull/12004? It doesn't look 
like there were any concrete objections to the changes made there -- was it 
just closed for lack of a reviewer?

As someone who has spent several tens of hours (and counting!) debugging 
classpath issues for Spark applications that read from and write to an object 
store, I think this change is hugely valuable. I suspect that the large number 
of votes and watchers indicates that others think this as well, so it'd be 
pretty depressing if it didn't happen just because no one will review the 
patch. Unfortunately I'm not qualified to review it myself, but I'd be quite 
grateful if someone more competent were to do so.

> Add spark-hadoop-cloud module to pull in object store support
> -
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15649506#comment-15649506
 ] 

Steven Rand commented on SPARK-18364:
-

The solution I'd initially had in mind, simply creating a {{MetricsSystem}} in 
{{YarnShuffleService}}, isn't ideal, since then {{network-yarn}} has to depend 
on {{core}}.

I tried splitting the metrics code currently in {{core}} into its own module, 
but this is pretty tough given how much other code from {{core}} the metrics 
code depends on. And the {{metrics}} module can't depend on {{core}}, since 
that creates a circular dependency (core has to depend on metrics).

Another idea might be to try to get metrics out of the executors instead of 
from ExternalShuffleBlockHandler, though this also seems tricky.

Open to ideas on how to proceed here if anyone has them.


> expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-1645, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15648785#comment-15648785
 ] 

Steven Rand commented on SPARK-18364:
-

I can implement this if people think it makes sense.

> expose metrics for YarnShuffleService
> -
>
> Key: SPARK-18364
> URL: https://issues.apache.org/jira/browse/SPARK-18364
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.1
>Reporter: Steven Rand
>Priority: Minor
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> ExternalShuffleService exposes metrics as of SPARK-16405. However, 
> YarnShuffleService does not.
> The work of instrumenting ExternalShuffleBlockHandler was already done in 
> SPARK-1645, so this JIRA is for creating a MetricsSystem in 
> YarnShuffleService similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18364) expose metrics for YarnShuffleService

2016-11-08 Thread Steven Rand (JIRA)
Steven Rand created SPARK-18364:
---

 Summary: expose metrics for YarnShuffleService
 Key: SPARK-18364
 URL: https://issues.apache.org/jira/browse/SPARK-18364
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.0.1
Reporter: Steven Rand
Priority: Minor


ExternalShuffleService exposes metrics as of SPARK-16405. However, 
YarnShuffleService does not.

The work of instrumenting ExternalShuffleBlockHandler was already done in 
SPARK-1645, so this JIRA is for creating a MetricsSystem in YarnShuffleService 
similarly to how ExternalShuffleService already does it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org