Re: A new external catalog
On 14 Feb 2018, at 19:56, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Newbie question: I want to add system/integration tests for the new functionality. There are a set of existing tests around Spark Catalog that I can leverage. Great. The provider I’m writing is backed by a web service though which is part of an AWS account. I can write the tests using a mocked client that somehow clones the behavior of the webservice, but I’ll get the most value if I actually run the tests against a real AWS Glue account. Benefits/cost of both mock: fast, fault injection, jenkins can run without credentials, but you need to maintain the mock. Failing a mock may just be a false alarm; success doesn't mean that much. live: needs credentials for runs, runs up bills, slow at distance or scale. Gives real answers about whether things work. Harder to inject faults. Way more complex test setup. Personally, I prefer live & try and set tests up to cost a few cents and run fast, even though that keeps them out of Jenkins-based test runs for submitted patches. After all: you do need the real test cases somewhere. How do you guys deal with external dependencies for system tests? That's credentais, For the hadoop core FS tests, all the tests which need (aws, azure) secrets are * tagged as integration tests and run in a different maven phase * only executed if the credentials are set in a non-SCM managed file, test/resources/auth-keys.xml, where its safest to actually pull them in from elsewhere via XInclude, so your secrets are never in the git-managed dirs * Are designed to run in parallel so you can do a full test run in < 10 minutes, even remotely * Apart from the scale tests which work with multiple MB and let you define that scale up to many GB and 1000s of files/directories if you are running in-infra. * Have really long timeouts, again, configurable for long-haul & scale tests you can look at the code there, and note the testing docs, which have a strict "declare the specific infra endpoint you ran against or nobody will even look at your code". That forces honesty in all submissions, even from ourselves https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/site/markdown/testing_azure.md The hadoop-aws module now has a fault injecting client built into the release JAR, something you can turn on from a hadoop config option. This is designed to let integration tests verify that the layers above are resilient to those failures being injected (retryable throttle exceptions, repeatedly observable listing inconsistencies, soon, broken connections on read() calls. Anything downstream can turn it on to see what breaks Spark 2.3 has a hadoop-cloud module which pulls in those FS clients and their transient modules, but not any tests. I keep my core set here https://github.com/hortonworks-spark/cloud-integration with the test suite trait and base class https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/hortonworks/spark/cloud/CloudSuite.scala These pick up a path to a config file which you keep out of the source tree, again for security reaosh mvn test -Dcloud.test.configuration.file=/Users/stevel/Projects/sparkwork/cloud-test-configs/s3a.xml ... I'd look at the POM there for how things propagate, and how that suite trait has a ctest() method which only registers a test case for execution of the conditions are met, where conditions include the credentials being provided https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/hortonworks/spark/cloud/CloudSuiteTrait.scala#L62 Is there an AWS account that is used for this purpose by any chance? Sadly no. Microsoft give apache committers Azure credits if you ask nicely, but not from your colleagues. Now, if you were able to change that policy, things would be nice Now I know how AWS STS works, with the AssumeRole API allowing you to create tokens which are only valid for a few minutes, I think it could actually be possible to have a jenkins setup where Jenkins creates some temporary AWS Credentials for the duration of a test run, revokes them after, for a role restricted to AWS resources, so that even patches from untrusted sources could be run with the credentials. You'd need to give jenkins the full keys though, and keep them locked down, somehow, + maybe refresh them nightly for bonus security (& restrict those to verify few rights too: assume role for that restricted role would be enough)
Re: A new external catalog
On 14 Feb 2018, at 13:51, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a quick look and am sufficiently scared for now. I had run into that warning from the S3 stream before. Sigh. things like that are trouble as they don't get picked up in automated test runs -or they do, but unless you look at the console output, you don't see them. It's why when generally for a Hadoop release we fix the version at least 4-6 weeks before the release and play through the command line, downstream tests, etc. The other troublespot is changes in perf & scale which don't show up on the smaller tests which look for functionality "can I distcp 5 directories", but kick off when the problem is "can I use distcp to back up 4 PB of data without too many DELETE calls being throttled". If you look at HADOOP-15209/HADOOP-15191 the tests instrument the S3A client to count the #of delete calls made so I can make assertions that an LRU cache of recently deleted paths actually works, using the test mechanisms covered in http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html Other than that, sitting in front of the screen watching that a test to do spark ORC/Parquet jobs using S3 as a src & dest, and noticing when it takes much much longer than you'd expect is a cue of a regression. Or that I'm just accidentally using an object store on a different continent from normal. It'd be interesting to consider what you can do with scaltest/junit test runners to actually catch performance regressions here: have something take the Ant-format XML reports & convert to something where you can diff performance over time & so build up your own local model of how long things should take. An interesting project for someone. FWIW, Hadoop 3.1 is down for v 1.11.271 (shaded), which does have your stuff in, so far all is good. -Steve
Re: A new external catalog
Newbie question: I want to add system/integration tests for the new functionality. There are a set of existing tests around Spark Catalog that I can leverage. Great. The provider I’m writing is backed by a web service though which is part of an AWS account. I can write the tests using a mocked client that somehow clones the behavior of the webservice, but I’ll get the most value if I actually run the tests against a real AWS Glue account. How do you guys deal with external dependencies for system tests? Is there an AWS account that is used for this purpose by any chance? Thanks, -Ameen From: Steve Loughran Date: Tuesday, February 13, 2018 at 5:01 PM To: "Tayyebi, Ameen" Cc: Apache Spark Dev Subject: Re: A new external catalog On 13 Feb 2018, at 21:20, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking I fall back to shading Glue’s specific SDK? FWIW, some of the other troublespots are not functional, they're log overflow https://issues.apache.org/jira/browse/HADOOP-15040 https://issues.apache.org/jira/browse/HADOOP-14596 Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go with that into Hadoop 3.1 if we're happy, but that's not so much for new features but "stack traces throughout the log", which seems to be a recurrent issue with the JARs, and one which often slips by CI build runs. If it wasn't for that, we'd have stuck with 1.11.199 because it didn't have any issues that we hadn't already got under control (https://github.com/aws/aws-sdk-java/issues/1211) Like I said: upgrades bring fear From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Tuesday, February 13, 2018 at 3:34 PM To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>> Cc: Apache Spark Dev mailto:dev@spark.apache.org>> Subject: Re: A new external catalog On 13 Feb 2018, at 19:50, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest version up front saying "Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see." https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, especially if it's the unshaded version which forces in a version of jackson. Which SDK version are you proposing? 1.11.x ?
Re: A new external catalog
Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a quick look and am sufficiently scared for now. I had run into that warning from the S3 stream before. Sigh. From: Steve Loughran Date: Tuesday, February 13, 2018 at 5:01 PM To: "Tayyebi, Ameen" Cc: Apache Spark Dev Subject: Re: A new external catalog On 13 Feb 2018, at 21:20, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking I fall back to shading Glue’s specific SDK? FWIW, some of the other troublespots are not functional, they're log overflow https://issues.apache.org/jira/browse/HADOOP-15040 https://issues.apache.org/jira/browse/HADOOP-14596 Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go with that into Hadoop 3.1 if we're happy, but that's not so much for new features but "stack traces throughout the log", which seems to be a recurrent issue with the JARs, and one which often slips by CI build runs. If it wasn't for that, we'd have stuck with 1.11.199 because it didn't have any issues that we hadn't already got under control (https://github.com/aws/aws-sdk-java/issues/1211) Like I said: upgrades bring fear From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Tuesday, February 13, 2018 at 3:34 PM To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>> Cc: Apache Spark Dev mailto:dev@spark.apache.org>> Subject: Re: A new external catalog On 13 Feb 2018, at 19:50, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest version up front saying "Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see." https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, especially if it's the unshaded version which forces in a version of jackson. Which SDK version are you proposing? 1.11.x ?
Re: A new external catalog
On 13 Feb 2018, at 21:20, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking I fall back to shading Glue’s specific SDK? FWIW, some of the other troublespots are not functional, they're log overflow https://issues.apache.org/jira/browse/HADOOP-15040 https://issues.apache.org/jira/browse/HADOOP-14596 Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go with that into Hadoop 3.1 if we're happy, but that's not so much for new features but "stack traces throughout the log", which seems to be a recurrent issue with the JARs, and one which often slips by CI build runs. If it wasn't for that, we'd have stuck with 1.11.199 because it didn't have any issues that we hadn't already got under control (https://github.com/aws/aws-sdk-java/issues/1211) Like I said: upgrades bring fear From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Tuesday, February 13, 2018 at 3:34 PM To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>> Cc: Apache Spark Dev mailto:dev@spark.apache.org>> Subject: Re: A new external catalog On 13 Feb 2018, at 19:50, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest version up front saying "Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see." https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, especially if it's the unshaded version which forces in a version of jackson. Which SDK version are you proposing? 1.11.x ?
Re: A new external catalog
Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking I fall back to shading Glue’s specific SDK? From: Steve Loughran Date: Tuesday, February 13, 2018 at 3:34 PM To: "Tayyebi, Ameen" Cc: Apache Spark Dev Subject: Re: A new external catalog On 13 Feb 2018, at 19:50, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest version up front saying "Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see." https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, especially if it's the unshaded version which forces in a version of jackson. Which SDK version are you proposing? 1.11.x ?
Re: A new external catalog
On 13 Feb 2018, at 19:50, Tayyebi, Ameen mailto:tayye...@amazon.com>> wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve made sure the version is in sync with the Kinesis client used by spark-streaming module. Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest version up front saying "Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see." https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, especially if it's the unshaded version which forces in a version of jackson. Which SDK version are you proposing? 1.11.x ?