Re: A new external catalog

2018-02-16 Thread Steve Loughran


On 14 Feb 2018, at 19:56, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Newbie question:

I want to add system/integration tests for the new functionality. There are a 
set of existing tests around Spark Catalog that I can leverage. Great. The 
provider I’m writing is backed by a web service though which is part of an AWS 
account. I can write the tests using a mocked client that somehow clones the 
behavior of the webservice, but I’ll get the most value if I actually run the 
tests against a real AWS Glue account.



Benefits/cost of both

mock: fast, fault injection, jenkins can run without credentials, but you need 
to maintain the mock. Failing a mock may just be a false alarm; success doesn't 
mean that much.

live: needs credentials for runs, runs up bills, slow at distance or scale. 
Gives real answers about whether things work. Harder to inject faults. Way more 
complex test setup.

Personally, I prefer live & try and set tests up to cost a few cents and run 
fast, even though that keeps them out of Jenkins-based test runs for submitted 
patches. After all: you do need the real test cases somewhere.

How do you guys deal with external dependencies for system tests?


That's credentais,


For the hadoop core FS tests, all the tests which need (aws, azure) secrets are

* tagged as integration tests and run in a different maven phase
* only executed if the credentials are set in a non-SCM managed file, 
test/resources/auth-keys.xml, where its safest to actually pull them in from 
elsewhere via XInclude, so your secrets are never in the git-managed dirs
* Are designed to run in parallel so you can do a full test run in < 10 
minutes, even remotely
* Apart from the scale tests which work with multiple MB and let you define 
that scale up to many GB and 1000s of files/directories if you are running 
in-infra.
* Have really long timeouts, again, configurable for long-haul & scale tests

you can look at the code there, and note the testing docs, which have a strict 
"declare the specific infra endpoint you ran against or nobody will even look 
at your code". That forces honesty in all submissions, even from ourselves

https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/site/markdown/testing_azure.md

The hadoop-aws module now has a fault injecting client built into the release 
JAR, something you can turn on from a hadoop config option. This is designed to 
let integration tests verify that the layers above are resilient to those 
failures being injected (retryable throttle exceptions, repeatedly observable 
listing inconsistencies, soon, broken connections on read() calls. Anything 
downstream can turn it on to see what breaks



Spark 2.3 has a hadoop-cloud module which pulls in those FS clients and their 
transient modules, but not any tests.

I keep my core set here  https://github.com/hortonworks-spark/cloud-integration

with the test suite trait and base class
https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/hortonworks/spark/cloud/CloudSuite.scala

These pick up a path to a config file which you keep out of the source tree, 
again for security reaosh

mvn test   
-Dcloud.test.configuration.file=/Users/stevel/Projects/sparkwork/cloud-test-configs/s3a.xml
 ...


I'd look at the POM there for how things propagate, and how that suite trait 
has a ctest() method which only registers a test case for execution of the 
conditions are met, where conditions include the credentials being provided

https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/main/scala/com/hortonworks/spark/cloud/CloudSuiteTrait.scala#L62


Is there an AWS account that is used for this purpose by any chance?


Sadly no. Microsoft give apache committers Azure credits if you ask nicely, but 
not from your colleagues. Now, if you were able to change that policy, things 
would be nice

Now I know how AWS STS works, with the AssumeRole API allowing you to create 
tokens which are only valid for a few minutes, I think it could actually be 
possible to have a jenkins setup where Jenkins creates some temporary AWS 
Credentials for the duration of a test run, revokes them after, for a role 
restricted to AWS resources, so that even patches from untrusted sources could 
be run with the credentials. You'd need to give jenkins the full keys though, 
and keep them locked down, somehow, + maybe refresh them nightly for bonus 
security  (& restrict those to verify few rights too: assume role for that 
restricted role would be enough)




Re: A new external catalog

2018-02-16 Thread Steve Loughran


On 14 Feb 2018, at 13:51, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a 
quick look and am sufficiently scared for now. I had run into that warning from 
the S3 stream before. Sigh.


things like that are trouble as they don't get picked up in automated test runs 
-or they do, but unless you look at the console output, you don't see them. 
It's why when generally for a Hadoop release we fix the version at least 4-6 
weeks before the release and play through the command line, downstream tests, 
etc.

The other troublespot is changes in perf & scale which don't show up on the 
smaller tests which look for functionality "can I distcp 5 directories", but 
kick off when the problem is "can I use distcp to back up 4 PB of data without 
too many DELETE calls being throttled". If you look at 
HADOOP-15209/HADOOP-15191 the tests instrument the S3A client to count the #of 
delete calls made so I can make assertions that an LRU cache of recently 
deleted paths actually works, using the test mechanisms covered in 
http://steveloughran.blogspot.co.uk/2016/04/distributed-testing-making-use-of.html

Other than that, sitting in front of the screen watching that a test to do 
spark ORC/Parquet jobs using S3 as a src & dest, and noticing when it takes 
much much longer than you'd expect is a cue of a regression. Or that I'm just 
accidentally using an object store on a different continent from normal.

It'd be interesting to consider what you can do with scaltest/junit test 
runners to actually catch performance regressions here: have something take the 
Ant-format XML reports & convert to something where you can diff performance 
over time & so build up your own local model of how long things should take. An 
interesting project for someone.

FWIW, Hadoop 3.1 is down for v 1.11.271 (shaded), which does have your stuff 
in, so far all is good.

-Steve


Re: A new external catalog

2018-02-14 Thread Tayyebi, Ameen
Newbie question:

I want to add system/integration tests for the new functionality. There are a 
set of existing tests around Spark Catalog that I can leverage. Great. The 
provider I’m writing is backed by a web service though which is part of an AWS 
account. I can write the tests using a mocked client that somehow clones the 
behavior of the webservice, but I’ll get the most value if I actually run the 
tests against a real AWS Glue account.

How do you guys deal with external dependencies for system tests? Is there an 
AWS account that is used for this purpose by any chance?

Thanks,
-Ameen

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 5:01 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog





On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?




Re: A new external catalog

2018-02-14 Thread Tayyebi, Ameen
Thanks a lot Steve. I’ll go through the Jira’s you linked in detail. I took a 
quick look and am sufficiently scared for now. I had run into that warning from 
the S3 stream before. Sigh.

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 5:01 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog





On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?




Re: A new external catalog

2018-02-13 Thread Steve Loughran


On 13 Feb 2018, at 21:20, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:

Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76


272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?


FWIW, some of the other troublespots are not functional, they're log overflow

https://issues.apache.org/jira/browse/HADOOP-15040
https://issues.apache.org/jira/browse/HADOOP-14596

Myself and Cloudera collaborators are testing the shaded 1.11.271 JAR & will go 
with that into Hadoop 3.1 if we're happy, but that's not so much for new 
features but "stack traces throughout the log", which seems to be a recurrent 
issue with the JARs, and one which often slips by CI build runs. If it wasn't 
for that, we'd have stuck with 1.11.199 because it didn't have any issues that 
we hadn't already got under control 
(https://github.com/aws/aws-sdk-java/issues/1211)

Like I said: upgrades bring fear


From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" mailto:tayye...@amazon.com>>
Cc: Apache Spark Dev mailto:dev@spark.apache.org>>
Subject: Re: A new external catalog




On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?



Re: A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Yes, I’m thinking about upgrading to these:
1.9.0

1.11.272

From:

1.7.3

1.11.76

272 is the earliest that has Glue.

How about I let the build system run the tests and if things start breaking I 
fall back to shading Glue’s specific SDK?

From: Steve Loughran 
Date: Tuesday, February 13, 2018 at 3:34 PM
To: "Tayyebi, Ameen" 
Cc: Apache Spark Dev 
Subject: Re: A new external catalog




On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?


Re: A new external catalog

2018-02-13 Thread Steve Loughran


On 13 Feb 2018, at 19:50, Tayyebi, Ameen 
mailto:tayye...@amazon.com>> wrote:


The biggest challenge is that I had to upgrade the AWS SDK to a newer version 
so that it includes the Glue client since Glue is a new service. So far, I 
haven’t see any jar hell issues, but that’s the main drawback I can see. I’ve 
made sure the version is in sync with the Kinesis client used by 
spark-streaming module.

Funnily enough, I'm currently updating the s3a troubleshooting doc, the latest 
version up front saying

"Whatever problem you have, changing the AWS SDK version will not fix things, 
only change the stack traces you see."

https://github.com/steveloughran/hadoop/blob/s3/HADOOP-15076-trouble-and-perf/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md

Upgrading AWS SDKs is, sadly, often viewed with almost the same fear as guava, 
especially if it's the unshaded version which forces in a version of jackson.

Which SDK version are you proposing? 1.11.x ?