Stephen Sisk created BEAM-2659:
----------------------------------

             Summary: JdbcIOIT flaky when run using io-it-suite-local
                 Key: BEAM-2659
                 URL: https://issues.apache.org/jira/browse/BEAM-2659
             Project: Beam
          Issue Type: Bug
          Components: testing
            Reporter: Stephen Sisk
            Assignee: Chamikara Jayalath


Note: the problem below *should not* affect io-it-suite and thus the jdbc 
jenkins job that's currently in PR. I haven't tested that exact configurations 
so I'm not 100% certain, but I don't have any indications that there'll be a 
problem.
---

 I've been running the postgres kubernetes scripts locally for a while and 
haven't seen any problems.

However, now that I'm running it via io-it-suite-local, and I'm starting to see 
flakiness - clients attempting to connect to the postgres server will get 
"connection attempt failed" error.  The difference between what was working 
before and now is that now the load balancer and the pod are getting set up at 
the same time. Before I was using it with a pre-existing load balancer - that 
is I haven't been tearing down/starting up the load balancer+pod at run time.

So I think the problem is in the interaction between the two or potentially 
just in the LoadBalancer service (it may take a little bit longer to get fully 
hooked up even after it reports an IP)

Possible causes:
* the loadbalancer is reporting it's ready before it actually can serve traffic 
to the postgres instance
* the lodabalancer has another status field that I'm not looking at - today we 
only check IP address, perhaps the loadbalancer exposes a status field? kubectl 
get/describe might be able to help. A cursory examination didn't show anything 
helpful.
* the postgres instance isn't actually ready when it says it is. I don't think 
that's the issue since I was working with postgres pods before and they seemed 
fine then

Potential solutions: 
* if cause is slow postgres pod start (unlikely): determine postgres pod health 
by reading from sql?  (pg_ctl?), and then have pkb wait for that by adding a 
dynamic_pipeline_option that wait for the kubernetes status to be okay and 
sends to a non-existent pipeline option
* file bug about loadbalancer not being ready when it says it is? (investigate 
that more :)
* have some way for pkb to actually connect to and validate the connection to 
postgres (that seems complicated.)

If the problem is that the loadbalancer is not ready when it says it is, while 
we are waiting for kubernetes to fix the issue, one workaround would be to:
1) modify io-it-suite-local to not load any kubernetes scripts (set 
--beam_kubernetes_scripts equal to blank line or skip it altogether - 
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/pom.xml#L137)
2) have the user run the kubernetes scripts manually beforehand, wait for it to 
be healthy, and then run io-it-suite-local 

To repro the problem:
mvn verify -Dio-it-suite-local -pl sdks/java/io/jdbc 
-DpkbLocation="your-copy-of-PerfKitBenchmarker/pkb.py" 
-DintegrationTestPipelineOptions='["--tempRoot=gs://sisk-test/staging"]' 
-DforceDirectRunner=true

this should fail when run repeatedly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to