Re: A mini postmortem on snapshot failures

2017-01-04 Thread Erb, Stephan
I am not aware of tickets or work on the _exact_ problem described in this 
thread.

However, there is currently some on-going work to improve scheduler performance 
and availability during snaphshots on large clusters:

• https://reviews.apache.org/r/54883/  (still pending)
• https://reviews.apache.org/r/54774/ with associated ticket 
https://issues.apache.org/jira/browse/AURORA-1861 (already merged)

Maybe those are also interesting for you.


On 31/12/2016, 00:03, "meghdoot bhattacharya"  
wrote:

Is there a ticket for this? In our large test cluster, we hit into this 
under heavy load.
Thx

  From: John Sirois 
 To: dev@aurora.apache.org 
 Sent: Wednesday, October 5, 2016 11:04 AM
 Subject: Re: A mini postmortem on snapshot failures
   
On Tue, Oct 4, 2016 at 11:55 AM, Joshua Cohen  wrote:

> Hi Zameer,
>
> Thanks for this writeup!
>
> I think one other option to consider would be using a connection for
> writing the snapshots that's not bound by the pool's maximum checkout 
time.
> I'm not sure if this is feasible or not, but I worry that there's
> potentially no upper bound on raising the maximum checkout time as the 
size
> of a cluster grows or its read traffic grows. It feels a bit heavy weight
> to up the max checkout time when potentially the only connection exceeding
> the limits is the one writing the snapshot.
>

+1 to the above, it would be great if the snapshotter thread could grab its
own dedicated connection.  In fact, since the operation is relatively rare
(compared to all other reads and writes), it could even grab and dispose of
a new connection per-snapshot to simplify things (ie: no need to check for
a stale connection like you would have to if you grabbed one and tried to
hold it for the lifetime of the scheduler process).


> I'd definitely be in favor of adding a flag to tune the maximum connection
> checkout.
>
> I'm neutral to negative on having Snapshot creation failures be fatal, I
> don't necessarily think that one failed snapshot should take the scheduler
> down, but I agree that some number of consecutive failures is a Bad Thing
> that is worthy of investigation. My concern with having failures be fatal
> is the pathological case where snapshots always fail causing your 
scheduler
> to failover once every SNAPSHOT_INTERVAL. Do you think it would be
> sufficient to add `scheduler_log_snapshots` to the list of important
> stats[1]?
>
> I'm also neutral on changing the defaults. I'm not sure if it's warranted,
> as the behavior will vary based on cluster. It seems like you guys got bit
> by this due to a comparatively heavy read load? Our cluster, on the other
> hand, is probably significantly larger, but is not queried as much, and we
> haven't run into issues with the defaults. However, as long as there are 
is
> no adverse impact to bumping the default values I've got no objections.
>
> Cheers,
>
> Joshua
>
> [1]
> https://github.com/apache/aurora/blob/master/docs/
> operations/monitoring.md#important-stats
>
> On Fri, Sep 30, 2016 at 7:34 PM, Zameer Manji  wrote:
>
> > Aurora Developers and Users,
> >
> > I would like to share failure case I experienced recently. In a modestly
> > sized production cluster with high read load, snapshot creation began to
> > fail. The logs showed the following:
> >
> > 
> > W0923 00:23:55.528 [LogStorage-0, LogStorage:473]
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid. org.apache.ibatis.exceptions.
> PersistenceException:
> > ### Error rolling back transaction.  Cause: java.sql.SQLException: Error
> > accessing PooledConnection. Connection is invalid.
> > ### Cause: java.sql.SQLException: Error accessing PooledConnection.
> > Connection is invalid.
> > at
> > org.apache.ibatis.exceptions.ExceptionFactory.wrapException(
> > ExceptionFactory.java:30)
> > at
> > org.apache.ibatis.session.defaults.DefaultSqlSession.
> > rollback(DefaultSqlSession.java:216)
> > at
> > org.apache.ibatis.session.SqlSessionManager.rollback(
> > SqlSessionManager.java:299)
> > at
> > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(
> > TransactionalMethodInterceptor.java:116)
> > at
> > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$
> > write$0(DbStorage.java:175)
> > at
> > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(
> > GatingDelayExecutor.java:62)
> > at

Re: Build failed in Jenkins: aurora-packaging-nightly #535

2017-01-04 Thread Erb, Stephan
I tried to investigate the repeated build failures of the Aurora packaging 
build job but was not able to find the cause. Repeated builds on my local 
machine did not trigger the same error. 

If anyone else has an idea what could be wrong with  
https://builds.apache.org/view/All/job/aurora-packaging-nightly/  please step 
forward.



On 04/01/2017, 01:38, "Apache Jenkins Server"  wrote:

See 

--
[...truncated 18959 lines...]
failure
failure_limit
hello_world
ordering
ports
sleep60


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh:
org


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org:
apache


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache:
aurora


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora:
e2e


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora/e2e:
Dockerfile.netcat
Dockerfile.python
ephemeral_daemon_with_final.aurora
http
http_example.py
run-server.sh
test_bypass_leader_redirect_end_to_end.sh
test_daemonizing_process.aurora
test_end_to_end.sh
test_kerberos_end_to_end.sh
validate_serverset.py


artifacts/aurora-centos-7/dist/rpmbuild/BUILD/apache-aurora-0.17.0-SNAPSHOT/src/test/sh/org/apache/aurora/e2e/http:
http_example.aurora
http_example_bad_healthcheck.aurora
http_example_updated.aurora

artifacts/aurora-centos-7/dist/rpmbuild/BUILDROOT:

artifacts/aurora-centos-7/dist/rpmbuild/RPMS:
x86_64

artifacts/aurora-centos-7/dist/rpmbuild/RPMS/x86_64:
aurora-executor-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-executor-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-scheduler-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm

aurora-scheduler-debuginfo-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.12-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.14-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.15-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.16-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.26-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2016.12.29-1.el7.centos.aurora.x86_64.rpm
aurora-tools-0.17.0_SNAPSHOT.2017.01.03-1.el7.centos.aurora.x86_64.rpm
repodata

artifacts/aurora-centos-7/dist/rpmbuild/RPMS/x86_64/repodata:

077b3c5c12e5bc075694b4fc884b503e4167dcabba4502bdd72715cc1a464d7a-other.sqlite.bz2

162ac872fa2d3fde308a44e8617eb9a74a9649a0f4a09db1a8103033c4bf8dc9-primary.xml.gz

1bba22db0bdc1c51d165a9562d8c2a369ec0361b494e80ee51231e64283be84a-primary.sqlite.bz2

20ebec8dede455208f21767f897788be4cb2ff4465d63c62276d578e3497c337-other.xml.gz

22e71dace89c0e45d4fcb3afc6f805973d8b6a4e8400217f3e2d099cad2e4742-primary.xml.gz

247cc5d9118a98da934750f93ac1a03c39f9ee5e21ffb1f76cfea1ff6c01844b-filelists.xml.gz

262601e2b003888bdc9849b10e5aeac215981e64a4bedf3e07fb19d32a0e4f11-filelists.xml.gz

2c8f782d7e0a79c7971e4c0bcbe14260c298cfad7e84a963d598855b06cbace9-filelists.xml.gz