Hi,

We using Postgres 9.3.10 on Amazon RDS and running into some strange
behavior that has been tough to track down and debug (partially due to the
limited admin access from RDS).

We're running a read-only query that normally takes ~10-15 min., but also
runs concurrently with several other intensive queries (these queries
themselves, finish).

On one particular day, this query hung for many hours and even while we
killed pids for running queries and any locks granted, the query would
never return.  Also no hints of blocking processes.  After some digging
through some I/O metrics, we didn't see any memory issues or unusual spikes
that would lead us to believe that we're running low on resources.

There is 1 caveat, however:  there was a different schema that contained a
day-old copy of data that isn't normally present when the hang started to
occur.  However, since these are completely different schema namespaces
with no crossovers in the queries themselves, I don't see how this is
relevant.


 1) We ended up doing a full reboot of the RDS instance and ran the query
again, this time, no other queries are running off of a fresh boot-up (no
competing locks or transactions).  The query continued to hang.

 2) We then ran pg_dump to snapshot the current data and did a full
pg_restore (after dropping all schemas) of an older dataset where we knew
this query would run successfully.  As expected, the query ran fine.

 3) We then dropped all schemas again and pg_restored the previous dataset
that was causing the query to hang, and then to my surprise, the query ran
just fine.  No hangs.

We thought this might be possibly due to some internal vacuuming, but this
is unlikely since there are no real concurrent reads or updates happening.
Auto-vacuum is also on with default settings.

What is the most confusing part in all of this is why a DROP SCHEMA CASCADE
and a fresh pg_restore would somehow fix the problem.  Even a fresh reboot
didn't fix it.

Any ideas??

Reply via email to