On Mon, Aug 31, 2015 at 12:09 PM, twoflower <standa.ku...@gmail.com> wrote:

> I have the following three tables:
>
> DOCUMENT
>   id (index)
>   documenttype
>   date_last_updated: timestamp(6) (indexed)
>
> EXTERNAL_TRANSLATION_UNIT
>   id (indexed)
>   fk_id_document (indexed)
>
> EXTERNAL_TRANSLATION
>   id (indexed)
>   fk_id_translation_unit (indexed)
>
> Table sizes:
>  DOCUMENT: 381 000
>  EXTERNAL_TRANSLATION_UNIT: 76 000 000
>  EXTERNAL_TRANSLATION: 76 000 000
>
> Now the following query takes about 36 minutes to finish:
>
>  SELECT u.id AS id_external_translation_unit,
>     r.id AS id_external_translation,
>     u.fk_id_language AS fk_id_source_language,
>     r.fk_id_language AS fk_id_target_language,
>     doc.fk_id_job
>  FROM "EXTERNAL_TRANSLATION_UNIT" u
>      JOIN "DOCUMENT" doc ON u.fk_id_document = doc.id
>      JOIN "EXTERNAL_TRANSLATION" r ON u.id = r.fk_id_translation_unit
>  WHERE doc.date_last_updated >= date(now() - '171:00:00'::interval)
>  ORDER BY r.id LIMIT 1000
>
> This is the query plan:
>
> <http://postgresql.nabble.com/file/n5864045/qp1.png>
>
> If I remove the WHERE condition, it returns immediately.
>
>
​So does  "SELECT 1;" - but since that doesn't give the same answer it is
not very relevant.​


> Am I doing something obviously wrong?
>

​Not obviously...​


> Thank you for any ideas.
>

​Consider updating the translation tables at the same time the document
table is updated.  That way you can apply the WHERE and ORDER BY clauses
against the same table.

​I presume you've run ANALYZE on the data.

I would probably try something like:

WITH docs AS ( SELECT ... WHERE date > ...)
SELECT ... FROM (translations join translation_unit) t
WHERE EXISTS (SELECT 1 FROM docs WHERE t.doc_id = docs.doc_id)
ORDER BY t.id LIMIT 1000

You are trying to avoid the NESTED LOOP and the above has a decent chance
of materializing docs and then building either a bit or hash map for both
docs and translations thus performing a single sequential scan over both
instead of performing 70+ million index lookups.

Take this with a grain of salt as my fluency in this area is limited - I
tend to work with trial-and-error but without data that is difficult.

I'm not sure if the planner could be smarter because you are asking a
question it is not particularly suited to estimating - namely cross-table
correlations.  Rethinking the model is likely to give you a better outcome
long-term though it does seem like there should be room for improvement
within the stated query and model.

As Tomas said you likely will benefit from increased working memory in
order to make materializing and hashing/bitmapping favorable compared to a
nested loop.

David J.

Reply via email to