Re: Array of integer indexed nested-loop semi join

2022-05-23 Thread Mickael van der Beek
Hello Jeff,

Thank you again for your advice.

I did indeed think of the ARRAY_AGG() version of the query.
Although this method is very fast (and does use indexes) for smallish array
sizes, this is sadly not practical in my case because the arrays of
matching rows can reach multiple hundreds of thousands of rows.
I thought of maybe "batching" the ARRAY_AGG() in batches of max n rows in a
subquery and then calculating intersection on that but it doesn't seem
practical or faster in the end.

> You could just add a DISTINCT to get rid of the duplicates.  Of course
that will also take some time on a large returned data set, but probably
less time than scanning a giant table.  I think this is probably cleaner
than the alternatives.

Yes, and a GROUP BY will do the trick as well.
The fact that the current solution is a "nested loop" instead of a "nested
loop semi join" means that the query is much slower due to needing to GROUP
BY the rows.
This is why I tried various version using EXISTS, ANY, ARRAY_AGG(), etc
with no avail.
Would you have an idea on why PostgreSQL doesn't use the existing indexes
for this type of subqueries ?

> I don't know if you misunderstood.  I meant specifically the intarray
extension.  You can use integer arrays with built-in GIN indexes without
help from the intarray extension.  Maybe you know that already and are just
saying that the extension is even faster than the built-in indexed
operators are and you need that extra speed.

Are there specific advantages to not using the intarray extension and it's
indexes in this case?
I was under the impression that it supported more operation types and was
generally faster for this niche use case.

Thank you again for your help,

Mickael



On Mon, May 23, 2022 at 4:11 PM Jeff Janes  wrote:

> On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <
> mickael.van.der.b...@gmail.com> wrote:
>
>> Hello Jeff,
>>
>> Sadly, the query you suggested won't work because you are only returning
>> the first row of the matching inner query rows.
>>
>
> Sure, but the query I replaced did the same thing.  (I thought that was
> what you wanted, but I guess that was just to get it to run fast enough to
> ever finish--in that case it is probably better to use EXPLAIN without the
> ANALYZE so that we can see the plan of the correct query).  To get around
> that one-row limit you have to write it somewhat differently, getting rid
> of the ARRAY and adding an array_agg():
>
> SELECT fu.*
> FROM
>   fact_users AS fu
> WHERE
>   fu.w2_page_idxs && (select array_agg(idx) from fact_pages where
> attribute_idxs && ARRAY[201]);
>
> This way of writing it is better, as it still works with the LIMIT 1 but
> also works without it.  This still uses the indexes for me, at least when
> enable_seqscan is off.
>
>
>> The INNER JOIN version of the query will return all matching rows but
>> also include duplicates:
>>
>
> You could just add a DISTINCT to get rid of the duplicates.  Of course
> that will also take some time on a large returned data set, but probably
> less time than scanning a giant table.  I think this is probably cleaner
> than the alternatives.
>
>
>>
>> The reason I'm using integer arrays is because it is the only way I have
>> found in PostgreSQL to get fast inclusion / exclusion checks on large
>> datasets (hundreds of millions of values).
>> Did I misunderstand your response?
>>
>
> I don't know if you misunderstood.  I meant specifically the intarray
> extension.  You can use integer arrays with built-in GIN indexes without
> help from the intarray extension.  Maybe you know that already and are just
> saying that the extension is even faster than the built-in indexed
> operators are and you need that extra speed.
>
> Cheers,
>
> Jeff
>
>>

-- 
Mickael van der BeekWeb developer & Security analyst

mickael.van.der.b...@gmail.com


Re: Array of integer indexed nested-loop semi join

2022-05-23 Thread Jeff Janes
On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <
mickael.van.der.b...@gmail.com> wrote:

> Hello Jeff,
>
> Sadly, the query you suggested won't work because you are only returning
> the first row of the matching inner query rows.
>

Sure, but the query I replaced did the same thing.  (I thought that was
what you wanted, but I guess that was just to get it to run fast enough to
ever finish--in that case it is probably better to use EXPLAIN without the
ANALYZE so that we can see the plan of the correct query).  To get around
that one-row limit you have to write it somewhat differently, getting rid
of the ARRAY and adding an array_agg():

SELECT fu.*
FROM
  fact_users AS fu
WHERE
  fu.w2_page_idxs && (select array_agg(idx) from fact_pages where
attribute_idxs && ARRAY[201]);

This way of writing it is better, as it still works with the LIMIT 1 but
also works without it.  This still uses the indexes for me, at least when
enable_seqscan is off.


> The INNER JOIN version of the query will return all matching rows but also
> include duplicates:
>

You could just add a DISTINCT to get rid of the duplicates.  Of course that
will also take some time on a large returned data set, but probably less
time than scanning a giant table.  I think this is probably cleaner than
the alternatives.


>
> The reason I'm using integer arrays is because it is the only way I have
> found in PostgreSQL to get fast inclusion / exclusion checks on large
> datasets (hundreds of millions of values).
> Did I misunderstand your response?
>

I don't know if you misunderstood.  I meant specifically the intarray
extension.  You can use integer arrays with built-in GIN indexes without
help from the intarray extension.  Maybe you know that already and are just
saying that the extension is even faster than the built-in indexed
operators are and you need that extra speed.

Cheers,

Jeff

>


Re: Array of integer indexed nested-loop semi join

2022-05-23 Thread Mickael van der Beek
Hello Jeff,

Sadly, the query you suggested won't work because you are only returning
the first row of the matching inner query rows.
Example:

SELECT
>   u.idx,
>   u.page_idxs
> FROM
>   (
> VALUES
>   (1, ARRAY[11, 21, 31]),
>   (2, ARRAY[12, 21, 32]),
>   (3, ARRAY[13, 23, 31])
>   )
> AS u(idx, page_idxs)
> WHERE
>   u.page_idxs && ARRAY[(
> SELECT
>   p.idx
> FROM
>   (
> VALUES
>   (11, ARRAY[101, 201, 301]),
>   (21, ARRAY[102, 201, 302]),
>   (13, ARRAY[103, 203, 301])
>   )
> AS p(idx, attribute_idxs)
> WHERE
>   p.attribute_idxs && ARRAY[201]
> FETCH FIRST 1 ROWS ONLY
>   )]
> ;


This query only returns one row while it should actually return two:

1 {11,21,31}


The INNER JOIN version of the query will return all matching rows but also
include duplicates:

SELECT
>   u.idx,
>   u.page_idxs
> FROM
>   (
> VALUES
>   (1, ARRAY[11, 21, 31]),
>   (2, ARRAY[12, 21, 32]),
>   (3, ARRAY[13, 23, 31])
>   )
> AS u(idx, page_idxs)
> INNER JOIN
>   (
> SELECT
>   p.idx
> FROM
>   (
> VALUES
>   (11, ARRAY[101, 201, 301]),
>   (21, ARRAY[102, 201, 302]),
>   (13, ARRAY[103, 203, 301])
>   )
> AS p(idx, attribute_idxs)
> WHERE
>   p.attribute_idxs && ARRAY[201]
>   )
>   AS p2
>   ON u.page_idxs && ARRAY[p2.idx]
> ;


Results:

1 {11,21,31}
> 1 {11,21,31}
> 2 {12,21,32}


As far as I know, the the IN + sub-expression query can't work since the
left side of the operation is an array of integers and the right side a set
of rows with a single integer column.
The reason I'm using integer arrays is because it is the only way I have
found in PostgreSQL to get fast inclusion / exclusion checks on large
datasets (hundreds of millions of values).
Did I misunderstand your response?
Thank you for the ongoing help,

Mickael

On Mon, May 23, 2022 at 4:45 AM Jeff Janes  wrote:

>
>
> On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <
> mickael.van.der.b...@gmail.com> wrote:
>
>>
>> Query:
>>
>> EXPLAIN (
>>>   ANALYZE,
>>>   VERBOSE,
>>>   COSTS,
>>>   BUFFERS,
>>>   TIMING
>>> )
>>> SELECT
>>>   fu.w2_page_idxs
>>> FROM
>>>   fact_users
>>> AS fu
>>> WHERE
>>>   EXISTS (
>>> SELECT
>>> FROM
>>>   (
>>> SELECT
>>>   ARRAY[idx] AS page_idx
>>> FROM
>>>   fact_pages
>>> WHERE
>>>   attribute_idxs && ARRAY[30160]
>>> FETCH FIRST 1 ROWS ONLY
>>>   )
>>> AS fp
>>> WHERE
>>>   fu.w2_page_idxs && fp.page_idx
>>>   )
>>> ;
>>
>>
>> Without any surprises, the planner is using a sequential scan on the
>> "fact_users" table which is very large instead of using the GIN index set
>> on the "w2_page_idxs" column.
>>
>
> For me, using the subquery in and expression, instead of the EXISTS, does
> get it to use the gin index.  And I think it must give the same results.
>
> SELECT
>   fu.w2_page_idxs
> FROM  fact_users AS fu
> WHERE
>   fu.w2_page_idxs && ARRAY[(select idx from fact_pages where
> attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];
>
> But why are you using intarray?  That is unnecessary here, and by creating
> ambiguity about the array operators it might be harmful.
>
> Cheers,
>
> Jeff
>
>>

-- 
Mickael van der BeekWeb developer & Security analyst

mickael.van.der.b...@gmail.com