Re: Shared detoast Datum proposal

2024-05-23 Thread Andy Fan


Robert Haas  writes:

> On Wed, May 22, 2024 at 9:46 PM Andy Fan  wrote:
>> Please give me one more chance to explain this. What I mean is:
>>
>> Take SELECT f(a) FROM t1 join t2...; for example,
>>
>> When we read the Datum of a Var, we read it from tts->tts_values[*], no
>> matter what kind of TupleTableSlot is.  So if we can put the "detoast
>> datum" back to the "original " tts_values, then nothing need to be
>> changed.
>
> Yeah, I don't think you can do that.
>
>> - Saving the "detoast datum" version into tts_values[*] doesn't break
>>   the above semantic and the ExprEval engine just get a detoast version
>>   so it doesn't need to detoast it again.
>
> I don't think this is true. If it is true, it needs to be extremely
> well-justified. Even if this seems to work in simple cases, I suspect
> there will be cases where it breaks badly, resulting in memory leaks
> or server crashes or some other kind of horrible problem that I can't
> quite imagine. Unfortunately, my knowledge of this code isn't
> fantastic, so I can't say exactly what bad thing will happen, and I
> can't even say with 100% certainty that anything bad will happen, but
> I bet it will. It seems like it goes against everything I understand
> about how TupleTableSlots are supposed to be used. (Andres would be
> the best person to give a definitive answer here.)
>
>> - The keypoint is the memory management and effeiciency. for now I think
>>   all the places where "slot->tts_nvalid" is set to 0 means the
>>   tts_values[*] is no longer validate, then this is the place we should
>>   release the memory for the "detoast datum".  All the other places like
>>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>>   "detoast datum" as well.
>
> This might be a way of addressing some of the issues with this idea,
> but I doubt it will be acceptable. I don't think we want to complicate
> the slot API for this feature. There's too much downside to doing
> that, in terms of performance and understandability.

Thanks for the doubt! Let's see if Andres has time to look at this. I
feel great about the current state. 

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-05-23 Thread Robert Haas
On Wed, May 22, 2024 at 9:46 PM Andy Fan  wrote:
> Please give me one more chance to explain this. What I mean is:
>
> Take SELECT f(a) FROM t1 join t2...; for example,
>
> When we read the Datum of a Var, we read it from tts->tts_values[*], no
> matter what kind of TupleTableSlot is.  So if we can put the "detoast
> datum" back to the "original " tts_values, then nothing need to be
> changed.

Yeah, I don't think you can do that.

> - Saving the "detoast datum" version into tts_values[*] doesn't break
>   the above semantic and the ExprEval engine just get a detoast version
>   so it doesn't need to detoast it again.

I don't think this is true. If it is true, it needs to be extremely
well-justified. Even if this seems to work in simple cases, I suspect
there will be cases where it breaks badly, resulting in memory leaks
or server crashes or some other kind of horrible problem that I can't
quite imagine. Unfortunately, my knowledge of this code isn't
fantastic, so I can't say exactly what bad thing will happen, and I
can't even say with 100% certainty that anything bad will happen, but
I bet it will. It seems like it goes against everything I understand
about how TupleTableSlots are supposed to be used. (Andres would be
the best person to give a definitive answer here.)

> - The keypoint is the memory management and effeiciency. for now I think
>   all the places where "slot->tts_nvalid" is set to 0 means the
>   tts_values[*] is no longer validate, then this is the place we should
>   release the memory for the "detoast datum".  All the other places like
>   ExecMaterializeSlot or ExecCopySlot also need to think about the
>   "detoast datum" as well.

This might be a way of addressing some of the issues with this idea,
but I doubt it will be acceptable. I don't think we want to complicate
the slot API for this feature. There's too much downside to doing
that, in terms of performance and understandability.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Shared detoast Datum proposal

2024-05-22 Thread Andy Fan


Nikita Malakhov  writes:

> Hi,

> Andy, glad you've not lost interest in this work, I'm looking
> forward to your improvements!

Thanks for your words, I've adjusted to the rhythm of the community and
welcome more feedback:)

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-05-22 Thread Andy Fan

Robert Haas  writes:

> On Tue, May 21, 2024 at 10:02 PM Andy Fan  wrote:
>> One more things I want to highlight it "syscache" is used for metadata
>> and *detoast cache* is used for user data. user data is more
>> likely bigger than metadata, so cache size control (hence eviction logic
>> run more often) is more necessary in this case. The first graph in [1]
>> (including the quota text) highlight this idea.
>
> Yes.
>
>> I think that is same idea as design a.
>
> Sounds similar.
>

This is really great to know!

>> I think the key points includes:
>>
>> 1. Where to put the *detoast value* so that ExprEval system can find out
>> it cheaply. The soluation in A slot->tts_values[*].
>>
>> 2. How to manage the memory for holding the detoast value.
>>
>> The soluation in A is:
>>
>> - using a lazy created MemoryContext in slot.
>> - let the detoast system detoast the datum into the designed
>> MemoryContext.
>>
>> 3. How to let Expr evalution run the extra cycles when needed.
>
> I don't understand (3) but I agree that (1) and (2) are key points. In
> many - nearly all - circumstances just overwriting slot->tts_values is
> unsafe and will cause problems. To make this work, we've got to find
> some way of doing this that is compatible with general design of
> TupleTableSlot, and I think in that API, just about the only time you
> can touch slot->tts_values is when you're trying to populate a virtual
> slot. But often we'll have some other kind of slot. And even if we do
> have a virtual slot, I think you're only allowed to manipulate
> slot->tts_values up until you call ExecStoreVirtualTuple.

Please give me one more chance to explain this. What I mean is:

Take SELECT f(a) FROM t1 join t2...; for example,

When we read the Datum of a Var, we read it from tts->tts_values[*], no
matter what kind of TupleTableSlot is.  So if we can put the "detoast
datum" back to the "original " tts_values, then nothing need to be
changed. 

Aside from efficiency considerations, security and rationality are also
important considerations. So what I think now is:

- tts_values[*] means the Datum from the slot, and it has its operations
  like slot_getsomeattrs, ExecMaterializeSlot, ExecCopySlot,
  ExecClearTuple and so on. All of the characters have nothing with what
  kind of slot it is. 

- Saving the "detoast datum" version into tts_values[*] doesn't break
  the above semantic and the ExprEval engine just get a detoast version
  so it doesn't need to detoast it again.

- The keypoint is the memory management and effeiciency. for now I think
  all the places where "slot->tts_nvalid" is set to 0 means the
  tts_values[*] is no longer validate, then this is the place we should
  release the memory for the "detoast datum".  All the other places like
  ExecMaterializeSlot or ExecCopySlot also need to think about the
  "detoast datum" as well.

> That's why I mentioned using something temporary. You could legally
> (a) materialize the existing slot, (b) create a new virtual slot, (c)
> copy over the tts_values and tts_isnull arrays, (c) detoast any datums
> you like from tts_values and store the new datums back into the array,
> (d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
> of the original one. That would avoid repeated detoasting, but it
> would also have a bunch of overhead.

Yes, I agree with the overhead, so I perfer to know if the my current
strategy has principle issue first.

> Having to fully materialize the
> existing slot is a cost that you wouldn't always need to pay if you
> didn't do this, but you have to do it to make this work. Creating the
> new virtual slot costs something. Copying the arrays costs something.
> Detoasting costs quite a lot potentially, if you don't end up using
> the values. If you end up needing a heap tuple representation of the
> slot, you're going to now have to make a new one rather than just
> using the one that came straight from the buffer.

> But I do agree with you
> that *if* we could make it work, it would probably be better than a
> longer-lived cache.


I noticed the "if" and great to know that:), speically for the efficiecy
& memory usage control purpose. 

Another big challenge of this is how to decide if a Var need this
pre-detoasting strategy, we have to consider:

- The in-place tts_values[*] update makes the slot bigger, which is
  harmful for CP_SMALL_TLIST (createplan.c) purpose. 
- ExprEval runtime checking if the Var is toastable is an overhead. It
  is must be decided ExecInitExpr time. 
  
The code complexity because of the above 2 factors makes people (include
me) unhappy.

===
Yesterday I did some worst case testing for the current strategy, and
the first case show me ~1% *unexpected* performance regression and I
tried to find out it with perf record/report, and no lucky. that's the
reason why I didn't send a complete post yesterday.

As for now, since we are talking about the high level design, I think
it is OK to post the improved design 

Re: Shared detoast Datum proposal

2024-05-22 Thread Robert Haas
On Tue, May 21, 2024 at 10:02 PM Andy Fan  wrote:
> One more things I want to highlight it "syscache" is used for metadata
> and *detoast cache* is used for user data. user data is more
> likely bigger than metadata, so cache size control (hence eviction logic
> run more often) is more necessary in this case. The first graph in [1]
> (including the quota text) highlight this idea.

Yes.

> I think that is same idea as design a.

Sounds similar.

> I think the key points includes:
>
> 1. Where to put the *detoast value* so that ExprEval system can find out
> it cheaply. The soluation in A slot->tts_values[*].
>
> 2. How to manage the memory for holding the detoast value.
>
> The soluation in A is:
>
> - using a lazy created MemoryContext in slot.
> - let the detoast system detoast the datum into the designed
> MemoryContext.
>
> 3. How to let Expr evalution run the extra cycles when needed.

I don't understand (3) but I agree that (1) and (2) are key points. In
many - nearly all - circumstances just overwriting slot->tts_values is
unsafe and will cause problems. To make this work, we've got to find
some way of doing this that is compatible with general design of
TupleTableSlot, and I think in that API, just about the only time you
can touch slot->tts_values is when you're trying to populate a virtual
slot. But often we'll have some other kind of slot. And even if we do
have a virtual slot, I think you're only allowed to manipulate
slot->tts_values up until you call ExecStoreVirtualTuple.

That's why I mentioned using something temporary. You could legally
(a) materialize the existing slot, (b) create a new virtual slot, (c)
copy over the tts_values and tts_isnull arrays, (c) detoast any datums
you like from tts_values and store the new datums back into the array,
(d) call ExecStoreVirtualTuple, and then (e) use the new slot in place
of the original one. That would avoid repeated detoasting, but it
would also have a bunch of overhead. Having to fully materialize the
existing slot is a cost that you wouldn't always need to pay if you
didn't do this, but you have to do it to make this work. Creating the
new virtual slot costs something. Copying the arrays costs something.
Detoasting costs quite a lot potentially, if you don't end up using
the values. If you end up needing a heap tuple representation of the
slot, you're going to now have to make a new one rather than just
using the one that came straight from the buffer.

So I don't think we could do something like this unconditionally.
We'd have to do it only when there was a reasonable expectation that
it would work out, which means we'd have to be able to predict fairly
accurately whether it was going to work out. But I do agree with you
that *if* we could make it work, it would probably be better than a
longer-lived cache.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Shared detoast Datum proposal

2024-05-22 Thread Nikita Malakhov
Hi,

Robert, thank you very much for your response to this thread.
I agree with most of things you've mentioned, but this proposal
is a PoC, and anyway has a long way to go to be presented
(if it ever would) as something to be committed.

Andy, glad you've not lost interest in this work, I'm looking
forward to your improvements!

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-05-21 Thread Andy Fan


Andy Fan  writes:

> Hi Robert, 
>
>> Andy Fan asked me off-list for some feedback about this proposal. I
>> have hesitated to comment on it for lack of having studied the matter
>> in any detail, but since I've been asked for my input, here goes:
>
> Thanks for doing this! Since we have two totally different designs
> between Tomas and me and I think we both understand the two sides of
> each design, just that the following things are really puzzled to me.

my emacs was stuck and somehow this incompleted message was sent out,
Please ignore this post for now. Sorry for this.

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-05-21 Thread Andy Fan


Hi Robert, 

> Andy Fan asked me off-list for some feedback about this proposal. I
> have hesitated to comment on it for lack of having studied the matter
> in any detail, but since I've been asked for my input, here goes:

Thanks for doing this! Since we have two totally different designs
between Tomas and me and I think we both understand the two sides of
each design, just that the following things are really puzzled to me.

- which factors we should care more.
- the possiblility to improve the design-A/B for its badness.

But for the point 2, we probably needs some prefer design first.

for now let's highlight that there are 2 totally different method to
handle this problem. Let's call:

design a:  detoast the datum into slot->tts_values (and manage the
memory with the lifespan of *slot*). From me.

design b:  detost the datum into a CACHE (and manage the memory with
CACHE eviction and at released the end-of-query). From Tomas.

Currently I pefer design a, I'm not sure if I want desgin a because a is
really good or just because design a is my own design. So I will
summarize the the current state for the easier discussion. I summarize
them so that you don't need to go to the long discussion, and I summarize
the each point with the link to the original post since I may
misunderstand some points and the original post can be used as a double
check. 

> I agree that there's a problem here, but I suspect this is not the
> right way to solve it. Let's compare this to something like the
> syscache. Specifically, let's think about the syscache on
> pg_class.relname.

OK.

> In the case of the syscache on pg_class.relname, there are two reasons
> why we might repeatedly look up the same values in the cache. One is
> that the code might do multiple name lookups when it really ought to
> do only one. Doing multiple lookups is bad for security and bad for
> performance, but we have code like that in some places and some of it
> is hard to fix. The other reason is that it is likely that the user
> will mention the same table names repeatedly; each time they do, they
> will trigger a lookup on the same name. By caching the result of the
> lookup, we can make it much faster. An important point to note here is
> that the queries that the user will issue in the future are unknown to
> us. In a certain sense, we cannot even know whether the same table
> name will be mentioned more than once in the same query: when we reach
> the first instance of the table name and look it up, we do not have
> any good way of knowing whether it will be mentioned again later, say
> due to a self-join. Hence, the pattern of cache lookups is
> fundamentally unknowable.

True. 

> But that's not the case in the motivating example that started this
> thread. In that case, the target list includes three separate
> references to the same column. We can therefore predict that if the
> column value is toasted, we're going to de-TOAST it exactly three
> times.

True.

> If, on the other hand, the column value were mentioned only
> once, it would be detoasted just once. In that latter case, which is
> probably quite common, this whole cache idea seems like it ought to
> just waste cycles, which makes me suspect that no matter what is done
> to this patch, there will always be cases where it causes significant
> regressions.

I agree.

> In the former case, where the column reference is
> repeated, it will win, but it will also hold onto the detoasted value
> after the third instance of detoasting, even though there will never
> be any more cache hits.

True.

> Here, unlike the syscache case, the cache is
> thrown away at the end of the query, so future queries can't benefit.
> Even if we could find a way to preserve the cache in some cases, it's
> not very likely to pay off. People are much more likely to hit the
> same table more than once than to hit the same values in the same
> table more than once in the same session.

I agree.

One more things I want to highlight it "syscache" is used for metadata
and *detoast cache* is used for user data. user data is more 
likely bigger than metadata, so cache size control (hence eviction logic
run more often) is more necessary in this case. The first graph in [1]
(including the quota text) highlight this idea.

> Suppose we had a design where, when a value got detoasted, the
> detoasted value went into the place where the toasted value had come
> from. Then this problem would be handled pretty much perfectly: the
> detoasted value would last until the scan advanced to the next row,
> and then it would be thrown out. So there would be no query-lifespan
> memory leak.

I think that is same idea as design a.

> Now, I don't really know how this would work, because the
> TOAST pointer is coming out of a slot and we can't necessarily write
> one attribute of any arbitrary slot type without a bunch of extra
> overhead, but maybe there's some way.

I think the key points includes:

1. Where to put the *detoast 

Re: Shared detoast Datum proposal

2024-05-20 Thread Robert Haas
Hi,

Andy Fan asked me off-list for some feedback about this proposal. I
have hesitated to comment on it for lack of having studied the matter
in any detail, but since I've been asked for my input, here goes:

I agree that there's a problem here, but I suspect this is not the
right way to solve it. Let's compare this to something like the
syscache. Specifically, let's think about the syscache on
pg_class.relname.

In the case of the syscache on pg_class.relname, there are two reasons
why we might repeatedly look up the same values in the cache. One is
that the code might do multiple name lookups when it really ought to
do only one. Doing multiple lookups is bad for security and bad for
performance, but we have code like that in some places and some of it
is hard to fix. The other reason is that it is likely that the user
will mention the same table names repeatedly; each time they do, they
will trigger a lookup on the same name. By caching the result of the
lookup, we can make it much faster. An important point to note here is
that the queries that the user will issue in the future are unknown to
us. In a certain sense, we cannot even know whether the same table
name will be mentioned more than once in the same query: when we reach
the first instance of the table name and look it up, we do not have
any good way of knowing whether it will be mentioned again later, say
due to a self-join. Hence, the pattern of cache lookups is
fundamentally unknowable.

But that's not the case in the motivating example that started this
thread. In that case, the target list includes three separate
references to the same column. We can therefore predict that if the
column value is toasted, we're going to de-TOAST it exactly three
times. If, on the other hand, the column value were mentioned only
once, it would be detoasted just once. In that latter case, which is
probably quite common, this whole cache idea seems like it ought to
just waste cycles, which makes me suspect that no matter what is done
to this patch, there will always be cases where it causes significant
regressions. In the former case, where the column reference is
repeated, it will win, but it will also hold onto the detoasted value
after the third instance of detoasting, even though there will never
be any more cache hits. Here, unlike the syscache case, the cache is
thrown away at the end of the query, so future queries can't benefit.
Even if we could find a way to preserve the cache in some cases, it's
not very likely to pay off. People are much more likely to hit the
same table more than once than to hit the same values in the same
table more than once in the same session.

Suppose we had a design where, when a value got detoasted, the
detoasted value went into the place where the toasted value had come
from. Then this problem would be handled pretty much perfectly: the
detoasted value would last until the scan advanced to the next row,
and then it would be thrown out. So there would be no query-lifespan
memory leak. Now, I don't really know how this would work, because the
TOAST pointer is coming out of a slot and we can't necessarily write
one attribute of any arbitrary slot type without a bunch of extra
overhead, but maybe there's some way. If it could be done cheaply
enough, it would gain all the benefits of this proposal a lot more
cheaply and with fewer downsides. Alternatively, maybe there's a way
to notice the multiple references and introduce some kind of
intermediate slot or other holding area where the detoasted value can
be stored.

In other words, instead of computing

big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'

Maybe we ought to be trying to compute this:

big_jsonb_col=detoast(big_jsonb_col)
big_jsonb_col->'a',  big_jsonb_col->'b',  big_jsonb_col->'c'

Or perhaps this:

tmp=detoast(big_jsonb_col)
tmp->'a', tmp->'b', tmp->'c'

In still other words, a query-lifespan cache for this information
feels like it's tackling the problem at the wrong level. I do suspect
there may be query shapes where the same value gets detoasted multiple
times in ways that the proposals I just made wouldn't necessarily
catch, such as in the case of a self-join. But I think those cases are
rare. In most cases, repeated detoasting is probably very localized,
with the same exact TOAST pointer being de-TOASTed a couple of times
in quick succession, so a query-lifespan cache seems like the wrong
thing. Also, in most cases, repeated detoasting is probably quite
predictable: we could infer it from static analysis of the query. So
throwing ANY kind of cache at the problem seems like the wrong
approach unless the cache is super-cheap, which doesn't seem to be the
case with this proposal, because a cache caters to cases where we
can't know whether we're going to recompute the same result, and here,
that seems like something we could probably figure out.

Because I don't see any way to avoid regressions in the common case
where detoasting isn't repeated, I 

Re: Shared detoast Datum proposal

2024-03-15 Thread Nikita Malakhov
Hi!

Here's a slightly improved version of patch Tomas provided above (v2),
with cache invalidations and slices caching added, still as PoC.

The main issue I've encountered during tests is that when the same query
retrieves both slices and full value - slices, like substring(t,...) the
order of
the values is not predictable, with text fields substrings are retrieved
before the full value and we cannot benefit from cache full value first
and use slices from cached value.

Yet the cache code is still very compact and affects only sources related
to detoasting.

Tomas, about  HASH_ENTER - according to comments it could throw
an OOM error, so I've changed it to  HASH_ENTER_NULL to avoid
new errors. In this case we would just have the value not cached
without an error.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


0001-toast-cache-v3.patch
Description: Binary data


Re: Shared detoast Datum proposal

2024-03-07 Thread Tomas Vondra
On 3/7/24 08:33, Nikita Malakhov wrote:
> Hi!
> 
> Tomas, I thought about this issue -
>> What if you only need the first slice? In that case decoding everything
>> will be a clear regression.
> And completely agree with you, I'm currently working on it and will post
> a patch a bit later.

Cool. I don't plan to work on improving my patch - it was merely a PoC,
so if you're interested in working on that, that's good.

> Another issue we have here - if we have multiple slices requested -
> then we have to update cached slice from both sides, the beginning
> and the end.
> 

No opinion. I haven't thought about how to handle slices too much.

> On update, yep, you're right
>> Surely the update creates a new TOAST value, with a completely new TOAST
>> pointer, new rows in the TOAST table etc. It's not updated in-place. So
>> it'd end up with two independent entries in the TOAST cache.
> 
>> Or are you interested just in evicting the old value simply to free the
>> memory, because we're not going to need that (now invisible) value? That
>> seems interesting, but if it's too invasive or complex I'd just leave it
>> up to the LRU eviction (at least for v1).
> Again, yes, we do not need the old value after it was updated and
> it is better to take care of it explicitly. It's a simple and not invasive
> addition to your patch.
> 

OK

> Could not agree with you about large values - it makes sense
> to cache large values because the larger it is the more benefit
> we will have from not detoasting it again and again. One way
> is to keep them compressed, but what if we have data with very
> low compression rate?
> 

I'm not against caching large values, but I also think it's not
difficult to construct cases where it's a losing strategy.

> I also changed HASHACTION field value from HASH_ENTER to
> HASH_ENTER_NULL to softly deal with case when we do not
> have enough memory and added checks for if the value was stored
> or not.
> 

I'm not sure about this. HASH_ENTER_NULL is only being used in three
very specific places (two are lock management, one is shmeminitstruct).
This hash table is not unique in any way, so why not to use HASH_ENTER
like 99% other hash tables?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-06 Thread Nikita Malakhov
Hi!

Tomas, I thought about this issue -
>What if you only need the first slice? In that case decoding everything
>will be a clear regression.
And completely agree with you, I'm currently working on it and will post
a patch a bit later. Another issue we have here - if we have multiple
slices requested - then we have to update cached slice from both sides,
the beginning and the end.

On update, yep, you're right
>Surely the update creates a new TOAST value, with a completely new TOAST
>pointer, new rows in the TOAST table etc. It's not updated in-place. So
>it'd end up with two independent entries in the TOAST cache.

>Or are you interested just in evicting the old value simply to free the
>memory, because we're not going to need that (now invisible) value? That
>seems interesting, but if it's too invasive or complex I'd just leave it
>up to the LRU eviction (at least for v1).
Again, yes, we do not need the old value after it was updated and
it is better to take care of it explicitly. It's a simple and not invasive
addition to your patch.

Could not agree with you about large values - it makes sense
to cache large values because the larger it is the more benefit
we will have from not detoasting it again and again. One way
is to keep them compressed, but what if we have data with very
low compression rate?

I also changed HASHACTION field value from HASH_ENTER to
HASH_ENTER_NULL to softly deal with case when we do not
have enough memory and added checks for if the value was stored
or not.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-03-06 Thread Tomas Vondra



On 3/5/24 10:03, Nikita Malakhov wrote:
> Hi,
> 
> Tomas, sorry for confusion, in my previous message I meant exactly
> the same approach you've posted above, and came up with almost
> the same implementation.
> 
> Thank you very much for your attention to this thread!
> 
> I've asked Andy about this approach because of the same reasons you
> mentioned - it keeps cache code small, localized and easy to maintain.
> 
> The question that worries me is the memory limit, and I think that cache
> lookup and entry invalidation should be done in toast_tuple_externalize
> code, for the case the value has been detoasted previously is updated
> in the same query, like
> 

I haven't thought very much about when we need to invalidate entries and
where exactly should that happen. My patch is a very rough PoC that I
wrote in ~1h, and it's quite possible it will have to move or should be
refactored in some way.

But I don't quite understand the problem with this query:

> UPDATE test SET value = value || '...';

Surely the update creates a new TOAST value, with a completely new TOAST
pointer, new rows in the TOAST table etc. It's not updated in-place. So
it'd end up with two independent entries in the TOAST cache.

Or are you interested just in evicting the old value simply to free the
memory, because we're not going to need that (now invisible) value? That
seems interesting, but if it's too invasive or complex I'd just leave it
up to the LRU eviction (at least for v1).

I'm not sure why should anything happen in toast_tuple_externalize, that
seems like a very low-level piece of code. But I realized there's also
toast_delete_external, and maybe that's a good place to invalidate the
deleted TOAST values (which would also cover the UPDATE).

> I've added cache entry invalidation and data removal on delete and update
> of the toasted values and am currently experimenting with large values.
> 

I haven't done anything about large values in the PoC patch, but I think
it might be a good idea to either ignore values that are too large (so
that it does not push everything else from the cache) or store them in
compressed form (assuming it's small enough, to at least save the I/O).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-06 Thread Tomas Vondra
On 3/6/24 07:09, Nikita Malakhov wrote:
> Hi,
> 
> In addition to the previous message - for the toast_fetch_datum_slice
> the first (seems obvious) way is to detoast the whole value, save it
> to cache and get slices from it on demand. I have another one on my
> mind, but have to play with it first.
> 

What if you only need the first slice? In that case decoding everything
will be a clear regression.

IMHO this needs to work like this:

1) check if the cache already has sufficiently large slice detoasted
(and use the cached value if yes)

2) if there's no slice detoasted, or if it's too small, detoast a
sufficiently large slice and store it in the cache (possibly evicting
the already detoasted slice)


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-05 Thread Nikita Malakhov
Hi,

In addition to the previous message - for the toast_fetch_datum_slice
the first (seems obvious) way is to detoast the whole value, save it
to cache and get slices from it on demand. I have another one on my
mind, but have to play with it first.

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-03-05 Thread Nikita Malakhov
Hi,

Tomas, sorry for confusion, in my previous message I meant exactly
the same approach you've posted above, and came up with almost
the same implementation.

Thank you very much for your attention to this thread!

I've asked Andy about this approach because of the same reasons you
mentioned - it keeps cache code small, localized and easy to maintain.

The question that worries me is the memory limit, and I think that cache
lookup and entry invalidation should be done in toast_tuple_externalize
code, for the case the value has been detoasted previously is updated
in the same query, like
UPDATE test SET value = value || '...';

I've added cache entry invalidation and data removal on delete and update
of the toasted values and am currently experimenting with large values.

On Tue, Mar 5, 2024 at 5:59 AM Andy Fan  wrote:

>
> >
> >> 2. more likely to use up all the memory which is allowed. for example:
> >> if we set the limit to 1MB, then we kept more data which will be not
> >> used and then consuming all of the 1MB.
> >>
> >> My method is resolving this with some helps from other modules (kind of
> >> invasive) but can control the eviction well and use the memory as less
> >> as possible.
> >>
> >
> > Is the memory usage really an issue? Sure, it'd be nice to evict entries
> > as soon as we can prove they are not needed anymore, but if the cache
> > limit is set to 1MB it's not really a problem to use 1MB.
>
> This might be a key point which leads us to some different directions, so
> I want to explain more about this, to see if we can get some agreement
> here.
>
> It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
> 100MB. In my current case it is 40MB at least. Less memory limit
> causes cache ineffecitve, high memory limit cause potential memory
> use-up issue in the TOAST cache design. But in my method, even we set a
> higher value, it just limits the user case it really (nearly) needed,
> and it would not cache more values util the limit is hit. This would
> make a noticable difference when we want to set a high limit and we have
> some high active sessions, like 100 * 40MB = 4GB.
>
> > On 3/4/24 18:08, Andy Fan wrote:
> >> ...
> >>>
>  I assumed that releasing all of the memory at the end of executor once
>  is not an option since it may consumed too many memory. Then, when and
>  which entry to release becomes a trouble for me. For example:
> 
>    QUERY PLAN
>  --
>   Nested Loop
> Join Filter: (t1.a = t2.a)
> ->  Seq Scan on t1
> ->  Seq Scan on t2
>  (4 rows)
> 
>  In this case t1.a needs a longer lifespan than t2.a since it is
>  in outer relation. Without the help from slot's life-cycle system, I
>  can't think out a answer for the above question.
> 
> >>>
> >>> This is true, but how likely such plans are? I mean, surely no one
> would
> >>> do nested loop with sequential scans on reasonably large tables, so how
> >>> representative this example is?
> >>
> >> Acutally this is a simplest Join case, we still have same problem like
> >> Nested Loop + Index Scan which will be pretty common.
> >>
> >
> > Yes, I understand there are cases where LRU eviction may not be the best
> > choice - like here, where the "t1" should stay in the case. But there
> > are also cases where this is the wrong choice, and LRU would be better.
> >
> > For example a couple paragraphs down you suggest to enforce the memory
> > limit by disabling detoasting if the memory limit is reached. That means
> > the detoasting can get disabled because there's a single access to the
> > attribute somewhere "up the plan tree". But what if the other attributes
> > (which now won't be detoasted) are accessed many times until then?
>
> I am not sure I can't follow up here, but I want to explain more about
> the disable-detoast-sharing logic when the memory limit is hit. When
> this happen, the detoast sharing is disabled, but since the detoast
> datum will be released every soon when the slot->tts_values[*] is
> discard, then the 'disable' turn to 'enable' quickly. So It is not
> true that once it is get disabled, it can't be enabled any more for the
> given query.
>
> > I think LRU is a pretty good "default" algorithm if we don't have a very
> > good idea of the exact life span of the values, etc. Which is why other
> > nodes (e.g. Memoize) use LRU too.
>
> > But I wonder if there's a way to count how many times an attribute is
> > accessed (or is likely to be). That might be used to inform a better
> > eviction strategy.
>
> Yes, but in current issue we can get a better esitimation with the help
> of plan shape and Memoize depends on some planner information as well.
> If we bypass the planner information and try to resolve it at the
> cache level, the code may become to complex as well and all the cost is
> run time overhead while the other way is a planning timie overhead.
>
> > Also, we don't 

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan


>
>> 2. more likely to use up all the memory which is allowed. for example:
>> if we set the limit to 1MB, then we kept more data which will be not
>> used and then consuming all of the 1MB. 
>> 
>> My method is resolving this with some helps from other modules (kind of
>> invasive) but can control the eviction well and use the memory as less
>> as possible.
>> 
>
> Is the memory usage really an issue? Sure, it'd be nice to evict entries
> as soon as we can prove they are not needed anymore, but if the cache
> limit is set to 1MB it's not really a problem to use 1MB.

This might be a key point which leads us to some different directions, so
I want to explain more about this, to see if we can get some agreement
here.

It is a bit hard to decide which memory limit to set, 1MB, 10MB or 40MB,
100MB. In my current case it is 40MB at least. Less memory limit 
causes cache ineffecitve, high memory limit cause potential memory
use-up issue in the TOAST cache design. But in my method, even we set a
higher value, it just limits the user case it really (nearly) needed,
and it would not cache more values util the limit is hit. This would
make a noticable difference when we want to set a high limit and we have
some high active sessions, like 100 * 40MB = 4GB. 

> On 3/4/24 18:08, Andy Fan wrote:
>> ...
>>>
 I assumed that releasing all of the memory at the end of executor once
 is not an option since it may consumed too many memory. Then, when and
 which entry to release becomes a trouble for me. For example:

   QUERY PLAN
 --
  Nested Loop
Join Filter: (t1.a = t2.a)
->  Seq Scan on t1
->  Seq Scan on t2
 (4 rows)

 In this case t1.a needs a longer lifespan than t2.a since it is
 in outer relation. Without the help from slot's life-cycle system, I
 can't think out a answer for the above question.

>>>
>>> This is true, but how likely such plans are? I mean, surely no one would
>>> do nested loop with sequential scans on reasonably large tables, so how
>>> representative this example is?
>> 
>> Acutally this is a simplest Join case, we still have same problem like
>> Nested Loop + Index Scan which will be pretty common. 
>> 
>
> Yes, I understand there are cases where LRU eviction may not be the best
> choice - like here, where the "t1" should stay in the case. But there
> are also cases where this is the wrong choice, and LRU would be better.
>
> For example a couple paragraphs down you suggest to enforce the memory
> limit by disabling detoasting if the memory limit is reached. That means
> the detoasting can get disabled because there's a single access to the
> attribute somewhere "up the plan tree". But what if the other attributes
> (which now won't be detoasted) are accessed many times until then?

I am not sure I can't follow up here, but I want to explain more about
the disable-detoast-sharing logic when the memory limit is hit. When
this happen, the detoast sharing is disabled, but since the detoast
datum will be released every soon when the slot->tts_values[*] is
discard, then the 'disable' turn to 'enable' quickly. So It is not 
true that once it is get disabled, it can't be enabled any more for the
given query.

> I think LRU is a pretty good "default" algorithm if we don't have a very
> good idea of the exact life span of the values, etc. Which is why other
> nodes (e.g. Memoize) use LRU too.

> But I wonder if there's a way to count how many times an attribute is
> accessed (or is likely to be). That might be used to inform a better
> eviction strategy.

Yes, but in current issue we can get a better esitimation with the help
of plan shape and Memoize depends on some planner information as well.
If we bypass the planner information and try to resolve it at the 
cache level, the code may become to complex as well and all the cost is
run time overhead while the other way is a planning timie overhead.

> Also, we don't need to evict the whole entry - we could evict just the
> data part (guaranteed to be fairly large), but keep the header, and keep
> the counts, expected number of hits, and other info. And use this to
> e.g. release entries that reached the expected number of hits. But I'd
> probably start with LRU and only do this as an improvement later.

A great lession learnt here, thanks for sharing this!

As for the current user case what I want to highlight is in the current
user case, we are "caching" "user data" "locally".

USER DATA indicates it might be very huge, it is not common to have a
1M tables, but it is much common we have 1M Tuples in one scan, so
keeping the header might extra memory usage as well, like 10M * 24 bytes
= 240MB. LOCALLY means it is not friend to multi active sessions. CACHE
indicates it is hard to evict correctly. My method also have the USER
DATA, LOCALLY attributes, but it would be better at eviction. eviction
then have lead to memory usage issue which is 

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra
On 3/4/24 18:08, Andy Fan wrote:
> ...
>>
>>> I assumed that releasing all of the memory at the end of executor once
>>> is not an option since it may consumed too many memory. Then, when and
>>> which entry to release becomes a trouble for me. For example:
>>>
>>>   QUERY PLAN
>>> --
>>>  Nested Loop
>>>Join Filter: (t1.a = t2.a)
>>>->  Seq Scan on t1
>>>->  Seq Scan on t2
>>> (4 rows)
>>>
>>> In this case t1.a needs a longer lifespan than t2.a since it is
>>> in outer relation. Without the help from slot's life-cycle system, I
>>> can't think out a answer for the above question.
>>>
>>
>> This is true, but how likely such plans are? I mean, surely no one would
>> do nested loop with sequential scans on reasonably large tables, so how
>> representative this example is?
> 
> Acutally this is a simplest Join case, we still have same problem like
> Nested Loop + Index Scan which will be pretty common. 
> 

Yes, I understand there are cases where LRU eviction may not be the best
choice - like here, where the "t1" should stay in the case. But there
are also cases where this is the wrong choice, and LRU would be better.

For example a couple paragraphs down you suggest to enforce the memory
limit by disabling detoasting if the memory limit is reached. That means
the detoasting can get disabled because there's a single access to the
attribute somewhere "up the plan tree". But what if the other attributes
(which now won't be detoasted) are accessed many times until then?

I think LRU is a pretty good "default" algorithm if we don't have a very
good idea of the exact life span of the values, etc. Which is why other
nodes (e.g. Memoize) use LRU too.

But I wonder if there's a way to count how many times an attribute is
accessed (or is likely to be). That might be used to inform a better
eviction strategy.

Also, we don't need to evict the whole entry - we could evict just the
data part (guaranteed to be fairly large), but keep the header, and keep
the counts, expected number of hits, and other info. And use this to
e.g. release entries that reached the expected number of hits. But I'd
probably start with LRU and only do this as an improvement later.

>> Also, this leads me to the need of having some sort of memory limit. If
>> we may be keeping entries for extended periods of time, and we don't
>> have any way to limit the amount of memory, that does not seem great.
>>
>> AFAIK if we detoast everything into tts_values[] there's no way to
>> implement and enforce such limit. What happens if there's a row with
>> multiple large-ish TOAST values? What happens if those rows are in
>> different (and distant) parts of the plan?
> 
> I think this can be done by tracking the memory usage on EState level
> or global variable level and disable it if it exceeds the limits and
> resume it when we free the detoast datum when we don't need it. I think
> no other changes need to be done.
> 

That seems like a fair amount of additional complexity. And what if the
toasted values are accessed in context without EState (I haven't checked
how common / important that is)?

And I'm not sure just disabling the detoast as a way to enforce a memory
limit, as explained earlier.

>> It seems far easier to limit the memory with the toast cache.
> 
> I think the memory limit and entry eviction is the key issue now. IMO,
> there are still some difference when both methods can support memory
> limit. The reason is my patch can grantee the cached memory will be
> reused, so if we set the limit to 10MB, we know all the 10MB is
> useful, but the TOAST cache method, probably can't grantee that, then
> when we want to make it effecitvely, we have to set a higher limit for
> this.
> 

Can it actually guarantee that? It can guarantee the slot may be used,
but I don't see how could it guarantee the detoasted value will be used.
We may be keeping the slot for other reasons. And even if it could
guarantee the detoasted value will be used, does that actually prove
it's better to keep that value? What if it's only used once, but it's
blocking detoasting of values that will be used 10x that?

If we detoast a 10MB value in the outer side of the Nest Loop, what if
the inner path has multiple accesses to another 10MB value that now
can't be detoasted (as a shared value)?

> 
>>> Another difference between the 2 methods is my method have many
>>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>>> it can save some run-time effort like hash_search find / enter run-time
>>> in method 2 since I put them directly into tts_values[*].
>>>
>>> I'm not sure the factor 2 makes some real measurable difference in real
>>> case, so my current concern mainly comes from factor 1.
>>> """
>>>
>>
>> This seems a bit dismissive of the (possible) issue.
> 
> Hmm, sorry about this, that is not my intention:(
> 

No need to apologize.

>> It'd be good to do
>> some measurements, especially on simple queries 

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan


>
> I decided to whip up a PoC patch implementing this approach, to get a
> better idea of how it compares to the original proposal, both in terms
> of performance and code complexity.
>
> Attached are two patches that both add a simple cache in detoast.c, but
> differ in when exactly the caching happens.
>
> toast-cache-v1 - caching happens in toast_fetch_datum, which means this
> happens before decompression
>
> toast-cache-v2 - caching happens in detoast_attr, after decompression
...
>
> I think implementation-wise this is pretty non-invasive.
..
>
> Performance-wise, these patches are slower than with Andy's patch. For
> example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
> v2 at ~150ms. I'm sure there's a number of optimization opportunities
> and v2 could get v2 closer to the 100ms.
>
> v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
> as master, because the data set is small enough to be fully cached, so
> there's no I/O and it's the decompression is the actual bottleneck.
>
> That being said, I'm not convinced v1 would be a bad choice for some
> cases. If there's memory pressure enough to evict TOAST, it's quite
> possible the I/O would become the bottleneck. At which point it might be
> a good trade off to cache compressed data, because then we'd cache more
> detoasted values.
>
> OTOH it's likely the access to TOAST values is localized (in temporal
> sense), i.e. we read it from disk, detoast it a couple times in a short
> time interval, and then move to the next row. That'd mean v2 is probably
> the correct trade off.

I don't have different thought about the above statement and Thanks for
the PoC code which would make later testing much easier. 

>> 
>> """
>> A alternative design: toast cache
>> -
>> 
>> This method is provided by Tomas during the review process. IIUC, this
>> method would maintain a local HTAB which map a toast datum to a detoast
>> datum and the entry is maintained / used in detoast_attr
>> function. Within this method, the overall design is pretty clear and the
>> code modification can be controlled in toasting system only.
>> 
>
> Right.
>
>> I assumed that releasing all of the memory at the end of executor once
>> is not an option since it may consumed too many memory. Then, when and
>> which entry to release becomes a trouble for me. For example:
>> 
>>   QUERY PLAN
>> --
>>  Nested Loop
>>Join Filter: (t1.a = t2.a)
>>->  Seq Scan on t1
>>->  Seq Scan on t2
>> (4 rows)
>> 
>> In this case t1.a needs a longer lifespan than t2.a since it is
>> in outer relation. Without the help from slot's life-cycle system, I
>> can't think out a answer for the above question.
>> 
>
> This is true, but how likely such plans are? I mean, surely no one would
> do nested loop with sequential scans on reasonably large tables, so how
> representative this example is?

Acutally this is a simplest Join case, we still have same problem like
Nested Loop + Index Scan which will be pretty common. 

> Also, this leads me to the need of having some sort of memory limit. If
> we may be keeping entries for extended periods of time, and we don't
> have any way to limit the amount of memory, that does not seem great.
>
> AFAIK if we detoast everything into tts_values[] there's no way to
> implement and enforce such limit. What happens if there's a row with
> multiple large-ish TOAST values? What happens if those rows are in
> different (and distant) parts of the plan?

I think this can be done by tracking the memory usage on EState level
or global variable level and disable it if it exceeds the limits and
resume it when we free the detoast datum when we don't need it. I think
no other changes need to be done.

> It seems far easier to limit the memory with the toast cache.

I think the memory limit and entry eviction is the key issue now. IMO,
there are still some difference when both methods can support memory
limit. The reason is my patch can grantee the cached memory will be
reused, so if we set the limit to 10MB, we know all the 10MB is
useful, but the TOAST cache method, probably can't grantee that, then
when we want to make it effecitvely, we have to set a higher limit for
this.


>> Another difference between the 2 methods is my method have many
>> modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
>> it can save some run-time effort like hash_search find / enter run-time
>> in method 2 since I put them directly into tts_values[*].
>> 
>> I'm not sure the factor 2 makes some real measurable difference in real
>> case, so my current concern mainly comes from factor 1.
>> """
>> 
>
> This seems a bit dismissive of the (possible) issue.

Hmm, sorry about this, that is not my intention:(

> It'd be good to do
> some measurements, especially on simple queries that can't benefit from
> the detoasting (e.g. because there's no varlena attribute).

This testing 

Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra


On 3/4/24 02:23, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> On 2/26/24 16:29, Andy Fan wrote:
>>>
>>> ...>
>>> I can understand the benefits of the TOAST cache, but the following
>>> issues looks not good to me IIUC: 
>>>
>>> 1). we can't put the result to tts_values[*] since without the planner
>>> decision, we don't know if this will break small_tlist logic. But if we
>>> put it into the cache only, which means a cache-lookup as a overhead is
>>> unavoidable.
>>
>> True - if you're comparing having the detoasted value in tts_values[*]
>> directly with having to do a lookup in a cache, then yes, there's a bit
>> of an overhead.
>>
>> But I think from the discussion it's clear having to detoast the value
>> into tts_values[*] has some weaknesses too, in particular:
>>
>> - It requires decisions which attributes to detoast eagerly, which is
>> quite invasive (having to walk the plan, ...).
>>
>> - I'm sure there will be cases where we choose to not detoast, but it
>> would be beneficial to detoast.
>>
>> - Detoasting just the initial slices does not seem compatible with this.
>>
>> IMHO the overhead of the cache lookup would be negligible compared to
>> the repeated detoasting of the value (which is the current baseline). I
>> somewhat doubt the difference compared to tts_values[*] will be even
>> measurable.
>>
>>>
>>> 2). It is hard to decide which entry should be evicted without attaching
>>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>>> we keep is the entry we will reuse soon?
>>>
>>
>> True. But is that really a problem? I imagined we'd set some sort of
>> memory limit on the cache (work_mem?), and evict oldest entries. So the
>> entries would eventually get evicted, and the memory limit would ensure
>> we don't consume arbitrary amounts of memory.
>>
> 
> Here is a copy from the shared_detoast_datum.org in the patch. I am
> feeling about when / which entry to free is a key problem and run-time
> (detoast_attr) overhead vs createplan.c overhead is a small difference
> as well. the overhead I paid for createplan.c/setref.c looks not huge as
> well. 

I decided to whip up a PoC patch implementing this approach, to get a
better idea of how it compares to the original proposal, both in terms
of performance and code complexity.

Attached are two patches that both add a simple cache in detoast.c, but
differ in when exactly the caching happens.

toast-cache-v1 - caching happens in toast_fetch_datum, which means this
happens before decompression

toast-cache-v2 - caching happens in detoast_attr, after decompression

I started with v1, but that did not help with the example workloads
(from the original post) at all. Only after I changed this to cache
decompressed data (in v2) it became competitive.

There's GUC to enable the cache (enable_toast_cache, off by default), to
make experimentation easier.

The cache is invalidated at the end of a transaction - I think this
should be OK, because the rows may be deleted but can't be cleaned up
(or replaced with a new TOAST value). This means the cache could cover
multiple queries - not sure if that's a good thing or bad thing.

I haven't implemented any sort of cache eviction. I agree that's a hard
problem in general, but I doubt we can do better than LRU. I also didn't
implement any memory limit.

FWIW I think it's a fairly serious issue Andy's approach does not have
any way to limit the memory. It will detoasts the values eagerly, but
what if the rows have multiple large values? What if we end up keeping
multiple such rows (from different parts of the plan) at once?

I also haven't implemented caching for sliced data. I think modifying
the code to support this should not be difficult - it'd require tracking
how much data we have in the cache, and considering that during lookup
and when adding entries to cache.

I've implemented the cache as "global". Maybe it should be tied to query
or executor state, but then it'd be harder to access it from detoast.c
(we'd have to pass the cache pointer in some way, and it wouldn't work
for use cases that don't go through executor).

I think implementation-wise this is pretty non-invasive.

Performance-wise, these patches are slower than with Andy's patch. For
example for the jsonb Q1, I see master ~500ms, Andy's patch ~100ms and
v2 at ~150ms. I'm sure there's a number of optimization opportunities
and v2 could get v2 closer to the 100ms.

v1 is not competitive at all in this jsonb/Q1 test - it's just as slow
as master, because the data set is small enough to be fully cached, so
there's no I/O and it's the decompression is the actual bottleneck.

That being said, I'm not convinced v1 would be a bad choice for some
cases. If there's memory pressure enough to evict TOAST, it's quite
possible the I/O would become the bottleneck. At which point it might be
a good trade off to cache compressed data, because then we'd cache more
detoasted values.

OTOH it's likely the access to TOAST values is 

Re: Shared detoast Datum proposal

2024-03-04 Thread Andy Fan


Tomas Vondra  writes:

>>> But I'm a bit surprised the patch needs to pass a
>>> MemoryContext to so many places as a function argument - that seems to
>>> go against how we work with memory contexts. Doubly so because it seems
>>> to only ever pass CurrentMemoryContext anyway. So what's that about?
>> 
>> I think you are talking about the argument like this:
>>  
>>  /* --
>> - * detoast_attr -
>> + * detoast_attr_ext -
>>   *
>>   *  Public entry point to get back a toasted value from compression
>>   *  or external storage.  The result is always non-extended varlena form.
>>   *
>> + * ctx: The memory context which the final value belongs to.
>> + *
>>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>>   * datum, the result will be a pfree'able chunk.
>>   * --
>>   */
>> 
>> +extern struct varlena *
>> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
>> 
>> This is mainly because 'detoast_attr' will apply more memory than the
>> "final detoast datum" , for example the code to scan toast relation
>> needs some memory as well, and what I want is just keeping the memory
>> for the final detoast datum so that other memory can be released sooner,
>> so I added the function argument for that. 
>> 
>
> What exactly does detoast_attr allocate in order to scan toast relation?
> Does that happen in master, or just with the patch?

It is in the current master, for example the TupleTableSlot creation
needed by scanning the toast relation needs a memory allocating. 

> If with master, I
> suggest to ignore that / treat that as a separate issue and leave it for
> a different patch.

OK, I can make it as a seperate commit in the next version. 

> In any case, the custom is to allocate results in the context that is
> set in CurrentMemoryContext at the moment of the call, and if there's
> substantial amount of allocations that we want to free soon, we either
> do that by explicit pfree() calls, or create a small temporary context
> in the function (but that's more expensive).
>
> I don't think we should invent a new convention where the context is
> passed as an argument, unless absolutely necessary.

Hmm, in this case, if we don't add the new argument, we have to get the
detoast datum in Context-1 and copy it to the desired memory context,
which is the thing I want to avoid.  I think we have a same decision to
make on the TOAST cache method as well.

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-03-04 Thread Tomas Vondra



On 3/4/24 02:29, Andy Fan wrote:
> 
> Tomas Vondra  writes:
> 
>> On 3/3/24 07:10, Andy Fan wrote:
>>>
>>> Hi,
>>>
>>> Here is a updated version, the main changes are:
>>>
>>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>>> pending items during discussion. 
>>> 2. I removed the slot->pre_detoast_attrs totally.
>>> 3. handle some pg_detoast_datum_slice use case.
>>> 4. Some implementation improvement.
>>>
>>
>> I only very briefly skimmed the patch, and I guess most of my earlier
>> comments still apply.
> 
> Yes, the overall design is not changed.
> 
>> But I'm a bit surprised the patch needs to pass a
>> MemoryContext to so many places as a function argument - that seems to
>> go against how we work with memory contexts. Doubly so because it seems
>> to only ever pass CurrentMemoryContext anyway. So what's that about?
> 
> I think you are talking about the argument like this:
>  
>  /* --
> - * detoast_attr -
> + * detoast_attr_ext -
>   *
>   *   Public entry point to get back a toasted value from compression
>   *   or external storage.  The result is always non-extended varlena form.
>   *
> + * ctx: The memory context which the final value belongs to.
> + *
>   * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
>   * datum, the result will be a pfree'able chunk.
>   * --
>   */
> 
> +extern struct varlena *
> +detoast_attr_ext(struct varlena *attr, MemoryContext ctx)
> 
> This is mainly because 'detoast_attr' will apply more memory than the
> "final detoast datum" , for example the code to scan toast relation
> needs some memory as well, and what I want is just keeping the memory
> for the final detoast datum so that other memory can be released sooner,
> so I added the function argument for that. 
> 

What exactly does detoast_attr allocate in order to scan toast relation?
Does that happen in master, or just with the patch? If with master, I
suggest to ignore that / treat that as a separate issue and leave it for
a different patch.

In any case, the custom is to allocate results in the context that is
set in CurrentMemoryContext at the moment of the call, and if there's
substantial amount of allocations that we want to free soon, we either
do that by explicit pfree() calls, or create a small temporary context
in the function (but that's more expensive).

I don't think we should invent a new convention where the context is
passed as an argument, unless absolutely necessary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan


Tomas Vondra  writes:

> On 3/3/24 07:10, Andy Fan wrote:
>> 
>> Hi,
>> 
>> Here is a updated version, the main changes are:
>> 
>> 1. an shared_detoast_datum.org file which shows the latest desgin and
>> pending items during discussion. 
>> 2. I removed the slot->pre_detoast_attrs totally.
>> 3. handle some pg_detoast_datum_slice use case.
>> 4. Some implementation improvement.
>> 
>
> I only very briefly skimmed the patch, and I guess most of my earlier
> comments still apply.

Yes, the overall design is not changed.

> But I'm a bit surprised the patch needs to pass a
> MemoryContext to so many places as a function argument - that seems to
> go against how we work with memory contexts. Doubly so because it seems
> to only ever pass CurrentMemoryContext anyway. So what's that about?

I think you are talking about the argument like this:
 
 /* --
- * detoast_attr -
+ * detoast_attr_ext -
  *
  * Public entry point to get back a toasted value from compression
  * or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: The memory context which the final value belongs to.
+ *
  * Note some callers assume that if the input is an EXTERNAL or COMPRESSED
  * datum, the result will be a pfree'able chunk.
  * --
  */

+extern struct varlena *
+detoast_attr_ext(struct varlena *attr, MemoryContext ctx)

This is mainly because 'detoast_attr' will apply more memory than the
"final detoast datum" , for example the code to scan toast relation
needs some memory as well, and what I want is just keeping the memory
for the final detoast datum so that other memory can be released sooner,
so I added the function argument for that. 

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan


Tomas Vondra  writes:

> On 2/26/24 16:29, Andy Fan wrote:
>>
>> ...>
>> I can understand the benefits of the TOAST cache, but the following
>> issues looks not good to me IIUC: 
>> 
>> 1). we can't put the result to tts_values[*] since without the planner
>> decision, we don't know if this will break small_tlist logic. But if we
>> put it into the cache only, which means a cache-lookup as a overhead is
>> unavoidable.
>
> True - if you're comparing having the detoasted value in tts_values[*]
> directly with having to do a lookup in a cache, then yes, there's a bit
> of an overhead.
>
> But I think from the discussion it's clear having to detoast the value
> into tts_values[*] has some weaknesses too, in particular:
>
> - It requires decisions which attributes to detoast eagerly, which is
> quite invasive (having to walk the plan, ...).
>
> - I'm sure there will be cases where we choose to not detoast, but it
> would be beneficial to detoast.
>
> - Detoasting just the initial slices does not seem compatible with this.
>
> IMHO the overhead of the cache lookup would be negligible compared to
> the repeated detoasting of the value (which is the current baseline). I
> somewhat doubt the difference compared to tts_values[*] will be even
> measurable.
>
>> 
>> 2). It is hard to decide which entry should be evicted without attaching
>> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
>> we keep is the entry we will reuse soon?
>> 
>
> True. But is that really a problem? I imagined we'd set some sort of
> memory limit on the cache (work_mem?), and evict oldest entries. So the
> entries would eventually get evicted, and the memory limit would ensure
> we don't consume arbitrary amounts of memory.
>

Here is a copy from the shared_detoast_datum.org in the patch. I am
feeling about when / which entry to free is a key problem and run-time
(detoast_attr) overhead vs createplan.c overhead is a small difference
as well. the overhead I paid for createplan.c/setref.c looks not huge as
well. 

"""
A alternative design: toast cache
-

This method is provided by Tomas during the review process. IIUC, this
method would maintain a local HTAB which map a toast datum to a detoast
datum and the entry is maintained / used in detoast_attr
function. Within this method, the overall design is pretty clear and the
code modification can be controlled in toasting system only.

I assumed that releasing all of the memory at the end of executor once
is not an option since it may consumed too many memory. Then, when and
which entry to release becomes a trouble for me. For example:

  QUERY PLAN
--
 Nested Loop
   Join Filter: (t1.a = t2.a)
   ->  Seq Scan on t1
   ->  Seq Scan on t2
(4 rows)

In this case t1.a needs a longer lifespan than t2.a since it is
in outer relation. Without the help from slot's life-cycle system, I
can't think out a answer for the above question.

Another difference between the 2 methods is my method have many
modification on createplan.c/setref.c/execExpr.c/execExprInterp.c, but
it can save some run-time effort like hash_search find / enter run-time
in method 2 since I put them directly into tts_values[*].

I'm not sure the factor 2 makes some real measurable difference in real
case, so my current concern mainly comes from factor 1.
"""


-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-03-03 Thread Andy Fan


Tomas Vondra  writes:

> On 3/3/24 02:52, Andy Fan wrote:
>> 
>> Hi Nikita,
>> 
>>>
>>> Have you considered another one - to alter pg_detoast_datum
>>> (actually, it would be detoast_attr function) and save
>>> detoasted datums in the detoast context derived
>>> from the query context? 
>>>
>>> We have just enough information at this step to identify
>>> the datum - toast relation id and value id, and could
>>> keep links to these detoasted values in a, say, linked list
>>>  or hash table. Thus we would avoid altering the executor
>>> code and all detoast-related code would reside within
>>> the detoast source files?
>> 
>> I think you are talking about the way Tomas provided.  I am really
>> afraid that I was thought of too self-opinionated, but I do have some
>> concerns about this approch as I stated here [1], looks my concerns is
>> still not addressed, or the concerns itself are too absurd which is
>> really possible I think? 
>> 
>
> I can't speak for others, but I did not consider you and your
> proposals too self-opinionated. 

This would be really great to know:)

> You did
> propose a solution that you consider the right one. That's fine. People
> will question that and suggest possible alternatives. That's fine too,
> it's why we have this list in the first place.

I agree.

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-03-03 Thread Tomas Vondra
On 3/3/24 07:10, Andy Fan wrote:
> 
> Hi,
> 
> Here is a updated version, the main changes are:
> 
> 1. an shared_detoast_datum.org file which shows the latest desgin and
> pending items during discussion. 
> 2. I removed the slot->pre_detoast_attrs totally.
> 3. handle some pg_detoast_datum_slice use case.
> 4. Some implementation improvement.
> 

I only very briefly skimmed the patch, and I guess most of my earlier
comments still apply. But I'm a bit surprised the patch needs to pass a
MemoryContext to so many places as a function argument - that seems to
go against how we work with memory contexts. Doubly so because it seems
to only ever pass CurrentMemoryContext anyway. So what's that about?

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-03 Thread Tomas Vondra



On 2/26/24 16:29, Andy Fan wrote:
>
> ...>
> I can understand the benefits of the TOAST cache, but the following
> issues looks not good to me IIUC: 
> 
> 1). we can't put the result to tts_values[*] since without the planner
> decision, we don't know if this will break small_tlist logic. But if we
> put it into the cache only, which means a cache-lookup as a overhead is
> unavoidable.

True - if you're comparing having the detoasted value in tts_values[*]
directly with having to do a lookup in a cache, then yes, there's a bit
of an overhead.

But I think from the discussion it's clear having to detoast the value
into tts_values[*] has some weaknesses too, in particular:

- It requires decisions which attributes to detoast eagerly, which is
quite invasive (having to walk the plan, ...).

- I'm sure there will be cases where we choose to not detoast, but it
would be beneficial to detoast.

- Detoasting just the initial slices does not seem compatible with this.

IMHO the overhead of the cache lookup would be negligible compared to
the repeated detoasting of the value (which is the current baseline). I
somewhat doubt the difference compared to tts_values[*] will be even
measurable.

> 
> 2). It is hard to decide which entry should be evicted without attaching
> it to the TupleTableSlot's life-cycle. then we can't grantee the entry
> we keep is the entry we will reuse soon?
> 

True. But is that really a problem? I imagined we'd set some sort of
memory limit on the cache (work_mem?), and evict oldest entries. So the
entries would eventually get evicted, and the memory limit would ensure
we don't consume arbitrary amounts of memory.

We could also add some "slot callback" but that seems unnecessary.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-03 Thread Tomas Vondra
On 3/3/24 02:52, Andy Fan wrote:
> 
> Hi Nikita,
> 
>>
>> Have you considered another one - to alter pg_detoast_datum
>> (actually, it would be detoast_attr function) and save
>> detoasted datums in the detoast context derived
>> from the query context? 
>>
>> We have just enough information at this step to identify
>> the datum - toast relation id and value id, and could
>> keep links to these detoasted values in a, say, linked list
>>  or hash table. Thus we would avoid altering the executor
>> code and all detoast-related code would reside within
>> the detoast source files?
> 
> I think you are talking about the way Tomas provided.  I am really
> afraid that I was thought of too self-opinionated, but I do have some
> concerns about this approch as I stated here [1], looks my concerns is
> still not addressed, or the concerns itself are too absurd which is
> really possible I think? 
> 

I'm not sure I understand your concerns. I can't speak for others, but I
did not consider you and your proposals too self-opinionated. You did
propose a solution that you consider the right one. That's fine. People
will question that and suggest possible alternatives. That's fine too,
it's why we have this list in the first place.

FWIW I'm not somehow sure the approach I suggested is guaranteed to be
better than "your" approach. Maybe there's some issue that I missed, for
example.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-03-02 Thread Andy Fan

Hi,

Here is a updated version, the main changes are:

1. an shared_detoast_datum.org file which shows the latest desgin and
pending items during discussion. 
2. I removed the slot->pre_detoast_attrs totally.
3. handle some pg_detoast_datum_slice use case.
4. Some implementation improvement.

commit 66c64c197a5dab97a563be5a291127e4c5d6841d (HEAD -> shared_detoast_value)
Author: yizhi.fzh 
Date:   Sun Mar 3 13:48:25 2024 +0800

shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org

In the shared_detoast_datum.org, I added the alternative design part for
the idea of TOAST cache.

-- 
Best Regards
Andy Fan

>From 66c64c197a5dab97a563be5a291127e4c5d6841d Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Sun, 3 Mar 2024 13:48:25 +0800
Subject: [PATCH v9 1/1] shared detoast datum

See the overall design & alternative design & testing in
shared_detoast_datum.org
---
 src/backend/access/common/detoast.c   |  68 +-
 src/backend/access/common/toast_compression.c |  10 +-
 src/backend/executor/execExpr.c   |  60 +-
 src/backend/executor/execExprInterp.c | 179 ++
 src/backend/executor/execTuples.c | 127 
 src/backend/executor/execUtils.c  |   2 +
 src/backend/executor/nodeHashjoin.c   |   2 +
 src/backend/executor/nodeMergejoin.c  |   2 +
 src/backend/executor/nodeNestloop.c   |   1 +
 src/backend/executor/shared_detoast_datum.org | 203 ++
 src/backend/jit/llvm/llvmjit_expr.c   |  26 +-
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/backend/optimizer/plan/createplan.c   | 107 +++-
 src/backend/optimizer/plan/setrefs.c  | 590 +++---
 src/include/access/detoast.h  |   3 +
 src/include/access/toast_compression.h|   4 +-
 src/include/executor/execExpr.h   |  12 +
 src/include/executor/tuptable.h   |  14 +
 src/include/nodes/execnodes.h |  14 +
 src/include/nodes/plannodes.h |  53 ++
 src/tools/pgindent/typedefs.list  |   2 +
 21 files changed, 1342 insertions(+), 138 deletions(-)
 create mode 100644 src/backend/executor/shared_detoast_datum.org

diff --git a/src/backend/access/common/detoast.c b/src/backend/access/common/detoast.c
index 3547cdba56..acc9644689 100644
--- a/src/backend/access/common/detoast.c
+++ b/src/backend/access/common/detoast.c
@@ -22,11 +22,11 @@
 #include "utils/expandeddatum.h"
 #include "utils/rel.h"
 
-static struct varlena *toast_fetch_datum(struct varlena *attr);
+static struct varlena *toast_fetch_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_fetch_datum_slice(struct varlena *attr,
 			   int32 sliceoffset,
 			   int32 slicelength);
-static struct varlena *toast_decompress_datum(struct varlena *attr);
+static struct varlena *toast_decompress_datum(struct varlena *attr, MemoryContext ctx);
 static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32 slicelength);
 
 /* --
@@ -42,7 +42,7 @@ static struct varlena *toast_decompress_datum_slice(struct varlena *attr, int32
  * --
  */
 struct varlena *
-detoast_external_attr(struct varlena *attr)
+detoast_external_attr_ext(struct varlena *attr, MemoryContext ctx)
 {
 	struct varlena *result;
 
@@ -51,7 +51,7 @@ detoast_external_attr(struct varlena *attr)
 		/*
 		 * This is an external stored plain value
 		 */
-		result = toast_fetch_datum(attr);
+		result = toast_fetch_datum(attr, ctx);
 	}
 	else if (VARATT_IS_EXTERNAL_INDIRECT(attr))
 	{
@@ -68,13 +68,13 @@ detoast_external_attr(struct varlena *attr)
 
 		/* recurse if value is still external in some other way */
 		if (VARATT_IS_EXTERNAL(attr))
-			return detoast_external_attr(attr);
+			return detoast_external_attr_ext(attr, ctx);
 
 		/*
 		 * Copy into the caller's memory context, in case caller tries to
 		 * pfree the result.
 		 */
-		result = (struct varlena *) palloc(VARSIZE_ANY(attr));
+		result = (struct varlena *) MemoryContextAlloc(ctx, VARSIZE_ANY(attr));
 		memcpy(result, attr, VARSIZE_ANY(attr));
 	}
 	else if (VARATT_IS_EXTERNAL_EXPANDED(attr))
@@ -87,7 +87,7 @@ detoast_external_attr(struct varlena *attr)
 
 		eoh = DatumGetEOHP(PointerGetDatum(attr));
 		resultsize = EOH_get_flat_size(eoh);
-		result = (struct varlena *) palloc(resultsize);
+		result = (struct varlena *) MemoryContextAlloc(ctx, resultsize);
 		EOH_flatten_into(eoh, (void *) result, resultsize);
 	}
 	else
@@ -101,32 +101,45 @@ detoast_external_attr(struct varlena *attr)
 	return result;
 }
 
+struct varlena *
+detoast_external_attr(struct varlena *attr)
+{
+	return detoast_external_attr_ext(attr, CurrentMemoryContext);
+}
+
 
 /* --
- * detoast_attr -
+ * detoast_attr_ext -
  *
  *	Public entry point to get back a toasted value from compression
  *	or external storage.  The result is always non-extended varlena form.
  *
+ * ctx: 

Re: Shared detoast Datum proposal

2024-03-02 Thread Andy Fan


Hi Nikita,

>
> Have you considered another one - to alter pg_detoast_datum
> (actually, it would be detoast_attr function) and save
> detoasted datums in the detoast context derived
> from the query context? 
>
> We have just enough information at this step to identify
> the datum - toast relation id and value id, and could
> keep links to these detoasted values in a, say, linked list
>  or hash table. Thus we would avoid altering the executor
> code and all detoast-related code would reside within
> the detoast source files?

I think you are talking about the way Tomas provided.  I am really
afraid that I was thought of too self-opinionated, but I do have some
concerns about this approch as I stated here [1], looks my concerns is
still not addressed, or the concerns itself are too absurd which is
really possible I think? 

[1] https://www.postgresql.org/message-id/875xyb1a6q.fsf%40163.com

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-03-02 Thread Nikita Malakhov
Hi, Andy!

Sorry for the delay, I have had long flights this week.
I've reviewed the patch set, thank you for your efforts.
I have several notes about patch set code, but first of
I'm not sure the overall approach is the best for the task.

As Tomas wrote above, the approach is very invasive
and spreads code related to detoasting among many
parts of code.

Have you considered another one - to alter pg_detoast_datum
(actually, it would be detoast_attr function) and save
detoasted datums in the detoast context derived
from the query context?

We have just enough information at this step to identify
the datum - toast relation id and value id, and could
keep links to these detoasted values in a, say, linked list
 or hash table. Thus we would avoid altering the executor
code and all detoast-related code would reside within
the detoast source files?

I'd check this approach in several days and would
report on the result here.

There are also comments on the code itself, I'd write them
a bit later.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan


Nikita Malakhov  writes:

> Hi,
>
> Tomas, we already have a working jsonb partial detoast prototype,
> and currently I'm porting it to the recent master.

This is really awesome! Acutally when I talked to MySQL guys, they said
MySQL already did this and I admit it can resolve some different issues
than the current patch. Just I have many implemetion questions in my
mind. 

> Andy, thank you! I'll check the last patch set out and reply in a day
> or two.

Thank you for your attention!


-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan


> On 2/26/24 14:22, Andy Fan wrote:
>> 
>>...
>>
>>> Also, toasted values
>>> are not always being used immediately and as a whole, i.e. jsonb values are 
>>> fully
>>> detoasted (we're working on this right now) to extract the smallest value 
>>> from
>>> big json, and these values are not worth keeping in memory. For text values 
>>> too,
>>> we often do not need the whole value to be detoasted.
>> 
>> I'm not sure how often this is true, espeically you metioned text data
>> type. I can't image why people acquire a piece of text and how can we
>> design a toasting system to fulfill such case without detoast the whole
>> as the first step. But for the jsonb or array data type, I think it is
>> more often. However if we design toasting like that, I'm not sure if it
>> would slow down other user case. for example detoasting the whole piece
>> use case. I am justing feeling both way has its user case, kind of heap
>> store and columna store.  
>> 
>
> Any substr/starts_with call benefits from this optimization, and such
> calls don't seem that uncommon. I certainly can imagine people doing
> this fairly often.

This leads me to pay more attention to pg_detoast_datum_slice user case
which I did overlook and caused some regression in this case. Thanks!

As you said later:

> Is there a way to identify cases that are likely to benefit from this
> slicing, and disable the detoasting for them? We already disable it for
> other cases, so maybe we can do this here too?

I think the answer is yes, if our goal is to disable the whole toast for
some speical calls like substr/starts_with. In the current patch, we have:

/*
 * increase_level_for_pre_detoast
 *  Check if the given Expr could detoast a Var directly, if yes,
 * increase the level and return true. otherwise return false;
 */
static inline void
increase_level_for_pre_detoast(Node *node, intermediate_level_context *ctx)
{
/* The following nodes is impossible to detoast a Var directly. */
if (IsA(node, List) || IsA(node, TargetEntry) || IsA(node, NullTest))
{
ctx->level_added = false;
}
else if (IsA(node, FuncExpr) && castNode(FuncExpr, node)->funcid == 
F_PG_COLUMN_COMPRESSION)
{
/* let's not detoast first so that pg_column_compression works. 
*/
ctx->level_added = false;
}

while it works, but the blacklist looks not awesome.

> In any case, it's a long-standing behavior /
> optimization, and I don't think we can just dismiss it this quickly.

I agree. So I said both have their user case. and when I say the *both*, I
mean the other method is "partial detoast prototype", which Nikita has
told me before. while I am sure it has many user case, I'm also feeling
it is complex and have many questions in my mind, but I'd like to see
the design before asking them.

> Or maybe there's a way to do the detoasting lazily, and only keep the
> detoasted value when it's requesting the whole value. Or perhaps even
> better, remember what part we detoasted, and maybe it'll be enough for
> future requests?
>
> I'm not sure how difficult would this be with the proposed approach,
> which eagerly detoasts the value into tts_values. I think it'd be
> easier to implement with the TOAST cache I briefly described ... 

I can understand the benefits of the TOAST cache, but the following
issues looks not good to me IIUC: 

1). we can't put the result to tts_values[*] since without the planner
decision, we don't know if this will break small_tlist logic. But if we
put it into the cache only, which means a cache-lookup as a overhead is
unavoidable.

2). It is hard to decide which entry should be evicted without attaching
it to the TupleTableSlot's life-cycle. then we can't grantee the entry
we keep is the entry we will reuse soon?

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-02-26 Thread Nikita Malakhov
Hi,

Tomas, we already have a working jsonb partial detoast prototype,
and currently I'm porting it to the recent master. Due to the size
 of the changes and very invasive nature it takes a lot of effort,
but it is already done. I'm also trying to make the core patch
less invasive. Actually, it is a subject for a separate topic, as soon
as I make it work on the current master we'll propose it
to the community.

Andy, thank you! I'll check the last patch set out and reply in a day or
two.

--
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-02-26 Thread Tomas Vondra


On 2/26/24 14:22, Andy Fan wrote:
> 
>...
>
>> Also, toasted values
>> are not always being used immediately and as a whole, i.e. jsonb values are 
>> fully
>> detoasted (we're working on this right now) to extract the smallest value 
>> from
>> big json, and these values are not worth keeping in memory. For text values 
>> too,
>> we often do not need the whole value to be detoasted.
> 
> I'm not sure how often this is true, espeically you metioned text data
> type. I can't image why people acquire a piece of text and how can we
> design a toasting system to fulfill such case without detoast the whole
> as the first step. But for the jsonb or array data type, I think it is
> more often. However if we design toasting like that, I'm not sure if it
> would slow down other user case. for example detoasting the whole piece
> use case. I am justing feeling both way has its user case, kind of heap
> store and columna store.  
> 

Any substr/starts_with call benefits from this optimization, and such
calls don't seem that uncommon. I certainly can imagine people doing
this fairly often. In any case, it's a long-standing behavior /
optimization, and I don't think we can just dismiss it this quickly.

Is there a way to identify cases that are likely to benefit from this
slicing, and disable the detoasting for them? We already disable it for
other cases, so maybe we can do this here too?

Or maybe there's a way to do the detoasting lazily, and only keep the
detoasted value when it's requesting the whole value. Or perhaps even
better, remember what part we detoasted, and maybe it'll be enough for
future requests?

I'm not sure how difficult would this be with the proposed approach,
which eagerly detoasts the value into tts_values. I think it'd be easier
to implement with the TOAST cache I briefly described ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-02-26 Thread Andy Fan


Hi, 


> When we store and detoast large values, say, 1Gb - that's a very likely 
> scenario,
> we have such cases from prod systems - we would end up in using a lot of 
> shared
> memory to keep these values alive, only to discard them later.

First the feature can make sure if the original expression will not
detoast it, this feature will not detoast it as well. Based on this, if
there is a 1Gb datum, in the past, it took 1GB memory as well, but
it may be discarded quicker than the patched version. but patched
version will *not* keep it for too long time since once the
slot->tts_values[*] is invalidated, the memory will be released
immedately.

For example:

create table t(a text, b int);
insert into t select '1-gb-length-text' from generate_series(1, 100);

select * from t where a like '%gb%';

The patched version will take 1gb extra memory only.

Are you worried about this case or some other cases? 

> Also, toasted values
> are not always being used immediately and as a whole, i.e. jsonb values are 
> fully
> detoasted (we're working on this right now) to extract the smallest value from
> big json, and these values are not worth keeping in memory. For text values 
> too,
> we often do not need the whole value to be detoasted.

I'm not sure how often this is true, espeically you metioned text data
type. I can't image why people acquire a piece of text and how can we
design a toasting system to fulfill such case without detoast the whole
as the first step. But for the jsonb or array data type, I think it is
more often. However if we design toasting like that, I'm not sure if it
would slow down other user case. for example detoasting the whole piece
use case. I am justing feeling both way has its user case, kind of heap
store and columna store.  

FWIW, as I shown in the test case, this feature is not only helpful for
big datum, it is also helpful for small toast datum. 

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-02-25 Thread Nikita Malakhov
Hi!

I see this to be a very promising initiative, but some issues come into my
mind.
When we store and detoast large values, say, 1Gb - that's a very likely
scenario,
we have such cases from prod systems - we would end up in using a lot of
shared
memory to keep these values alive, only to discard them later. Also,
toasted values
are not always being used immediately and as a whole, i.e. jsonb values are
fully
detoasted (we're working on this right now) to extract the smallest value
from
big json, and these values are not worth keeping in memory. For text values
too,
we often do not need the whole value to be detoasted and kept in memory.

What do you think?

-- 
Regards,
Nikita Malakhov
Postgres Professional
The Russian Postgres Company
https://postgrespro.ru/


Re: Shared detoast Datum proposal

2024-02-25 Thread Andy Fan


Hi,

>>
>> Good idea. Either that (a separate README), or a comment in a header of
>> some suitable .c/.h file (I prefer that, because that's kinda obvious
>> when reading the code, I often not notice a README exists next to it).
>
> Great, I'd try this from tomorrow. 

I have made it.  Currently I choose README because this feature changed
createplan.c, setrefs.c, execExpr.c and execExprInterp.c, so putting the
high level design to any of them looks inappropriate. So the high level
design is here and detailed design for each steps is in the comments
around the code.  Hope this is helpful!

The problem:
-

In the current expression engine, a toasted datum is detoasted when
required, but the result is discarded immediately, either by pfree it
immediately or leave it for ResetExprContext. Arguments for which one to
use exists sometimes. More serious problem is detoasting is expensive,
especially for the data types like jsonb or array, which the value might
be very huge. In the blow example, the detoasting happens twice.

SELECT jb_col->'a', jb_col->'b' FROM t;

Within the shared-detoast-datum, we just need to detoast once for each
tuple, and discard it immediately when the tuple is not needed any
more. FWIW this issue may existing for small numeric, text as well
because of SHORT_TOAST feature where the toast's len using 1 byte rather
than 4 bytes.

Current Design
--

The high level design is let createplan.c and setref.c decide which
Vars can use this feature, and then the executor save the detoast
datum back slot->to tts_values[*] during the ExprEvalStep of
EEOP_{INNER|OUTER|SCAN}_VAR_TOAST. The reasons includes:

- The existing expression engine read datum from tts_values[*], no any
  extra work need to be done.
- Reuse the lifespan of TupleTableSlot system to manage memory. It
  is natural to think the detoast datum is a tts_value just that it is
  in a detoast format. Since we have a clear lifespan for TupleTableSlot
  already, like ExecClearTuple, ExecCopySlot et. We are easy to reuse
  them for managing the datoast datum's memory.
- The existing projection method can copy the datoasted datum (int64)
  automatically to the next node's slot, but keeping the ownership
  unchanged, so only the slot where the detoast really happen take the
  charge of it's lifespan.

So the final data change is adding the below field into TubleTableSlot.

typedef struct TupleTableSlot
{
..

/*
 * The attributes whose values are the detoasted version in tts_values[*],
 * if so these memory needs some extra clean-up. These memory can't be put
 * into ecxt_per_tuple_memory since many of them needs a longer life
 * span.
 *
 * These memory is put into TupleTableSlot.tts_mcxt and be clear
 * whenever the tts_values[*] is invalidated.
 */
Bitmapset   *pre_detoast_attrs;
};

Assuming which Var should use this feature has been decided in
createplan.c and setref.c already. The 3 new ExprEvalSteps
EEOP_{INNER,OUTER,SCAN}_VAR_TOAST as used. During the evaluating these
steps, the below code is used.

static inline void
ExecSlotDetoastDatum(TupleTableSlot *slot, int attnum)
{
if (!slot->tts_isnull[attnum] &&
VARATT_IS_EXTENDED(slot->tts_values[attnum]))
{
Datum   oldDatum;
MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);

oldDatum = slot->tts_values[attnum];
slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
(struct varlena 
*) oldDatum));
Assert(slot->tts_nvalid > attnum);
Assert(oldDatum != slot->tts_values[attnum]);
slot->pre_detoasted_attrs= bms_add_member(slot->pre_detoasted_attrs, 
attnum);
MemoryContextSwitchTo(old);
}
}

Since I don't want to the run-time extra check to see if is a detoast
should happen, so introducing 3 new steps.

When to free the detoast datum? It depends on when the slot's
tts_values[*] is invalidated, ExecClearTuple is the clear one, but any
TupleTableSlotOps which set the tts_nvalid = 0 tells us no one will use
the datum in tts_values[*] so it is time to release them based on
slot.pre_detoast_attrs as well.

Now comes to the createplan.c/setref.c part, which decides which Vars
should use the shared detoast feature. The guideline of this is:

1. It needs a detoast for a given expression.
2. It should not breaks the CP_SMALL_TLIST design. Since we saved the
   detoast datum back to tts_values[*], which make tuple bigger. if we
   do this blindly, it would be harmful to the ORDER / HASH style nodes.

A high level data flow is:

1. at the createplan.c, we walk the plan tree go gather the
   CP_SMALL_TLIST because of SORT/HASH style nodes, information and save
   it to Plan.forbid_pre_detoast_vars via the function
   set_plan_forbid_pre_detoast_vars_recurse.

2. at the setrefs.c, fix_{scan|join}_expr will recurse to Var for each
   expression, so it is a good time to track the 

Re: Shared detoast Datum proposal

2024-02-21 Thread Andy Fan



I see your reply when I started to write a more high level
document. Thanks for the step by step help!

Tomas Vondra  writes:

> On 2/20/24 19:38, Andy Fan wrote:
>> 
>> ...
>>
>>> I think it should be done the other
>>> way around, i.e. the patch should introduce the main feature first
>>> (using the traditional Bitmapset), and then add Bitset on top of that.
>>> That way we could easily measure the impact and see if it's useful.
>> 
>> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
>> indicate it is too expensive. and after talk with David at [2], I
>> introduced bitset and use it here. the test case I used comes from [1].
>> IRCC, there were 5% performance difference because of this.
>> 
>> create table w(a int, b numeric);
>> insert into w select i, i from generate_series(1, 100)i;
>> select b from w where b > 0;
>> 
>> To reproduce the difference, we can replace the bitset_clear() with
>> 
>> bitset_free(slot->pre_detoasted_attrs);
>> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
>> 
>> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
>> 
>
> I understand the bitset was not introduced until v5, after noticing the
> bitmapset is not quite efficient. But I still think the patches should
> be the other way around, i.e. the main feature first, then the bitset as
> an optimization.
>
> That allows everyone to observe the improvement on their own (without
> having to tweak the code), and it also doesn't require commit of the
> bitset part before it gets actually used by anything.

I start to think this is a better way rather than the opposite. The next
version will be:

0001: shared detoast datum feature with high level introduction.
0002: introduce bitset and use it shared-detoast-datum feature, with the
test case to show the improvement. 

>> I will do more test on the memory leak stuff, since there are so many
>> operation aginst slot like ExecCopySlot etc, I don't know how to test it
>> fully. the method in my mind now is use TPCH with 10GB data size, and
>> monitor the query runtime memory usage.
>> 
>
> I think this is exactly the "high level design" description that should
> be in a comment, somewhere.

Got it.

>>> Of course, expanded datums are
>>> not meant to be long-lived, while "shared detoasted values" are meant to
>>> exist (potentially) for the query duration.
>> 
>> hmm, acutally the "shared detoast value" just live in the
>> TupleTableSlot->tts_values[*], rather than the whole query duration. The
>> simple case is:
>> 
>> SELECT * FROM t WHERE a_text LIKE 'abc%';
>> 
>> when we scan to the next tuple, the detoast value for the previous tuple
>> will be relased.
>> 
>
> But if the (detoasted) value is passed to the next executor node, it'll
> be kept, right?

Yes and only one copy for all the executor nodes.

> Unrelated question - could this actually end up being too aggressive?
> That is, could it detoast attributes that we end up not needing? For
> example, what if the tuple gets eliminated for some reason (e.g. because
> of a restriction on the table, or not having a match in a join)? Won't
> we detoast the tuple only to throw it away?

The detoast datum will have the exactly same lifespan with other
tts_values[*]. If the tuple get eliminated for any reason, those detoast
datum still exist until the slot is cleared for storing the next tuple.

>> typedef struct Join
>> {
>> ..
>>  /*
>>   * Records of var's varattno - 1 where the Var is accessed indirectly by
>>   * any expression, like a > 3.  However a IS [NOT] NULL is not included
>>   * since it doesn't access the tts_values[*] at all.
>>   *
>>   * This is a essential information to figure out which attrs should use
>>   * the pre-detoast-attrs logic.
>>   */
>>  Bitmapset  *outer_reference_attrs;
>>  Bitmapset  *inner_reference_attrs;
>> } Join;
>> 
>
> Is it actually necessary to add new fields to these nodes? Also, the
> names are not particularly descriptive of the purpose - it'd be better
> to have "detoast" in the name, instead of generic "reference".

Because of the way of the data transformation, I think we must add the
fields to keep such inforamtion. Then these information will be used
initilize the necessary information in PlanState. maybe I am having a
fixed mindset, I can't think out a way to avoid that right now.

I used 'reference' rather than detoast is because some implementaion
issues. In the createplan.c and setref.c, I can't check the atttyplen
effectively, so even a Var with int type is still hold here which may
have nothing with detoast.

>> 
>> When I worked with the UniqueKey feature, I maintained a
>> UniqueKey.README to summaried all the dicussed topics in threads, the
>> README is designed to save the effort for more reviewer, I think I
>> should apply the same logic for this feature.
>> 
>
> Good idea. Either that (a separate README), or a comment in a header of
> some suitable .c/.h file (I 

Re: Shared detoast Datum proposal

2024-02-21 Thread Tomas Vondra
On 2/20/24 19:38, Andy Fan wrote:
> 
> ...
>
>> I think it should be done the other
>> way around, i.e. the patch should introduce the main feature first
>> (using the traditional Bitmapset), and then add Bitset on top of that.
>> That way we could easily measure the impact and see if it's useful.
> 
> Acutally v4 used the Bitmapset, and then both perf and pgbench's tps
> indicate it is too expensive. and after talk with David at [2], I
> introduced bitset and use it here. the test case I used comes from [1].
> IRCC, there were 5% performance difference because of this.
> 
> create table w(a int, b numeric);
> insert into w select i, i from generate_series(1, 100)i;
> select b from w where b > 0;
> 
> To reproduce the difference, we can replace the bitset_clear() with
> 
> bitset_free(slot->pre_detoasted_attrs);
> slot->pre_detoasted_attrs = bitset_init(slot->tts_tupleDescriptor->natts);
> 
> in ExecFreePreDetoastDatum. then it works same as Bitmapset.
> 

I understand the bitset was not introduced until v5, after noticing the
bitmapset is not quite efficient. But I still think the patches should
be the other way around, i.e. the main feature first, then the bitset as
an optimization.

That allows everyone to observe the improvement on their own (without
having to tweak the code), and it also doesn't require commit of the
bitset part before it gets actually used by anything.

> 
>> On the whole, my biggest concern is memory usage & leaks. It's not
>> difficult to already have problems with large detoasted values, and if
>> we start keeping more of them, that may get worse. Or at least that's my
>> intuition - it can't really get better by keeping the values longer, right?
>>
>> The other thing is the risk of leaks (in the sense of keeping detoasted
>> values longer than expected). I see the values are allocated in
>> tts_mcxt, and maybe that's the right solution - not sure.
> 
> about the memory usage, first it is kept as the same lifesplan as the
> tts_values[*] which can be released pretty quickly, only if the certain
> values of the tuples is not needed. it is true that we keep the detoast
> version longer than before,  but that's something we have to pay I
> think. 
> 
> Leaks may happen since tts_mcxt is reset at the end of *executor*. So if
> we forget to release the memory when the tts_values[*] is invalidated
> somehow, the memory will be leaked until the end of executor. I think
> that will be enough to cause an issue. Currently besides I release such
> memory at the ExecClearTuple, I also relase such memory whenever we set
> tts_nvalid to 0, the theory used here is:
> 
> /*
>  * tts_values is treated invalidated since tts_nvalid is set to 0, so
>  * let's free the pre-detoast datum.
>  */
> ExecFreePreDetoastDatum(slot);
> 
> I will do more test on the memory leak stuff, since there are so many
> operation aginst slot like ExecCopySlot etc, I don't know how to test it
> fully. the method in my mind now is use TPCH with 10GB data size, and
> monitor the query runtime memory usage.
> 

I think this is exactly the "high level design" description that should
be in a comment, somewhere.

> 
>> FWIW while looking at the patch, I couldn't help but to think about
>> expanded datums. There's similarity in what these two features do - keep
>> detoasted values for a while, so that we don't need to do the expensive
>> processing if we access them repeatedly.
> 
> Could you provide some keyword or function names for the expanded datum
> here, I probably miss this.
> 

see src/include/utils/expandeddatum.h

>> Of course, expanded datums are
>> not meant to be long-lived, while "shared detoasted values" are meant to
>> exist (potentially) for the query duration.
> 
> hmm, acutally the "shared detoast value" just live in the
> TupleTableSlot->tts_values[*], rather than the whole query duration. The
> simple case is:
> 
> SELECT * FROM t WHERE a_text LIKE 'abc%';
> 
> when we scan to the next tuple, the detoast value for the previous tuple
> will be relased.
> 

But if the (detoasted) value is passed to the next executor node, it'll
be kept, right?

>> But maybe there's something
>> we could learn from expanded datums? For example how the varlena pointer
>> is leveraged to point to the expanded object.
> 
> maybe. currently I just use detoast_attr to get the desired version. I'm
> pleasure if we have more effective way.
> 
> if (!slot->tts_isnull[attnum] &&
> VARATT_IS_EXTENDED(slot->tts_values[attnum]))
> {
>   Datum   oldDatum;
>   MemoryContext old = MemoryContextSwitchTo(slot->tts_mcxt);
> 
>   oldDatum = slot->tts_values[attnum];
>   slot->tts_values[attnum] = PointerGetDatum(detoast_attr(
>   
>   (struct varlena *) oldDatum));
>   Assert(slot->tts_nvalid > attnum);
>   Assert(oldDatum != slot->tts_values[attnum]);
>   

Re: Shared detoast Datum proposal

2024-02-20 Thread Andy Fan

Tomas Vondra  writes:

Hi Tomas, 
>
> I took a quick look on this thread/patch today, so let me share a couple
> initial thoughts. I may not have a particularly coherent/consistent
> opinion on the patch or what would be a better way to do this yet, but
> perhaps it'll start a discussion ...

Thank you for this!

>
> The goal of the patch (as I understand it) is essentially to cache
> detoasted values, so that the value does not need to be detoasted
> repeatedly in different parts of the plan. I think that's perfectly
> sensible and worthwhile goal - detoasting is not cheap, and complex
> plans may easily spend a lot of time on it.

exactly.

>
> That being said, the approach seems somewhat invasive, and touching
> parts I wouldn't have expected to need a change to implement this. For
> example, I certainly would not have guessed the patch to need changes in
> createplan.c or setrefs.c.
>
> Perhaps it really needs to do these things, but neither the thread nor
> the comments are very enlightening as for why it's needed :-( In many
> cases I can guess, but I'm not sure my guess is correct. And comments in
> code generally describe what's happening locally / next line, not the
> bigger picture & why it's happening.

there were explaination at [1], but it probably is too high level.
Writing a proper comments is challenging for me, but I am pretty happy
to try more. At the end of this writing, I explained the data workflow,
I am feeling that would be useful for reviewers. 

> IIUC we walk the plan to decide which Vars should be detoasted (and
> cached) once, and which cases should not do that because it'd inflate
> the amount of data we need to keep in a Sort, Hash etc.

Exactly.

> Not sure if
> there's a better way to do this - it depends on what happens in the
> upper parts of the plan, so we can't decide while building the paths.

I'd say I did this intentionally. Deciding such things in paths will be
more expensive than create_plan stage IMO.

> But maybe we could decide this while transforming the paths into a plan?
> (I realize the JIT thread nearby needs to do something like that in
> create_plan, and in that one I suggested maybe walking the plan would be
> a better approach, so I may be contradicting myself a little bit.).

I think that's pretty similar what I'm doing now. Just that I did it
*just after* the create_plan. This is because the create_plan doesn't
transform the path to plan in the top->down manner all the time, the
known exception is create_mergejoin_plan. so I have to walk just after
the create_plan is 
done.

In the create_mergejoin_plan, the Sort node is created *after* the
subplan for the Sort is created.

/* Recursively process the path tree, demanding the correct tlist result */
plan = create_plan_recurse(root, best_path, CP_EXACT_TLIST);

+ /*
+  * After the plan tree is built completed, we start to walk for which
+  * expressions should not used the shared-detoast feature.
+  */
+ set_plan_forbid_pre_detoast_vars_recurse(plan, NIL);

>
> In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
> explain the overall strategy / reasoning in a bit more detail. Maybe
> it's somewhere in this thread, but that's not great for reviewers.

a lession learnt, thanks.

a revisted version of comments from the lastest patch.

/*
 * set_plan_forbid_pre_detoast_vars_recurse
 *   Walking the Plan tree in the top-down manner to gather the vars which
 * should be as small as possible and record them in 
Plan.forbid_pre_detoast_vars
 * 
 * plan: the plan node to walk right now.
 * small_tlist: a list of nodes which its subplan should provide them as
 * small as possible.
 */
static void
set_plan_forbid_pre_detoast_vars_recurse(Plan *plan, List *small_tlist)

>
> Similar for the setrefs.c changes. It seems a bit suspicious to piggy
> back the new code into fix_scan_expr/fix_scan_list and similar code.
> Those functions have a pretty clearly defined purpose, not sure we want
> to also extend them to also deal with this new thing. (FWIW I'd 100%%
> did it this way if I hacked on a PoC of this, to make it work. But I'm
> not sure it's the right solution.)

The main reason of doing so is because I want to share the same walk
effort as fix_scan_expr. otherwise I have to walk the plan for 
every expression again. I thought this as a best practice in the past
and thought we can treat the pre_detoast_attrs as a valuable side
effects:(

> I don't know what to thing about the Bitset - maybe it's necessary, but
> how would I know? I don't have any way to measure the benefits, because
> the 0002 patch uses it right away. 

a revisted version of comments from the latest patch. graph 2 explains
this decision.

/*
 * The attributes whose values are the detoasted version in 
tts_values[*],
 * if so these memory needs some extra clean-up. These memory can't be 
put
 * into ecxt_per_tuple_memory since many of them needs a longer life 
span,
 * for example the 

Re: Shared detoast Datum proposal

2024-02-20 Thread Tomas Vondra
Hi,

I took a quick look on this thread/patch today, so let me share a couple
initial thoughts. I may not have a particularly coherent/consistent
opinion on the patch or what would be a better way to do this yet, but
perhaps it'll start a discussion ...


The goal of the patch (as I understand it) is essentially to cache
detoasted values, so that the value does not need to be detoasted
repeatedly in different parts of the plan. I think that's perfectly
sensible and worthwhile goal - detoasting is not cheap, and complex
plans may easily spend a lot of time on it.

That being said, the approach seems somewhat invasive, and touching
parts I wouldn't have expected to need a change to implement this. For
example, I certainly would not have guessed the patch to need changes in
createplan.c or setrefs.c.

Perhaps it really needs to do these things, but neither the thread nor
the comments are very enlightening as for why it's needed :-( In many
cases I can guess, but I'm not sure my guess is correct. And comments in
code generally describe what's happening locally / next line, not the
bigger picture & why it's happening.

IIUC we walk the plan to decide which Vars should be detoasted (and
cached) once, and which cases should not do that because it'd inflate
the amount of data we need to keep in a Sort, Hash etc. Not sure if
there's a better way to do this - it depends on what happens in the
upper parts of the plan, so we can't decide while building the paths.

But maybe we could decide this while transforming the paths into a plan?
(I realize the JIT thread nearby needs to do something like that in
create_plan, and in that one I suggested maybe walking the plan would be
a better approach, so I may be contradicting myself a little bit.).

In any case, set_plan_forbid_pre_detoast_vars_recurse should probably
explain the overall strategy / reasoning in a bit more detail. Maybe
it's somewhere in this thread, but that's not great for reviewers.

Similar for the setrefs.c changes. It seems a bit suspicious to piggy
back the new code into fix_scan_expr/fix_scan_list and similar code.
Those functions have a pretty clearly defined purpose, not sure we want
to also extend them to also deal with this new thing. (FWIW I'd 100%%
did it this way if I hacked on a PoC of this, to make it work. But I'm
not sure it's the right solution.)

I don't know what to thing about the Bitset - maybe it's necessary, but
how would I know? I don't have any way to measure the benefits, because
the 0002 patch uses it right away. I think it should be done the other
way around, i.e. the patch should introduce the main feature first
(using the traditional Bitmapset), and then add Bitset on top of that.
That way we could easily measure the impact and see if it's useful.

On the whole, my biggest concern is memory usage & leaks. It's not
difficult to already have problems with large detoasted values, and if
we start keeping more of them, that may get worse. Or at least that's my
intuition - it can't really get better by keeping the values longer, right?

The other thing is the risk of leaks (in the sense of keeping detoasted
values longer than expected). I see the values are allocated in
tts_mcxt, and maybe that's the right solution - not sure.


FWIW while looking at the patch, I couldn't help but to think about
expanded datums. There's similarity in what these two features do - keep
detoasted values for a while, so that we don't need to do the expensive
processing if we access them repeatedly. Of course, expanded datums are
not meant to be long-lived, while "shared detoasted values" are meant to
exist (potentially) for the query duration. But maybe there's something
we could learn from expanded datums? For example how the varlena pointer
is leveraged to point to the expanded object.

For example, what if we add a "TOAST cache" as a query-level hash table,
and modify the detoasting to first check the hash table (with the TOAST
pointer as a key)? It'd be fairly trivial to enforce a memory limit on
the hash table, evict values from it, etc. And it wouldn't require any
of the createplan/setrefs changes, I think ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Shared detoast Datum proposal

2024-02-19 Thread Andy Fan

Hi,

I didn't another round of self-review.  Comments, variable names, the
order of function definition are improved so that it can be read as
smooth as possible. so v6 attached.

-- 
Best Regards
Andy Fan
>From f2e7772228e8a18027b9c29f10caba9c6570d934 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Tue, 20 Feb 2024 11:11:53 +0800
Subject: [PATCH v6 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 200 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test_misc/Makefile   |  11 +
 src/test/modules/test_misc/README |   4 +-
 .../test_misc/expected/test_bitset.out|   7 +
 src/test/modules/test_misc/meson.build|  17 ++
 .../modules/test_misc/sql/test_bitset.sql |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c| 118 +++
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 441 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..40cfea2308 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1351,6 +1346,19 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1466,177 @@ bitmap_match(const void *key1, const void *key2, Size keysize)
 	return !bms_equal(*((const Bitmapset *const *) key1),
 	  *((const Bitmapset *const *) key2));
 }
+
+/*
+ * bitset_init - create a Bitset. the set will be round up to nwords;
+ */
+Bitset *
+bitset_init(size_t size)
+{
+	int			nword = (size + BITS_PER_BITMAPWORD - 1) / BITS_PER_BITMAPWORD;
+	Bitset	   *result;
+
+	if (size == 0)
+		return NULL;
+
+	result = (Bitset *) palloc0(sizeof(Bitset) + nword * sizeof(bitmapword));
+	result->nwords = nword;
+
+	return result;
+}
+
+/*
+ * bitset_clear - clear the bits only, but the memory is still there.
+ */
+void
+bitset_clear(Bitset *a)
+{
+	if (a != NULL)
+		memset(a->words, 0, sizeof(bitmapword) * a->nwords);
+}
+
+void
+bitset_free(Bitset *a)
+{
+	if (a != NULL)
+		pfree(a);
+}
+
+bool
+bitset_is_empty(Bitset *a)
+{
+	int			i;
+
+	if (a == NULL)
+		return true;
+
+	for (i = 0; i < a->nwords; i++)
+	{
+		bitmapword	w = a->words[i];
+
+		if (w != 0)
+			return false;
+	}
+
+	return true;
+}
+
+Bitset *
+bitset_copy(Bitset *a)
+{
+	Bitset	   *result;
+
+	if (a == NULL)
+		return NULL;
+
+	result = bitset_init(a->nwords * BITS_PER_BITMAPWORD);
+
+	memcpy(result->words, a->words, sizeof(bitmapword) * a->nwords);
+	return result;
+}
+
+void
+bitset_add_member(Bitset *a, int x)
+{
+	int			wordnum,
+bitnum;
+
+	Assert(x >= 0);
+
+	wordnum = WORDNUM(x);
+	bitnum = BITNUM(x);
+
+	Assert(wordnum < a->nwords);
+
+	a->words[wordnum] |= ((bitmapword) 1 << bitnum);
+}
+
+void
+bitset_del_member(Bitset *a, int x)
+{
+	int			

Re: Shared detoast Datum proposal

2024-01-23 Thread Andy Fan

Michael Zhilin  writes:

> Hi Andy,
>
> It looks like v5 is missing in your mail. Could you please check and resend 
> it?

ha, yes.. v5 is really attached this time.

commit eee0b2058f912d0d56282711c5d88bc0b1b75c2f (HEAD -> 
shared_detoast_value_v3)
Author: yizhi.fzh 
Date:   Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

details at https://postgr.es/m/87il4jrk1l.fsf%40163.com

commit eeca405f5ae87e7d4e5496de989ac7b5173bcaa9
Author: yizhi.fzh 
Date:   Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] 
https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com


As for the commit "Introduce a Bitset data struct.", the test coverage
is 100% now. So it would be great that people can review this first. 

-- 
Best Regards
Andy Fan

>From eeca405f5ae87e7d4e5496de989ac7b5173bcaa9 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Mon, 22 Jan 2024 15:48:33 +0800
Subject: [PATCH v5 1/2] Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com
---
 src/backend/nodes/bitmapset.c | 201 +-
 src/backend/nodes/outfuncs.c  |  51 +
 src/include/nodes/bitmapset.h |  28 +++
 src/include/nodes/nodes.h |   4 +
 src/test/modules/test_misc/Makefile   |  11 +
 src/test/modules/test_misc/README |   4 +-
 .../test_misc/expected/test_bitset.out|   7 +
 src/test/modules/test_misc/meson.build|  17 ++
 .../modules/test_misc/sql/test_bitset.sql |   3 +
 src/test/modules/test_misc/test_misc--1.0.sql |   5 +
 src/test/modules/test_misc/test_misc.c| 132 
 src/test/modules/test_misc/test_misc.control  |   4 +
 src/tools/pgindent/typedefs.list  |   1 +
 13 files changed, 456 insertions(+), 12 deletions(-)
 create mode 100644 src/test/modules/test_misc/expected/test_bitset.out
 create mode 100644 src/test/modules/test_misc/sql/test_bitset.sql
 create mode 100644 src/test/modules/test_misc/test_misc--1.0.sql
 create mode 100644 src/test/modules/test_misc/test_misc.c
 create mode 100644 src/test/modules/test_misc/test_misc.control

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 65805d4527..eb38834e21 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1315,23 +1315,18 @@ bms_join(Bitmapset *a, Bitmapset *b)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-int
-bms_next_member(const Bitmapset *a, int prevbit)
+
+static int
+bms_next_member_internal(int nwords, const bitmapword *words, int prevbit)
 {
-	int			nwords;
 	int			wordnum;
 	bitmapword	mask;
 
-	Assert(bms_is_valid_set(a));
-
-	if (a == NULL)
-		return -2;
-	nwords = a->nwords;
 	prevbit++;
 	mask = (~(bitmapword) 0) << BITNUM(prevbit);
 	for (wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
 	{
-		bitmapword	w = a->words[wordnum];
+		bitmapword	w = words[wordnum];
 
 		/* ignore bits before prevbit */
 		w &= mask;
@@ -1348,9 +1343,23 @@ bms_next_member(const Bitmapset *a, int prevbit)
 		/* in subsequent words, consider all bits */
 		mask = (~(bitmapword) 0);
 	}
+
 	return -2;
 }
 
+int
+bms_next_member(const Bitmapset *a, int prevbit)
+{
+	Assert(a == NULL || IsA(a, Bitmapset));
+
+	Assert(bms_is_valid_set(a));
+
+	if (a == NULL)
+		return -2;
+
+	return bms_next_member_internal(a->nwords, a->words, prevbit);
+}
+
 /*
  * bms_prev_member - find prev member of a set
  *
@@ -1458,3 +1467,177 @@ bitmap_match(const void *key1, const void *key2, Size 

Re: Shared detoast Datum proposal

2024-01-23 Thread Michael Zhilin

Hi Andy,

It looks like v5 is missing in your mail. Could you please check and 
resend it?


Thanks,
 Michael.

On 1/23/24 08:44, Andy Fan wrote:

Hi,

Peter Smith  writes:


2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1]https://commitfest.postgresql.org/46/4759/
[2]https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue.  This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> 
shared_detoast_value_v2)
Author: yizhi.fzh
Date:   Tue Jan 23 13:38:34 2024 +0800

 shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh
Date:   Mon Jan 22 15:48:33 2024 +0800

 Introduce a Bitset data struct.
 
 While Bitmapset is designed for variable-length of bits, Bitset is

 designed for fixed-length of bits, the fixed length must be specified at
 the bitset_init stage and keep unchanged at the whole lifespan. Because
 of this, some operations on Bitset is simpler than Bitmapset.
 
 The bitset_clear unsets all the bits but kept the allocated memory, this

 capacity is impossible for bit Bitmapset for some solid reasons and this
 is the main reason to add this data struct.
 
 Also for performance aspect, the functions for Bitset removed some

 unlikely checks, instead with some Asserts.
 
 [1]https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com

 [2]https://postgr.es/m/875xzqxbv5.fsf%40163.com


I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination.

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.



--
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru


Re: Shared detoast Datum proposal

2024-01-22 Thread Andy Fan


Hi,

Peter Smith  writes:

> 2024-01 Commitfest.
>
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
>
> ==
> [1] https://commitfest.postgresql.org/46/4759/
> [2] 
> https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

v5 attached, it should fix the above issue.  This version also introduce
a data struct called bitset, which has a similar APIs like bitmapset but
have the ability to reset all bits without recycle its allocated memory,
this is important for this feature.

commit 44754fb03accb0dec9710a962a334ee73eba3c49 (HEAD -> 
shared_detoast_value_v2)
Author: yizhi.fzh 
Date:   Tue Jan 23 13:38:34 2024 +0800

shared detoast feature.

commit 14a6eafef9ff4926b8b877d694de476657deee8a
Author: yizhi.fzh 
Date:   Mon Jan 22 15:48:33 2024 +0800

Introduce a Bitset data struct.

While Bitmapset is designed for variable-length of bits, Bitset is
designed for fixed-length of bits, the fixed length must be specified at
the bitset_init stage and keep unchanged at the whole lifespan. Because
of this, some operations on Bitset is simpler than Bitmapset.

The bitset_clear unsets all the bits but kept the allocated memory, this
capacity is impossible for bit Bitmapset for some solid reasons and this
is the main reason to add this data struct.

Also for performance aspect, the functions for Bitset removed some
unlikely checks, instead with some Asserts.

[1] 
https://postgr.es/m/CAApHDvpdp9LyAoMXvS7iCX-t3VonQM3fTWCmhconEvORrQ%2BZYA%40mail.gmail.com
[2] https://postgr.es/m/875xzqxbv5.fsf%40163.com


I didn't write a good commit message for commit 2, the people who is
interested with this can see the first message in this thread for
explaination. 

I think anyone whose customer uses lots of jsonb probably can get
benefits from this. the precondition is the toast value should be
accessed 1+ times, including the jsonb_out function. I think this would
be not rare to happen.

-- 
Best Regards
Andy Fan





Re: Shared detoast Datum proposal

2024-01-21 Thread Peter Smith
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

Kind Regards,
Peter Smith.




Re: Shared detoast Datum proposal

2024-01-08 Thread Andy Fan

Andy Fan  writes:

>>
>> One of the tests was aborted at CFBOT [1] with:
>> [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
>> /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
>> [09:47:01.035] [New LWP 28182]
>
> There was a bug in JIT part, here is the fix.  Thanks for taking care of
> this!

Fixed a GCC warning in cirrus-ci, hope everything is fine now.

-- 
Best Regards
Andy Fan

>From 6dc858fae29486bea9125e7a3fb43a7081e62097 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v4 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c  |  65 ++-
 src/backend/executor/execExprInterp.c| 181 +++
 src/backend/executor/execTuples.c| 130 +
 src/backend/executor/execUtils.c |   5 +
 src/backend/executor/nodeHashjoin.c  |   2 +
 src/backend/executor/nodeMergejoin.c |   2 +
 src/backend/executor/nodeNestloop.c  |   1 +
 src/backend/jit/llvm/llvmjit_expr.c  |  26 +-
 src/backend/jit/llvm/llvmjit_types.c |   1 +
 src/backend/nodes/bitmapset.c|  13 +
 src/backend/optimizer/plan/createplan.c  |  73 ++-
 src/backend/optimizer/plan/setrefs.c | 529 +++
 src/include/executor/execExpr.h  |   6 +
 src/include/executor/tuptable.h  |  60 +++
 src/include/nodes/bitmapset.h|   1 +
 src/include/nodes/execnodes.h|   5 +
 src/include/nodes/plannodes.h|  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1114 insertions(+), 106 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91df2009be..4aeec8419f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 }
 else
 {
+	int			attnum;
+	Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 	/* regular user column */
 	scratch.d.var.attnum = variable->varattno - 1;
 	scratch.d.var.vartype = variable->vartype;
+	attnum = scratch.d.var.attnum;
+
 	switch (variable->varno)
 	{
 		case INNER_VAR:
-			scratch.opcode = EEOP_INNER_VAR;
+
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_INNER_VAR_TOAST;
+			}
+			else
+			{
+scratch.opcode = EEOP_INNER_VAR;
+			}
 			break;
 		case OUTER_VAR:
-			scratch.opcode = EEOP_OUTER_VAR;
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_OUTER_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_OUTER_VAR;
 			break;
 
 			/* INDEX_VAR is handled by default case */
 
 		default:
-			scratch.opcode = EEOP_SCAN_VAR;
+			if (is_scan_plan(plan) && bms_is_member(
+	attnum,
+	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+			{
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 ((Scan *) plan)->scanrelid,
+		 attnum);
+}
+scratch.opcode = EEOP_SCAN_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_SCAN_VAR;
 			break;
 	}
 }
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3c17cc6b1e..bd769fdeb6 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum 

Re: Shared detoast Datum proposal

2024-01-06 Thread Andy Fan

Hi,

vignesh C  writes:

> On Mon, 1 Jan 2024 at 19:26, Andy Fan  wrote:
>>
>>
>> Andy Fan  writes:
>>
>> >
>> > Some Known issues:
>> > --
>> >
>> > 1. Currently only Scan & Join nodes are considered for this feature.
>> > 2. JIT is not adapted for this purpose yet.
>>
>> JIT is adapted for this feature in v2. Any feedback is welcome.
>
> One of the tests was aborted at CFBOT [1] with:
> [09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
> /tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
> [09:47:01.035] [New LWP 28182]

There was a bug in JIT part, here is the fix.  Thanks for taking care of
this!

-- 
Best Regards
Andy Fan

>From 06a212ad6b6254cbd38dbacc409165300ff7c677 Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v3 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c  |  65 ++-
 src/backend/executor/execExprInterp.c| 181 +++
 src/backend/executor/execTuples.c| 130 +
 src/backend/executor/execUtils.c |   5 +
 src/backend/executor/nodeHashjoin.c  |   2 +
 src/backend/executor/nodeMergejoin.c |   2 +
 src/backend/executor/nodeNestloop.c  |   1 +
 src/backend/jit/llvm/llvmjit_expr.c  |  26 +-
 src/backend/jit/llvm/llvmjit_types.c |   1 +
 src/backend/nodes/bitmapset.c|  13 +
 src/backend/optimizer/plan/createplan.c  |  73 ++-
 src/backend/optimizer/plan/setrefs.c | 536 +++
 src/include/executor/execExpr.h  |   6 +
 src/include/executor/tuptable.h  |  60 +++
 src/include/nodes/bitmapset.h|   1 +
 src/include/nodes/execnodes.h|   5 +
 src/include/nodes/plannodes.h|  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1119 insertions(+), 108 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 91df2009be..4aeec8419f 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -932,22 +932,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 }
 else
 {
+	int			attnum;
+	Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 	/* regular user column */
 	scratch.d.var.attnum = variable->varattno - 1;
 	scratch.d.var.vartype = variable->vartype;
+	attnum = scratch.d.var.attnum;
+
 	switch (variable->varno)
 	{
 		case INNER_VAR:
-			scratch.opcode = EEOP_INNER_VAR;
+
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_INNER_VAR_TOAST;
+			}
+			else
+			{
+scratch.opcode = EEOP_INNER_VAR;
+			}
 			break;
 		case OUTER_VAR:
-			scratch.opcode = EEOP_OUTER_VAR;
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_OUTER_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_OUTER_VAR;
 			break;
 
 			/* INDEX_VAR is handled by default case */
 
 		default:
-			scratch.opcode = EEOP_SCAN_VAR;
+			if (is_scan_plan(plan) && bms_is_member(
+	attnum,
+	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+			{
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 ((Scan *) plan)->scanrelid,
+		 attnum);
+}
+scratch.opcode = EEOP_SCAN_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_SCAN_VAR;
 			break;
 	}
 }
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3c17cc6b1e..bd769fdeb6 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ 

Re: Shared detoast Datum proposal

2024-01-06 Thread vignesh C
On Mon, 1 Jan 2024 at 19:26, Andy Fan  wrote:
>
>
> Andy Fan  writes:
>
> >
> > Some Known issues:
> > --
> >
> > 1. Currently only Scan & Join nodes are considered for this feature.
> > 2. JIT is not adapted for this purpose yet.
>
> JIT is adapted for this feature in v2. Any feedback is welcome.

One of the tests was aborted at CFBOT [1] with:
[09:47:00.735] dumping /tmp/cores/postgres-11-28182.core for
/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin/postgres
[09:47:01.035] [New LWP 28182]
[09:47:01.748] [Thread debugging using libthread_db enabled]
[09:47:01.748] Using host libthread_db library
"/lib/x86_64-linux-gnu/libthread_db.so.1".
[09:47:09.392] Core was generated by `postgres: postgres regression
[local] SELECT  '.
[09:47:09.392] Program terminated with signal SIGSEGV, Segmentation fault.
[09:47:09.392] #0  0x7fa4eed4a5a1 in ?? ()
[09:47:11.123]
[09:47:11.123] Thread 1 (Thread 0x7fa4f8050a40 (LWP 28182)):
[09:47:11.123] #0  0x7fa4eed4a5a1 in ?? ()
[09:47:11.123] No symbol table info available.
...
...
[09:47:11.123] #4  0x7fa4ebc7a186 in LLVMOrcGetSymbolAddress () at
/build/llvm-toolchain-11-HMpQvg/llvm-toolchain-11-11.0.1/llvm/lib/ExecutionEngine/Orc/OrcCBindings.cpp:124
[09:47:11.123] No locals.
[09:47:11.123] #5  0x7fa4eed6fc7a in llvm_get_function
(context=0x564b1813a8a0, funcname=0x7fa4eed4a570 "AWAVATSH\201",
) at
../src/backend/jit/llvm/llvmjit.c:460
[09:47:11.123] addr = 94880527996960
[09:47:11.123] __func__ = "llvm_get_function"
[09:47:11.123] #6  0x7fa4eed902e1 in ExecRunCompiledExpr
(state=0x0, econtext=0x564b18269d20, isNull=0x7ffc11054d5f) at
../src/backend/jit/llvm/llvmjit_expr.c:2577
[09:47:11.123] cstate = 
[09:47:11.123] func = 0x564b18269d20
[09:47:11.123] #7  0x564b1698e614 in ExecEvalExprSwitchContext
(isNull=0x7ffc11054d5f, econtext=0x564b18269d20, state=0x564b182ad820)
at ../src/include/executor/executor.h:355
[09:47:11.123] retDatum = 
[09:47:11.123] oldContext = 0x564b182680d0
[09:47:11.123] retDatum = 
[09:47:11.123] oldContext = 
[09:47:11.123] #8  ExecProject (projInfo=0x564b182ad818) at
../src/include/executor/executor.h:389
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] state = 0x564b182ad820
[09:47:11.123] slot = 0x564b182ad788
[09:47:11.123] isnull = false
[09:47:11.123] econtext = 
[09:47:11.123] state = 
[09:47:11.123] slot = 
[09:47:11.123] isnull = 
[09:47:11.123] #9  ExecMergeJoin (pstate=) at
../src/backend/executor/nodeMergejoin.c:836
[09:47:11.123] node = 
[09:47:11.123] joinqual = 0x0
[09:47:11.123] otherqual = 0x0
[09:47:11.123] qualResult = 
[09:47:11.123] compareResult = 
[09:47:11.123] innerPlan = 
[09:47:11.123] innerTupleSlot = 
[09:47:11.123] outerPlan = 
[09:47:11.123] outerTupleSlot = 
[09:47:11.123] econtext = 0x564b18269d20
[09:47:11.123] doFillOuter = false
[09:47:11.123] doFillInner = false
[09:47:11.123] __func__ = "ExecMergeJoin"
[09:47:11.123] #10 0x564b169275b9 in ExecProcNodeFirst
(node=0x564b18269db0) at ../src/backend/executor/execProcnode.c:464
[09:47:11.123] No locals.
[09:47:11.123] #11 0x564b169a2675 in ExecProcNode
(node=0x564b18269db0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #12 ExecRecursiveUnion (pstate=0x564b182684a0) at
../src/backend/executor/nodeRecursiveunion.c:115
[09:47:11.123] node = 0x564b182684a0
[09:47:11.123] outerPlan = 0x564b18268d00
[09:47:11.123] innerPlan = 0x564b18269db0
[09:47:11.123] plan = 0x564b182ddc78
[09:47:11.123] slot = 
[09:47:11.123] isnew = false
[09:47:11.123] #13 0x564b1695a421 in ExecProcNode
(node=0x564b182684a0) at ../src/include/executor/executor.h:273
[09:47:11.123] No locals.
[09:47:11.123] #14 CteScanNext (node=0x564b183a6830) at
../src/backend/executor/nodeCtescan.c:103
[09:47:11.123] cteslot = 
[09:47:11.123] estate = 
[09:47:11.123] dir = ForwardScanDirection
[09:47:11.123] forward = true
[09:47:11.123] tuplestorestate = 0x564b183a5cd0
[09:47:11.123] eof_tuplestore = 
[09:47:11.123] slot = 0x564b183a6be0
[09:47:11.123] #15 0x564b1692e22b in ExecScanFetch
(node=node@entry=0x564b183a6830,
accessMtd=accessMtd@entry=0x564b1695a183 ,
recheckMtd=recheckMtd@entry=0x564b16959db3 ) at
../src/backend/executor/execScan.c:132
[09:47:11.123] estate = 
[09:47:11.123] #16 0x564b1692e332 in ExecScan
(node=0x564b183a6830, accessMtd=accessMtd@entry=0x564b1695a183
, recheckMtd=recheckMtd@entry=0x564b16959db3
) at ../src/backend/executor/execScan.c:181
[09:47:11.123] econtext = 0x564b183a6b50
[09:47:11.123] qual = 0x0
[09:47:11.123] projInfo = 0x0
[09:47:11.123] #17 0x564b1695a5ea in 

Re: Shared detoast Datum proposal

2024-01-01 Thread Andy Fan

Andy Fan  writes:

>
> Some Known issues:
> --
>
> 1. Currently only Scan & Join nodes are considered for this feature.
> 2. JIT is not adapted for this purpose yet.

JIT is adapted for this feature in v2. Any feedback is welcome.

-- 
Best Regards
Andy Fan

>From ee44d4721147589dbba8366382a18adeee05419b Mon Sep 17 00:00:00 2001
From: "yizhi.fzh" 
Date: Wed, 27 Dec 2023 18:43:56 +0800
Subject: [PATCH v2 1/1] shared detoast feature.

---
 src/backend/executor/execExpr.c  |  65 ++-
 src/backend/executor/execExprInterp.c| 181 +++
 src/backend/executor/execTuples.c| 130 +
 src/backend/executor/execUtils.c |   5 +
 src/backend/executor/nodeHashjoin.c  |   2 +
 src/backend/executor/nodeMergejoin.c |   2 +
 src/backend/executor/nodeNestloop.c  |   1 +
 src/backend/jit/llvm/llvmjit_expr.c  |  22 +
 src/backend/jit/llvm/llvmjit_types.c |   1 +
 src/backend/nodes/bitmapset.c|  13 +
 src/backend/optimizer/plan/createplan.c  |  73 ++-
 src/backend/optimizer/plan/setrefs.c | 536 +++
 src/include/executor/execExpr.h  |   6 +
 src/include/executor/tuptable.h  |  60 +++
 src/include/nodes/bitmapset.h|   1 +
 src/include/nodes/execnodes.h|   5 +
 src/include/nodes/plannodes.h|  50 ++
 src/test/regress/sql/shared_detoast_slow.sql |  70 +++
 18 files changed, 1117 insertions(+), 106 deletions(-)
 create mode 100644 src/test/regress/sql/shared_detoast_slow.sql

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 2c62b0c9c8..749bcc9023 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -935,22 +935,81 @@ ExecInitExprRec(Expr *node, ExprState *state,
 }
 else
 {
+	int			attnum;
+	Plan	   *plan = state->parent ? state->parent->plan : NULL;
+
 	/* regular user column */
 	scratch.d.var.attnum = variable->varattno - 1;
 	scratch.d.var.vartype = variable->vartype;
+	attnum = scratch.d.var.attnum;
+
 	switch (variable->varno)
 	{
 		case INNER_VAR:
-			scratch.opcode = EEOP_INNER_VAR;
+
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->inner_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_INNER_VAR_TOAST: flags = %d costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_INNER_VAR_TOAST;
+			}
+			else
+			{
+scratch.opcode = EEOP_INNER_VAR;
+			}
 			break;
 		case OUTER_VAR:
-			scratch.opcode = EEOP_OUTER_VAR;
+			if (is_join_plan(plan) &&
+bms_is_member(attnum,
+			  ((JoinState *) state->parent)->outer_pre_detoast_attrs))
+			{
+/* debug purpose. */
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_OUTER_VAR_TOAST: flags = %u costs=%.2f..%.2f, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 attnum);
+}
+scratch.opcode = EEOP_OUTER_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_OUTER_VAR;
 			break;
 
 			/* INDEX_VAR is handled by default case */
 
 		default:
-			scratch.opcode = EEOP_SCAN_VAR;
+			if (is_scan_plan(plan) && bms_is_member(
+	attnum,
+	((ScanState *) state->parent)->scan_pre_detoast_attrs))
+			{
+if (!jit_enabled)
+{
+	elog(INFO,
+		 "EEOP_SCAN_VAR_TOAST: flags = %u costs=%.2f..%.2f, scanId: %d, attnum: %d",
+		 state->flags,
+		 plan->startup_cost,
+		 plan->total_cost,
+		 ((Scan *) plan)->scanrelid,
+		 attnum);
+}
+scratch.opcode = EEOP_SCAN_VAR_TOAST;
+			}
+			else
+scratch.opcode = EEOP_SCAN_VAR;
 			break;
 	}
 }
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 24c2b60c62..b5a464bf80 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -57,6 +57,7 @@
 #include "postgres.h"
 
 #include "access/heaptoast.h"
+#include "access/detoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
 #include "executor/execExpr.h"
@@ -157,6 +158,9 @@ static void ExecEvalRowNullInt(ExprState *state, ExprEvalStep *op,
 static Datum ExecJustInnerVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustOuterVar(ExprState *state, ExprContext *econtext, bool *isnull);
 static Datum ExecJustScanVar(ExprState *state, ExprContext *econtext, bool *isnull);
+static Datum ExecJustInnerVarToast(ExprState *state, ExprContext *econtext,