RE: Best way to keep track of a sliced TOAST

2019-03-20 Thread Bruno Hass
I would like to optimize the jsonb key access operations. I could not find the 
discussion you've mentioned, but I am giving some thought to the idea.

Instead of storing lengths, could we dedicate the first chunk of the TOASTed 
jsonb to store where each key is located? Would it be a good idea?

You've mentioned that the current jsonb format is byte-oriented. Does that 
imply that a single jsonb key value might be split between multiple chunks?


Bruno Hass


De: Robert Haas 
Enviado: sexta-feira, 15 de março de 2019 12:22
Para: Bruno Hass
Cc: pgsql-hackers
Assunto: Re: Best way to keep track of a sliced TOAST

On Fri, Mar 15, 2019 at 7:37 AM Bruno Hass  wrote:
> This idea is what I was hoping to achieve. Would we be able to make 
> optimizations on deTOASTing  just by storing the chunk lengths in chunk 0?

I don't know. I guess we could also NOT store the chunk lengths and
just say that if you don't know which chunk you want by chunk number,
your only other alternative is to read the chunks in order.  The
problem with that is that it you can no longer index by byte-position
without fetching every chunk prior to that byte position, but maybe
that's not important enough to justify the overhead of a list of chunk
lengths.  Or maybe it depends on what you want to do with it.

Again, stuff like what you are suggesting here has been suggested
before.  I think the problem is if someone did the work to invent such
an infrastructure, that wouldn't actually do anything by itself.  We'd
then need to find an application of it where it brought us some clear
advantage.  As I said in my previous email, jsonb seems like a
promising candidate, but I don't think it's a slam dunk.  What would
the design look like, exactly?  Which operations would get faster, and
could we really make them work?  The existing format is, I think,
designed with a byte-oriented format in mind, and a chunk-oriented
format might have different design constraints.  It seems like an idea
with potential, but there's a lot of daylight between a directional
idea with potential and a specific idea accompanied by a high-quality
implementation thereof.

> Also, wouldn't it break existing functions by dedicating a whole chunk 
> (possibly more) to such metadata?

Anybody writing such a patch would have to be prepared to fix any such
breakage that occurred, at least as regards core code.  I would guess
that this could be done without breaking too much third-party code,
but I guess it depends on exactly what the author of this hypothetical
patch ends up changing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: Best way to keep track of a sliced TOAST

2019-03-15 Thread Bruno Hass
> It seems to me that making this overly pluggable is likely to be a net
> negative, because there probably aren't really that many different
> ways of doing this that are useful, and because having to store more
> identifying information will make the toasted datum larger.  One idea
> is to let the datatype divide the datum up into variable-sized chunks
> and then have the on-disk format store a list of chunk lengths in
> chunk 0 (and following, if there are lots of chunks?) followed by the
> chunks themselves.  The data would all go into the TOAST table as it
> does today, and the TOASTed data could be read without knowing
> anything about the data type.  However, code that knows how the data
> was chunked at TOAST time could try to speed things up by operating
> directly on the compressed data if it can figure out which chunk it
> needs without fetching everything.

This idea is what I was hoping to achieve. Would we be able to make 
optimizations on deTOASTing  just by storing the chunk lengths in chunk 0? 
Also, wouldn't it break existing functions by dedicating a whole chunk 
(possibly more) to such metadata?


De: Robert Haas 
Enviado: terça-feira, 12 de março de 2019 14:34
Para: Bruno Hass
Cc: pgsql-hackers
Assunto: Re: Best way to keep track of a sliced TOAST

On Mon, Mar 11, 2019 at 9:27 AM Bruno Hass  wrote:
> I've been reading about TOASTing and would like to modify how the slicing 
> works by taking into consideration the type of the varlena field. These 
> changes would support future implementations of type specific optimized 
> TOAST'ing functions. The first step would be to add information to the TOAST 
> so we know if it is sliced or not and by which function it was sliced and 
> TOASTed. This information should not break the current on disk format of 
> TOASTs. I had the idea of putting this information on the varattrib struct 
> va_header, perhaps adding more bit layouts to represent sliced TOASTs. This 
> idea, however, was pointed to me to be a rather naive approach. What would be 
> the best way to do this?

Well, you can't really use va_header, because every possible bit
pattern for va_header means something already.  The first byte tells
us what kind of varlena we have:

 * Bit layouts for varlena headers on big-endian machines:
 *
 * 00xx 4-byte length word, aligned, uncompressed data (up to 1G)
 * 01xx 4-byte length word, aligned, *compressed* data (up to 1G)
 * 1000 1-byte length word, unaligned, TOAST pointer
 * 1xxx 1-byte length word, unaligned, uncompressed data (up to 126b)
 *
 * Bit layouts for varlena headers on little-endian machines:
 *
 * xx00 4-byte length word, aligned, uncompressed data (up to 1G)
 * xx10 4-byte length word, aligned, *compressed* data (up to 1G)
 * 0001 1-byte length word, unaligned, TOAST pointer
 * xxx1 1-byte length word, unaligned, uncompressed data (up to 126b)

All of the bits other than the ones that tell us what kind of varlena
we've got are part of the length word itself; you couldn't use any bit
pattern for some other purpose without breaking on-disk compatibility
with existing releases.  What you could possibly do is add a new
possible value of vartag_external, which tells us what "kind" of
toasted datum we've got.  Currently, toasted datums stored on disk are
always type 18, but there's no reason that I know of why we couldn't
have more than one possibility there.

However, I think you might want to discuss on this mailing list a bit
more about what you are hoping to achieve before you do too much
development, at least if you aspire to get something committed.  A
project like the one you are proposing sounds like something not for
the faint of heart, and it's not really clear what benefits you
anticipate.  I think there has been previous discussion of this topic
at least for jsonb, so you might also want to search the archives for
those discussions.  I wouldn't go so far as to say that this idea
can't work or wouldn't have any value, but it does seem like the kind
of thing where you could spend a lot of time going down a dead end,
and discussion on the list might help you avoid some of those dead
ends.

It seems to me that making this overly pluggable is likely to be a net
negative, because there probably aren't really that many different
ways of doing this that are useful, and because having to store more
identifying information will make the toasted datum larger.  One idea
is to let the datatype divide the datum up into variable-sized chunks
and then have the on-disk format store a list of chunk lengths in
chunk 0 (and following, if there are lots of chunks?) followed by the
chunks themselves.  The data would all go into the TOAST table as it
does today, and the TOASTed data could be read without knowing
anything about the data type.  However, code that knows how the data
was chunked at TOAST

Best way to keep track of a sliced TOAST

2019-03-11 Thread Bruno Hass
Hi,

I've been reading about TOASTing and would like to modify how the slicing works 
by taking into consideration the type of the varlena field. These changes would 
support future implementations of type specific optimized TOAST'ing functions. 
The first step would be to add information to the TOAST so we know if it is 
sliced or not and by which function it was sliced and TOASTed. This information 
should not break the current on disk format of TOASTs. I had the idea of 
putting this information on the varattrib struct va_header, perhaps adding more 
bit layouts to represent sliced TOASTs. This idea, however, was pointed to me 
to be a rather naive approach. What would be the best way to do this?

Bruno Hass


GSoC 2019 - TOAST'ing in slices idea

2019-03-04 Thread Bruno Hass

Hello everyone,

I am currently writing my proposal for GSoC 2019 for the TOAST'ing in slices 
idea.
 I already have a sketch of the description and approach outline, which I am 
sending in this e-mail. I would be happy to receive some feedback on it. I've 
been reading the PostgreSQL source code and documentation in order to 
understand better how to implement this idea and to write a detailed proposal. 
I would also be glad to receive some recommendation of documentation or 
material on TOAST'ing internals as well as some hint on where to look further 
in the source code.


I am looking forward to your feedback!


PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf


[Proposal] TOAST'ing in slices

2019-03-04 Thread Bruno Hass
Hello Hackers,

I'm intending to optimize some varlena data types as my GSoC proposal. That 
would be done by a smarter way of splitting the TOAST table chunks, depending 
on its data type. A JSONB would be split considering its internal tree 
structure, keeping track of which keys are in each chunk. Arrays and text 
fields are candidates for optimization as well, as I outlined in my proposal 
draft (attached). Not being very familiar with the source code, I am writing to 
this list seeking some guidance in where to look in the source code in order to 
detail my implementation. Furthermore, I have a couple of questions:

  *   Would it be a good idea to modify the TOAST table row to keep metadata on 
the data it stores?
  *   This proposal will modify how JSONB, array and text fields are TOASTed. 
Where can I find the code related to that?

Kind regards,

Bruno Hass



PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf


GSoC 2019 - TOAST'ing in slices idea

2019-03-02 Thread Bruno Hass
Hello everyone,

I am currently writing my proposal for GSoC 2019 for the TOAST'ing in slices 
idea.
 I already have a sketch of the description and approach outline, which I am 
sending in this e-mail. I would be happy to receive some feedback on it. I've 
been reading the PostgreSQL source code and documentation in order to 
understand better how to implement this idea and to write a detailed proposal. 
I would also be glad to receive some recommendation of documentation or 
material on TOAST'ing internals as well as some hint on where to look further 
in the source code.


I am looking forward to your feedback!


PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf