Re: [HACKERS] Reducing data type space usage

2006-09-18 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-09-15 kell 19:34, kirjutas Tom Lane:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Oh, OK, I had high byte meaning no header, but clear is better, so
  0001 is 0x01, and  is .  But I see now that bytea does
  store nulls, so yea, we would be better using 1001, and it is the
  same size as .
 
 I'm liking this idea more the more I think about it, because it'd
 actually be far less painful to put into the system structure than the
 other idea of fooling with varlena headers.  To review: Bruce is
 proposing a var-length type structure with the properties
 
   first byte 0xxx   field length 1 byte, exactly that value
   first byte 1xxx   xxx data bytes follow

would adding this -

first byte 0xxx   field length 1 byte, exactly that value
first byte 10xx   0xx data bytes follow
first byte 110x  -- x  data bytes to follow
first byte 111x  -- x    bytes t.flw

be too expensive ?

it seems that for strings up to 63 bytes it would be as expensive as it is 
with your proposal, but that it would scale up to 536870912 (2^29) bytes nicely.

this would be extra good for datasets that are mostly below 63 (or 127) with 
only 
a small percentage above


 This can support *any* stored value from zero to 127 bytes long.
 We can imagine creating new datatypes short varchar and short char,
 and then having the parser silently substitute these types for varchar(N)
 or char(N) whenever N = 127 / max_encoding_length.  Add some
 appropriate implicit casts to convert these to the normal varlena types
 for computation, and away you go.  No breakage of any existing
 datatype-specific code, just a few additions in places like
 heap_form_tuple.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-18 Thread Tom Lane
Hannu Krosing [EMAIL PROTECTED] writes:
 would adding this -

 first byte 0xxx   field length 1 byte, exactly that value
 first byte 10xx   0xx data bytes follow
 first byte 110x  -- x  data bytes to follow
 first byte 111x  -- x    bytes t.flw

 be too expensive ?

What's the point?  Unless there's provision for TOAST, you can't put
arbitrarily long strings on disk anyway, so I don't see what this
datatype can do that is much better than the simpler version.  If you
try to fit TOAST in, then you are back to exactly the var-length-varlena
proposals upthread.

To me the only really compelling argument for worrying about the size
of the header is for fields that are certain to be short, such as
poor-mans-enumerated-type strings, part numbers, etc.  The
zero-or-one-byte header approach seems to handle that use-case just
fine.  I think more complicated approaches are going to add lots of
code and cycles for only very marginal gains.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Reducing data type space usage

2006-09-17 Thread Martijn van Oosterhout
On Sat, Sep 16, 2006 at 08:56:11PM +0100, Gregory Stark wrote:

[Re inet and cidr]

 Why are these varlena? Just for ipv6 addresses? Is the network mask length not
 stored if it's not present? This gives us a strange corner case in that ipv4
 addresses will *always* fit in the smallfoo data type and ipv6 *never* fit.
 Ie, we'll essentially end up with an ipv4inet and an ipv6inet. Sad in a way.

Eh? Either will always fit. ipv6 is 128 *bits* and the new varlena
header goes to 128 *bytes*. That should fit easily, no?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Reducing data type space usage

2006-09-17 Thread Gregory Stark
Martijn van Oosterhout kleptog@svana.org writes:

 On Sat, Sep 16, 2006 at 08:56:11PM +0100, Gregory Stark wrote:

 [Re inet and cidr]

 Why are these varlena? Just for ipv6 addresses? Is the network mask length 
 not
 stored if it's not present? This gives us a strange corner case in that ipv4
 addresses will *always* fit in the smallfoo data type and ipv6 *never* fit.
 Ie, we'll essentially end up with an ipv4inet and an ipv6inet. Sad in a way.

 Eh? Either will always fit. ipv6 is 128 *bits* and the new varlena
 header goes to 128 *bytes*. That should fit easily, no?

Er, oops.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 The user would have to decide that he'll never need a value over 127 bytes
 long ever in order to get the benefit.

Weren't you the one that's been going on at great length about how
wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
use case, but it's about as efficient as we could possibly get within
that use case.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Gregory Stark [EMAIL PROTECTED] writes:
 The user would have to decide that he'll never need a value over 127 bytes
 long ever in order to get the benefit.

 Weren't you the one that's been going on at great length about how
 wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
 use case, but it's about as efficient as we could possibly get within
 that use case.

Sure, but are you saying you would have this in addition to do variable sized
varlena headers?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom Lane [EMAIL PROTECTED] writes:
 Weren't you the one that's been going on at great length about how
 wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
 use case, but it's about as efficient as we could possibly get within
 that use case.

 Sure, but are you saying you would have this in addition to do variable sized
 varlena headers?

We could ... but right at the moment I'm thinking this would solve 80%
of the problem with about 1% of the work needed for the varlena change.
So my thought is to do this for 8.3 and then wait a couple releases to
see if the more extensive hack is really needed.  Even if we did
eventually install variable-sized varlena headers, this layout would
still be useful for types like inet/cidr.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Andrew Dunstan



Tom Lane wrote:

To review: Bruce is
proposing a var-length type structure with the properties

first byte 0xxx   field length 1 byte, exactly that value
first byte 1xxx   xxx data bytes follow

This can support *any* stored value from zero to 127 bytes long.
We can imagine creating new datatypes short varchar and short char,
and then having the parser silently substitute these types for varchar(N)
or char(N) whenever N = 127 / max_encoding_length.  Add some
appropriate implicit casts to convert these to the normal varlena types
for computation, and away you go.  No breakage of any existing
datatype-specific code, just a few additions in places like
heap_form_tuple.


  


I like this scheme a lot - maximum bang for buck.

Is there any chance we can do it transparently, without exposing new 
types? It is in effect an implementation detail ISTM, and ideally the 
user would not need to have any knowledge of it.


cheers

andrew

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 I like this scheme a lot - maximum bang for buck.

 Is there any chance we can do it transparently, without exposing new 
 types? It is in effect an implementation detail ISTM, and ideally the 
 user would not need to have any knowledge of it.

Well, they'd have to be separate types, but the parser handling of them
would be reasonably transparent I think.  It would work pretty much
exactly like the way that CHAR(N) maps to bpchar now --- is that
sufficiently well hidden for your taste?

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Andrew Dunstan



Tom Lane wrote:

Andrew Dunstan [EMAIL PROTECTED] writes:
  

I like this scheme a lot - maximum bang for buck.



  
Is there any chance we can do it transparently, without exposing new 
types? It is in effect an implementation detail ISTM, and ideally the 
user would not need to have any knowledge of it.



Well, they'd have to be separate types, but the parser handling of them
would be reasonably transparent I think.  It would work pretty much
exactly like the way that CHAR(N) maps to bpchar now --- is that
sufficiently well hidden for your taste?


  


Yeah, probably. At least to the stage where it's not worth a herculean 
effort to overcome.


cheers

andrew

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Hannu Krosing
Ühel kenal päeval, R, 2006-09-15 kell 19:18, kirjutas Tom Lane:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  No, it'll be a 1-byte header with length indicating that no bytes
  follow,
 
  Well, in my idea, 1001 would be 0x01.  I was going to use the
  remaining 7 bits for the 7-bit ascii value.
 
 Huh?  I thought you said 0001 would be 0x01, that is, high bit
 clear means a single byte containing an ASCII character. 

why not go all the way, and do utf-7 encoded header if hi bit is set ?

or just always have an utf-8 encoded header.

 You could
 reverse that but it just seems to make things harder --- the byte
 isn't a correct data byte by itself, as it would be with the other
 convention.
 
   regards, tom lane
 
 ---(end of broadcast)---
 TIP 3: Have you checked our extensive FAQ?
 
http://www.postgresql.org/docs/faq
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Bruce Momjian
Hannu Krosing wrote:
 ?hel kenal p?eval, R, 2006-09-15 kell 19:18, kirjutas Tom Lane:
  Bruce Momjian [EMAIL PROTECTED] writes:
   Tom Lane wrote:
   No, it'll be a 1-byte header with length indicating that no bytes
   follow,
  
   Well, in my idea, 1001 would be 0x01.  I was going to use the
   remaining 7 bits for the 7-bit ascii value.
  
  Huh?  I thought you said 0001 would be 0x01, that is, high bit
  clear means a single byte containing an ASCII character. 
 
 why not go all the way, and do utf-7 encoded header if hi bit is set ?
 
 or just always have an utf-8 encoded header.

This is a special zero-length header case.  We only have one byte. 
Doing a utf8 length call to span to the next column is too expensive.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Bruce Momjian
Tom Lane wrote:
 Gregory Stark [EMAIL PROTECTED] writes:
  The user would have to decide that he'll never need a value over 127 bytes
  long ever in order to get the benefit.
 
 Weren't you the one that's been going on at great length about how
 wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
 use case, but it's about as efficient as we could possibly get within
 that use case.

To summarize what we are now considering:

Originally, there was the idea of doing 1,2, and 4-byte headers.  The
2-byte case is probably not worth the extra complexity (saving 2 bytes
on a 128-byte length isn't very useful).

What has come about is the idea of 0, 1, and 4-byte headers.  0-byte
headers store only one 7-bit ASCII byte, 1-byte headers can store 127
bytes or 127 / max_encoding_len characters.  4-byte headers store what
we have now.

The system is split into two types of headers, 0/1 headers which are
identified by a special data type (or mapped to a data type that can't
exceed that length, like inet), and 4-byte headers.  The code that deals
with 0/1 headers is independent of the 4-byte header code we have now.

I am slightly worried about having short version of many of our types. 
Not only char, varchar, and text, but also numeric.  I see these varlena
types in the system:

test= SELECT typname FROM pg_type WHERE typlen = -1 AND typtype = 'b'
   AND typelem = 0;
  typname
---
 bytea
 text
 path
 polygon
 inet
 cidr
 bpchar
 varchar
 bit
 varbit
 numeric
 refcursor
(12 rows)

Are these shorter headers going to have the same alignment requirements
as the 4-byte headers?  I am thinking not, meaning we will not have as
much padding overhead we have now.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Gregory Stark
Bruce Momjian [EMAIL PROTECTED] writes:

 Tom Lane wrote:
 Gregory Stark [EMAIL PROTECTED] writes:
  The user would have to decide that he'll never need a value over 127 bytes
  long ever in order to get the benefit.
 
 Weren't you the one that's been going on at great length about how
 wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
 use case, but it's about as efficient as we could possibly get within
 that use case.

Sure, this helps with CHAR(1) but there were plen


 To summarize what we are now considering:

 Originally, there was the idea of doing 1,2, and 4-byte headers.  The
 2-byte case is probably not worth the extra complexity (saving 2 bytes
 on a 128-byte length isn't very useful).

Well don't forget we virtually *never* use more than 2 bytes out of the 4 byte
headers for on-disk data. The only way we ever store a datum larger than 16k
is you compile with 32k blocks *and* you explicitly disable toasting on the
column.

Worse, if we don't do anything about fields like text it's not true that this
only occurs on 128-byte columns and larger. It occurs on any column that
*could* contain 128 bytes or more. Ie, any column declared as varchar(128)
even if it contains only Bruce or any column declared as text or numeric.

I'm not sure myself whether the smallfoo data types are a bad idea in
themselves though. I just think it probably doesn't replace trying to shorten
the largefoo varlena headers as well.

Part of the reason I think the smallfoo data types may be a bright idea in
their own right is that the datatypes might be able to do clever things about
their internal storage. For instance, smallnumeric could use base 100 where
largenumeric uses base 1.

 I am slightly worried about having short version of many of our types. 
 Not only char, varchar, and text, but also numeric.  I see these varlena
 types in the system:

I think only the following ones make sense for smallfoo types:

bpchar
varchar
bit
varbit
numeric

These don't currently take typmods so we'll never know when they could use a
smallfoo representation, it might be useful if they did though:

bytea
text
path
polygon


Why are these varlena? Just for ipv6 addresses? Is the network mask length not
stored if it's not present? This gives us a strange corner case in that ipv4
addresses will *always* fit in the smallfoo data type and ipv6 *never* fit.
Ie, we'll essentially end up with an ipv4inet and an ipv6inet. Sad in a way.

inet
cidr

I have to read up on what this is.

refcursor


 Are these shorter headers going to have the same alignment requirements
 as the 4-byte headers?  I am thinking not, meaning we will not have as
 much padding overhead we have now.

Well a 1-byte length header doesn't need any alignment so they would have only
the alignment that the data type itself declares. I'm not sure how interacts
with heap_deform_tuple but it's probably simpler than finding out only once
you parse the length header what alignment you need.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Mark Dilger

Mark Dilger wrote:

Wouldn't a 4-byte numeric be a float4 and an 8-byte numeric be a 
float8.  I'm not sure I see the difference.


Nevermind.  I don't normally think about numeric as anything other than 
an arbitrarily large floating point type.  But it does differ in that 
you can specify the range you want it to cover.


mark

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread mark
On Sat, Sep 16, 2006 at 02:13:49PM -0700, Mark Dilger wrote:
 Mark Dilger wrote:
 Wouldn't a 4-byte numeric be a float4 and an 8-byte numeric be a 
 float8.  I'm not sure I see the difference.
 Nevermind.  I don't normally think about numeric as anything other than 
 an arbitrarily large floating point type.  But it does differ in that 
 you can specify the range you want it to cover.

Range and the base, both being important.

Cheers,
mark

-- 
[EMAIL PROTECTED] / [EMAIL PROTECTED] / [EMAIL PROTECTED] 
__
.  .  _  ._  . .   .__.  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/|_ |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
   and in the darkness bind them...

   http://mark.mielke.cc/


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Bruce Momjian
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Tom Lane wrote:
  Gregory Stark [EMAIL PROTECTED] writes:
   The user would have to decide that he'll never need a value over 127 
   bytes
   long ever in order to get the benefit.
  
  Weren't you the one that's been going on at great length about how
  wastefully we store CHAR(1) ?  Sure, this has a somewhat restricted
  use case, but it's about as efficient as we could possibly get within
  that use case.
 
 Sure, this helps with CHAR(1) but there were plen

OK.  One thing that we have to remember is that the goal isn't to
squeeze every byte out of the storage format.  That would be
inefficient, performance-wise.  We need just a reasonble storage layout.

  To summarize what we are now considering:
 
  Originally, there was the idea of doing 1,2, and 4-byte headers.  The
  2-byte case is probably not worth the extra complexity (saving 2 bytes
  on a 128-byte length isn't very useful).
 
 Well don't forget we virtually *never* use more than 2 bytes out of the 4 byte
 headers for on-disk data. The only way we ever store a datum larger than 16k
 is you compile with 32k blocks *and* you explicitly disable toasting on the
 column.

Well, if we went with 2-byte, then we are saying we are not going to
store the TOAST length in the heap header, but store it somewhere else,
probably in TOAST.  I can see how that could be done.  This would leave
us with 0, 1, and 2-byte headers, and 4-byte headers in TOAST.  Is that
something to consider?  I think one complexity is that we are going to
need 4-byte headers in the backend to move around values, so there is
going to need to be a 2-byte to 4-byte mapping for all data types, not
just the short ones.  If this only applies to TEXT, bytea, and a few
other types, it is uncertain whether it is worth it.

(We do store the TOAST length in heap, right, just before the TOAST
pointer?)

 Worse, if we don't do anything about fields like text it's not true that this
 only occurs on 128-byte columns and larger. It occurs on any column that
 *could* contain 128 bytes or more. Ie, any column declared as varchar(128)
 even if it contains only Bruce or any column declared as text or numeric.

Well, if you are using TEXT, it is hard to say you are worried about
storage size.  I can't imagine many one-byte values are stored in TEXT.

 I'm not sure myself whether the smallfoo data types are a bad idea in
 themselves though. I just think it probably doesn't replace trying to shorten
 the largefoo varlena headers as well.

See above.  Using just 2-byte headers in heap is a possibility.  I am
just not sure if the overhead is worth it.  With the 0-1 header, we
don't have any backend changes as data is passed around from the disk to
memory.  Doing the 2-byte header would require that.

 Part of the reason I think the smallfoo data types may be a bright idea in
 their own right is that the datatypes might be able to do clever things about
 their internal storage. For instance, smallnumeric could use base 100 where
 largenumeric uses base 1.

I hardly think modifying the numeric routines to do a two different
bases is worth it.  

  I am slightly worried about having short version of many of our types. 
  Not only char, varchar, and text, but also numeric.  I see these varlena
  types in the system:
 
 I think only the following ones make sense for smallfoo types:
 
   bpchar
   varchar
   bit
   varbit
   numeric

OK, bit and numeric are ones we didn't talk about yet.

 These don't currently take typmods so we'll never know when they could use a
 smallfoo representation, it might be useful if they did though:
 
   bytea
   text
   path
   polygon

Good point.

 
 
 Why are these varlena? Just for ipv6 addresses? Is the network mask length not
 stored if it's not present? This gives us a strange corner case in that ipv4
 addresses will *always* fit in the smallfoo data type and ipv6 *never* fit.
 Ie, we'll essentially end up with an ipv4inet and an ipv6inet. Sad in a way.
 
   inet
   cidr

Yes, I think so.

 
 I have to read up on what this is.
 
   refcursor
 
 
  Are these shorter headers going to have the same alignment requirements
  as the 4-byte headers?  I am thinking not, meaning we will not have as
  much padding overhead we have now.
 
 Well a 1-byte length header doesn't need any alignment so they would have only
 the alignment that the data type itself declares. I'm not sure how interacts
 with heap_deform_tuple but it's probably simpler than finding out only once
 you parse the length header what alignment you need.

That is as big a win as the shorter header.  Doing a variable length
header with big-endian encoding and stuff would be a mess, for sure.
With 0-1 header, your alignment doesn't need to change from the disk to
memory.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can 

Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Heikki Linnakangas

Tom Lane wrote:

Gregory Stark [EMAIL PROTECTED] writes:
The user would have to decide that he'll never need a value over 127 
bytes

long ever in order to get the benefit.


Weren't you the one that's been going on at great length about how
wastefully we store CHAR(1) ? Sure, this has a somewhat restricted
use case, but it's about as efficient as we could possibly get within
that use case.


I like the idea of having variable length headers much more than a new 
short character type. It solves a more general problem, and it 
compresses VARCHAR(255) TEXT fields nicely when the actual data in the 
field is small.


I'd like to propose one more encoding scheme, based on on Tom's earlier 
proposals. The use cases I care about are:


* support uncompressed data up to 1G, like we do now
* 1 byte length word for short data.
* store typical CHAR(1) values in just 1 byte.

Tom wrote:
 * 0xxx uncompressed 4-byte length word as stated above
 * 10xx 1-byte length word, up to 62 bytes of data
 * 110x 2-byte length word, uncompressed inline data
 * 1110 2-byte length word, compressed inline data
 *  1-byte length word, out-of-line TOAST pointer

My proposal is:

00xx uncompressed, aligned 4-byte length word
010x 1-byte length word, uncompressed inline data (up to 32 bytes)
011x 2-byte length word, uncompressed inline data (up to 8k)
1xxx 1 byte data in range 0x20-0x7E
1000 2-byte length word, compressed inline data (up to 4k)
 TOAST pointer

The decoding algorithm is similar to Tom's proposal, and relies on using 
0x00 for padding.


--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Gregory Stark
Bruce Momjian [EMAIL PROTECTED] writes:

 Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
 Sure, this helps with CHAR(1) but there were plen

 OK.  

Ooops, sorry, I guess I sent that before I was finished editing it. I'm glad
you could divine what I meant because I'm not entirely sure myself :)

 Well, if you are using TEXT, it is hard to say you are worried about
 storage size.  I can't imagine many one-byte values are stored in TEXT.

Sure, what about Middle name or initial. Or Apartment Number. Or for that
matter Drive Name on a windows box. Just because the user doesn't want to
enforce a limit on the field doesn't mean the data will always be so large.

 Part of the reason I think the smallfoo data types may be a bright idea in
 their own right is that the datatypes might be able to do clever things about
 their internal storage. For instance, smallnumeric could use base 100 where
 largenumeric uses base 1.

 I hardly think modifying the numeric routines to do a two different
 bases is worth it.  

It doesn't actually require any modification, it's already a #define. It may
be worth doing the work to make it a run-time parameter so we don't need to
recompile the functions twice.

I'm pretty sure it's worthwhile as far as space conservation goes. a datum
holding a value like 10 currently takes 10 bytes including the length
header:

postgres=# select sizeof('10'::numeric);
 sizeof 

 10
(1 row)


That would go down to 7 bytes with a 1-byte length header. And down to 4 bytes
with base 100. Ie, reading a table full of small numeric values would be 75%
faster.

With some clever hacking I think we could get it to go down to a single byte
with no length header just like ascii characters for integers under 128. But
that's a separate little side project.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Reducing data type space usage

2006-09-16 Thread Tom Lane
Hannu Krosing [EMAIL PROTECTED] writes:
 why not go all the way, and do utf-7 encoded header if hi bit is set ?
 or just always have an utf-8 encoded header.

That definition is (a) very expensive to scan, and (b) useless for
anything except utf-8 encoded text.  Whatever mechanism we select should
be more flexible than that (eg, should work for inet values).

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] Reducing data type space usage

2006-09-15 Thread Gregory Stark


Following up on the recent discussion on list about wasted space in data
representations I want to summarise what we found and make some proposals:

As I see it there are two cases:

Case 1) Data types that are variable length but often quite small. This includes
   things like NUMERIC which in common use will rarely be larger than 12-20
   bytes and often things like text.

   In cases like these we really only need 1 or sometimes 2 byte varlena
   header overhead, not 4 as we currently do. In fact we *never* need more
   than 2 bytes of varlena header on disk anyways with the standard
   configuration.

Case 2) Data types that are different sizes depending on the typmod but are 
always
   the same size that can be determined statically for a given typmod. In the
   case of a ASCII encoded database CHAR(n) fits this category and in any case
   we'll eventually have per-column encoding. NUMERC(a,b) could also be made
   to fit this as well.
   
   In cases like these we don't need *any* varlena header. If we could arrange
   for the functions to have enough information to know how large the data
   must be.

Solutions proposed:

Case 1) We've discussed the variable sized varlena headers and I think it's 
clear
   that that's the most realistic way to approach it.

   I don't think any other approaches were even suggested. Tom said he wanted
   a second varlena format for numeric that would have 2-byte alignment. But I
   think we could always just say that we always use the 2-byte varlena header
   on data types with 2-byte alignment and the 4-byte header on data types
   with 4-byte alignment needs. Or heap_form_tuple could be even cleverer
   about it but I'm not sure it's worth it.

   This limits the wasted space to 1-2% for most variable sized data that are
   50 bytes long or more. But for very small data such as the quite common
   cases where those are often only 1-4 bytes it still means a 25-100%
   performance drain.

Case 2) Solving this is quite difficult without introducing major performance
   problems or security holes. The one approach we have that's practical right
   now is introducing special data types such as the oft-mentioned char data
   type. char doesn't have quite the right semantics to use as a transparent
   substitute for CHAR but we could define a CHAR(1) with exactly the right
   semantics and substitute it transparently in parser/analyze.c (btw having
   two files named analyze.c is pretty annoying). We could do the same with
   NUMERIC(a,b) for sufficiently small values of a and b with something like
   D'Arcy's CASH data type (which uses an integer internally).

   The problem with defining lots of data types is that the number of casts
   and cross-data-type comparisons grows quadratically as the number of data
   types grows. In theory we would save space by defining a CHAR(n) for
   whatever size n the user needs but I can't really see anything other than
   CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
   (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
   substitute might be reasonable but anything else would be pointless.

I see these two solutions as complementary. The variable varlena headers take
care of the larger data and the special-purpose data types take care of the
extremely small data. And pretty important to cover both cases data that fits
in 1-4 bytes is quite common. You often see databases with dozens of CHAR(1)
flag columns or NUMERIC(10,2) currency columns.

With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
performance lost on the things that would fit in 1-4 bytes. And with the
variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
performance drain due to wasted space on larger data.

Doing better would require a complete solution to data types that can
understand how large they are based on their typmod. That would imply more
dramatic solutions like I mused about involving passing around structures that
contain the Datum as well as the attlen or atttypmod. The more I think about
these ideas the more I think they may have merit but they would be awfully
invasive and require more thought.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Martijn van Oosterhout
On Fri, Sep 15, 2006 at 06:50:37PM +0100, Gregory Stark wrote:
 With a CHAR(1) and CASH style numeric substitute we won't have 25-100%
 performance lost on the things that would fit in 1-4 bytes. And with the
 variable sized varlena header we'll limit to 25% at worst and 1-2% usually the
 performance drain due to wasted space on larger data.

I wonder how much of the benefit will be eaten by alignment. I think
it'd be great if we rearrange the fields in a tuple to minimize
alignment, but that logical field order patch has been and gone and the
issues havn't changed.

There's also slack at the end of pages.

 Doing better would require a complete solution to data types that can
 understand how large they are based on their typmod. That would imply more
 dramatic solutions like I mused about involving passing around structures that
 contain the Datum as well as the attlen or atttypmod. The more I think about
 these ideas the more I think they may have merit but they would be awfully
 invasive and require more thought.

Whatever the solution is here, the same logic will have to apply to
extracting Datums out of tuples. If you want the 7th column in a tuple,
you have to find the lengths of all the previous datums first.

Good summary though, probably worth putting on the wiki so next time we
don't have to search the archives.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Mark Dilger

Gregory Stark wrote:



snip


Case 2) Solving this is quite difficult without introducing major performance
   problems or security holes. The one approach we have that's practical right
   now is introducing special data types such as the oft-mentioned char data
   type. char doesn't have quite the right semantics to use as a transparent
   substitute for CHAR but we could define a CHAR(1) with exactly the right
   semantics and substitute it transparently in parser/analyze.c (btw having
   two files named analyze.c is pretty annoying). We could do the same with
   NUMERIC(a,b) for sufficiently small values of a and b with something like
   D'Arcy's CASH data type (which uses an integer internally).


Didn't we discuss a problem with using CHAR(n), specifically that the 
number of bytes required to store n characters is variable?  I had 
suggested making an ascii1 type, ascii2 type, etc.  Someone else seemed 
to be saying that should be called bytea1, bytea2, or perhaps with the 
parenthesis bytea(1), bytea(2).  The point being that it is a fixed 
number of bytes.



   The problem with defining lots of data types is that the number of casts
   and cross-data-type comparisons grows quadratically as the number of data
   types grows. In theory we would save space by defining a CHAR(n) for
   whatever size n the user needs but I can't really see anything other than
   CHAR(1) being worthwhile. Similarly a 4-byte NUMERIC substitute like CASH
   (with full NUMERIC semantics though) and maybe a 2-byte and 8-byte
   substitute might be reasonable but anything else would be pointless.


Wouldn't a 4-byte numeric be a float4 and an 8-byte numeric be a 
float8.  I'm not sure I see the difference.  As for a 2-byte floating 
point number, I like the idea and will look for an ieee specification 
for how the bits are arranged, if any such ieee spec exists.


mark

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Bruce Momjian
Gregory Stark wrote:
 Case 2) Data types that are different sizes depending on the typmod but are 
 always
the same size that can be determined statically for a given typmod. In the
case of a ASCII encoded database CHAR(n) fits this category and in any case
we'll eventually have per-column encoding. NUMERC(a,b) could also be made
to fit this as well.

In cases like these we don't need *any* varlena header. If we could arrange
for the functions to have enough information to know how large the data
must be.

I thought about the CHAR(1) case some more.  Rather than restrict
single-byte storage to ASCII-encoded databases, I think there is a more
general solution.

First, I don't think any solution that assumes typmod will be around to
help determine the meaning of the column is going to work.

I think what will work is to store a 1-character, 7-bit ASCII value in
one byte, by setting the high bit.  This will work for any database
encoding.  This is the zero-length header case.

If the 1-character has a high bit, will require a one-byte length header
and then the high-bit byte, and if it is multi-byte, perhaps more bytes.

Zero-length header will even work for a VARCHAR(8) field that stores one
7-bit ASCII character, because it isn't relying on the typmod.

FYI, we also need to figure out how to store a zero-length string.  That
will probably be high-bit, and then all zero bits.  We don't store a
zero-byte in strings, so that should be unique for .

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 FYI, we also need to figure out how to store a zero-length string.  That
 will probably be high-bit, and then all zero bits.  We don't store a
 zero-byte in strings, so that should be unique for .

No, it'll be a 1-byte header with length indicating that no bytes
follow, which likely will be 1001 rather than 1000 ... but
in either case there is no ambiguity involved.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  FYI, we also need to figure out how to store a zero-length string.  That
  will probably be high-bit, and then all zero bits.  We don't store a
  zero-byte in strings, so that should be unique for .
 
 No, it'll be a 1-byte header with length indicating that no bytes
 follow, which likely will be 1001 rather than 1000 ... but
 in either case there is no ambiguity involved.

Well, in my idea, 1001 would be 0x01.  I was going to use the
remaining 7 bits for the 7-bit ascii value.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 No, it'll be a 1-byte header with length indicating that no bytes
 follow,

 Well, in my idea, 1001 would be 0x01.  I was going to use the
 remaining 7 bits for the 7-bit ascii value.

Huh?  I thought you said 0001 would be 0x01, that is, high bit
clear means a single byte containing an ASCII character.  You could
reverse that but it just seems to make things harder --- the byte
isn't a correct data byte by itself, as it would be with the other
convention.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Bruce Momjian
Tom Lane wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  No, it'll be a 1-byte header with length indicating that no bytes
  follow,
 
  Well, in my idea, 1001 would be 0x01.  I was going to use the
  remaining 7 bits for the 7-bit ascii value.
 
 Huh?  I thought you said 0001 would be 0x01, that is, high bit
 clear means a single byte containing an ASCII character.  You could
 reverse that but it just seems to make things harder --- the byte
 isn't a correct data byte by itself, as it would be with the other
 convention.

Oh, OK, I had high byte meaning no header, but clear is better, so
0001 is 0x01, and  is .  But I see now that bytea does
store nulls, so yea, we would be better using 1001, and it is the
same size as .

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Gregory Stark
Bruce Momjian [EMAIL PROTECTED] writes:

 Oh, OK, I had high byte meaning no header

Just how annoying would it be if I pointed out I suggested precisely this a
few days ago?

Tom said he didn't think there was enough code space and my own
experimentation was slowly leading me to agree, sadly. It would be neat to get
it to work though.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Tom Lane
Bruce Momjian [EMAIL PROTECTED] writes:
 Oh, OK, I had high byte meaning no header, but clear is better, so
 0001 is 0x01, and  is .  But I see now that bytea does
 store nulls, so yea, we would be better using 1001, and it is the
 same size as .

I'm liking this idea more the more I think about it, because it'd
actually be far less painful to put into the system structure than the
other idea of fooling with varlena headers.  To review: Bruce is
proposing a var-length type structure with the properties

first byte 0xxx   field length 1 byte, exactly that value
first byte 1xxx   xxx data bytes follow

This can support *any* stored value from zero to 127 bytes long.
We can imagine creating new datatypes short varchar and short char,
and then having the parser silently substitute these types for varchar(N)
or char(N) whenever N = 127 / max_encoding_length.  Add some
appropriate implicit casts to convert these to the normal varlena types
for computation, and away you go.  No breakage of any existing
datatype-specific code, just a few additions in places like
heap_form_tuple.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Tom said he didn't think there was enough code space and my own
 experimentation was slowly leading me to agree, sadly.

There isn't if you want the type to also handle long strings.
But what if we restrict it to short strings?  See my message
just now.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Bruce Momjian
Gregory Stark wrote:
 Bruce Momjian [EMAIL PROTECTED] writes:
 
  Oh, OK, I had high byte meaning no header
 
 Just how annoying would it be if I pointed out I suggested precisely this a
 few days ago?
 
 Tom said he didn't think there was enough code space and my own
 experimentation was slowly leading me to agree, sadly. It would be neat to get
 it to work though.

I didn't see the 7-bit thing.  Guess I missed it.  I was worried we
would not have enough bits to track a 1-gig field.

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDBhttp://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Gregory Stark [EMAIL PROTECTED] writes:
 Tom said he didn't think there was enough code space and my own
 experimentation was slowly leading me to agree, sadly.

 There isn't if you want the type to also handle long strings.
 But what if we restrict it to short strings?  See my message
 just now.

Then it seems like it imposes a pretty hefty burden on the user. 

text columns, for example, can never take advantage of it. And there are
plenty of instances where 127 bytes would be just short enough to be annoying
even though 99% of the data would in fact be shorter. Things like address
and product name for example.

The user would have to decide that he'll never need a value over 127 bytes
long ever in order to get the benefit.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Reducing data type space usage

2006-09-15 Thread Bort, Paul
Gregory Stark writes:
 Tom Lane [EMAIL PROTECTED] writes:
 
  There isn't if you want the type to also handle long strings.
  But what if we restrict it to short strings?  See my 
 message just now.
 
 Then it seems like it imposes a pretty hefty burden on the user. 
 

But there are a lot of places where it wins: 
- single byte for a multi-state flag
- hex representation of a hash (like SHA-1)
- part numbers
- lots of fields imported from legacy systems
- ZIP/Postal codes

And for all of those you can decisively say at design time that 127
characters is an OK limit.

+1 for Bruce/Tom's idea.

Regards,
Paul Bort

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq