Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Seamus Abshere
On Fri, Oct 13, 2017, at 03:16 PM, David G. Johnston wrote:
> implement a "system-managed-enum" type with many of the same properties
[...]
> TOAST does involved compression but the input to
> the compression algorithm is a single cell (row and column) in a table.​
> As noted above I consider the TOAST table and main table to be a single
> logical table.

See this sounds like _exactly_ what I want. Except with a content hash
instead of an id.

Seems to me like all of the machinery that allows you to look things up
by TOASTed columns and subsequently return TOASTed values as if they
resided in the same physical table is what is needed.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Thomas Kellerer

Seamus Abshere schrieb am 13.10.2017 um 18:43:

On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:

Theoretically / blue sky, could there be a table or column type that
transparently handles "shared strings" like this, reducing size on disk
at the cost of lookup overhead for all queries?
(I guess maybe it's like TOAST, but content-hashed and de-duped and not
only for large objects?)


On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:

What was described is exactly what relations and Foreign Keys are for.


hi Melvin, appreciate the reminder. Our issue is that we have 300+
columns and frequently include them in the SELECT or WHERE clauses... so
traditional normalization would involve hundreds of joins.

That's why I ask about a new table or column type that handles basic
translation and de-duping transparently, keeping the coded values
in-table.



As those are all strings: why not put them into a hstore (or jsonb) column?

Then all of them could compressed (TOASTed) as a single value instead of 300 
independent values.


 





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread David G. Johnston
On Fri, Oct 13, 2017 at 9:29 AM, Seamus Abshere  wrote:

> > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote
> > > Theoretically / blue sky, could there be a table or column type that
> > > transparently handles "shared strings" like this, reducing size on disk
> > > at the cost of lookup overhead for all queries?
> > > (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> > > only for large objects?)
>
> On Fri, Oct 13, 2017, at 01:12 PM, David G. Johnston wrote:
> > Row-independence is baked into PostgreSQL pretty deeply...
>
> Could you say more about that?
>

​Not intelligibly...basically as far as PostgreSQL is concerned all the
data to reconstruct a row from a given table is present in that table.
>From a practical perspective the "TOAST table" for a table IS part of the
main table since it has no practical independent use.

As an aside I was thinking along the lines of an actual compression routine
which is what a spreadsheet file is able to do since a spreadsheet contains
the data from every row and column in a single file and is able to compress
the entire file by finding commonalities across rows and columns.  A
database generally cannot do that.

As for "transparent lookup tables for text columns"...I suppose one could
implement a "system-managed-enum" type with many of the same properties of
an actual enum but avoiding many of its problems by not exposing the
enum-ness to the user and instead just exposing the text labels...I suspect
faced with prospect of doing something that complex most users would just
setup a FK relationship.
​

> What about the comparison to TOAST, which stores values off-table?
>

TOAST solves a technical problem related to the fact that records "on the
table" have a very small size limitation (kb) while stored values can be at
least as large as a GB.  TOAST does involved compression but the input to
the compression algorithm is a single cell (row and column) in a table.​
As noted above I consider the TOAST table and main table to be a single
logical table.

Like I said the enum type has similar properties to what you want - but
Melvin is right that using it requires careful consideration of how your
data might change in the future.

David J.


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Peter J. Holzer
On 2017-10-13 12:49:21 -0300, Seamus Abshere wrote:
> In the spreadsheet world, there is this concept of "shared strings," a
> simple way of compressing spreadsheets when the data is duplicated in
> many cells.
> 
> In my database, I have a table with >200 million rows and >300 columns
> (all the households in the United States). For clarity of development
> and debugging, I have not made any effort to normalize its contents, so
> millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
> TOWNHOUSE" (yes, that whole string!) instead of some code representing
> it.
> 
> Theoretically / blue sky, could there be a table or column type that
> transparently handles "shared strings" like this, reducing size on disk
> at the cost of lookup overhead for all queries?

Theoretically it's certainly possible and I think some column-oriented
databases store data that way.

> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> only for large objects?)

Yes, but if you want to autmatically delete entries which are no longer
needed you need to keep track of that. So either a reference count or an
index lookup on the parent table. This is starting to look a lot like a
foreign key - just hidden from the user. Performance would probably be
similar, too.

We have done something similar (although for different reasons). We
ended up doing the "join" in the application. For most purposes we don't
need the descriptive strings and when we need them we can do a
dictionary lookup just before sending them to the client (of course the
dictionary has to be read from the database, too, but it doesn't change
that often, so it can be cached). And from a software maintainability
POV I think a dictionary lookup in Perl is a lot nicer than 50 joins
(or 300 in your case).

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Melvin Davidson
On Fri, Oct 13, 2017 at 12:52 PM, Melvin Davidson 
wrote:

>
>
> On Fri, Oct 13, 2017 at 12:43 PM, Seamus Abshere 
> wrote:
>
>> > > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:
>> > >> Theoretically / blue sky, could there be a table or column type that
>> > >> transparently handles "shared strings" like this, reducing size on
>> disk
>> > >> at the cost of lookup overhead for all queries?
>> > >> (I guess maybe it's like TOAST, but content-hashed and de-duped and
>> not
>> > >> only for large objects?)
>>
>> On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:
>> > What was described is exactly what relations and Foreign Keys are for.
>>
>> hi Melvin, appreciate the reminder. Our issue is that we have 300+
>> columns and frequently include them in the SELECT or WHERE clauses... so
>> traditional normalization would involve hundreds of joins.
>>
>> That's why I ask about a new table or column type that handles basic
>> translation and de-duping transparently, keeping the coded values
>> in-table.
>>
>
> >I ask about a new table or column type that handles basic translation
>
> AFAIK, there is no such thing currently available.Your initial post
> indicated you were working with spreadsheets and were
> looking to translate to PostgreSQL database. There is no short cut to
> normalizing, but the time you spend doing so in initial
> design will well be worthwhile once it is implemented.
>
> --
> *Melvin Davidson*
> I reserve the right to fantasize.  Whether or not you
> wish to share my fantasy is entirely up to you.
>


*Seamus,*




*Just a thought. As I mentioned previously, there is no shortcut to
optimizing your database. However, you can do it in increments.First,
create all your foreign key / repetative data tables.*


*Next, add additional FK columns to you current tables to reference the fk
/ repetative data tables.*


*Modify your application / queries to utilize the new columns.*






*Do extensive testing to make sure your modifications work properly.VERY
IMPORTANT: Before the next step, make a backup of the existing database and
verify you have a good copy.Finally, drop all the old repetative data
columns.*
-- 
*Melvin Davidson*
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Melvin Davidson
On Fri, Oct 13, 2017 at 12:43 PM, Seamus Abshere  wrote:

> > > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:
> > >> Theoretically / blue sky, could there be a table or column type that
> > >> transparently handles "shared strings" like this, reducing size on
> disk
> > >> at the cost of lookup overhead for all queries?
> > >> (I guess maybe it's like TOAST, but content-hashed and de-duped and
> not
> > >> only for large objects?)
>
> On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:
> > What was described is exactly what relations and Foreign Keys are for.
>
> hi Melvin, appreciate the reminder. Our issue is that we have 300+
> columns and frequently include them in the SELECT or WHERE clauses... so
> traditional normalization would involve hundreds of joins.
>
> That's why I ask about a new table or column type that handles basic
> translation and de-duping transparently, keeping the coded values
> in-table.
>

>I ask about a new table or column type that handles basic translation

AFAIK, there is no such thing currently available.Your initial post
indicated you were working with spreadsheets and were
looking to translate to PostgreSQL database. There is no short cut to
normalizing, but the time you spend doing so in initial
design will well be worthwhile once it is implemented.

-- 
*Melvin Davidson*
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Seamus Abshere
> > On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote:
> >> Theoretically / blue sky, could there be a table or column type that
> >> transparently handles "shared strings" like this, reducing size on disk
> >> at the cost of lookup overhead for all queries?
> >> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> >> only for large objects?)

On Fri, Oct 13, 2017, at 01:29 PM, Melvin Davidson wrote:
> What was described is exactly what relations and Foreign Keys are for.

hi Melvin, appreciate the reminder. Our issue is that we have 300+
columns and frequently include them in the SELECT or WHERE clauses... so
traditional normalization would involve hundreds of joins.

That's why I ask about a new table or column type that handles basic
translation and de-duping transparently, keeping the coded values
in-table.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Melvin Davidson
On Fri, Oct 13, 2017 at 12:12 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere 
> wrote:
>
>> Theoretically / blue sky, could there be a table or column type that
>> transparently handles "shared strings" like this, reducing size on disk
>> at the cost of lookup overhead for all queries?
>>
>> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
>> only for large objects?)
>>
>
> Row-independence is baked into PostgreSQL pretty deeply...
>
> I think an enum type is about as close are you are likely to get if you
> don't wish to setup your own foreign-key relationships with surrogate keys.
>
> David J.
>

I STRONGLY advise againt the use of ENUMS.

What was described is exactly what relations and Foreign Keys are for.

Example:
CREATE TABLE residence_type
(
  residence_type_id   INTEGER NOT NULL,
  residence_type_desc TEXT NOT NULL,
  CONSTRAINT residence_type_pk PRIMARY KEY (residence_type_id)
);

CREATE TABLE state
(
  state_id   CHAR(02) NOT NULL,
  state_name TEXT NOT NULL,
  CONSTRAINT state_pk PRIMARY KEY (state_id)
);

CREATE TABLE residence
(
  residence_idBIGINT NOT NULL,
  residence_type_id INTEGER,
  street_numCHARACTER(10),
  street_nameCHARACTER(20),
  cityCHARACTER(40),
  state_idCHAR(02),
  CONSTRAINT residence_pk PRIMARY KEY (residence_id),
  CONSTRAINT fk_residence_state FOREIGN KEY (state_id)
REFERENCES state (state_id),
  CONSTRAINT fk_residence_type FOREIGN KEY (residence_type_id)
REFERENCES residence_type (residence_type_id)
);

SELECT t.residence_type_desc,
   r.street_num,
   r.street_name,
   r.city,
   s.state_name
  FROM residence r
  JOIN residence_type t ON t.residence_id = r.residence_id
  JOIN state s ON s.state_id = r.state_id
 WHERE residence_id = 12345;

-- 
*Melvin Davidson*
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Seamus Abshere
> On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere wrote
> > Theoretically / blue sky, could there be a table or column type that
> > transparently handles "shared strings" like this, reducing size on disk
> > at the cost of lookup overhead for all queries?
> > (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> > only for large objects?)

On Fri, Oct 13, 2017, at 01:12 PM, David G. Johnston wrote:
> Row-independence is baked into PostgreSQL pretty deeply...

Could you say more about that?

What about the comparison to TOAST, which stores values off-table?

Thanks!


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread David G. Johnston
On Fri, Oct 13, 2017 at 8:49 AM, Seamus Abshere  wrote:

> Theoretically / blue sky, could there be a table or column type that
> transparently handles "shared strings" like this, reducing size on disk
> at the cost of lookup overhead for all queries?
>
> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> only for large objects?)
>

Row-independence is baked into PostgreSQL pretty deeply...

I think an enum type is about as close are you are likely to get if you
don't wish to setup your own foreign-key relationships with surrogate keys.

David J.


Re: [GENERAL] "Shared strings"-style table

2017-10-13 Thread Rob Sargent



On 10/13/2017 09:49 AM, Seamus Abshere wrote:

hey,

In the spreadsheet world, there is this concept of "shared strings," a
simple way of compressing spreadsheets when the data is duplicated in
many cells.

In my database, I have a table with >200 million rows and >300 columns
(all the households in the United States). For clarity of development
and debugging, I have not made any effort to normalize its contents, so
millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
TOWNHOUSE" (yes, that whole string!) instead of some code representing
it.

Theoretically / blue sky, could there be a table or column type that
transparently handles "shared strings" like this, reducing size on disk
at the cost of lookup overhead for all queries?

(I guess maybe it's like TOAST, but content-hashed and de-duped and not
only for large objects?)

Thanks,
Seamus

--
Seamus Abshere, SCEA
https://www.faraday.io
https://github.com/seamusabshere
https://linkedin.com/in/seamusabshere


What data type are these columns now?  I would be tempted to map the 
full strings to an abbreviation just so I didn't have to alter all the 
columns to an "id";  Optional to place any RI on the columns to the 
abbreviation dictionary table.  Just use the translation as a last step 
in user facing reports.  If you can map/abbreviate to 4 characters, 
you've approximated the disk size of an integer.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] "Shared strings"-style table

2017-10-13 Thread Seamus Abshere
hey,

In the spreadsheet world, there is this concept of "shared strings," a
simple way of compressing spreadsheets when the data is duplicated in
many cells.

In my database, I have a table with >200 million rows and >300 columns
(all the households in the United States). For clarity of development
and debugging, I have not made any effort to normalize its contents, so
millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
TOWNHOUSE" (yes, that whole string!) instead of some code representing
it.

Theoretically / blue sky, could there be a table or column type that
transparently handles "shared strings" like this, reducing size on disk
at the cost of lookup overhead for all queries?

(I guess maybe it's like TOAST, but content-hashed and de-duped and not
only for large objects?)

Thanks,
Seamus

--
Seamus Abshere, SCEA
https://www.faraday.io
https://github.com/seamusabshere
https://linkedin.com/in/seamusabshere


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general