Re: [PERFORM] Optimizing a VIEW

2008-08-22 Thread Decibel!

On Aug 20, 2008, at 1:18 PM, Tom Lane wrote:

If you're worried about storage space, I wouldn't go for arrays of
composite :-(.  The tuple header overhead is horrendous, almost
certainly a lot worse than hstore.



Oh holy cow, I didn't realize we had a big header in there. Is that  
to allow for changing the definition of the composite type?

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Optimizing a VIEW

2008-08-20 Thread Decibel!

On Aug 16, 2008, at 9:19 PM, Gurjeet Singh wrote:

For you very specific case, I recommend you check out contrib/hstore:
http://www.postgresql.org/docs/current/static/hstore.html


Awesome Any comments on the performance of hstore?


I've looked at it but haven't actually used it. One thing I wish it  
did was to keep a catalog somewhere of the names that it's seen so  
that it wasn't storing them as in-line text. If you have even  
moderate-length names and are storing small values you quickly end up  
wasting a ton of space.


BTW, now that you can build arrays of composite types, that might be  
an easy way to deal with this stuff. Create a composite type of  
(name_id, value) and store that in an array.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Optimizing a VIEW

2008-08-20 Thread Decibel!

On Aug 17, 2008, at 10:21 AM, Madison Kelly wrote:
Truth be told, I sort of expected this would be what I had to do. I  
think I asked this more in hoping that there might be some magic  
I didn't know about, but I see now that's not the case. :)


As my data points grow to 500,000+, the time it took to return  
these results grew to well over 10 minutes on a decent server and  
the DB size was growing rapidly, as you spoke of.


So I did just as you suggested and took the variable names I knew  
about specifically and created a table for them. These are the ones  
that are being most often updated (hourly per customer) and made  
each column an 'int' or 'real' where possible and ditched the  
tracking of the adding/modifying user and time stamp. I added those  
out of habit, more than anything. This data will always come from a  
system app though, so...


Given that my DB is in development and how very long and intensive  
it would have been to pull out the existing data, I have started  
over and am now gathering new data. In a week or so I should have  
the same amount of data as I had before and I will be able to do a  
closer comparison test.


However, I already suspect the growth of the database will be  
substantially slower and the queries will return substantially faster.



I strongly recommend you also re-think using EAV at all for this. It  
plain and simple does not scale well. I won't go so far as to say it  
can never be used (we're actually working on one right now, but it  
will only be used to occasionally pull up single entities), but you  
have to be really careful with it. I don't see it working very well  
for what it sounds like you're trying to do.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Optimizing a VIEW

2008-08-20 Thread Tom Lane
Decibel! [EMAIL PROTECTED] writes:
 On Aug 16, 2008, at 9:19 PM, Gurjeet Singh wrote:
 Awesome Any comments on the performance of hstore?

 I've looked at it but haven't actually used it. One thing I wish it  
 did was to keep a catalog somewhere of the names that it's seen so  
 that it wasn't storing them as in-line text. If you have even  
 moderate-length names and are storing small values you quickly end up  
 wasting a ton of space.

 BTW, now that you can build arrays of composite types, that might be  
 an easy way to deal with this stuff. Create a composite type of  
 (name_id, value) and store that in an array.

If you're worried about storage space, I wouldn't go for arrays of
composite :-(.  The tuple header overhead is horrendous, almost
certainly a lot worse than hstore.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing a VIEW

2008-08-18 Thread Matthew Wakeling

On Fri, 15 Aug 2008, Madison Kelly wrote:
 Below I will post the VIEW and a sample of the query's EXPLAIN ANALYZE. 
Thanks for any tips/help/clue-stick-beating you may be able to share!


This query looks incredibly expensive:


   SELECT

...

   FROM
   customer a,
   history.customer_data b,
   history.customer_data c,
   history.customer_data d,
   history.customer_data e,
   history.customer_data f,
   history.customer_data g,
   history.customer_data h,
   history.customer_data i,
   history.customer_data j,
   history.customer_data k,
   history.customer_data l
   WHERE
   a.cust_id=b.cd_cust_id AND
   a.cust_id=c.cd_cust_id AND
   a.cust_id=d.cd_cust_id AND
   a.cust_id=e.cd_cust_id AND
   a.cust_id=f.cd_cust_id AND
   a.cust_id=g.cd_cust_id AND
   a.cust_id=h.cd_cust_id AND
   a.cust_id=i.cd_cust_id AND
   a.cust_id=j.cd_cust_id AND
   a.cust_id=k.cd_cust_id AND
   a.cust_id=l.cd_cust_id AND

...

I would refactor this significantly, so that instead of returning a wide 
result, it would return more than one row per customer. Just do a single 
join between customer and history.customer_data - it will run much faster.


Matthew

--
Here we go - the Fairy Godmother redundancy proof.
   -- Computer Science Lecturer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing a VIEW

2008-08-17 Thread Madison Kelly

Decibel! wrote:

On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
The 'cust_id' references the customer that the given data belongs to. 
The reason for this data bucket (does this structure have a proper 
name?) is that the data I need to store on a give customer is quite 
variable and outside of my control. As it is, there is about 400 
different variable/value pairs I need to store per customer.



It's called Entity-Attribute-Value, and it's performance is pretty much 
guaranteed to suck for any kind of a large dataset. The problem is that 
you're storing a MASSIVE amount of extra information for every single 
value. Consider:


If each data point was just a field in a table, then even if we left 
cd_value as text, each data point would consume 4 bytes* + 1 byte per 
character (I'm assuming you don't need extra UTF8 chars or anything). Of 
course if you know you're only storing numbers or the like then you can 
make that even more efficient.


* In 8.3, the text field overhead could be as low as 1 byte if the field 
is small enough.


OTOH, your table is going to 32+24 bytes per row just for the per-row 
overhead, ints and timestamps. Each text field will have 1 or 4 bytes in 
overhead, then you have to store the actual data. Realistically, you're 
looking at 60+ bytes per data point, as opposed to maybe 15, or even 
down to 4 if you know you're storing an int.


Now figure out what that turns into if you have 100 data points per 
minute. It doesn't take very long until you have a huge pile of data 
you're trying to deal with. (As an aside, I once consulted with a 
company that wanted to do this... they wanted to store about 400 data 
points from about 1000 devices on a 5 minute interval. That worked out 
to something like 5GB per day, just for the EAV table. Just wasn't going 
to scale...)


So, back to your situation... there's several things you can do that 
will greatly improve things.


Identify data points that are very common and don't use EAV to store 
them. Instead, store them as regular fields in a table (and don't use 
text if at all possible).


You need to trim down your EAV table. Throw out the added/modified info; 
there's almost certainly no reason to store that *per data point*. Get 
rid of cd_id; there should be a natural PK you can use, and you 
certainly don't want anything else referring to this table (which is a 
big reason to use a surrogate key).


cd_variable and cd_tag need to be ints that point at other tables. For 
that matter, do you really need to tag each *data point*? Probably not...


Finally, if you have a defined set of points that you need to report on, 
create a materialized view that has that information.


BTW, it would probably be better to store data either in the main table, 
or the history table, but not both places.


This is a very long and thoughtful reply, thank you very kindly.

Truth be told, I sort of expected this would be what I had to do. I 
think I asked this more in hoping that there might be some magic I 
didn't know about, but I see now that's not the case. :)


As my data points grow to 500,000+, the time it took to return these 
results grew to well over 10 minutes on a decent server and the DB size 
was growing rapidly, as you spoke of.


So I did just as you suggested and took the variable names I knew about 
specifically and created a table for them. These are the ones that are 
being most often updated (hourly per customer) and made each column an 
'int' or 'real' where possible and ditched the tracking of the 
adding/modifying user and time stamp. I added those out of habit, more 
than anything. This data will always come from a system app though, so...


Given that my DB is in development and how very long and intensive it 
would have been to pull out the existing data, I have started over and 
am now gathering new data. In a week or so I should have the same amount 
of data as I had before and I will be able to do a closer comparison test.


However, I already suspect the growth of the database will be 
substantially slower and the queries will return substantially faster.


Thank you again!

Madi

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing a VIEW

2008-08-16 Thread Decibel!

On Aug 15, 2008, at 1:36 PM, Madison Kelly wrote:
The 'cust_id' references the customer that the given data belongs  
to. The reason for this data bucket (does this structure have a  
proper name?) is that the data I need to store on a give customer  
is quite variable and outside of my control. As it is, there is  
about 400 different variable/value pairs I need to store per customer.



It's called Entity-Attribute-Value, and it's performance is pretty  
much guaranteed to suck for any kind of a large dataset. The problem  
is that you're storing a MASSIVE amount of extra information for  
every single value. Consider:


If each data point was just a field in a table, then even if we left  
cd_value as text, each data point would consume 4 bytes* + 1 byte per  
character (I'm assuming you don't need extra UTF8 chars or anything).  
Of course if you know you're only storing numbers or the like then  
you can make that even more efficient.


* In 8.3, the text field overhead could be as low as 1 byte if the  
field is small enough.


OTOH, your table is going to 32+24 bytes per row just for the per-row  
overhead, ints and timestamps. Each text field will have 1 or 4 bytes  
in overhead, then you have to store the actual data. Realistically,  
you're looking at 60+ bytes per data point, as opposed to maybe 15,  
or even down to 4 if you know you're storing an int.


Now figure out what that turns into if you have 100 data points per  
minute. It doesn't take very long until you have a huge pile of data  
you're trying to deal with. (As an aside, I once consulted with a  
company that wanted to do this... they wanted to store about 400 data  
points from about 1000 devices on a 5 minute interval. That worked  
out to something like 5GB per day, just for the EAV table. Just  
wasn't going to scale...)


So, back to your situation... there's several things you can do that  
will greatly improve things.


Identify data points that are very common and don't use EAV to store  
them. Instead, store them as regular fields in a table (and don't use  
text if at all possible).


You need to trim down your EAV table. Throw out the added/modified  
info; there's almost certainly no reason to store that *per data  
point*. Get rid of cd_id; there should be a natural PK you can use,  
and you certainly don't want anything else referring to this table  
(which is a big reason to use a surrogate key).


cd_variable and cd_tag need to be ints that point at other tables.  
For that matter, do you really need to tag each *data point*?  
Probably not...


Finally, if you have a defined set of points that you need to report  
on, create a materialized view that has that information.


BTW, it would probably be better to store data either in the main  
table, or the history table, but not both places.

--
Decibel!, aka Jim C. Nasby, Database Architect  [EMAIL PROTECTED]
Give your computer some brain candy! www.distributed.net Team #1828




smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Optimizing a VIEW

2008-08-16 Thread Rodrigo E. De León Plicet
On Sat, Aug 16, 2008 at 2:19 PM, Decibel! [EMAIL PROTECTED] wrote:
 You need to trim down your EAV table.

Egads! I'd say completely get rid of this beast and redesign it
according to valid relational concepts.

This post pretty much explains the whole issue with EAV:
http://groups.google.com/group/microsoft.public.sqlserver.programming/browse_thread/thread/df5bb99b3eaadfa9/6a160e5027ce3a80?lnk=stq=eav#6a160e5027ce3a80

EAV is evil. Period.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing a VIEW

2008-08-16 Thread Rodrigo E. De León Plicet
On Fri, Aug 15, 2008 at 1:36 PM, Madison Kelly [EMAIL PROTECTED] wrote:
  The 'cust_id' references the customer that the given data belongs to. The
 reason for this data bucket (does this structure have a proper name?) is
 that the data I need to store on a give customer is quite variable and
 outside of my control. As it is, there is about 400 different variable/value
 pairs I need to store per customer.

For you very specific case, I recommend you check out contrib/hstore:
http://www.postgresql.org/docs/current/static/hstore.html

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizing a VIEW

2008-08-16 Thread Gurjeet Singh
On Sun, Aug 17, 2008 at 7:06 AM, Rodrigo E. De León Plicet 
[EMAIL PROTECTED] wrote:

 On Fri, Aug 15, 2008 at 1:36 PM, Madison Kelly [EMAIL PROTECTED] wrote:
   The 'cust_id' references the customer that the given data belongs to.
 The
  reason for this data bucket (does this structure have a proper name?)
 is
  that the data I need to store on a give customer is quite variable and
  outside of my control. As it is, there is about 400 different
 variable/value
  pairs I need to store per customer.

 For you very specific case, I recommend you check out contrib/hstore:
 http://www.postgresql.org/docs/current/static/hstore.html


Awesome Any comments on the performance of hstore?

Best regards,
-- 
[EMAIL PROTECTED]
[EMAIL PROTECTED] gmail | hotmail | indiatimes | yahoo }.com

EnterpriseDB http://www.enterprisedb.com

Mail sent from my BlackLaptop device