[HACKERS] [feature] cached index to speed up specific queries on extremely large data sets
hi folks, please cc me direct on responses as i am subscribed on digest. i've been asked to look at how to deal with around 7 billion records (appx 30 columns, appx data size total 1k) and this might have to be in a single system (i will need to Have Words with the client about that). the data is read-only and an arbitrary number of additional tables may be created to manage the data. records come in at a rate of around 25 million per day, the 7 billion records is based on the assumption of keeping one month's worth of data around. analysis of this data needs to be done across the entire set: i.e. it may not be subdivided into isolated tables (by day for example). i am therefore um rather concerned about efficiency, even just from the perspective of using 2nd normalised form and not doing JOINs against other tables. so i had an idea. there already exists the concept of indexes. there already exists the concept of cached queries. question: would it be practical to *merge* those two concepts such that specific queries could be *updated* as new records are added, such that when the query is called again it answers basically pretty much immediately? let us assume that performance degradation on update (given that indexes already exist and are required to be updated) is acceptable. the only practical way (without digging into postgresql's c code) to do this at a higher level would be in effect to abandon the advantages of the postgresql query optimisation engine and *reimplement* it in a high-level language, subdividing the data into smaller (more manageable) tables, using yet more tables to store intermediate results of a previous query then somehow managing to stitch together a new response based on newer packets. it would be a complete nightmare to both implement and maintain. second question then based on whether the first is practical: is there anyone who would be willing (assuming it can be arranged) to engage in a contract to implement the required functionality? thanks, l. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [feature] cached index to speed up specific queries on extremely large data sets
On 04/11/2014 03:20 PM, lkcl . wrote: so i had an idea. there already exists the concept of indexes. there already exists the concept of cached queries. question: would it be practical to*merge* those two concepts such that specific queries could be*updated* as new records are added, such that when the query is called again it answers basically pretty much immediately? let us assume that performance degradation on update (given that indexes already exist and are required to be updated) is acceptable. I think you just described materialized views. The built-in materialized views in PostgreSQL are not updated immediately as the tables are modified, but it's entirely possible to roll your own using views and triggers. There are a few links on the PostgreSQL wiki, in the Versions before 9.3 section: https://wiki.postgresql.org/wiki/Materialized_Views. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [feature] cached index to speed up specific queries on extremely large data sets
On Fri, Apr 11, 2014 at 1:26 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04/11/2014 03:20 PM, lkcl . wrote: so i had an idea. there already exists the concept of indexes. there already exists the concept of cached queries. question: would it be practical to*merge* those two concepts such that specific queries could be*updated* as new records are added, such that when the query is called again it answers basically pretty much immediately? let us assume that performance degradation on update (given that indexes already exist and are required to be updated) is acceptable. I think you just described materialized views. ... well... dang :) http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views ok so definitely not the snapshot materialised views, but yes! the eager materialised views, definitely. The built-in materialized views in PostgreSQL are not updated immediately as the tables are modified, ... but that would probably be enough. but it's entirely possible to roll your own using views and triggers. There are a few links on the PostgreSQL wiki, in the Versions before 9.3 section: https://wiki.postgresql.org/wiki/Materialized_Views. awesome. uhhh, well that was easy *lol*. once i am paid, whom do i send the payment to for the fast response and incredibly valuable information? :) [this is a serious question!] l. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [feature] cached index to speed up specific queries on extremely large data sets
On Fri, Apr 11, 2014 at 9:53 PM, lkcl . luke.leigh...@gmail.com wrote: On Fri, Apr 11, 2014 at 1:26 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04/11/2014 03:20 PM, lkcl . wrote: so i had an idea. there already exists the concept of indexes. there already exists the concept of cached queries. question: would it be practical to*merge* those two concepts such that specific queries could be*updated* as new records are added, such that when the query is called again it answers basically pretty much immediately? let us assume that performance degradation on update (given that indexes already exist and are required to be updated) is acceptable. I think you just described materialized views. ... well... dang :) http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views ok so definitely not the snapshot materialised views, but yes! the eager materialised views, definitely. The built-in materialized views in PostgreSQL are not updated immediately as the tables are modified, ... but that would probably be enough. but it's entirely possible to roll your own using views and triggers. There are a few links on the PostgreSQL wiki, in the Versions before 9.3 section: https://wiki.postgresql.org/wiki/Materialized_Views. When updating a materialized view, or refreshing it, you need as well to be aware that an exclusive lock is taken on it during the refresh in 9.3, so the materialized view cannot be accessed for read queries. awesome. uhhh, well that was easy *lol*. once i am paid, whom do i send the payment to for the fast response and incredibly valuable information? :) [this is a serious question!] This can be helpful: http://www.postgresql.org/about/donate/ -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [feature] cached index to speed up specific queries on extremely large data sets
On Fri, Apr 11, 2014 at 2:12 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Apr 11, 2014 at 9:53 PM, lkcl . luke.leigh...@gmail.com wrote: section: https://wiki.postgresql.org/wiki/Materialized_Views. When updating a materialized view, or refreshing it, you need as well to be aware that an exclusive lock is taken on it during the refresh in 9.3, so the materialized view cannot be accessed for read queries. ok. as long as the storage of data (in the underlying table) is not adversely affected, that would be fine. as this is the hackers list and this turns out to be more a user question i'll leave it for now. awesome. uhhh, well that was easy *lol*. once i am paid, whom do i send the payment to for the fast response and incredibly valuable information? :) [this is a serious question!] This can be helpful: http://www.postgresql.org/about/donate/ thank you michael. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers