[PERFORM] Storing sensor data
Hi, I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. I think the second design would be easiest on the database but as the current sensor state can potentially be queried often, it might be too slow to read. Any recommendations? signature.asc Description: OpenPGP digital signature
Re: [PERFORM] Storing sensor data
Ivan Voras wrote: I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. 3) write only to the history table, but have an INSERT trigger to update the table with current data. This has the same performance characteristics as 1, but let's you design your application like 2. I think I'd choose this approach (or 2), since it can handle out-of-order or delayed arrival of sensor readings gracefully (assuming they are timestamped at source). If you go with 2, I'd recommend to still create a view to encapsulate the complex query for the current values, to make the application development simpler. And if it gets slow, you can easily swap the view with a table, updated with triggers or periodically, without changing the application. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
Option 1 is about somewhere between 2 and 3 times more work for the database than option 2. Do you need every sensor update to hit the database? In a situation like this I'd be tempted to keep the current values in the application itself and then sweep them all into the database periodically. If some of the sensor updates should hit the database faster, you could push those in as you get them rather than wait for your sweeper. This setup has the advantage that you c0an scale up the number of sensors and the frequency the sensors report without having to scale up the disks. You can also do the sweeping all in one transaction or even in one batch update. On Thu, May 28, 2009 at 9:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Ivan Voras wrote: The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. Amen. Do that.
Re: [PERFORM] Storing sensor data
On Thu, May 28, 2009 at 2:54 PM, Ivan Voras ivo...@freebsd.org wrote: The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. The size of that dataset, combined with the apparent simplicity of your schema and the apparent requirement for most-sequential access (I'm guessing about the latter two), all lead me to suspect you would be happier with something other than a traditional relational database. I don't know how exact your historical data has to be. Could you get by with something like RRDTool? RRdTool is a round-robin database that stores multiple levels of historical values aggregated by function. So you typically create an average database, a max database and so on, with the appropriate functions to transform the data, and you subdivide these into day, month, year and so on, by the granularity of your choice. When you store a value, the historical data is aggregated appropriately -- at appropriate levels of granularity, so the current day database is more precise than the monthly one, and so on -- and you always have access to the exact current data. RRDTool is used by software such as Munin and Cacti that track a huge number of readings over time for graphing. If you require precise data with the ability to filter, aggregate and correlate over multiple dimensions, something like Hadoop -- or one of the Hadoop-based column database implementations, such as HBase or Hypertable -- might be a better option, combined with MapReduce/Pig to execute analysis jobs A. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com: Ivan Voras wrote: I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. 3) write only to the history table, but have an INSERT trigger to update the table with current data. This has the same performance characteristics as 1, but let's you design your application like 2. Excellent idea! I think I'd choose this approach (or 2), since it can handle out-of-order or delayed arrival of sensor readings gracefully (assuming they are timestamped at source). It seems like your approach is currently the winner. If you go with 2, I'd recommend to still create a view to encapsulate the complex query for the current values, to make the application development simpler. And if it gets slow, you can easily swap the view with a table, updated with triggers or periodically, without changing the application. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. I'll look into it, but we'll first see if we can get away with limiting the time the data needs to be available. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
2009/5/28 Nikolas Everett nik9...@gmail.com: Option 1 is about somewhere between 2 and 3 times more work for the database than option 2. Yes, for writes. Do you need every sensor update to hit the database? In a situation like We can't miss an update - they can be delayed but they all need to be recorded. this I'd be tempted to keep the current values in the application itself and then sweep them all into the database periodically. If some of the sensor updates should hit the database faster, you could push those in as you get them rather than wait for your sweeper. This setup has the advantage that you can scale up the number of sensors and the frequency the sensors report without having to scale up the disks. You can also do the sweeping all in one transaction or even in one batch update. It would be nice, but then we need to invest more effort in making the front-end buffering resilient. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
On Thu, May 28, 2009 at 04:55:34PM +0200, Ivan Voras wrote: 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com: Ivan Voras wrote: I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. 3) write only to the history table, but have an INSERT trigger to update the table with current data. This has the same performance characteristics as 1, but let's you design your application like 2. Excellent idea! I think I'd choose this approach (or 2), since it can handle out-of-order or delayed arrival of sensor readings gracefully (assuming they are timestamped at source). It seems like your approach is currently the winner. If you go with 2, I'd recommend to still create a view to encapsulate the complex query for the current values, to make the application development simpler. And if it gets slow, you can easily swap the view with a table, updated with triggers or periodically, without changing the application. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. I'll look into it, but we'll first see if we can get away with limiting the time the data needs to be available. Mr. Voras, One big benefit of partitioning is that you can prune old data with minimal impact to the running system. Doing a large bulk delete would be extremely I/O impacting without partion support. We use this for a DB log system and it allows us to simply truncate a day table instead of a delete -- much, much faster. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
2009/5/28 Alexander Staubo a...@bengler.no: On Thu, May 28, 2009 at 2:54 PM, Ivan Voras ivo...@freebsd.org wrote: The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. The size of that dataset, combined with the apparent simplicity of your schema and the apparent requirement for most-sequential access (I'm guessing about the latter two), Your guesses are correct, except every now and then a random value indexed on a timestamp needs to be retrieved. all lead me to suspect you would be happier with something other than a traditional relational database. I don't know how exact your historical data has to be. Could you get No lossy compression is allowed. Exact data is needed for the whole dataset- If you require precise data with the ability to filter, aggregate and correlate over multiple dimensions, something like Hadoop -- or one of the Hadoop-based column database implementations, such as HBase or Hypertable -- might be a better option, combined with MapReduce/Pig to execute analysis jobs This looks like an interesting idea to investigate. Do you have more experience with such databases? How do they fare with the following requirements: * Storing large datasets (do they pack data well in the database? No wasted space like in e.g. hash tables?) * Retrieving specific random records based on a timestamp or record ID? * Storing inifinite datasets (i.e. whose size is not known in advance - cf. e.g. hash tables) On the other hand, we could periodically transfer data from PostgreSQL into a simpler database (e.g. BDB) for archival purposes (at the expense of more code). Would they be better suited? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
2009/5/28 Kenneth Marshall k...@rice.edu: One big benefit of partitioning is that you can prune old data with minimal impact to the running system. Doing a large bulk delete would be extremely I/O impacting without partion support. We use this for a DB log system and it allows us to simply truncate a day table instead of a delete -- much, much faster. Thanks. I'll need to investigate how much administrative overhead and fragility partitioning will introduce since the data will also be replicated between 2 servers (I'm thinking of using Slony). Any experience with this combination? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
On Thu, May 28, 2009 at 5:06 PM, Ivan Voras ivo...@freebsd.org wrote: If you require precise data with the ability to filter, aggregate and correlate over multiple dimensions, something like Hadoop -- or one of the Hadoop-based column database implementations, such as HBase or Hypertable -- might be a better option, combined with MapReduce/Pig to execute analysis jobs This looks like an interesting idea to investigate. Do you have more experience with such databases? How do they fare with the following requirements: We might want to take this discussion off-list, since this list is about PostgreSQL. Feel free to reply privately. * Storing large datasets (do they pack data well in the database? No wasted space like in e.g. hash tables?) Columns databases like Hypertable and HBase are designed to store data quite efficiently. Each column is grouped in a unit called a column family and stored together in chunks usually called SSTables, after the Google Bigtable paper. (When you design your database you must determine which columns are usually accessed together, in other to avoid incurring the I/O cost of loading non-pertinent columns.) Each SSTable is like a partition. When storing a chunk to disk, the column is compressed, each column being stored sequentially for optimal compression. I have used HBase, but I don't have any feel for how much space it wastes. In theory, though, space usage should be more optimal than with PostgreSQL. I have used Cassandra, another column database I would also recommend, which is very efficient. In many ways I prefer Cassandra to HBase -- it's leaner, completely decentralized (no single point of failure) and independent of the rather huge, monolithic Hadoop project -- but it does not currently support MapReduce. If you want to implement some kind of distributed analysis system, you will need to write yourself. All three column stores support mapping information by a time dimension. Each time you write a key, you also provide a timestamp. In theory you can retain the entire history of a single key. HBase lets you specify how many revisions to retain; not sure what Cassandra does. However, Cassandra introduces the notion of a supercolumn family, another grouping level which lets you use the timestamp as a column key. To explain how this works, consider the following inserts: # insert(table_name, key, column, value, timestamp) db.insert(readings, temperature_sensor, value:1, 23, 200905281725023) db.insert(readings, temperature_sensor, value:2, 27, 200905281725023) db.insert(readings, temperature_sensor, value:3, 21, 200905281725023) The resulting temperature_sensor row will have three column values: value:1 value:2 value:3 23 27 21 You can keep adding values and the row will get bigger. Because columns are dynamic, only that row will grow; all other rows will stay the same size. Cassandra users usually use the column name as a kind of value -- image it's like subindexing an array. As you can see, I also passed a timestamp (the 2009.. bit), which is used for versioning. Since anyone can write to any node in a cluster, Cassandra needs to be able to resolve conflicts. Note that these databases are inherently distributed. You can run them on a single node just fine -- and that might be appropriate in your case -- but they really shine when you run a whole cluster. Cassandra is multi-master, so you can just boot up a number of nodes and read from/write to any of them. * Retrieving specific random records based on a timestamp or record ID? Absolutely. * Storing inifinite datasets (i.e. whose size is not known in advance - cf. e.g. hash tables) This is one area where column databases are better than relational ones. The schema is completely dynamic, and you can treat it as a hash table. On the other hand, we could periodically transfer data from PostgreSQL into a simpler database (e.g. BDB) for archival purposes (at the expense of more code). Would they be better suited? Considering the size and sequential nature of the data, I think they would be better match than a simple key-value store like BDB. A. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
depends on how soon do you need to access that data after it's being created, the way I do it in my systems, I get data from 8 points, bit less than you - but I dump it to csv, and import it on database host (separate server). now, you could go to BDB or whatever, but that's not the solution. So, I would try dumping it to a file, and have separate process, maybe separete server that would import it, store it, as database. Whatever you do, as guys said - partition. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
On Thu, May 28, 2009 at 05:24:33PM +0200, Ivan Voras wrote: 2009/5/28 Kenneth Marshall k...@rice.edu: One big benefit of partitioning is that you can prune old data with minimal impact to the running system. Doing a large bulk delete would be extremely I/O impacting without partion support. We use this for a DB log system and it allows us to simply truncate a day table instead of a delete -- much, much faster. Thanks. I'll need to investigate how much administrative overhead and fragility partitioning will introduce since the data will also be replicated between 2 servers (I'm thinking of using Slony). Any experience with this combination? We use Slony1 on a number of databases, but none yet on which we use data partitioning. Cheers, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
I currently have a database doing something very similar. I setup partition tables with predictable names based on the the data's timestamp week number eg: (Data__WI). I have a tigger on the parent partition table to redirect data to the correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz, 'IYYY_IW') ) . then I use dynamic sql to do the insert. I did some optimization by writting it in pl/TCL and using global variables to store prepared insert statements. Most queries for me are based on the date and we have decent performance with our current setup. For last/current sensor data we just store the last dataID in the sensor record. I haven't thought of a better way yet. After batch inserts we caculate the last reading for each participating sensorID inserted. With partition tables we struggled with the query to get the lastest data : select * from Data where sensorID=x order by ts limit 1 -- for parition tables. See ( http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php) On Thu, May 28, 2009 at 7:55 AM, Ivan Voras ivo...@freebsd.org wrote: 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com: Ivan Voras wrote: I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. 3) write only to the history table, but have an INSERT trigger to update the table with current data. This has the same performance characteristics as 1, but let's you design your application like 2. Excellent idea! I think I'd choose this approach (or 2), since it can handle out-of-order or delayed arrival of sensor readings gracefully (assuming they are timestamped at source). It seems like your approach is currently the winner. If you go with 2, I'd recommend to still create a view to encapsulate the complex query for the current values, to make the application development simpler. And if it gets slow, you can easily swap the view with a table, updated with triggers or periodically, without changing the application. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. I'll look into it, but we'll first see if we can get away with limiting the time the data needs to be available. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Storing sensor data
I also forgot to note that I had no problems setting up replication via londiste (skytools). The cronjob that creates the partition each week for me also adds the table to the replication set. As simple as: londiste.py londiste.ini provider add 'public.Data__WI' londiste.py londiste.ini subscriber add 'public.Data__WI' On Thu, May 28, 2009 at 11:56 AM, Greg Jaman gja...@gmail.com wrote: I currently have a database doing something very similar. I setup partition tables with predictable names based on the the data's timestamp week number eg: (Data__WI). I have a tigger on the parent partition table to redirect data to the correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz, 'IYYY_IW') ) . then I use dynamic sql to do the insert. I did some optimization by writting it in pl/TCL and using global variables to store prepared insert statements. Most queries for me are based on the date and we have decent performance with our current setup. For last/current sensor data we just store the last dataID in the sensor record. I haven't thought of a better way yet. After batch inserts we caculate the last reading for each participating sensorID inserted. With partition tables we struggled with the query to get the lastest data : select * from Data where sensorID=x order by ts limit 1 -- for parition tables. See ( http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php) On Thu, May 28, 2009 at 7:55 AM, Ivan Voras ivo...@freebsd.org wrote: 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com: Ivan Voras wrote: I need to store data about sensor readings. There is a known (but configurable) number of sensors which can send update data at any time. The current state needs to be kept but also all historical records. I'm trying to decide between these two designs: 1) create a table for current data, one record for each sensor, update this table when a sensor reading arrives, create a trigger that would transfer old record data to a history table (of basically the same structure) 2) write only to the history table, use relatively complex queries or outside-the-database magic to determine what the current values of the sensors are. 3) write only to the history table, but have an INSERT trigger to update the table with current data. This has the same performance characteristics as 1, but let's you design your application like 2. Excellent idea! I think I'd choose this approach (or 2), since it can handle out-of-order or delayed arrival of sensor readings gracefully (assuming they are timestamped at source). It seems like your approach is currently the winner. If you go with 2, I'd recommend to still create a view to encapsulate the complex query for the current values, to make the application development simpler. And if it gets slow, you can easily swap the view with a table, updated with triggers or periodically, without changing the application. The volume of sensor data is potentially huge, on the order of 500,000 updates per hour. Sensor data is few numeric(15,5) numbers. Whichever design you choose, you should also consider partitioning the data. I'll look into it, but we'll first see if we can get away with limiting the time the data needs to be available. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org ) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance