subject:"\[PERFORM\] Storing sensor data"

[PERFORM] Storing sensor data

2009-05-28 Thread Ivan Voras

Hi,

I need to store data about sensor readings. There is a known (but
configurable) number of sensors which can send update data at any time.
The current state needs to be kept but also all historical records.
I'm trying to decide between these two designs:

1) create a table for current data, one record for each sensor, update
this table when a sensor reading arrives, create a trigger that would
transfer old record data to a history table (of basically the same
structure)
2) write only to the history table, use relatively complex queries or
outside-the-database magic to determine what the current values of the
sensors are.

The volume of sensor data is potentially huge, on the order of 500,000
updates per hour. Sensor data is few numeric(15,5) numbers.

I think the second design would be easiest on the database but as the
current sensor state can potentially be queried often, it might be too
slow to read.

Any recommendations?



signature.asc
Description: OpenPGP digital signature

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Heikki Linnakangas


Ivan Voras wrote:

I need to store data about sensor readings. There is a known (but
configurable) number of sensors which can send update data at any time.
The current state needs to be kept but also all historical records.
I'm trying to decide between these two designs:

1) create a table for current data, one record for each sensor, update
this table when a sensor reading arrives, create a trigger that would
transfer old record data to a history table (of basically the same
structure)
2) write only to the history table, use relatively complex queries or
outside-the-database magic to determine what the current values of the
sensors are.


3) write only to the history table, but have an INSERT trigger to update 
the table with current data. This has the same performance 
characteristics as 1, but let's you design your application like 2.


I think I'd choose this approach (or 2), since it can handle 
out-of-order or delayed arrival of sensor readings gracefully (assuming 
they are timestamped at source).


If you go with 2, I'd recommend to still create a view to encapsulate 
the complex query for the current values, to make the application 
development simpler. And if it gets slow, you can easily swap the view 
with a table, updated with triggers or periodically, without changing 
the application.



The volume of sensor data is potentially huge, on the order of 500,000
updates per hour. Sensor data is few numeric(15,5) numbers.


Whichever design you choose, you should also consider partitioning the data.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Nikolas Everett

Option 1 is about somewhere between 2 and 3 times more work for the database
than option 2.

Do you need every sensor update to hit the database?  In a situation like
this I'd be tempted to keep the current values in the application itself and
then sweep them all into the database periodically.  If some of the sensor
updates should hit the database faster, you could push those in as you get
them rather than wait for your sweeper.  This setup has the advantage that
you c0an scale up the number of sensors and the frequency the sensors report
without having to scale up the disks.  You can also do the sweeping all in
one transaction or even in one batch update.


On Thu, May 28, 2009 at 9:31 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 Ivan Voras wrote:

 The volume of sensor data is potentially huge, on the order of 500,000
 updates per hour. Sensor data is few numeric(15,5) numbers.


 Whichever design you choose, you should also consider partitioning the
 data.



Amen.  Do that.

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Alexander Staubo

On Thu, May 28, 2009 at 2:54 PM, Ivan Voras ivo...@freebsd.org wrote:
 The volume of sensor data is potentially huge, on the order of 500,000
 updates per hour. Sensor data is few numeric(15,5) numbers.

The size of that dataset, combined with the apparent simplicity of
your schema and the apparent requirement for most-sequential access
(I'm guessing about the latter two), all lead me to suspect you would
be happier with something other than a traditional relational
database.

I don't know how exact your historical data has to be. Could you get
by with something like RRDTool? RRdTool is a round-robin database that
stores multiple levels of historical values aggregated by function. So
you typically create an average database, a max database and so
on, with the appropriate functions to transform the data, and you
subdivide these into day, month, year and so on, by the granularity of
your choice.

When you store a value, the historical data is aggregated
appropriately -- at appropriate levels of granularity, so the current
day database is more precise than the monthly one, and so on -- and
you always have access to the exact current data. RRDTool is used by
software such as Munin and Cacti that track a huge number of readings
over time for graphing.

If you require precise data with the ability to filter, aggregate and
correlate over multiple dimensions, something like Hadoop -- or one of
the Hadoop-based column database implementations, such as HBase or
Hypertable -- might be a better option, combined with MapReduce/Pig to
execute analysis jobs

A.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Ivan Voras

2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
 Ivan Voras wrote:

 I need to store data about sensor readings. There is a known (but
 configurable) number of sensors which can send update data at any time.
 The current state needs to be kept but also all historical records.
 I'm trying to decide between these two designs:

 1) create a table for current data, one record for each sensor, update
 this table when a sensor reading arrives, create a trigger that would
 transfer old record data to a history table (of basically the same
 structure)
 2) write only to the history table, use relatively complex queries or
 outside-the-database magic to determine what the current values of the
 sensors are.

 3) write only to the history table, but have an INSERT trigger to update the
 table with current data. This has the same performance characteristics as
 1, but let's you design your application like 2.

Excellent idea!

 I think I'd choose this approach (or 2), since it can handle out-of-order or
 delayed arrival of sensor readings gracefully (assuming they are timestamped
 at source).

It seems like your approach is currently the winner.

 If you go with 2, I'd recommend to still create a view to encapsulate the
 complex query for the current values, to make the application development
 simpler. And if it gets slow, you can easily swap the view with a table,
 updated with triggers or periodically, without changing the application.

 The volume of sensor data is potentially huge, on the order of 500,000
 updates per hour. Sensor data is few numeric(15,5) numbers.

 Whichever design you choose, you should also consider partitioning the data.

I'll look into it, but we'll first see if we can get away with
limiting the time the data needs to be available.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Ivan Voras

2009/5/28 Nikolas Everett nik9...@gmail.com:
 Option 1 is about somewhere between 2 and 3 times more work for the database
 than option 2.

Yes, for writes.

 Do you need every sensor update to hit the database?  In a situation like

We can't miss an update - they can be delayed but they all need to be recorded.

 this I'd be tempted to keep the current values in the application itself and
 then sweep them all into the database periodically.  If some of the sensor
 updates should hit the database faster, you could push those in as you get
 them rather than wait for your sweeper.  This setup has the advantage that
 you can scale up the number of sensors and the frequency the sensors report
 without having to scale up the disks.  You can also do the sweeping all in
 one transaction or even in one batch update.

It would be nice, but then we need to invest more effort in making the
front-end buffering resilient.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Kenneth Marshall

On Thu, May 28, 2009 at 04:55:34PM +0200, Ivan Voras wrote:
 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
  Ivan Voras wrote:
 
  I need to store data about sensor readings. There is a known (but
  configurable) number of sensors which can send update data at any time.
  The current state needs to be kept but also all historical records.
  I'm trying to decide between these two designs:
 
  1) create a table for current data, one record for each sensor, update
  this table when a sensor reading arrives, create a trigger that would
  transfer old record data to a history table (of basically the same
  structure)
  2) write only to the history table, use relatively complex queries or
  outside-the-database magic to determine what the current values of the
  sensors are.
 
  3) write only to the history table, but have an INSERT trigger to update the
  table with current data. This has the same performance characteristics as
  1, but let's you design your application like 2.
 
 Excellent idea!
 
  I think I'd choose this approach (or 2), since it can handle out-of-order or
  delayed arrival of sensor readings gracefully (assuming they are timestamped
  at source).
 
 It seems like your approach is currently the winner.
 
  If you go with 2, I'd recommend to still create a view to encapsulate the
  complex query for the current values, to make the application development
  simpler. And if it gets slow, you can easily swap the view with a table,
  updated with triggers or periodically, without changing the application.
 
  The volume of sensor data is potentially huge, on the order of 500,000
  updates per hour. Sensor data is few numeric(15,5) numbers.
 
  Whichever design you choose, you should also consider partitioning the data.
 
 I'll look into it, but we'll first see if we can get away with
 limiting the time the data needs to be available.
 

Mr. Voras,

One big benefit of partitioning is that you can prune old data with
minimal impact to the running system. Doing a large bulk delete would
be extremely I/O impacting without partion support. We use this for
a DB log system and it allows us to simply truncate a day table instead
of a delete -- much, much faster.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Ivan Voras

2009/5/28 Alexander Staubo a...@bengler.no:
 On Thu, May 28, 2009 at 2:54 PM, Ivan Voras ivo...@freebsd.org wrote:
 The volume of sensor data is potentially huge, on the order of 500,000
 updates per hour. Sensor data is few numeric(15,5) numbers.

 The size of that dataset, combined with the apparent simplicity of
 your schema and the apparent requirement for most-sequential access
 (I'm guessing about the latter two),

Your guesses are correct, except every now and then a random value
indexed on a timestamp needs to be retrieved.

 all lead me to suspect you would
 be happier with something other than a traditional relational
 database.

 I don't know how exact your historical data has to be. Could you get

No lossy compression is allowed. Exact data is needed for the whole dataset-

 If you require precise data with the ability to filter, aggregate and
 correlate over multiple dimensions, something like Hadoop -- or one of
 the Hadoop-based column database implementations, such as HBase or
 Hypertable -- might be a better option, combined with MapReduce/Pig to
 execute analysis jobs

This looks like an interesting idea to investigate. Do you have more
experience with such databases? How do they fare with the following
requirements:

* Storing large datasets (do they pack data well in the database? No
wasted space like in e.g. hash tables?)
* Retrieving specific random records based on a timestamp or record ID?
* Storing inifinite datasets (i.e. whose size is not known in
advance - cf. e.g. hash tables)

On the other hand, we could periodically transfer data from PostgreSQL
into a simpler database (e.g. BDB) for archival purposes (at the
expense of more code). Would they be better suited?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Ivan Voras

2009/5/28 Kenneth Marshall k...@rice.edu:


 One big benefit of partitioning is that you can prune old data with
 minimal impact to the running system. Doing a large bulk delete would
 be extremely I/O impacting without partion support. We use this for
 a DB log system and it allows us to simply truncate a day table instead
 of a delete -- much, much faster.

Thanks. I'll need to investigate how much administrative overhead and
fragility partitioning will introduce since the data will also be
replicated between 2 servers (I'm thinking of using Slony). Any
experience with this combination?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Alexander Staubo

On Thu, May 28, 2009 at 5:06 PM, Ivan Voras ivo...@freebsd.org wrote:
 If you require precise data with the ability to filter, aggregate and
 correlate over multiple dimensions, something like Hadoop -- or one of
 the Hadoop-based column database implementations, such as HBase or
 Hypertable -- might be a better option, combined with MapReduce/Pig to
 execute analysis jobs

 This looks like an interesting idea to investigate. Do you have more
 experience with such databases? How do they fare with the following
 requirements:

We might want to take this discussion off-list, since this list is
about PostgreSQL. Feel free to reply privately.

 * Storing large datasets (do they pack data well in the database? No
 wasted space like in e.g. hash tables?)

Columns databases like Hypertable and HBase are designed to store data
quite efficiently. Each column is grouped in a unit called a column
family and stored together in chunks usually called SSTables, after
the Google Bigtable paper. (When you design your database you must
determine which columns are usually accessed together, in other to
avoid incurring the I/O cost of loading non-pertinent columns.) Each
SSTable is like a partition. When storing a chunk to disk, the column
is compressed, each column being stored sequentially for optimal
compression.

I have used HBase, but I don't have any feel for how much space it
wastes. In theory, though, space usage should be more optimal than
with PostgreSQL. I have used Cassandra, another column database I
would also recommend, which is very efficient. In many ways I prefer
Cassandra to HBase -- it's leaner, completely decentralized (no single
point of failure) and independent of the rather huge, monolithic
Hadoop project -- but it does not currently support MapReduce. If you
want to implement some kind of distributed analysis system, you will
need to write yourself.

All three column stores support mapping information by a time
dimension. Each time you write a key, you also provide a timestamp. In
theory you can retain the entire history of a single key. HBase lets
you specify how many revisions to retain; not sure what Cassandra
does. However, Cassandra introduces the notion of a supercolumn
family, another grouping level which lets you use the timestamp as a
column key. To explain how this works, consider the following inserts:

  # insert(table_name, key, column, value, timestamp)
  db.insert(readings, temperature_sensor, value:1, 23, 200905281725023)
  db.insert(readings, temperature_sensor, value:2, 27, 200905281725023)
  db.insert(readings, temperature_sensor, value:3, 21, 200905281725023)

The resulting temperature_sensor row will have three column values:

  value:1   value:2   value:3
  23  27  21

You can keep adding values and the row will get bigger. Because
columns are dynamic, only that row will grow; all other rows will stay
the same size. Cassandra users usually use the column name as a kind
of value -- image it's like subindexing an array.

As you can see, I also passed a timestamp (the 2009.. bit), which is
used for versioning. Since anyone can write to any node in a cluster,
Cassandra needs to be able to resolve conflicts.

Note that these databases are inherently distributed. You can run them
on a single node just fine -- and that might be appropriate in your
case -- but they really shine when you run a whole cluster. Cassandra
is multi-master, so you can just boot up a number of nodes and read
from/write to any of them.

 * Retrieving specific random records based on a timestamp or record ID?

Absolutely.

 * Storing inifinite datasets (i.e. whose size is not known in
 advance - cf. e.g. hash tables)

This is one area where column databases are better than relational
ones. The schema is completely dynamic, and you can treat it as a hash
table.

 On the other hand, we could periodically transfer data from PostgreSQL
 into a simpler database (e.g. BDB) for archival purposes (at the
 expense of more code). Would they be better suited?

Considering the size and sequential nature of the data, I think they
would be better match than a simple key-value store like BDB.

A.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Grzegorz Jaśkiewicz

depends on how soon do you need to access that data after it's being
created, the way I do it in my systems, I get data from 8 points, bit
less than you - but I dump it to csv, and import it on database host
(separate server).
now, you could go to BDB or whatever, but that's not the solution.

So, I would try dumping it to a file, and have separate process, maybe
separete server that would import it, store it, as database. Whatever
you do, as guys said - partition.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Kenneth Marshall

On Thu, May 28, 2009 at 05:24:33PM +0200, Ivan Voras wrote:
 2009/5/28 Kenneth Marshall k...@rice.edu:
 
 
  One big benefit of partitioning is that you can prune old data with
  minimal impact to the running system. Doing a large bulk delete would
  be extremely I/O impacting without partion support. We use this for
  a DB log system and it allows us to simply truncate a day table instead
  of a delete -- much, much faster.
 
 Thanks. I'll need to investigate how much administrative overhead and
 fragility partitioning will introduce since the data will also be
 replicated between 2 servers (I'm thinking of using Slony). Any
 experience with this combination?
 

We use Slony1 on a number of databases, but none yet on which we
use data partitioning.

Cheers,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Greg Jaman

I currently have a database doing something very similar.  I setup partition
tables with predictable names based on the the data's timestamp week number
eg:  (Data__WI).

I have a tigger on the parent partition table to redirect data to the
correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz,
'IYYY_IW') ) .  then I use dynamic sql to do the insert.  I did some
optimization by writting it in pl/TCL and using global variables to store
prepared insert statements.

Most queries for me are based  on the date and we have decent performance
with our current setup.  For last/current sensor data we just store the last
dataID in the sensor record.  I haven't thought of a better way yet.  After
batch inserts we caculate the last reading for each participating sensorID
inserted.

With partition tables we struggled with the query to get the lastest data :
select * from Data where sensorID=x order by ts limit 1  -- for parition
tables.   See (
http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php)





On Thu, May 28, 2009 at 7:55 AM, Ivan Voras ivo...@freebsd.org wrote:

 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
  Ivan Voras wrote:
 
  I need to store data about sensor readings. There is a known (but
  configurable) number of sensors which can send update data at any time.
  The current state needs to be kept but also all historical records.
  I'm trying to decide between these two designs:
 
  1) create a table for current data, one record for each sensor, update
  this table when a sensor reading arrives, create a trigger that would
  transfer old record data to a history table (of basically the same
  structure)
  2) write only to the history table, use relatively complex queries or
  outside-the-database magic to determine what the current values of the
  sensors are.
 
  3) write only to the history table, but have an INSERT trigger to update
 the
  table with current data. This has the same performance characteristics
 as
  1, but let's you design your application like 2.

 Excellent idea!

  I think I'd choose this approach (or 2), since it can handle out-of-order
 or
  delayed arrival of sensor readings gracefully (assuming they are
 timestamped
  at source).

 It seems like your approach is currently the winner.

  If you go with 2, I'd recommend to still create a view to encapsulate the
  complex query for the current values, to make the application development
  simpler. And if it gets slow, you can easily swap the view with a table,
  updated with triggers or periodically, without changing the application.
 
  The volume of sensor data is potentially huge, on the order of 500,000
  updates per hour. Sensor data is few numeric(15,5) numbers.
 
  Whichever design you choose, you should also consider partitioning the
 data.

 I'll look into it, but we'll first see if we can get away with
 limiting the time the data needs to be available.

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Storing sensor data

2009-05-28 Thread Greg Jaman

I also forgot to note that I had no problems setting up replication via
londiste (skytools).   The cronjob that creates the partition each week for
me also adds the table to the replication set.   As simple as:
   londiste.py londiste.ini provider add 'public.Data__WI'
   londiste.py londiste.ini subscriber add 'public.Data__WI'






On Thu, May 28, 2009 at 11:56 AM, Greg Jaman gja...@gmail.com wrote:

 I currently have a database doing something very similar.  I setup
 partition tables with predictable names based on the the data's timestamp
 week number eg:  (Data__WI).

 I have a tigger on the parent partition table to redirect data to the
 correct partition( tablename:='Data_' || to_char('$NEW(ts)'::timestamptz,
 'IYYY_IW') ) .  then I use dynamic sql to do the insert.  I did some
 optimization by writting it in pl/TCL and using global variables to store
 prepared insert statements.

 Most queries for me are based  on the date and we have decent performance
 with our current setup.  For last/current sensor data we just store the last
 dataID in the sensor record.  I haven't thought of a better way yet.  After
 batch inserts we caculate the last reading for each participating sensorID
 inserted.

 With partition tables we struggled with the query to get the lastest data
 :  select * from Data where sensorID=x order by ts limit 1  -- for
 parition tables.   See (
 http://archives.postgresql.org/pgsql-performance/2008-11/msg00284.php)





 On Thu, May 28, 2009 at 7:55 AM, Ivan Voras ivo...@freebsd.org wrote:

 2009/5/28 Heikki Linnakangas heikki.linnakan...@enterprisedb.com:
  Ivan Voras wrote:
 
  I need to store data about sensor readings. There is a known (but
  configurable) number of sensors which can send update data at any time.
  The current state needs to be kept but also all historical records.
  I'm trying to decide between these two designs:
 
  1) create a table for current data, one record for each sensor,
 update
  this table when a sensor reading arrives, create a trigger that would
  transfer old record data to a history table (of basically the same
  structure)
  2) write only to the history table, use relatively complex queries or
  outside-the-database magic to determine what the current values of
 the
  sensors are.
 
  3) write only to the history table, but have an INSERT trigger to update
 the
  table with current data. This has the same performance characteristics
 as
  1, but let's you design your application like 2.

 Excellent idea!

  I think I'd choose this approach (or 2), since it can handle
 out-of-order or
  delayed arrival of sensor readings gracefully (assuming they are
 timestamped
  at source).

 It seems like your approach is currently the winner.

  If you go with 2, I'd recommend to still create a view to encapsulate
 the
  complex query for the current values, to make the application
 development
  simpler. And if it gets slow, you can easily swap the view with a table,
  updated with triggers or periodically, without changing the application.
 
  The volume of sensor data is potentially huge, on the order of 500,000
  updates per hour. Sensor data is few numeric(15,5) numbers.
 
  Whichever design you choose, you should also consider partitioning the
 data.

 I'll look into it, but we'll first see if we can get away with
 limiting the time the data needs to be available.

 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
 )
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

Re: [PERFORM] Storing sensor data

14 matches

Site Navigation

Mail list logo

Footer information