Re: [PERFORM] [GENERAL] Approach to Data Summary and Analysis

2014-04-15 Thread Chris Curvey
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco
robert.difa...@gmail.comwrote:

 I have several related tables that represent a call state. Let's think of
 these as phone calls to simplify things. Sometimes I need to determine the
 last time a user was called, the last time a user answered a call, or the
 last time a user completed a call.

 The basic schema is something like this:

 CREATE TABLE calls (
   id  BIGINT NOT NULL, // sequence generator
   user_id BIGINT NOT NULL,
   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
 );

 CREATE TABLE calls_answered (
   idBIGINT NOT NULL,
   answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
 );


 And so on for calls_connected, calls_completed, call_errors, etc.

 Occasionally I will want to know things like When was the last time a
 user answered a call or How many times has a user been called.

 I can do these queries using a combination of MAX or COUNT. But I'm
 concerned about the performance.

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Or the number of answered calls:

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Sometimes I might want to get this data for a whole bunch of users. For
 example, give me all users whose have not answered a call in the last 5
 days. Or even what percentage of users called actually answered a call.
 This approach could become a performance issue. So the other option is to
 create a call_summary table that is updated with triggers.

 The summary table would need fields like user_id, last_call_id,
 call_count, last_answered_id, answered_count, last_completed_id,
 last_completed_count, etc.

 My only issue with a summary table is that I don't want a bunch of null
 fields. For example, if the user was *called* but they have never
 *answered* at call then the last_call_id and call_count fields on the
 summary table would be non-NULL but the last_answer_id and answer_count
 fields WOULD be NULL. But over time all fields would eventually become
 non-NULL.

 So that leads me to a summary table for EACH call state. Each summary
 table would have a user id, a ref_id, and a count -- one summary table for
 each state e.g. call_summary, call_answered_summary, etc.

 This approach has the down side that it creates a lot of tables and
 triggers. It has the upside of being pretty efficient without having to
 deal with NULL values.  It's also pretty easy to reason about.

 So for my question -- is the choice between these a personal preference
 sort of thing or is there a right or wrong approach? Am I missing another
 approach that would be better?  I'm okay with SQL but I'm not expert so I'm
 not sure if there is an accepted DESIGN PATTERN for this that I am missing.

 Thanks!



My initial thought is:  that design is over-normalized. The thing you are
trying to model is the call, and it has severl attributes, some of which
may be unknown or not applicable (which is what NULL is for).  So my
thought would be to do something like this:

CREATE TABLE calls (
  id  BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
answered  TIMESTAMPTZ

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);


-- 
I asked the Internet how to train my cat, and the Internet told me to get a
dog.


Re: [PERFORM] [GENERAL] Approach to Data Summary and Analysis

2014-04-15 Thread Chris Curvey
On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey ch...@chriscurvey.comwrote:

 On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco robert.difa...@gmail.com
  wrote:

 I have several related tables that represent a call state. Let's think of
 these as phone calls to simplify things. Sometimes I need to determine the
 last time a user was called, the last time a user answered a call, or the
 last time a user completed a call.

 The basic schema is something like this:

 CREATE TABLE calls (
   id  BIGINT NOT NULL, // sequence generator
   user_id BIGINT NOT NULL,
   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
 );

 CREATE TABLE calls_answered (
   idBIGINT NOT NULL,
   answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
 );


 And so on for calls_connected, calls_completed, call_errors, etc.

 Occasionally I will want to know things like When was the last time a
 user answered a call or How many times has a user been called.

 I can do these queries using a combination of MAX or COUNT. But I'm
 concerned about the performance.

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Or the number of answered calls:

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Sometimes I might want to get this data for a whole bunch of users. For
 example, give me all users whose have not answered a call in the last 5
 days. Or even what percentage of users called actually answered a call.
 This approach could become a performance issue. So the other option is to
 create a call_summary table that is updated with triggers.

 The summary table would need fields like user_id, last_call_id,
 call_count, last_answered_id, answered_count, last_completed_id,
 last_completed_count, etc.

 My only issue with a summary table is that I don't want a bunch of null
 fields. For example, if the user was *called* but they have never
 *answered* at call then the last_call_id and call_count fields on the
 summary table would be non-NULL but the last_answer_id and answer_count
 fields WOULD be NULL. But over time all fields would eventually become
 non-NULL.

 So that leads me to a summary table for EACH call state. Each summary
 table would have a user id, a ref_id, and a count -- one summary table for
 each state e.g. call_summary, call_answered_summary, etc.

 This approach has the down side that it creates a lot of tables and
 triggers. It has the upside of being pretty efficient without having to
 deal with NULL values.  It's also pretty easy to reason about.

 So for my question -- is the choice between these a personal preference
 sort of thing or is there a right or wrong approach? Am I missing another
 approach that would be better?  I'm okay with SQL but I'm not expert so I'm
 not sure if there is an accepted DESIGN PATTERN for this that I am missing.

 Thanks!



 (Sorry, fat-fingered and hit send too early...)

CREATE TABLE calls (
  id  BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  answered  TIMESTAMPTZ NULL,
  connected  TIMESTAMPTZ NULL,
  completed  TIMESTAMPTZ NULL,


  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then  your queries end up looking like this:

--last time john answered
SELECT MAX(a.id)
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think
of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
(  select *
  from calls
  where calls.user_id = myusers.user_id
  and answered = five days ago)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null



-- 
I asked the Internet how to train my cat, and the Internet told me to get a
dog.


Re: [PERFORM] [GENERAL] Approach to Data Summary and Analysis

2014-04-15 Thread Robert DiFalco
Actually that was exactly the initial table design. There were more fields
because for my use case there were a lot more states and certain states
have additional data (for example when a call goes from answered to
connected it also gets the user_id of the person being connected to). So
that one table started getting a LOT of columns which starting making it
hard to reason about.

The more normalized version has a couple of things going for it. COUNT,
MIN, MAX, etc are very fast because I don't have to conditionally add null
checks. Everything is inserted so for the millions of calls that get made
the normalized schema was much more efficient for writing. It was also
easier to understand. The answer table only has calls that were answered,
the error table only has calls the resulted in an error after being
connected, etc.

I know this kind of gets into a religious area when discussing NULLs and
what level of normalization is appropriate so I don't want to spark any of
that on this thread. But only doing inserts and never doing updates or
deletes performed very well for large data sets.

That said, I could explore a compromise between the monolithic table
approach and the completely normalized set of tables approach. Thanks for
your input!


On Tue, Apr 15, 2014 at 8:12 AM, Chris Curvey ch...@chriscurvey.com wrote:




 On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey ch...@chriscurvey.comwrote:

 On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco 
 robert.difa...@gmail.com wrote:

 I have several related tables that represent a call state. Let's think
 of these as phone calls to simplify things. Sometimes I need to determine
 the last time a user was called, the last time a user answered a call, or
 the last time a user completed a call.

 The basic schema is something like this:

 CREATE TABLE calls (
   id  BIGINT NOT NULL, // sequence generator
   user_id BIGINT NOT NULL,
   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
 );

 CREATE TABLE calls_answered (
   idBIGINT NOT NULL,
   answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

   PRIMARY KEY (id),
   FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
 );


 And so on for calls_connected, calls_completed, call_errors, etc.

 Occasionally I will want to know things like When was the last time a
 user answered a call or How many times has a user been called.

 I can do these queries using a combination of MAX or COUNT. But I'm
 concerned about the performance.

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Or the number of answered calls:

 SELECT MAX(a.id)
 FROM calls_answered a JOIN calls c ON c.id = a.id
 WHERE c.user_id = ?;


 Sometimes I might want to get this data for a whole bunch of users. For
 example, give me all users whose have not answered a call in the last 5
 days. Or even what percentage of users called actually answered a call.
 This approach could become a performance issue. So the other option is to
 create a call_summary table that is updated with triggers.

 The summary table would need fields like user_id, last_call_id,
 call_count, last_answered_id, answered_count, last_completed_id,
 last_completed_count, etc.

 My only issue with a summary table is that I don't want a bunch of null
 fields. For example, if the user was *called* but they have never
 *answered* at call then the last_call_id and call_count fields on the
 summary table would be non-NULL but the last_answer_id and answer_count
 fields WOULD be NULL. But over time all fields would eventually become
 non-NULL.

 So that leads me to a summary table for EACH call state. Each summary
 table would have a user id, a ref_id, and a count -- one summary table for
 each state e.g. call_summary, call_answered_summary, etc.

 This approach has the down side that it creates a lot of tables and
 triggers. It has the upside of being pretty efficient without having to
 deal with NULL values.  It's also pretty easy to reason about.

 So for my question -- is the choice between these a personal preference
 sort of thing or is there a right or wrong approach? Am I missing another
 approach that would be better?  I'm okay with SQL but I'm not expert so I'm
 not sure if there is an accepted DESIGN PATTERN for this that I am missing.

 Thanks!



 (Sorry, fat-fingered and hit send too early...)

 CREATE TABLE calls (
   id  BIGINT NOT NULL, // sequence generator
   user_id BIGINT NOT NULL,
   called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
   answered  TIMESTAMPTZ NULL,
   connected  TIMESTAMPTZ NULL,
   completed  TIMESTAMPTZ NULL,


   PRIMARY KEY (id),
   FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
 );

 Then  your queries end up looking like this:

 --last time john answered
 SELECT MAX(a.id)
 FROM calls
 where answered is not null
 and user_id = ?

 -- users that have not answered a