Re: [GENERAL] Running Total with a Set Maximum

2010-09-01 Thread Peter Hunsberger
On Wed, Sep 1, 2010 at 5:20 AM, Paul Newman pnew...@prohire.co.uk wrote:
 Hi,
 Could someone out there help me.
 Given the following set of values
 +3
 +3
 0
 +3
 +3
 -3
 0
 +3

 I want to have a maximum sum of 6 at any one point , in other words I want
 the following


 +3  --  3
 +3  --  6
 0   --  6
 +3  --  6
 +3  --  6
 -3  --  3
 0  -- 3
 +3  -- 6

 How can I do this ?


Looks like what you really want is a running total where the max is no
more than 6 at any time.  Except it's not really a running total since
the current total is dependent on the last max?

You might be able to pull this off with some combination of window and
max, but really this is a rather odd mathematical operation so it's
not easy to code up directly in a single query.  If you can add
another column that is the running total to date and if you can use a
sequence for another column (primary key perhaps?) then the problem
becomes pretty trivial: at insert time you can simply pick the running
total from the most recent row (found via the max sequence) and create
the new (pseudo) running total with a case statement.  If you can't do
that, then I think it would be easiest to code this up in a procedure,
but before anyone jumps on that you might want to let us know if you
are free to add columns to the schema?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ORM integration?

2010-08-13 Thread Peter Hunsberger
On Fri, Aug 13, 2010 at 1:52 AM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 Hi folks

 I know many people here loathe ORM systems. I'm one of them, but I still
 use them when they appear to be appropriate, despite their problems.

 In the process I've come to realize that ORMs in general have a couple
 of issues that could be avoided with some help from the database.
 Specifically:

 - They need to fetch graphs of records that keep records associated
  with peers in relationships. Essentially they want nested sets.
  Ideally they need to be able to do this with a WHERE or
  LIMIT/OFFSET on the root relation, so they can do batch
  fetching of blocks of records in contexts where holding a
  transaction open (permitting cursor use) isn't appropriate.

  Currently ORMs do this by doing multiple LEFT OUTER JOINs and
  post-processing the results to eliminate duplication of data.
  Needless to say this is incredibly inefficient. It also makes
  using LIMIT/OFFSET nigh impossible, so they often fetch the whole
  data set into local memory (!!) even if the app only asks for a
  small fragment of it.

  A native way to fetch a query's results as a nested set, as
  (say) XML or JSON, would potentially be a huge bonus if ORM
  systems could be convinced to use it. It's potentially possible
  already with use of nested subqueries and array_agg.

  I've even been idly playing with the idea of using
  PL/Java to build a Java object graph in memory and send that
  to the client!

  I'm wondering  if anyone's been down  the path of building a
  more ORM-friendly relation graph representation database-side
  and sending it to the client.


 - ORMs tend to lack cache coherence. They generally maintain a cache
  of records fetched from the database - partly because their fetching
  is so expensive (as noted above) and partly just as a general
  performance optimisation. The problem is that the database has no
  way to invalidate the ORM's cache of a particular record when changes
  are made to it in the database, so the ORM's cache tends to get out
  of sync with the database.

snip/

We do not use an ORM, but we do do very generalized graphs of metadata
in the database, with the end results very similar to what you
describe.  I've written about this a bit on this list and won't repeat
myself other than to say once more, that we use the set / subset model
popularized by Joe Celko.  We decompose the graphs into trees and use
in-order tree traversal with individual tree nodes sharing a common FK
to identify them in the forest.  A single tree is a single flat query.
Our largest individual trees run to 60,000 nodes or so.  Graphs are
built and managed in memory. We do cache because, at the moment, the
construction of the metadata that is pointed to by these trees is
expensive (long story, unrelated to the issues you're talking about).
However, there is a task to optimize that and I suspect that we won't
need caching once we're done (though we may keep it since it is in
place and working).  All tree management is through the same code that
creates the in memory graphs in the first place so cache invalidation
is not an issue.

In the end we have a sort of ORM, however, we can add new tables to
our schema (with some fairly strict limitations) and define the
metadata to manage the CRUD operations on it all at run time.  Is this
applicable in general?  Probably but the code is not trivial...
Could we benefit from hooks into the database for this?  Perhaps,
datatypes that explicitly knew they had the set / subset relationship
to each other could allow moving big chunks (if not all) of the tree
management out of the Java code and into some form of operation within
the database.  Thinking out loud here, I guess that could be generally
useful, it seems you'd end up with a sort of ltree but I think to move
the generalization this far into the database means that the tree
nodes are managing (possibly poorly defined) FK relationships.  That's
probably ok for this kind of stuff, there are ways to get the
equivalent of  strong typing back either on the DB side or at run
time.  You're essentially end up hacking a relational database to
support network database type operations, so to the extent that people
need that you've got something useful...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] ORM integration?

2010-08-13 Thread Peter Hunsberger
On Fri, Aug 13, 2010 at 1:52 AM, Craig Ringer
cr...@postnewspapers.com.au wrote:
  I'm currently thinking that the upgraded LISTEN/NOTIFY mechanism
  in 9.0 might be a good channel for sending cache invalidation
  messages with. Thoughts? Ideas?


Forgot to comment on this.  Since we've already got our cache
management in place we don't need this, but yeah, this could have
simplified things in many ways.  But, what's the mechanism / transport
for the notification? MQ type stuff eek?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Dynamic data model, locks and performance

2010-07-28 Thread Peter Hunsberger
On Wed, Jul 28, 2010 at 10:15 PM, Pierre Thibault
pierre.thibau...@gmail.com wrote:


 What I would like to do is enable users to create their own data model. 
 Enable them to create a model and make  it evolve. For example, it would be 
 cool to create a model to represent car adds. Then, the buyers can search 
 adds  using the previous model with specific criteria like the color of the 
 car they want.


Sounds like a standard parts explosion or component / subcomponent
hierarchy?  The most general form has two pieces:

1) a tree model of some form. I like the set / subset model that Joe
Celko popularized (search the archives for other references to it from
me);

2) an extensible set of containers to hang of the tree.  Here I like
to use a strongly typed version of the Entity Attribute Value (EAV)
type model;

Those two pieces on their own are so general that you can't really
scale the results very far.  However, if you've got some specific
goals in mind I might be able to provide some hints.  If you're an
IEEE member I can point you at a presentation I did on the basics (I
think)...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-22 Thread Peter Hunsberger
On Wed, Jul 21, 2010 at 11:13 PM, Howard Rogers h...@diznix.com wrote:

 On Thu, Jul 22, 2010 at 1:02 PM, Scott Marlowe scott.marl...@gmail.com 
 wrote:


 
  Why on Earth would I want to store this sort of stuff in a bit string?!
 
  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.
 
  I don't know about you, but I find looking at 21205 a darn'd site
  easier than staring blankly at 101001011010101!!

snip lots of stuff/snip

 
  Note you can cast integer to bitstring, but there may be some odd
  behaviour for sign bits and such.  Which is again why I'd use the
  right type for the job, bit string.  But it's your project.
 

 Quoting...

  Because you are manipulating bits and not integers?  I guess there are
  10 kinds of people, those who like think in binary and those who
  don't.

 Er, no. 21205 is not an integer. It's an encoded bit of magic.


In that case your database design is fundamentally broken.  A database
should have content fields that map to the needs of the application.
As you describe your application requirements, that is a bit string
and not an integer.  Use bit strings and your application logic is
transparent, obvious and easy to maintain.  Use integers and you have
to resort to magic.  As you say, it's your choice, but you came here
looking for advice and the advice you were given is very good

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Bitmask trickiness

2010-07-22 Thread Peter Hunsberger
On Thu, Jul 22, 2010 at 4:09 PM, Howard Rogers h...@diznix.com wrote:

 On Fri, Jul 23, 2010 at 12:35 AM, Peter Hunsberger
 peter.hunsber...@gmail.com wrote:
  On Wed, Jul 21, 2010 at 11:13 PM, Howard Rogers h...@diznix.com wrote:
 
 there's a room-full of users who can look
 at code '4097' and know precisely what it means and would be mortified
 if I suddenly started displaying exactly the same meanings in what, to
 them, would look like utter gibberish.


In that case, you shouldn't be describing the column as some encoded
bit of magic here.  It clearly has some some semantic meaning which
gives you a reason to want to keep it that way.  Though why your users
are dealing with the raw values as stored in the database may be
another issue to deal with: Personally, I'd say store it in the way
that is easiest for your application logic to deal with, display it in
the form that is easiest for your users to deal with.  The are often
two completely different things...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] \COPY ... CSV with hex escapes

2010-07-07 Thread Peter Hunsberger
On Wed, Jul 7, 2010 at 9:21 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 Hi folks

 I have an odd csv input format to deal with. I'm about to put some
 Python together to reprocess it, but I thought I'd check in and see if
 I'm missing something obvious in \copy's capabilities.

 The input is fairly conventional comma-delimeted text with quoted
 fields, as output by Sybase SQL Anywhere 5.5's isql 'OUTPUT TO' filter.
 Yes, that's ancient. It is handled quite happily by \copy in csv mode,
 except that when csv mode is active, \xnn escapes do not seem to be
 processed. So I can have *either* \xnn escape processing *or* csv-style
 input processing.

 Anyone know of a way to get escape processing in csv mode?

Don't know if you can do it directly, but this seem like one of those
cases where a ETL tool like that from Pentaho (Kettle / Spoon) might
be in order?  One step to handle the escape chars and one to load the
actual CSV...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Partial indexes instead of partitions

2010-06-14 Thread Peter Hunsberger
On Mon, Jun 14, 2010 at 7:27 AM, David Wilson david.t.wil...@gmail.com wrote:


 On Mon, Jun 14, 2010 at 5:24 AM, Leonardo F m_li...@yahoo.it wrote:

  For inserts I do not see the reason
  why
  it would be better to use index partitioning because AFAIK
  b-tree
  would behave exactly the same in both cases.

 no, when the index gets very big inserting random values gets
 very slow.

 Do you have any empirical evidence for this being a real problem, or are you
 simply guessing? I have tables with 500m+ rows, on commodity hardware (4
 SATA disks in raid 10), and inserts to the indexes on those tables remain
 quite acceptable from a performance standpoint.


Can you define acceptable?  IIRC the OP is looking for 20,000+ inserts / sec.


-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Cognitive dissonance

2010-06-08 Thread Peter Hunsberger
On Tue, Jun 8, 2010 at 4:04 AM, John Gage jsmg...@numericable.fr wrote:
 Unix is a text-based operating system with unbelievably helpful text
 manipulation tools.

 Postgres is a creature of Unix which happens to have unbelievable text
 searching and manipulation tools.

 Yet, the only one file edition of the Postgres documentation is in...pdf
 format.  Huh?


I suppose the next thing you'll be suggesting is that, because
Postgres is a database, the documentation should be stored as some
form of searchable table within the database itself?

runs and hides/

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Queues Problem

2010-06-08 Thread Peter Hunsberger
On Tue, Jun 8, 2010 at 1:00 PM, uaca man uaca...@gmail.com wrote:
 This would work except for one thing, the building may affect another
 buildings, Consider this:

 the user starts one construction that will finish in 10 minutes and the
 building will give a bonus of +5 gold each seconds for the user. This has to
 be available in the seconds that the build is done and not one hour late
 because the user will lose the bonus for one hour.


Sounds like you need to take a big step back and figure out what
overall systems architecture makes sense for a real time gaming
platform  Couple of suggestions:

1) If you need real time events you do them where you need them: on
the front end, where the user interaction is happening.  Do not
attempt to code gaming rules and logic into database triggers and back
end relationships (though you might store gaming rules etc. in a
table).

2) Think of the front end as changing states as the user interacts
with it, then figure out what queries need to be made to correspond to
the changes in state.  For example, it is unlikely the user needs the
amount of gold updated every 5 seconds.  Rather, they need to know
how much they have on hand when they go to use it.  At that point, you
query for the old balance, find the last updated time, how many
buildings have been completed since then and for how long and figure
out what the new gold balance is.  Update the new balance at that
point (with a timestamp), and the front end goes on it's merry way...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Queues Problem

2010-06-08 Thread Peter Hunsberger
On Tue, Jun 8, 2010 at 1:26 PM, uaca man uaca...@gmail.com wrote:
 2) Think of the front end as changing states as the user interacts
 with it, then figure out what queries need to be made to correspond to
 the changes in state.  For example, it is unlikely the user needs the
 amount of gold updated every 5 seconds.  Rather, they need to know
 how much they have on hand when they go to use it.  At that point, you
 query for the old balance, find the last updated time, how many
 buildings have been completed since then and for how long and figure
 out what the new gold balance is.  Update the new balance at that
 point (with a timestamp), and the front end goes on it's merry way...

 That is exactly what we are doing for the most part and was our first bet
 with the buildings, however since building can affect pretty much anything,
 anywhere on the game changing states as the user interacts is getting beyond
 comprehension of a human mind(al least for my mind) and that was when I had
 the super idea, lest put the queue on the crontab!

Then each thing the building interacts with has it's own unique set of
states.  The only ones you need worry about are the ones a _user_ is
actually interacting with at any given point.


 Looks like we are going to cut off a few options of the game.
 ps: do i top post or bottom post here?


Bottom post.
-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Hiding data in postgresql

2010-05-24 Thread Peter Hunsberger
On Mon, May 24, 2010 at 2:16 PM, Hector Beyers hqbey...@gmail.com wrote:

 Hi guys,
 does ANYONE have any tips on hiding data on a database server? This means
 that data is stored in places that is not necessarily picked up in the
 schema of the database. I am doing some research on databases and need some
 direction.
 Any help or direction will be highly appreciated.

Let me guess: an attempt at security by obscurity?

I suppose you could always create a couple of columns such that some
function applied over them produced the real result (.  You could even
actually store this in an index, so although you could never see the
result directly (except in a dump) queries to get at it might perform
half reasonably.

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Hiding data in postgresql

2010-05-24 Thread Peter Hunsberger
On Mon, May 24, 2010 at 3:10 PM, Hector Beyers hqbey...@gmail.com wrote:
 Wow, this is really an idea I can work with.
 I know this is getting really brainstorming, but do you think it is possible
 to apply steganography (hiding data in pictures) tactics to the columns of a
 database?

If by that, you mean a variation on my original reply, then yes, but
you'd need at least one column that did not necessarily have contents
that looked like normal text.  This column would work as some form of
offset pointer into the other columns.  The result wouldn't
necessarily require much in the way of a reconstruction algorithm, but
if not it would likely also be fairly easy to figure out if someone
stumbled across your scheme.

If you want the truly secure variation on this then use some public
key / private key scheme in conjunction with this (in particular, ECC
could be pretty fast and have a small key size), so that some portion
of this is encrypted.  However, if you're doing that, you might as
well just encrypt the data directly...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Hiding data in postgresql

2010-05-24 Thread Peter Hunsberger
On Mon, May 24, 2010 at 3:24 PM, Hector Beyers hqbey...@gmail.com wrote:
 Dear Peter,
 can you elaborate on what you mean by storing 'this' in the index. Are you
 referring to the function that is applied over the data?
 How would you be able to see the result with a dump?

Yes, you would the function to create the index, something like:

create index gunk on tbl1 ( hidefunc( a, b, c )  )

The data created in this index might be viewable in a database dump by
simple inspection with something like an  editor.  How meaningful it
might be is another question...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to lose transaction history (xmin values, WAL, etc.)?

2010-05-21 Thread Peter Hunsberger
On Fri, May 21, 2010 at 6:40 AM, Sam Mason s...@samason.me.uk wrote:
 On Thu, May 20, 2010 at 09:33:23PM -0500, Peter Hunsberger wrote:
 On Thu, May 20, 2010 at 8:03 PM, Richard Walker rich...@softimp.com.au 
 wrote:
  If the hacker gets root access so they can read
  the raw database files, they most likely also
  have access to the means to decrypt any
  encrypted data. This is particularly so if
  the database updates are being done by stored
  procedures.

 Only if they also get at the keys.

 It's very difficult (with a conventional OS) to remove the *entirety* of
 the server hardware and software from the TCB.  Hence you should assume
 that if PG ever sees a key it's also possible for an attacker to see the
 same key.

Fair enough. The scenario the OP was describing seemed to be a concern
with browsing the database and not at the level of inspecting running
code, but who knows, he hasn't said what he is really trying to
achieve.  There are some hardware level work arounds for parts of
this, but if you're running everything on the same server I doubt that
you're also looking at that kind of scenario.

 The options are pretty much do all crypto away from the database server
 (either client side, or on another server that you can trust) or you
 have to trust (also in the technical sense) the database server itself
 and things become greatly simplified.

Absolutely the best idea, the OP seems to have ruled that out however.
 As such, you can only make things reasonably safe from direct
inspection


  If encryption/decryption happens
  on a separate (not-also-compromised) client,
  then OK.  Do you know of a way to deal with
  this if the application is on the same computer
  as the database?

 Can you use an external key store?

 How does this help? if the database has been compromised, what would
 stop the attacker from inserting some code that records the responses
 from this external key store?


Again, I had the impression that code injection did not seem to be the
issue here.  The issue seemed to be inspection of the data in the
database after the fact.  Like I said, it would be good to know what
the real requirements are...

However, that aside, to answer your question;  among other things, the
key request includes a timestamped hash of internal memory to ensure a
non compromised server and the keys store returns functions with
embedded one time keys to do the actual work. The keys are composite
with portions that must match a compile time stored hash (you can't
have a compromised server requesting the key), and at run time the
same hash must be yielded (impossible to know without a previously
compromised server), or you will get garbage.  Replay attacks won't
work since the server will check the time stamp on the original
request (which we already know can't be compromised) before yielding
up the decryption function.  Much of the key exchange process is
essentially standard Kerberos with the client and the server
authenticating themselves to each other as usual, but you do need some
extensions to manage the extra integrity checks and create and manage
the additional message contents.

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to lose transaction history (xmin values, WAL, etc.)?

2010-05-20 Thread Peter Hunsberger
On Thu, May 20, 2010 at 1:19 AM, Richard Walker rich...@softimp.com.au wrote:

 I have an application in which I want it
 _not_ to be possible to work out which
 rows of a table are being/have been updated.


Why?

If you really need security of some form at the physical database
level then don't screw around with convoluted hacks.  Encrypt the
critical data in the database and be done with it.

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to lose transaction history (xmin values, WAL, etc.)?

2010-05-20 Thread Peter Hunsberger
On Thu, May 20, 2010 at 8:03 PM, Richard Walker rich...@softimp.com.au wrote:
 Peter Hunsberger wrote:

 If you really need security of some form at the physical database
 level then don't screw around with convoluted hacks.  Encrypt the
 critical data in the database and be done with it.

 If the hacker gets root access so they can read
 the raw database files, they most likely also
 have access to the means to decrypt any
 encrypted data.  This is particularly so if
 the database updates are being done by stored
 procedures.

Only if they also get at the keys.


 If encryption/decryption happens
 on a separate (not-also-compromised) client,
 then OK.  Do you know of a way to deal with
 this if the application is on the same computer
 as the database?

Can you use an external key store?  If not, I can't see this as being
a serious attempt at security, but playing along, you could try
something like the following:

1) Symmetrically encrypt a randomly generated string with something
based on the users credentials (user name and password);

2) If the user can authenticate (many ways of checking this) then you
decrypt the string from 1) and it becomes the basis for the encryption
and decryption of the users data.

You can have the same string encrypted by multiple users as needed for
shared access and you can have a single user manage multiple strings
as needed.

If I recall correctly, there's a version of DB2 with this already
baked into the product.  No idea on where it sits on the free to
expensive scale

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] How to lose transaction history (xmin values, WAL, etc.)?

2010-05-20 Thread Peter Hunsberger
On Thu, May 20, 2010 at 11:29 PM, Richard Walker rich...@softimp.com.au wrote:

 If I understand your scheme, this works in a scenario
 where the row being updated is in some sense owned by a
 particular user, who must provide some other data (a password)
 not otherwise stored in the database in order
 to decrypt it and then encrypt it again after the update.

Well, a pool of users, but essentially yes.

 That's not the case in my scenario.  The row being
 updated is not specifically owned by an individual
 application user.

So who does own it?

 But you've given me food for thought, by translating
 the original problem into an encryption problem.

 There is _other_ data in the database which
 is sensitive and could usefully be encrypted.  But
 the data in this particular problem is not
 sensitive per se, I just don't want it to be
 possible to recreate a history of updates.


Like I said originally; why?  What's the real point in all this?

If it's anonymous records of some kind you could build FK strings that
are essentially variations on the scheme I've given you, but I'm not
sure I'd want to deal with the performance implications...

However, I will point out that if you can't read the data you may be
able to tell who created a given row, but so what?  All the variations
on your scenario that I can think of at the moment all seem to boil
down to a security issue of some form...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Sorting with materialized paths

2010-05-10 Thread Peter Hunsberger
On Sun, May 9, 2010 at 8:33 AM, Ovid curtis_ovid_...@yahoo.com wrote:
 My apologies. This isn't PG-specific, but since this is running on PostgreSQL 
 8.4, maybe there are specific features which might help.

 I have a tree structure in a table and it uses materialized paths to allow me 
 to find children quickly. However, I also need to sort the results 
 depth-first, as one would expect with threaded forum replies.

  id | parent_id | matpath |          created
 +---+-+
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  7 |         1 | 1       | 2010-05-08 18:18:11.849735
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695

 So the final results should actually be sorted like this:

  id | parent_id | matpath |          created
 +---+-+
  2 |         1 | 1       | 2010-05-08 15:18:37.987544
  6 |         2 | 1.2     | 2010-05-08 17:50:43.288759
  8 |         6 | 1.2.6   | 2010-05-09 14:01:17.632695
  3 |         1 | 1       | 2010-05-08 17:38:14.125377
  4 |         1 | 1       | 2010-05-08 17:38:57.26743
  5 |         1 | 1       | 2010-05-08 17:43:28.211708
  9 |         5 | 1.5     | 2010-05-09 14:02:43.818646
  7 |         1 | 1       | 2010-05-08 18:18:11.849735

 Rationale:  this is for a threaded forum and id 6 is a reply to id 2, so it 
 needs to show up after that one.  Here's the rough structure of what the 
 output would look like (imagine an HTML forum):

 * id 1 (root post)
    * id 2
        * id 6
            * id 8
    * id 3
    * id 4
    * id 5
        * id 9
    * id 7

 How would I work that out? Can I do that in straight SQL or should additional 
 information be added to this table?


This is (once more) a flat query if you use a set / subset tree
implementation.  Joe Celko's book Trees and Hierarchies in SQL for
Smarties might be the fastest way to get up to speed on this, but you
can also figure it out if you spend a bit of time with Google
Basically, every node in the tree is a table row with two columns, say
left and right. All children are contained within the left and right
of the parent.  Pre-order tree traversal gives the algorithm for
assigning left and right.  Once done, your problem is solved by
ordering on left.



-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] alter table alter type CASCADE

2010-05-05 Thread Peter Hunsberger
On Wed, May 5, 2010 at 7:31 AM, Thom Brown thombr...@gmail.com wrote:
 On 5 May 2010 13:14, Chris Barnes compuguruchrisbar...@hotmail.com wrote:

 It has been some years since I worked with Oracle, doesn't Oracle
 recompile the view when the object it references changes in structure?
 Send

 What does Oracle do when you've got a view like:

 CREATE OR REPLACE VIEW test_view AS
  SELECT test.test_id, test.test_value, date_is_future(test.test_date) as
 upcoming
    FROM test;

 Where date_is_future expects a text field, but you change the test table so
 that test_date is now varchar(20)?  The function no longer matches the
 signature.  Does it just prevent it?  And what about when conditions are
 provided in a WHERE clause which become invalid when the column type
 changes?


If Oracle can find a function that will make the view valid it will
continue to allow usage of the view; though it is still marked as
needing to be rebuilt from a user perspective any calls to it will
succeed if there is a possible way for the view to still be valid.  If
there is no possible way for any use of the view to succeed then the
calls fail.


-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Peter Hunsberger
On Fri, Apr 30, 2010 at 10:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Mike Christensen m...@kitchenpc.com writes:
 When a certain event happens, let's say event 4, I need to query for which
 users to notify.  So I'll be doing something like:

 SELECT UserId FROM Users WHERE Subscriptions  8;

 My question is say there's a million rows in the Users table.  If I have an
 index on Subscriptions, will this index be used in the above query?

 No.  At least not with a standard btree index.

 I'm not exactly sure that an index would be helpful at all --- it seems
 like the selectivity of this condition won't be very good anyway, will
 it?  The more popular notifications will be subscribed to by a large
 fraction of the user base.  Maybe it'd be useful to index unpopular
 notifications, but how often will you be searching for those?


We've got some similar columns (though nothing with any major number
of rows), so this is interesting...

If all subscriptions are roughly equal in popularity then any single
select should give ~ 10% of the data.  That would seem to be selective
enough that you'd really want an index?  If so, any answers to the
OP's main question; what would be the most efficient way to handle
this type of thing?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Indexing queries with bit masks

2010-04-30 Thread Peter Hunsberger
On Fri, Apr 30, 2010 at 11:29 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Hunsberger peter.hunsber...@gmail.com writes:
 If all subscriptions are roughly equal in popularity then any single
 select should give ~ 10% of the data.  That would seem to be selective
 enough that you'd really want an index?

 My personal rule of thumb is that 10% is around the threshold where
 indexes stop being very helpful.  At that selectivity, you're going
 to be having to read every page of the table anyway, and it's not
 clear that the extra I/O to read the index is going to get repaid in
 CPU savings.  (Now if the table+index are fully cached in RAM, the
 threshold's probably a bit higher, but there still is not reason to
 think that an index is going to make for a huge improvement.)

 If so, any answers to the OP's main question; what would be the most
 efficient way to handle this type of thing?

Ok, that makes sense, which immediately makes me wonder if partitions
might make sense for this use case?  In particular if there really are
only 10 different types?

[...]

 The best idea that comes to mind offhand is to not use an integer, but a
 boolean array, such that the queries look like

        select ... where subscriptions[4];


Interesting idea.  That might be worth testing for some of my use cases

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Ltree - how to sort nodes on parent node

2010-04-20 Thread Peter Hunsberger
On Tue, Apr 20, 2010 at 1:58 PM, Alban Hertroys
dal...@solfertje.student.utwente.nl wrote:
 On 20 Apr 2010, at 11:59, cojack wrote:


 I am not interested about recursive queries, i think this kill ltree idea.


 And IMHO it should. ltree is from a time when we didn't have any other means 
 to describe data organised as a tree in Postgres. Navigating a tree is 
 inherently recursive, so recursion is most likely the proper way to go about 
 it.

 A solution omitting recursion (like ltree) can be faster, but you will run 
 into limitations like the one you're currently struggling with.

 A solution with recursive queries will probably be more flexible and allows 
 for referential integrity without having to write your own triggers and stuff 
 - for example, what happens if you decide that Archeology isn't a Science but 
 a Colour? What makes sure it's child-nodes get moved into Colors as well?


I've only been peripherally following this thread, so the following
may be overkill for the requirements, but the non-recursive / flat
query, solution is usually the set / subset pattern.  It's been
popularized by Joe Celko and he has gone as far as writing a book on
the topic Trees and hierarchies in SQL for smarties.  If you don't
have many requirements for reordering the tree  this solution works
well.   It can be more of a pain if you need a GUI for tree management
(but can be done).  We use this type of solution to manage trees up to
about 100,000 nodes in size with good performance.  Other
non-recursive solutions include Vadim Tropashko's (now with Oracle)
Nested Interval Tree Encoding methods, which map directly to the
dotted path (1.1.3) type tree notations in the examples in this thread
and are a variation on the set / subset models.

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Null vs. Empty String in Postgres 8.3.8

2010-04-04 Thread Peter Hunsberger
On Sun, Apr 4, 2010 at 10:23 PM, Scott Marlowe scott.marl...@gmail.com wrote:
 On Sun, Apr 4, 2010 at 9:03 PM, Wang, Mary Y mary.y.w...@boeing.com wrote:
 I still don't get it.  I do want a zero for the subversion_flags to be 
 stored in the table.  But it returned an error because it didn't like 
 subversion_flags='' in the UPDATE SQL statement.

 subversion_flags | integer       | not null default 0

 Right. '' is not 0.  the old version of pgsql converted '' to 0 for
 you, incorrectly.  Now if you want 0 you need to say 0.


Or, since you have the default, set it to null (Which may be what
you thought you where doing?)

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] The REAL cost of joins

2010-03-03 Thread Peter Hunsberger
On Wed, Mar 3, 2010 at 3:59 PM, Marcin Krol mrk...@gmail.com wrote:

 Hello everyone,

 I have inadvertently set off a furball on an unrelated ng on what is the 
 actual cost of SQL joins. But there's no reliable conclusion. I would like to 
 actually know that, that is, are JOINs truly expensive?

 As they say, one measurement is worth a thousand opinions, so I've done 
 measurement on my PG app:

 $ time echo \c hrs;
 SELECT hosts.ip, reservation.start_date, architecture.architecture, 
 os_kind.os_kind, os_rel.os_rel, os_version.os_version, project.project, 
 email.email FROM hosts
  INNER JOIN project ON project.id = hosts.project_id
  INNER JOIN architecture ON hosts.architecture_id = architecture.id
  INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
  INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
  INNER JOIN os_version ON hosts.os_version_id = os_version.id
  INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
  INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id
  INNER JOIN email ON reservation.email_id = email.id

 ; | psql  /dev/null

 real    0m0.099s
 user    0m0.015s
 sys     0m0.005s



 $ time echo \c hrs;
  SELECT hosts.ip FROM hosts;
  SELECT reservation.start_date FROM reservation;
  SELECT architecture.architecture FROM architecture;
  SELECT os_rel.os_rel FROM os_rel;
  SELECT os_version.os_version FROM os_version;
  SELECT project.project FROM project;
  SELECT email.email FROM email;
   | psql  /dev/null

 real    0m0.046s
 user    0m0.008s
 sys     0m0.004s

 Note: I've created indexes on those tables, both on data columns like 
 hosts.ip and on .id columns.

 What do you think of this? And in general: when (if?) should one denormalize 
 data?

I don't get what these two examples are supposed to show?  They do
not, in general, create equivalent output data. If you had to write
procedural logic to do the same correlation between the individual
result sets as the joins are performing then you'd be back to at least
the same performance and probably worse, so what exactly is being
compared here?

As to your second question; well, outside of not doing premature
optimization I'll observe that for operational databases,
denormalization rarely gives me enough savings to justify the
resultant extra data management required to maintain it.  If you're
talking about a data warehouse (and given the way you've worded the
question I supsect you are not?) then that's a different question, in
that case I prefer to generate the data from the operational side of
the house so you're no longer paying an operational cost to maintain
the denormalized data (you've got a whole different set of costs
instead).

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Array columns vs normalized table

2010-03-02 Thread Peter Hunsberger
On Tue, Mar 2, 2010 at 11:21 AM, Lee Hachadoorian
lee.hachadoor...@gmail.com wrote:
 I work with state labor data which is reported to us in the form

        industry, year, quarter1, quarter2, quarter3, quarter4

 where each quarter represents an employment count. Obviously, this can
 be normalized to

        industry, year, quarter, employment

 Can anyone comment on, or point to me to an article or discussion
 regarding, why one would use an array column instead of normalizing
 the data? That is, would there be any benefit to storing it as

        industry int, year smallint, employment int[ ]

 where the last column would be a four element array with data for the
 four quarters.

I think you might want to step back and ask yourself why you'd want to
normalize this data at all?  Unless you are trying to operate on all 4
columns at once (comparisons or aggregation, etc.) I see no reason to.
 It will take more space and give you no new capabilities.  If you
need do need to operate on all 4 quarters simultaneously then you may
very well want to normalize, but in that case there's certainly no
advantage in going to an array type.  Personally, the only reason I'd
see for ever using an array type is when you have many very closely
related values that would cause some huge number of rows if the data
is stored normalized

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] more than 2GB data string save

2010-02-10 Thread Peter Hunsberger
On Wed, Feb 10, 2010 at 1:21 AM, Scott Marlowe scott.marl...@gmail.com wrote:

 On Wed, Feb 10, 2010 at 12:11 AM, Steve Atkins st...@blighty.com wrote:
  A database isn't really the right way to do full text search for single 
  files that big. Even if they'd fit in the database it's way bigger than the 
  underlying index types tsquery uses are designed for.
 
  Are you sure that the documents are that big? A single document of that 
  size would be 400 times the size of the bible. That's a ridiculously large 
  amount of text, most of a small library.
 
  If the answer is yes, it's really that big and it's really text then look 
  at clucene or, better, hiring a specialist.

 I'm betting it's something like gene sequences or geological samples,
 or something other than straight text.  But even those bear breaking
 down into some kind of simple normalization scheme don't they?


A single genome is ~ 1.3GB as chars, half that size if you use 4 bits
/ nucleotide (which should work for at least 90% of the use cases).
Simplest design is to store a single reference and then for everything
else store deltas from it.  On average that should require about about
3-5% of your reference sequence per comparative sample  (not counting
FKs and indexes).

As I mentioned on the list a couple of months ago we are in the middle
of stuffing a bunch of molecular data (including entire genomes) into
Postgres.   If anyone else is doing this I would welcome the
opportunity to discuss the issues off list...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Totally fresh uninstall/reinstall of Postgres on MacBook Pro running Snow Leopard?

2009-11-18 Thread Peter Hunsberger
On Wed, Nov 18, 2009 at 10:37 AM, doug livesey biot...@gmail.com wrote:
 Hi -- I'm having a world of trouble getting Postgres to work with a number
 of ruby libraries, and was thinking that, to be sure that I'm eliminating
 all that could be wrong, I need to ensure that my Postgres install is fresh,
 correct, and as canonical as possible.
 Could someone advise me on the best way to ensure that I have first
 *totally* uninstalled Postgres from my machine, then secondly the best way
 to install it so that everything is most likely to be set up the way a
 third-party developer might expect (ie, what is the most vanilla Postgres
 install)?

Not sure what you're trying to do, but on a Windows machine I've never
been able to get the Ruby Gems to recognize / find / work with a
Postgres install that was not integrated with them from the get go.
I've always had to use the integrated Postgres / Ruby packages to have
Ruby work...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Experience with many schemas vs many databases

2009-11-16 Thread Peter Hunsberger
On Sun, Nov 15, 2009 at 3:45 PM, undisclosed user
lovetodrinkpe...@gmail.com wrote:
 Backing up data by user is required for my solution. A lot of times, users
 screw up and they want to rollback to a previous state.

In that case, being able to revert the state of an application should
be part of the application (and database) design and not rely on
database: history tables, etc...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [pgeu-general] pgday.eu

2009-11-12 Thread Peter Hunsberger
On Thu, Nov 12, 2009 at 4:26 PM, Hans-Juergen Schoenig h...@cybertec.at wrote:

 I had no idea that European beer was uhmm, twice as strong as US
 counterparts. I knew I was in trouble when people (Dave Page) poured
 Scotch into my water. It was time to go before I got into a LOT of
 trouble :P

 Joshua D. Drake



 btw, the normal beer is twice as strong as the US one (around 5.4% here
 usually).
 you can also get bock bier (as we call it in austria), however.
 it is in the area of 12% :).
 THIS is beer *g*. you need just the fingers of one hand to count if you had
 more than enough *g*.

Not that I'd want to discourage anyone sending beer to anyone, but...
Though perhaps well outside the mainstream US beer drinkers normal
experience there are a good variety of barley wines, Trappist type
ales, bocks and doppelbocks available in the US. The whole category of
high test beer is a fav of mine, and I'd agree you shouldn't need more
than a couple of fingers to know when you've had enough.   Personally,
I'd say it's more like:

select limit from beer where type in ( 'Barley wine', 'Trappist',
'Bock', 'Doppelbock' )

limit
--
  1

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] still on joining array/inline values was and is: design, ref integrity and performance

2009-10-28 Thread Peter Hunsberger
On Wed, Oct 28, 2009 at 4:50 AM, Ivan Sergio Borgonovo
m...@webthatworks.it wrote:

 To have the 3rd constraint I'd have a table:
 create table pw_res(
  password varchar(16) primary key,
  res int references resources (res) on delete cascade
 );

 This comes handy for 2 reasons:
 - it helps me to enforce the third constraint
 - it makes it easier to find which resource is associated with a
  password that will be a common search

 But this introduces one more problem if I decide to delete a
 password. This could happen for shared and non shared passwords.
 I don't think it is something that may happen frequently... but it
 may happen... and once you start to introduce mess in your data it
 is hard to come back.

Not sure why this is a mess?  Cascaded deletes can clean up the data.
If you need it for some business reason add an active flag.


 So a fk on pw.password may help... but... once I've that fk I can't
 insert res,pw,email without pw baing in pw_res.

 If I do the other way around inserting password(s) in pw_res I may
 end up with passwords with no destination associated.

 Surely I could wrap everything in a transaction so that if I can't
 succede to insert email/pw records everything will be rolled back.

 But I can't see how to make it cleaner.

At some point you need business logic to enforce your business
requirements.  There is little point in trying to enforce business
constraints in the database.  Stick to pure low level data integrity
for the database and keep the business constraints at some higher
level.


 I can get emails and associate them with a resource and a password
 in one sql statement if I can defer the constraint check on password.
 The next statement would be used to fill pw_res with distinct values
 from pw.

 If I do the other way around in case I need different passwords for
 different emails I'll have to:
 - count the emails and fill pw_res with as many password as needed.
  I think I'll need generate_series.
 - find a way to associate these passwords with emails

 I don't know how to do the later.
 Still supposing I knew how to write the sql statement that will take
 values('email1', 'email2'...)
 as suggested by Merlin Moncure in a different thread I can see some
 pros and cons of these approaches.

 The first approach requires a distinct/group by that may be
 expensive.
 The second one requires I keep in memory all the emails while the
 first statement run.

Unless you're dealing with 100,000's of these things I think you're
engaging in a process of premature optimization.  Group by can work
efficiently over millions of rows.


 I'd think that if memory consumption start to be an issue, that
 means that the number of emails start to be too large to be loaded
 from an HTTP request, so I could have it on file and load it in a
 temp table and so on...

 Still before I make any test I need to find how to associate emails
 with passwords considering that I'd like to pass email inline the
 SQL statement and I'll have the passwords in pw_res.


Do the simplest thing possible.  Get it working, then see if you have
any new problems you need to solve.  Every issue you've described so
far is database design 101 and should present no real problem.  I
think you're agonizing over nothing...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Research and EAV models

2009-10-23 Thread Peter Hunsberger
On Fri, Oct 23, 2009 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Leif B. Kristensen l...@solumslekt.org writes:
 It seems like there are two camps considering EAV models. On the one
 hand, there are researchers who think that EAV is a great way to meet
 their objectives. On the other hand, there are the business guys who
 thnk that EAV is crap.

 Well, no, it's not that EAV is crap.  It's that EAV is strong evidence
 that you're using the wrong tool for the job.  If a SQL database is
 actually a good fit for your application, then it should be possible to
 extract a stronger schema for your data.  If you cannot, then you
 probably should be storing your data in something else.  Otherwise
 you'll spend untold man-hours beating your head against assorted walls
 while you try to persuade the SQL database to do things it was never
 meant for, and coping with performance issues because the cases you need
 are not optimized nor indeed optimizable.  (I can just see that other
 guy trying to search on one of his polymorphic columns :-(.)

I can certainly see where Tom is coming from on this and I tend to
agree in general.  However, the reality of the research world is never
that clear cut. In particular, what you find is that you have some
split where some percentage of the domain is well understood and can
be placed in a conventional schema and some portion is not so well
understood and requires something more flexible.  You could try and
split the data between two databases but that itself is very
problematic.

My take on this, for the research world, is to not go pure EAV, but
rather normalize by some more generic concepts within the domain.  Eg.
measurement, or evaluation, etc. You might ultimately end up with
a sort of EAV model, but the V portion is strongly typed within the
database and you're not trying to cast a string into 20 conventional
data types. This still requires rigorous metadata management on the EA
side of the EAV model, but you can tackle that in many ways.

 SQL isn't the be-all and end-all of data storage.  It does relational
 stuff well, and other stuff poorly.

You can build variations on EAV that are closer to a regular
relational schema.  These don't necessarily work well or poorly but
often, at least in the research world, the middle ground is good
enough.  You are after all, talking about people who spit out MySQL
databases at the drop of a hat

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Research and EAV models

2009-10-23 Thread Peter Hunsberger
On Fri, Oct 23, 2009 at 5:30 PM, Greg Smith gsm...@gregsmith.com wrote:

 Your thinking this is a business/research distinction isn't right, it comes
 down to the size of the data set and how fast/expressive the queries against
 them are.  In the research realm, you can watch this struggle play in things
 like bioinformatics, where there's giant stacks of very similar data they
 need to sort through.  What often ends up happening is you have scientists
 start with a simple EAV system, then watch it completely fall down under
 real-world load and complicated queries once it gets past prototype.  Then
 they try to patch it for a while by reinventing SQL query and storage
 concepts (badly).

I've got little idea what the OP is really dealing with but there is a
fundamental difference between the research world and the more
traditional business world.  On the research side the data models are
not well known in advance, there are few established best practices
and even data types are not always well known.  In a conventional
schema we would have about 2,000+ tables with 20,000+ columns.  This
is 6 year old system with 75% of our metadata having been added over
the last 3 years.  I expect growth to slow over the next 3 years, but
a 25% growth would not be unreasonable.  The current physical schema
is some 40 tables with maybe some 300 columns (we also render about
6000 customized presentations of the data and manage work flow and
validation with this schema).  Performance tuning can be a pain, but
we mostly got that mastered at this stage in the game: it's a trade
off, but using a conventional schema would have consumed far more
resources than we've spent over the last 6 years.  The first two
versions of the system were conventional schema and new research
protocols where taking 6 months to a year to add.  We can now add
simple ones in a day or two and complex one in a couple months (the
largest having some 60,000+ pieces of metadata, including validation
rules and workflow).


 Eventually someone with traditional database background comes along, notes
 that the data you really need is for the most part predetermined, rewrites
 that into something more amenable to standard SQL, and then the whole thing
 performs better for that subset.  But now you've got stuff like ALTER TABLE
 to add a column every time you want to track something new, and people don't
 like that when the schema is still fluid.  So they start putting stuff into
 EAV stores for their next project...and the cycle begins anew.

 Ultimately you can be really flexible in how your store your data, or you
 can get good query performance, but it's quite hard to do both.

Not much argument there!  However, it's a three way trade off: such a
schema can get good performance for a small number of users and / or a
small amount of data (we're about 300 users and 45gb total data).

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Free Tool to design Postgres Databases

2009-10-19 Thread Peter Hunsberger
On Mon, Oct 19, 2009 at 2:18 PM, Andre Lopes lopes80an...@gmail.com wrote:
 Hi,

 I'am searching for a free tool to design a Postgre Database. There is
 something similar to MySQL Workbench?

Search the archives this came up within the last couple of months.  I
currently use Power Architect, it's a beta product and still fairly
buggy but works.

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Partitioned tables as a poor mans columnar index?

2009-10-16 Thread Peter Hunsberger
I just realized that my replies to my previous question on sparse
arrays went off list due to the way this list server  is set up
(sigh).  It has occurred to me that for my problem, one possible
solution is columnar indexes and that, in a way, partitioned tables in
Postgres might give me somewhat the same capability.

The basic problem I have is that I have some tables that are
potentially very long (100,000's to millions of rows) and very skinny,
essentially just a FK to a parent table and about 12 bits, maybe less,
of actual data.  Now if I use a traditional table an int for FK is as
large or even larger than the data itself which is why I had been
looking at a sparse array (in which case I don't even need 12 bits for
the data, since part of the information is positional in nature).
However, building the routines to manipulate the arrays could be
painful and it's been suggested that their performance will be
non-optimal.

One alternative might be to use partitioned tables. If I partition the
tables in such a way that the table name can be associated directly to
the FK then I no longer need to actually store the FK in the table,
and I end up with maybe a total of 12 bits of data in each row.
Normally, I'd shy away from such a scheme since it strikes me as
potentially problematic:

1) you've got to join through DB metadata back to the actual data in question;

2) some process has to have create table auths if any of this is to be automated

My bias against 1) might just be because I come from an Oracle
background and it seems that in Postgres this is not as cumbersome or
as frowned on as it might be elsewhere?  The second issue seems
problematic no matter how I look at it.  However, I might be willing
to live with it for this particular problem, particulary if I can talk
myself into believing that I'm building a proxy for columnar indexes
:-)

So I guess two sets of questions:

1) In general how comfortable are members of the Postgres community in
using table names for partitioned tables where the table name itself
conveys some indirect join relationship?  It seems there is no extra
performance penalty in joining back to the system tables to do a look
up with Postgres so this is perhaps more of a best practices question
(or even a religious issue for relational purists) than anything
else...

2) If you're comfortable with the idea, how far would you go in
embracing it?  In particular, would you add code to Postgres to hide
the fact that you are joining via table name? Would you go as far as
to do it at the system level or would you stick to just wrapping it in
some functions (in which case the create table privilege is still
needed)?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Partitioned tables as a poor mans columnar index?

2009-10-16 Thread Peter Hunsberger
On Fri, Oct 16, 2009 at 3:31 PM, marcin mank marcin.m...@gmail.com wrote:
 On Fri, Oct 16, 2009 at 9:19 PM, Peter Hunsberger
 peter.hunsber...@gmail.com wrote:

 The basic problem I have is that I have some tables that are
 potentially very long (100,000's to millions of rows) and very skinny,

 and I end up with maybe a total of 12 bits of data in each row.

 Are You aware that there are some 20-ish bytes of metadata for each
 row? saving 4 bytes buys You nothing. Give it up.

No, the metadata is a whole 'nother problem.  I'm just talking about
fk relationships here.  This isn't an isolated issue within this
particular domain.  If you where to use a conventional table design,
then once the rest of the associated tables get built along with their
associated indexes you'd be looking at in the order of a terabyte for
this half of the DB...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Craeteing sparse arrays

2009-10-15 Thread Peter Hunsberger
Is there any easy way to create sparse arrays with Postres?
Specifically, when I construct, add or aggregate data to an array, I
want to be able to specify the position within the array where the
data is placed and have any intervening positions that have not yet
been populated just marked as nulls. eg, something like

insert into foo (bar[3],[7]) values ( 'a', 'b')

would build an array

bar = { null, null, 'a', null, null, null, 'b' }

or some such thing.  I suspect I'm going to have to write a function
to just find the length and append nulls until I reach the desired
position?  Given that some of the arrays I will be dealing with could
potentially be 1000s of elements long that seems a bit perverse.

I'm currently using 8.3 but 8.4 solutions are also welcome.  C code
not considered out of the question if it isn't a lot of work and will
make the rest of the process close to trivial...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Query to find contiguous ranges on a column

2009-10-14 Thread Peter Hunsberger
On Tue, Oct 13, 2009 at 5:12 PM, Tim Landscheidt t...@tim-landscheidt.de 
wrote:
 Peter Hunsberger peter.hunsber...@gmail.com wrote:

 You can either use a PL/pgSQL function (SETOF TEXT just
 for the convenience of the example):

That works well, takes about 20 seconds to do the 6M+ rows


 or a recursive query (which I always find very hard to com-
 prehend):

 | WITH RECURSIVE RecCols (LeftBoundary, Value) AS
 |   (SELECT col, col FROM t WHERE (col - 1) NOT IN (SELECT col FROM t)
 |    UNION ALL SELECT p.LeftBoundary, c.col FROM RecCols AS p, t AS c WHERE 
 c.col = p.Value + 1)
 |   SELECT LeftBoundary, MAX(Value) AS RightBoundary FROM RecCols
 |     GROUP BY LeftBoundary
 |     ORDER BY LeftBoundary;

 Could you run both against your data set and find out which
 one is faster for your six million rows?


Turns out the server is v 8.3, looks like I need to get them to
upgrade it so I get recursive and windowing :-(.  If this happens any
time soon I'll let you know the results.

Many thanks.


-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Query to find contiguous ranges on a column

2009-10-14 Thread Peter Hunsberger
On Wed, Oct 14, 2009 at 4:50 PM, Tim Landscheidt t...@tim-landscheidt.de 
wrote:
 Peter Hunsberger peter.hunsber...@gmail.com wrote:

 After some tests with a data set of 7983 rows (and 1638 ran-
 ges): Don't! :-) The recursive solution seems to be more
 than double as slow as the iterative. I'll take it to -per-
 formance.


Interesting, I've never liked recursive on Oracle but performance is
usually reasonable... Thanks for the heads up...


-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Query to find contiguous ranges on a column

2009-10-13 Thread Peter Hunsberger
Given a column of data resembling the following:

col
2
3
4
5
11
12
13
14
15
16
17
18
19
23
32
33
34
35
36
37

I need a query to find the contiguous ranges within this column, in
this case returning the result set:

start, end
2, 5
11, 19
23, 23
32, 37

I have one solution that joins the table against itself and does
(among other things) a subselect looking not exists col +1 and not
exists col -1 on the two instances of the table to find the start and
end.  This is, as you might guess, is not very efficient (my actual
data is some 6 million+ rows) and I'm guessing there has to be
something more efficient with windowing or possibly grouping on min
and max (though I can't see how to make sure they are part of a
contiguous set).  Anyone have any ideas?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] table full scan or index full scan?

2009-10-11 Thread Peter Hunsberger
2009/10/11 Scott Marlowe scott.marl...@gmail.com:
 The postgresql database uses the table full scan.but in oracle, the similar 
 SQL uses the index full scanning,speed quickly many than postgresql.

 Yep, PostgreSQL isn't Oracle.  It's a trade off.  In pgsql indexes
 don't contain visibility info, so all index lookups have to eventually
 hit the table itself.  So you either do indexlookup - table lookup,
 repeat as many times as you have index lookups or you just hit the
 table since you gotta go there anyway.

 On the  bright side, this makes updates faster since you don't have to
 lock both table and index and write to both at the same time anymore.

 postgresql's optimizer whether to have the necessity to make the adjustment?

 Sorry, it's an architectural difference.  Are you testing in a
 realistic scenario including both reads and writes to the database to
 see if postgresql is faster overall and identify problem areas that
 pop up there?


This is interesting, I just ran a similar issue the other day.
Clearly there is a wide range of read / write scenarios that Postgres
should be able to cover.  These days, I have a lot of designs leaning
more toward the data warehouse side of the operational spectrum as
opposed to the high transaction scenario and I specifically design DB
management strategies with the knowledge that writes will happen far
less than reads in our applications.  Is this an area where
optimizations are considered hard in Postrgres or hopefully, just
something that is on the todo list but just no one has gotten around
to yet?  Similarly, are accurate table summary stats possible someday
or are they considered close to impossible in order to eliminate race
conditions and lock contention scenarios?

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] generic modelling of data models; enforcing constraints dynamically...

2009-09-27 Thread Peter Hunsberger
On Sun, Sep 27, 2009 at 2:22 PM, David Fetter da...@fetter.org wrote:
 On Sun, Sep 27, 2009 at 08:26:27PM +0200, InterRob wrote:
 Dear David, dear all,
 I very well understand what you are saying...

 Clearly you do not.  What you are proposing has been tried many, many
 times before, and universally fails.

I've been refraining from jumping on this due to time constraints, but
this statement is silly.  We have a system that does almost exactly
what the OP wants although the implementation is slightly different:
we use an EAV like model with strong typing and build set / subset
forests to maintain arbitrary hierarchies of relationships.  Our
reasons for doing this are similar to the OPs; it's for research (in
our case medical research).  We maintain over 200,000 pieces of end
user generated metadata, describing what would be in a conventional
relational model over 20,000 columns and some 1,000s of tables but the
actual physical model is some 40 tables.   Yes, the flip side is, such
a system won't support more than 1,000,000s of transactions per day,
but that's not why you build them.


 That your people are failing to get together and agree to a data model
 is not a reason for you to prop up their failure with a technological
 fix that you know from the outset can't be made to work.


Spoken like someone who has always had the luxury of working in areas
with well defined problem domains...   I can't tell you the number of
people that told us exactly the same thing when we started on it.
That was 8 years ago.  Not only can such systems be built, they can be
made to scale reasonably well.  You do need to understand what you are
doing and why: the costs can be high, but when it comes to research,
the benefits can far outweigh the costs.

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] creation of foreign key without checking prior data?

2009-09-17 Thread Peter Hunsberger
On Thu, Sep 17, 2009 at 11:40 AM, hubert depesz lubaczewski
dep...@depesz.com wrote:

 So, since (as we know) foreign keys are not fault-proof, wouldn't it be
 good to provide a way to create them without all this time-consuming
 check?

No.

If you don't want the behavior of a foreign key then just don't define
a foreign key. Load the data, clean it up, then create the foreign key

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] creation of foreign key without checking prior data?

2009-09-17 Thread Peter Hunsberger
On Thu, Sep 17, 2009 at 12:44 PM, hubert depesz lubaczewski
dep...@depesz.com wrote:
 On Thu, Sep 17, 2009 at 12:31:14PM -0500, Peter Hunsberger wrote:
 On Thu, Sep 17, 2009 at 11:40 AM, hubert depesz lubaczewski
 dep...@depesz.com wrote:
 
  So, since (as we know) foreign keys are not fault-proof, wouldn't it be
  good to provide a way to create them without all this time-consuming
  check?

 No.

 If you don't want the behavior of a foreign key then just don't define
 a foreign key. Load the data, clean it up, then create the foreign key

 I think you are missing the point. Data is clean. It's dump. creation of
 fkey takes time, and i'd like to avoid this delay.

You can't have a foreign key that doesn't have relational integrity,
it is no longer a foreign key.  If you don't want the delay then don't
define the key, at least until some point at which you can take the
delay.  If there is never such a time then your operational scenario
needs changing, not Postgres...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Open source database design tool , alternative to MicroOLDAP

2009-09-10 Thread Peter Hunsberger
Check out Power Architect. Pre-major release, it has some bugs, but
generally works well.  It can also forward and reverse engineer.

2009/9/10 NTPT n...@centrum.cz:
 Hi all. is there available some freeware and/or opensource visual database
 design tool for postgresql ? Something like commercial microOLAP ?

 thanx for help

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general




-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Viable alternatives to SQL?

2009-08-27 Thread Peter Hunsberger
On Thu, Aug 27, 2009 at 8:43 AM, Kelly Joneskelly.terry.jo...@gmail.com wrote:
 Many sites let you search databases of information, but the search
 queries are very limited.

 I'm creating a site that'll allow arbitrary SQL queries to my data (I
 realize I'll need to handle injection attacks).

 Are there other viable ways to query data? I read a little on
 Business System 12 (BS12), Tutorial D, and even something called
 T-SQL (I think), but they all seem theoretical and not fully
 implemented.

 I want a query language that non-techies can use easily, but also
 supports arbitrarily complex queries. Does such a language exist?


Yeah, it's called SQL  Seriously, you're two requirements are
rather contradictory.   We have many end users who use SQL, learning
just enough to get the job done when we have this kind of problem.

Having said that, is there a reason why you are not looking at end
user query interfaces such as Hyperion or Cognos?  You might also look
at natural language or controlled vocabulary interfaces; if the
queries are domain specific (eg. medical) then you can often pick off
some low hanging fruit fairly quickly.  If the problem domain is non
specific then I don't think you'll be able to meet these requirements
without throwing a lot of money at the problem (The Cyc guys come to
mind ;-)...

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Find difference between two Text fields

2009-07-24 Thread Peter Hunsberger
Can anyone give me a way to find the difference between two text
fields on a character by character basis.  Essentially,  I'd like to
logically AND the two together and for any position that has a
non-zero result show whatever character is in that position for the
second string.  The solution can be postgres specific but something
approaching ANSI SQL would also be helpful (if possible).

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Using Postgres to store genetic data

2009-07-10 Thread Peter Hunsberger
We're looking at potentially using Postgres to store a variety of molecular
and genetic data.  At this point I have a bunch of general questions which I
can take to other lists if someone can tell me where
they would be most appropriate:
1) are there groups or individuals already doing this that have things they
can share?  I've seen the BLASTgres work and it needs rework to compile
against 8.3 but might be potentially useful some of our forms of data;

2) if we do decided to implement our own data types where's the best place
for discussion about such efforts?

3) Can anyone talk about what it would take to add columnar indexes to
Postgres?  Where would be the best place for discussion about what this
would take?

I can go into more details here if appropriate and if that will help...

-- 
Peter Hunsberger


Re: [GENERAL] Using Postgres to store genetic data

2009-07-10 Thread Peter Hunsberger
On Fri, Jul 10, 2009 at 4:02 PM, Steve Crawford
scrawf...@pinpointresearch.com wrote:

 Peter Hunsberger wrote:

 We're looking at potentially using Postgres to store a variety of molecular 
 and genetic data.  At this point I have a bunch of general questions...

 I don't know enough about your area of expertise to know if this is useful, 
 but I'd look at the Unison project to see what they are doing. There are also 
 videos/slides up from a recent San Francisco PUG meeting at UCSF that might 
 be of interest:

 http://www.vimeo.com/3732938
 http://www.slideshare.net/linuxpoet/unison-ucsf-sfpug


Looks like it may be somewhat applicable to my first question. Thanks.

Area of expertise is application architecture and design, focusing on
clinical systems for medical research these last 7 years or so and
medical in general for the last 15.  Adding the genomic and molecular
world to the mix is a new initiative. At the moment I've got enough
domain knowledge to be dangerous...

--
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general