[GENERAL] select all rows where any column is NULL

2011-07-26 Thread Nathan Boley
Does anyone have a concise way of doing $SUBJECT?

Best,
Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] a question about row estimation in postgres

2011-03-20 Thread Nathan Boley
> Following the formula outlined
> in http://www.postgresql.org/docs/8.3/static/row-estimation-examples.html
> Both gradrate 10 and gradrate 11 would fall in the first bucket.
> Shouldn't the row estimation be:
> (11 - 8) / (33 - 8) / 10 * 1302 = 15.624
> and
> (10 - 8) / (33 - 8) / 10 * 1302 = 10.416
> instead of 11 and 7?
> Perhaps I am missing something. I'd appreciate if you can point it out.

IIRC, you need to subtract the rows that are mcv's from the row count.
ie, you're not working with the full 1302 rows, only the ones that are
not mcv's.

Best,
Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SFPUG: Video from "Statistics and Postgres -- How the Planner Sees Your Data" Now on Vimeo

2009-10-15 Thread Nathan Boley
>
> There is reference to slides with some
> frequency though, are they available somewhere?
>

Ya.

http://encodestatistics.org/publications/statistics_and_postgres.pdf

Is there a better place for this?

-Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Q: data modeling with inheritance

2009-07-02 Thread Nathan Boley
>
>  variant              association                phenotype
>  ---              ---                -
>  variant_id - variant_id        +--- phenotype_id
>  genome_id            phenotype_id -+        short_descr
>  strand               origin_id (i.e., who)      long_descr
>  start_coord          ts (timestamp)
>  stop_coord
>

Is an association, for example, an experiment that establishes a
dependent relationship? So could there be multiple associations
between variant and phenotype?

> The problem that arises is the combinatorial nature of the schema design
> coupled with the lack of inherited primary keys.  In the current state
> of PG, one must (I think) make joining tables (association subclasses)
> for every combination of referenced foreign keys (variant and phenotype
> subclasses).
>

Is your concern that the number of joins will grow exponentially in
the number of variants and phenotypes?

> So, how would you model this data?  Do I ditch inheritance?

I've put some thought into representing an ontology via table
inheritance, and I've never been able to figure out a good way ( of
course, that's not to say one doesn't exist... ) .

If I understand your problem correctly, I would use composite keys (
ie  ( variant type, id ) ) and then use an extra join to separate
ontology tables to restrict searches to specific branches. So all
variants would be stored in the variants table, all phenotypes are in
the phenotypes table, and you join through association. It's not as
elegant as inheritance, but it will be much more flexible in both the
types of queries that you can write and in case your requirements
change.

-Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Streaming PUG meeting: PostgreSQL Genetics!

2009-03-14 Thread Nathan Boley
>> Will there be a saved version of this (SFPUG talk) available for later 
>> viewing? Don't
>> make me choose between steak and beer and postgres. 8(
>
> Yes!  I'll announce it here when it's available.
>

Is there any update on this?

-Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] full vacuum really slows down query

2008-06-08 Thread Nathan Boley
> I have a query that takes 2 sec if I run it from a freshly restored dump.
>  If I run a full vacuum on the database it then takes 30 seconds.

What do the two plans look like? Can you post the explains?

-Nathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] inheritance. more.

2008-05-01 Thread Nathan Boley
Because people can be smarter about the data partitioning.

Consider a table of users. Some are active, most are not. The active
users account for nearly all of the users table access, but I still
(occasionally) want to access info about the inactive users.
Partitioning users into active_users and inactive_users allows me to
tell the database (indirectly) that the active users index should stay
in memory, while the inactive users can relegated to disk.

-Nathan

On Thu, May 1, 2008 at 6:02 AM, Jeremy Harris <[EMAIL PROTECTED]> wrote:
> Gurjeet Singh wrote:
>
> >  One of the advantages
> > of breaking up your data into partitions, as professed by Simon (I think)
> > (and I agree), is that you have smaller indexes, which improve
> performance.
> > And maybe having one huge index managing the uniqueness across partitioned
> > data just defeats the idea of data partitioning!
> >
>
>  Isn't "large indexes are a performance problem" just saying
>  "we don't implement indexes very well"?   And why are they
>  a problem - surely a tree-structured index is giving you
>  range-partitioned subsets as you traverse it?  Why is this
>  different from manual partitioning into (inherited) tables?
>
>  Thanks,
> Jeremy
>
>
>
>  --
>  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-general
>

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Update Perrformance with nested loop vs. merge-join

2006-12-06 Thread Nathan Boley

Hi,

I just created a new column on a relatively large table (~ 2 mil rows over
400 mb in tablesize) and am trying to populate it from another much smaller
table by using an update query .
The purpose of the query to set a bool flag if the user for the action on
table1 is the last user, as determined from table 2.

The large table has a two column primary key and one of the columns is the
primary key on the smaller table.

My original update query is as follows:

update table1
   set lastuser = (table1.user = table2.lastuser)::bool from table2
   where table1.pkeycolumn1 = table2.pkey

which has an explain output of

Merge Join  (cost=883182.60..1024622.03 rows=6873573 width=89)
 Merge Cond: ("outer".pkeycolumn1 = "inner".pkey)
 ->  Sort  (cost=93387.50..94711.01 rows=529404 width=18)
   Sort Key: table2.pkey
   ->  Seq Scan on table2 (cost=0.00..15037.04 rows=529404 width=18)
 ->  Sort  (cost=789795.10..795605.39 rows=2324116 width=81)
   Sort Key: table1.pkeycolumn1
   ->  Seq Scan on table1  (cost=0.00..71900.16 rows=2324116 width=81)

This seems like it *should* be the best query to me, but when I try and run
it it takes forever.
However, when I add a subselect clause to the end to force the planner to
use a nested loop
select on table2, the query runs MUCH faster (1 hour instead of > 10, I
never let it finish)

Here is the query:

update table1
   set lastuser = (table1.user = table2.lastuser)::bool from table2
   where table1.pkeycolumn1 = table2.pkey
   and table1.pkeycolumn1 is in ( select pkey from table3 limit 100)

where table1.pkeycolumn1 is a foreign key in table3.pkey. Also, note that
the num of rows in table 3 is *much* less than 100.
Also, the limit clause is necessary to force the planner into using the
nested loop. (Aside: Is there a better way to do this inside of a query?)
Here is the explain for the above query:

Hash Join  (cost=13863.09..109298.79 rows=51388 width=89)
 Hash Cond: ("outer".pkeycolumn1 = "inner".pkey)
 ->  Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)
 ->  Hash  (cost=13854.99..13854.99 rows=1621 width=26)
   ->  Nested Loop  (cost=8205.72..13854.99 rows=1621 width=26)
 ->  HashAggregate  (cost=8205.72..8207.72 rows=200 width=8)
   ->  Limit  (cost=0.00..5891.43 rows=185143 width=8)
 ->  Seq Scan on table3
(cost=0.00..5891.43rows=185143 width=8)
 ->  Index Scan using table2-pkey-index" on table2  (cost=
0.00..28.14 rows=8 width=18)
   Index Cond: ("outer".pkey = table2.pkey)


My only theory up to this point is that it has something to do with how the
tablespace is being cached.  I notice that when I use plan 1, my
computer goes through long periods of io and with bursts of processor
activity every minute or so. When I use plan 2, the io sits at about 90%
of my total resource usage while my normal processor usage sits at about
10%. Maybe it keeps trying to cache and resort table2? I don't really
have any idea, but that is my only guess.

If anyone knows why this may be happenning, I would really appreciate it.

Thanks,
Nathan