from:"Jonathan Vanasco"

[GENERAL] index duplicates primary key, but is used more?

2017-06-02 Thread jonathan vanasco

i'm doing a performance audit and noticed something odd.

we tested a table a while back, by creating lots of indexes that match 
different queries (30+).

for simplicity, here's a two column table:

CREATE TABLE foo (id INT PRIMARY KEY
  value INT NOT NULL DEFAULT 0,
  );

The indexes were generated by a script, so we had things like:

CREATE INDEX idx_test_foo_id_asc ON foo(id ASC);
CREATE INDEX idx_test_foo_id_desc ON foo(id DESC);
CREATE INDEX idx_test_foo_val_asc ON foo(value ASC);
CREATE INDEX idx_test_foo_value_desc ON foo(value DESC);

What I noticed when checking stats earlier, is that although 
`idx_test_foo_id_asc` is the same as the PKEY... it was used about 10x more 
than the pkey.

Does anyone know of this is just random (perhaps due to the name being sorted 
earlier) or there is some other reason that index would be selected ?

my concern in deleting it, is that it might be preferred for queries due to 
hinting from the explicit 'order by'  (even though the contents are the same) 
and I may lose an index being leveraged in that query.

It's on a GIANT table, so it would be hard to recreate. 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] improvements/feedback sought for a working query that looks a bit ugly and might be inefficient

2017-05-18 Thread jonathan vanasco

On May 16, 2017, at 10:20 PM, David G. Johnston wrote:

> Unless you can discard the 5 and 1000 limits you are going to be stuck 
> computing rank three times in order to compute and filter them.

Thanks a ton for your insight.  I'm suck using them (5 is required for 
throttling, 1000 is required for this to run in a reasonable amount of time)

The overhead of computing things is indeed super small.  I'm not really worried 
much about the performance of this query (it runs around 3ms now, down from 
20+s).  I'm more worried about this code being referenced and a (possibly 
improper) idiom being used on queries where it will have a noticeable effect.

[GENERAL] improvements/feedback sought for a working query that looks a bit ugly and might be inefficient

2017-05-16 Thread jonathan vanasco


Everything here works fine - but after a handful of product iterations & 
production adjustments, a query that handles a "task queue" across a few tables 
looks a bit ugly.

I'm wondering if anyone can see obvious improvements.  

There are 3 tables:
upstream_provider
task
task_queue

Originally we needed to select 50 items off the top of the queue at a time.
Then we needed to set a max of 5 tasks per upstream provider (It's a cheap way 
to handle throttling).
The table is quite large, so a limit of the last 1000 items drastically 
improved performance.
The query got ugly when we needed to add a "priority" toggle to the queue -- 
basically to mark things as "process ASAP".  
The only way I could figure out how to do that, was to add a sort -- on 
"is_priority DESC NULLS LAST".
My concern is that the sort needs to happen 3x -- 
in the subselect for 1000 items
in the partition for row numbering
in the final sort


If anyone has a moment to look this over and suggest anything, I'd be very 
thankful.  A working reproduction is below.


SQL---
--

CREATE TABLE upstream_provider (id SERIAL PRIMARY KEY,
name VARCHAR(32),
is_paused BOOLEAN DEFAULT NULL
);
CREATE TABLE task (id SERIAL PRIMARY KEY,
   upstream_provider_id INT NOT NULL REFERENCES 
upstream_provider(id),
   name VARCHAR(32)
   );
CREATE TABLE task_queue (id SERIAL PRIMARY KEY,
 task_id INT NOT NULL REFERENCES task(id),
 upstream_provider_id INT NOT NULL REFERENCES 
upstream_provider(id),  # only here because it eliminates expensive joins 
elsewhere
 processing_status BOOLEAN DEFAULT NULL,
 is_priority BOOLEAN DEFAULT NULL
 );

SELECT  partition1.*
  , task.*
FROM (SELECT  window1.*
, row_number() OVER (PARTITION BY window1.upstream_provider_id
 ORDER BY window1.is_priority DESC NULLS LAST,
  window1.task_queue_id
 ) AS rownum
  FROM (SELECT   qu.id AS task_queue_id
   , qu.upstream_provider_id
   , qu.task_id
   , qu.is_priority
FROM
task_queue qu
JOIN
upstream_provider ON qu.upstream_provider_id = 
upstream_provider.id
WHERE (qu.processing_status IS NULL)
  AND (upstream_provider.is_paused IS NOT TRUE)
ORDER BY is_priority DESC NULLS LAST,
 qu.id DESC
LIMIT 1000
) window1
  ) partition1
JOIN task ON partition1.task_id = task.id
WHERE partition1.rownum < 5
ORDER BY is_priority DESC NULLS LAST,
 task_queue_id DESC
LIMIT
50
;

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] why isn't this subquery wrong?

2017-04-20 Thread jonathan vanasco

thanks all!

On Apr 20, 2017, at 6:42 PM, David G. Johnston wrote:

> Subqueries can see all columns of the parent.  When the subquery actually 
> uses one of them it is called a "correlated subquery".

i thought a correlated subquery had to note that table/alias, not a raw column. 
  I guess i've just been adhering to good form.

On Apr 20, 2017, at 6:43 PM, Tom Lane wrote:

> Cautious SQL programmers qualify all references inside sub-selects to avoid 
> getting caught by this accidentally.

is there a syntax to qualify a reference to lock a subquery to the current 
scope (disable looking at the parents)?  that's how I got caught on this by 
accident.

[GENERAL] why isn't this subquery wrong?

2017-04-20 Thread jonathan vanasco


I ran into an issue while changing a database schema around.  Some queries 
still worked, even though I didn't expect them to.

Can anyone explain to me why the following is valid (running 9.6) ?

schema
CREATE TEMPORARY TABLE example_a__data (
foo_id INT,
bar_id INT
);
CREATE TEMPORARY TABLE example_a__rollup_source (
id int primary key,
name varchar(64),
foo_id INT,
check_bool BOOLEAN
);
CREATE TEMPORARY TABLE example_a__rollup AS
SELECT id, name, foo_id
FROM example_a__rollup_source
WHERE check_bool IS TRUE
;

query:
SELECT foo_id
FROM example_a__data
WHERE foo_id IN (SELECT bar_id FROM example_a__rollup)
;

a raw select of `SELECT bar_id FROM example_a__rollup;` will cause an error 
because bar_id doesn't exist

postgres doesn't raise an error because example_a__data does have a bar_id -- 
but example_a__rollup doesn't and there's no explicit correlation in the query.

can someone explain why this happens?  i'm guessing there is a good reason -- 
but I'm unfamiliar with the type of implicit join/queries this behavior is 
enabling.

Re: [GENERAL] disk writes within a transaction

2017-03-01 Thread jonathan vanasco

On Feb 17, 2017, at 4:05 PM, Jeff Janes wrote:

> It will probably be easier to refactor the code than to quantify just how 
> much damage it does.

Thanks for all the info.  It looks like this is something worth prioritizing 
because of the effects on indexes.

We had discussed a fix and pointed it; rewriting the code that causes this is 
pretty massive, and will require blocking out a resource FT for 2 weeks on 
rewrites and testing.  We don't really have time to spare any of those devs, so 
time to make product tradeoffs ;(

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] appropriate column for storing ipv4 address

2017-03-01 Thread jonathan vanasco


I have to store/search some IP data in Postgres 9.6 and am second-guessing my 
storage options.  

Would anyone mind giving this a quick look for me?

Right now I have two tables, and am just using cidr for both:

create table tracked_ip_address (
id SERIAL primary key,
ip_address CIDR not null
);

create table tracked_ip_block (
id SERIAL primary key,
block_cidr CIDR not null,
ownserhip_data TEXT
);

The types of searching I'm doing:

1. on tracked_ip_address, I'll search for neighboring ips.  
e.g.
select * from tracked_ip_address where ip_address << 
'192.168'::CIDR;
select * from tracked_ip_address where ip_address << 
'192.168.1'::CIDR;

2. on tracked_ip_block, i search/join against the tracked_ip_address to 
show known ips in a block, or a known block for an ip.

i used cidr instead of inet for the ip_address because it saved me a cast on 
joins and appears to work the same.  was that the right move?  is there a 
better option?

thanks in advance.

/ jonathan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] recursive query too big to complete. are there any strategies to limit/partition?

2017-01-26 Thread Jonathan Vanasco

On Jan 26, 2017, at 7:07 PM, David G. Johnston wrote:

> Thinking aloud - why doesn't just finding every record with 5 descendants 
> not work?  Any chain longer than 5 would have at least 5 items.

Oh it works. This is why I ask these questions -- new perspectives!

> Even without recursion you could build out a five-way self-join and any 
> records that make it that far are guilty.  I suppose this assumes your setup 
> is non-cyclic.

There could be cyclic records, but that's easy to filter out.  A first approach 
took 40 seconds to run.  A little tweaking is necessary, but this is a great 
start.

THANK YOU!  You saved me!

[GENERAL] recursive query too big to complete. are there any strategies to limit/partition?

2017-01-26 Thread Jonathan Vanasco

There are over 20 million records in a self-referential database table, where 
one record may point to another record as a descendant.

Because of a bug in application code, there was no limit on recursion.  The max 
was supposed to be 4.  A few outlier records have between 5 and 5000 
descendants (there could be more.  I manually found one chain of 5000.

I need to find all the chains of 5+ and mark them for update/deletion.  While 
the database is about 10GB, the recursive search is maxing out on diskspace and 
causing a failure (there was over over 100GB of workspace free)

Is there any way to make a recursive query work, or will I have to use another 
means and just iterate over the entire dataset (either in postgres or an 
external service)








-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] efficiently migrating 'old' data from one table to another

2017-01-13 Thread Jonathan Vanasco

On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:

> On Thu, Jan 12, 2017 at 2:19 PM, bto...@computer.org
>  wrote:
>> 
>> Review manual section 7.8.2. Data-Modifying Statements in WITH
>> 
>> 
>> https://www.postgresql.org/docs/9.6/static/queries-with.html
> 
> this.
> 
> with data as (delete from foo where ... returning * ) insert into
> foo_backup select * from data;

Thanks, btober and merlin.  that's exactly what i want.

On Jan 12, 2017, at 4:45 PM, Adrian Klaver wrote:
> Maybe I am missing something, but why do the UPDATE?
> Why not?:
> ...
> With an index on record_timestamp.

That's actually the production deployment that we're trying to optimize.  
Depending on the size of the table (rows, width) it performs "less than great", 
even with the index on record_timestamp.

The UPDATE actually worked faster in most situations.  I honestly don't know 
why (the only thing that makes sense to me is server-load)... but the update + 
bool test ended up being (much) faster than the timestamp comparison.  

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] efficiently migrating 'old' data from one table to another

2017-01-12 Thread Jonathan Vanasco

I'm just wondering if there's a more efficient way of handling a certain 
periodic data migration.

We have a pair of tables with this structure:

table_a__live
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

table_a__archive
column_1 INT
column_2 INT
record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to 
`table_a__archive`.  The entries are copied over to the archive, then deleted.

The staleness is calculated based on age--  so we need to use INTERVAL.  the 
"live" table can have anywhere from 100k to 20MM records.

the primary key on `table_a__live` is a composite of column_1 & column_2, 

In order to minimize scanning the table, we opted to hint migrations with a 
dedicated column:

ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) 
WHERE is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

BEGIN;
UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < 
transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
INSERT INTO table_a__archive (column_1, column_2, record_timestamp) 
SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate 
IS TRUE;
DELETE FROM table_a__live WHERE is_migrate IS TRUE; 
COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from 
postgres re-writing all the rows.  

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based on 
that table -- but there's a lot of disk-io on that approach too.


fwiw we're on postgres9.6.1

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] temporarily disable autovacuum on a database or server ?

2017-01-12 Thread Jonathan Vanasco


On Jan 11, 2017, at 8:19 PM, Melvin Davidson wrote:
> 
> Yes, you're right about ALTER SYSTER. Unfortunately, the op provided neither 
> PostgreSQL version or O/S, so we can't even be sure that is 
> an option. That is  why I stated "I cannot confirm".


I didn't think that would matter, but postgres 9.6.1 and ubuntu 16.04

anyways, thanks.  i'll test that approach.

[GENERAL] temporarily disable autovacuum on a database or server ?

2017-01-11 Thread Jonathan Vanasco

I've run into a performance issue, and I think autovacuum may be involved.

does anyone know if its possible to temporarily stop autovacuum without a 
server restart ?

It seems that it either requires a server restart, or specific tables to be 
configured.

Several times a day/week, I run a handful of scripts to handle database 
maintenance and backups:

* refreshing materialized views
* calculating analytics/derived/summary tables and columns
* backing up the database (pg_dumpall > bz2 > archiving)

These activities have occasionally overlapped with autovacuum, and the 
performance seems to be affected.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Improve PostGIS performance with 62 million rows?

2017-01-09 Thread Jonathan Vanasco


On Jan 9, 2017, at 12:49 PM, Israel Brewster wrote:

>  Planning time: 4.554 ms
>  Execution time: 225998.839 ms
> (20 rows)
> 
> So a little less than four minutes. Not bad (given the size of the database), 
> or so I thought.
> 
> This morning (so a couple of days later) I ran the query again without the 
> explain analyze to check the results, and noticed that it didn't take 
> anywhere near four minutes to execute. So I ran the explain analyze again, 
> and got this:

...

>  Planning time: 0.941 ms
>  Execution time: 9636.285 ms
> (20 rows)
> 
> So from four minutes on the first run to around 9 1/2 seconds on the second. 
> Presumably this difference is due to caching? I would have expected any 
> caches to have expired by the time I made the second run, but the data *is* 
> static, so I guess not. Otherwise, I don't know how to explain the 
> improvement on the second run - the query plans appear identical (at least to 
> me). *IS* there something else (for example, auto vacuum running over the 
> weekend) that could explain the performance difference?


This may sound crazy, but I suggest running each of these scenarios 3+ times:

# cold explain
stop postgres
start postgres
explain analyze SELECT

# cold select
stop postgres
start postgres
enable \t for query timing
SELECT

# cold explain to select
stop postgres
start postgres
explain analyze SELECT
enable \t for query timing
SELECT

# cold select to explain
stop postgres
start postgres
enable \t for query timing
SELECT
explain analyze SELECT

# cold select to select
stop postgres
start postgres
enable \t for query timing
SELECT
SELECT

I've found the timing for "Explain Analyze" to be incredibly different from an 
actual SELECT on complex/large dataset queries... and the differences don't 
seem to correlate to possible speedups from index/table caching.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] does postgres log the create/refresh of a materialized view anywhere?

2016-12-13 Thread Jonathan Vanasco

Is there a way to find out when a materialized view was created/refreshed?  I 
couldn't find this information anywhere in the docs.

the use-case is that I wish to update a materialized view a few times a day in 
a clustered environment.  i'd like to make sure one of the redundant nodes 
doesn't refresh if needed.  I can log this manually in postgresql if needed, 
but was hoping there was some "timestamp" on the view in a system table.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Determining server load

2016-09-27 Thread Jonathan Vanasco

On Sep 27, 2016, at 2:46 PM, Israel Brewster wrote:

> I do have those on, and I could write a parser that scans through the logs 
> counting connections and disconnections to give a number of current 
> connections at any given time. Trying to make it operate "in real time" would 
> be interesting, though, as PG logs into different files by day-of-the-week 
> (at least, with the settings I have), rather than into a single file that 
> gets rotated out. I was kind of hoping such a tool, such as pgbadger (which, 
> unfortunately, only seems to track connections per second and not consecutive 
> connections), already existed, or that there was some way to have the 
> database itself track this metric. If not, well, I guess that's another 
> project :)

There are a lot of postgres configs and server specific tools... but on the 
application side and for general debugging, have you looked at statsd ?  
https://github.com/etsy/statsd

it's a lightweight node.js app that runs on your server and listens for UDP 
signals, which your apps can emit for counting or timing.  We have a ton of 
Python apps logging to it, including every postgres connection open/close and 
error.  The overhead of clients and server is negligible.  When combined with 
the graphite app for browsing data via charts, it becomes really useful at 
detecting issues with load or errors stemming from a deployment  -- you just 
look for spikes and cliffs.  We even use it to log the volume of INSERTS vs 
SELECTS vs UPDATES being sent to postgres.

The more services/apps you run, the more useful it gets, as you can figure out 
which apps/deployments are screwing up postgres and the exact moment things 
went wrong.

Re: [GENERAL] bitwise storage and operations

2016-09-27 Thread Jonathan Vanasco


On Sep 27, 2016, at 10:54 AM, Brian Dunavant wrote:

> db=# select 'foo' where (9 & 1) > 0;

A HA

Thank you Brian and David -- I didn't realize that you needed to do the 
comparison to the result.

(or convert the result as these work):

select 'foo' where (9 & 1)::bool;
select 'foo' where bool(9 & 1);

I kept trying to figure out how to run operators on "9"  and "1" independently 
to create a boolean result.  I either needed more coffee or less yesterday.

As a followup question...

Some searches suggested that Postgres can't use indexes of INTs for these 
comparisons, but could on bitwise string columns.

One of these tables has over 30MM rows, so I'm trying to avoid a seq scan as 
much as possible.

I thought of creating a function index that casts my column to a bitstring, and 
then tailors searches onto that. For example:

CREATE TEMPORARY TABLE example_toggle(
id int primary key,
toggle int default null
);
INSERT INTO example_toggle (id, toggle) VALUES (1, 1), (2, 2), (3, 3), 
(4, 5), (5, 8);
CREATE INDEX idx_example_toggle_toggle_bit ON 
example_toggle(cast(toggle as bit(4)));

While these selects work...

select * from example_toggle where (toggle & 1)::bool AND (toggle & 
4)::bool;
select * from example_toggle where (toggle::bit(4) & 1::bit(4) <> 
0::bit(4)) AND (toggle::bit(4) & 4::bit(4) <> 0::bit(4));

Only about 200k items have a flag right now (out of 30MM) so I thought of using 
a partial index on the set flags.

The only way I've been able to get an index on the not null/0 used is to do the 
following:

CREATE INDEX idx_example_toggle_toggle_bit ON 
example_toggle(cast(toggle as bit(4))) WHERE toggle <> 0;

then tweak the query with 

select * from example_toggle where (toggle & 1)::bool AND (toggle & 
4)::bool AND (toggle > 0);
select * from example_toggle where (toggle::bit(4) & 1::bit(4) <> 
0::bit(4)) AND (toggle::bit(4) & 4::bit(4) <> 0::bit(4)) AND (toggle > 0);

obviously, the sample above is far too small for an index to be considered... 
but in general... is a partial index of "toggle <> 0" and then hinting with 
"toggle > 0" the best way to only index the values that are not null or 0?




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] bitwise storage and operations

2016-09-26 Thread Jonathan Vanasco


We've been storing some "enumerated"/"set" data in postgresql as INT or BIT(32) 
for several years for some flags/toggles on records. 

This was preferable for storage to the ENUM type (or multiple columns), as we 
often changed the number of enumerated options or their labels -- and computing 
everything in the application saved the trouble of database migrations.  This 
has worked out perfectly -- until today.

For the first time ever, we need to run some queries that filter on these 
columns at the PostgreSQL level -- and I can't figure out how.

The documentation doesn't have any examples for SELECT for the bitwise 
operators, and everything I've found on various threads/forums has addressed 
inserts or converting on a select -- but never a comparison.

I've tried numerous forms and have gotten as far as CASTing everything to 
BIT(n), but I can't seem to construct a valid query that can filter what I want.

Can anyone share a sample WHERE clause or two that does a bitwise comparison 
against an INT or BIT column? 

Thanks!




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] optimizing a query

2016-06-22 Thread Jonathan Vanasco

On Jun 22, 2016, at 2:38 PM, David G. Johnston wrote:
> What query?  A self-contained email would be nice.

This was the same query as in the previous email in the thread.  I didn't think 
to repeat it.  I did include it below.

> https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
> 
> Note especially:
> 
> "Visibility information is not stored in index entries, only in heap 
> entries; ..."
> 
> The check against the heap isn't for the truthiness of the predicate but the 
> visibility of the row.

Thanks for this link. 

The table I worked on hasn't had any writes since a server restart, and 
according to those docs the queries should have been off the visibility map not 
the heap.  
However the amount of time to search is not in line with expectations for the 
visibility map. 

After reading the last paragraph about some index optimizations in 9.6 that 
looked related, I installed the RC on an another machine and dumped 2 tables 
from production to see if I would qualify for any improvements.  

>>> But there's a problem: the WHERE clause refers to success which is not 
>>> available as a result column of the index. Nonetheless, an index-only scan 
>>> is possible because the plan does not need to recheck that part of the 
>>> WHERE clause at runtime: all entries found in the index necessarily have 
>>> success = true so this need not be explicitly checked in the plan. 
>>> PostgreSQL versions 9.6 and later will recognize such cases and allow 
>>> index-only scans to be generated, but older versions will not.

The 9.6 branch planner optimizes for my query and realizes that it doesn't need 
to check the table:

So while this index is necessary on 9.5:
CREATE INDEX idx__9_5 ON table_a(column_1, id, column_2) WHERE column_2 
IS NOT FALSE;

This index works on 9.6
CREATE INDEX idx__9_6 ON table_a(column_1, id) WHERE column_2 IS NOT 
FALSE;

Considering I have several million rows, this has a noticeable effect .

Combined with the various improvements on 9.6, there is a huge difference in 
query speed:

9.6 runs the query with the smaller index in an average of 1200ms
9.5 runs the query with the larger index in an average of 2700ms

> This one requires knowledge of the query; but I am not surprised that 
> reversing the order of columns in a b-tree index has an impact.

I expected this to impact the decision on which index to use when multiple ones 
are available, or to offer poor performance -- but not to discount using the 
index entirely.

> All at once?

No.  I dropped all indexes to test, then for each column combination did:

CREATE INDEX foo_idx;
ANALYZE foo ;
EXPLAIN ANALYZE; 
DROP INDEX foo_idx;

I call Explain Analyze manually once for the plan, then via script 25x to 
average out execution times and account for cold-start vs having loaded all the 
indexes.  I shut down all other user processes on the machine as well.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] optimizing a query

2016-06-22 Thread Jonathan Vanasco


On Jun 22, 2016, at 4:25 AM, Erik Gustafson wrote:

> don't you want an index on t_a2b.col_a, maybe partial where col_a=1 ?  

that table has indexes on all columns.  they're never referenced because the 
rows are so short.  this was just an example query too, col_a has 200k 
variations 

After a lot of testing, I think I found a not-bug but possible 
area-for-improvement in the planner when joining against a table for filtering 
(using my production 9.5.2 box)

I checked a query against multiple possible indexes using the related columns.  
only one of indexes was on the table for each series of tests, and I analyzed 
the table after the drop/create of indexes.


Note 1: The only time an index-only scan is used, is on this form:

CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, 
col_partial) WHERE col_partial IS NOT FALSE;

Omitting the col_partial from being indexed will trigger a Bitmap Heap 
Scan on the full table with a recheck condition:

CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE 
col_partial IS NOT FALSE;

This shouldn't be necessary.  the planner knew that `col_partial` 
fulfilled the WHERE clause when it used the index, but scanned the table to 
check it anyways.

On most tables the heap scan was negligible, but on a few larger tables 
it accounted a 20% increase in execution.

Note 2:

This is odd, but this index is used by the planner:
CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE 
col_partial IS NOT FALSE;

but this index is never used:
CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE 
col_partial IS NOT FALSE;

I honestly don't know why the second index would not be used.  The 
query time doubled without it when run on a table with 6million rows and about 
20 columns.

---

The indexes I tested on:

CREATE INDEX idx_fkey_1 ON table_a(fkey_1);
CREATE INDEX idx_partial_fkey ON table_a(fkey_1) WHERE col_partial IS 
NOT FALSE;
CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE 
col_partial IS NOT FALSE;
CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE 
col_partial IS NOT FALSE;
CREATE INDEX idx_partial_fkey_partial ON table_a(fkey_1, col_partial) 
WHERE col_partial IS NOT FALSE;
CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, 
col_partial) WHERE col_partial IS NOT FALSE;



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] optimizing a query

2016-06-21 Thread Jonathan Vanasco


On Jun 21, 2016, at 6:55 PM, David G. Johnston wrote:

> Aside from the name these indexes are identical...

sorry. tired eyes copy/pasting between windows and trying to 'average' out 40 
similar queries.

> These two items combined reduce the desirability of diagnosing this...it 
> doesn't seem like you've faithfully recreated the scenario for us to evaluate.
> 
> Your post is also not self-contained and you haven't provided the actual 
> EXPLAINs you are getting.

I played around with some more indexes, creating and disabling them on one 
specific query
Eventually i found some index formats that didn't pull in the whole table.
They gave approximately the same results as the other selects, with some 
differences in reporting.  the heap scan on the table was negligible.  the big 
hit was off the outer hash join.
the formatting in explain made a negligible check look like it was the root 
issue


CREATE TABLE t_a (id SERIAL PRIMARY KEY,
  col_1 INT NOT NULL,
  col_2 BOOLEAN DEFAULT NULL
  );
CREATE INDEX test_idx__t_a_col1_col2__v1 on t_a (col_1) WHERE col_2 IS NOT 
FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v2 on t_a (col_1, id) WHERE col_2 IS NOT 
FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v3 on t_a (id, col_1) WHERE col_2 IS NOT 
FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v4 on t_a (id, col_1, col_2) WHERE col_2 
IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v5 on t_a (col_1, col_2) WHERE col_2 IS 
NOT FALSE;

CREATE TABLE t_b (id SERIAL PRIMARY KEY,
  col_1 INT NOT NULL,
  col_2 BOOLEAN DEFAULT NULL
  );
CREATE TABLE t_a2b (a_id INT NOT NULL REFERENCES t_a(id),
b_id INT NOT NULL REFERENCES t_b(id),
col_a INT NOT NULL,
PRIMARY KEY (a_id, b_id)
);

EXPLAIN ANALYZE
SELECT t_a2b.b_id AS t_a2b_b_id,
   count(t_a2b.b_id) AS counted
FROM t_a2b
JOIN t_a ON t_a2b.a_id = t_a.id
WHERE t_a.col_1 = 730
  AND t_a2b.col_a = 1
  AND (t_a.col_2 IS NOT False)
GROUP BY t_a2b.b_id
ORDER BY counted DESC,
t_a2b.b_id ASC
LIMIT 25
OFFSET 0
;



 QUERY PLAN

 Limit  (cost=270851.55..270851.62 rows=25 width=4) (actual 
time=1259.950..1259.953 rows=25 loops=1)
   ->  Sort  (cost=270851.55..270863.43 rows=4750 width=4) (actual 
time=1259.945..1259.945 rows=25 loops=1)
 Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
 Sort Method: top-N heapsort  Memory: 26kB
 ->  HashAggregate  (cost=270670.01..270717.51 rows=4750 width=4) 
(actual time=1259.430..1259.769 rows=1231 loops=1)
   Group Key: t_a2b.b_id
   ->  Hash Join  (cost=171148.45..270516.71 rows=30660 width=4) 
(actual time=107.662..1230.481 rows=124871 loops=1)
 Hash Cond: (t_a2b.a_id = t_a.id)
 ->  Seq Scan on t_a2b  (cost=0.00..89741.18 rows=2485464 
width=8) (actual time=0.011..661.978 rows=2492783 loops=1)
   Filter: (col_a = 1)
   Rows Removed by Filter: 2260712
 ->  Hash  (cost=170446.87..170446.87 rows=56126 width=4) 
(actual time=107.409..107.409 rows=48909 loops=1)
   Buckets: 65536  Batches: 1  Memory Usage: 2232kB
   ->  Bitmap Heap Scan on t_a  
(cost=1055.41..170446.87 rows=56126 width=4) (actual time=18.243..94.470 
rows=48909 loops=1)
 Recheck Cond: ((col_1 = 730) AND (col_2 IS NOT 
FALSE))
 Heap Blocks: exact=43972
 ->  Bitmap Index Scan on 
test_idx__t_a_col1_col2__v2  (cost=0.00..1041.38 rows=56126 width=0) (actual 
time=8.661..8.661 rows=48909 loops=1)
   Index Cond: (col_1 = 730)
 Planning time: 0.796 ms
 Execution time: 1260.092 ms


QUERY PLAN
--
 Limit  (cost=208239.59..208239.65 rows=25 width=4) (actual 
time=1337.739..1337.743 rows=25 loops=1)
   ->  Sort  (cost=208239.59..208251.47 rows=4750 width=4) (actual 
time=1337.737..1337.739 rows=25 loops=1)
 Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
 Sort Method: top-N heapsort  Memory: 26kB
 ->  HashAggregate  (cost=208058.05..208105.55 rows=4750 width=4) 
(

[GENERAL] optimizing a query

2016-06-21 Thread Jonathan Vanasco

I have a handful of queries in the following general form that I can't seem to 
optimize any further (same results on 9.3, 9.4, 9.5)

I'm wondering if anyone might have a suggestion, or if they're done.

The relevant table structure:

t_a2b
a_id INT references t_a(id)
b_id INT references t_b(id)
col_a

t_a
id INT
col_1 INT
col_2 BOOL

The selects query the association table (t_a2b) and join in a related table 
(t_a) for some filtering.

In effort of simplifying the work, I've created indexes on t_a that have all 
the related columns.

CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;

postgres will query test_idx__a first (yay!) but then does a bitmap heap scan 
on t_a, and uses the raw t_a for the hash join.  

I don't actually need any information from t_a - it's just there for the 
filtering, and ideally postgres would just use the index.

I thought this might have been from using a partial index, but the same results 
happen with a full index.  I just can't seem to avoid this hash join against 
the full table.

anyone have a suggestion?


example query

SELECT t_a2b.b_id AS b_id,
   count(t_a2b.b_id) AS counted
FROM t_a2b 
WHERE 
  t_a2b.col_a = 1
  AND
  t_a.col_1 = 730
  AND
  t_a.col_2 IS NOT False
GROUP BY t_a2b.b_id
ORDER BYcounted DESC,
t_a2b.b_id ASC  


   



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] does timestamp precision affect storage size?

2016-06-21 Thread Jonathan Vanasco


On Jun 21, 2016, at 4:50 PM, Tom Lane wrote:

> Storage-wise, no.  If you have a resolution spec on your columns now,
> I think dropping the resolution spec would save you a few nanoseconds per
> row insertion due to not having to apply the roundoff function.  Adding
> one would certainly not improve speed.


On Jun 21, 2016, at 4:47 PM, Vik Fearing wrote:
> No, there are no space savings here.
> 
> =# select pg_column_size('now'::timestamptz(0)),
> pg_column_size('now'::timestamptz);


Thanks.  I thought that was happening, but wanted to make sure.  the allure of 
shaving a byte or two off some rows couldn't be ignored ;)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] does timestamp precision affect storage size?

2016-06-21 Thread Jonathan Vanasco

i'm cleaning up some queries for performance, and noticed that we never use 
precision beyond the second (ie, `timestamp(0)`) in our business logic.

would there be any savings in storage or performance improvements from losing 
the resolution on fractional seconds, or are `timestamp(precision)` effectively 
the same for storage as `timestamp`?  (based on docs, I assume the latter but 
wanted to check)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] disable ipv6?

2016-04-21 Thread Jonathan Vanasco

I'm running postgresql on ubuntu.  the 9.4 branch from postgresql.org

I think the only way to disable ipv6 is to edit postgresql.conf and explicitly 
state localhost in ipv4 as follows

- listen_addresses = 'localhost'
+ listen_addresses = '127.0.0.1'

can anyone confirm?




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Is it possible to select index values ?

2016-02-02 Thread Jonathan Vanasco

On Feb 1, 2016, at 6:58 PM, David G. Johnston wrote:

> You can query the statistics portion of the database to get some basic 
> statistics of the form mentioned.

Yeah, i didn't think there would be support.  The stats collector doesn't have 
the info that I want... it's focused on how the data is used.  I'm more 
interested in what the data is.

Basically I want to compare the distribution of index "keys".  In the case of a 
substring index, comparing the distribution at 3,4,5,6,7,8 characters.  based 
on that, i can run some server tests on different lengths, and the stats 
collector comes into play.  i'm taking a blind stab on some index contents, and 
want to be a bit more educated.

anyways, I eventually realized that I may be better just pulling the columns 
and running some offline analytics (it could be done in sql, but the results 
would be nicer as a graph).

[GENERAL] Is it possible to select index values ?

2016-02-01 Thread Jonathan Vanasco

Is it possible to select index values ?

I haven't found any documentation that says "No", but I haven't found anything 
that says "Yes" either.

The reason - I have a few function indexes that are working as partial indexes. 
 I'd like to run some analytics on them (to determine uniqueness of values, 
decide if i should change the function, etc).  It would be easier if I could 
somehow access the index contents than re-create the index data into a 
temporary table.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] controlling memory management with regard to a specific query (or groups of connections)

2015-11-20 Thread Jonathan Vanasco


Thanks. Unfortunately, this is in a clustered environment.   NFS and other 
shared drive systems won't scale well.  I'd need to run a service that can 
serve/delete the local files, which is why I'm just stashing it in Postgres for 
now.  

> On Nov 19, 2015, at 2:26 AM, Roxanne Reid-Bennett  wrote:
> 
> We have a system that loads a bunch of files up to be processed - we queue 
> them for processing behind the scenes.  We don't load them into Postgres 
> before processing.  We put them in a temp directory and just save the 
> location of the file to the database.  This configuration does have 
> limitations.  Post-processing can not be load balanced across servers unless 
> the temp directory is  shared.
> 
> I'm sure you'll get more DB centric answers from others on the list.
> 
> Roxanne


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] controlling memory management with regard to a specific query (or groups of connections)

2015-11-18 Thread Jonathan Vanasco

As a temporary fix I need to write some uploaded image files to PostgreSQL 
until a task server can read/process/delete them.  

The problem I've run into (via server load tests that model our production 
environment), is that these read/writes end up pushing the indexes used by 
other queries out of memory -- causing them to be re-read from disk.   These 
files can be anywhere from 200k to 5MB.

has anyone dealt with situations like this before and has any suggestions?  I 
could use a dedicated db connection if that would introduce any options. 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] temporary indexes?

2015-10-22 Thread Jonathan Vanasco

On Oct 22, 2015, at 5:04 PM, Jim Nasby wrote:

> 
> What % of execution time is spent creating those indexes? Or is that factored 
> into the 1000%? Also, could your analysis queries be run in a REPEATABLE READ 
> transaction (meaning that once the transaction starts it doesn't get any new 
> data)? If it could then the temp indexes could be static, which would mean no 
> update overhead.

Running without the indexes would take over an hour to execute the scripts, and 
totally jams the machine (we got 30minutes in once, and had to kill it).  
That's because of millions of rows used in joins and sequential scans.  

Building all the indexes takes 30 seconds; most SQL commands then run only 
against the indexes (some of which are partial) and the entire suite finishes 
in about 3 minutes.

If the indexes stay active during the day, there seems to be a 2-3% drop in 
write performance.   This is on a webapp, so we're just happier shifting the 
index work from peak hours to offpeak hours.  It means we can delay spinning up 
another application server a bit longer.

I'll definitely look into your suggestions the next time I hit this code.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] temporary indexes?

2015-10-22 Thread Jonathan Vanasco

On Oct 22, 2015, at 2:08 PM, Tom Lane wrote:
> FWIW, I don't find much attraction in the idea of building an index for
> use by a single query.  There basically isn't any scenario where that's
> going to beat running a plan that doesn't require the index.  The value of
> an index is generally to avoid a whole-table scan and/or a sort, but
> you'll necessarily pay those costs to make the index.

On Oct 22, 2015, at 8:17 AM, vincent elschot wrote:
> Do you mean creating a temporary index on a non-temporary table to speed up 
> the queries that fills the temporary table?

One of the use-cases is speeding up inserts on create, but another is for 
periodic analytics routines (which we handle with explicit create/drop index 
commands.

In one example of our analytics routines, we end up needing to create/drop 
about 15 indexes to optimize 45 queries.  This speeds up the execution by 1000% 
and minimizes RAM usage.  We don't keep the indexes active, because we only 
need them for analytics and the overhead of managing them during high write 
periods during the day is noticeable.  Creating and dropping these indexes 
on-demand gives us all the benefit with none of the drawbacks.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] temporary indexes?

2015-10-21 Thread Jonathan Vanasco

On Oct 21, 2015, at 3:42 PM, Adrian Klaver wrote:

> I misunderstood then. The only thing I can think of is to wrap in a 
> transaction, though that presents other issues with open transactions and/or 
> errors in the transaction.

I just explicitly drop.  The convenience of an auto-drop would be a nice backup.

Transactions and table-locking issues are probably why temporary indexes don't 
exist.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] temporary indexes?

2015-10-21 Thread Jonathan Vanasco


On Oct 21, 2015, at 2:59 PM, Jeff Janes wrote:
> I think he means more like:
> 
> create temporary table temp_test(id int, fld_1 varchar);
> create temporary index on permanent_table (fld_1);
> 
> select something from temp_test join permanent_table using (fld_1) where a=b;
> select something_else from temp_test join permanent_table using (fld_1) where 
> c=d;

Yes. That's exactly what I mean:  A "temporary index" on a "permanent table" , 
which expires in the same manner of a "temporary table".




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] temporary indexes?

2015-10-21 Thread Jonathan Vanasco

I couldn't find any mention of this on the archives...

Have the project maintainers ever considered extending CREATE INDEX to support 
"temporary" indexes like CREATE TEMPORARY TABLE?

When creating temporary tables for analytics/reporting, I've noticed that I 
often need to create (then drop) indexes on regular tables.  Temporary indexes 
seemed like a natural fit here, so i was wondering if there was any reason why 
they're not supported (other than no one wanted it!)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] using postgresql for session

2015-10-14 Thread Jonathan Vanasco

On Oct 7, 2015, at 11:58 AM, john.tiger wrote:

> has anyone used postgres jsonb for holding session ?  Since server side 
> session is really just a piece of data, why bother with special "session" 
> plugins and just use postgres to hold the data and retrieve it with psycopg2 
> ?  Maybe use some trigger if session changes?We are using python Bottle 
> with psycopg2 (super simple, powerful combo) - are we missing something 
> magical about session plugins ?

I previously used TEXT or BLOB for holding session data, and pickled the data.  
I can't remember.

If you're going to use PostgresSQL for the session, the big performance tip is 
to use partial index on the session key (assuming it's an md5-like hash).

So you'd want a table that is something like this:

CREATE TABLE session(
session_id VARCHAR(32) PRIMARY KEY,
session_data TEXT
);
CREATE INDEX idx_session_partial ON session(substr(session_id , 0,5))

Then query like this

SELECT * FROM session WHERE session_id = :session_id AND 
substr(session_id, 0, 5) = :session_id_substring ; 
SELECT * FROM session WHERE session_id = :session_id AND 
substr(session_id, 0, 5) = substr(:session_id, 0, 5) ; 

That will get the planner to use the partial index first, before using the 
session_id index.  Depending on how many items are in your table, it can make 
your SELECTS several orders of magnitude faster.

As for session plugins -- a lot of people in the  web frameworks community are 
abandoning server side sessions for client side sessions.  They are generally 
easier to handle state across clusters and data centers.  Some server side 
session-like data is still needed, but it's often assembled from data in the 
client side.

Most of the Python session plugins I've used have some sort of status check 
coupled with a cleanup function/middleware component to see if the object has 
changed at all.  This way UPDATES only occur when needed.

FWIW, I ended up migrating our sessions into redis.  We already had redis 
running on the cluster, and offloading it got a lot more performance our 
Postgres without scaling our hardware.   There just isn't much of a reason for 
having pg manage a simple KV store.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] "global" & shared sequences

2015-10-02 Thread Jonathan Vanasco

Thanks for the reply.

On Oct 2, 2015, at 3:26 PM, Jim Nasby wrote:

> I'm not really following here... the size of an index is determined by the 
> number of tuples in it and the average width of each tuple. So as long as 
> you're using the same size of data type, 18 vs 1 sequence won't change the 
> size of your indexes.

I'm pretty much concerned with exactly that -- the general distribution of 
numbers, which affects the average size/length of each key.

Using an even distribution as an example, the average width of the keys can 
increase by 2 places:

Since we have ~18 object types, the primary keys in each might range from 1 to  
9,999,999
Using a shared sequence, the keys for the same dataset would range from  1 to 
189,999,999

Each table is highly related, and may fkey onto 2-4 other tables... So i'm a 
bit wary of this change.  But if it works for others... I'm fine with that!

> Sequences are designed to be extremely fast to assign. If you ever did find a 
> single sequence being a bottleneck, you could always start caching values in 
> each backend. I think it'd be hard (if not impossible) to turn a single 
> global sequence into a real bottleneck.

I don't think so either, but everything I've read has been theoretical -- so I 
was hoping that someone here can give the "yeah, no issue!" from experience.
The closest production stuff I found was  via  the BDR plugin (only relevant 
thing that came up during search) and there seemed to be anecdotal accounts of 
issues with sequences becoming bottlenecks -- but that was from their code that 
pre-generated allowable sequence ids on each node.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] "global" & shared sequences

2015-10-01 Thread Jonathan Vanasco

Hoping to glean some advice from the more experienced

The major component of our application currently tracks a few dozen object 
types, and the total number of objects is in the 100s Millions range.  Postgres 
will potentially be tracking billions of objects.

Right now the primary key for our "core" objects is based on a per-table 
sequence, but each object has a secondary id based on a global/shared sequence. 
 we expose everything via a connected object graph, and basically needed a 
global sequence.  We are currently scaled vertically (1x writer, 2x reader)

I'd like to avoid assuming any more technical debt, and am not thrilled with 
the current setup.  Our internal relations are all by the table's primary key, 
but the external (API, WEB) queries use the global id.  Every table has 2 
indexes, and we need to convert a 'global' id to a 'table id' before doing a 
query.  If we're able to replace the per-table primary key with the global id, 
we'd be freeing up some disk space from the indexes and tables -- and not have 
to keep our performance cache that maps table-to-global ids.

The concerns that I have before moving ahead are:

1. general performance at different stages of DB size.   with 18 sequences, our 
keys/indexes are simply smaller than they'd be with 1 key.  i wonder how this 
will impact lookups and joins.
2. managing this sequence when next scaling the db (which would probably have 
to be sharding, unless others have a suggestion)

if anyone has insights, they would be greatly appreciated.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] trouble converting several serial queries into a parallel query

2015-07-04 Thread Jonathan Vanasco

I have a very simple query that is giving me some issues due to the size of the 
database and the number of requests I make to it in order to compile the report 
I need:

A dumbed down version of the table and query:

CREATE TABLE a_to_b (
id_a INT NOT NULL REFERENCES table_a(id), 
id_b INT NOT NULL REFERENCES table_b(id),
PRIMARY KEY (id_a, id_b)
);
SELECT id_a, id_b FROM a_2_b WHERE id_a = 1 LIMIT 5;

The problem is that the table has a few million records and I need to query it 
30+ times in a row.  

I'd like to improve this with a parallel search using `IN()`

SELECT id_a, id_b FROM a_2_b WHERE id_a = IN 
(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26.27,28,29,30);

That technique has generally fixed a lot of bottlenecks for us.

However I can't wrap my head around structuring it so that I can apply a limit 
based on the column -- so that I only get 5 records per id_a.

The table has columns that I would use for ordering in the future, but I'm fine 
with just getting random values right now .

Can anyone offer some suggestions?  Thanks in advance.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] newsfeed type query

2015-04-30 Thread Jonathan Vanasco

On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote:

> Only because you're using UNION. Use UNION ALL instead.

The difference between "union" and "union all" was negligible.  the problem was 
in the subselect and the sheer size of the tables, even when we could handle it 
as an index-only scan.

On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote:

> I would expect the overall query to return only 60F nad 55F as the most recent
> data. No? You expect it to return 4 items when the LIMIT is only 2. Remember
> that the overall query should be also ordered by ts and limited to 2.

You're right. total mistake on my part and confusion with that.  I got this 
query confused with the specifics of a similar one.

Re: [GENERAL] newsfeed type query

2015-04-29 Thread Jonathan Vanasco


On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote:

> Could you please explain to me the error(s) in my reasoning?

Let me just flip your list in reverse... and add in some elements (marked with 
a *):

posting  ts  context
p60  60  friend
p55  55 friend*
p54  54 friend*
p50  50  group
p50  49  group*
p50  49  group*
p40  40  friend
p30  30  group
p20  20  friend
p10  10  group
p00  friend

With the 2 limited subqueries, the results would be:
60F, 55F, 50G, 49G

But the "most recent" data is
50F, 55F, 54F, 50G

So we end up showing 49 which is less relevant than 54.

In some situations this isn't much of an issue, but in others it is 
detrimental. 
For example, one of my "feeds" contains a distribution of events 
according-to-type that is very uneven.  While "friend" and "group" might be 
relatively close in time to one another, "system" or other events may be months 
old -- and that older content gets pulled in with this style of query.  

If you need to paginate the data and select the next 10 overall items, it gets 
even more complicated.

IIRC, the best mix of performance and "product" that I've found is do something 
like this:

SELECT * FROM (
SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 1;
UNION
SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 1;
) as unioned
order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0; 

by creating an artificial limit on the inner queries, you can save postgres 
from doing a lot of intensive i/o work and memory usage (like a lot)
then, joining a few lists and sorting 20k (or even 100k) items is really cheap.
the downside is that you effectively limit the 'relevancy' of the query to 
whatever the inner limit is (ie, 1 -- not the combined total of 2), but 
that number can be arbitrarily high enough that it is irrelevant while still 
showing the right amount of content for people.







-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] newsfeed type query

2015-04-29 Thread Jonathan Vanasco

Thanks all!  These point me in much better directions!

Jim Nasby's approach to selecting an expression addressed some things (SELECT 
f.useraccount_id_b IS NOT NULL AS in_friends)

Ladislav Lenart's usage of the CTE is also of a different format that I've used 
in the past.

I think i'll be able to patch together some performance improvements now, that 
will last until the database structure changes.  

On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote:

> I think you can propagate ORDER BY and LIMIT also to the subqueries of the
> UNION, i.e.:

It behaves a lot better, but doesn't give me the resultset I need.  Older data 
from one subquery is favored to newer data from another

I use a similar approach on another part of this application -- where the 
effect on the resultset isn't as pronounced.  
On that query there are over 100 million total stream events.  Not using an 
inner limit runs the query in 7 minutes; limiting the inner subquery to 1MM 
runs in 70 seconds... and limiting to 10k is around 100ms.  

On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote:

> I see others have responded with suggestions to improve query performance,
> but one thing I noticed when you gave the data structure is there are no
> no primary keys defined for friends or posting,  neither are there any 
> indexes. 
> Was that an omission? 

This was a quick functional example to illustrate.  The real tables are 
slightly different but do have pkeys ( 'id' is a bigserial, relationship tables 
(friends, memberships) use a composite key ).  They are aggressively indexed 
and reindexed on various columns for query performance.  sometimes we create an 
extra index that has multiple columns or partial-columns to make make scans 
index-only.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] newsfeed type query

2015-04-28 Thread Jonathan Vanasco

Sorry, I was trying to ask something very abstract as I have similar situations 
on multiple groups of queries/tables (and they're all much more complex).

I'm on pg 9.3

The relevant structure is:

posting:
id
timestamp_publish
group_id__in
user_id__author

friends:
user_id__a
user_id__b

memberships:
user_id
group_id
role_id


-- working sql
CREATE TABLE groups(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE users(
id SERIAL NOT NULL PRIMARY KEY
);
CREATE TABLE friends (
user_id__a INT NOT NULL REFERENCES users( id ),
user_id__b INT NOT NULL REFERENCES users( id )
);
CREATE TABLE memberships (
user_id INT NOT NULL REFERENCES users( id ),
group_id INT NOT NULL REFERENCES groups( id ),
role_id INT NOT NULL
);
CREATE TABLE posting (
id SERIAL NOT NULL,
timestamp_publish timestamp not null,
group_id__in INT NOT NULL REFERENCES groups(id),
user_id__author INT NOT NULL REFERENCES users(id),
is_published BOOL
);

The output that I'm trying to get is:
posting.id
{the context of the select}
posting.timestamp_publish (this may need to get correlated into other 
queries)


These approaches had bad performance:

-- huge selects / memory
-- it needs to load everything from 2 tables before it limits
EXPLAIN ANALYZE
SELECT id, feed_context FROM (
SELECT  id, timestamp_publish, 'in-group' AS feed_context FROM posting
WHERE ( 
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 
57 AND role_id IN (1,2,3)) 
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP 
AT TIME ZONE 'UTC')
)
UNION
SELECT  id, timestamp_publish, 'by-user' AS feed_context FROM posting
WHERE ( 
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a 
= 57) 
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP 
AT TIME ZONE 'UTC')
)
) AS feed
ORDER BY  timestamp_publish DESC
LIMIT 10
;

-- selects minimized, but repetitive subqueries
SELECT  
id, 
CASE
WHEN group_id__in IN (SELECT group_id FROM memberships WHERE 
user_id = 57 AND role_id IN (1,2,3)) THEN True
ELSE NULL
END AS feed_context_group,
CASE
WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE 
user_id__a = 57) THEN True
ELSE NULL
END AS feed_context_user
FROM posting
WHERE ( 
group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 
57 AND role_id IN (1,2,3))
OR
user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a 
= 57) 
)
AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT 
TIME ZONE 'UTC')
ORDER BY  timestamp_publish DESC
LIMIT 10
;

  

On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote:

> Since you very nicely DID NOT provide the pg version, O/S or table 
> structure(s), which is what you should do REGARDLESS of the 
> type of question (it's just the smart and polite thing to do when asking for 
> help) The best  I can suggest is:
> SELECT 
>   CASE WHEN context = 'friend' THEN p.junka
> WHEN context = 'group' THEN p.junkb
> WHEN context = 'both'   THEN p.junka || ' ' || p.junkb
> END
>FROM posting p
>   where p.author_id in (SELECT f.friend_id 
> FROM friends f
>   WHERE f.user_id = ?) 
>OR p.group_id in (SELECT m.group_id 
> FROM memberships m
>   WHERE m.user_id = ?);

[GENERAL] newsfeed type query

2015-04-28 Thread Jonathan Vanasco


I'm trying to upgrade some code that powers a newfeed type stream, and hoping 
someone can offer some insight on better ways to structure some parts of the 
query

The part that has me stumped right now...

There are several criteria for why something could appear in a stream.  for 
example, here are 2 handling a posting:

* a posting by a friend
* a posting in a group

the general way I've handled this so far has been simple:

select * from posting where author_id in (select friend_id from friends 
where user_id = ?) or group_id in (select group_id from memberships where 
user_id = ?);

now i need to pull in the context of the match (friend, group, both), but I 
can't figure out how to do this cleanly. 

1. if i just add 'case' statements to the select to note the origin, those 
subselects run again.   (ie, the same subquery is executed twice)
2. if i structure this as a union (and note the origin with a string), it takes 
a lot more work to integrate and sort the 2 separate selects ( eg "select id, 
timestamp, 'by-friend'" unioned with "in-group")

does anyone have ideas on other approaches to structuring this?












-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] splitting up tables based on read/write frequency of columns

2015-01-21 Thread Jonathan Vanasco

On Jan 19, 2015, at 5:07 PM, Stefan Keller wrote:

> Hi
> 
> I'm pretty sure PostgreSQL can handle this.
> But since you asked with a theoretic background,
> it's probably worthwhile to look at column stores (like [1]).

Wow. I didn't know there was a column store extension for PG -- this would come 
in handy for some analytic stuff we run!

I know that PG can handle my current system "at scale".  I'm really just 
wondering what the possible slowdowns/improvements will be.  

Doing a rewrite of the entire row + updating the various indexes seems to be a 
lot of unnecessary IO.  At some point it will make sense to minimize that and 
isolate the heavy-write columns from impacting the rest of the table's 
performance.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] splitting up tables based on read/write frequency of columns

2015-01-19 Thread Jonathan Vanasco

This is really a theoretical/anecdotal question, as I'm not at a scale yet 
where this would measurable.  I want to investigate while this is fresh in my 
mind...

I recall reading that unless a row has columns that are TOASTed, an `UPDATE` is 
essentially an `INSERT + DELETE`, with the previous row marked for vacuuming.

A few of my tables have the following characteristics:
- The Primary Key has many other tables/columns that FKEY onto it.
- Many columns (30+) of small data size
- Most columns (90%) are 1 WRITE(UPDATE) for 1000 READS
- Some columns (10%) do a bit of internal bookkeeping and are 1 
WRITE(UPDATE) for 50 READS

Has anyone done testing/benchmarking on potential efficiency/savings by 
consolidating the frequent UPDATE columns into their own table?




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] postgresql versus riak for a global exchange

2015-01-06 Thread Jonathan Vanasco


A very popular design I see is often this:

- PostgreSQL for account, inventory, transactional; and all writes
- NoSQL (Redis, Riak, Mongo, etc) for read-only index postgres (almost 
like a read-through cache) and assembled documents


On Jan 5, 2015, at 5:46 PM, Raymond Cote wrote:

> I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). 
> I know that Riak 2.0 now offers strong consistency. Have not yet seen what 
> that does to performance. 
> Big plusses for PostgreSQL:
>   - you can do both relational and NOSQL tasks (the Binary JSON in the latest 
> PostgreSQL).
>   - well-tested consistency, ACID, etc.
>   - lots of adapters and support. 
>   - big community
> 
> Big plusses for Riak:
>  - multi-master replication
>  - multi-data center replication
>  - easy to scale up
> 
> We use PostgreSQL in combination with Riak for data storage (we have a 
> tokenization service). 
> We're currently using the EnterpriseDB multi-master PostgreSQL replication 
> and are quite happy with it. 
> The replication runs periodically, not streaming, so there is at least a 1 
> second delay for replication to occur. 
> Riak replicates quicker — but then you don’t have the strong relational 
> structure on top. 
> 
> As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. 
> In addition to just storing things, you’ll need to keep track of all sorts of 
> log-in and contact info — perhaps not ideal for Riak. 
> Probably best to consider precisely what traits your planned application has 
> and then look to match against the database storage. 
> May even end up with a mix of the two just as we have. 
> 
> Your decision may also depend on which development language/framework you 
> chose for the implementation. 



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding

2015-01-05 Thread Jonathan Vanasco

On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote:

> So the system I've settled with is storing both the originally supplied
> representation, *and* the lower cased punycode encoded version in a separate
> column for indexing/search. This seems really hackish to me though.

I actually do the same exact thing and don't think it's hackish.  I actually 
really like being able to see the punycode next to the human representation on 
simple db pulls.  It's barely more disk space and really useful.  

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Storing Video's or vedio file in DB.

2014-12-17 Thread Jonathan Vanasco


I wouldn't even store it on the filesystem if I could avoid that.
Most people I know will assign the video a unique identifier (which is stored 
in the database) and then store the video file with a 3rd party (e.g. Amazon 
S3).

1. This is often cheaper.  Videos take up a lot of disk space.  Having to 
ensure 2-3 copies of a file as a failover is not fun.
2. It offloads work from internal servers.  Why deal with connections that are 
serving a static file if you can avoid it?

In terms of FS vs DB (aside from the open vs streaming which was already 
brought up)

I think the big issue with storing large files in the database is the 
input/output connection.
Postgres has a specified number of max connections available, and each one has 
some overhead to operate. Meanwhile, a server like nginx can handle 10k 
connections easily, and with little or no overhead.  While the speed is 
comparable to the OS, you end up using a resource from a limited database 
connection pool.  And you run the risk of a slow/dropped client tying up the 
connection.  
Why allocate a resource to these operations, when there are more lightweight 
alternatives that won't tie up a database connection ?



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Jonathan Vanasco

On Dec 12, 2014, at 4:58 PM, Tom Lane wrote:

> regression=# create table tt (f1 int, f2 text);
> CREATE TABLE
> regression=# create index on tt (lower(f2));
> CREATE INDEX
> regression=# explain select * from tt order by lower(f2);
> QUERY PLAN 
> 
> Index Scan using tt_lower_idx on tt  (cost=0.15..65.68 rows=1230 width=36)
> (1 row)

Thank you so much for posting this test.

I got a seq scan on my local machine, so I checked the version... still running 
9.2.4.
I tried it on production (which is 9.3.x) and got the same result as you.

Looking at the 9.3 release notes, I'm guessing this behavior is from one of the 
Optimizer fixes.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)

2014-12-12 Thread Jonathan Vanasco

On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote:

> select a,b,c into newtable from oldtable group by a,b,c;
> 
> On pass, done.

This is a bit naive, but couldn't this approach potentially be faster 
(depending on the system)?

SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) 
AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 
1;
DELETE FROM source_table USING duplicate_records WHERE source_table.a = 
duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c 
= duplicate_records.c;

It would require multiple full table scans, but it would minimize the writing 
to disk -- and isn't a 'read' operation usually much more efficient than a 
'write' operation?  If the duplicate checking is only done on a small subset of 
columns, indexes could speed things up too.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] function indexes, index only scan and sorting

2014-12-12 Thread Jonathan Vanasco


Can someone confirm a suspicion for me ?

I have a moderately sized table (20+ columns, 3MM rows) that tracks "tags".

I have a lower(column) function index that is used simplify case-insensitive 
lookups.

CREATE INDEX idx_tag_name_lower ON tag(lower(name));

I have a few complex queries that need to join back to this table (via the `id` 
primary key) and sort on `lower(name)`.

I'm not selecting `lower(name)`, just using it for an order-by.

The only way I seem to be able to avoid a Sequential Scan and run an index-only 
scan is with another index -- this one specifically (and I've run queries 
against 8 index permutations):

CREATE INDEX idx_tag_joins ON tag(id, name_display); 

Am I correct in observing that the value of a function index can't be used for 
sorting ?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] deferring ForeignKey checks when you didn't set a deferrable constraint ?

2014-11-21 Thread Jonathan Vanasco

On Nov 20, 2014, at 6:00 PM, Melvin Davidson wrote:

> Try the following queries. It will give you two .sql files (create_fkeys.sql 
> & drop_fkeys.sql).

Thanks!

I tried a variation of that to create DEFERRABLE constraints, and that was a 
mess.  It appears all the checks ran at the end of the transaction individually 
– the process consumed 100% cpu overnight and was stuck on the 'commit' after 
16 hours..

So I crossed my fingers and tried your code like this:

BEGIN;
DROP CONSTRAINT . x24;
DELETE;
ADD CONSTRAINT . x24;
COMMIT;

And that took just over 24 seconds.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] deferring ForeignKey checks when you didn't set a deferrable constraint ?

2014-11-20 Thread Jonathan Vanasco


I have a core table with tens-of-millions of rows, and need to delete about a 
million records.

There are 21 foreign key checks against this table.  Based on the current 
performance, it would take a few days to make my deletions.

None of the constraints were defined as `DEFERRABLE INITIALLY IMMEDIATE', so 
I'm out of luck on deferring them.

Dropping/redefining constraints looks to be an ordeal --  and something I'm 
scared to make a mistake on.

i looked into disabling triggers on a table, but I couldn't find any info on 
how to trigger at the end of the transaction so I can ensure integrity.

does anyone have suggestions on things that might work?
 

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] issue with double ordering in a wrapped distinct

2014-11-19 Thread Jonathan Vanasco


I re-ran the query in multiple forms, and included it below  (I regexed it to 
become 'foo2bar' so it's more generic to others).   

I also uploaded it as a public spreadsheet to google, because I think that is a 
bit easier to look at:


https://docs.google.com/spreadsheets/d/1w9HM8w9YUpul5Bmy-uvzA4I6x5OFwWzR6K5atFG2ubw/edit?usp=sharing

The most interesting thing to me was how the planner was so drastically 
affected by interplay of distinct and order in a subselect :

293 - SELECT ( SELECT DISTINCT ) ORDER LIMIT OFFSET 
293 - SELECT DISTINCT ( SELECT DISTINCT ORDER ) LIMIT OFFSET 
300 - SELECT DISTINCT ( SELECT ORDER ) LIMIT OFFSET 
6400 - SELECT( SELECT DISTINCT ORDER ) LIMIT OFFSET 
7631 - SELECT DISTINCT ( SELECT ) ORDER LIMIT OFFSET 

And you can also see how the planner completely changed the strategy when 
LIMIT/OFFSET was introduced to the first query -- 

394 SELECT ORDER BY;
446501 SELECT ORDER BY LIMIT OFFSET;







Query A

EXPLAIN ANALYZE
SELECT DISTINCT foo_2_bar.bar_id
FROM foo_2_bar
JOIN foo ON foo_2_bar.foo_id = foo.id
WHERE foo.attribute_id_a = 582
  AND (foo.is_a IS NOT TRUE)
  AND (foo.is_b IS NOT TRUE)
  AND (foo.is_c IS NOT TRUE)
  AND (foo.is_d IS NOT TRUE)
ORDER BY foo_2_bar.bar_id ASC
;

  
QUERY PLAN

 Unique  (cost=48810.15..48842.34 rows=6437 width=4) (actual 
time=283.850..389.587 rows=3468 loops=1)
   ->  Sort  (cost=48810.15..48826.25 rows=6437 width=4) (actual 
time=283.846..335.532 rows=44985 loops=1)
 Sort Key: foo_2_bar.bar_id
 Sort Method: quicksort  Memory: 3645kB
 ->  Nested Loop  (cost=322.52..48402.94 rows=6437 width=4) (actual 
time=2.734..221.878 rows=44985 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=322.09..27984.32 rows=7679 
width=4) (actual time=2.718..27.531 rows=7885 loops=1)
 Recheck Cond: (attribute_id_a = 582)
 Filter: ((is_a IS NOT TRUE) AND (is_b IS NOT TRUE) AND 
(is_c IS NOT TRUE) AND (is_d IS NOT TRUE))
 Rows Removed by Filter: 7
 ->  Bitmap Index Scan on idx__foo__attribute_id_a  
(cost=0.00..320.17 rows=7966 width=0) (actual time=1.335..1.335 rows=8161 
loops=1)
   Index Cond: (attribute_id_a = 582)
   ->  Index Only Scan using idx__foo2bar__test on foo_2_bar  
(cost=0.43..2.61 rows=5 width=8) (actual time=0.004..0.010 rows=6 loops=7885)
 Index Cond: (foo_id = foo.id)
 Heap Fetches: 0
 Total runtime: 394.606 ms






Query A-LIMITED -- same as above, just adds a LIMIT/OFFSET

EXPLAIN ANALYZE
SELECT DISTINCT foo_2_bar.bar_id
FROM foo_2_bar
JOIN foo ON foo_2_bar.foo_id = foo.id
WHERE foo.attribute_id_a = 582
  AND (foo.is_a IS NOT TRUE)
  AND (foo.is_b IS NOT TRUE)
  AND (foo.is_c IS NOT TRUE)
  AND (foo.is_d IS NOT TRUE)
ORDER BY foo_2_bar.bar_id ASC
LIMIT 50
OFFSET 0
;

  
QUERY PLAN

 Limit  (cost=0.85..15386.21 rows=50 width=4) (actual 
time=57698.794..446500.933 rows=50 loops=1)
   ->  Unique  (cost=0.85..1980710.86 rows=6437 width=4) (actual 
time=57698.789..446500.787 rows=50 loops=1)
 ->  Nested Loop  (cost=0.85..1980694.77 rows=6437 width=4) (actual 
time=57698.784..446498.319 rows=2011 loops=1)
   ->  Index Scan using idx__foo2bar__bar_id on foo_2_bar  
(cost=0.43..75725.91 rows=1517741 width=8) (actual time=0.017..10373.409 
rows=364

Re: [GENERAL] issue with double ordering in a wrapped distinct

2014-11-18 Thread Jonathan Vanasco

On Nov 18, 2014, at 6:43 PM, Tom Lane wrote:
> but as for why it gets a much worse plan after
> flattening --- insufficient data.

Thanks.  I'll run some test cases in the morning and post the full queries 
matched with ANALYZE EXPLAIN.  
This is just puzzling to me.  I was hoping there might be a more general 
planner issue that someone would have noticed.

On Nov 18, 2014, at 6:55 PM, David G Johnston wrote:
> 
> I presume you have a reason for not simply doing away with the subquery
> altogether...

When not using the subquery, the query ran in 1s -- if I didn't have a 
LIMIT/OFFSET
Adding a LIMIT/OFFSET to that query made it run for nearly 6 minutes.
The only way I could manage to trick it to use the better query plan, was to 
wrap the "good" query as a subquery, and then run a LIMIT/OFFSET in the outer 
query.

[GENERAL] issue with double ordering in a wrapped distinct

2014-11-18 Thread Jonathan Vanasco

I have a particular query that returns resultset of 45k rows out of a large 
resultset (pg 9.3 and 9.1)

It's a many 2 many query, where I"m trying to search for Bar based on 
attributes in a linked Foo.

I tweaked the indexes, optimized the query, and got it down an acceptable speed 
around 1,100ms

the second I added a limit/offset though -- the query plan completely changed 
and it ballooned up to 297,340 ms.   Yes, I waited that long to see what was 
going on in the query planner.

I did a lot of playing around, and managed to get this form of a query to work 
in 305ms with a limit/offset.  

SELECT DISTINCT qinner.bar_id
FROM
  (SELECT foo_2_bar.bar_id AS bar_id
   FROM foo_2_bar
   JOIN foo ON foo_2_bar.foo_id = foo.id
   WHERE foo.biz_id = 1
 AND (foo.is_hidden IS NOT TRUE)
   ORDER BY foo_2_bar.bar_id ASC
   ) AS qinner
ORDER BY qinner.bar_id ASC 
LIMIT 100
OFFSET 0
;

This is what I don't understand -- notice the two order_by calls.  

If i run this with an inner and outer order_by, I get ~305ms.  (I don't 
think I need both, but I wasn't sure if ordering is kept from a subselect )

If i run this with only the inner, I get ~304ms.

If I run this with only the outer, it's pushing over 10minutes again

i'm wondering if anyone might know why that performance hit would be happening




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] String searching

2014-11-18 Thread Jonathan Vanasco


On Nov 18, 2014, at 11:49 AM, Robert DiFalco wrote:

> As far as I can tell, the trigram extension would be the easiest way to 
> implement this. It looks like I wouldn't need to mess with vectors, etc. It 
> would just look like a standard index and query, right? It seems that if I 
> need something more powerful in the future that I could always move to 
> ElasticSearch, Sphinx, or something similar.

I just followed the instructions in the docs to create the index, and ran 
ANALYZE on the table before running a standard "like" SELECT.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] String searching

2014-11-18 Thread Jonathan Vanasco

On Nov 18, 2014, at 7:38 AM, Albe Laurenz wrote:
> 
> That index wouldn't help with the query at all.
> 
> If you really need a full substring search (i.e., you want to find
> "howardjohnson"), the only thing that could help are trigram indexes.

I stand corrected.  

I ran a sample query on my test database of 100k names

using a function index `lower(name)`

this runs an index scan in .2ms
... where lower(name) = lower('bob');

but this runs a sequential scan in 90ms:
... where lower(name) like lower('%bob%');

I didn't know that 'like' doesn't run on indexes!

using a trigaram index, 

this runs a bitmap index on the trigram, then a bitmap heap on the 
table.  13ms.
...where name ilike '%bob%';

Re: [GENERAL] String searching

2014-11-17 Thread Jonathan Vanasco

On Nov 17, 2014, at 12:55 PM, Robert DiFalco wrote:

> SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%');
> 
> That said, which would be the best extension module to use? A "gist" index on 
> the uppercased column? Or something else? Thanks!

Performance wise, I think a function index would probably be the best:

CREATE INDEX mytable_lower_fullname_idx ON mytable(lower(fullname));

SELECT * FROM mytable WHERE lower(fullname) LIKE lower('%john%');

The only reason why I use `lower` and not `upper` is that it's easier to look 
at when dealing with debugging and sample queries.

I'd bench against GIN and GIST, but I think this will work the best.

The reason is that GIN/GIST use language patterns to simplify the index.  so 
they work great on "words"

select plainto_tsquery('doing watching reading programming');
'watch' & 'read' & 'program'

but not so great on "names":

select plainto_tsquery('john doe');
 'john' & 'doe'

select plainto_tsquery('jon doe');
 'jon' & 'doe

So you'll get a bit more overhead on the match and you won't get a smaller 
index (which is why they're great for fulltext)

The search execution might turn out to be much faster.  If so, i'd love to 
know.  But doing a lower() search on a lower() function index has always been 
ridiculously fast for me.

This only goes for names though.  If you're searching other fields, then 
another search method might be considerably better.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] pg_upgrade and ubuntu

2014-11-14 Thread Jonathan Vanasco

I ran into an issue migrating from 9.1 to 9.3 on ubuntu using pg_upgrade

the default ubuntu package, and the one from postgresql.org, both store 
`postgresql.conf` in etc as `/etc/postgresql/VERSION/main/postgresql.conf`

however, the pg_upgrade script expects it in the `datadir`.

the simple solution seems to be just symlinking the /etc files into the data 
dirs.  

it took me a while to realize this was the error.

it might make sense to upgrade the docs with a note about what should be in the 
data dir to enable an upgrade.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Are there any downsides to using postgres' data directory on a dedicated drive/partition / filesystem?

2014-11-13 Thread Jonathan Vanasco


Thanks, everyone!

For now this will be provisioning physical drive for a box -- and "everything" 
will be there for now.  So OS on one drive, and DB on another.  

I've run into programs before (mostly on Mac/Win) that are exceedingly not 
happy if they're run on a drive other than the OS.  

Since many people partition data and services under pg, I figured it would be 
okay -- but I couldn't find anything in the docs and wanted to check.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Are there any downsides to using postgres' data directory on a dedicated drive/partition / filesystem?

2014-11-13 Thread Jonathan Vanasco

I'm running postgres on a virtual server

I was wondering if there were any known issues with moving the data directory 
to another mounted partition / filesystem.  



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] troubleshooting a database that keeps locking up

2014-11-13 Thread Jonathan Vanasco


I have a database that has started to constantly hang after a brief period of 
activity

looking at `select * from pg_stat_activity;` I roughly see the following each 
time:

process 1 | 
process 2 |  in transaction
process 3 |  in transaction
process 4 | 
process 5 | 
process 6 | 
process 7 | INSERT INTO table_a   RETURNING id

occasionally I'll see

process 8 | UPDATE table_b

Does anyone have tips on how I can troubleshoot this.  

I was hoping there would be some way to show the history of the " in 
transaction" processes, but I couldn't find them.

I was also wondering if the "RETURNING id" might have something to do with this.

I'd appreciate any pointers in trying to figure out what is causing this.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Modeling Friendship Relationships

2014-11-13 Thread Jonathan Vanasco


On Nov 11, 2014, at 5:38 PM, Robert DiFalco wrote:

> Thoughts? Do I just choose one or is there a clear winner? TIA!


I prefer this model

user_id__a INT NOT NULL REFERENCES user(id),
user_id__b INT NOT NULL REFERENCES user(id),
is_reciprocal BOOLEAN
primary key (user_id__a, user_id__b)

if a relationship is confirmed (or dropped) I toggle is_reciprocal.  having 
that value saves a lot of work doing joins or analyzing friendship sets

if you have multiple relationship types, then things get tricky.

you can either 
- treat the row as a triplet ( user_id__a, user_id__b, 
relationship_type_id)   [i still recommend the reciprocal bool]
- if you have a finite set of relationship types, you could just use 
each one as a bool column within the a2b row

I've tried doing the "one row per relationship" approach, and didn't like it.   
the time savings on simple searches were marginally faster, but the sql was 
increasingly more complex and slower to execute as we leveraged the table into 
other queries.  



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] index behavior question - multicolumn not consulted ?

2014-10-07 Thread Jonathan Vanasco


I have a table with over 1MM records and 15 columns.

I had created a "unique index" on a mix of two columns to enforce a constraint 
: (resource_type_id, lower(archive_pathname))

i've noticed that searches never use this.  no matter what I query, even if 
it's only the columns in the index.  I'm seeing a 550ms sequential scan on 
everything.

If I create an index only on the text field: lower(archive_pathname) , all the 
queries use that and complete in 1.4ms

does anyone know why this happens ?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] faster way to calculate top "tags" for a "resource" based on a column

2014-10-07 Thread Jonathan Vanasco

On Oct 7, 2014, at 10:02 AM, Marc Mamin wrote:

> Hi,
> it seems to me that your subquery may deliver duplicate ids.
> And with the selectivity of your example, I would expect an index usage 
> instead of a table scan. You may check how up to date your statistics are
> and try to raise the statistic target on the column resource_2_tag.tag_id.
> Also try a CTE form for your query:

It shouldn't be able to deliver duplicate ids.  

=> SELECT COUNT(*) FROM (SELECT DISTINCT id FROM resource WHERE 
resource_attribute1_id = 614) AS foo;
count ---  5184

=> SELECT COUNT(*) FROM (SELECT id FROM resource WHERE resource_attribute1_id = 
614) AS foo;
count ---  5184

However, adding in the DISTINCT drastically changed the query plan, and did 
give a speedup.

Your comment made me focus on the notion of a Table Scan. I assumed it did the 
seq scan - and there would not be much savings otherwise - because the table is 
just 2 ids.

I was wrong.

I noticed that I never put a PRIMARY KEY constraint on that table.  

So i tried adding a PRIMARY KEY constraint, then running vacuum analyze...

And that solved all my problems.

the original query ended up being the fastest at 260ms ( down from 1760 )

Join - 260ms
Subquery w/DISTINCT - 300ms
CTE - 330
CTE w/DISTINCT - 345ms
Subquery (no DISTINCT) - 1500ms

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] faster way to calculate top "tags" for a "resource" based on a column

2014-10-06 Thread Jonathan Vanasco

On Oct 6, 2014, at 5:56 PM, Jim Nasby wrote:

> Don't join to the resource table; there's no reason to because you're not 
> pulling anything from it.

Thanks the reply!

I'm not pulling anything from the resource table, but the join is necessary 
because I'm filtering based on it. ( see the WHERE clause )

I'm not trying to find the most used overall tags, but the ones that are used 
by resources with a (variable) id on a column in the resources table. .  

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] faster way to calculate top "tags" for a "resource" based on a column

2014-10-03 Thread Jonathan Vanasco

I've been able to fix most of my slow queries into something more acceptable, 
but I haven't been able to shave any time off this one.  I'm hoping someone has 
another strategy.

I have 2 tables:
resource
resource_2_tag

I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources 
that match a given attribute on the "resource" table.

both tables have around 1.6million records.  

If the database needs to warm up and read into cache, this can take 60seconds 
to read the data off disk.  
If the database doesn't need to warm up, it averages 1.76seconds.  

The 1.76s time is troubling me.
Searching for the discrete elements of this is pretty lightweight.  

here's an explain --  http://explain.depesz.com/s/PndC 

I tried a subquery instead of a join, and the query optimized the plan to the 
same.

i'm hoping someone will see something that I just don't see.



  Table "public.resource_2_tag"
Column |  Type   | Modifiers 
---+-+---
 resource_id   | integer | 
 tag_id| integer | 
Indexes:
"_idx_speed_resource_2_tag__resource_id" btree (resource_id)
"_idx_speed_resource_2_tag__tag_id" btree (tag_id)

  Table "public.resource"
   Column|Type |
Modifiers 
-+-+--
 id  | integer | not null 
default nextval('resource_id_seq'::regclass)
resource_attribute1_id   | integer | 
lots of other columns| |
Indexes:
"resource_attribute1_idx" btree (resource_attribute1_id)



select count(*) from resource;
-- 1669729

select count(*) from resource_2_tag;
-- 1676594

select count(*) from resource where resource_attribute1_id = 614;
-- 5184
-- 4.386ms

select id from resource where resource_attribute1_id = 614;
-- 5184
-- 87.303ms

popping the 5k elements into an "in" clause, will run the query in around 100ms.


EXPLAIN ANALYZE
SELECT 
resource_2_tag.tag_id AS resource_2_tag_tag_id, 
count(resource_2_tag.tag_id) AS counted 
FROM 
resource_2_tag 
JOIN resource ON resource.id = resource_2_tag.resource_id 
WHERE 
resource.resource_attribute1_id = 614 
GROUP BY resource_2_tag.tag_id 
ORDER BY counted DESC 
LIMIT 25 OFFSET 0;



 Limit  (cost=76659.61..76659.68 rows=25 width=4) (actual 
time=1502.902..1502.913 rows=25 loops=1)
   ->  Sort  (cost=76659.61..76672.47 rows=5141 width=4) (actual 
time=1502.900..1502.906 rows=25 loops=1)
 Sort Key: (count(resource_2_tag.tag_id))
 Sort Method: top-N heapsort  Memory: 26kB
 ->  HashAggregate  (cost=76463.13..76514.54 rows=5141 width=4) (actual 
time=1487.016..1495.206 rows=13887 loops=1)
   ->  Hash Join  (cost=35867.88..76437.42 rows=5141 width=4) 
(actual time=97.654..1453.337 rows=27068 loops=1)
 Hash Cond: (resource_2_tag.resource_id = resource.id)
 ->  Seq Scan on resource_2_tag  (cost=0.00..25847.94 
rows=1676594 width=8) (actual time=0.032..513.046 rows=1676594 loops=1)
 ->  Hash  (cost=35803.88..35803.88 rows=5120 width=4) 
(actual time=97.576..97.576 rows=5184 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 183kB
   ->  Bitmap Heap Scan on resource  
(cost=272.68..35803.88 rows=5120 width=4) (actual time=5.911..90.264 rows=5184 
loops=1)
 Recheck Cond: (resource_attribute1_id = 614)
 ->  Bitmap Index Scan on 
resource_attribute1_idx  (cost=0.00..271.40 rows=5120 width=0) (actual 
time=3.575..3.575 rows=5184 loops=1)
   Index Cond: (resource_attribute1_id = 
614)
 Total runtime: 1503.146 ms




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] installing on mac air development machine

2014-10-02 Thread Jonathan Vanasco


On Oct 2, 2014, at 7:30 PM, john gale wrote:

> The GUI installer for Mac OS X downloaded from postgresql.org works fine.

Unless you NEED to use the source/etc version, use the GUI installer.   

Unless you are already on a system where installing from Fink/Macports/Source 
is commonplace... you're going to spend more time installing and configuring 
the environment than you will using the application.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Benching Queries

2014-10-02 Thread Jonathan Vanasco


Does anyone have a good solution for benching queries under various conditions, 
and collecting the EXPLAIN data ?

I looked at pgbench, but it doesn't seem to be what I want.

My situation is this-

- For a given query, there are 3-5 different ways that I can run it.  
- Each form of the query has a completely different execution plan and query 
time, often using different indexes.
- The same query runs differently on first query, vs subsequent queries (when 
the indexes/tables are already in memory).

My goal is to find an overall balance of query time (cold-start vs in-memory) 
and indexes (number of).  



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] improving speed of query that uses a multi-column "filter" ?

2014-10-01 Thread Jonathan Vanasco

On Oct 1, 2014, at 12:34 AM, Misa Simic wrote:

> Have you considered maybe partial indexes?
> 
> http://www.postgresql.org/docs/9.3/static/indexes-partial.html
> 
> I.e idx1 on pk column of the table with where inside index exactly the same 
> as your first where
> 
> Idx2 on pk column with where inside index as second where

That was actually my first attempt , and I was hoping it would work.  

Unfortunately, there is always something in the queries that keeps Postgres 
trying to use other (slower) indexes or jumping to a sequential scan.  

I haven't been able to trick the planner into using the partial index, and most 
online resources suggested it wasn't possible.

Re: [GENERAL] improving speed of query that uses a multi-column "filter" ?

2014-09-30 Thread Jonathan Vanasco

On Sep 30, 2014, at 8:04 PM, John R Pierce  wrote:
> if col_1 IS NULL,   then that OR condition doesn't make much sense. just 
> saying...

I was just making a quick example.  There are two commonly used "filter sets", 
each are mostly on Bool columns that allow null -- but one checks to see if the 
row references itself in a particular column.

> these 4 columns are all nullable booleans, so they can be TRUE, FALSE, or 
> NULL ?  

Most of them, yes.

> with 4 columns, there's 3^4 = 81 possible combinations of these values...
> you might get better speeds encoding this as a single SHORT INTEGER, and 
> enumerating those 81 states, then just do equals or IN (set of values) 
> conditions...   of course, this might make a lot of OTHER code more 
> complicated.   It might be easier to make each col_X 2 bits of this integer, 
> such that one bit indicates the value was 'NULL', and the other bit is the 
> true/false state if that first bit isn't set, this would make testing 
> individual bits somewhat better.

That's interesting.  I never thought of how Postgres processes the data.

For legacy reasons, I can't change the data types -- but I can add additional 
columns.  So I could do a trigger/function that manages a filter_test column 
that is an int, give each filter a bit value, and then just run a scan on that. 
 It wouldn't be much more work to test that and dedicated Bool columns for each 
filter.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] improving speed of query that uses a multi-column "filter" ?

2014-09-30 Thread Jonathan Vanasco


I'm trying to improve the speed of suite of queries that go across a few 
million rows.

They use 2 main "filters" across a variety of columns:

WHERE (col_1 IS NULL ) AND (col_2 IS NULL) AND ((col_3 IS NULL) OR 
(col_3 = col_1))
WHERE (col_1 IS True ) AND (col_2 IS True) AND (col_3 IS True) OR 
(col_4 IS NULL)

I created a dedicated multi-column index for each query to speed them up.  That 
was great.

I still don't have the performance where I want it to be - the size of the 
index seems to be an issue.  If the index were on one column, instead of 4, I 
think the scans would complete in time.

i looked online and the archives, and couldn't find much information on good 
strategies to deal with this.

It looks like my best option is to somehow index on the "interpretation" of 
this criteria, and not the criteria itself.

the two ways that come to mind are:

1. alter the table: adding a boolean column for each filter-test to the 
table, index that, then query for that field
2. leave the table as-is: write a custom function for each filter, and 
then use a function index 

has anyone else encountered a need like this?

are there any tips / tricks / things I should look out for.  are there better 
ways to handle this?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Postgres as key/value store

2014-09-29 Thread Jonathan Vanasco

On Sep 27, 2014, at 7:48 PM, snacktime wrote:

> The schema is that a key is a string, and the value is a string or binary.  I 
> am actually storing protocol buffer messages, but the library gives me the 
> ability to serialize to native protobuf or to json.  Json is useful at times 
> especially for debugging.

I don't know if this will apply to you, but i received significant speed 
improvements on Postgres key searches by using substring indexes.

If your keys are just random hashes, this would probably work well for you.  
if your keys are person-readable, it's probably not going to work as well as 
the distribution of prefix characters will probably be too uniform.

But the general idea is twofold:

1. create an additional partial index on the key field -- CREATE INDEX 
_entities_id__subst_7 ON entities(substr(id,1,7));
2. update your SELECTS to search for both the full string AND the 
substring

- WHERE id = :id
+ WHERE (id = :id) AND (substr(id,1,7) = substr(:id, 1, 7))

By adding in the substring query, the planner will (usually) optimize the 
select by doing a first pass on the substring index.  then it searches that 
limited set for the rest of matching criteria.

on a table with 4MM+ records , introducing a substring index/query improved my 
searches by a few orders of magnitude.

before trying this indexing strategy, we were actively looking to migrate this 
particular query service off of postgres -- it was such a bottleneck and was 
not scalable.  
now there is no reason to leave in the foreseeable future.

On Sep 27, 2014, at 8:33 PM, Gavin Flower wrote:

>>  This works well because keys are left prefixed with a scope, a delimiter, 
>> and then the actual key for the data.  
> Then I noticed that your id is actually a compound key, and probably would be 
> better modelled as:

if you're able to standardize the scope out, an index of "(scope, 
substring(key,1,7))" might work well.

i only used 1,7 as my key arguments, because that was an optimal speed/space 
mix on my dataset.  depending on yours, a shorter or longer index might be more 
appropriate

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] table versioning approach (not auditing)

2014-09-29 Thread Jonathan Vanasco

On Sep 29, 2014, at 4:06 PM, Nick Guenther wrote:

> A newbie tangent question: how do you access the transaction serial? Is it 
> txid_current() as listed in 
> http://www.postgresql.org/docs/9.3/static/functions-info.html?

My implementations were ridiculously simple/naive in design, and existed 
entirely with under defined serials.  i'd just create a new record + id on a 
write operation, and then use it when logging all operations.

I had read up on a lot of (possibly better) ways to handle this using pg 
internals.  They all seemed more advanced than I needed.

> And does your implementation worry about multiple timelines? 

Not sure I understand this... but every object is given a revision id.  edits 
between consecutive revisions are allowed, edits spanning multiple revisions 
are rejected.

On Sep 29, 2014, at 5:25 PM, Abelard Hoffman wrote:

> Felix & Jonathan: both of you mention just storing deltas. But if you do 
> that, how do you associate the delta record with the original row? Where's 
> the PK stored, if it wasn't part of the delta?

The logic I decided on, is this:

Revision 0
 Only the original record is stored
Revision 1
• Copy the original record into revision store
Revision 1+
• Update the original record, store the deltas in the revision store

The reason why I chose this path, is that in my system:
• most records are not edited
• the records that are edited, are heavily edited

We use an ORM and it was simple to implement this pattern with it, and then 
write some functions in postgres to ensure it is adhered to.

When I need to pull data out:

• I can pull exact revisions out of the htstore for a given table/row 
using the revision ids as a key
• the revisions all contain the transaction id
• if i need to get more info about a given transaction, i can query the 
transactions table and get a list of all the objects that were edited within 
that transaction

if i wanted to ensure referential integrity, i could have used a table instead 
of an hstore (or json).  If the application grows much larger, it will probably 
be migrated to a model like that.  This approach just gave a lot of flexibility 
, minimized  tables in the database, and was very easy to pull off.  i went 
with hstore because json didn't allow in-place updates at the time (i think it 
does now).  

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] table versioning approach (not auditing)

2014-09-29 Thread Jonathan Vanasco


In the past, to accomplish the same thing I've done this:

- store the data in hstore/json.  instead of storing snapshots, I store deltas. 
 i've been using a second table though, because it's improved performance on 
reads and writes.
- use a "transaction" log.  every write session gets logged into the 
transaction table (serial, timestamp, user_id).  all updates to the recorded 
tables include the transaction's serial.  then there is a "transactions" table, 
that is just "transaction_serial ,  object_id , object_action".  

whenever I have needs for auditing or versioning, I can just query the 
transaction table for the records I want... then use that to grab the data out 
of hstore.



On Sep 28, 2014, at 10:00 PM, Abelard Hoffman wrote:

> Hi. I need to maintain a record of all changes to certain tables so assist in 
> viewing history and reverting changes when necessary (customer service makes 
> an incorrect edit, etc.).
> 
> I have studied these two audit trigger examples:
> https://wiki.postgresql.org/wiki/Audit_trigger
> https://wiki.postgresql.org/wiki/Audit_trigger_91plus
> 
> I've also read about two other approaches to versioning:
> 1. maintain all versions in one table, with a flag to indicate which is the 
> current version
> 2. have a separate versions table for each real table, and insert into the 
> associated version table whenever an update or insert is done.
> 
> My current implementation is based on the wiki trigger examples, using a 
> single table, and a json column to record the row changes (rather than 
> hstore). What I like about that, in particular, is I can have a "global," 
> chronological view of all versioned changes very easily.
> 
> But there are two types of queries I need to run.
> 1. Find all changes made by a specific user
> 2. Find all changes related to a specific record
> 
> #1 is simple to do. The versioning table has a user_id column of who made the 
> change, so I can query on that.
> 
> #2 is more difficult. I may want to fetch all changes to a group of tables 
> that are all related by foreign keys (e.g., find all changes to "user" record 
> 849, along with any changes to their "articles," "photos," etc.). All of the 
> data is in the json column, of course, but it seems like a pain to try and 
> build a query on the json column that can fetch all those relationships (and 
> if I mess it up, I probably won't generate any errors, since the json is so 
> free-form).
> 
> So my question is, do you think using the json approach is wrong for this 
> case? Does it seem better to have separate versioning tables associated with 
> each real table? Or another approach?
> 
> Thanks

[GENERAL] advice sought - general approaches to optimizing queries around "event streams"

2014-09-26 Thread Jonathan Vanasco


I have a growing database with millions of rows that track resources against an 
event stream.  

i have a few handfuls of queries that interact with this stream in a variety of 
ways, and I have managed to drop things down from 70s to 3.5s on full scans and 
offer .05s partial scans.  

no matter how i restructure queries, I can't seem to get around a few 
bottlenecks and I wanted to know if there were any tips/tricks from the 
community on how to approach them.  

a simple form of my database would be:

--  1k of
create table stream (
id int not null primary key,
)

-- 1MM of
create table resource (
id int not null primary key,
col_a bool,
col_b bool,
col_c text,
);

-- 10MM of
create table streamevent (
id int not null,
event_timestamp timestamp not null,
stream_id int not null references stream(id)
);

-- 10MM of
create table resource_2_stream_event(
resource_id int not null references resource(id),
streamevent_id int not null references streamevent(id)
)

Everything is running off of indexes; there are no seq scans.

I've managed to optimize my queries by avoiding joins against tables, and 
turning the stream interaction into a subquery or CTE.  
better performance has come from limiting the number of "stream events"  ( 
which are only the timestamp and resource_id off a joined table ) 

The bottlenecks I've encountered have primarily been:

1.  When interacting with a stream, the ordering of event_timestamp and 
deduplicating of resources becomes an issue.
I've figured out a novel way to work with the most recent events, but 
distant events are troublesome

using no limit, the query takes 3500 ms
using a limit of 1, the query takes 320ms
using a limit of 1000, the query takes 20ms

there is a dedicated index of on event_timestamp (desc) , and it is 
being used
according to the planner... finding all the records is fine; 
merging-into and sorting the aggregate to handle the deduplication of records 
in a stream seems to be the issue (either with DISTINCT or max+group_by)


2.  I can't figure out an effective way to search for a term against an 
entire stream (using a tsquery/gin based search)

I thought about limiting the query by finding matching resources first, 
then locking it to an event stream, but:
- scanning the entire table for a term takes about 10 seconds 
on an initial hit.  subsequent queries for the same terms end up using the 
cache, and complete within 20ms.

I get better search performance by calculating the event stream, then 
searching it for matching documents, but I still have the performance issues 
related to limiting the window of events

i didn't include example queries, because I'm more concerned with the general 
approaches and ideas behind dealing with large data sets than i am with raw SQL 
right now.  

i'm hoping someone can enlighten me into looking at new ways to solve these 
problems.   i think i've learned more about postgres/sql in the past 48hour 
than I have in the past 15 years, and I'm pretty sure that the improvements I 
need will come from new ways of querying data , rather than optimizing the 
current queries.




















-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] how does full text searching tokenize words ? can it be altered?

2014-07-10 Thread Jonathan Vanasco


I'm getting a handful of 'can not index words longer than 2047 characters' on 
my `gin` indexes.

1. does this 2047 character count correspond to tokens / indexed words?  
2. if so, is there a way to lower this number ?
3. is there a way to profile the index for the frequency of tokens ?


( apologies in advance if this looks familiar, i posted this as part of a 
larger question last month; everything but this was answered by the list and I 
can't find answers to this online )




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] How to store fixed size images?

2014-06-20 Thread Jonathan Vanasco

On Jun 19, 2014, at 11:21 AM, Andy Colson wrote:

> I think it depends on how you are going to use them.  I, for example, have 
> lots of images that are served on a web page, after benchmarks I found it was 
> faster to store them on filesystem and let apache serve them directly.

I rarely store images like that locally now; I just toss them onto Amazon S3.

When I did have to store lots of images locally , I found this to be the best 
method:

1. The Postgres record for the image is given a unique and random hash as a 
hexdigest
2. The Image is saved onto a filesystem into a directory mapped by the hexdigest

for example, there might be something like this:

Postgres:
id  | filename | hash
001 | image.jpg | abcdef123

Filesystem
abc/def/123/abcdef123-image.jpg

nginx/apache rewrite rule :
abcdef123-image.jpg ->  abc/def/123/abcdef123-image.jpg

the reason for this has to do with the performance of various filesystems and 
issues with the distribution of digits in a sequence.  it ties into Benford's 
Law ( http://en.wikipedia.org/wiki/Benford's_law ) as well.

a handful of filesystems exhibit decreased performance as the number of items 
in a directory increases.  a few years ago, 1k-4k items was a safe max -- but 
at 10x that some filesystems really slowed.  i think most modern filesystems 
are still quick at the 5-10k range.  

a hash has more characters and a more normal distribution than a series of 
numbers or natural language filenames.

and if you group a hexdigest into triplets , you get 4096 max files/folders in 
a directory  which is a decent sweet spot
16 * 16 * 16 = 4096

i haven't had to deal with this sort of stuff in almost 10 years now.  but 
archiving content like this back then was a considerable improvement to 
filesystem performance and web serving.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] two questions about fulltext searchign / tsvector indexes

2014-06-10 Thread Jonathan Vanasco


On Jun 10, 2014, at 8:26 AM, Vick Khera wrote:

Thanks so much for this.

We do a lot of searching on this column, so pre-computing seems to be the way.

I'm not worried about disk space for now, and can revisit that later if there 
is a problem

Just for clarification on this:

Option A (less fast):
create gin index on tsvector(searchable_column)

Option B (faster):
create tsvector column for `searchable_column`
create gin index on searchable_column





> On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco  wrote:
>>I can't figure out which one to use.  This is on a steadily growing 
>> table of around 20MM rows that gets 20-80k new records a day, but existing 
>> records are rarely updated.
> 
> The question as always is a time-space trade-off. How frequently do
> you make the full text search? If you do it frequently, then with a
> pre-computed tsv column you save all that time per row of computing
> the tsvector on every search. If you do it infrequently, the space
> savings (and not needing to maintain that column) may benefit you.
> 
> Personally in these days of cheap disks I'd go with the dedicated
> column. Given that, you want to just have a GIN index on that one
> column, and the query you want, given some plain text string like
> "fluffy dog" is this:
> 
> select plainto_tsquery('fluffy dog') @@ my_tsv_column;
> 
> I always use a trigger on insert and update to maintain the ts_vector
> column, so there is no doubt of how it was computed by various
> programs.
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

// Jonathan Vanasco

c. 646.729.6436  |  415.501.9815
e. jonat...@2xlp.com
w. http://findmeon.com/user/jvanasco 
linkedin. http://linkedin.com/in/jonathanvanasco
blog. http://destructuring.net

[GENERAL] two questions about fulltext searchign / tsvector indexes

2014-06-09 Thread Jonathan Vanasco

I'm having some issues with fulltext searching.  

I've gone though the list archives and stack overflow, but can't seem to get 
the exact answers.  hoping someone can help.

Thanks in advance and apologies for these questions being rather basic.  I just 
felt the docs and some online posts are leading me into possibly making the 
wrong decision and I want to make sure I"m doing this right.


1.  I need to make both 'title' and 'description' searchable.   What is the 
current proper way to index multiple columns of a table ( ie, not one ) ?

I've essentially seen the following in the docs, mailing list, and 
various websites:

A unified index
CREATE INDEX CONCURRENTLY unified_tsvector_idx ON mytable USING 
gin(to_tsvector('english', title || ' ' || description ));

Individual indexes
CREATE INDEX CONCURRENTLY title_tsvector_idx ON mytable USING 
gin(to_tsvector('english', title ));
CREATE INDEX CONCURRENTLY description_tsvector_idx ON mytable 
USING gin(to_tsvector('english', description ));

Using dedicated columns ( one or more )
ALTER TABLE  
create trigger 

I can't figure out which one to use.  This is on a steadily growing 
table of around 20MM rows that gets 20-80k new records a day, but existing 
records are rarely updated.


2. I've been getting a handful of 'can not index words longer than 2047 
characters' in my tests.  

if this 2047 character max is on tokens, is there a way to lower it?  
or to profile the index for distribution of tokens ?  I don't think we have to 
support any tokens larger than 20chars or so.

3a. What should EXPLAIN ANALYZE show if it is using the index ?  i couldn't 
find an example.

3b. Depending on how I index the column, what do I need to pass into the query 
so that it uses the index ?

1.  if the index is created like 
gin(to_tsvector('english', title ));

do i have to search in this format ?
to_tsvector('english',title) @@ to_tsquery('english', 
'dog') ;

2.  if i use an index like 
 gin(to_tsvector('english', title || ' ' || description 
));
 
what is the correct way to query the database and let the 
planner know I want to use the index ?





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] how can i bugfix "idle in transaction" lockups ?

2010-12-02 Thread Jonathan Vanasco

begin w/o commit or rollback?

and thanks. you've been very helpful!

On Nov 30, 2010, at 2:21 PM, Merlin Moncure wrote:

> Begin w/o commit is a grave application error and you should
> consider reworking your code base so that it doesn't happen (ever).


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] how can i bugfix "idle in transaction" lockups ?

2010-11-30 Thread Jonathan Vanasco

on a project, i find myself continually finding the database locked up with 
"idle in transaction" connections

are there any commands that will allow me to check exactly what was going on in 
that transaction ?

i couldn't find anything in the docs, and the project has decent traffic, so 
its going to be an issue to log all statements so I can sift through the data 
by PID


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] question about unique indexes

2010-05-10 Thread Jonathan Vanasco



On May 10, 2010, at 6:29 AM, Alban Hertroys wrote:
As the docs state and as others already mentioned, "Null values are  
not considered equal".



Ah.  I interpreted that wrong.  I thought it applied to indexes  
differently.  I'll have to experiment now...



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] question about unique indexes

2010-05-09 Thread Jonathan Vanasco


-- running pg 8.4

i have a table defining geographic locations

id
lat
long
country_id not null
state_id
city_id
postal_code_id

i was given a unique index on
(country_id, state_id, city_id, postal_code_id)

the unique index isn't working as i'd expect it to.  i was hoping  
someone could explain why:


in the two records below, only country_id and state_id are assigned   
( aside from the serial )


geographic_location_id | coordinates_latitude | coordinates_longitude  
| country_id | state_id | city_id | postal_code_id
+--+--- 
++--+-+
312 |   
|   |233 |   65 | |
443 |   
|   |233 |   65 | |


i was under the expectation that the unique constraint would apply in  
this place.


from the docs:
	When an index is declared unique, multiple table rows with equal  
indexed values are not allowed. Null values are not considered equal.  
A multicolumn unique index will only reject cases where all indexed  
columns are equal in multiple rows.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2

2010-04-22 Thread Jonathan Vanasco



On Apr 21, 2010, at 9:38 PM, Glen Parker wrote:

Not if qty_requested_available needs to be <= qty_available...



indeed, i'm an idiot this week.

thanks a ton.  this really helped me out!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2

2010-04-21 Thread Jonathan Vanasco



it would be that, but with greatest

thank you.  that's the exact query i was failing to write !

On Apr 21, 2010, at 8:51 PM, Glen Parker wrote:


UPDATE
 cart_item
SET
 qty_requested_available = least(cart_item.qty_requested,  
stock.qty_available)

FROM
 stock
WHERE
 cart_item.stock_id = stock.stock_id AND
 qty_requested_available <> least(cart_item.qty_requested,  
stock.qty_available);



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2

2010-04-21 Thread Jonathan Vanasco

I'm trying to write a bit of logic as 1 query, but I can't seem to do  
it under 2 queries.


i'm hoping someone can help

the basic premise is that i have an inventory management system , and  
am trying to update the quantity available in the "shopping  
cart" (which is different than the independently tracked quantity  
requested ).


the logic is fairly simple:
	cart items should show the quantity_requested as available if that  
number is <= the number of items in stock, otherwise they should show  
the max number of items available


the solution i ended up with, is to just update the cart_items with  
the entire quantity_available per product, and then fix that in a  
second pass.


i'm wondering if this can be *efficiently* done within a single update  
statement.   i couldn't figure out how to do this in a single update,  
and not make multiple queries to find the actual qty_available




UPDATE
cart_item
SET
	qty_requested_available = ( SELECT qty_available FROM stock where  
stock.id = stock_id)

;

UPDATE
cart_item
SET
qty_requested_available =
CASE
WHEN
qty_requested_available > qty_requested THEN 
qty_requested
ELSE
qty_requested_available
END
;


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] is it possible to do an update with a nested select that references the outer update ?

2009-10-06 Thread Jonathan Vanasco


A typo in a webapp left ~150 records damaged overnight

I was hoping to automate this, but may just use regex to make update  
statements for this


basically , i have this situation:

table a ( main record )
id , id_field , fullname

table b ( extended profiles )
id_field , last_name , first_name, middle_name , age , etc

id_field on table a was left null due to a typo with the orm

i've tried many variations to automate it, none seem to work

i think this attempt most clearly expresses what I was trying to do

	UPDATE table_a a set id_field = ( SELECT id_field FROM table_b b  
WHERE a.first_name || ' ' || b.last_name = a.fullname ) WHERE id_field  
IS NULL ;


I'd be greatful if anyone has a pointer 


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] concatenation issue ( 8.4 )

2009-09-18 Thread Jonathan Vanasco


I have a table with
name_first
name_middle
name_last

if i try concatenating as such:
SELECT
name_first || ' ' || name_middle || ' ' || name_last
FROM
mytable
;

I end up with NULL as the concatenated string whenever any of the  
referred fields contain a NULL value


I tried some text conversion and explicit casting , but that didn't work

What am I doing wrong ?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] does anyone know what to use in pg_hba.conf that will allow me to run cronjobs with pg_dump?

2009-02-10 Thread Jonathan Vanasco


i think i just need a METHOD for localhost only.

thanks.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] list all columns in db

2007-06-11 Thread Jonathan Vanasco



Thank you Jon -- thats the exact sort of trick I was hoping for.

Cheers!

On Jun 7, 2007, at 6:36 PM, Jon Sime wrote:


Jonathan Vanasco wrote:

Does anyone have a trick to list all columns in a db ?


No trickery, just exploit the availability of the SQL standard  
information_schema views:


select table_schema, table_name, column_name
from information_schema.columns
where table_schema not in ('pg_catalog','information_schema')
order by 1,2,3

If you want an equivalent that uses pg_catalog (non-portable  
outside of PostgreSQL) you could instead do:


select n.nspname as table_schema, c.relname as table_name,
a.attname as column_name
from pg_catalog.pg_attribute a
join pg_catalog.pg_class c on (a.attrelid = c.oid)
join pg_catalog.pg_namespace n on (c.relnamespace = n.oid)
where c.relkind in ('r','v') and a.attnum > 0
and n.nspname not in ('pg_catalog','information_schema')
order by 1,2,3

-Jon

--
Senior Systems Developer
Media Matters for America
http://mediamatters.org/


// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|   CEO/Founder SyndiClick Networks
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 6: explain analyze is your friend

[GENERAL] list all columns in db

2007-06-07 Thread Jonathan Vanasco



Does anyone have a trick to list all columns in a db ?

I need to audit a few dbs to make sure column & table names are  
adhering to our standard semantic syntax.


i figure there has to be an old pg-admin  trick out there to display  
a db like


%(tname)s . %(cname)

or some similar format



// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|   CEO/Founder SyndiClick Networks
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 6: explain analyze is your friend

Re: [GENERAL] Schema as versioning strategy

2007-04-26 Thread Jonathan Vanasco



On Apr 25, 2007, at 2:05 PM, Richard Huxton wrote:


Owen Hartnett wrote:
I want to "freeze" a snapshot of the database every year (think of  
end of year tax records).  However, I want this frozen version  
(and all the previous frozen versions) available to the database  
user as read-only.  My thinking is to copy the entire public  
schema (which is where all the current data lives) into a new  
schema, named 2007 (2008, etc.)


Sounds perfectly reasonable. You could either do it as a series of:
  CREATE TABLE archive2007.foo AS SELECT * FROM public.foo;
or do a pg_dump of schema "public", tweak the file to change the  
schema names and restore it.


the create table method won't copy the constraints + fkeys .

i think you're best off with a pgdump


---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[GENERAL] conditional joins and views

2007-04-24 Thread Jonathan Vanasco


Is it possible at all to use conditional joins in views?

ie:
	select a.* , b.* from a inner join b ON ( (a.id = b.id) AND  
(a.otherfield <> ?) )


I have a few 15-20 table joins that i'd like to push into views.
i've never wanted to push something with a conditional join into a  
view before, so am at a loss on this being a possibility.


seeing little documentation on this, i'm thinking its not possible  
and i'll have to use a function.







// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/

Re: [GENERAL] unique constraint on 2 columns

2007-04-20 Thread Jonathan Vanasco



On Apr 20, 2007, at 6:13 PM, Jeff Davis wrote:


This is more correct structure, and yes, it would involve a join.


I know thats the 'more correct' way -- but I can't do the join ,  
which is why I posted about a 2 column unique index.
I tested with a join before posting - i have an already large table  
that is growing quickly.  in order to use the join and keep current  
performance I'd need to scale out in hardware - which is just not an  
option right now.  searching 100M records vs searching 100M records +  
a join is a huge difference.  when you try to do analytics, its just  
not appropriate in my situation.



No, it does not mean you need to rewrite anything. Use a view; that's
the great benefit you get from using a relational database like
PostgreSQL.


i'd have to rewrite everything that reads from that table to use the  
view instead of the current query, and then worry about inserts.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

Re: [GENERAL] unique constraint on 2 columns

2007-04-20 Thread Jonathan Vanasco



On Apr 20, 2007, at 5:43 PM, Vladimir Zelinski wrote:


This looks like more table design problem than
database limitation.
The one column should accommodate values from both
columns with unique index built on this column. Your
requirements tell me that these values are the same
nature and should be placed in the same column. To
distinguish between them use another column to put an
attribute.


No, both values can be present at once.  They're both external facing  
guids that point to the same resource and serve as keys for the table  
data.  Some rows have one, some have two.


in regards to  table desgin solution, if I redid anything it would be  
something like:

table_main
main_id
table_main_2_guid
main_id
guid_id unique
context_id

but then i'm using 2 tables and have to join -- which means I need to  
rewrite evertyhing that queries this table - which is both  the core  
table for my application and supposed to be free of any multi-table  
queries for simple reads.


so i'm going to try the trigger route.


// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[GENERAL] unique constraint on 2 columns

2007-04-20 Thread Jonathan Vanasco




I need a certain unique constraint in pg that i can't figure out.

Given:

create table test_a (
id serial ,
name_1 varchar(32) ,
name_2 varchar(32)
);

I need name_1 and name_2 to both be unique so that:
name_1 never appears in name_1 or name_2
name_2 never appears in name_2 or name_1


a standard 2 column unique index / constraint will not accomplish this.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

Re: [GENERAL] making a pg store of 'multiple checkboxes' efficient

2007-04-18 Thread Jonathan Vanasco



On Apr 18, 2007, at 4:28 AM, Alban Hertroys wrote:


I got some good results using bitwise operations on an integer column.
An index on such a column helps a great deal of course. What type of
integer you need depends on how many booleans you (expect to) have.

My operations were like "WHERE (value & 80) = 80" to match against the
7th and 5th bits, with value an int4 column. Query times are in the  
tens

of milliseconds range.

Admittedly I only got about that many records through joins with other
tables I needed (meaning that my result set may at some point in time
have been around as big as your data set), so the planner may have
reduced the number of bitwise operations significantly for me.

The actual number of bitwise values to compare was around 40,000
integers * 25 mask values, but that got joined with over 1M records  
from

another result set.

A bitwise operator on a single column should (theoretically) have less
overhead than integer/boolean operators on multiple columns. Computers
are good at bitwise operations, after all.



Computers are good at bitwise operations, but software often has  
scary implementations :)


thanks for the input.  I'll definitely go this route.  It was my  
first thought, but there is almost no documentation out there for  
this type of storage.




// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster

[GENERAL] making a pg store of 'multiple checkboxes' efficient

2007-04-18 Thread Jonathan Vanasco

I have a large table (5M items current) that is projected to grow at  
the rate of 2.5M a month looking at the current usage trends.


the table represents some core standardized user account attributes ,  
while text heavy / unstandardized info lies in other tables.


my issue is this: i'm adding in a 'multiple checkboxes' style field,  
and trying to weigh the options for db representation against one  
another.


my main concern is speed - this is read heavy , but I am worried to  
some degree about disk space (not sure where disk space fits in with  
pg, when I used to use mysql the simplest schema change could  
drastically effect the disk size though ).


that said , these are my current choices:

option a
bitwise operations
and/or operations to condense checkboxes into 
searchable field
pro:
super small
fits in 1 table
con:
could not find any docs on the speed of bitwise 
searches in pg

option b
secondary table with bools
create table extends( account_id , option_1_bool , 
option_2_bool )
pro:
1 join , fast search on bools
con:
PITA to maintain/extend

option c
mapping table
create table mapping ( account_id , option_id )
pro:
extensible
con:
slow speed - needs multiple joins , records all 
over


I'd personally lean towards option a or b .  anyone have suggestions ?

thanks.


// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  FindMeOn.com - The cure for Multiple Web Personality Disorder
|  Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

|  RoadSound.com - Tools For Bands, Stuff For Fans
|  Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly

1 2 >

1 - 100 of 131 matches

Mail list logo