[GENERAL] index duplicates primary key, but is used more?
i'm doing a performance audit and noticed something odd. we tested a table a while back, by creating lots of indexes that match different queries (30+). for simplicity, here's a two column table: CREATE TABLE foo (id INT PRIMARY KEY value INT NOT NULL DEFAULT 0, ); The indexes were generated by a script, so we had things like: CREATE INDEX idx_test_foo_id_asc ON foo(id ASC); CREATE INDEX idx_test_foo_id_desc ON foo(id DESC); CREATE INDEX idx_test_foo_val_asc ON foo(value ASC); CREATE INDEX idx_test_foo_value_desc ON foo(value DESC); What I noticed when checking stats earlier, is that although `idx_test_foo_id_asc` is the same as the PKEY... it was used about 10x more than the pkey. Does anyone know of this is just random (perhaps due to the name being sorted earlier) or there is some other reason that index would be selected ? my concern in deleting it, is that it might be preferred for queries due to hinting from the explicit 'order by' (even though the contents are the same) and I may lose an index being leveraged in that query. It's on a GIANT table, so it would be hard to recreate. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] improvements/feedback sought for a working query that looks a bit ugly and might be inefficient
On May 16, 2017, at 10:20 PM, David G. Johnston wrote: > Unless you can discard the 5 and 1000 limits you are going to be stuck > computing rank three times in order to compute and filter them. Thanks a ton for your insight. I'm suck using them (5 is required for throttling, 1000 is required for this to run in a reasonable amount of time) The overhead of computing things is indeed super small. I'm not really worried much about the performance of this query (it runs around 3ms now, down from 20+s). I'm more worried about this code being referenced and a (possibly improper) idiom being used on queries where it will have a noticeable effect.
[GENERAL] improvements/feedback sought for a working query that looks a bit ugly and might be inefficient
Everything here works fine - but after a handful of product iterations & production adjustments, a query that handles a "task queue" across a few tables looks a bit ugly. I'm wondering if anyone can see obvious improvements. There are 3 tables: upstream_provider task task_queue Originally we needed to select 50 items off the top of the queue at a time. Then we needed to set a max of 5 tasks per upstream provider (It's a cheap way to handle throttling). The table is quite large, so a limit of the last 1000 items drastically improved performance. The query got ugly when we needed to add a "priority" toggle to the queue -- basically to mark things as "process ASAP". The only way I could figure out how to do that, was to add a sort -- on "is_priority DESC NULLS LAST". My concern is that the sort needs to happen 3x -- in the subselect for 1000 items in the partition for row numbering in the final sort If anyone has a moment to look this over and suggest anything, I'd be very thankful. A working reproduction is below. SQL--- -- CREATE TABLE upstream_provider (id SERIAL PRIMARY KEY, name VARCHAR(32), is_paused BOOLEAN DEFAULT NULL ); CREATE TABLE task (id SERIAL PRIMARY KEY, upstream_provider_id INT NOT NULL REFERENCES upstream_provider(id), name VARCHAR(32) ); CREATE TABLE task_queue (id SERIAL PRIMARY KEY, task_id INT NOT NULL REFERENCES task(id), upstream_provider_id INT NOT NULL REFERENCES upstream_provider(id), # only here because it eliminates expensive joins elsewhere processing_status BOOLEAN DEFAULT NULL, is_priority BOOLEAN DEFAULT NULL ); SELECT partition1.* , task.* FROM (SELECT window1.* , row_number() OVER (PARTITION BY window1.upstream_provider_id ORDER BY window1.is_priority DESC NULLS LAST, window1.task_queue_id ) AS rownum FROM (SELECT qu.id AS task_queue_id , qu.upstream_provider_id , qu.task_id , qu.is_priority FROM task_queue qu JOIN upstream_provider ON qu.upstream_provider_id = upstream_provider.id WHERE (qu.processing_status IS NULL) AND (upstream_provider.is_paused IS NOT TRUE) ORDER BY is_priority DESC NULLS LAST, qu.id DESC LIMIT 1000 ) window1 ) partition1 JOIN task ON partition1.task_id = task.id WHERE partition1.rownum < 5 ORDER BY is_priority DESC NULLS LAST, task_queue_id DESC LIMIT 50 ; -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] why isn't this subquery wrong?
thanks all! On Apr 20, 2017, at 6:42 PM, David G. Johnston wrote: > Subqueries can see all columns of the parent. When the subquery actually > uses one of them it is called a "correlated subquery". i thought a correlated subquery had to note that table/alias, not a raw column. I guess i've just been adhering to good form. On Apr 20, 2017, at 6:43 PM, Tom Lane wrote: > Cautious SQL programmers qualify all references inside sub-selects to avoid > getting caught by this accidentally. is there a syntax to qualify a reference to lock a subquery to the current scope (disable looking at the parents)? that's how I got caught on this by accident.
[GENERAL] why isn't this subquery wrong?
I ran into an issue while changing a database schema around. Some queries still worked, even though I didn't expect them to. Can anyone explain to me why the following is valid (running 9.6) ? schema CREATE TEMPORARY TABLE example_a__data ( foo_id INT, bar_id INT ); CREATE TEMPORARY TABLE example_a__rollup_source ( id int primary key, name varchar(64), foo_id INT, check_bool BOOLEAN ); CREATE TEMPORARY TABLE example_a__rollup AS SELECT id, name, foo_id FROM example_a__rollup_source WHERE check_bool IS TRUE ; query: SELECT foo_id FROM example_a__data WHERE foo_id IN (SELECT bar_id FROM example_a__rollup) ; a raw select of `SELECT bar_id FROM example_a__rollup;` will cause an error because bar_id doesn't exist postgres doesn't raise an error because example_a__data does have a bar_id -- but example_a__rollup doesn't and there's no explicit correlation in the query. can someone explain why this happens? i'm guessing there is a good reason -- but I'm unfamiliar with the type of implicit join/queries this behavior is enabling.
Re: [GENERAL] disk writes within a transaction
On Feb 17, 2017, at 4:05 PM, Jeff Janes wrote: > It will probably be easier to refactor the code than to quantify just how > much damage it does. Thanks for all the info. It looks like this is something worth prioritizing because of the effects on indexes. We had discussed a fix and pointed it; rewriting the code that causes this is pretty massive, and will require blocking out a resource FT for 2 weeks on rewrites and testing. We don't really have time to spare any of those devs, so time to make product tradeoffs ;( -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] appropriate column for storing ipv4 address
I have to store/search some IP data in Postgres 9.6 and am second-guessing my storage options. Would anyone mind giving this a quick look for me? Right now I have two tables, and am just using cidr for both: create table tracked_ip_address ( id SERIAL primary key, ip_address CIDR not null ); create table tracked_ip_block ( id SERIAL primary key, block_cidr CIDR not null, ownserhip_data TEXT ); The types of searching I'm doing: 1. on tracked_ip_address, I'll search for neighboring ips. e.g. select * from tracked_ip_address where ip_address << '192.168'::CIDR; select * from tracked_ip_address where ip_address << '192.168.1'::CIDR; 2. on tracked_ip_block, i search/join against the tracked_ip_address to show known ips in a block, or a known block for an ip. i used cidr instead of inet for the ip_address because it saved me a cast on joins and appears to work the same. was that the right move? is there a better option? thanks in advance. / jonathan -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] recursive query too big to complete. are there any strategies to limit/partition?
On Jan 26, 2017, at 7:07 PM, David G. Johnston wrote: > Thinking aloud - why doesn't just finding every record with 5 descendants > not work? Any chain longer than 5 would have at least 5 items. Oh it works. This is why I ask these questions -- new perspectives! > Even without recursion you could build out a five-way self-join and any > records that make it that far are guilty. I suppose this assumes your setup > is non-cyclic. There could be cyclic records, but that's easy to filter out. A first approach took 40 seconds to run. A little tweaking is necessary, but this is a great start. THANK YOU! You saved me!
[GENERAL] recursive query too big to complete. are there any strategies to limit/partition?
There are over 20 million records in a self-referential database table, where one record may point to another record as a descendant. Because of a bug in application code, there was no limit on recursion. The max was supposed to be 4. A few outlier records have between 5 and 5000 descendants (there could be more. I manually found one chain of 5000. I need to find all the chains of 5+ and mark them for update/deletion. While the database is about 10GB, the recursive search is maxing out on diskspace and causing a failure (there was over over 100GB of workspace free) Is there any way to make a recursive query work, or will I have to use another means and just iterate over the entire dataset (either in postgres or an external service) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] efficiently migrating 'old' data from one table to another
On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote: > On Thu, Jan 12, 2017 at 2:19 PM, bto...@computer.org > wrote: >> >> Review manual section 7.8.2. Data-Modifying Statements in WITH >> >> >> https://www.postgresql.org/docs/9.6/static/queries-with.html > > this. > > with data as (delete from foo where ... returning * ) insert into > foo_backup select * from data; Thanks, btober and merlin. that's exactly what i want. On Jan 12, 2017, at 4:45 PM, Adrian Klaver wrote: > Maybe I am missing something, but why do the UPDATE? > Why not?: > ... > With an index on record_timestamp. That's actually the production deployment that we're trying to optimize. Depending on the size of the table (rows, width) it performs "less than great", even with the index on record_timestamp. The UPDATE actually worked faster in most situations. I honestly don't know why (the only thing that makes sense to me is server-load)... but the update + bool test ended up being (much) faster than the timestamp comparison. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] efficiently migrating 'old' data from one table to another
I'm just wondering if there's a more efficient way of handling a certain periodic data migration. We have a pair of tables with this structure: table_a__live column_1 INT column_2 INT record_timestamp TIMESTAMP table_a__archive column_1 INT column_2 INT record_timestamp TIMESTAMP periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`. The entries are copied over to the archive, then deleted. The staleness is calculated based on age-- so we need to use INTERVAL. the "live" table can have anywhere from 100k to 20MM records. the primary key on `table_a__live` is a composite of column_1 & column_2, In order to minimize scanning the table, we opted to hint migrations with a dedicated column: ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL; CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL; so our migration is then based on that `is_migrate` column: BEGIN; UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month'; INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS TRUE; DELETE FROM table_a__live WHERE is_migrate IS TRUE; COMMIT; The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows. can anyone suggest a better approach? I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of disk-io on that approach too. fwiw we're on postgres9.6.1 -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporarily disable autovacuum on a database or server ?
On Jan 11, 2017, at 8:19 PM, Melvin Davidson wrote: > > Yes, you're right about ALTER SYSTER. Unfortunately, the op provided neither > PostgreSQL version or O/S, so we can't even be sure that is > an option. That is why I stated "I cannot confirm". I didn't think that would matter, but postgres 9.6.1 and ubuntu 16.04 anyways, thanks. i'll test that approach.
[GENERAL] temporarily disable autovacuum on a database or server ?
I've run into a performance issue, and I think autovacuum may be involved. does anyone know if its possible to temporarily stop autovacuum without a server restart ? It seems that it either requires a server restart, or specific tables to be configured. Several times a day/week, I run a handful of scripts to handle database maintenance and backups: * refreshing materialized views * calculating analytics/derived/summary tables and columns * backing up the database (pg_dumpall > bz2 > archiving) These activities have occasionally overlapped with autovacuum, and the performance seems to be affected. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Improve PostGIS performance with 62 million rows?
On Jan 9, 2017, at 12:49 PM, Israel Brewster wrote: > Planning time: 4.554 ms > Execution time: 225998.839 ms > (20 rows) > > So a little less than four minutes. Not bad (given the size of the database), > or so I thought. > > This morning (so a couple of days later) I ran the query again without the > explain analyze to check the results, and noticed that it didn't take > anywhere near four minutes to execute. So I ran the explain analyze again, > and got this: ... > Planning time: 0.941 ms > Execution time: 9636.285 ms > (20 rows) > > So from four minutes on the first run to around 9 1/2 seconds on the second. > Presumably this difference is due to caching? I would have expected any > caches to have expired by the time I made the second run, but the data *is* > static, so I guess not. Otherwise, I don't know how to explain the > improvement on the second run - the query plans appear identical (at least to > me). *IS* there something else (for example, auto vacuum running over the > weekend) that could explain the performance difference? This may sound crazy, but I suggest running each of these scenarios 3+ times: # cold explain stop postgres start postgres explain analyze SELECT # cold select stop postgres start postgres enable \t for query timing SELECT # cold explain to select stop postgres start postgres explain analyze SELECT enable \t for query timing SELECT # cold select to explain stop postgres start postgres enable \t for query timing SELECT explain analyze SELECT # cold select to select stop postgres start postgres enable \t for query timing SELECT SELECT I've found the timing for "Explain Analyze" to be incredibly different from an actual SELECT on complex/large dataset queries... and the differences don't seem to correlate to possible speedups from index/table caching. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] does postgres log the create/refresh of a materialized view anywhere?
Is there a way to find out when a materialized view was created/refreshed? I couldn't find this information anywhere in the docs. the use-case is that I wish to update a materialized view a few times a day in a clustered environment. i'd like to make sure one of the redundant nodes doesn't refresh if needed. I can log this manually in postgresql if needed, but was hoping there was some "timestamp" on the view in a system table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Determining server load
On Sep 27, 2016, at 2:46 PM, Israel Brewster wrote: > I do have those on, and I could write a parser that scans through the logs > counting connections and disconnections to give a number of current > connections at any given time. Trying to make it operate "in real time" would > be interesting, though, as PG logs into different files by day-of-the-week > (at least, with the settings I have), rather than into a single file that > gets rotated out. I was kind of hoping such a tool, such as pgbadger (which, > unfortunately, only seems to track connections per second and not consecutive > connections), already existed, or that there was some way to have the > database itself track this metric. If not, well, I guess that's another > project :) There are a lot of postgres configs and server specific tools... but on the application side and for general debugging, have you looked at statsd ? https://github.com/etsy/statsd it's a lightweight node.js app that runs on your server and listens for UDP signals, which your apps can emit for counting or timing. We have a ton of Python apps logging to it, including every postgres connection open/close and error. The overhead of clients and server is negligible. When combined with the graphite app for browsing data via charts, it becomes really useful at detecting issues with load or errors stemming from a deployment -- you just look for spikes and cliffs. We even use it to log the volume of INSERTS vs SELECTS vs UPDATES being sent to postgres. The more services/apps you run, the more useful it gets, as you can figure out which apps/deployments are screwing up postgres and the exact moment things went wrong.
Re: [GENERAL] bitwise storage and operations
On Sep 27, 2016, at 10:54 AM, Brian Dunavant wrote: > db=# select 'foo' where (9 & 1) > 0; A HA Thank you Brian and David -- I didn't realize that you needed to do the comparison to the result. (or convert the result as these work): select 'foo' where (9 & 1)::bool; select 'foo' where bool(9 & 1); I kept trying to figure out how to run operators on "9" and "1" independently to create a boolean result. I either needed more coffee or less yesterday. As a followup question... Some searches suggested that Postgres can't use indexes of INTs for these comparisons, but could on bitwise string columns. One of these tables has over 30MM rows, so I'm trying to avoid a seq scan as much as possible. I thought of creating a function index that casts my column to a bitstring, and then tailors searches onto that. For example: CREATE TEMPORARY TABLE example_toggle( id int primary key, toggle int default null ); INSERT INTO example_toggle (id, toggle) VALUES (1, 1), (2, 2), (3, 3), (4, 5), (5, 8); CREATE INDEX idx_example_toggle_toggle_bit ON example_toggle(cast(toggle as bit(4))); While these selects work... select * from example_toggle where (toggle & 1)::bool AND (toggle & 4)::bool; select * from example_toggle where (toggle::bit(4) & 1::bit(4) <> 0::bit(4)) AND (toggle::bit(4) & 4::bit(4) <> 0::bit(4)); Only about 200k items have a flag right now (out of 30MM) so I thought of using a partial index on the set flags. The only way I've been able to get an index on the not null/0 used is to do the following: CREATE INDEX idx_example_toggle_toggle_bit ON example_toggle(cast(toggle as bit(4))) WHERE toggle <> 0; then tweak the query with select * from example_toggle where (toggle & 1)::bool AND (toggle & 4)::bool AND (toggle > 0); select * from example_toggle where (toggle::bit(4) & 1::bit(4) <> 0::bit(4)) AND (toggle::bit(4) & 4::bit(4) <> 0::bit(4)) AND (toggle > 0); obviously, the sample above is far too small for an index to be considered... but in general... is a partial index of "toggle <> 0" and then hinting with "toggle > 0" the best way to only index the values that are not null or 0? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] bitwise storage and operations
We've been storing some "enumerated"/"set" data in postgresql as INT or BIT(32) for several years for some flags/toggles on records. This was preferable for storage to the ENUM type (or multiple columns), as we often changed the number of enumerated options or their labels -- and computing everything in the application saved the trouble of database migrations. This has worked out perfectly -- until today. For the first time ever, we need to run some queries that filter on these columns at the PostgreSQL level -- and I can't figure out how. The documentation doesn't have any examples for SELECT for the bitwise operators, and everything I've found on various threads/forums has addressed inserts or converting on a select -- but never a comparison. I've tried numerous forms and have gotten as far as CASTing everything to BIT(n), but I can't seem to construct a valid query that can filter what I want. Can anyone share a sample WHERE clause or two that does a bitwise comparison against an INT or BIT column? Thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing a query
On Jun 22, 2016, at 2:38 PM, David G. Johnston wrote: > What query? A self-contained email would be nice. This was the same query as in the previous email in the thread. I didn't think to repeat it. I did include it below. > https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html > > Note especially: > > "Visibility information is not stored in index entries, only in heap > entries; ..." > > The check against the heap isn't for the truthiness of the predicate but the > visibility of the row. Thanks for this link. The table I worked on hasn't had any writes since a server restart, and according to those docs the queries should have been off the visibility map not the heap. However the amount of time to search is not in line with expectations for the visibility map. After reading the last paragraph about some index optimizations in 9.6 that looked related, I installed the RC on an another machine and dumped 2 tables from production to see if I would qualify for any improvements. >>> But there's a problem: the WHERE clause refers to success which is not >>> available as a result column of the index. Nonetheless, an index-only scan >>> is possible because the plan does not need to recheck that part of the >>> WHERE clause at runtime: all entries found in the index necessarily have >>> success = true so this need not be explicitly checked in the plan. >>> PostgreSQL versions 9.6 and later will recognize such cases and allow >>> index-only scans to be generated, but older versions will not. The 9.6 branch planner optimizes for my query and realizes that it doesn't need to check the table: So while this index is necessary on 9.5: CREATE INDEX idx__9_5 ON table_a(column_1, id, column_2) WHERE column_2 IS NOT FALSE; This index works on 9.6 CREATE INDEX idx__9_6 ON table_a(column_1, id) WHERE column_2 IS NOT FALSE; Considering I have several million rows, this has a noticeable effect . Combined with the various improvements on 9.6, there is a huge difference in query speed: 9.6 runs the query with the smaller index in an average of 1200ms 9.5 runs the query with the larger index in an average of 2700ms > This one requires knowledge of the query; but I am not surprised that > reversing the order of columns in a b-tree index has an impact. I expected this to impact the decision on which index to use when multiple ones are available, or to offer poor performance -- but not to discount using the index entirely. > All at once? No. I dropped all indexes to test, then for each column combination did: CREATE INDEX foo_idx; ANALYZE foo ; EXPLAIN ANALYZE; DROP INDEX foo_idx; I call Explain Analyze manually once for the plan, then via script 25x to average out execution times and account for cold-start vs having loaded all the indexes. I shut down all other user processes on the machine as well. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing a query
On Jun 22, 2016, at 4:25 AM, Erik Gustafson wrote: > don't you want an index on t_a2b.col_a, maybe partial where col_a=1 ? that table has indexes on all columns. they're never referenced because the rows are so short. this was just an example query too, col_a has 200k variations After a lot of testing, I think I found a not-bug but possible area-for-improvement in the planner when joining against a table for filtering (using my production 9.5.2 box) I checked a query against multiple possible indexes using the related columns. only one of indexes was on the table for each series of tests, and I analyzed the table after the drop/create of indexes. Note 1: The only time an index-only scan is used, is on this form: CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE; Omitting the col_partial from being indexed will trigger a Bitmap Heap Scan on the full table with a recheck condition: CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE; This shouldn't be necessary. the planner knew that `col_partial` fulfilled the WHERE clause when it used the index, but scanned the table to check it anyways. On most tables the heap scan was negligible, but on a few larger tables it accounted a 20% increase in execution. Note 2: This is odd, but this index is used by the planner: CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE; but this index is never used: CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE; I honestly don't know why the second index would not be used. The query time doubled without it when run on a table with 6million rows and about 20 columns. --- The indexes I tested on: CREATE INDEX idx_fkey_1 ON table_a(fkey_1); CREATE INDEX idx_partial_fkey ON table_a(fkey_1) WHERE col_partial IS NOT FALSE; CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE; CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE; CREATE INDEX idx_partial_fkey_partial ON table_a(fkey_1, col_partial) WHERE col_partial IS NOT FALSE; CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE; -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] optimizing a query
On Jun 21, 2016, at 6:55 PM, David G. Johnston wrote: > Aside from the name these indexes are identical... sorry. tired eyes copy/pasting between windows and trying to 'average' out 40 similar queries. > These two items combined reduce the desirability of diagnosing this...it > doesn't seem like you've faithfully recreated the scenario for us to evaluate. > > Your post is also not self-contained and you haven't provided the actual > EXPLAINs you are getting. I played around with some more indexes, creating and disabling them on one specific query Eventually i found some index formats that didn't pull in the whole table. They gave approximately the same results as the other selects, with some differences in reporting. the heap scan on the table was negligible. the big hit was off the outer hash join. the formatting in explain made a negligible check look like it was the root issue CREATE TABLE t_a (id SERIAL PRIMARY KEY, col_1 INT NOT NULL, col_2 BOOLEAN DEFAULT NULL ); CREATE INDEX test_idx__t_a_col1_col2__v1 on t_a (col_1) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v2 on t_a (col_1, id) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v3 on t_a (id, col_1) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v4 on t_a (id, col_1, col_2) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__t_a_col1_col2__v5 on t_a (col_1, col_2) WHERE col_2 IS NOT FALSE; CREATE TABLE t_b (id SERIAL PRIMARY KEY, col_1 INT NOT NULL, col_2 BOOLEAN DEFAULT NULL ); CREATE TABLE t_a2b (a_id INT NOT NULL REFERENCES t_a(id), b_id INT NOT NULL REFERENCES t_b(id), col_a INT NOT NULL, PRIMARY KEY (a_id, b_id) ); EXPLAIN ANALYZE SELECT t_a2b.b_id AS t_a2b_b_id, count(t_a2b.b_id) AS counted FROM t_a2b JOIN t_a ON t_a2b.a_id = t_a.id WHERE t_a.col_1 = 730 AND t_a2b.col_a = 1 AND (t_a.col_2 IS NOT False) GROUP BY t_a2b.b_id ORDER BY counted DESC, t_a2b.b_id ASC LIMIT 25 OFFSET 0 ; QUERY PLAN Limit (cost=270851.55..270851.62 rows=25 width=4) (actual time=1259.950..1259.953 rows=25 loops=1) -> Sort (cost=270851.55..270863.43 rows=4750 width=4) (actual time=1259.945..1259.945 rows=25 loops=1) Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=270670.01..270717.51 rows=4750 width=4) (actual time=1259.430..1259.769 rows=1231 loops=1) Group Key: t_a2b.b_id -> Hash Join (cost=171148.45..270516.71 rows=30660 width=4) (actual time=107.662..1230.481 rows=124871 loops=1) Hash Cond: (t_a2b.a_id = t_a.id) -> Seq Scan on t_a2b (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.011..661.978 rows=2492783 loops=1) Filter: (col_a = 1) Rows Removed by Filter: 2260712 -> Hash (cost=170446.87..170446.87 rows=56126 width=4) (actual time=107.409..107.409 rows=48909 loops=1) Buckets: 65536 Batches: 1 Memory Usage: 2232kB -> Bitmap Heap Scan on t_a (cost=1055.41..170446.87 rows=56126 width=4) (actual time=18.243..94.470 rows=48909 loops=1) Recheck Cond: ((col_1 = 730) AND (col_2 IS NOT FALSE)) Heap Blocks: exact=43972 -> Bitmap Index Scan on test_idx__t_a_col1_col2__v2 (cost=0.00..1041.38 rows=56126 width=0) (actual time=8.661..8.661 rows=48909 loops=1) Index Cond: (col_1 = 730) Planning time: 0.796 ms Execution time: 1260.092 ms QUERY PLAN -- Limit (cost=208239.59..208239.65 rows=25 width=4) (actual time=1337.739..1337.743 rows=25 loops=1) -> Sort (cost=208239.59..208251.47 rows=4750 width=4) (actual time=1337.737..1337.739 rows=25 loops=1) Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=208058.05..208105.55 rows=4750 width=4) (
[GENERAL] optimizing a query
I have a handful of queries in the following general form that I can't seem to optimize any further (same results on 9.3, 9.4, 9.5) I'm wondering if anyone might have a suggestion, or if they're done. The relevant table structure: t_a2b a_id INT references t_a(id) b_id INT references t_b(id) col_a t_a id INT col_1 INT col_2 BOOL The selects query the association table (t_a2b) and join in a related table (t_a) for some filtering. In effort of simplifying the work, I've created indexes on t_a that have all the related columns. CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE; postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the hash join. I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just use the index. I thought this might have been from using a partial index, but the same results happen with a full index. I just can't seem to avoid this hash join against the full table. anyone have a suggestion? example query SELECT t_a2b.b_id AS b_id, count(t_a2b.b_id) AS counted FROM t_a2b WHERE t_a2b.col_a = 1 AND t_a.col_1 = 730 AND t_a.col_2 IS NOT False GROUP BY t_a2b.b_id ORDER BYcounted DESC, t_a2b.b_id ASC -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] does timestamp precision affect storage size?
On Jun 21, 2016, at 4:50 PM, Tom Lane wrote: > Storage-wise, no. If you have a resolution spec on your columns now, > I think dropping the resolution spec would save you a few nanoseconds per > row insertion due to not having to apply the roundoff function. Adding > one would certainly not improve speed. On Jun 21, 2016, at 4:47 PM, Vik Fearing wrote: > No, there are no space savings here. > > =# select pg_column_size('now'::timestamptz(0)), > pg_column_size('now'::timestamptz); Thanks. I thought that was happening, but wanted to make sure. the allure of shaving a byte or two off some rows couldn't be ignored ;) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] does timestamp precision affect storage size?
i'm cleaning up some queries for performance, and noticed that we never use precision beyond the second (ie, `timestamp(0)`) in our business logic. would there be any savings in storage or performance improvements from losing the resolution on fractional seconds, or are `timestamp(precision)` effectively the same for storage as `timestamp`? (based on docs, I assume the latter but wanted to check) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] disable ipv6?
I'm running postgresql on ubuntu. the 9.4 branch from postgresql.org I think the only way to disable ipv6 is to edit postgresql.conf and explicitly state localhost in ipv4 as follows - listen_addresses = 'localhost' + listen_addresses = '127.0.0.1' can anyone confirm? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Is it possible to select index values ?
On Feb 1, 2016, at 6:58 PM, David G. Johnston wrote: > You can query the statistics portion of the database to get some basic > statistics of the form mentioned. Yeah, i didn't think there would be support. The stats collector doesn't have the info that I want... it's focused on how the data is used. I'm more interested in what the data is. Basically I want to compare the distribution of index "keys". In the case of a substring index, comparing the distribution at 3,4,5,6,7,8 characters. based on that, i can run some server tests on different lengths, and the stats collector comes into play. i'm taking a blind stab on some index contents, and want to be a bit more educated. anyways, I eventually realized that I may be better just pulling the columns and running some offline analytics (it could be done in sql, but the results would be nicer as a graph).
[GENERAL] Is it possible to select index values ?
Is it possible to select index values ? I haven't found any documentation that says "No", but I haven't found anything that says "Yes" either. The reason - I have a few function indexes that are working as partial indexes. I'd like to run some analytics on them (to determine uniqueness of values, decide if i should change the function, etc). It would be easier if I could somehow access the index contents than re-create the index data into a temporary table. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] controlling memory management with regard to a specific query (or groups of connections)
Thanks. Unfortunately, this is in a clustered environment. NFS and other shared drive systems won't scale well. I'd need to run a service that can serve/delete the local files, which is why I'm just stashing it in Postgres for now. > On Nov 19, 2015, at 2:26 AM, Roxanne Reid-Bennett wrote: > > We have a system that loads a bunch of files up to be processed - we queue > them for processing behind the scenes. We don't load them into Postgres > before processing. We put them in a temp directory and just save the > location of the file to the database. This configuration does have > limitations. Post-processing can not be load balanced across servers unless > the temp directory is shared. > > I'm sure you'll get more DB centric answers from others on the list. > > Roxanne -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] controlling memory management with regard to a specific query (or groups of connections)
As a temporary fix I need to write some uploaded image files to PostgreSQL until a task server can read/process/delete them. The problem I've run into (via server load tests that model our production environment), is that these read/writes end up pushing the indexes used by other queries out of memory -- causing them to be re-read from disk. These files can be anywhere from 200k to 5MB. has anyone dealt with situations like this before and has any suggestions? I could use a dedicated db connection if that would introduce any options. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporary indexes?
On Oct 22, 2015, at 5:04 PM, Jim Nasby wrote: > > What % of execution time is spent creating those indexes? Or is that factored > into the 1000%? Also, could your analysis queries be run in a REPEATABLE READ > transaction (meaning that once the transaction starts it doesn't get any new > data)? If it could then the temp indexes could be static, which would mean no > update overhead. Running without the indexes would take over an hour to execute the scripts, and totally jams the machine (we got 30minutes in once, and had to kill it). That's because of millions of rows used in joins and sequential scans. Building all the indexes takes 30 seconds; most SQL commands then run only against the indexes (some of which are partial) and the entire suite finishes in about 3 minutes. If the indexes stay active during the day, there seems to be a 2-3% drop in write performance. This is on a webapp, so we're just happier shifting the index work from peak hours to offpeak hours. It means we can delay spinning up another application server a bit longer. I'll definitely look into your suggestions the next time I hit this code. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporary indexes?
On Oct 22, 2015, at 2:08 PM, Tom Lane wrote: > FWIW, I don't find much attraction in the idea of building an index for > use by a single query. There basically isn't any scenario where that's > going to beat running a plan that doesn't require the index. The value of > an index is generally to avoid a whole-table scan and/or a sort, but > you'll necessarily pay those costs to make the index. On Oct 22, 2015, at 8:17 AM, vincent elschot wrote: > Do you mean creating a temporary index on a non-temporary table to speed up > the queries that fills the temporary table? One of the use-cases is speeding up inserts on create, but another is for periodic analytics routines (which we handle with explicit create/drop index commands. In one example of our analytics routines, we end up needing to create/drop about 15 indexes to optimize 45 queries. This speeds up the execution by 1000% and minimizes RAM usage. We don't keep the indexes active, because we only need them for analytics and the overhead of managing them during high write periods during the day is noticeable. Creating and dropping these indexes on-demand gives us all the benefit with none of the drawbacks. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporary indexes?
On Oct 21, 2015, at 3:42 PM, Adrian Klaver wrote: > I misunderstood then. The only thing I can think of is to wrap in a > transaction, though that presents other issues with open transactions and/or > errors in the transaction. I just explicitly drop. The convenience of an auto-drop would be a nice backup. Transactions and table-locking issues are probably why temporary indexes don't exist. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] temporary indexes?
On Oct 21, 2015, at 2:59 PM, Jeff Janes wrote: > I think he means more like: > > create temporary table temp_test(id int, fld_1 varchar); > create temporary index on permanent_table (fld_1); > > select something from temp_test join permanent_table using (fld_1) where a=b; > select something_else from temp_test join permanent_table using (fld_1) where > c=d; Yes. That's exactly what I mean: A "temporary index" on a "permanent table" , which expires in the same manner of a "temporary table". -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] temporary indexes?
I couldn't find any mention of this on the archives... Have the project maintainers ever considered extending CREATE INDEX to support "temporary" indexes like CREATE TEMPORARY TABLE? When creating temporary tables for analytics/reporting, I've noticed that I often need to create (then drop) indexes on regular tables. Temporary indexes seemed like a natural fit here, so i was wondering if there was any reason why they're not supported (other than no one wanted it!) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] using postgresql for session
On Oct 7, 2015, at 11:58 AM, john.tiger wrote: > has anyone used postgres jsonb for holding session ? Since server side > session is really just a piece of data, why bother with special "session" > plugins and just use postgres to hold the data and retrieve it with psycopg2 > ? Maybe use some trigger if session changes?We are using python Bottle > with psycopg2 (super simple, powerful combo) - are we missing something > magical about session plugins ? I previously used TEXT or BLOB for holding session data, and pickled the data. I can't remember. If you're going to use PostgresSQL for the session, the big performance tip is to use partial index on the session key (assuming it's an md5-like hash). So you'd want a table that is something like this: CREATE TABLE session( session_id VARCHAR(32) PRIMARY KEY, session_data TEXT ); CREATE INDEX idx_session_partial ON session(substr(session_id , 0,5)) Then query like this SELECT * FROM session WHERE session_id = :session_id AND substr(session_id, 0, 5) = :session_id_substring ; SELECT * FROM session WHERE session_id = :session_id AND substr(session_id, 0, 5) = substr(:session_id, 0, 5) ; That will get the planner to use the partial index first, before using the session_id index. Depending on how many items are in your table, it can make your SELECTS several orders of magnitude faster. As for session plugins -- a lot of people in the web frameworks community are abandoning server side sessions for client side sessions. They are generally easier to handle state across clusters and data centers. Some server side session-like data is still needed, but it's often assembled from data in the client side. Most of the Python session plugins I've used have some sort of status check coupled with a cleanup function/middleware component to see if the object has changed at all. This way UPDATES only occur when needed. FWIW, I ended up migrating our sessions into redis. We already had redis running on the cluster, and offloading it got a lot more performance our Postgres without scaling our hardware. There just isn't much of a reason for having pg manage a simple KV store. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] "global" & shared sequences
Thanks for the reply. On Oct 2, 2015, at 3:26 PM, Jim Nasby wrote: > I'm not really following here... the size of an index is determined by the > number of tuples in it and the average width of each tuple. So as long as > you're using the same size of data type, 18 vs 1 sequence won't change the > size of your indexes. I'm pretty much concerned with exactly that -- the general distribution of numbers, which affects the average size/length of each key. Using an even distribution as an example, the average width of the keys can increase by 2 places: Since we have ~18 object types, the primary keys in each might range from 1 to 9,999,999 Using a shared sequence, the keys for the same dataset would range from 1 to 189,999,999 Each table is highly related, and may fkey onto 2-4 other tables... So i'm a bit wary of this change. But if it works for others... I'm fine with that! > Sequences are designed to be extremely fast to assign. If you ever did find a > single sequence being a bottleneck, you could always start caching values in > each backend. I think it'd be hard (if not impossible) to turn a single > global sequence into a real bottleneck. I don't think so either, but everything I've read has been theoretical -- so I was hoping that someone here can give the "yeah, no issue!" from experience. The closest production stuff I found was via the BDR plugin (only relevant thing that came up during search) and there seemed to be anecdotal accounts of issues with sequences becoming bottlenecks -- but that was from their code that pre-generated allowable sequence ids on each node. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] "global" & shared sequences
Hoping to glean some advice from the more experienced The major component of our application currently tracks a few dozen object types, and the total number of objects is in the 100s Millions range. Postgres will potentially be tracking billions of objects. Right now the primary key for our "core" objects is based on a per-table sequence, but each object has a secondary id based on a global/shared sequence. we expose everything via a connected object graph, and basically needed a global sequence. We are currently scaled vertically (1x writer, 2x reader) I'd like to avoid assuming any more technical debt, and am not thrilled with the current setup. Our internal relations are all by the table's primary key, but the external (API, WEB) queries use the global id. Every table has 2 indexes, and we need to convert a 'global' id to a 'table id' before doing a query. If we're able to replace the per-table primary key with the global id, we'd be freeing up some disk space from the indexes and tables -- and not have to keep our performance cache that maps table-to-global ids. The concerns that I have before moving ahead are: 1. general performance at different stages of DB size. with 18 sequences, our keys/indexes are simply smaller than they'd be with 1 key. i wonder how this will impact lookups and joins. 2. managing this sequence when next scaling the db (which would probably have to be sharding, unless others have a suggestion) if anyone has insights, they would be greatly appreciated. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] trouble converting several serial queries into a parallel query
I have a very simple query that is giving me some issues due to the size of the database and the number of requests I make to it in order to compile the report I need: A dumbed down version of the table and query: CREATE TABLE a_to_b ( id_a INT NOT NULL REFERENCES table_a(id), id_b INT NOT NULL REFERENCES table_b(id), PRIMARY KEY (id_a, id_b) ); SELECT id_a, id_b FROM a_2_b WHERE id_a = 1 LIMIT 5; The problem is that the table has a few million records and I need to query it 30+ times in a row. I'd like to improve this with a parallel search using `IN()` SELECT id_a, id_b FROM a_2_b WHERE id_a = IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26.27,28,29,30); That technique has generally fixed a lot of bottlenecks for us. However I can't wrap my head around structuring it so that I can apply a limit based on the column -- so that I only get 5 records per id_a. The table has columns that I would use for ordering in the future, but I'm fine with just getting random values right now . Can anyone offer some suggestions? Thanks in advance. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] newsfeed type query
On Apr 29, 2015, at 6:50 PM, Jim Nasby wrote: > Only because you're using UNION. Use UNION ALL instead. The difference between "union" and "union all" was negligible. the problem was in the subselect and the sheer size of the tables, even when we could handle it as an index-only scan. On Apr 29, 2015, at 1:18 PM, Ladislav Lenart wrote: > I would expect the overall query to return only 60F nad 55F as the most recent > data. No? You expect it to return 4 items when the LIMIT is only 2. Remember > that the overall query should be also ordered by ts and limited to 2. You're right. total mistake on my part and confusion with that. I got this query confused with the specifics of a similar one.
Re: [GENERAL] newsfeed type query
On Apr 29, 2015, at 12:25 PM, Ladislav Lenart wrote: > Could you please explain to me the error(s) in my reasoning? Let me just flip your list in reverse... and add in some elements (marked with a *): posting ts context p60 60 friend p55 55 friend* p54 54 friend* p50 50 group p50 49 group* p50 49 group* p40 40 friend p30 30 group p20 20 friend p10 10 group p00 friend With the 2 limited subqueries, the results would be: 60F, 55F, 50G, 49G But the "most recent" data is 50F, 55F, 54F, 50G So we end up showing 49 which is less relevant than 54. In some situations this isn't much of an issue, but in others it is detrimental. For example, one of my "feeds" contains a distribution of events according-to-type that is very uneven. While "friend" and "group" might be relatively close in time to one another, "system" or other events may be months old -- and that older content gets pulled in with this style of query. If you need to paginate the data and select the next 10 overall items, it gets even more complicated. IIRC, the best mix of performance and "product" that I've found is do something like this: SELECT * FROM ( SELECT a,b,c FROM table_a ORDER BY TIMESTAMP DESC LIMIT 1; UNION SELECT a,b,c FROM table_b ORDER BY TIMESTAMP DESC LIMIT 1; ) as unioned order by unioned TIMESTAMP DESC LIMIT 100 OFFSET 0; by creating an artificial limit on the inner queries, you can save postgres from doing a lot of intensive i/o work and memory usage (like a lot) then, joining a few lists and sorting 20k (or even 100k) items is really cheap. the downside is that you effectively limit the 'relevancy' of the query to whatever the inner limit is (ie, 1 -- not the combined total of 2), but that number can be arbitrarily high enough that it is irrelevant while still showing the right amount of content for people. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] newsfeed type query
Thanks all! These point me in much better directions! Jim Nasby's approach to selecting an expression addressed some things (SELECT f.useraccount_id_b IS NOT NULL AS in_friends) Ladislav Lenart's usage of the CTE is also of a different format that I've used in the past. I think i'll be able to patch together some performance improvements now, that will last until the database structure changes. On Apr 29, 2015, at 6:54 AM, Ladislav Lenart wrote: > I think you can propagate ORDER BY and LIMIT also to the subqueries of the > UNION, i.e.: It behaves a lot better, but doesn't give me the resultset I need. Older data from one subquery is favored to newer data from another I use a similar approach on another part of this application -- where the effect on the resultset isn't as pronounced. On that query there are over 100 million total stream events. Not using an inner limit runs the query in 7 minutes; limiting the inner subquery to 1MM runs in 70 seconds... and limiting to 10k is around 100ms. On Apr 29, 2015, at 10:16 AM, Melvin Davidson wrote: > I see others have responded with suggestions to improve query performance, > but one thing I noticed when you gave the data structure is there are no > no primary keys defined for friends or posting, neither are there any > indexes. > Was that an omission? This was a quick functional example to illustrate. The real tables are slightly different but do have pkeys ( 'id' is a bigserial, relationship tables (friends, memberships) use a composite key ). They are aggressively indexed and reindexed on various columns for query performance. sometimes we create an extra index that has multiple columns or partial-columns to make make scans index-only. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] newsfeed type query
Sorry, I was trying to ask something very abstract as I have similar situations on multiple groups of queries/tables (and they're all much more complex). I'm on pg 9.3 The relevant structure is: posting: id timestamp_publish group_id__in user_id__author friends: user_id__a user_id__b memberships: user_id group_id role_id -- working sql CREATE TABLE groups( id SERIAL NOT NULL PRIMARY KEY ); CREATE TABLE users( id SERIAL NOT NULL PRIMARY KEY ); CREATE TABLE friends ( user_id__a INT NOT NULL REFERENCES users( id ), user_id__b INT NOT NULL REFERENCES users( id ) ); CREATE TABLE memberships ( user_id INT NOT NULL REFERENCES users( id ), group_id INT NOT NULL REFERENCES groups( id ), role_id INT NOT NULL ); CREATE TABLE posting ( id SERIAL NOT NULL, timestamp_publish timestamp not null, group_id__in INT NOT NULL REFERENCES groups(id), user_id__author INT NOT NULL REFERENCES users(id), is_published BOOL ); The output that I'm trying to get is: posting.id {the context of the select} posting.timestamp_publish (this may need to get correlated into other queries) These approaches had bad performance: -- huge selects / memory -- it needs to load everything from 2 tables before it limits EXPLAIN ANALYZE SELECT id, feed_context FROM ( SELECT id, timestamp_publish, 'in-group' AS feed_context FROM posting WHERE ( group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC') ) UNION SELECT id, timestamp_publish, 'by-user' AS feed_context FROM posting WHERE ( user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC') ) ) AS feed ORDER BY timestamp_publish DESC LIMIT 10 ; -- selects minimized, but repetitive subqueries SELECT id, CASE WHEN group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) THEN True ELSE NULL END AS feed_context_group, CASE WHEN user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) THEN True ELSE NULL END AS feed_context_user FROM posting WHERE ( group_id__in IN (SELECT group_id FROM memberships WHERE user_id = 57 AND role_id IN (1,2,3)) OR user_id__author IN (SELECT user_id__b FROM friends WHERE user_id__a = 57) ) AND (is_published = True AND timestamp_publish <= CURRENT_TIMESTAMP AT TIME ZONE 'UTC') ORDER BY timestamp_publish DESC LIMIT 10 ; On Apr 28, 2015, at 6:56 PM, Melvin Davidson wrote: > Since you very nicely DID NOT provide the pg version, O/S or table > structure(s), which is what you should do REGARDLESS of the > type of question (it's just the smart and polite thing to do when asking for > help) The best I can suggest is: > SELECT > CASE WHEN context = 'friend' THEN p.junka > WHEN context = 'group' THEN p.junkb > WHEN context = 'both' THEN p.junka || ' ' || p.junkb > END >FROM posting p > where p.author_id in (SELECT f.friend_id > FROM friends f > WHERE f.user_id = ?) >OR p.group_id in (SELECT m.group_id > FROM memberships m > WHERE m.user_id = ?);
[GENERAL] newsfeed type query
I'm trying to upgrade some code that powers a newfeed type stream, and hoping someone can offer some insight on better ways to structure some parts of the query The part that has me stumped right now... There are several criteria for why something could appear in a stream. for example, here are 2 handling a posting: * a posting by a friend * a posting in a group the general way I've handled this so far has been simple: select * from posting where author_id in (select friend_id from friends where user_id = ?) or group_id in (select group_id from memberships where user_id = ?); now i need to pull in the context of the match (friend, group, both), but I can't figure out how to do this cleanly. 1. if i just add 'case' statements to the select to note the origin, those subselects run again. (ie, the same subquery is executed twice) 2. if i structure this as a union (and note the origin with a string), it takes a lot more work to integrate and sort the 2 separate selects ( eg "select id, timestamp, 'by-friend'" unioned with "in-group") does anyone have ideas on other approaches to structuring this? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] splitting up tables based on read/write frequency of columns
On Jan 19, 2015, at 5:07 PM, Stefan Keller wrote: > Hi > > I'm pretty sure PostgreSQL can handle this. > But since you asked with a theoretic background, > it's probably worthwhile to look at column stores (like [1]). Wow. I didn't know there was a column store extension for PG -- this would come in handy for some analytic stuff we run! I know that PG can handle my current system "at scale". I'm really just wondering what the possible slowdowns/improvements will be. Doing a rewrite of the entire row + updating the various indexes seems to be a lot of unnecessary IO. At some point it will make sense to minimize that and isolate the heavy-write columns from impacting the rest of the table's performance. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] splitting up tables based on read/write frequency of columns
This is really a theoretical/anecdotal question, as I'm not at a scale yet where this would measurable. I want to investigate while this is fresh in my mind... I recall reading that unless a row has columns that are TOASTed, an `UPDATE` is essentially an `INSERT + DELETE`, with the previous row marked for vacuuming. A few of my tables have the following characteristics: - The Primary Key has many other tables/columns that FKEY onto it. - Many columns (30+) of small data size - Most columns (90%) are 1 WRITE(UPDATE) for 1000 READS - Some columns (10%) do a bit of internal bookkeeping and are 1 WRITE(UPDATE) for 50 READS Has anyone done testing/benchmarking on potential efficiency/savings by consolidating the frequent UPDATE columns into their own table? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] postgresql versus riak for a global exchange
A very popular design I see is often this: - PostgreSQL for account, inventory, transactional; and all writes - NoSQL (Redis, Riak, Mongo, etc) for read-only index postgres (almost like a read-through cache) and assembled documents On Jan 5, 2015, at 5:46 PM, Raymond Cote wrote: > I’m familiar with both PostgreSQL and Riak (1.4, not 2.0). > I know that Riak 2.0 now offers strong consistency. Have not yet seen what > that does to performance. > Big plusses for PostgreSQL: > - you can do both relational and NOSQL tasks (the Binary JSON in the latest > PostgreSQL). > - well-tested consistency, ACID, etc. > - lots of adapters and support. > - big community > > Big plusses for Riak: > - multi-master replication > - multi-data center replication > - easy to scale up > > We use PostgreSQL in combination with Riak for data storage (we have a > tokenization service). > We're currently using the EnterpriseDB multi-master PostgreSQL replication > and are quite happy with it. > The replication runs periodically, not streaming, so there is at least a 1 > second delay for replication to occur. > Riak replicates quicker — but then you don’t have the strong relational > structure on top. > > As mentioned earlier, ‘exchange…trade…asset’ is a bit vague. > In addition to just storing things, you’ll need to keep track of all sorts of > log-in and contact info — perhaps not ideal for Riak. > Probably best to consider precisely what traits your planned application has > and then look to match against the database storage. > May even end up with a mix of the two just as we have. > > Your decision may also depend on which development language/framework you > chose for the implementation. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Hostnames, IDNs, Punycode and Unicode Case Folding
On Dec 29, 2014, at 5:36 PM, Mike Cardwell wrote: > So the system I've settled with is storing both the originally supplied > representation, *and* the lower cased punycode encoded version in a separate > column for indexing/search. This seems really hackish to me though. I actually do the same exact thing and don't think it's hackish. I actually really like being able to see the punycode next to the human representation on simple db pulls. It's barely more disk space and really useful. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Storing Video's or vedio file in DB.
I wouldn't even store it on the filesystem if I could avoid that. Most people I know will assign the video a unique identifier (which is stored in the database) and then store the video file with a 3rd party (e.g. Amazon S3). 1. This is often cheaper. Videos take up a lot of disk space. Having to ensure 2-3 copies of a file as a failover is not fun. 2. It offloads work from internal servers. Why deal with connections that are serving a static file if you can avoid it? In terms of FS vs DB (aside from the open vs streaming which was already brought up) I think the big issue with storing large files in the database is the input/output connection. Postgres has a specified number of max connections available, and each one has some overhead to operate. Meanwhile, a server like nginx can handle 10k connections easily, and with little or no overhead. While the speed is comparable to the OS, you end up using a resource from a limited database connection pool. And you run the risk of a slow/dropped client tying up the connection. Why allocate a resource to these operations, when there are more lightweight alternatives that won't tie up a database connection ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] function indexes, index only scan and sorting
On Dec 12, 2014, at 4:58 PM, Tom Lane wrote: > regression=# create table tt (f1 int, f2 text); > CREATE TABLE > regression=# create index on tt (lower(f2)); > CREATE INDEX > regression=# explain select * from tt order by lower(f2); > QUERY PLAN > > Index Scan using tt_lower_idx on tt (cost=0.15..65.68 rows=1230 width=36) > (1 row) Thank you so much for posting this test. I got a seq scan on my local machine, so I checked the version... still running 9.2.4. I tried it on production (which is 9.3.x) and got the same result as you. Looking at the 9.3 release notes, I'm guessing this behavior is from one of the Optimizer fixes. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Removing duplicate records from a bulk upload (rationale behind selecting a method)
On Dec 8, 2014, at 9:35 PM, Scott Marlowe wrote: > select a,b,c into newtable from oldtable group by a,b,c; > > On pass, done. This is a bit naive, but couldn't this approach potentially be faster (depending on the system)? SELECT a, b, c INTO duplicate_records FROM ( SELECT a, b, c, count(*) AS counted FROM source_table GROUP BY a, b, c ) q_inner WHERE q_inner.counted > 1; DELETE FROM source_table USING duplicate_records WHERE source_table.a = duplicate_records.a AND source_table.b = duplicate_records.b AND source_table.c = duplicate_records.c; It would require multiple full table scans, but it would minimize the writing to disk -- and isn't a 'read' operation usually much more efficient than a 'write' operation? If the duplicate checking is only done on a small subset of columns, indexes could speed things up too. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] function indexes, index only scan and sorting
Can someone confirm a suspicion for me ? I have a moderately sized table (20+ columns, 3MM rows) that tracks "tags". I have a lower(column) function index that is used simplify case-insensitive lookups. CREATE INDEX idx_tag_name_lower ON tag(lower(name)); I have a few complex queries that need to join back to this table (via the `id` primary key) and sort on `lower(name)`. I'm not selecting `lower(name)`, just using it for an order-by. The only way I seem to be able to avoid a Sequential Scan and run an index-only scan is with another index -- this one specifically (and I've run queries against 8 index permutations): CREATE INDEX idx_tag_joins ON tag(id, name_display); Am I correct in observing that the value of a function index can't be used for sorting ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] deferring ForeignKey checks when you didn't set a deferrable constraint ?
On Nov 20, 2014, at 6:00 PM, Melvin Davidson wrote: > Try the following queries. It will give you two .sql files (create_fkeys.sql > & drop_fkeys.sql). Thanks! I tried a variation of that to create DEFERRABLE constraints, and that was a mess. It appears all the checks ran at the end of the transaction individually – the process consumed 100% cpu overnight and was stuck on the 'commit' after 16 hours.. So I crossed my fingers and tried your code like this: BEGIN; DROP CONSTRAINT . x24; DELETE; ADD CONSTRAINT . x24; COMMIT; And that took just over 24 seconds. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] deferring ForeignKey checks when you didn't set a deferrable constraint ?
I have a core table with tens-of-millions of rows, and need to delete about a million records. There are 21 foreign key checks against this table. Based on the current performance, it would take a few days to make my deletions. None of the constraints were defined as `DEFERRABLE INITIALLY IMMEDIATE', so I'm out of luck on deferring them. Dropping/redefining constraints looks to be an ordeal -- and something I'm scared to make a mistake on. i looked into disabling triggers on a table, but I couldn't find any info on how to trigger at the end of the transaction so I can ensure integrity. does anyone have suggestions on things that might work? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] issue with double ordering in a wrapped distinct
I re-ran the query in multiple forms, and included it below (I regexed it to become 'foo2bar' so it's more generic to others). I also uploaded it as a public spreadsheet to google, because I think that is a bit easier to look at: https://docs.google.com/spreadsheets/d/1w9HM8w9YUpul5Bmy-uvzA4I6x5OFwWzR6K5atFG2ubw/edit?usp=sharing The most interesting thing to me was how the planner was so drastically affected by interplay of distinct and order in a subselect : 293 - SELECT ( SELECT DISTINCT ) ORDER LIMIT OFFSET 293 - SELECT DISTINCT ( SELECT DISTINCT ORDER ) LIMIT OFFSET 300 - SELECT DISTINCT ( SELECT ORDER ) LIMIT OFFSET 6400 - SELECT( SELECT DISTINCT ORDER ) LIMIT OFFSET 7631 - SELECT DISTINCT ( SELECT ) ORDER LIMIT OFFSET And you can also see how the planner completely changed the strategy when LIMIT/OFFSET was introduced to the first query -- 394 SELECT ORDER BY; 446501 SELECT ORDER BY LIMIT OFFSET; Query A EXPLAIN ANALYZE SELECT DISTINCT foo_2_bar.bar_id FROM foo_2_bar JOIN foo ON foo_2_bar.foo_id = foo.id WHERE foo.attribute_id_a = 582 AND (foo.is_a IS NOT TRUE) AND (foo.is_b IS NOT TRUE) AND (foo.is_c IS NOT TRUE) AND (foo.is_d IS NOT TRUE) ORDER BY foo_2_bar.bar_id ASC ; QUERY PLAN Unique (cost=48810.15..48842.34 rows=6437 width=4) (actual time=283.850..389.587 rows=3468 loops=1) -> Sort (cost=48810.15..48826.25 rows=6437 width=4) (actual time=283.846..335.532 rows=44985 loops=1) Sort Key: foo_2_bar.bar_id Sort Method: quicksort Memory: 3645kB -> Nested Loop (cost=322.52..48402.94 rows=6437 width=4) (actual time=2.734..221.878 rows=44985 loops=1) -> Bitmap Heap Scan on foo (cost=322.09..27984.32 rows=7679 width=4) (actual time=2.718..27.531 rows=7885 loops=1) Recheck Cond: (attribute_id_a = 582) Filter: ((is_a IS NOT TRUE) AND (is_b IS NOT TRUE) AND (is_c IS NOT TRUE) AND (is_d IS NOT TRUE)) Rows Removed by Filter: 7 -> Bitmap Index Scan on idx__foo__attribute_id_a (cost=0.00..320.17 rows=7966 width=0) (actual time=1.335..1.335 rows=8161 loops=1) Index Cond: (attribute_id_a = 582) -> Index Only Scan using idx__foo2bar__test on foo_2_bar (cost=0.43..2.61 rows=5 width=8) (actual time=0.004..0.010 rows=6 loops=7885) Index Cond: (foo_id = foo.id) Heap Fetches: 0 Total runtime: 394.606 ms Query A-LIMITED -- same as above, just adds a LIMIT/OFFSET EXPLAIN ANALYZE SELECT DISTINCT foo_2_bar.bar_id FROM foo_2_bar JOIN foo ON foo_2_bar.foo_id = foo.id WHERE foo.attribute_id_a = 582 AND (foo.is_a IS NOT TRUE) AND (foo.is_b IS NOT TRUE) AND (foo.is_c IS NOT TRUE) AND (foo.is_d IS NOT TRUE) ORDER BY foo_2_bar.bar_id ASC LIMIT 50 OFFSET 0 ; QUERY PLAN Limit (cost=0.85..15386.21 rows=50 width=4) (actual time=57698.794..446500.933 rows=50 loops=1) -> Unique (cost=0.85..1980710.86 rows=6437 width=4) (actual time=57698.789..446500.787 rows=50 loops=1) -> Nested Loop (cost=0.85..1980694.77 rows=6437 width=4) (actual time=57698.784..446498.319 rows=2011 loops=1) -> Index Scan using idx__foo2bar__bar_id on foo_2_bar (cost=0.43..75725.91 rows=1517741 width=8) (actual time=0.017..10373.409 rows=364
Re: [GENERAL] issue with double ordering in a wrapped distinct
On Nov 18, 2014, at 6:43 PM, Tom Lane wrote: > but as for why it gets a much worse plan after > flattening --- insufficient data. Thanks. I'll run some test cases in the morning and post the full queries matched with ANALYZE EXPLAIN. This is just puzzling to me. I was hoping there might be a more general planner issue that someone would have noticed. On Nov 18, 2014, at 6:55 PM, David G Johnston wrote: > > I presume you have a reason for not simply doing away with the subquery > altogether... When not using the subquery, the query ran in 1s -- if I didn't have a LIMIT/OFFSET Adding a LIMIT/OFFSET to that query made it run for nearly 6 minutes. The only way I could manage to trick it to use the better query plan, was to wrap the "good" query as a subquery, and then run a LIMIT/OFFSET in the outer query.
[GENERAL] issue with double ordering in a wrapped distinct
I have a particular query that returns resultset of 45k rows out of a large resultset (pg 9.3 and 9.1) It's a many 2 many query, where I"m trying to search for Bar based on attributes in a linked Foo. I tweaked the indexes, optimized the query, and got it down an acceptable speed around 1,100ms the second I added a limit/offset though -- the query plan completely changed and it ballooned up to 297,340 ms. Yes, I waited that long to see what was going on in the query planner. I did a lot of playing around, and managed to get this form of a query to work in 305ms with a limit/offset. SELECT DISTINCT qinner.bar_id FROM (SELECT foo_2_bar.bar_id AS bar_id FROM foo_2_bar JOIN foo ON foo_2_bar.foo_id = foo.id WHERE foo.biz_id = 1 AND (foo.is_hidden IS NOT TRUE) ORDER BY foo_2_bar.bar_id ASC ) AS qinner ORDER BY qinner.bar_id ASC LIMIT 100 OFFSET 0 ; This is what I don't understand -- notice the two order_by calls. If i run this with an inner and outer order_by, I get ~305ms. (I don't think I need both, but I wasn't sure if ordering is kept from a subselect ) If i run this with only the inner, I get ~304ms. If I run this with only the outer, it's pushing over 10minutes again i'm wondering if anyone might know why that performance hit would be happening -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] String searching
On Nov 18, 2014, at 11:49 AM, Robert DiFalco wrote: > As far as I can tell, the trigram extension would be the easiest way to > implement this. It looks like I wouldn't need to mess with vectors, etc. It > would just look like a standard index and query, right? It seems that if I > need something more powerful in the future that I could always move to > ElasticSearch, Sphinx, or something similar. I just followed the instructions in the docs to create the index, and ran ANALYZE on the table before running a standard "like" SELECT. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] String searching
On Nov 18, 2014, at 7:38 AM, Albe Laurenz wrote: > > That index wouldn't help with the query at all. > > If you really need a full substring search (i.e., you want to find > "howardjohnson"), the only thing that could help are trigram indexes. I stand corrected. I ran a sample query on my test database of 100k names using a function index `lower(name)` this runs an index scan in .2ms ... where lower(name) = lower('bob'); but this runs a sequential scan in 90ms: ... where lower(name) like lower('%bob%'); I didn't know that 'like' doesn't run on indexes! using a trigaram index, this runs a bitmap index on the trigram, then a bitmap heap on the table. 13ms. ...where name ilike '%bob%';
Re: [GENERAL] String searching
On Nov 17, 2014, at 12:55 PM, Robert DiFalco wrote: > SELECT * FROM MyTable WHERE upper(FullName) LIKE upper('%John%'); > > That said, which would be the best extension module to use? A "gist" index on > the uppercased column? Or something else? Thanks! Performance wise, I think a function index would probably be the best: CREATE INDEX mytable_lower_fullname_idx ON mytable(lower(fullname)); SELECT * FROM mytable WHERE lower(fullname) LIKE lower('%john%'); The only reason why I use `lower` and not `upper` is that it's easier to look at when dealing with debugging and sample queries. I'd bench against GIN and GIST, but I think this will work the best. The reason is that GIN/GIST use language patterns to simplify the index. so they work great on "words" select plainto_tsquery('doing watching reading programming'); 'watch' & 'read' & 'program' but not so great on "names": select plainto_tsquery('john doe'); 'john' & 'doe' select plainto_tsquery('jon doe'); 'jon' & 'doe So you'll get a bit more overhead on the match and you won't get a smaller index (which is why they're great for fulltext) The search execution might turn out to be much faster. If so, i'd love to know. But doing a lower() search on a lower() function index has always been ridiculously fast for me. This only goes for names though. If you're searching other fields, then another search method might be considerably better. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] pg_upgrade and ubuntu
I ran into an issue migrating from 9.1 to 9.3 on ubuntu using pg_upgrade the default ubuntu package, and the one from postgresql.org, both store `postgresql.conf` in etc as `/etc/postgresql/VERSION/main/postgresql.conf` however, the pg_upgrade script expects it in the `datadir`. the simple solution seems to be just symlinking the /etc files into the data dirs. it took me a while to realize this was the error. it might make sense to upgrade the docs with a note about what should be in the data dir to enable an upgrade. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Are there any downsides to using postgres' data directory on a dedicated drive/partition / filesystem?
Thanks, everyone! For now this will be provisioning physical drive for a box -- and "everything" will be there for now. So OS on one drive, and DB on another. I've run into programs before (mostly on Mac/Win) that are exceedingly not happy if they're run on a drive other than the OS. Since many people partition data and services under pg, I figured it would be okay -- but I couldn't find anything in the docs and wanted to check. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Are there any downsides to using postgres' data directory on a dedicated drive/partition / filesystem?
I'm running postgres on a virtual server I was wondering if there were any known issues with moving the data directory to another mounted partition / filesystem. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] troubleshooting a database that keeps locking up
I have a database that has started to constantly hang after a brief period of activity looking at `select * from pg_stat_activity;` I roughly see the following each time: process 1 | process 2 | in transaction process 3 | in transaction process 4 | process 5 | process 6 | process 7 | INSERT INTO table_a RETURNING id occasionally I'll see process 8 | UPDATE table_b Does anyone have tips on how I can troubleshoot this. I was hoping there would be some way to show the history of the " in transaction" processes, but I couldn't find them. I was also wondering if the "RETURNING id" might have something to do with this. I'd appreciate any pointers in trying to figure out what is causing this. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Modeling Friendship Relationships
On Nov 11, 2014, at 5:38 PM, Robert DiFalco wrote: > Thoughts? Do I just choose one or is there a clear winner? TIA! I prefer this model user_id__a INT NOT NULL REFERENCES user(id), user_id__b INT NOT NULL REFERENCES user(id), is_reciprocal BOOLEAN primary key (user_id__a, user_id__b) if a relationship is confirmed (or dropped) I toggle is_reciprocal. having that value saves a lot of work doing joins or analyzing friendship sets if you have multiple relationship types, then things get tricky. you can either - treat the row as a triplet ( user_id__a, user_id__b, relationship_type_id) [i still recommend the reciprocal bool] - if you have a finite set of relationship types, you could just use each one as a bool column within the a2b row I've tried doing the "one row per relationship" approach, and didn't like it. the time savings on simple searches were marginally faster, but the sql was increasingly more complex and slower to execute as we leveraged the table into other queries. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] index behavior question - multicolumn not consulted ?
I have a table with over 1MM records and 15 columns. I had created a "unique index" on a mix of two columns to enforce a constraint : (resource_type_id, lower(archive_pathname)) i've noticed that searches never use this. no matter what I query, even if it's only the columns in the index. I'm seeing a 550ms sequential scan on everything. If I create an index only on the text field: lower(archive_pathname) , all the queries use that and complete in 1.4ms does anyone know why this happens ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] faster way to calculate top "tags" for a "resource" based on a column
On Oct 7, 2014, at 10:02 AM, Marc Mamin wrote: > Hi, > it seems to me that your subquery may deliver duplicate ids. > And with the selectivity of your example, I would expect an index usage > instead of a table scan. You may check how up to date your statistics are > and try to raise the statistic target on the column resource_2_tag.tag_id. > Also try a CTE form for your query: It shouldn't be able to deliver duplicate ids. => SELECT COUNT(*) FROM (SELECT DISTINCT id FROM resource WHERE resource_attribute1_id = 614) AS foo; count --- 5184 => SELECT COUNT(*) FROM (SELECT id FROM resource WHERE resource_attribute1_id = 614) AS foo; count --- 5184 However, adding in the DISTINCT drastically changed the query plan, and did give a speedup. Your comment made me focus on the notion of a Table Scan. I assumed it did the seq scan - and there would not be much savings otherwise - because the table is just 2 ids. I was wrong. I noticed that I never put a PRIMARY KEY constraint on that table. So i tried adding a PRIMARY KEY constraint, then running vacuum analyze... And that solved all my problems. the original query ended up being the fastest at 260ms ( down from 1760 ) Join - 260ms Subquery w/DISTINCT - 300ms CTE - 330 CTE w/DISTINCT - 345ms Subquery (no DISTINCT) - 1500ms -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] faster way to calculate top "tags" for a "resource" based on a column
On Oct 6, 2014, at 5:56 PM, Jim Nasby wrote: > Don't join to the resource table; there's no reason to because you're not > pulling anything from it. Thanks the reply! I'm not pulling anything from the resource table, but the join is necessary because I'm filtering based on it. ( see the WHERE clause ) I'm not trying to find the most used overall tags, but the ones that are used by resources with a (variable) id on a column in the resources table. . -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] faster way to calculate top "tags" for a "resource" based on a column
I've been able to fix most of my slow queries into something more acceptable, but I haven't been able to shave any time off this one. I'm hoping someone has another strategy. I have 2 tables: resource resource_2_tag I want to calculate the top 25 "tag_ids" in "resource_2_tag " for resources that match a given attribute on the "resource" table. both tables have around 1.6million records. If the database needs to warm up and read into cache, this can take 60seconds to read the data off disk. If the database doesn't need to warm up, it averages 1.76seconds. The 1.76s time is troubling me. Searching for the discrete elements of this is pretty lightweight. here's an explain -- http://explain.depesz.com/s/PndC I tried a subquery instead of a join, and the query optimized the plan to the same. i'm hoping someone will see something that I just don't see. Table "public.resource_2_tag" Column | Type | Modifiers ---+-+--- resource_id | integer | tag_id| integer | Indexes: "_idx_speed_resource_2_tag__resource_id" btree (resource_id) "_idx_speed_resource_2_tag__tag_id" btree (tag_id) Table "public.resource" Column|Type | Modifiers -+-+-- id | integer | not null default nextval('resource_id_seq'::regclass) resource_attribute1_id | integer | lots of other columns| | Indexes: "resource_attribute1_idx" btree (resource_attribute1_id) select count(*) from resource; -- 1669729 select count(*) from resource_2_tag; -- 1676594 select count(*) from resource where resource_attribute1_id = 614; -- 5184 -- 4.386ms select id from resource where resource_attribute1_id = 614; -- 5184 -- 87.303ms popping the 5k elements into an "in" clause, will run the query in around 100ms. EXPLAIN ANALYZE SELECT resource_2_tag.tag_id AS resource_2_tag_tag_id, count(resource_2_tag.tag_id) AS counted FROM resource_2_tag JOIN resource ON resource.id = resource_2_tag.resource_id WHERE resource.resource_attribute1_id = 614 GROUP BY resource_2_tag.tag_id ORDER BY counted DESC LIMIT 25 OFFSET 0; Limit (cost=76659.61..76659.68 rows=25 width=4) (actual time=1502.902..1502.913 rows=25 loops=1) -> Sort (cost=76659.61..76672.47 rows=5141 width=4) (actual time=1502.900..1502.906 rows=25 loops=1) Sort Key: (count(resource_2_tag.tag_id)) Sort Method: top-N heapsort Memory: 26kB -> HashAggregate (cost=76463.13..76514.54 rows=5141 width=4) (actual time=1487.016..1495.206 rows=13887 loops=1) -> Hash Join (cost=35867.88..76437.42 rows=5141 width=4) (actual time=97.654..1453.337 rows=27068 loops=1) Hash Cond: (resource_2_tag.resource_id = resource.id) -> Seq Scan on resource_2_tag (cost=0.00..25847.94 rows=1676594 width=8) (actual time=0.032..513.046 rows=1676594 loops=1) -> Hash (cost=35803.88..35803.88 rows=5120 width=4) (actual time=97.576..97.576 rows=5184 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 183kB -> Bitmap Heap Scan on resource (cost=272.68..35803.88 rows=5120 width=4) (actual time=5.911..90.264 rows=5184 loops=1) Recheck Cond: (resource_attribute1_id = 614) -> Bitmap Index Scan on resource_attribute1_idx (cost=0.00..271.40 rows=5120 width=0) (actual time=3.575..3.575 rows=5184 loops=1) Index Cond: (resource_attribute1_id = 614) Total runtime: 1503.146 ms -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] installing on mac air development machine
On Oct 2, 2014, at 7:30 PM, john gale wrote: > The GUI installer for Mac OS X downloaded from postgresql.org works fine. Unless you NEED to use the source/etc version, use the GUI installer. Unless you are already on a system where installing from Fink/Macports/Source is commonplace... you're going to spend more time installing and configuring the environment than you will using the application. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Benching Queries
Does anyone have a good solution for benching queries under various conditions, and collecting the EXPLAIN data ? I looked at pgbench, but it doesn't seem to be what I want. My situation is this- - For a given query, there are 3-5 different ways that I can run it. - Each form of the query has a completely different execution plan and query time, often using different indexes. - The same query runs differently on first query, vs subsequent queries (when the indexes/tables are already in memory). My goal is to find an overall balance of query time (cold-start vs in-memory) and indexes (number of). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] improving speed of query that uses a multi-column "filter" ?
On Oct 1, 2014, at 12:34 AM, Misa Simic wrote: > Have you considered maybe partial indexes? > > http://www.postgresql.org/docs/9.3/static/indexes-partial.html > > I.e idx1 on pk column of the table with where inside index exactly the same > as your first where > > Idx2 on pk column with where inside index as second where That was actually my first attempt , and I was hoping it would work. Unfortunately, there is always something in the queries that keeps Postgres trying to use other (slower) indexes or jumping to a sequential scan. I haven't been able to trick the planner into using the partial index, and most online resources suggested it wasn't possible.
Re: [GENERAL] improving speed of query that uses a multi-column "filter" ?
On Sep 30, 2014, at 8:04 PM, John R Pierce wrote: > if col_1 IS NULL, then that OR condition doesn't make much sense. just > saying... I was just making a quick example. There are two commonly used "filter sets", each are mostly on Bool columns that allow null -- but one checks to see if the row references itself in a particular column. > these 4 columns are all nullable booleans, so they can be TRUE, FALSE, or > NULL ? Most of them, yes. > with 4 columns, there's 3^4 = 81 possible combinations of these values... > you might get better speeds encoding this as a single SHORT INTEGER, and > enumerating those 81 states, then just do equals or IN (set of values) > conditions... of course, this might make a lot of OTHER code more > complicated. It might be easier to make each col_X 2 bits of this integer, > such that one bit indicates the value was 'NULL', and the other bit is the > true/false state if that first bit isn't set, this would make testing > individual bits somewhat better. That's interesting. I never thought of how Postgres processes the data. For legacy reasons, I can't change the data types -- but I can add additional columns. So I could do a trigger/function that manages a filter_test column that is an int, give each filter a bit value, and then just run a scan on that. It wouldn't be much more work to test that and dedicated Bool columns for each filter. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] improving speed of query that uses a multi-column "filter" ?
I'm trying to improve the speed of suite of queries that go across a few million rows. They use 2 main "filters" across a variety of columns: WHERE (col_1 IS NULL ) AND (col_2 IS NULL) AND ((col_3 IS NULL) OR (col_3 = col_1)) WHERE (col_1 IS True ) AND (col_2 IS True) AND (col_3 IS True) OR (col_4 IS NULL) I created a dedicated multi-column index for each query to speed them up. That was great. I still don't have the performance where I want it to be - the size of the index seems to be an issue. If the index were on one column, instead of 4, I think the scans would complete in time. i looked online and the archives, and couldn't find much information on good strategies to deal with this. It looks like my best option is to somehow index on the "interpretation" of this criteria, and not the criteria itself. the two ways that come to mind are: 1. alter the table: adding a boolean column for each filter-test to the table, index that, then query for that field 2. leave the table as-is: write a custom function for each filter, and then use a function index has anyone else encountered a need like this? are there any tips / tricks / things I should look out for. are there better ways to handle this? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Postgres as key/value store
On Sep 27, 2014, at 7:48 PM, snacktime wrote: > The schema is that a key is a string, and the value is a string or binary. I > am actually storing protocol buffer messages, but the library gives me the > ability to serialize to native protobuf or to json. Json is useful at times > especially for debugging. I don't know if this will apply to you, but i received significant speed improvements on Postgres key searches by using substring indexes. If your keys are just random hashes, this would probably work well for you. if your keys are person-readable, it's probably not going to work as well as the distribution of prefix characters will probably be too uniform. But the general idea is twofold: 1. create an additional partial index on the key field -- CREATE INDEX _entities_id__subst_7 ON entities(substr(id,1,7)); 2. update your SELECTS to search for both the full string AND the substring - WHERE id = :id + WHERE (id = :id) AND (substr(id,1,7) = substr(:id, 1, 7)) By adding in the substring query, the planner will (usually) optimize the select by doing a first pass on the substring index. then it searches that limited set for the rest of matching criteria. on a table with 4MM+ records , introducing a substring index/query improved my searches by a few orders of magnitude. before trying this indexing strategy, we were actively looking to migrate this particular query service off of postgres -- it was such a bottleneck and was not scalable. now there is no reason to leave in the foreseeable future. On Sep 27, 2014, at 8:33 PM, Gavin Flower wrote: >> This works well because keys are left prefixed with a scope, a delimiter, >> and then the actual key for the data. > Then I noticed that your id is actually a compound key, and probably would be > better modelled as: if you're able to standardize the scope out, an index of "(scope, substring(key,1,7))" might work well. i only used 1,7 as my key arguments, because that was an optimal speed/space mix on my dataset. depending on yours, a shorter or longer index might be more appropriate -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] table versioning approach (not auditing)
On Sep 29, 2014, at 4:06 PM, Nick Guenther wrote: > A newbie tangent question: how do you access the transaction serial? Is it > txid_current() as listed in > http://www.postgresql.org/docs/9.3/static/functions-info.html? My implementations were ridiculously simple/naive in design, and existed entirely with under defined serials. i'd just create a new record + id on a write operation, and then use it when logging all operations. I had read up on a lot of (possibly better) ways to handle this using pg internals. They all seemed more advanced than I needed. > And does your implementation worry about multiple timelines? Not sure I understand this... but every object is given a revision id. edits between consecutive revisions are allowed, edits spanning multiple revisions are rejected. On Sep 29, 2014, at 5:25 PM, Abelard Hoffman wrote: > Felix & Jonathan: both of you mention just storing deltas. But if you do > that, how do you associate the delta record with the original row? Where's > the PK stored, if it wasn't part of the delta? The logic I decided on, is this: Revision 0 Only the original record is stored Revision 1 • Copy the original record into revision store Revision 1+ • Update the original record, store the deltas in the revision store The reason why I chose this path, is that in my system: • most records are not edited • the records that are edited, are heavily edited We use an ORM and it was simple to implement this pattern with it, and then write some functions in postgres to ensure it is adhered to. When I need to pull data out: • I can pull exact revisions out of the htstore for a given table/row using the revision ids as a key • the revisions all contain the transaction id • if i need to get more info about a given transaction, i can query the transactions table and get a list of all the objects that were edited within that transaction if i wanted to ensure referential integrity, i could have used a table instead of an hstore (or json). If the application grows much larger, it will probably be migrated to a model like that. This approach just gave a lot of flexibility , minimized tables in the database, and was very easy to pull off. i went with hstore because json didn't allow in-place updates at the time (i think it does now). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] table versioning approach (not auditing)
In the past, to accomplish the same thing I've done this: - store the data in hstore/json. instead of storing snapshots, I store deltas. i've been using a second table though, because it's improved performance on reads and writes. - use a "transaction" log. every write session gets logged into the transaction table (serial, timestamp, user_id). all updates to the recorded tables include the transaction's serial. then there is a "transactions" table, that is just "transaction_serial , object_id , object_action". whenever I have needs for auditing or versioning, I can just query the transaction table for the records I want... then use that to grab the data out of hstore. On Sep 28, 2014, at 10:00 PM, Abelard Hoffman wrote: > Hi. I need to maintain a record of all changes to certain tables so assist in > viewing history and reverting changes when necessary (customer service makes > an incorrect edit, etc.). > > I have studied these two audit trigger examples: > https://wiki.postgresql.org/wiki/Audit_trigger > https://wiki.postgresql.org/wiki/Audit_trigger_91plus > > I've also read about two other approaches to versioning: > 1. maintain all versions in one table, with a flag to indicate which is the > current version > 2. have a separate versions table for each real table, and insert into the > associated version table whenever an update or insert is done. > > My current implementation is based on the wiki trigger examples, using a > single table, and a json column to record the row changes (rather than > hstore). What I like about that, in particular, is I can have a "global," > chronological view of all versioned changes very easily. > > But there are two types of queries I need to run. > 1. Find all changes made by a specific user > 2. Find all changes related to a specific record > > #1 is simple to do. The versioning table has a user_id column of who made the > change, so I can query on that. > > #2 is more difficult. I may want to fetch all changes to a group of tables > that are all related by foreign keys (e.g., find all changes to "user" record > 849, along with any changes to their "articles," "photos," etc.). All of the > data is in the json column, of course, but it seems like a pain to try and > build a query on the json column that can fetch all those relationships (and > if I mess it up, I probably won't generate any errors, since the json is so > free-form). > > So my question is, do you think using the json approach is wrong for this > case? Does it seem better to have separate versioning tables associated with > each real table? Or another approach? > > Thanks
[GENERAL] advice sought - general approaches to optimizing queries around "event streams"
I have a growing database with millions of rows that track resources against an event stream. i have a few handfuls of queries that interact with this stream in a variety of ways, and I have managed to drop things down from 70s to 3.5s on full scans and offer .05s partial scans. no matter how i restructure queries, I can't seem to get around a few bottlenecks and I wanted to know if there were any tips/tricks from the community on how to approach them. a simple form of my database would be: -- 1k of create table stream ( id int not null primary key, ) -- 1MM of create table resource ( id int not null primary key, col_a bool, col_b bool, col_c text, ); -- 10MM of create table streamevent ( id int not null, event_timestamp timestamp not null, stream_id int not null references stream(id) ); -- 10MM of create table resource_2_stream_event( resource_id int not null references resource(id), streamevent_id int not null references streamevent(id) ) Everything is running off of indexes; there are no seq scans. I've managed to optimize my queries by avoiding joins against tables, and turning the stream interaction into a subquery or CTE. better performance has come from limiting the number of "stream events" ( which are only the timestamp and resource_id off a joined table ) The bottlenecks I've encountered have primarily been: 1. When interacting with a stream, the ordering of event_timestamp and deduplicating of resources becomes an issue. I've figured out a novel way to work with the most recent events, but distant events are troublesome using no limit, the query takes 3500 ms using a limit of 1, the query takes 320ms using a limit of 1000, the query takes 20ms there is a dedicated index of on event_timestamp (desc) , and it is being used according to the planner... finding all the records is fine; merging-into and sorting the aggregate to handle the deduplication of records in a stream seems to be the issue (either with DISTINCT or max+group_by) 2. I can't figure out an effective way to search for a term against an entire stream (using a tsquery/gin based search) I thought about limiting the query by finding matching resources first, then locking it to an event stream, but: - scanning the entire table for a term takes about 10 seconds on an initial hit. subsequent queries for the same terms end up using the cache, and complete within 20ms. I get better search performance by calculating the event stream, then searching it for matching documents, but I still have the performance issues related to limiting the window of events i didn't include example queries, because I'm more concerned with the general approaches and ideas behind dealing with large data sets than i am with raw SQL right now. i'm hoping someone can enlighten me into looking at new ways to solve these problems. i think i've learned more about postgres/sql in the past 48hour than I have in the past 15 years, and I'm pretty sure that the improvements I need will come from new ways of querying data , rather than optimizing the current queries. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] how does full text searching tokenize words ? can it be altered?
I'm getting a handful of 'can not index words longer than 2047 characters' on my `gin` indexes. 1. does this 2047 character count correspond to tokens / indexed words? 2. if so, is there a way to lower this number ? 3. is there a way to profile the index for the frequency of tokens ? ( apologies in advance if this looks familiar, i posted this as part of a larger question last month; everything but this was answered by the list and I can't find answers to this online ) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] How to store fixed size images?
On Jun 19, 2014, at 11:21 AM, Andy Colson wrote: > I think it depends on how you are going to use them. I, for example, have > lots of images that are served on a web page, after benchmarks I found it was > faster to store them on filesystem and let apache serve them directly. I rarely store images like that locally now; I just toss them onto Amazon S3. When I did have to store lots of images locally , I found this to be the best method: 1. The Postgres record for the image is given a unique and random hash as a hexdigest 2. The Image is saved onto a filesystem into a directory mapped by the hexdigest for example, there might be something like this: Postgres: id | filename | hash 001 | image.jpg | abcdef123 Filesystem abc/def/123/abcdef123-image.jpg nginx/apache rewrite rule : abcdef123-image.jpg -> abc/def/123/abcdef123-image.jpg the reason for this has to do with the performance of various filesystems and issues with the distribution of digits in a sequence. it ties into Benford's Law ( http://en.wikipedia.org/wiki/Benford's_law ) as well. a handful of filesystems exhibit decreased performance as the number of items in a directory increases. a few years ago, 1k-4k items was a safe max -- but at 10x that some filesystems really slowed. i think most modern filesystems are still quick at the 5-10k range. a hash has more characters and a more normal distribution than a series of numbers or natural language filenames. and if you group a hexdigest into triplets , you get 4096 max files/folders in a directory which is a decent sweet spot 16 * 16 * 16 = 4096 i haven't had to deal with this sort of stuff in almost 10 years now. but archiving content like this back then was a considerable improvement to filesystem performance and web serving. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] two questions about fulltext searchign / tsvector indexes
On Jun 10, 2014, at 8:26 AM, Vick Khera wrote: Thanks so much for this. We do a lot of searching on this column, so pre-computing seems to be the way. I'm not worried about disk space for now, and can revisit that later if there is a problem Just for clarification on this: Option A (less fast): create gin index on tsvector(searchable_column) Option B (faster): create tsvector column for `searchable_column` create gin index on searchable_column > On Mon, Jun 9, 2014 at 8:55 PM, Jonathan Vanasco wrote: >>I can't figure out which one to use. This is on a steadily growing >> table of around 20MM rows that gets 20-80k new records a day, but existing >> records are rarely updated. > > The question as always is a time-space trade-off. How frequently do > you make the full text search? If you do it frequently, then with a > pre-computed tsv column you save all that time per row of computing > the tsvector on every search. If you do it infrequently, the space > savings (and not needing to maintain that column) may benefit you. > > Personally in these days of cheap disks I'd go with the dedicated > column. Given that, you want to just have a GIN index on that one > column, and the query you want, given some plain text string like > "fluffy dog" is this: > > select plainto_tsquery('fluffy dog') @@ my_tsv_column; > > I always use a trigger on insert and update to maintain the ts_vector > column, so there is no doubt of how it was computed by various > programs. > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general // Jonathan Vanasco c. 646.729.6436 | 415.501.9815 e. jonat...@2xlp.com w. http://findmeon.com/user/jvanasco linkedin. http://linkedin.com/in/jonathanvanasco blog. http://destructuring.net
[GENERAL] two questions about fulltext searchign / tsvector indexes
I'm having some issues with fulltext searching. I've gone though the list archives and stack overflow, but can't seem to get the exact answers. hoping someone can help. Thanks in advance and apologies for these questions being rather basic. I just felt the docs and some online posts are leading me into possibly making the wrong decision and I want to make sure I"m doing this right. 1. I need to make both 'title' and 'description' searchable. What is the current proper way to index multiple columns of a table ( ie, not one ) ? I've essentially seen the following in the docs, mailing list, and various websites: A unified index CREATE INDEX CONCURRENTLY unified_tsvector_idx ON mytable USING gin(to_tsvector('english', title || ' ' || description )); Individual indexes CREATE INDEX CONCURRENTLY title_tsvector_idx ON mytable USING gin(to_tsvector('english', title )); CREATE INDEX CONCURRENTLY description_tsvector_idx ON mytable USING gin(to_tsvector('english', description )); Using dedicated columns ( one or more ) ALTER TABLE create trigger I can't figure out which one to use. This is on a steadily growing table of around 20MM rows that gets 20-80k new records a day, but existing records are rarely updated. 2. I've been getting a handful of 'can not index words longer than 2047 characters' in my tests. if this 2047 character max is on tokens, is there a way to lower it? or to profile the index for distribution of tokens ? I don't think we have to support any tokens larger than 20chars or so. 3a. What should EXPLAIN ANALYZE show if it is using the index ? i couldn't find an example. 3b. Depending on how I index the column, what do I need to pass into the query so that it uses the index ? 1. if the index is created like gin(to_tsvector('english', title )); do i have to search in this format ? to_tsvector('english',title) @@ to_tsquery('english', 'dog') ; 2. if i use an index like gin(to_tsvector('english', title || ' ' || description )); what is the correct way to query the database and let the planner know I want to use the index ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] how can i bugfix "idle in transaction" lockups ?
begin w/o commit or rollback? and thanks. you've been very helpful! On Nov 30, 2010, at 2:21 PM, Merlin Moncure wrote: > Begin w/o commit is a grave application error and you should > consider reworking your code base so that it doesn't happen (ever). -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] how can i bugfix "idle in transaction" lockups ?
on a project, i find myself continually finding the database locked up with "idle in transaction" connections are there any commands that will allow me to check exactly what was going on in that transaction ? i couldn't find anything in the docs, and the project has decent traffic, so its going to be an issue to log all statements so I can sift through the data by PID -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] question about unique indexes
On May 10, 2010, at 6:29 AM, Alban Hertroys wrote: As the docs state and as others already mentioned, "Null values are not considered equal". Ah. I interpreted that wrong. I thought it applied to indexes differently. I'll have to experiment now... -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] question about unique indexes
-- running pg 8.4 i have a table defining geographic locations id lat long country_id not null state_id city_id postal_code_id i was given a unique index on (country_id, state_id, city_id, postal_code_id) the unique index isn't working as i'd expect it to. i was hoping someone could explain why: in the two records below, only country_id and state_id are assigned ( aside from the serial ) geographic_location_id | coordinates_latitude | coordinates_longitude | country_id | state_id | city_id | postal_code_id +--+--- ++--+-+ 312 | | |233 | 65 | | 443 | | |233 | 65 | | i was under the expectation that the unique constraint would apply in this place. from the docs: When an index is declared unique, multiple table rows with equal indexed values are not allowed. Null values are not considered equal. A multicolumn unique index will only reject cases where all indexed columns are equal in multiple rows. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2
On Apr 21, 2010, at 9:38 PM, Glen Parker wrote: Not if qty_requested_available needs to be <= qty_available... indeed, i'm an idiot this week. thanks a ton. this really helped me out! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2
it would be that, but with greatest thank you. that's the exact query i was failing to write ! On Apr 21, 2010, at 8:51 PM, Glen Parker wrote: UPDATE cart_item SET qty_requested_available = least(cart_item.qty_requested, stock.qty_available) FROM stock WHERE cart_item.stock_id = stock.stock_id AND qty_requested_available <> least(cart_item.qty_requested, stock.qty_available); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] trying to write a bit of logic as one query, can't seem to do it under 2
I'm trying to write a bit of logic as 1 query, but I can't seem to do it under 2 queries. i'm hoping someone can help the basic premise is that i have an inventory management system , and am trying to update the quantity available in the "shopping cart" (which is different than the independently tracked quantity requested ). the logic is fairly simple: cart items should show the quantity_requested as available if that number is <= the number of items in stock, otherwise they should show the max number of items available the solution i ended up with, is to just update the cart_items with the entire quantity_available per product, and then fix that in a second pass. i'm wondering if this can be *efficiently* done within a single update statement. i couldn't figure out how to do this in a single update, and not make multiple queries to find the actual qty_available UPDATE cart_item SET qty_requested_available = ( SELECT qty_available FROM stock where stock.id = stock_id) ; UPDATE cart_item SET qty_requested_available = CASE WHEN qty_requested_available > qty_requested THEN qty_requested ELSE qty_requested_available END ; -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] is it possible to do an update with a nested select that references the outer update ?
A typo in a webapp left ~150 records damaged overnight I was hoping to automate this, but may just use regex to make update statements for this basically , i have this situation: table a ( main record ) id , id_field , fullname table b ( extended profiles ) id_field , last_name , first_name, middle_name , age , etc id_field on table a was left null due to a typo with the orm i've tried many variations to automate it, none seem to work i think this attempt most clearly expresses what I was trying to do UPDATE table_a a set id_field = ( SELECT id_field FROM table_b b WHERE a.first_name || ' ' || b.last_name = a.fullname ) WHERE id_field IS NULL ; I'd be greatful if anyone has a pointer -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] concatenation issue ( 8.4 )
I have a table with name_first name_middle name_last if i try concatenating as such: SELECT name_first || ' ' || name_middle || ' ' || name_last FROM mytable ; I end up with NULL as the concatenated string whenever any of the referred fields contain a NULL value I tried some text conversion and explicit casting , but that didn't work What am I doing wrong ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] does anyone know what to use in pg_hba.conf that will allow me to run cronjobs with pg_dump?
i think i just need a METHOD for localhost only. thanks. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] list all columns in db
Thank you Jon -- thats the exact sort of trick I was hoping for. Cheers! On Jun 7, 2007, at 6:36 PM, Jon Sime wrote: Jonathan Vanasco wrote: Does anyone have a trick to list all columns in a db ? No trickery, just exploit the availability of the SQL standard information_schema views: select table_schema, table_name, column_name from information_schema.columns where table_schema not in ('pg_catalog','information_schema') order by 1,2,3 If you want an equivalent that uses pg_catalog (non-portable outside of PostgreSQL) you could instead do: select n.nspname as table_schema, c.relname as table_name, a.attname as column_name from pg_catalog.pg_attribute a join pg_catalog.pg_class c on (a.attrelid = c.oid) join pg_catalog.pg_namespace n on (c.relnamespace = n.oid) where c.relkind in ('r','v') and a.attnum > 0 and n.nspname not in ('pg_catalog','information_schema') order by 1,2,3 -Jon -- Senior Systems Developer Media Matters for America http://mediamatters.org/ // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | CEO/Founder SyndiClick Networks | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 6: explain analyze is your friend
[GENERAL] list all columns in db
Does anyone have a trick to list all columns in a db ? I need to audit a few dbs to make sure column & table names are adhering to our standard semantic syntax. i figure there has to be an old pg-admin trick out there to display a db like %(tname)s . %(cname) or some similar format // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | CEO/Founder SyndiClick Networks | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] Schema as versioning strategy
On Apr 25, 2007, at 2:05 PM, Richard Huxton wrote: Owen Hartnett wrote: I want to "freeze" a snapshot of the database every year (think of end of year tax records). However, I want this frozen version (and all the previous frozen versions) available to the database user as read-only. My thinking is to copy the entire public schema (which is where all the current data lives) into a new schema, named 2007 (2008, etc.) Sounds perfectly reasonable. You could either do it as a series of: CREATE TABLE archive2007.foo AS SELECT * FROM public.foo; or do a pg_dump of schema "public", tweak the file to change the schema names and restore it. the create table method won't copy the constraints + fkeys . i think you're best off with a pgdump ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[GENERAL] conditional joins and views
Is it possible at all to use conditional joins in views? ie: select a.* , b.* from a inner join b ON ( (a.id = b.id) AND (a.otherfield <> ?) ) I have a few 15-20 table joins that i'd like to push into views. i've never wanted to push something with a conditional join into a view before, so am at a loss on this being a possibility. seeing little documentation on this, i'm thinking its not possible and i'll have to use a function. // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | SyndiClick.com | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org/
Re: [GENERAL] unique constraint on 2 columns
On Apr 20, 2007, at 6:13 PM, Jeff Davis wrote: This is more correct structure, and yes, it would involve a join. I know thats the 'more correct' way -- but I can't do the join , which is why I posted about a 2 column unique index. I tested with a join before posting - i have an already large table that is growing quickly. in order to use the join and keep current performance I'd need to scale out in hardware - which is just not an option right now. searching 100M records vs searching 100M records + a join is a huge difference. when you try to do analytics, its just not appropriate in my situation. No, it does not mean you need to rewrite anything. Use a view; that's the great benefit you get from using a relational database like PostgreSQL. i'd have to rewrite everything that reads from that table to use the view instead of the current query, and then worry about inserts. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] unique constraint on 2 columns
On Apr 20, 2007, at 5:43 PM, Vladimir Zelinski wrote: This looks like more table design problem than database limitation. The one column should accommodate values from both columns with unique index built on this column. Your requirements tell me that these values are the same nature and should be placed in the same column. To distinguish between them use another column to put an attribute. No, both values can be present at once. They're both external facing guids that point to the same resource and serve as keys for the table data. Some rows have one, some have two. in regards to table desgin solution, if I redid anything it would be something like: table_main main_id table_main_2_guid main_id guid_id unique context_id but then i'm using 2 tables and have to join -- which means I need to rewrite evertyhing that queries this table - which is both the core table for my application and supposed to be free of any multi-table queries for simple reads. so i'm going to try the trigger route. // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | SyndiClick.com | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[GENERAL] unique constraint on 2 columns
I need a certain unique constraint in pg that i can't figure out. Given: create table test_a ( id serial , name_1 varchar(32) , name_2 varchar(32) ); I need name_1 and name_2 to both be unique so that: name_1 never appears in name_1 or name_2 name_2 never appears in name_2 or name_1 a standard 2 column unique index / constraint will not accomplish this. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] making a pg store of 'multiple checkboxes' efficient
On Apr 18, 2007, at 4:28 AM, Alban Hertroys wrote: I got some good results using bitwise operations on an integer column. An index on such a column helps a great deal of course. What type of integer you need depends on how many booleans you (expect to) have. My operations were like "WHERE (value & 80) = 80" to match against the 7th and 5th bits, with value an int4 column. Query times are in the tens of milliseconds range. Admittedly I only got about that many records through joins with other tables I needed (meaning that my result set may at some point in time have been around as big as your data set), so the planner may have reduced the number of bitwise operations significantly for me. The actual number of bitwise values to compare was around 40,000 integers * 25 mask values, but that got joined with over 1M records from another result set. A bitwise operator on a single column should (theoretically) have less overhead than integer/boolean operators on multiple columns. Computers are good at bitwise operations, after all. Computers are good at bitwise operations, but software often has scary implementations :) thanks for the input. I'll definitely go this route. It was my first thought, but there is almost no documentation out there for this type of storage. // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | SyndiClick.com | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[GENERAL] making a pg store of 'multiple checkboxes' efficient
I have a large table (5M items current) that is projected to grow at the rate of 2.5M a month looking at the current usage trends. the table represents some core standardized user account attributes , while text heavy / unstandardized info lies in other tables. my issue is this: i'm adding in a 'multiple checkboxes' style field, and trying to weigh the options for db representation against one another. my main concern is speed - this is read heavy , but I am worried to some degree about disk space (not sure where disk space fits in with pg, when I used to use mysql the simplest schema change could drastically effect the disk size though ). that said , these are my current choices: option a bitwise operations and/or operations to condense checkboxes into searchable field pro: super small fits in 1 table con: could not find any docs on the speed of bitwise searches in pg option b secondary table with bools create table extends( account_id , option_1_bool , option_2_bool ) pro: 1 join , fast search on bools con: PITA to maintain/extend option c mapping table create table mapping ( account_id , option_id ) pro: extensible con: slow speed - needs multiple joins , records all over I'd personally lean towards option a or b . anyone have suggestions ? thanks. // Jonathan Vanasco | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | SyndiClick.com | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | FindMeOn.com - The cure for Multiple Web Personality Disorder | Web Identity Management and 3D Social Networking | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | RoadSound.com - Tools For Bands, Stuff For Fans | Collaborative Online Management And Syndication Tools | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly