[HACKERS] semi-PoC: kNN-gist for cubes
I have a rough proof-of-concept for getting nearest-neighbor searches working with cubes. When I say "rough", I mean "I have no idea what I'm doing and I haven't written C for 15 years but I hear it got standardized please don't hurt me". It seems to be about 400x faster for a 3D cube with 1 million rows, more like 10-30x for a 6D cube with 10 million rows. The patch adds operator <-> (which is just the existing cube_distance function) and support function 8, distance (which is just g_cube_distance, a wrapper around cube_distance). The code is in no way production-quality; it is in fact right around "look! it compiles!", complete with pasted-in, commented-out code from something I was mimicking. I thought I'd share at this early stage in the hopes I might get some pointers, such as: - What unintended consequences should I be looking for? - What benchmarks should I do? - What kind of edge cases might I consider? - I'm just wrapping cube_distance and calling it through DirectFunctionCall; it's probably more proper to extract out the "real" function and call it from both cube_distance and g_cube_distance. Right? - What else don't I know? (Besides C, funny man.) The patch, such as it is, is at: https://github.com/jaylevitt/postgres/commit/9cae4ea6bd4b2e582b95d7e1452de0a7aec12857 with an even-messier test at https://github.com/jaylevitt/postgres/commit/daa33e30acaa2c99fe554d88a99dd7d78ff6c784 I initially thought this patch made inserting and indexing slower, but then I realized the fast version was doing 1 million rows, and the slow one did 10 million rows. Which means: dinnertime. Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Bugs/slowness inserting and indexing cubes
[Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. NOTABLE PROBLEMS 1. In 9.1.2, inserting 10x rows takes 19x the time. - 9.1-HEAD and 9.2 "fix" this; it now slows down linearly - but: 10s > 8s > 5s! - but: comparing Ubuntu binary w/vanilla source build on virtual disks, might not be significant 2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes it can't work on an unlogged table 3. In 9.2, creating the 10-million-row index always fails 4. 9.1-HEAD never successfully indexes 10 million rows ("never" = at least 20 minutes on two runs; I will follow up in a few hours) DETAILS Times are in seconds, single run. +---+-+-+--+--+ | Platform | 1m rows | 1m rows | 10m rows | 10m rows | | | INSERT | CR NDX | INSERT | CR NDX | +---+-+-+--+--+ | 9.1.2 logged | 5 | 35 | 98 | 434 | | 9.1.2 unlogged| 2 | 34[**] | 22 | 374[**] | | 9.1-HEAD logged | 10 | 65 | 89 | [***]| | 9.1-HEAD unlogged | 2 | 39 | 20 | 690[**] | | 9.2 logged| 8 | 57 | 87 | 509[*] | | 9.2 unlogged | 2 | 33[**] | 21 | 327[*] | +---+-+-+--+--+ [*] psql:slowcube.sql:20: ERROR: node buffer of page being split (121550) does not exist [**] psql:slowcube.sql:21: ERROR: unlogged GiST indexes are not supported [***] never completed after 10-20 minutes; nothing in server.log at default logging levels, postgres process consuming about 1 CPU in IOWAIT, checkpoints every 7-8 seconds VARIABILITY A few runs in a row on 9.1-HEAD, 1 million rows, logged: ++--+ | INSERT | CREATE INDEX | ++--+ | 10 | 65 | | 8 | 61 | | 7 | 59 | | 8 | 61 | | 7 | 55 | ++--+ SYSTEM SPECS Amazon EC2, EBS-backed, m1.large 7.5GB RAM, 2 cores Intel(R) Xeon(R) CPU E5645 @ 2.40GHz shared_buffers = 1867MB checkpoint_segments = 32 effective_cache_size = 3734MB 9.1.2: installed binaries from Ubuntu's oneiric repo 9.1-HEAD: REL9_1_STABLE, ef19c9dfaa99a2b78ed0f78aa4a44ed31636fdc4, built with simple configure/make/make install 9.2: master, 1631598ea204a3b05104f25d008b510ff5a5c94a, built with simple configure/make/make install 9.1.2 and 9.1-HEAD were run on different (but identically configured) instances. 9.1-HEAD and 9.2 were run on the same instance, but EBS performance is unpredictable. YMMV. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Jay Levitt wrote: [Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. And I bet you'll want the test script... sigh. attached. \c postgres drop database if exists slowcube; create database slowcube; \c slowcube \timing create schema slowcube; set search_path to slowcube; create extension cube; set work_mem to '1GB'; set maintenance_work_mem to '1GB'; create table cubetest ( position cube ); insert into cubetest (position) select cube(array[random() * 1000, random() * 1000, random() * 1000]) from generate_series(1,100); select now(); create index q on cubetest using gist(position); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Progress on fast path sorting, btree index creation time
Jim "Decibel!" Nasby wrote: I agree that it's probably pretty unusual to index floats. FWIW: Cubes and points are floats, right? So would spatial indexes benefit from this optimization, or is it only raw floats? Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Tom Lane wrote: Jay Levitt writes: [Posted at Andres's request] TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in various builds. 1. In 9.1.2, inserting 10x rows takes 19x the time. - 9.1-HEAD and 9.2 "fix" this; it now slows down linearly - but: 10s> 8s> 5s! - but: comparing Ubuntu binary w/vanilla source build on virtual disks, might not be significant FWIW, I find it really hard to believe that there is any real difference between 9.1.2 and 9.1 branch tip on this. There have been no significant changes in either the gist or contrib/cube code in that branch. I suspect you have a measurement issue there. I suspect you're right, given that five runs in a row produced times from 7s to 10s. I just wanted to include it for completeness and in case it triggered any "a-ha" moments. 4. 9.1-HEAD never successfully indexes 10 million rows ("never" = at least 20 minutes on two runs; I will follow up in a few hours) Works for me (see above), though it's slower than you might've expected. So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and (probably both of our) 9.1-HEAD takes 1918s... is that something to worry about, and if so, are there any tests I can run to assist? That bug doesn't affect me personally, but y'know, community and all that. Also, I wonder if it's something like "9.2 got way faster doing X, but meanwhile, HEAD got way slower doing Y.", and this is a canary in the coal mine. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bugs/slowness inserting and indexing cubes
Robert Haas wrote: On Mon, Feb 13, 2012 at 7:45 AM, Robert Haas wrote: On Thu, Feb 9, 2012 at 3:37 PM, Jay Levitt wrote: So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and (probably both of our) 9.1-HEAD takes 1918s... is that something to worry about, and if so, are there any tests I can run to assist? That bug doesn't affect me personally, but y'know, community and all that. Also, I wonder if it's something like "9.2 got way faster doing X, but meanwhile, HEAD got way slower doing Y.", and this is a canary in the coal mine. This might be a lame hypothesis, but... is it possible that you built your 9.1-tip binaries with --enable-cassert? Or with different optimization options? No, I think I/O just varies more than my repeated tests on 1M rows indicated. I ran the 10M-row test four times on the same server, alternating between packaged 9.1.2 and source-built 9.1.2 (default configure options), and saw these times: INSERT INDEX apt 76 578 source 163 636 apt 73 546 source 80 473 EBS has no performance guarantees at all; you share your disks with an arbitrary number of other users, so if someone "in the neighborhood" decides to do some heavy disk I/O, you lose. Let this be a lesson to me: run benchmarks locally! So I tested. On my MacBook Pro, your test script builds the index in just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE. I think that's the 1-million version I emailed; try adding a zero and see if it doesn't take a little longer. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Designing an extension for feature-space similarity search
[Preamble: I've been told that the hackers list is appropriate for extension-related topics like this, even if it's not about contributing to core. If I'm misappropriating, please let me know.] Goal: Personalized, context-relevant query results We are building a deeply personalized site; think "OKCupid for product recommendations" or "Pinterest for people with your tastes". We use psych research to measure and predict your personality and traits along a number of scales (dimensions), and then we connect you with people, products and content we think you'll like. I won't go into the design history, but you can read a little here: http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/ Suffice to say, this ends up needing something like KNN-GiST cubes, only: - The overall concept is more like N-dimensional vectors than cubes - But a dimension might be in any domain, not just floats - All vectors have the same number of dimensions with the same meanings - The distance along each dimension is a domain-specific function - NULLs are allowed (the distance function will handle the semantics) - The distance between two vectors is a function that aggregates the distances of each dimension, along with arbitrary other arguments - for instances, it might take the weighted average of the dimensions That aggregation (which may not literally be an aggregate; I'm not sure yet) needs to happen in a SELECT list, which means it needs to be fast, which means all this (or at least much of it) has to be C. The "simplest thing that works" is probably to hack up the cube extension, declare that everything (except inner pages) must be a zero-volume cube (cube_is_point()), map our non-float features onto floats somehow, and hard-code all the distance functions and the aggregation function. But I think this sort of similarity-search engine has general utility, and I also want to make it easy for us to add and subtract dimensions without too much pain; that should be DDL, not code. So thinking about how this might evolve... - I'm not sure how to represent arbitrary column-like features without reinventing the wheel and putting a database in the database. hstore only stores text, probably for this reason; I took a look at the earlier json patch and saw that it handled only a few core data types. Have there been any other PoCs that involved collections of hetereogenous data? I almost want an actual instance of an "anyarray". - Alternatively, is there a way to index an entire, arbitrary row, rather than on a column on that row? I'm fine with this extension requiring its own table, so I leave the data where it is in the row, and only worry about indexing it. I can't just use functional indexes, because I'll need to provide operators and support functions to GiST. Maybe I have a fake sentinel column, where all the operators use SPI to introspect the row, treat each column as a feature dimension, call the underlying operators on each column's data type, etc. - Can domains have operators, or are operators defined on types? - Does KNN-GiST run into problems when <-> returns values that don't "make sense" in the physical world? For instance, let's say NULL <-> NULL returns a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that could even describe a triangle - but my spidey sense is tingling on this. - Are there previous discussions, patches, abandoned projects, etc. that this reminds you of and that I should go research? Thanks for any thoughts, and I'd love collaborators or even mentors - we plan to open source whatever we produce here, and I don't have quite the theoretical background it takes to do this properly. Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Alexander Korotkov wrote: On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt mailto:jay.lev...@gmail.com>> wrote: - But a dimension might be in any domain, not just floats - The distance along each dimension is a domain-specific function What exact domains do you expect? Some domains could appear to be quite hard for index-based similarity search using GiST (for example, sets, strings etc.). Oh, nothing nearly so complex, and (to Tom's point) no composite types, either. Right now we have demographics like gender, geolocation, and birthdate; I think any domain will be a type that's easily expressible in linear terms. I was thinking in domains rather than types because there isn't one distance function for "date" or "float"; me.birthdate <-> you.birthdate "birthdate" is normalized to a different curve than now() <-> posting_date, and raw_score <-> raw_score would differ from z_score <-> z_score. It would have been elegant to express that distance with <->, but since domains can't have operators, I can create distance(this, other) functions for each domain. It just won't look as pretty! Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Tom Lane wrote: Jay Levitt writes: - I'm not sure how to represent arbitrary column-like features without reinventing the wheel and putting a database in the database. ISTM you could define a composite type and then create operators and an operator class over that type. If you were trying to make a btree opclass there might be a conflict with the built-in record_ops opclass, but since you're only interested in GIST I don't see any real roadblocks. Perfect. Composite types are exactly what I need here; the application can declare its composite type and provide distance functions for each member, and the extension can use those to calculate similarity. How do I introspect the composite type's pg_class to see what it contains? I assume there's a better way than SPI on system catalogs :) Should I be using systable_* functions from genam, or is there an in-memory tree? I feel like funcapi gets me partway there but there's magic in the middle. Can you think of any code that would serve as a sample, maybe whatever creates the output for psql's \d? The main potential disadvantage of this is that you'd have the standard tuple header as overhead in index entries --- but maybe the entries are large enough that that doesn't matter, and in any case you could probably make use of the GIST "compress" method to get rid of most of the header. Maybe convert to MinimalTuple, for instance, if you want to still be able to leverage existing support code for field extraction. Probably not worth it to save the 8 bytes; we're starting out at about 20 floats per row. But good to know for later optimization... - Can domains have operators, or are operators defined on types? I think the current state of play is that you can have such things but the system will only consider them for exact type matches, so you might need more explicit casts than you ordinarily would. However, we only support domains over base types not composites, so this isn't really going to be a profitable direction for you anyway. Actually, as mentioned to Alexander, I'm thinking of domains per feature, not for the overall tuple, so birthdate<->birthdate differs from now()<->posting_date. Sounds like that might work - I'll play. - Does KNN-GiST run into problems when<-> returns values that don't "make sense" in the physical world? Wouldn't surprise me. In general, non-strict index operators are a bad idea. However, if the indexed entities are records, it would be entirely your own business how you handled individual fields being NULL. Yeah, that example conflated NULLs in the feature fields (we don't know your birthdate) with <-> on the whole tuple. Oops. I guess I can just test this by verifying that KNN-GiST ordered by distance returns the same results as without the index. Thanks for your help here. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Tom Lane wrote: - Can domains have operators, or are operators defined on types? I think the current state of play is that you can have such things but the system will only consider them for exact type matches, so you might need more explicit casts than you ordinarily would. Turns out it's even smarter than that; it seems to coerce when it's unambiguous: create domain birthdate as date; create function date_dist(birthdate, birthdate) returns integer as $$ select 123; $$ language sql; create operator <-> ( procedure = date_dist, leftarg = birthdate, rightarg = birthdate); select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate; -- 123 select '2012-01-01'::date <-> '2012-01-01'::date ; -- 123 create domain activity_date as date; create function date_dist(activity_date, activity_date) returns integer as $$ select 432; $$ language sql; create operator <-> ( procedure = date_dist, leftarg = activity_date, rightarg = activity_date); select '2012-01-01'::activity_date <-> '2012-01-01'::activity_date; -- 432 select '2012-01-01'::birthdate <-> '2012-01-01'::birthdate; -- 123 select '2012-01-01'::date <-> '2012-01-01'::date ; -- ERROR: operator is not unique: date <-> date -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Copyright notice for contrib/cube?
I'm basing an extension off contrib/cube. I'm going to open-source it under the existing PostgreSQL license, but I'm not sure how the copyright notice should look - there isn't one at the moment. (In fact, there's no LICENSE or COPYRIGHT file at all.) Should it be something like Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group Portions Copyright (c) 2012, TipTap Inc. ? Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Copyright notice for contrib/cube?
Marti Raudsepp wrote: On Fri, Feb 17, 2012 at 17:42, Jay Levitt wrote: Should it be something like Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group Portions Copyright (c) 2012, TipTap Inc. Please don't add that, just change 2011 to 2012. This is what the wiki says: Q: May I add my own copyright notice where appropriate? To clarify, this is for an extension to be distributed separately on PGXN and GitHub, not for a contribution to the PostgreSQL distribution. It will differ greatly from contrib/cube when it's done, but cube is the scaffolding I'm starting with. That said: Q: Doesn't the PostgreSQL license itself require to keep the copyright notice intact? A: Yes, it does. And it is, because the PostgreSQL Global Development Group covers all copyright holders. Is that true for separately-distributed extensions as well - if I push this to GitHub, my company is part of the PGDG? Where is the PGDG defined? If not (and perhaps even if so), I think I could still add an additional copyright notice without violating the license, since the copyright notice and following two paragraphs still appear in all copies. But perhaps it's not necessary. I think the edge case is something stupid like "In five years, there is no remaining contrib code, and we get bought by MonsantoOracleHalliburton, and they want to close-source the code in a way that's somehow incompatible with the PostgreSQL license.. can they?" But that does raise two other points: - cube seems to post-date any work at UC. Should I still include the "Portions Copyright (c) 1994, The Regents of the University of California"? - Technically, the license could be read such that "the above copyright notice" (singular) refers to the UC copyright notice but not the PGDG notice; next time the lawyers run through it, you might want to add an "s" to "notices".. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Tom Lane wrote: Jay Levitt writes: - Does KNN-GiST run into problems when<-> returns values that don't "make sense" in the physical world? If the indexed entities are records, it would be entirely your own business how you handled individual fields being NULL. This turns out to be a bit challenging. Let's say I'm building a nullable_point type that allows the Y axis to be NULL (or any sentinel value for "missing data"), where the semantics are "NULL is infinitely far from the query". I'll need my GiST functions to return useful results with NULL - not just correct results, but results that help partition the tree nicely. At first I thought this posed a challenge for union; if I have these points: (1,2) (2,1) (1,NULL) what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR. The real challenge is probably in picksplit and penalty - where in the tree should I stick (1,NULL)? - at which point you say "Yes, algorithms for efficient indexes are hard work and computer-science-y" and point me at surrogate splitters. Just thinking out loud, I guess; if other GiST types have addressed this problem, I'd love to hear about it. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Alexander Korotkov wrote: On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt mailto:jay.lev...@gmail.com>> wrote: At first I thought this posed a challenge for union; if I have these points: (1,2) (2,1) (1,NULL) what's the union? I think the answer is to treat NULL box coordinates like LL = -infinity, UR = infinity, or (equivalently, I think) to store a saw_nulls bit in addition to LL and UR. Similar problem appears at GiST indexing of ranges, because range can be empty. There additional "contain empty" flag was introduced. This "contain empty" flag indicates that underlying value can be empty. So, this flag is set when union with empty range or other range with this flag set. It's likely you need similar flag for each dimension. Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Designing an extension for feature-space similarity search
Alexander Korotkov wrote: On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt Ah, yes, exactly the same problem. So what led you to add a flag instead of using the range NULL..NULL? I'm on the fence about choosing. At first, range bounds can't be NULL :) At second, if we have range (a;b)+"contain empty" in internal page, both facts: 1) All normal underlying ranges are contained in (a;b). 2) There can be empty underlying ranges. are useful for search. That makes sense; you're essentially keeping one bit of stats about the values present in the range. I wonder: if I'm indexing a rowtype, then for each column in the row I need to store a lower-left and an upper-right bound, plus a might-have-nulls flag. Sounds a lot like a range. Should I just use ranges for that? See a downside (overhead)? See an upside (seems less duplicative somehow)? I'm fine depending on 9.2. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Future of our regular expression code
Stephen Frost wrote: Alright, I'll bite.. Which existing regexp implementation that's well written, well maintained, and which is well protected against malicious regexes should we be considering then? FWIW, there's a benchmark here that compares a number of regexp engines, including PCRE, TRE and Russ Cox's RE2: http://lh3lh3.users.sourceforge.net/reb.shtml The fastest backtracking-style engine seems to be oniguruma, which is native to Ruby 1.9 and thus not only supports Unicode but I'd bet performs pretty well on it, on account of it's developed in Japan. But it goes pathological on regexen containing '|'; the only safe choice among PCRE-style engines is RE2, but of course that doesn't support backreferences. Russ's page on re2 (http://code.google.com/p/re2/) says: "If you absolutely need backreferences and generalized assertions, then RE2 is not for you, but you might be interested in irregexp, Google Chrome's regular expression engine." That's here: http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html Sadly, it's in Javascript. Seems like if you need a safe, performant regexp implementation, your choice is (a) finish PLv8 and support it on all platforms, or (b) add backreferences to RE2 and precompile it to C with Comeau (if that's still around), or... Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_test_timing tool for EXPLAIN ANALYZE overhead
Greg Smith wrote: Anyway, the patch does now includes several examples and a short primer on PC clock hardware, to help guide what good results look like and why they've been impossible to obtain in the past. That's a bit Linux-centric, but the hardware described covers almost all systems using Intel or AMD processors. Only difference with most other operating systems is how aggressively they have adopted newer timer hardware. At least this gives a way to measure all of them. N.B.: Windows has at least two clock APIs, timeGetTime and QueryPerformanceCounters (and probably more, these days). They rely on different hardware clocks, and can get out of sync with each other; meanwhile, QueryPerformanceCounters can get out of sync with itself on (older?) multi-CPU boards. So if you're doing high-res timing, it's good to make sure you aren't relying on two different clocks in different places... I ran into this with MIDI drivers years ago; and wrote a doc: http://www.ultimatemetal.com/forum/equipment/315910-midi-latency-problem-nuendo.html#post6315034 and a clock-testing utility: https://github.com/jaylevitt/miditime -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed
Tom Lane wrote: Ants Aasma writes: A user complained on pgsql-performance that SELECT col FROM table GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe to return tuples from hash-aggregate as they are found when no aggregate functions are in use. Attached is a first shot at that. As I commented in the other thread, the user would be a lot better off if he'd had an index on the column in question. I'm not sure it's worth complicating the hashagg logic when an indexscan + groupagg would address the case better. Would this patch help in the case where "table" is actually a set-returning function, and thus can't have an index? (I don't yet know enough about the tree to know when hashaggs get used). I'm wondering if this is a useful exception to the "restrictions can't get pushed down through GROUP BYs" rule. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Switching to Homebrew as recommended Mac install?
The Mac installation docs currently recommend the EDB one-click installer as the first choice. While this does install pgadmin and some other refinements, it also is fairly confusing to troubleshoot: - By default, it installs to /Library/PostgreSQL, which is also (I think) where the Apple-supplied Lion install is - The uninstaller is hidden in /Library/PostgreSQL, which (since Finder hides /Library by default) you're likely to go to via Terminal. But the uninstaller is a Mac app, so even if you find it you have to know to use "open" to run it, because Mac apps are really directories that the Finder abstracts away from you. - The EDB docs are written against 8.4. - There are, as @lluad points out, no fewer than eight ways to install Postgres on a Mac (fink, macports, homebrew, Lion default, build from source, EDB, and two other binary installers) - We have few Mac experts hanging out in #postgresql. - We just had two folks within an hour, BOTH with conflicting installs of Postgres. So with all respect and thanks to EDB for maintaining those installers, I'd like to propose that homebrew become the recommended install method on Mac, and I will update the Mac formula to overcome any current objections. The nice thing about homebrew is that (a) formulas can contain arbitrary Ruby and command-line options, so we can easily deal with things like detecting existing installs, handling shared memory, etc. if we want to, and (b) pull requests are accepted freely and frequently, so it can always be the current, security-patched version. What do folks think of this idea? When I mention homebrew in #postgresql, there's always an "ick", but I believe that's true of any package manager (and possibly any Mac anything, because we're all Rails-loving ORM-using SQL-not-understanding fanbois, and I say that with love.) The current homebrew installer is already 9.1.3, and does a make-world, so you get all of contrib built. POSSIBLE OBJECTIONS/PREREQUISITES 1. homebrew installs everything under /usr/local and makes that user-writeable. Sorry. It does because most Mac users don't know how to edit PATH for GUI apps (it's in a .plist in a hidden directory in your home dir), and /usr/local is already in PATH by default. 2. The current formula installs Postgres as the desktop user, not as the _postgres role account. I'm personally of the strong opinion that user-to-user privilege escalation attacks are NOT an issue on desktops; all important files are already owned by the desktop user. The attack vector is *maybe* root escalation attacks, but if you want root, it's so common for installers to ask permission that your malware could just ask. The real attack vector is "I'm in your browser", and that has nothing to do with root, permissions, or users at all. Meanwhile, the EDB installer by default installs both app and data to a directory that requires root - so I assume it runs as root too - and nobody's complained. However, if this is a sticking point, I'd have no problem adding a --user option that would default to _postgres (underscore-prefixed usernames are the Apple standard). 3. The current formula (TCF) spits out instructions telling you how to initdb, but they're easy to overlook. I'm happy to add an option if necessary, and might do it anyway. 4. TCF also spits out instructions for adding Postgres to launchctl (Mac's version of /etc/init.d or Windows Services), rather than doing it for you, but again, I'd happily add the option. (I'm checking with Homebrew folks to see if there's some dictum against that; it's a common pattern to put launchctl in the instructions, but IME the usability is poor.) 5. TCF doesn't update your shared memory settings. Again, happy to add that. 6. TCF doesn't look for existing installs. This is especially a problem on Lion, since Apple bundles PG 8.4, either client or server IIUC, and although /usr/local/bin is in your PATH, it comes *after* /usr/bin (grumble), so you'll either have the homebrew server fail to launch (since port 5432 is in use), or you'll have an outdated client version. In IRC, both users had actually installed the EDB version months ago and forgotten about it, but over time, Lion users will grow, since all new Macs come with only Lion. There are several ways to address this; my preference is to have homebrew warn about existing installs but take care of any magic to make them go away, a la http://nextmarvel.net/blog/2011/09/brew-install-postgresql-on-os-x-lion/. 7. There's no homebrew formula for pgadmin. I've never built it, and might be able to add that, but probably not right away. 8. There might be other popular things that EDB's StackBuilder does. 9. EDB is an important contributor to the PG core community, and maybe th
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Jay Levitt wrote: POSSIBLE OBJECTIONS/PREREQUISITES 10. There is no homebrew support for multiple versions, and no current plans to add it (though it's on the wishlist). This means homebrew is only useful if "I want to install a PostgreSQL thingie" is the common Mac use case. If people often need to use specific older versions, to mirror their server configs, it's a problem. It *might* be possible to hack this into our formula, but I'm not sure it's either doable or acceptable. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Dave Page wrote: > It seems to me that most of your arguments against the installers are > based on incorrect understanding or information, and most of your > arguments for Homebrew actually come across as arguments against! You're right about the former - and as to the latter, they *were* arguments against ("potential objections"). I try to pre-argue against my own proposals to save everyone time; if I can still prevail, I must have a damn good idea :) At this point I agree with you, but I'm still going to go into detail, because I think there are two markets for Postgres, and the database community has been so focused around enterprise for so long that you're missing opportunities with web startups. I'd love to help bridge the gap, having jumped straight from big-iron PL/I to ooh-Ruby-is-shiny. And web startups develop on Mac laptops. They just do. So if it helps you to imagine me as a 20something "I'm a Mac" hipster, working on some hot Facebook/mobile app with funding from Spark Capital, do that. Lord knows it helps me. >> - We have few Mac experts hanging out in #postgresql. > Not sure how this is relevant to the proposal. The impetus for the idea was that there seems to be a steady stream of novice PG users on Mac who come into #postgresql with installation problems, which is bad enough as an out-of-box experience - but worse is that there are rarely folks around who can help. (Of course, I'm extrapolating; every time *I'm* in IRC and see this, there's someone who can help. But you know what I mean.) And (although my proposal started with documentation) I'm of the firm opinion that "there's no such thing as a documentation error"; a user problem is a software problem. Humans will click buttons before they'll read, developers are humans, and no amount of RTFM will ever fix that. If we can make installers smarter, that's way better than troubleshooting guides, IRC, mailing lists, etc. So that's where I was coming from. I didn't realize that you were actively maintaining the EDB installer (see below for the 8.4 doc explanation); obviously, if you can improve that, it's the best solution and we should, if anything, recommend it MORE vigorously. Still, there's a growing community of developers who expect "brew install" to work, and I do want to fix it for them. The EDB installer will always be a one-off experience; most of the other servers you install will be through a package manager, and homebrew's popularity (despite its youth) is impressive. Both of my n=2 data points had run across PG a while back, installed it with the one-click to try it out, forgotten about it, done "brew install postgresql" today, and naturally ran into problems. >> - By default, it installs to /Library/PostgreSQL, which is also (I think) >> where the Apple-supplied Lion install is > No, Apple's version is installed in /usr on mine. Ah hah. I suppose only the Apple .plist is stored under /Library, then. Let me amend that to "this made everyone in IRC, and probably many other non-Mac-expert troubleshooters, assume that this is an Apple-installed package." It'd be great for this to go somewhere that feels like "Oh, this was installed by you"; /Library feels kinda weird for a server, though I can understand your reasoning. Maybe even /Library/EnterpriseDB/PostgreSQL to make it obvious? >> - The uninstaller is hidden in /Library/PostgreSQL, which (since Finder >> hides /Library by default) you're likely to go to via Terminal. But the >> uninstaller is a Mac app, so even if you find it you have to know to use >> "open" to run it, because Mac apps are really directories that the Finder >> abstracts away from you. > Yes. How about a one-liner bash script "uninstall-postgresql" that does nothing but "open uninstall-postgresql.app"? >> - The EDB docs are written against 8.4. > Only if you install 8.4. If you install 8.3 you get the 8.3 docs, 9.0 > the 9.0 docs and so on. No, I meant on the web: http://www.enterprisedb.com/resources-community/pginst-guide That's what made me assume that the installer wasn't maintained (except as to repackaging new PG versions, obviously). It's obviously not hard to replace "8.3" with "9.1" when you read it, but it still leaves an impression akin to "This web site works best with IE7 and above." Allow me to now replace most of this thread with "hey, you might wanna update that page." >> - There are eight ways to install Postgres on a Mac > That isn't any more of a reason to discount the EDB installer than any other. Nope, just an argument that the recommended installer should handle that nicely. >> - We just had two folks within an hour, BOTH with conflicting installs of >> Postgres. > Not sure how that is relevant either. You can have conflicting > installation using any of the installation methods, including a > home-built source tree. Right, but I suspect this is a common problem - not only have I seen it in IRC but 3 or 4 times in
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Tom Lane wrote: While you might not like the EDB installer, at least those folks are active in the lists and accountable for whatever problems their code has. Who in heck is responsible for the "homebrew" packaging, and do they answer questions in the PG lists? Just for general knowledge... Who's responsible is whoever wants to be; homebrew is open source, and with a github-based workflow, it's trivial for them to accept pull requests. On the 1967 formulas (packages) in the repo, there have been 1759 contributors. I was volunteering to be the maintainer and liaison if we did this; I'm pretty good at email and IRC. It's actually pretty clever and elegant - homebrew itself uses git and github for formula updates and distribution, and is written in ruby1.8 which ships with all Macs. /usr/local is a git repo, "brew update" is mostly "git pull", and "brew search" checks for new pull requests if it doesn't find a matching formula. The whole thing's all of 1500 lines of code, and you saw what formulas look like. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Dave Page wrote: On Mon, Apr 2, 2012 at 12:29 AM, Jay Levitt wrote: Just as an FYI, a large percentage of the PostgreSQL developers are Mac users, including myself. They're also the company standard at EnterpriseDB - so we're not entirely unfamiliar with software development on them. Good to know; now I realize that even on Macs there are two worlds. I think it's important to understand the new world, because although very little infrastructure spans two generations, I believe that PostgreSQL is uniquely positioned to do exactly that, just as POSIX has. It's a performant ACID relational data store for the old guard, but it's also a web-scale cloud-friendly polyglot-extensible preinstalled XML/JSON/hstore/tree/graph/schemaless store for the new, and like Java HotSpot, it gets smarter and faster every day. It is an operating system for your data, it's free, and Oracle hasn't bought it (except yesterday). This goes back to the "marketing challenge, not technical challenge" threads. - We have few Mac experts hanging out in #postgresql. Not sure how this is relevant to the proposal. The impetus for the idea was that there seems to be a steady stream of novice PG users on Mac who come into #postgresql with installation problems, If you see someone report a bug with the installers, please have them report it on the EnterpriseDB forums: http://forums.enterprisedb.com/forums/show/9.page It's not a bug with the EDB installer, except insofar as the bug is "the EDB installer is not prominent and widely recommended enough and so they used homebrew or macports". The EDB installer is what they used once upon a time; today they are following a blog post that walked them through homebrew (probably as one line of a longer process involving memcached, sphinx, nginx, node.js, and seven other things). The EDB installer will always be a one-off experience; most of the other servers you install will be through a package manager, and homebrew's popularity (despite its youth) is impressive. I would disagree with that. Most users I know do not use things like homebrew (particularly those coming from Windows who have no familiarity with such package managers at all). Yep; two worlds. >> [docs reference 8.4] That hasn't been updated because the installation steps haven't changed and I'd rather spend time writing software than updating screenshots. Fair. I wonder if there's an easy way to automate the generation and insertion of those screen shots. You could always write an AppleScript/Automator/Keyboard Maestro script, but I feel like I've seen a better way specifically for tech docs; I'll keep an eye out. A couple of points of note: - The introduction says: "This document is based on the 8.4.x one-click installers but applies equally to later versions." - The doc also explains where to find the uninstaller. Sure, and if humans read docs, instead of just glancing at them, that'd be all you needed. In any case, I could counter myself that nobody reads the doc period, so it doesn't matter what version is listed; that's just the source of my own misunderstanding about maintenance. - There are eight ways to install Postgres on a Mac That isn't any more of a reason to discount the EDB installer than any other. Nope, just an argument that the recommended installer should handle that nicely. It does. It'll detect that the port is in use and suggest a different one. I don't know of any other of those installation methods that'll do that. That's terrific, and you're right - it's the only one. >>>> 2. The current formula installs Postgres as the desktop user, not as the >>>> _postgres role account. >>> That's not very helpful on shared machines - and whilst it may be fine >>> for developers etc, it's not the recommended way to setup PostgreSQL >>> for any kind of production use. >> Oh, of course. Homebrew does not target the three remaining people who run >> production XServes. It's purely for Mac developer workstations. At >> startups. Which are MacBooks! :) > "Production" doesn't necessarily mean "server". All those thousands of > Poker Tracker users that run with PostgreSQL on Windows on their home > machines are production users for example. Excellent point, and one I hadn't thought of. 1. Rubyists in general are sick of sudo'ing on their laptops, because It Doesn't Matter (as I'll fail to argue later). Homebrew puts itself into a directory that is user-writable so it does not require sudo for basic installation. Nice. You just turned me off ever wanting anything related to Ruby on my Mac either! Well, for "Rubyists", I should properly substitute "younger
Re: [HACKERS] Switching to Homebrew as recommended Mac install? / apology
David Johnston wrote: > Just trying to bridge an apparent gap since the original e-mail seems to > have come across as too adversarial that the underlying thoughts have > been overlooked. Trying to contribute in my own way with my current > resources. Thanks, but it's my own fault for basing a half-baked "let's rewrite everything" idea on a few wrong assumptions without asking some simple questions first. (Sorry, David.) And you guys don't know me very well yet, so you don't how to interpret my tone, or that I spend the *first* half of each day making the exact opposite arguments to all the young turks in the hopes we'll all meet in the middle. I plan to hang around, so by way of belated introduction, and you can imagine this in the style of Beetlejuice: Hi. I wrote AOL's OLTP-style mail server in 1994 and scaled it (with an eventual team of 25) from 27 to 4000 TPS on minicomputers as powerful as an iPhone. It had multi-site replication, zero-downtime installs (without blocking writes), and served 1.5 million simultaneous users. I had to write nearly a whole SVM-based OS in the app, because nothing at the time - from the message bus to the disk cache to poll() itself - could handle our load, and our growth outpaced the hardware manufacturers' ability to build it. I did this with no CS background beyond public school (I started as a 6502 hacker), no search engine, and no access to research papers or computer scientists. I have no idea how. The architecture survived five underlying infrastructure transitions (Stratus filesystem, Sybase, Informix, Tandem, and the move from four- to eight-byte IDs that PG has successfully staved off) while migrating live. It never lost or misrouted a message, and never had a security breach in the nine years I led it (or, AFAIK, since), despite the fact that we were a big enough target for hackers to respond to our changed defenses within hours. I do actually know this stuff, or did. I spent 10 years taking a break, studying music, successfully sleeping through Java; now I'm back in technology, leading development in a Rails/web/JS startup, and luxuriating in the ability to actually store data in a database this time - because you guys have spent the past 20 years studying and improving the hard parts of abstracting performant, reliable, scalable data storage. I have a tendency to see the possible endgame and insist on starting now, and if I get too idealistic, ambitious, and "MongoDB is webscale", please just drag me into a conversation about lock contention and distributed 2PC and I'll forget the whole thing. But I really do think PG can be the makes-everything-possible, does-most-things-well data store - really, data platform - for the next decade or two, and I want to contribute. I'm provocative, playful and grandiose, I apologize except not really, and it's because in my n=1 experience, the way life works is (a) you decide to change the world and then (b) you do. > You do not need permission to contribute to the community > in the way you seek so what is it that you are really asking for? Nothing at this point. I was thinking out loud, and at the time was temporarily insa^h^h^hconvinced that the homebrew formula should be the community standard, and thus that I'd have to bring it up to some level of acceptability/review. I've contributed to the formula in the past, and will continue to do so based on the thoughts everyone's shared here. It doesn't need to be official to be useful, and as David Page said, it's not gonna be listed in the docs no matter what, given the one decision that homebrew makes (/usr/local) that I can't override. When brew is replaced by something more popular do you think you will continue to maintain the recipe or is it going to end up stuck showing us how to install version 9.3 or earlier. Like anything, I'll maintain it until it becomes useless to me or vice versa, and someone will pick it up or they won't. But just to be clear, Homebrew's a source-based repo (so there's no cross-compiler issues), pulling from the upstream source repository, using only the stock compiler toolchain, Intel-only, on a platform where the only hardware manufacturer has themselves severely constrained the number of possible configurations. For the most part, updating the formula to "package" new versions is a matter of changing the following two lines: url 'http://ftp.postgresql.org/pub/source/v9.1.3/postgresql-9.1.3.tar.bz2' md5 '641e1915f7ebfdc9f138e4c55b6aec0e' Unless the instructions for "How to build postgres from source" change, nothing else in the formula *needs* to. The current formula is fairly simple; aside from user instructions, the code is 75 lines and mostly consists of default arguments to ./configure. (Formula attached for the curious.) Pull requests are freely and quickly accepted after a quick review; the homebrew repo is operated more in the "fail early and often" s
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Robert Haas wrote: On Mon, Apr 2, 2012 at 5:23 AM, Dave Page wrote: If homebrew intentionally creates a hole like that, then for as long as I'm one of the PostgreSQL webmasters it will *never* be listed on our download pages. I think that's a bit harsh. It's not as if the PostgreSQL package creates the security hole; it's something that the packaging system does itself, independent of whether or not you try to install PostgreSQL with it. So it seems to me that refusing to list it is making life difficult for people who have already made the decision to use brew, without any compensating advantage. In fairness to Dave, I think he was still reacting to my initial proposal that homebrew be the *recommended* install. In that case, people might install homebrew specifically because postgresql.org recommended it. If the consensus is that /usr/local/* user-writable is insecure, I certainly wouldn't object to a little note that said "If you're using homebrew, do 'brew install postgresql', but we don't recommend homebrew for security reasons; a little pressure might provide the impetus for homebrew to allow a better way. That said, about 8 months ago Homebrew's defaults changed. It no longer requires /usr/local to be directly writable; it will sudo if necessary during the initial installation to create its subdirectories. Those directories are mostly user-writable, though: % ls -l /usr/local total 8 drwxr-xr-x 37 jay admin 1.2K Mar 31 16:39 Cellar/ drwxr-xr-x7 jay admin 238B Feb 29 10:51 Library/ -rw-r--r--1 jay admin 789B Feb 29 10:57 README.md drwxr-xr-x 499 jay admin17K Apr 1 15:29 bin/ drwxr-xr-x9 jay admin 306B Mar 7 16:23 etc/ drwxr-xr-x 69 jay admin 2.3K Mar 16 16:48 include/ drwxr-xr-x 178 jay admin 5.9K Mar 16 16:48 lib/ drwxr-xr-x3 root admin 102B Mar 14 13:20 man/ drwxr-xr-x 20 jay admin 680B Mar 31 16:40 share/ drwx--3 jay admin 102B Feb 29 11:43 var/ At no point was anything in /usr/local *world*-writable, FWIW. That doesn't mean that I approve of brew's approach to this problem, though. Even if you think that it's unimportant to keep the desktop user from usurping root privileges, having some things installed in /usr/local as root and others as the desktop user (multiple different desktop users?) seems like a recipe for chaos. I think the brew designers expect most folks to either not have anything in /usr/local from outside homebrew, to not install anything there as root, to understand the security consequences, or to use homebrew as root even though it's unsupported, and deal with their own bugs. I can't help but wonder if this isn't just the natural way a packaging system evolves - you start with something very simple (like what brew is now) and then gradually you realize that there are some annoyances, so you file those down by adding some more complexity, and eventually you end up with a system that's just as complex as the ones that you originally thought were too complex. Packaging systems? I thought that's called "all software ever"! brew's lucky in that the Mac by definition is not a heterogeneous environment, and so Mac users don't expect it to be. (Last I checked, it's either difficult, impossible or unsupported to boot from a volume other than your filesystem root.) By being mostly source-fetch-only, sitting on top of git, and not packaging either system-provided features (many of which would require root) or repackaging gems/eggs/nodes, they've avoided a lot of the hard problems. But sure, it's only two years old and it will get more complex over time. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Switching to Homebrew as recommended Mac install?
Dave Page wrote: Exactly - which is why I was objecting to recommending a distribution of PostgreSQL that came in a packaging system that we were told changed /usr/local to be world writeable to avoid the use/annoyance of the standard security measures on the platform. We... that's not exactly what happened: I originally wrote: POSSIBLE OBJECTIONS/PREREQUISITES 1. homebrew installs everything under /usr/local and makes that user-writeable. So. :) Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Pushing restrictions down into GROUP BYs?
A while back, I posted a pathological minimal-case query where, in order to select one row from a users table, Postgres needed to scan the whole users table, because the restriction was not visible to the GROUP BY. At the time, Tom wrote: > Don't hold your breath waiting for that to change. To do what you're > wishing for, we'd have to treat the GROUP BY subquery as if it were an > inner indexscan, and push a join condition into it. That's not even > possible today. It might be possible after I get done with the > parameterized-path stuff I've been speculating about for a couple of > years now; but I suspect that even if it is possible, we won't do it > for subqueries because of the planner-performance hit we'd take from > repeatedly replanning the same subquery. http://archives.postgresql.org/pgsql-performance/2011-11/msg00131.php Given all the work that's been done on parameterized paths and LATERAL, is this something to revisit yet? In 9.3, it's at least possible to manually tweak the SRF, thanks to LATERAL. But it'd be great to allow set-returning functions to remain blissfully unaware of their surroundings. Modular code, Single Responsibility Principle, all that. I guess a more general question is: Are there cases where the planner can *use* LATERAL functionality to push down restrictions like this? (Do LATERAL and pushdown conceptually overlap? I think maybe they do.) Example code below - and before you say "but you could just use 'where exists'", trust me that the original queries were much more involved :) - drop schema if exists jaytest cascade; create schema jaytest; set search_path to jaytest; create table questions ( id int not null primary key, user_id int not null ); insert into questions select generate_series(1,1100), (random()*200)::int; create table users ( id int not null primary key ); insert into users select generate_series(1, 200); vacuum freeze analyze; create function srf() returns table (user_id int) as $$ select u.id from users as u group by u.id $$ language sql stable; -- Option 1: Use the set-returning function explain analyze select questions.id from questions join srf() on srf.user_id = questions.user_id where questions.id = 1; -- Option 2: Use the equivalent of the set-returning function -- (remove any doubts about function call overhead) explain analyze select questions.id from questions join ( select u.id as user_id from users as u group by u.id ) as srf1 on srf1.user_id = questions.user_id where questions.id = 1; -- Option 3: Now that we have LATERAL, manually -- push the join down into the set-returning function create function srf_lateral(questions_user_id int) returns table (user_id int) as $$ select u.id from users as u where u.id = questions_user_id group by u.id $$ language sql stable; explain analyze select questions.id from questions, lateral srf_lateral(questions.user_id) where questions.id = 1; drop schema jaytest cascade; - On my machine, options 1 and 2 usually have runtimes of 100-400 ms, and option 3 is about 0.04ms. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Last gasp
Alvaro Herrera wrote: Now what would be sort of neat is if we had a way to keep all the versions of patch X plus author and reviewer information, links to reviews and discussion, etc. in some sort of centralized place. FWIW: y'all might have discussed to death during the git migration, so *please* do not let me derail you if so... github does a great job of exactly this. You open an issue, you reference it from commits, all the related commits are listed in (and browseable from) the issue, you can comment on specific lines of the commit, it integrates w/email, it has an API to write tools (both workflow and archival) against, etc. Rather than extend the CF app into a trivial-patch workflow app, it might be worth looking at integrating it with github. Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Last gasp
Christopher Browne wrote: On Thu, Apr 12, 2012 at 6:11 PM, Jay Levitt wrote: Rather than extend the CF app into a trivial-patch workflow app, it might be worth looking at integrating it with github. There's a reluctance to require a proprietary component that could disappear on us without notice. Excellent point. I was thinking that GitHub's API would allow archival exporting to counter that, along the lines of "let's take advantage of it for the next five years until it goes south, and THEN we could write our own". But I can see how that might not be the best choice for a project that expects to preserve history for a few decades. GitHub does offer an "enterprise version" that you can self-host, but it seems to be priced per-user and intended for solely intranet use. If the feature set is desirable, though, I wonder if Postgres is big/high profile enough for them to figure out some sort of better arrangement. They *love* it when big open-source projects use GitHub as their public repo - they'll email and blog announcements about it - and if there's interest I'd be happy to open a conversation with them. The existence of git itself is a result of *exactly* that circumstance, as Linux kernel developers had gotten dependent on BitKeeper, whereupon the owner decided to take his toys home, at which point they were left bereft of "their" SCM tool. <http://kerneltrap.org/node/4966> Good history lesson there, with a great outcome. I expect that it would be more worthwhile to look into enhancements to git workflow such as<http://code.google.com/p/gerrit/> Gerrit. I don't know that Gerrit is THE answer, but there are certainly projects that have found it of value, and it doesn't have the "oops, it's proprietary" problem. I've looked at it in conjunction with Jenkins CI; it looked nice but was way too heavy-weight for a four-person startup (what's code review?). It's probably much more suitable for this sized project. Gerrit's a full-featured code review app with a tolerable UI; I was thinking of GitHub more as a great lightweight UI for doc patches and other trivial patches where you might have lots of casual review and comments but no need for, say, recording regression tests against each patch version. e.g.: https://github.com/rails/rails/pull/5730 Also, for doc patches, GitHub has the great advantage of in-place editing right from the web UI. Peter mentioned the desire to bring more eyes and hands onto these type of patches - I think the phrase was "enthusiast power users who wouldn't really consider themselves hackers." The advantage of GitHub here would be its (current) widespread adoption; the perceived barrier to entry is far lower and the workflow is far more obvious than a mailing list, formatting patches by email, not quite knowing what the process is. I mention all this not to try to push GitHub specifically, but to say "these are the types of features that modern, open-source collaborative systems offer, and could improve community involvement". From a human- rather than technology-oriented perspective: I was shocked to find that you folks *WANT* reviews from non-contributors. It was my assumption as a newcomer that if I don't feel well-versed enough to submit patches yet, the last thing you'd want me to do was to look over someone else's patch and say "Yeah, that looks good", any more than I care if my mom thinks my latest web app is "very nice". I see now that the "Reviewing a Patch" wiki page explains this, but maybe this info should be pushed higher into the docs and web site; a "How can I contribute" page, open calls for reviewers on the non-hackers mailing lists, things like that. Or maybe just make the wiki page bright red and blink a lot. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Last gasp
Alex wrote: I didn't follow this whole thread, but have we considered Redmine[1]? As the resident "Ruby is shiny, let's do everything in Rails on my MacBook" guy, I'd like to make a statement against interest: I've tried Redmine a few times and it's been painful. Much of the codebase is deprecated, it's slow, it has no meaningful search (in 2012?!), I've seen wiki edits disappear, and at the moment pulling up its own FAQ page at redmine.org times out. Maybe you've had better luck with it, but whenever I've Googled for Redmine questions, the collective Internet has sighed and said "Yeah, it was a really good idea, though." Jay Levitt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Last gasp
Alex wrote: Jay Levitt writes: Alex wrote: I didn't follow this whole thread, but have we considered Redmine[1]? As the resident "Ruby is shiny, let's do everything in Rails on my MacBook" guy, I'd like to make a statement against interest: I've tried Redmine a few times and it's been painful. Much of the codebase is deprecated, it's slow, it has no meaningful search (in 2012?!), I've seen wiki edits disappear, and at the moment pulling up its own FAQ page at redmine.org times out. Yay, that's totally FUD to me. You're right, it was. My bad. Someday I will find the balance between precision and concision. Could you please elaborate a bit on your points? Deprecated codebase? Let me guess... It runs on an outdated version of Rails (2.3) but only because Rails is changing so rapidly, I believe. There is work in progress[1] to move to the supported branch Rails-3.x. I wasn't even thinking of that; I know many production systems still run on Rails 2.3, and in fact it probably even performs better for some workloads. 3.x is a mixed bag. I don't hold that against Redmine. But it's still FUD, because I can't remember where I saw this information. So: withdrawn. Slow? Do you have any data to back this point up? No measurable data; just a sigh of relief when switching from Redmine to Github - and GitHub ain't a speed demon. In general, I've seen multi-second page load times on crazy-simple things like wiki edits; this was on a hosted provider (sourcerepo.com), but they also hosted our git repo and we had no speed problems there. No meaningful search, eh? Works for me. Redmine searches return partial-word matches, and there's no way to disable that. Searching for "test" finds "latest". To me, that's broken. Also, the UI is very 5 years ago; e.g., "compare revisions" uses the same columns-of-radio-buttons approach as MediaWiki. If the goal is a tool to reduce friction and increase involvement, you want a smoother UX. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug tracker tool we need
Alex wrote: I still fail to see how Redmine doesn't fit into requirements summarized at that wiki page[1], so that must be something other than formal requirement of being free/open software and running postgres behind (some sort of "feeling" maybe?) Well, if those requirements are in fact requirements, Redmine could suit the purpose (perhaps with some custom extensions). But yes, Redmine "feels" wrong to me, though I have never been particularly happy with *any* self-hosted bug tracker. Of those I've used - Redmine, Mantis, JIRA, Trac, Bugzilla, GForge, RT - Redmine feels the least-worst to me; it's about equal to Trac, and it's in Ruby so I could theoretically(!) improve it. I think the biggest missing pieces in Redmine aside from custom CF stuff are: better search, single sign-on (it requires Yet Another Login), a better UX (AJAX, syntax highlighting) and better git integration (a la pull requests, where private git commits = patches). Those are some pretty big pieces. I don't think Redmine out-of-the-box would improve either CFs or community involvement. Jay, Alvaro, Dimitri (and whoever else wants to speak up) could you please describe your ideal tool for the task? My opinion isn't all that important, since I currently have an infinite opinion-to-contribution ratio, but in my unicorniverse: We'd accept that open source hasn't always produced great UX, we'd use GitHub's issue tracker, allow volunteers to do "bug wrangling" triage via tags, use GitHub hooks to integrate with the existing CF app, and write archiving tools that would let us easily export everything off of GitHub for when (a) something better comes along or (b) GitHub pops out of existence or adds egregious licensing terms like BitKeeper. Reasons: - Familiarity: Many developers already have a GitHub account and use it - Discoverability: GitHub has great SEO - Tight integration of git with patch and issue management (pull requests, fork networks, etc); eliminates ceremony rather than adding to it - Readable UI with syntax highlighting, etc - Patch commenting and git integration encourage actual review-resubmit cycles instead of "Here, look, I fixed it for you" reviews - Two-way email/web integration - Meets Tom's "would be sort of neat" criteria[1] - Could easily implement Simon's "pony" criteria[2] with tags and API - Easily extensible with API and hooks - Subjectively: Its design encourages better community and core interactions than any I've seen in 25 years. GitHub could well be a non-starter, but if third-party-dependence is really the holdup, I'd volunteer to write the tools - in fact, a google of [export issues from github] shows a few that might already suffice. Given that every other existing tool likely have pissed off someone already, I guess our best bet is writing one from scratch. ISTR there's a great "Writing your own bug tracker is an anti-pattern" blog, but I can't find it anymore. Or maybe there isn't really a need for a tracker? The core team have managed to live without one for so long after all... As an end-user, I've reported exactly one bug in a release version of Postgres, and it was fixed (and back-ported!) the next day. So I really can't complain about the tracking of actual bugs. Sounds like we do need something better for CF/patch workflow, tho. Jay [1] Tom wrote: Now what would be sort of neat is if we had a way to keep all the versions of patch X plus author and reviewer information, links to reviews and discussion, etc. in some sort of centralized place [2] Simon wrote: My I-Want-a-Pony idea is some kind of rating system that allows us all to judge patches in terms of importance/popularity, complexity and maturity. I guess a Balanced Scorecard for the development process. So we can all see whats going on. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join
Simon Riggs wrote: I'd like to see something along the lines of demand-created optional indexes, that we reclaim space/maintenance overhead on according to some cache management scheme. More space you have, the more of the important ones hang around. The rough same idea applies to materialised views. +10; this sort of demand-driven optimization could be the database equivalent of Java's HotSpot (which accomplished the amazing task of making Java kinda fastish sometimes). Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug tracker tool we need
Magnus Hagander wrote: On Mon, Apr 16, 2012 at 23:48, Jay Levitt wrote: - Familiarity: Many developers already have a GitHub account and use it Most of the more senior developers don't use github. Other than possibly as a place to store a plain git repository. So that's not really relevant. I meant outside developers - the folks you'd like to see more involved in the process. - Patch commenting and git integration encourage actual review-resubmit cycles instead of "Here, look, I fixed it for you" reviews The amount of spam coming through that system, and the inability/unwillingness of github to even care about it is a killer argument *against* github. We have working antispam for email. The github antispam is somewhere around where email antispam was in 1994. Interesting; I haven't run into this but you're the second person to mention it here. Antispam is (in the large) a technically unsolvable problem; even in the '90s, we'd see hackers start poking at our newest countermeasures within the hour. GitHub is a giant target, and PG probably benefits here from NOT being one. (A quick Google shows redmine and especially Trac having spam issues of their own.) Pedantic note/fun fact: There was no email antispam in 1994; Canter & Siegel posted their infamous USENET Green Card spam that year, but it didn't really spread to email for another year or two. Once it did, there were fervent debates about whether it should be called "velveeta" to distinguish from the USENET variety. GitHub could well be a non-starter, but if third-party-dependence is really the holdup, I'd volunteer to write the tools - in fact, a google of [export issues from github] shows a few that might already suffice. It *is* a non-starter, because (a) it's a third party dependency, and (b) AFAIK they don't provide *data access* to the issue trackers. Sure they do: http://developer.github.com/v3/issues/ Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug tracker tool we need
Alex Shulgin wrote: Jay Levitt writes: (A quick Google shows redmine and especially Trac having spam issues of their own.) Ugh, redmine (or trac for that matters) has nothing to with handling spam. I believe a typical bug tracker doesn't handle spam itself, it lets the mailing system do that. Surely you can throw in some captcha plugins to try to reduce the spam posted from the web UI. Maybe I'm confused - Magnus et al, are we talking spammy issues/issue comments/etc, or are we talking more about exposed email addresses? I assumed we meant spammy issues, like blog comments - spammers post issues and comments with links, to get PageRank to their sites. Email defenses wouldn't help here. Captchas are fairly pointless nowadays, assuming you have someone dedicated enough to write a spambot against your bug tracker. Most of them (even reCAPTCHA!) can be >80% defeated by software - many 99% - and there are millions of humans hanging out on Mechanical Turk who'll solve them for you 100%. Modern anti-spam ends up being a machine learning and systems exercise.. but that's another mailing list :) I think Google gets more use out of reCAPTCHA for OCR tweaking than for anti-spam. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug tracker tool we need
Greg Smith wrote: On 04/17/2012 09:20 AM, Jay Levitt wrote: Antispam is (in the large) a technically unsolvable problem; even in the '90s, we'd see hackers start poking at our newest countermeasures within the hour. GitHub is a giant target, and PG probably benefits here from NOT being one. Everyone who deals with list moderation and spam issues around PostgreSQL just got a belly laugh from that comment. Hint: the PostgreSQL lists had already been around and therefore were being targeted by spammers for over ten years before GitHub even existed. Hehe. OK, we will have to battle this out over drinks if I ever make it to PGCon.. but teaser: I've bankrupted Sanford Wallace and taught the DOJ what spam was. Pedantic note/fun fact: There was no email antispam in 1994 I like it when Magnus really gets the details perfect when making a deadpan joke. Dammit. I *fail*. Anyway, back to serious talk, I believe GitHub is a dead end here because the "primary key" as it were for issues is a repo. A bug tracker for PostgreSQL would need to have issues broken down per branch and include information similar to the release notes for each minor point release. Tracking when and how a bug is backported to older versions is one hard part of the problem here. That's a great point. Both GitHub and git itself have no real concept of releases, and can't tell you when a commit made it in. Although.. there's some sort of new release-note functionality. Maybe I'll play and see if it'd be applicable here. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.2 release notes, beta time?
Magnus Hagander wrote: 2012/4/28 Josh Berkus: Ugh. Maybe the whole idea of getting a beta out before PGCon is doomed. Still, if we don't try for this schedule, we're looking at at least two more weeks' slip, because we're surely not going to wrap during PGCon. We could do it in person! We could have Berkus handing out CDs! Or, if you guys want to mail out CDs to the whole US, I know some people. Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Request to add options to tools/git_changelog
Bruce Momjian wrote: I am again requesting the addition of options to tools/git_changelog so I can more easily produce the release notes. I asked for this during 9.1 development and it was rejected. I am currently using my own custom version of the tool, but have to merge community improvements into the tool every year before I use it. FYI in the general case of "I have to maintain a patch set": Now that PG is on git, there's a tool called Stacked Git that lets you use git's excellent merge capabilities to maintain patches. http://www.procode.org/stgit/ Jay -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers