[HACKERS] semi-PoC: kNN-gist for cubes

2012-02-06 Thread Jay Levitt
I have a rough proof-of-concept for getting nearest-neighbor searches 
working with cubes.  When I say rough, I mean I have no idea what I'm 
doing and I haven't written C for 15 years but I hear it got standardized 
please don't hurt me.  It seems to be about 400x faster for a 3D cube with 
1 million rows, more like 10-30x for a 6D cube with 10 million rows.


The patch adds operator - (which is just the existing cube_distance 
function) and support function 8, distance (which is just g_cube_distance, a 
wrapper around cube_distance).


The code is in no way production-quality; it is in fact right around look! 
it compiles!, complete with pasted-in, commented-out code from something I 
was mimicking.  I thought I'd share at this early stage in the hopes I might 
get some pointers, such as:


- What unintended consequences should I be looking for?
- What benchmarks should I do?
- What kind of edge cases might I consider?
- I'm just wrapping cube_distance and calling it through DirectFunctionCall; 
it's probably more proper to extract out the real function and call it 
from both cube_distance and g_cube_distance. Right?

- What else don't I know?  (Besides C, funny man.)

The patch, such as it is, is at:

https://github.com/jaylevitt/postgres/commit/9cae4ea6bd4b2e582b95d7e1452de0a7aec12857

with an even-messier test at

https://github.com/jaylevitt/postgres/commit/daa33e30acaa2c99fe554d88a99dd7d78ff6c784

I initially thought this patch made inserting and indexing slower, but then 
I realized the fast version was doing 1 million rows, and the slow one did 
10 million rows.  Which means: dinnertime.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-07 Thread Jay Levitt

[Posted at Andres's request]

TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in 
various builds.


NOTABLE PROBLEMS

1. In 9.1.2, inserting 10x rows takes 19x the time.
   - 9.1-HEAD and 9.2 fix this; it now slows down linearly
   - but: 10s  8s  5s!
   - but: comparing Ubuntu binary w/vanilla source build on virtual disks, 
might not be significant


2. In both 9.1 and 9.2, there is a long delay before CREATE INDEX realizes 
it can't work on an unlogged table

3. In 9.2, creating the 10-million-row index always fails
4. 9.1-HEAD never successfully indexes 10 million rows (never = at least 
20 minutes on two runs; I will follow up in a few hours)


DETAILS

Times are in seconds, single run.

+---+-+-+--+--+
| Platform  | 1m rows | 1m rows | 10m rows | 10m rows |
|   | INSERT  | CR NDX  | INSERT   | CR NDX   |
+---+-+-+--+--+
| 9.1.2 logged  | 5   | 35  | 98   | 434  |
| 9.1.2 unlogged| 2   | 34[**]  | 22   | 374[**]  |
| 9.1-HEAD logged   | 10  | 65  | 89   | [***]|
| 9.1-HEAD unlogged | 2   | 39  | 20   | 690[**]  |
| 9.2 logged| 8   | 57  | 87   | 509[*]   |
| 9.2 unlogged  | 2   | 33[**]  | 21   | 327[*]   |
+---+-+-+--+--+

[*] psql:slowcube.sql:20: ERROR:  node buffer of page being split (121550) 
does not exist

[**] psql:slowcube.sql:21: ERROR:  unlogged GiST indexes are not supported
[***] never completed after 10-20 minutes; nothing in server.log at default 
logging levels, postgres process consuming about 1 CPU in IOWAIT, 
checkpoints every 7-8 seconds


VARIABILITY

A few runs in a row on 9.1-HEAD, 1 million rows, logged:

++--+
| INSERT | CREATE INDEX |
++--+
| 10 |   65 |
|  8 |   61 |
|  7 |   59 |
|  8 |   61 |
|  7 |   55 |
++--+

SYSTEM SPECS

Amazon EC2, EBS-backed, m1.large
7.5GB RAM, 2 cores
Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

shared_buffers = 1867MB
checkpoint_segments = 32
effective_cache_size = 3734MB

9.1.2: installed binaries from Ubuntu's oneiric repo
9.1-HEAD: REL9_1_STABLE, ef19c9dfaa99a2b78ed0f78aa4a44ed31636fdc4, built 
with simple configure/make/make install
9.2: master, 1631598ea204a3b05104f25d008b510ff5a5c94a, built with simple 
configure/make/make install


9.1.2 and 9.1-HEAD were run on different (but identically configured) 
instances.  9.1-HEAD and 9.2 were run on the same instance, but EBS 
performance is unpredictable. YMMV.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-07 Thread Jay Levitt

Jay Levitt wrote:

[Posted at Andres's request]

TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in
various builds.


And I bet you'll want the test script... sigh.  attached.
\c postgres
drop database if exists slowcube;
create database slowcube;
\c slowcube
\timing
create schema slowcube;
set search_path to slowcube;

create extension cube;

set work_mem to '1GB';
set maintenance_work_mem to '1GB';

create table cubetest (
  position cube
  );

insert into cubetest (position)
  select cube(array[random() * 1000, random() * 1000, random() * 1000]) from 
generate_series(1,100);
select now();
create index q on cubetest using gist(position);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-07 Thread Jay Levitt

Jim Decibel! Nasby wrote:

I agree that it's probably pretty unusual to index floats.


FWIW: Cubes and points are floats, right? So would spatial indexes benefit 
from this optimization, or is it only raw floats?


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-09 Thread Jay Levitt

Tom Lane wrote:

Jay Levittjay.lev...@gmail.com  writes:

[Posted at Andres's request]
TL;DR: Inserting and indexing cubes is slow and/or broken in various ways in
various builds.



1. In 9.1.2, inserting 10x rows takes 19x the time.
 - 9.1-HEAD and 9.2 fix this; it now slows down linearly
 - but: 10s  8s  5s!
 - but: comparing Ubuntu binary w/vanilla source build on virtual disks,
might not be significant


FWIW, I find it really hard to believe that there is any real difference
between 9.1.2 and 9.1 branch tip on this.  There have been no
significant changes in either the gist or contrib/cube code in that
branch.  I suspect you have a measurement issue there.


I suspect you're right, given that five runs in a row produced times from 7s 
to 10s.  I just wanted to include it for completeness and in case it 
triggered any a-ha moments.



4. 9.1-HEAD never successfully indexes 10 million rows (never = at least
20 minutes on two runs; I will follow up in a few hours)


Works for me (see above), though it's slower than you might've expected.


So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and 
(probably both of our) 9.1-HEAD takes 1918s... is that something to worry 
about, and if so, are there any tests I can run to assist? That bug doesn't 
affect me personally, but y'know, community and all that.  Also, I wonder if 
it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way 
slower doing Y., and this is a canary in the coal mine.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-13 Thread Jay Levitt

Robert Haas wrote:

On Mon, Feb 13, 2012 at 7:45 AM, Robert Haasrobertmh...@gmail.com  wrote:

On Thu, Feb 9, 2012 at 3:37 PM, Jay Levittjay.lev...@gmail.com  wrote:

So my pre-built 9.1.2 takes 434s, my source-built 9.2 takes 509s, and
(probably both of our) 9.1-HEAD takes 1918s... is that something to worry
about, and if so, are there any tests I can run to assist? That bug doesn't
affect me personally, but y'know, community and all that.  Also, I wonder if
it's something like 9.2 got way faster doing X, but meanwhile, HEAD got way
slower doing Y., and this is a canary in the coal mine.

This might be a lame hypothesis, but... is it possible that you built
your 9.1-tip binaries with --enable-cassert?  Or with different
optimization options?


No, I think I/O just varies more than my repeated tests on 1M rows 
indicated.  I ran the 10M-row test four times on the same server, 
alternating between packaged 9.1.2 and source-built 9.1.2 (default configure 
options), and saw these times:


INSERT  INDEX
apt 76  578
source  163 636
apt 73  546
source  80  473

EBS has no performance guarantees at all; you share your disks with an 
arbitrary number of other users, so if someone in the neighborhood decides 
to do some heavy disk I/O, you lose. Let this be a lesson to me: run 
benchmarks locally!



So I tested.  On my MacBook Pro, your test script builds the index in
just over 25 s on both REL9_1_2 and this morning's REL9_1_STABLE.


I think that's the 1-million version I emailed; try adding a zero and see if 
it doesn't take a little longer.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Designing an extension for feature-space similarity search

2012-02-15 Thread Jay Levitt
[Preamble: I've been told that the hackers list is appropriate for 
extension-related topics like this, even if it's not about contributing to 
core. If I'm misappropriating, please let me know.]


Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think OKCupid for product 
recommendations or Pinterest for people with your tastes. We use psych 
research to measure and predict your personality and traits along a number 
of scales (dimensions), and then we connect you with people, products and 
content we think you'll like.


I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
- The distance between two vectors is a function that aggregates the 
distances of each dimension, along with arbitrary other arguments - for 
instances, it might take the weighted average of the dimensions


That aggregation (which may not literally be an aggregate; I'm not sure yet) 
needs to happen in a SELECT list, which means it needs to be fast, which 
means all this (or at least much of it) has to be C.


The simplest thing that works is probably to hack up the cube extension, 
declare that everything (except inner pages) must be a zero-volume cube 
(cube_is_point()), map our non-float features onto floats somehow, and 
hard-code all the distance functions and the aggregation function.


But I think this sort of similarity-search engine has general utility, and I 
also want to make it easy for us to add and subtract dimensions without too 
much pain; that should be DDL, not code. So thinking about how this might 
evolve...


- I'm not sure how to represent arbitrary column-like features without 
reinventing the wheel and putting a database in the database.  hstore only 
stores text, probably for this reason; I took a look at the earlier json 
patch and saw that it handled only a few core data types.  Have there been 
any other PoCs that involved collections of hetereogenous data? I almost 
want an actual instance of an anyarray.


- Alternatively, is there a way to index an entire, arbitrary row, rather 
than on a column on that row? I'm fine with this extension requiring its own 
table, so I leave the data where it is in the row, and only worry about 
indexing it. I can't just use functional indexes, because I'll need to 
provide operators and support functions to GiST. Maybe I have a fake 
sentinel column, where all the operators use SPI to introspect the row, 
treat each column as a feature dimension, call the underlying operators on 
each column's data type, etc.


- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when - returns values that don't make 
sense in the physical world? For instance, let's say NULL - NULL returns 
a distance of 1.0. That means that NULL1 - NULL2 = 1.0, and NULL2 - 
NULL3 = 1.0, but NULL1 - NULL3 = 1.0 as well. I think that's fine - that 
could even describe a triangle - but my spidey sense is tingling on this.


- Are there previous discussions, patches, abandoned projects, etc. that 
this reminds you of and that I should go research?


Thanks for any thoughts, and I'd love collaborators or even mentors - we 
plan to open source whatever we produce here, and I don't have quite the 
theoretical background it takes to do this properly.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Alexander Korotkov wrote:

On Thu, Feb 16, 2012 at 12:34 AM, Jay Levitt jay.lev...@gmail.com
mailto:jay.lev...@gmail.com wrote:

- But a dimension might be in any domain, not just floats
- The distance along each dimension is a domain-specific function

What exact domains do you expect? Some domains could appear to be quite hard
for index-based similarity search using GiST (for example, sets, strings etc.).


Oh, nothing nearly so complex, and (to Tom's point) no composite types, 
either. Right now we have demographics like gender, geolocation, and 
birthdate; I think any domain will be a type that's easily expressible in 
linear terms.  I was thinking in domains rather than types because there 
isn't one distance function for date or float; me.birthdate - 
you.birthdate birthdate is normalized to a different curve than now() - 
posting_date, and raw_score - raw_score would differ from z_score - z_score.


It would have been elegant to express that distance with -, but since 
domains can't have operators, I can create distance(this, other) functions 
for each domain. It just won't look as pretty!


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Tom Lane wrote:

Jay Levittjay.lev...@gmail.com  writes:

- I'm not sure how to represent arbitrary column-like features without
reinventing the wheel and putting a database in the database.


ISTM you could define a composite type and then create operators and an
operator class over that type.  If you were trying to make a btree
opclass there might be a conflict with the built-in record_ops opclass,
but since you're only interested in GIST I don't see any real
roadblocks.


Perfect. Composite types are exactly what I need here; the application can 
declare its composite type and provide distance functions for each member, 
and the extension can use those to calculate similarity. How do I introspect 
the composite type's pg_class to see what it contains? I assume there's a 
better way than SPI on system catalogs :) Should I be using systable_* 
functions from genam, or is there an in-memory tree? I feel like funcapi 
gets me partway there but there's magic in the middle.


Can you think of any code that would serve as a sample, maybe whatever 
creates the output for psql's \d?




The main potential disadvantage of this is that you'd have
the standard tuple header as overhead in index entries --- but maybe the
entries are large enough that that doesn't matter, and in any case you
could probably make use of the GIST compress method to get rid of most
of the header.  Maybe convert to MinimalTuple, for instance, if you want
to still be able to leverage existing support code for field extraction.


Probably not worth it to save the 8 bytes; we're starting out at about 20 
floats per row. But good to know for later optimization...





- Can domains have operators, or are operators defined on types?


I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.  However, we only
support domains over base types not composites, so this isn't really
going to be a profitable direction for you anyway.


Actually, as mentioned to Alexander, I'm thinking of domains per feature, 
not for the overall tuple, so birthdate-birthdate differs from 
now()-posting_date.  Sounds like that might work - I'll play.



- Does KNN-GiST run into problems when-  returns values that don't make
sense in the physical world?


Wouldn't surprise me.  In general, non-strict index operators are a bad
idea.  However, if the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.


Yeah, that example conflated NULLs in the feature fields (we don't know your 
birthdate) with - on the whole tuple.  Oops.


I guess I can just test this by verifying that KNN-GiST ordered by distance 
returns the same results as without the index.


Thanks for your help here.

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-16 Thread Jay Levitt

Tom Lane wrote:

- Can domains have operators, or are operators defined on types?


I think the current state of play is that you can have such things but
the system will only consider them for exact type matches, so you might
need more explicit casts than you ordinarily would.


Turns out it's even smarter than that; it seems to coerce when it's unambiguous:

create domain birthdate as date;
create function date_dist(birthdate, birthdate) returns integer as $$
select 123;
$$ language sql;
create operator - (
procedure = date_dist,
leftarg = birthdate,
rightarg = birthdate);

select '2012-01-01'::birthdate - '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date - '2012-01-01'::date ;
-- 123


create domain activity_date as date;
create function date_dist(activity_date, activity_date)
returns integer as $$
  select 432;
$$ language sql;
create operator - (
procedure = date_dist,
leftarg = activity_date,
rightarg = activity_date);

select '2012-01-01'::activity_date - '2012-01-01'::activity_date;
-- 432

select '2012-01-01'::birthdate - '2012-01-01'::birthdate;
-- 123

select '2012-01-01'::date - '2012-01-01'::date ;
-- ERROR:  operator is not unique: date - date

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Copyright notice for contrib/cube?

2012-02-17 Thread Jay Levitt
I'm basing an extension off contrib/cube. I'm going to open-source it under 
the existing PostgreSQL license, but I'm not sure how the copyright notice 
should look - there isn't one at the moment. (In fact, there's no LICENSE or 
COPYRIGHT file at all.)


Should it be something like

Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group

Portions Copyright (c) 2012, TipTap Inc.

?

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Copyright notice for contrib/cube?

2012-02-17 Thread Jay Levitt

Marti Raudsepp wrote:

On Fri, Feb 17, 2012 at 17:42, Jay Levittjay.lev...@gmail.com  wrote:

Should it be something like

Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
Portions Copyright (c) 2012, TipTap Inc.


Please don't add that, just change 2011 to 2012. This is what the wiki says:

Q: May I add my own copyright notice where appropriate?


To clarify, this is for an extension to be distributed separately on PGXN 
and GitHub, not for a contribution to the PostgreSQL distribution. It will 
differ greatly from contrib/cube when it's done, but cube is the scaffolding 
I'm starting with.


That said:


Q: Doesn't the PostgreSQL license itself require to keep the copyright
notice intact?
A: Yes, it does. And it is, because the PostgreSQL Global Development
Group covers all copyright holders.


Is that true for separately-distributed extensions as well - if I push this 
to GitHub, my company is part of the PGDG? Where is the PGDG defined?


If not (and perhaps even if so), I think I could still add an additional 
copyright notice without violating the license, since the copyright notice 
and following two paragraphs still appear in all copies. But perhaps it's 
not necessary.


I think the edge case is something stupid like In five years, there is no 
remaining contrib code, and we get bought by MonsantoOracleHalliburton, and 
they want to close-source the code in a way that's somehow incompatible with 
the PostgreSQL license.. can they?


But that does raise two other points:

- cube seems to post-date any work at UC. Should I still include the 
Portions Copyright (c) 1994, The Regents of the University of California?


- Technically, the license could be read such that the above copyright 
notice (singular) refers to the UC copyright notice but not the PGDG 
notice; next time the lawyers run through it, you might want to add an s 
to notices..


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Tom Lane wrote:

Jay Levittjay.lev...@gmail.com  writes:

- Does KNN-GiST run into problems when-  returns values that don't make
sense in the physical world?


If the indexed entities are records, it would be
entirely your own business how you handled individual fields being NULL.


This turns out to be a bit challenging. Let's say I'm building a 
nullable_point type that allows the Y axis to be NULL (or any sentinel value 
for missing data), where the semantics are NULL is infinitely far from 
the query.   I'll need my GiST functions to return useful results with NULL 
- not just correct results, but results that help partition the tree nicely.


At first I thought this posed a challenge for union; if I have these points:

(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates like 
LL = -infinity, UR = infinity, or (equivalently, I think) to store a 
saw_nulls bit in addition to LL and UR.


The real challenge is probably in picksplit and penalty - where in the tree 
should I stick (1,NULL)? - at which point you say Yes, algorithms for 
efficient indexes are hard work and computer-science-y and point me at 
surrogate splitters.


Just thinking out loud, I guess; if other GiST types have addressed this 
problem, I'd love to hear about it.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Alexander Korotkov wrote:

On Fri, Feb 17, 2012 at 11:00 PM, Jay Levitt jay.lev...@gmail.com
mailto:jay.lev...@gmail.com wrote:

At first I thought this posed a challenge for union; if I have these 
points:


(1,2)
(2,1)
(1,NULL)

what's the union? I think the answer is to treat NULL box coordinates
like LL = -infinity, UR = infinity, or (equivalently, I think) to store
a saw_nulls bit in addition to LL and UR.

Similar problem appears at GiST indexing of ranges, because range can be
empty. There additional contain empty flag was introduced. This contain
empty flag indicates that underlying value can be empty. So, this flag is
set when union with empty range or other range with this flag set. It's
likely you need similar flag for each dimension.


Ah, yes, exactly the same problem. So what led you to add a flag instead of 
using the range NULL..NULL? I'm on the fence about choosing.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Designing an extension for feature-space similarity search

2012-02-17 Thread Jay Levitt

Alexander Korotkov wrote:

On Fri, Feb 17, 2012 at 11:32 PM, Jay Levitt jay.lev...@gmail.com



Ah, yes, exactly the same problem. So what led you to add a flag instead
of using the range NULL..NULL? I'm on the fence about choosing.


At first, range bounds can't be NULL :) At second, if we have range
(a;b)+contain empty in internal page, both facts:
1) All normal underlying ranges are contained in (a;b).
2) There can be empty underlying ranges.
are useful for search.


That makes sense; you're essentially keeping one bit of stats about the 
values present in the range.


I wonder: if I'm indexing a rowtype, then for each column in the row I need 
to store a lower-left and an upper-right bound, plus a might-have-nulls 
flag.  Sounds a lot like a range. Should I just use ranges for that? See a 
downside (overhead)? See an upside (seems less duplicative somehow)? I'm 
fine depending on 9.2.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Future of our regular expression code

2012-02-19 Thread Jay Levitt

Stephen Frost wrote:

Alright, I'll bite..  Which existing regexp implementation that's well
written, well maintained, and which is well protected against malicious
regexes should we be considering then?


FWIW, there's a benchmark here that compares a number of regexp engines, 
including PCRE, TRE and Russ Cox's RE2:


http://lh3lh3.users.sourceforge.net/reb.shtml

The fastest backtracking-style engine seems to be oniguruma, which is native 
to Ruby 1.9 and thus not only supports Unicode but I'd bet performs pretty 
well on it, on account of it's developed in Japan.  But it goes pathological 
on regexen containing '|'; the only safe choice among PCRE-style engines is 
RE2, but of course that doesn't support backreferences.


Russ's page on re2 (http://code.google.com/p/re2/) says:

If you absolutely need backreferences and generalized assertions, then RE2 
is not for you, but you might be interested in irregexp, Google Chrome's 
regular expression engine.


That's here:

http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

Sadly, it's in Javascript.  Seems like if you need a safe, performant regexp 
implementation, your choice is (a) finish PLv8 and support it on all 
platforms, or (b) add backreferences to RE2 and precompile it to C with 
Comeau (if that's still around), or...


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_timing tool for EXPLAIN ANALYZE overhead

2012-02-22 Thread Jay Levitt

Greg Smith wrote:


Anyway, the patch does now includes several examples and a short primer on
PC clock hardware, to help guide what good results look like and why they've
been impossible to obtain in the past.  That's a bit Linux-centric, but the
hardware described covers almost all systems using Intel or AMD processors.
Only difference with most other operating systems is how aggressively they
have adopted newer timer hardware.  At least this gives a way to measure all
of them.


N.B.: Windows has at least two clock APIs, timeGetTime and 
QueryPerformanceCounters (and probably more, these days). They rely on 
different hardware clocks, and can get out of sync with each other; 
meanwhile, QueryPerformanceCounters can get out of sync with itself on 
(older?) multi-CPU boards.


So if you're doing high-res timing, it's good to make sure you aren't 
relying on two different clocks in different places... I ran into this with 
MIDI drivers years ago; and wrote a doc:


http://www.ultimatemetal.com/forum/equipment/315910-midi-latency-problem-nuendo.html#post6315034

and a clock-testing utility:

https://github.com/jaylevitt/miditime


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Pushing restrictions down into GROUP BYs?

2012-09-22 Thread Jay Levitt
A while back, I posted a pathological minimal-case query where, in order 
to select one row from a users table, Postgres needed to scan the whole 
users table, because the restriction was not visible to the GROUP BY.


At the time, Tom wrote:

 Don't hold your breath waiting for that to change.  To do what you're
 wishing for, we'd have to treat the GROUP BY subquery as if it were an
 inner indexscan, and push a join condition into it.  That's not even
 possible today.  It might be possible after I get done with the
 parameterized-path stuff I've been speculating about for a couple of
 years now; but I suspect that even if it is possible, we won't do it
 for subqueries because of the planner-performance hit we'd take from
 repeatedly replanning the same subquery.

http://archives.postgresql.org/pgsql-performance/2011-11/msg00131.php

Given all the work that's been done on parameterized paths and LATERAL, 
is this something to revisit yet?  In 9.3, it's at least possible to 
manually tweak the SRF, thanks to LATERAL. But it'd be great to allow 
set-returning functions to remain blissfully unaware of their 
surroundings. Modular code, Single Responsibility Principle, all that.


I guess a more general question is: Are there cases where the planner 
can *use* LATERAL functionality to push down restrictions like this? (Do 
LATERAL and pushdown conceptually overlap? I think maybe they do.)


Example code below - and before you say but you could just use 'where 
exists', trust me that the original queries were much more involved :)


-


drop schema if exists jaytest cascade;
create schema jaytest;
set search_path to jaytest;

create table questions (
  id int not null primary key,
  user_id int not null
);
insert into questions
  select generate_series(1,1100), (random()*200)::int;

create table users (
  id int not null primary key
);
insert into users select generate_series(1, 200);

vacuum freeze analyze;

create function srf() returns table (user_id int) as $$
  select u.id
  from users as u
  group by u.id
$$ language sql stable;

-- Option 1: Use the set-returning function

explain analyze
select questions.id
from questions
join srf()
on srf.user_id = questions.user_id
where questions.id = 1;

-- Option 2: Use the equivalent of the set-returning function
-- (remove any doubts about function call overhead)

explain analyze
select questions.id
from questions
join (
  select u.id as user_id
  from users as u
  group by u.id
) as srf1
on srf1.user_id = questions.user_id
where questions.id = 1;

-- Option 3: Now that we have LATERAL, manually
-- push the join down into the set-returning function

create function srf_lateral(questions_user_id int) returns table 
(user_id int) as $$

  select u.id
  from users as u
  where u.id = questions_user_id
  group by u.id
$$ language sql stable;

explain analyze
select questions.id
from questions, lateral srf_lateral(questions.user_id)
where questions.id = 1;

drop schema jaytest cascade;

-

On my machine, options 1 and 2 usually have runtimes of 100-400 ms, and 
option 3 is about 0.04ms.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Lazy hashaggregate when no aggregation is needed

2012-03-30 Thread Jay Levitt

Tom Lane wrote:

Ants Aasmaa...@cybertec.at writes:

A user complained on pgsql-performance that SELECT col FROM table
GROUP BY col LIMIT 2; performs a full table scan. ISTM that it's safe
to return tuples from hash-aggregate as they are found when no
aggregate functions are in use. Attached is a first shot at that.


As I commented in the other thread, the user would be a lot better off
if he'd had an index on the column in question. I'm not sure it's worth
complicating the hashagg logic when an indexscan + groupagg would
address the case better.


Would this patch help in the case where table is actually a set-returning 
function, and thus can't have an index? (I don't yet know enough about the 
tree to know when hashaggs get used). I'm wondering if this is a useful 
exception to the restrictions can't get pushed down through GROUP BYs rule.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-01 Thread Jay Levitt
The Mac installation docs currently recommend the EDB one-click installer as 
the first choice. While this does install pgadmin and some other 
refinements, it also is fairly confusing to troubleshoot:


- By default, it installs to /Library/PostgreSQL, which is also (I think) 
where the Apple-supplied Lion install is


- The uninstaller is hidden in /Library/PostgreSQL, which (since Finder 
hides /Library by default) you're likely to go to via Terminal. But the 
uninstaller is a Mac app, so even if you find it you have to know to use 
open to run it, because Mac apps are really directories that the Finder 
abstracts away from you.


- The EDB docs are written against 8.4.

- There are, as @lluad points out, no fewer than eight ways to install 
Postgres on a Mac (fink, macports, homebrew, Lion default, build from 
source, EDB, and two other binary installers)


- We have few Mac experts hanging out in #postgresql.

- We just had two folks within an hour, BOTH with conflicting installs of 
Postgres.


So with all respect and thanks to EDB for maintaining those installers, I'd 
like to propose that homebrew become the recommended install method on Mac, 
and I will update the Mac formula to overcome any current objections. The 
nice thing about homebrew is that (a) formulas can contain arbitrary Ruby 
and command-line options, so we can easily deal with things like detecting 
existing installs, handling shared memory, etc. if we want to, and (b) pull 
requests are accepted freely and frequently, so it can always be the 
current, security-patched version.


What do folks think of this idea?  When I mention homebrew in #postgresql, 
there's always an ick, but I believe that's true of any package manager 
(and possibly any Mac anything, because we're all Rails-loving ORM-using 
SQL-not-understanding fanbois, and I say that with love.)


The current homebrew installer is already 9.1.3, and does a make-world, so 
you get all of contrib built.


POSSIBLE OBJECTIONS/PREREQUISITES

1. homebrew installs everything under /usr/local and makes that 
user-writeable.  Sorry.  It does because most Mac users don't know how to 
edit PATH for GUI apps (it's in a .plist in a hidden directory in your home 
dir), and /usr/local is already in PATH by default.


2. The current formula installs Postgres as the desktop user, not as the 
_postgres role account.


I'm personally of the strong opinion that user-to-user privilege escalation 
attacks are NOT an issue on desktops; all important files are already owned 
by the desktop user. The attack vector is *maybe* root escalation attacks, 
but if you want root, it's so common for installers to ask permission that 
your malware could just ask.  The real attack vector is I'm in your 
browser, and that has nothing to do with root, permissions, or users at all.


Meanwhile, the EDB installer by default installs both app and data to a 
directory that requires root - so I assume it runs as root too - and 
nobody's complained.


However, if this is a sticking point, I'd have no problem adding a --user 
option that would default to _postgres (underscore-prefixed usernames are 
the Apple standard).


3. The current formula (TCF) spits out instructions telling you how to 
initdb, but they're easy to overlook.  I'm happy to add an option if 
necessary, and might do it anyway.


4. TCF also spits out instructions for adding Postgres to launchctl (Mac's 
version of /etc/init.d or Windows Services), rather than doing it for you, 
but again, I'd happily add the option.  (I'm checking with Homebrew folks to 
see if there's some dictum against that; it's a common pattern to put 
launchctl in the instructions, but IME the usability is poor.)


5. TCF doesn't update your shared memory settings. Again, happy to add that.

6. TCF doesn't look for existing installs. This is especially a problem on 
Lion, since Apple bundles PG 8.4, either client or server IIUC, and although 
/usr/local/bin is in your PATH, it comes *after* /usr/bin (grumble), so 
you'll either have the homebrew server fail to launch (since port 5432 is in 
use), or you'll have an outdated client version.


In IRC, both users had actually installed the EDB version months ago and 
forgotten about it, but over time, Lion users will grow, since all new Macs 
come with only Lion.  There are several ways to address this; my preference 
is to have homebrew warn about existing installs but take care of any magic 
to make them go away, a la 
http://nextmarvel.net/blog/2011/09/brew-install-postgresql-on-os-x-lion/.


7. There's no homebrew formula for pgadmin. I've never built it, and might 
be able to add that, but probably not right away.


8. There might be other popular things that EDB's StackBuilder does.

9. EDB is an important contributor to the PG core community, and maybe the 
link juice/publicity is politically important.  Lemme know.


That's all I can think of... thoughts? Objections? Which do you think are 
prerequisites?


Jay Levitt

Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-01 Thread Jay Levitt

Jay Levitt wrote:

POSSIBLE OBJECTIONS/PREREQUISITES


10. There is no homebrew support for multiple versions, and no current plans 
to add it (though it's on the wishlist). This means homebrew is only useful 
if I want to install a PostgreSQL thingie is the common Mac use case. If 
people often need to use specific older versions, to mirror their server 
configs, it's a problem.  It *might* be possible to hack this into our 
formula, but I'm not sure it's either doable or acceptable.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-01 Thread Jay Levitt

Dave Page wrote:
 It seems to me that most of your arguments against the installers are
 based on incorrect understanding or information, and most of your
 arguments for Homebrew actually come across as arguments against!

You're right about the former - and as to the latter, they *were* arguments 
against (potential objections).  I try to pre-argue against my own 
proposals to save everyone time; if I can still prevail, I must have a damn 
good idea :)


At this point I agree with you, but I'm still going to go into detail, 
because I think there are two markets for Postgres, and the database 
community has been so focused around enterprise for so long that you're 
missing opportunities with web startups. I'd love to help bridge the gap, 
having jumped straight from big-iron PL/I to ooh-Ruby-is-shiny. And web 
startups develop on Mac laptops. They just do. So if it helps you to imagine 
me as a 20something I'm a Mac hipster, working on some hot Facebook/mobile 
app with funding from Spark Capital, do that. Lord knows it helps me.


 - We have few Mac experts hanging out in #postgresql.
 Not sure how this is relevant to the proposal.

The impetus for the idea was that there seems to be a steady stream of 
novice PG users on Mac who come into #postgresql with installation problems, 
which is bad enough as an out-of-box experience - but worse is that there 
are rarely folks around who can help. (Of course, I'm extrapolating; every 
time *I'm* in IRC and see this, there's someone who can help. But you know 
what I mean.)


And (although my proposal started with documentation) I'm of the firm 
opinion that there's no such thing as a documentation error; a user 
problem is a software problem.  Humans will click buttons before they'll 
read, developers are humans, and no amount of RTFM will ever fix that. If we 
can make installers smarter, that's way better than troubleshooting guides, 
IRC, mailing lists, etc. So that's where I was coming from.


I didn't realize that you were actively maintaining the EDB installer (see 
below for the 8.4 doc explanation); obviously, if you can improve that, it's 
the best solution and we should, if anything, recommend it MORE vigorously. 
Still, there's a growing community of developers who expect brew install 
to work, and I do want to fix it for them.  The EDB installer will always be 
a one-off experience; most of the other servers you install will be through 
a package manager, and homebrew's popularity (despite its youth) is 
impressive.  Both of my n=2 data points had run across PG a while back, 
installed it with the one-click to try it out, forgotten about it, done 
brew install postgresql today, and naturally ran into problems.


 - By default, it installs to /Library/PostgreSQL, which is also (I think)
 where the Apple-supplied Lion install is
 No, Apple's version is installed in /usr on mine.

Ah hah. I suppose only the Apple .plist is stored under /Library, then. Let 
me amend that to this made everyone in IRC, and probably many other 
non-Mac-expert troubleshooters, assume that this is an Apple-installed 
package. It'd be great for this to go somewhere that feels like Oh, this 
was installed by you; /Library feels kinda weird for a server, though I can 
understand your reasoning. Maybe even /Library/EnterpriseDB/PostgreSQL to 
make it obvious?


 - The uninstaller is hidden in /Library/PostgreSQL, which (since Finder
 hides /Library by default) you're likely to go to via Terminal. But the
 uninstaller is a Mac app, so even if you find it you have to know to use
 open to run it, because Mac apps are really directories that the Finder
 abstracts away from you.
 Yes.

How about a one-liner bash script uninstall-postgresql that does nothing 
but open uninstall-postgresql.app?


 - The EDB docs are written against 8.4.
 Only if you install 8.4. If you install 8.3 you get the 8.3 docs, 9.0
 the 9.0 docs and so on.

No, I meant on the web:

http://www.enterprisedb.com/resources-community/pginst-guide

That's what made me assume that the installer wasn't maintained (except as 
to repackaging new PG versions, obviously). It's obviously not hard to 
replace 8.3 with 9.1 when you read it, but it still leaves an impression 
akin to This web site works best with IE7 and above. Allow me to now 
replace most of this thread with hey, you might wanna update that page.


 - There are eight ways to install Postgres on a Mac
 That isn't any more of a reason to discount the EDB installer than any other.

Nope, just an argument that the recommended installer should handle that nicely.

 - We just had two folks within an hour, BOTH with conflicting installs of
 Postgres.
 Not sure how that is relevant either. You can have conflicting
 installation using any of the installation methods, including a
 home-built source tree.

Right, but I suspect this is a common problem - not only have I seen it in 
IRC but 3 or 4 times in my 12-person startup, which is kinda amazing given 
that we've 

Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-01 Thread Jay Levitt

Tom Lane wrote:

While you might not like the EDB installer, at least those
folks are active in the lists and accountable for whatever problems
their code has.  Who in heck is responsible for the homebrew
packaging, and do they answer questions in the PG lists?


Just for general knowledge... Who's responsible is whoever wants to be; 
homebrew is open source, and with a github-based workflow, it's trivial for 
them to accept pull requests.  On the 1967 formulas (packages) in the repo, 
there have been 1759 contributors.  I was volunteering to be the maintainer 
and liaison if we did this; I'm pretty good at email and IRC.


It's actually pretty clever and elegant - homebrew itself uses git and 
github for formula updates and distribution, and is written in ruby1.8 which 
ships with all Macs. /usr/local is a git repo, brew update is mostly git 
pull, and brew search checks for new pull requests if it doesn't find a 
matching formula. The whole thing's all of 1500 lines of code, and you saw 
what formulas look like.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-02 Thread Jay Levitt

Dave Page wrote:

On Mon, Apr 2, 2012 at 12:29 AM, Jay Levittjay.lev...@gmail.com  wrote:
Just as an FYI, a large percentage of the PostgreSQL developers are
Mac users, including myself. They're also the company standard at
EnterpriseDB - so we're not entirely unfamiliar with software
development on them.


Good to know; now I realize that even on Macs there are two worlds. I think 
it's important to understand the new world, because although very little 
infrastructure spans two generations, I believe that PostgreSQL is uniquely 
positioned to do exactly that, just as POSIX has. It's a performant ACID 
relational data store for the old guard, but it's also a web-scale 
cloud-friendly polyglot-extensible preinstalled 
XML/JSON/hstore/tree/graph/schemaless store for the new, and like Java 
HotSpot, it gets smarter and faster every day. It is an operating system for 
your data, it's free, and Oracle hasn't bought it (except yesterday).


This goes back to the marketing challenge, not technical challenge threads.


- We have few Mac experts hanging out in #postgresql.

Not sure how this is relevant to the proposal.

The impetus for the idea was that there seems to be a steady stream of
novice PG users on Mac who come into #postgresql with installation problems,

If you see someone report a bug with the installers, please have them
report it on the EnterpriseDB forums:
http://forums.enterprisedb.com/forums/show/9.page


It's not a bug with the EDB installer, except insofar as the bug is the EDB 
installer is not prominent and widely recommended enough and so they used 
homebrew or macports. The EDB installer is what they used once upon a time; 
today they are following a blog post that walked them through homebrew 
(probably as one line of a longer process involving memcached, sphinx, 
nginx, node.js, and seven other things).



The EDB installer will always be
a one-off experience; most of the other servers you install will be through
a package manager, and homebrew's popularity (despite its youth) is
impressive.


I would disagree with that. Most users I know do not use things like
homebrew (particularly those coming from Windows who have no
familiarity with such package managers at all).


Yep; two worlds.

 [docs reference 8.4]

That hasn't been updated because the installation steps haven't
changed and I'd rather spend time writing software than updating
screenshots.


Fair. I wonder if there's an easy way to automate the generation and 
insertion of those screen shots. You could always write an 
AppleScript/Automator/Keyboard Maestro script, but I feel like I've seen a 
better way specifically for tech docs; I'll keep an eye out.



A couple of points of note:

- The introduction says: This document is based on the 8.4.x
one-click installers but applies equally to later versions.

- The doc also explains where to find the uninstaller.


Sure, and if humans read docs, instead of just glancing at them, that'd be 
all you needed. In any case, I could counter myself that nobody reads the 
doc period, so it doesn't matter what version is listed; that's just the 
source of my own misunderstanding about maintenance.



- There are eight ways to install Postgres on a Mac

That isn't any more of a reason to discount the EDB installer than any
other.

Nope, just an argument that the recommended installer should handle that
nicely.


It does. It'll detect that the port is in use and suggest a different
one. I don't know of any other of those installation methods that'll
do that.


That's terrific, and you're right - it's the only one.

 2. The current formula installs Postgres as the desktop user, not as the
 _postgres role account.
 That's not very helpful on shared machines - and whilst it may be fine
 for developers etc, it's not the recommended way to setup PostgreSQL
 for any kind of production use.
 Oh, of course. Homebrew does not target the three remaining people who run
 production XServes.  It's purely for Mac developer workstations.  At
 startups. Which are MacBooks! :)
 Production doesn't necessarily mean server. All those thousands of
 Poker Tracker users that run with PostgreSQL on Windows on their home
 machines are production users for example.

Excellent point, and one I hadn't thought of.


1. Rubyists in general are sick of sudo'ing on their laptops, because It
Doesn't Matter (as I'll fail to argue later).  Homebrew puts itself into a
directory that is user-writable so it does not require sudo for basic
installation.


Nice. You just turned me off ever wanting anything related to Ruby on
my Mac either!


Well, for Rubyists, I should properly substitute younger developers, aka 
all developers eventually.


As more toolchains work without sudo (Ruby, Python, JavaScript), using sudo 
actually breaks stuff, because now you've written some files as root and 
you'll later fail to overwrite them as user. Heck, I've had that happen with 
ccache; if you do sudo make install without make, you're 

Re: [HACKERS] Switching to Homebrew as recommended Mac install? / apology

2012-04-02 Thread Jay Levitt

David Johnston wrote:
 Just trying to bridge an apparent gap since the original e-mail seems to
 have come across as too adversarial that the underlying thoughts have
 been overlooked.  Trying to contribute in my own way with my current
 resources.

Thanks, but it's my own fault for basing a half-baked let's rewrite 
everything idea on a few wrong assumptions without asking some simple 
questions first. (Sorry, David.) And you guys don't know me very well yet, 
so you don't how to interpret my tone, or that I spend the *first* half of 
each day making the exact opposite arguments to all the young turks in the 
hopes we'll all meet in the middle.  I plan to hang around, so by way of 
belated introduction, and you can imagine this in the style of Beetlejuice:


Hi. I wrote AOL's OLTP-style mail server in 1994 and scaled it (with an 
eventual team of 25) from 27 to 4000 TPS on minicomputers as powerful as an 
iPhone. It had multi-site replication, zero-downtime installs (without 
blocking writes), and served 1.5 million simultaneous users. I had to write 
nearly a whole SVM-based OS in the app, because nothing at the time - from 
the message bus to the disk cache to poll() itself - could handle our load, 
and our growth outpaced the hardware manufacturers' ability to build it. I 
did this with no CS background beyond public school (I started as a 6502 
hacker), no search engine, and no access to research papers or computer 
scientists. I have no idea how.


The architecture survived five underlying infrastructure transitions 
(Stratus filesystem, Sybase, Informix, Tandem, and the move from four- to 
eight-byte IDs that PG has successfully staved off) while migrating live. It 
never lost or misrouted a message, and never had a security breach in the 
nine years I led it (or, AFAIK, since), despite the fact that we were a big 
enough target for hackers to respond to our changed defenses within hours. I 
do actually know this stuff, or did.


I spent 10 years taking a break, studying music, successfully sleeping 
through Java; now I'm back in technology, leading development in a 
Rails/web/JS startup, and luxuriating in the ability to actually store data 
in a database this time - because you guys have spent the past 20 years 
studying and improving the hard parts of abstracting performant, reliable, 
scalable data storage.


I have a tendency to see the possible endgame and insist on starting now, 
and if I get too idealistic, ambitious, and MongoDB is webscale, please 
just drag me into a conversation about lock contention and distributed 2PC 
and I'll forget the whole thing. But I really do think PG can be the 
makes-everything-possible, does-most-things-well data store - really, data 
platform - for the next decade or two, and I want to contribute.


I'm provocative, playful and grandiose, I apologize except not really, and 
it's because in my n=1 experience, the way life works is (a) you decide to 
change the world and then (b) you do.


 You do not need permission to contribute to the community
 in the way you seek so what is it that you are really asking for?

Nothing at this point. I was thinking out loud, and at the time was 
temporarily insa^h^h^hconvinced that the homebrew formula should be the 
community standard, and thus that I'd have to bring it up to some level of 
acceptability/review.  I've contributed to the formula in the past, and will 
continue to do so based on the thoughts everyone's shared here. It doesn't 
need to be official to be useful, and as David Page said, it's not gonna be 
listed in the docs no matter what, given the one decision that homebrew 
makes (/usr/local) that I can't override.



When brew is replaced by something more popular do you
think you will continue to maintain the recipe or is it going to end
up stuck showing us how to install version 9.3 or earlier.


Like anything, I'll maintain it until it becomes useless to me or vice 
versa, and someone will pick it up or they won't.  But just to be clear, 
Homebrew's a source-based repo (so there's no cross-compiler issues), 
pulling from the upstream source repository, using only the stock compiler 
toolchain, Intel-only, on a platform where the only hardware manufacturer 
has themselves severely constrained the number of possible configurations. 
For the most part, updating the formula to package new versions is a 
matter of changing the following two lines:


  url 'http://ftp.postgresql.org/pub/source/v9.1.3/postgresql-9.1.3.tar.bz2'
  md5 '641e1915f7ebfdc9f138e4c55b6aec0e'

Unless the instructions for How to build postgres from source change, 
nothing else in the formula *needs* to.  The current formula is fairly 
simple; aside from user instructions, the code is 75 lines and mostly 
consists of default arguments to ./configure.  (Formula attached for the 
curious.)  Pull requests are freely and quickly accepted after a quick 
review; the homebrew repo is operated more in the fail early and often 
spirit than in 

Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-03 Thread Jay Levitt

Robert Haas wrote:

On Mon, Apr 2, 2012 at 5:23 AM, Dave Pagedp...@pgadmin.org  wrote:

If homebrew intentionally creates a hole like that, then for as long
as I'm one of the PostgreSQL webmasters it will *never* be listed on
our download pages.


I think that's a bit harsh.  It's not as if the PostgreSQL package
creates the security hole; it's something that the packaging system
does itself, independent of whether or not you try to install
PostgreSQL with it.  So it seems to me that refusing to list it is
making life difficult for people who have already made the decision to
use brew, without any compensating advantage.


In fairness to Dave, I think he was still reacting to my initial proposal 
that homebrew be the *recommended* install. In that case, people might 
install homebrew specifically because postgresql.org recommended it.


If the consensus is that /usr/local/* user-writable is insecure, I certainly 
wouldn't object to a little note that said If you're using homebrew, do 
'brew install postgresql', but we don't recommend homebrew for security 
reasons; a little pressure might provide the impetus for homebrew to allow a 
better way.


That said, about 8 months ago Homebrew's defaults changed. It no longer 
requires /usr/local to be directly writable; it will sudo if necessary 
during the initial installation to create its subdirectories.  Those 
directories are mostly user-writable, though:


% ls -l /usr/local
total 8
drwxr-xr-x   37 jay   admin   1.2K Mar 31 16:39 Cellar/
drwxr-xr-x7 jay   admin   238B Feb 29 10:51 Library/
-rw-r--r--1 jay   admin   789B Feb 29 10:57 README.md
drwxr-xr-x  499 jay   admin17K Apr  1 15:29 bin/
drwxr-xr-x9 jay   admin   306B Mar  7 16:23 etc/
drwxr-xr-x   69 jay   admin   2.3K Mar 16 16:48 include/
drwxr-xr-x  178 jay   admin   5.9K Mar 16 16:48 lib/
drwxr-xr-x3 root  admin   102B Mar 14 13:20 man/
drwxr-xr-x   20 jay   admin   680B Mar 31 16:40 share/
drwx--3 jay   admin   102B Feb 29 11:43 var/

At no point was anything in /usr/local *world*-writable, FWIW.


That doesn't mean that I approve of brew's approach to this problem,
though.  Even if you think that it's unimportant to keep the desktop
user from usurping root privileges, having some things installed in
/usr/local as root and others as the desktop user (multiple different
desktop users?) seems like a recipe for chaos.


I think the brew designers expect most folks to either not have anything in 
/usr/local from outside homebrew, to not install anything there as root, to 
understand the security consequences, or to use homebrew as root even though 
it's unsupported, and deal with their own bugs.



I can't help but wonder if this isn't just the natural way a packaging system
evolves - you start with something very simple (like what brew is now)
and then gradually you realize that there are some annoyances, so you
file those down by adding some more complexity, and eventually you end
up with a system that's just as complex as the ones that you
originally thought were too complex.


Packaging systems? I thought that's called all software ever!

brew's lucky in that the Mac by definition is not a heterogeneous 
environment, and so Mac users don't expect it to be. (Last I checked, it's 
either difficult, impossible or unsupported to boot from a volume other than 
your filesystem root.) By being mostly source-fetch-only, sitting on top of 
git, and not packaging either system-provided features (many of which would 
require root) or repackaging gems/eggs/nodes, they've avoided a lot of the 
hard problems.  But sure, it's only two years old and it will get more 
complex over time.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Switching to Homebrew as recommended Mac install?

2012-04-04 Thread Jay Levitt

Dave Page wrote:

Exactly - which is why I was objecting to recommending a distribution
of PostgreSQL that came in a packaging system that we were told
changed /usr/local to be world writeable to avoid the use/annoyance of
the standard security measures on the platform.


We... that's not exactly what happened:

I originally wrote:

POSSIBLE OBJECTIONS/PREREQUISITES

1. homebrew installs everything under /usr/local and makes that user-writeable.


So. :)

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Last gasp

2012-04-12 Thread Jay Levitt

Alvaro Herrera wrote:

Now what would be sort of neat is if we had a way to keep all the
versions of patch X plus author and reviewer information, links to
reviews and discussion, etc. in some sort of centralized place.


FWIW: y'all might have discussed to death during the git migration, so 
*please* do not let me derail you if so...


github does a great job of exactly this.  You open an issue, you reference 
it from commits, all the related commits are listed in (and browseable from) 
the issue, you can comment on specific lines of the commit, it integrates 
w/email, it has an API to write tools (both workflow and archival) against, 
etc.


Rather than extend the CF app into a trivial-patch workflow app, it might be 
worth looking at integrating it with github.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Last gasp

2012-04-14 Thread Jay Levitt

Christopher Browne wrote:

On Thu, Apr 12, 2012 at 6:11 PM, Jay Levittjay.lev...@gmail.com  wrote:

Rather than extend the CF app into a trivial-patch workflow app, it might be
worth looking at integrating it with github.


There's a reluctance to require a proprietary component that could
disappear on us without notice.


Excellent point. I was thinking that GitHub's API would allow archival 
exporting to counter that, along the lines of let's take advantage of it 
for the next five years until it goes south, and THEN we could write our 
own. But I can see how that might not be the best choice for a project that 
expects to preserve history for a few decades.


GitHub does offer an enterprise version that you can self-host, but it 
seems to be priced per-user and intended for solely intranet use.


If the feature set is desirable, though, I wonder if Postgres is big/high 
profile enough for them to figure out some sort of better arrangement. They 
*love* it when big open-source projects use GitHub as their public repo - 
they'll email and blog announcements about it - and if there's interest I'd 
be happy to open a conversation with them.



The existence of git itself is a result of *exactly* that
circumstance, as Linux kernel developers had gotten dependent on
BitKeeper, whereupon the owner decided to take his toys home, at which
point they were left bereft of their SCM tool.
http://kerneltrap.org/node/4966


Good history lesson there, with a great outcome.


I expect that it would be more worthwhile to look into enhancements to
git workflow such ashttp://code.google.com/p/gerrit/  Gerrit.  I
don't know that Gerrit is THE answer, but there are certainly projects
that have found it of value, and it doesn't have the oops, it's
proprietary problem.


I've looked at it in conjunction with Jenkins CI; it looked nice but was way 
too heavy-weight for a four-person startup (what's code review?). It's 
probably much more suitable for this sized project.


Gerrit's a full-featured code review app with a tolerable UI; I was thinking 
of GitHub more as a great lightweight UI for doc patches and other trivial 
patches where you might have lots of casual review and comments but no need 
for, say, recording regression tests against each patch version.  e.g.:


https://github.com/rails/rails/pull/5730

Also, for doc patches, GitHub has the great advantage of in-place editing 
right from the web UI.


Peter mentioned the desire to bring more eyes and hands onto these type of 
patches - I think the phrase was enthusiast power users who wouldn't really 
consider themselves hackers. The advantage of GitHub here would be its 
(current) widespread adoption; the perceived barrier to entry is far lower 
and the workflow is far more obvious than a mailing list, formatting patches 
by email, not quite knowing what the process is.


I mention all this not to try to push GitHub specifically, but to say these 
are the types of features that modern, open-source collaborative systems 
offer, and could improve community involvement.


From a human- rather than technology-oriented perspective: I was shocked to 
find that you folks *WANT* reviews from non-contributors. It was my 
assumption as a newcomer that if I don't feel well-versed enough to submit 
patches yet, the last thing you'd want me to do was to look over someone 
else's patch and say Yeah, that looks good, any more than I care if my mom 
thinks my latest web app is very nice.


I see now that the Reviewing a Patch wiki page explains this, but maybe 
this info should be pushed higher into the docs and web site; a How can I 
contribute page, open calls for reviewers on the non-hackers mailing lists, 
things like that.  Or maybe just make the wiki page bright red and blink a lot.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Last gasp

2012-04-14 Thread Jay Levitt

Alex wrote:

I didn't follow this whole thread, but have we considered Redmine[1]?


As the resident Ruby is shiny, let's do everything in Rails on my MacBook 
guy, I'd like to make a statement against interest: I've tried Redmine a few 
times and it's been painful.  Much of the codebase is deprecated, it's slow, 
it has no meaningful search (in 2012?!), I've seen wiki edits disappear, and 
at the moment pulling up its own FAQ page at redmine.org times out.


Maybe you've had better luck with it, but whenever I've Googled for Redmine 
questions, the collective Internet has sighed and said Yeah, it was a 
really good idea, though.


Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Last gasp

2012-04-16 Thread Jay Levitt

Alex wrote:

Jay Levittjay.lev...@gmail.com  writes:


Alex wrote:

I didn't follow this whole thread, but have we considered Redmine[1]?

As the resident Ruby is shiny, let's do everything in Rails on my
MacBook guy, I'd like to make a statement against interest: I've
tried Redmine a few times and it's been painful.  Much of the codebase
is deprecated, it's slow, it has no meaningful search (in 2012?!),
I've seen wiki edits disappear, and at the moment pulling up its own
FAQ page at redmine.org times out.


Yay, that's totally FUD to me.


You're right, it was. My bad. Someday I will find the balance between 
precision and concision.



Could you please elaborate a bit on your points?

Deprecated codebase?  Let me guess...

It runs on an outdated version of Rails (2.3) but only because Rails is
changing so rapidly, I believe.  There is work in progress[1] to move to
the supported branch Rails-3.x.


I wasn't even thinking of that; I know many production systems still run on 
Rails 2.3, and in fact it probably even performs better for some workloads. 
3.x is a mixed bag. I don't hold that against Redmine.


But it's still FUD, because I can't remember where I saw this information. 
So: withdrawn.




Slow?  Do you have any data to back this point up?


No measurable data; just a sigh of relief when switching from Redmine to 
Github - and GitHub ain't a speed demon. In general, I've seen multi-second 
page load times on crazy-simple things like wiki edits; this was on a hosted 
provider (sourcerepo.com), but they also hosted our git repo and we had no 
speed problems there.



No meaningful search, eh?  Works for me.


Redmine searches return partial-word matches, and there's no way to disable 
that.  Searching for test finds latest. To me, that's broken.


Also, the UI is very 5 years ago; e.g., compare revisions uses the same 
columns-of-radio-buttons approach as MediaWiki. If the goal is a tool to 
reduce friction and increase involvement, you want a smoother UX.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug tracker tool we need

2012-04-16 Thread Jay Levitt

Alex wrote:

I still fail to see how Redmine doesn't fit into requirements summarized
at that wiki page[1], so that must be something other than formal
requirement of being free/open software and running postgres behind
(some sort of feeling maybe?)


Well, if those requirements are in fact requirements, Redmine could suit the 
purpose (perhaps with some custom extensions).  But yes, Redmine feels 
wrong to me, though I have never been particularly happy with *any* 
self-hosted bug tracker.


Of those I've used - Redmine, Mantis, JIRA, Trac, Bugzilla, GForge, RT - 
Redmine feels the least-worst to me; it's about equal to Trac, and it's in 
Ruby so I could theoretically(!) improve it.


I think the biggest missing pieces in Redmine aside from custom CF stuff 
are: better search, single sign-on (it requires Yet Another Login), a better 
UX (AJAX, syntax highlighting) and better git integration (a la pull 
requests, where private git commits = patches). Those are some pretty big 
pieces. I don't think Redmine out-of-the-box would improve either CFs or 
community involvement.



Jay, Alvaro, Dimitri (and whoever else wants to speak up) could you
please describe your ideal tool for the task?


My opinion isn't all that important, since I currently have an infinite 
opinion-to-contribution ratio, but in my unicorniverse: We'd accept that 
open source hasn't always produced great UX, we'd use GitHub's issue 
tracker, allow volunteers to do bug wrangling triage via tags, use GitHub 
hooks to integrate with the existing CF app, and write archiving tools that 
would let us easily export everything off of GitHub for when (a) something 
better comes along or (b) GitHub pops out of existence or adds egregious 
licensing terms like BitKeeper.


Reasons:

- Familiarity: Many developers already have a GitHub account and use it
- Discoverability: GitHub has great SEO
- Tight integration of git with patch and issue management (pull requests, 
fork networks, etc); eliminates ceremony rather than adding to it

- Readable UI with syntax highlighting, etc
- Patch commenting and git integration encourage actual review-resubmit 
cycles instead of Here, look, I fixed it for you reviews

- Two-way email/web integration
- Meets Tom's would be sort of neat criteria[1]
- Could easily implement Simon's pony criteria[2] with tags and API
- Easily extensible with API and hooks
- Subjectively: Its design encourages better community and core interactions 
than any I've seen in 25 years.


GitHub could well be a non-starter, but if third-party-dependence is really 
the holdup, I'd volunteer to write the tools - in fact, a google of [export 
issues from github] shows a few that might already suffice.



Given that every other existing tool likely have pissed off someone
already, I guess our best bet is writing one from scratch.


ISTR there's a great Writing your own bug tracker is an anti-pattern blog, 
but I can't find it anymore.



Or maybe there isn't really a need for a tracker?  The core team have
managed to live without one for so long after all...


As an end-user, I've reported exactly one bug in a release version of 
Postgres, and it was fixed (and back-ported!) the next day.  So I really 
can't complain about the tracking of actual bugs.


Sounds like we do need something better for CF/patch workflow, tho.

Jay

[1] Tom wrote:


Now what would be sort of neat is if we had a way to keep all the versions
of patch X plus author and reviewer information, links to reviews and
discussion, etc. in some sort of centralized place


[2] Simon wrote:


My I-Want-a-Pony idea is some kind of rating system that allows us all
to judge patches in terms of importance/popularity, complexity and
maturity. I guess a Balanced Scorecard for the development process. So
we can all see whats going on.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.3 Pre-proposal: Range Merge Join

2012-04-16 Thread Jay Levitt

Simon Riggs wrote:

I'd like to see something along the lines of demand-created optional
indexes, that we reclaim space/maintenance overhead on according to
some cache management scheme. More space you have, the more of the
important ones hang around. The rough same idea applies to
materialised views.


+10; this sort of demand-driven optimization could be the database 
equivalent of Java's HotSpot (which accomplished the amazing task of making 
Java kinda fastish sometimes).


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug tracker tool we need

2012-04-17 Thread Jay Levitt

Magnus Hagander wrote:

On Mon, Apr 16, 2012 at 23:48, Jay Levittjay.lev...@gmail.com  wrote:

- Familiarity: Many developers already have a GitHub account and use it

Most of the more senior developers don't use github. Other than
possibly as a place to store a plain git repository. So that's not
really relevant.


I meant outside developers - the folks you'd like to see more involved in 
the process.



- Patch commenting and git integration encourage actual review-resubmit
cycles instead of Here, look, I fixed it for you reviews


The amount of spam coming through that system, and the
inability/unwillingness of github to even care about it is a killer
argument *against* github.

We have working antispam for email. The github antispam is somewhere
around where email antispam was in 1994.


Interesting; I haven't run into this but you're the second person to mention 
it here.  Antispam is (in the large) a technically unsolvable problem; even 
in the '90s, we'd see hackers start poking at our newest countermeasures 
within the hour.  GitHub is a giant target, and PG probably benefits here 
from NOT being one. (A quick Google shows redmine and especially Trac having 
spam issues of their own.)


Pedantic note/fun fact: There was no email antispam in 1994; Canter  Siegel 
posted their infamous USENET Green Card spam that year, but it didn't really 
spread to email for another year or two. Once it did, there were fervent 
debates about whether it should be called velveeta to distinguish from the 
USENET variety.



GitHub could well be a non-starter, but if third-party-dependence is really
the holdup, I'd volunteer to write the tools - in fact, a google of [export
issues from github] shows a few that might already suffice.


It *is* a non-starter, because (a) it's a third party dependency, and
(b) AFAIK they don't provide *data access* to the issue trackers.


Sure they do:

http://developer.github.com/v3/issues/

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug tracker tool we need

2012-04-17 Thread Jay Levitt

Alex Shulgin wrote:

Jay Levittjay.lev...@gmail.com  writes:

(A quick Google shows redmine and especially Trac having spam issues
of their own.)


Ugh, redmine (or trac for that matters) has nothing to with handling
spam.  I believe a typical bug tracker doesn't handle spam itself, it
lets the mailing system do that.

Surely you can throw in some captcha plugins to try to reduce the spam
posted from the web UI.


Maybe I'm confused - Magnus et al, are we talking spammy issues/issue 
comments/etc, or are we talking more about exposed email addresses?


I assumed we meant spammy issues, like blog comments - spammers post issues 
and comments with links, to get PageRank to their sites.  Email defenses 
wouldn't help here.


Captchas are fairly pointless nowadays, assuming you have someone dedicated 
enough to write a spambot against your bug tracker.  Most of them (even 
reCAPTCHA!) can be 80% defeated by software - many 99% - and there are 
millions of humans hanging out on Mechanical Turk who'll solve them for you 
100%.  Modern anti-spam ends up being a machine learning and systems 
exercise.. but that's another mailing list :) I think Google gets more use 
out of reCAPTCHA for OCR tweaking than for anti-spam.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug tracker tool we need

2012-04-17 Thread Jay Levitt

Greg Smith wrote:

On 04/17/2012 09:20 AM, Jay Levitt wrote:

Antispam is (in the large) a technically unsolvable
problem; even in the '90s, we'd see hackers start poking at our newest
countermeasures within the hour. GitHub is a giant target, and PG
probably benefits here from NOT being one.

Everyone who deals with list moderation and spam issues around PostgreSQL
just got a belly laugh from that comment. Hint: the PostgreSQL lists had
already been around and therefore were being targeted by spammers for over
ten years before GitHub even existed.


Hehe.  OK, we will have to battle this out over drinks if I ever make it to 
PGCon.. but teaser: I've bankrupted Sanford Wallace and taught the DOJ what 
spam was.



Pedantic note/fun fact: There was no email antispam in 1994

I like it when Magnus really gets the details perfect when making a deadpan
joke.


Dammit.  I *fail*.


Anyway, back to serious talk, I believe GitHub is a dead end here because
the primary key as it were for issues is a repo. A bug tracker for
PostgreSQL would need to have issues broken down per branch and include
information similar to the release notes for each minor point release.
Tracking when and how a bug is backported to older versions is one hard part
of the problem here.


That's a great point. Both GitHub and git itself have no real concept of 
releases, and can't tell you when a commit made it in.


Although.. there's some sort of new release-note functionality. Maybe I'll 
play and see if it'd be applicable here.


Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 release notes, beta time?

2012-04-29 Thread Jay Levitt

Magnus Hagander wrote:

2012/4/28 Josh Berkusj...@agliodbs.com:

Ugh.  Maybe the whole idea of getting a beta out before PGCon is doomed.
Still, if we don't try for this schedule, we're looking at at least two
more weeks' slip, because we're surely not going to wrap during PGCon.

We could do it in person!


We could have Berkus handing out CDs!


Or, if you guys want to mail out CDs to the whole US, I know some people.

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request to add options to tools/git_changelog

2012-04-29 Thread Jay Levitt

Bruce Momjian wrote:

I am again requesting the addition of options to tools/git_changelog so
I can more easily produce the release notes.  I asked for this during
9.1 development and it was rejected.  I am currently using my own
custom version of the tool, but have to merge community improvements
into the tool every year before I use it.


FYI in the general case of I have to maintain a patch set: Now that PG is 
on git, there's a tool called Stacked Git that lets you use git's excellent 
merge capabilities to maintain patches.


http://www.procode.org/stgit/

Jay

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers