Text search selectivity improvements (was Re: [HACKERS] Google Summer of Code 2008)

2008-03-18 Thread Jan Urbański
OK, here's a more detailed description of the FTS selectivity improvement idea: === Write a typanalyze function for column type tsvector The function would go through the tuples returned by the BlockSampler and compute the number of times each distinct lexeme appears inside the

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Jan Urbański
Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. OK, after reading through the some of the

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Jan Urbaski wrote: Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. OK,

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Tom Lane
Oleg Bartunov [EMAIL PROTECTED] writes: On Sat, 8 Mar 2008, Jan Urbaski wrote: I have a feeling that in many cases identifying the top 50 to 300 lexemes would be enough to talk about text search selectivity with a degree of confidence. At least we wouldn't give overly low estimates for

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Jan Urbański
Oleg Bartunov wrote: On Sat, 8 Mar 2008, Jan Urbaski wrote: OK, after reading through the some of the code the idea is to write a custom typanalyze function for tsvector columns. It could look inside such function already exists, it's ts_stat(). The problem with ts_stat() is its

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Tom Lane wrote: Oleg Bartunov [EMAIL PROTECTED] writes: On Sat, 8 Mar 2008, Jan Urbaski wrote: I have a feeling that in many cases identifying the top 50 to 300 lexemes would be enough to talk about text search selectivity with a degree of confidence. At least we wouldn't

Re: [HACKERS] Google Summer of Code 2008

2008-03-08 Thread Oleg Bartunov
On Sat, 8 Mar 2008, Jan Urbaski wrote: Unfortunately, selectivity estimation for query is much difficult than just estimate frequency of individual word. Sure, given something like 'cats dogs'::tsquery the frequency of 'cat' and 'dog' won't suffice. But at least it's a starting point and

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Jan Urbański
Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. +1 for that one ... OK, this one might very well be the one that'd be more useful. And I can always reuse the other idea for my thesis, after expanding it a bit.

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Dave Page
On Tue, Mar 4, 2008 at 4:47 PM, Jan Urbański [EMAIL PROTECTED] wrote: Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. +1 for that one ... OK, this one might very well be the one that'd be more useful. And

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Josh Berkus
Jan, OK, this one might very well be the one that'd be more useful. Well, you should submit *both* once SoC opens for applications. The mentors will decide which. -- Josh Berkus PostgreSQL @ Sun San Francisco -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Oleg Bartunov
Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Oleg On Tue, 4 Mar 2008, Jan Urbaski wrote: Tom Lane wrote: =?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. +1 for that one ...

Re: [HACKERS] Google Summer of Code 2008

2008-03-04 Thread Jan Urbański
Oleg Bartunov wrote: Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? I guess the first approach could be to populate some more columns in pg_statistics for tables with tsvectors. I see there are some statistics already being gathered

[HACKERS] Google Summer of Code 2008

2008-03-03 Thread Jan Urbański
Hi PostgreSQL! Although this year's GSoC is just starting, I thought getting in touch a bit earlier would only be of benefit. I study Computer Science in Faculty of Mathematics, Informatics and Mechanics of Warsaw University. I'm currently in my fourth year of studies. Having chosen Databases

Re: [HACKERS] Google Summer of Code 2008

2008-03-03 Thread Tom Lane
=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= [EMAIL PROTECTED] writes: 2. Implement better selectivity estimates for FTS. +1 for that one ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your Subscription: