Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Mark Kirkwood
Josh Berkus wrote:
Mark, Tim,
 

select
	pav1.person_id
from
	person_attributes_vertical pav1
where
	   (pav1.attribute_id = 1
	and pav1.value_id in (2,3))
	or (pav1.attribute_id = 2
	and pav1.value_id in (2,3))
   

Not the same query, sorry.   Daniel's query yields all the person_id's which 
have criteria A AND criteria B.   Yours gives all the person_id's which have 
criteria A OR criteria B.

 

Apologies, not thinking clearly enough there...
Maybe try out intersection :
select
   pav1.person_id
from
   person_attributes_vertical pav1
where
  (pav1.attribute_id = 1
   and pav1.value_id in (2,3))
intersect
select
   pav1.person_id
from
   person_attributes_vertical pav1
where (pav1.attribute_id = 2
   and pav1.value_id in (2,3))
In the advent that is unhelpful, I wonder about simplifying the 
situation and investigating how

select
   pav1.person_id
from
   person_attributes_vertical pav1
where
  pav1.attribute_id = 1
performs, compared to
select
   pav1.person_id
from
   person_attributes_vertical pav1
where
  (pav1.attribute_id = 1
   and pav1.value_id in (2,3))
If the first performs ok and the second does not, It may be possible to 
get better times by doing some horrible re-writes :e.g:

select
   pav1.person_id
from
   person_attributes_vertical pav1
where
  (pav1.attribute_id = 1
   and pav1.value_id||null in (2,3))
etc.
regards
Mark

   

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Simon Riggs

Two more unusual suggestions:

1. Drop all the indexes and just do sequential scans (full table scans),
aiming as hard as possible to get the whole people table to fit in memory
(size says that should be easy - various ways) - and make sure you're using
8.0 so you have the best cache manager. This will at least give you
consistent performance on whatever attribute values searched on in user
queries. Dropping all the indexes will allow the query to optimize faster,
since it has only one path choice. Work out how many attributes it takes to
reduce the list of candidates to a manageable number, and include only those
factors into the table, effectively vertically partitioning the table,
thereby reducing the volume and increasing scan speed. Then redesign the
user interface so that they see a two-stage process, first stage is top N
common attributes, second stage is to further reduce that down using rarer
attributes, as well as using the output from the first table to index into
the second. Users then won't mind additional wait time.
(Experiment with: Horizontally partition the table into N pieces. Issue N
simultaneous queries to simulate a parallel query. Try N == 2 on your box)

2. You can try forcing the use of a Star Join optimization here:
Concatenate the attribute values into a single column, then index it. This
will be nearly unique. Cluster the table.
Permute the values of the attributes, to give you a list of concatenated
keys that would match, then join that list to the main table, using a join
via the index.
You can do this permutation by using a reference table per attribute, then
doing an unconstrained product join between all of the attribute tables
(avoid any indexes on them) and assembling the candidate keys into a single
temporary table. Then join the temp table to the main people table. This
will only work effectively if people's attributes are selected with some
discrimination, otherwise this optimisation will fail. You'd need to
constrain the user interface to "pick 20 out of the following 100 attribute
values" or some other heuristic to ensure a relatively low count, or use a
LIMIT on the query into the temp table.
This sounds long winded, but is essentially the route the Oracle optimizer
takes in performing a star joinyou clearly know you're Oracle, so look
that up to confirm what I'm saying. (May not work as well if you use a
sub-select on PostgreSQL)

Also, I'd categorise the Age, Height, Weight and Salary attributes and
everything else based upon most common ranges, so it will be just an
equality search on an integer assigned to that category, rather than a >
search. Order by the distance, don't search on it, it'll be quicker since
you'll only need to calculate it for the records that match...even if you do
get a few too many, it would be a shame to avoid somebody because they lived
1 mile outside of the stated radius.

The database sounds < 1 Gb in total logical volume, so 4Gb of RAM should be
easily sufficient.

Best Regards, Simon Riggs

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Daniel
> Ceregatti
> Sent: 19 August 2004 19:03
> To: [EMAIL PROTECTED]
> Subject: [PERFORM] What is the best way to do attribute/values?
>
>
> Hi list,
>
> I have a database with 1M "people" in it. Each person has about 20
> attributes, such as height, weight, eye color, etc. I need to be able to
> search for people based on these attributes. A search can be conducted
> on one attribute, all attributes, or any number in between. How would
> _you_ do this?
>
> I have already attempted to answer this. My attempts are detailed here:
>
> http://sh.nu/email.txt
>
> This is the email I was originally going to send to this list. Since
> it's so large, I decided to link to it instead. If you feel that it
> belongs in a post to the list, let me know, and I'll post again.
>
> I've discussed these attempts with people in #postgresql on
> irc.freenode.net. Agliodbs (I presume you know who this is) was very
> helpful, but in end was at a loss. I find myself in the same postition
> at this time. He suggested I contact this list.
>
> My ultimate goal is performance. This _must_ be fast. And by fast, I
> mean, < 1 second, for every permutation of the number of attributes
> searched for. Flexibility would be a bonus, but at this point I'll
> settle for something that's harder to maintain if I can get the speed
> gain I need.
>
> Thanks,
>
> Daniel Ceregatti
>
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Josh Berkus
Mark, Tim,

> select
>   pav1.person_id
> from
>   person_attributes_vertical pav1
> where
>  (pav1.attribute_id = 1
>   and pav1.value_id in (2,3))
>   or (pav1.attribute_id = 2
>   and pav1.value_id in (2,3))

Not the same query, sorry.   Daniel's query yields all the person_id's which 
have criteria A AND criteria B.   Yours gives all the person_id's which have 
criteria A OR criteria B.

> There will be a lot of attributes with the same ID; there will also be a
> lot of attributes with the same value. However, there should be much less
> attributes with a specific combination of (ID/Value). Right now I think it
> will be very hard to determine which field has a better selectivity:
> attribute_id or value_id.

Given that there is already an index on ( attribute_id, value_id ) I don't 
quite see what difference this makes.   Unless you're suggesting this as a 
workaround for the PG Planner's poor use of the index?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Leeuw van der, Tim
Hi,

On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote:

> select
>   pav1.person_id
> from
>   person_attributes_vertical pav1
> where
>  (pav1.attribute_id = 1
>   and pav1.value_id in (2,3))
>   or (pav1.attribute_id = 2
>   and pav1.value_id in (2,3))
>

[...]

Why not combine attribute_id and value_id? Then you have nothing but an OR (or IN).

It should, AFAICS, give you much better selectivity on your indexes:

There will be a lot of attributes with the same ID; there will also be a lot of 
attributes with the same value. However, there should be much less attributes with a 
specific combination of (ID/Value).
Right now I think it will be very hard to determine which field has a better 
selectivity: attribute_id or value_id.


The combined attribute/value field could be an int8 or so, where the upper 4 bytes are 
for attribute_id and the lower 4 bytes for value_id.
Depending on the number of attributes and possible values a smaller datatype and / or 
a different split can be made. A smaller datatype will result in faster access.

What difference does that make?

regards,

--Tim

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Jeff
On Aug 25, 2004, at 4:22 AM, Mark Kirkwood wrote:
select
pav1.person_id
from
person_attributes_vertical pav1
where
   (pav1.attribute_id = 1
and pav1.value_id in (2,3))
or (pav1.attribute_id = 2
and pav1.value_id in (2,3))
You know..
It may help if you toss in a group by
ie
select pav1.person_id, count(*) from person_attributes_vertical pav1
where (pav1.attribute_id = 1 and pav1.value_id in (2,3)) or ( ... ) or 
(...)
group by pav1.person_id
order by count(*) desc

that should give you the person_id's that matched the most 
criteria
I've used similar things before now that I've thought about it.

If you want an exact match you could put
"having count(*) = $myNumAttributes" in there too.. By definition an 
exact match would match that definition..

it has an added side effect of producing "closest matches" when an 
exact match cannot be found... granted you may not want that for a 
dating site : )

"You asked for a blond female, blue eyes.. but I couldn't find any... 
but I *DID* find a brown haired male with brown eyes! Is that good 
enough?"

--
Jeff Trout <[EMAIL PROTECTED]>
http://www.jefftrout.com/
http://www.stuarthamm.net/
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-25 Thread Mark Kirkwood

Josh Berkus wrote:
Things we've already tried to avoid going over old ground:
1) increasing statistics;
2) increasing sort_mem (to 256MB, which is overkill)
3) testing on 8.0 beta, which does not affect the issue.
At this point I'm looking for ideas.   Suggestions, anyone?
 

with respect to query design:
consider instead of:
select
pav1.person_id
from
person_attributes_vertical pav1,
person_attributes_vertical pav2
where
pav1.attribute_id = 1
and pav1.value_id in (2,3)
and pav2.attribute_id = 2
and pav2.value_id in (2,3)
and pav1.person_id = pav2.person_id
try:
select
pav1.person_id
from
person_attributes_vertical pav1
where
   (pav1.attribute_id = 1
and pav1.value_id in (2,3))
or (pav1.attribute_id = 2
and pav1.value_id in (2,3))

I am gambling that the 'or's' might be less expensive than the multiple self joins 
(particularly in the more general cases!).
To make access work well you might want to have *several* concatenated indexes of 2 -> 
4 attributes - to work around Pg inability to use more than 1 in a given query.
For this query indexing (attribute_id, value_id) is probably good.
Consider playing with 'random_page_cost' and maybe 'effective_cache_size' to encourage 
the planner to use 'em.
regards
Mark


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-24 Thread Richard Huxton
Daniel Ceregatti wrote:
Hi list,
I have a database with 1M "people" in it. Each person has about 20
attributes, such as height, weight, eye color, etc. I need to be able to
search for people based on these attributes. A search can be conducted
on one attribute, all attributes, or any number in between. How would
_you_ do this?
I have already attempted to answer this. My attempts are detailed here:
http://sh.nu/email.txt
Hmm... interesting.
Shot in the dark - try a tsearch2 full-text index. Your problem could be 
translated into searching strings of the form
  "hair=black eyes=blue age=117"

Not pretty, but might give you the speed you want.
--
  Richard Huxton
  Archonet Ltd
---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
 subscribe-nomail command to [EMAIL PROTECTED] so that your
 message can get through to the mailing list cleanly


Re: [PERFORM] What is the best way to do attribute/values?

2004-08-24 Thread Josh Berkus
Folks,

> I've discussed these attempts with people in #postgresql on
> irc.freenode.net. Agliodbs (I presume you know who this is) was very
> helpful, but in end was at a loss. I find myself in the same postition
> at this time. He suggested I contact this list.

There's a couple of issues here to attack:

1) PostgreSQL is not using the most optimal plan.First, it's ignoring the 
fact that all referenced columns are indexed and only using the first column, 
then filtering based on the other criteria.   Second, testing has shown that 
a hash join would actually be faster.   We've tried upping the statistics, 
but it doesn't seem to have an effect on the planner's erroneous estimates.

2) Even were it using the most optimal plan, it's still to slow.   As you can 
see from the plan, each merge join takes about 1.5 to 2 seconds.(hash 
joins are only about 0.5 seconds slower).  Mysteriously, a big chunk of this 
time is spent *in bewtween* planner steps, as if there was some hold-up in 
retrieving the index or table pages.   There may be, but Daniel and I have 
not been able to diagnose the cause.   It's particularly mysterious since a 
filter-and-sort on a *single* criteria set, without join, takes < 400ms.

Things we've already tried to avoid going over old ground:
1) increasing statistics;
2) increasing sort_mem (to 256MB, which is overkill)
3) testing on 8.0 beta, which does not affect the issue.

At this point I'm looking for ideas.   Suggestions, anyone?

-- 
Josh Berkus
Aglio Database Solutions
San Francisco

---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] What is the best way to do attribute/values?

2004-08-24 Thread Daniel Ceregatti
Hi list,

I have a database with 1M "people" in it. Each person has about 20
attributes, such as height, weight, eye color, etc. I need to be able to
search for people based on these attributes. A search can be conducted
on one attribute, all attributes, or any number in between. How would
_you_ do this?

I have already attempted to answer this. My attempts are detailed here:

http://sh.nu/email.txt

This is the email I was originally going to send to this list. Since
it's so large, I decided to link to it instead. If you feel that it
belongs in a post to the list, let me know, and I'll post again.

I've discussed these attempts with people in #postgresql on
irc.freenode.net. Agliodbs (I presume you know who this is) was very
helpful, but in end was at a loss. I find myself in the same postition
at this time. He suggested I contact this list.

My ultimate goal is performance. This _must_ be fast. And by fast, I
mean, < 1 second, for every permutation of the number of attributes
searched for. Flexibility would be a bonus, but at this point I'll
settle for something that's harder to maintain if I can get the speed
gain I need.

Thanks,

Daniel Ceregatti

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster