Re: [PERFORM] Suspending SELECTs

2006-01-18 Thread Harry Jackson
Your experiment made far too many assumptions and the data does not
stand up to scrutiny.

On 1/18/06, Alessandro Baretta [EMAIL PROTECTED] wrote:
 Results: I'll omit the numerical data, which everyone can easily obtain in 
 only
 a few minutes, repeating the experiment. I used several query strings 
 containing
 very common words (linux debian, linux kernel, linux tux), each yielding
 millions of results. I set Google to retrieve 100 results per page. Then I ran
 the query and paged through the data set. The obvious result is that execution
 time is a monotonously growing function of the page number. This clearly
 indicates that Google does not use any algorithm of the proposed kind, but
 rather an OFFSET/LIMIT strategy, thus disproving the hypothesis.

I just ran the same test and I got a different outcome than you. The
last page came back twice as fast as page 4. I noticed no trend in the
speed of the results from each page.

Of course it is probably in cache because its such a common thing to
be searched on so the experiment is pointless.

You cannot jump to your conclusions based on a few searches on google.

 It must also be noted that Google refuses to return more than 1000 results per
 query, thus indicating that the strategy the adopted quite apparently cannot
 scale indefinitely, for on a query returning a potentially flooding dataset, a
 user paging through the data would experience a linear slowdown on the number 
 of
 pages already fetched, and the DBMS workload would also be linear on the 
 number
 of fetched pages.

There are various reason why google might  want to limit the search
result returned ie to encourage people to narrow their search. Prevent
screen scrapers from hitting them really hard blah blah. Perhaps less
than 0.0001% of real users (not scrapers) actually dig down to the
10th page so whats the point.

There are numerous methods that you can use to give separate result
pages some of which include going back to the database and some don't.
I prefer not to go back to the database if I can avoid it and if all
you want to do is provide a few links to further pages of results then
going back to the database and using offsets is a waste of IO.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] help tuning queries on large database

2006-01-09 Thread Harry Jackson
On 1/9/06, Kelly Burkhart [EMAIL PROTECTED] wrote:
 On 1/8/06, Ron [EMAIL PROTECTED] wrote:
  snip
  Among the other tricks having lots of RAM allows:
  If some of your tables are Read Only or VERY rarely written to, you
  can preload them at boot time and make them RAM resident using the
  /etc/tmpfs trick.

  What is the /etc/tmpfs trick?

I think he means you can create a directory that mounts and area of
RAM. If you put the tables on it then it will be very fast. I would
not recommend it for anything you cannot afford to loose.

I have also tried it and found that it did not produce as good as
performance as I expected.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] help tuning queries on large database

2006-01-06 Thread Harry Jackson
On 1/6/06, peter royal [EMAIL PROTECTED] wrote:
 PostgreSQL 8.1.1

 shared_buffers = 1  # (It was higher, 50k, but didn't help any,
 so brought down to free ram for disk cache)
 work_mem = 8196
 random_page_cost = 3
 effective_cache_size = 25

I have played with both disk cache settings and shared buffers and I
found that if I increased the shared buffers above a certain value
performance would increase dramatically. Playing with the effective
cache did not have the same amount of impact. I am currently running
with

shared_buffers = 254288 # approx 2.1Gb

and this is on a smaller dataset than yours.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] CPU and RAM

2005-12-30 Thread Harry Jackson
On 24 Dec 2005 10:25:09 -0500, Greg Stark [EMAIL PROTECTED] wrote:

 Harry Jackson [EMAIL PROTECTED] writes:

  I always look at the explain plans.
 
  =# explain select item_id, term_frequency from reverse_index where
  term_id = 22781;
QUERY PLAN
  ---
   Bitmap Heap Scan on reverse_index  (cost=884.57..84443.35 rows=150448 
  width=8)
 Recheck Cond: (term_id = 22781)
 -  Bitmap Index Scan on reverse_index_term_id_idx
  (cost=0.00..884.57 rows=150448 width=0)
   Index Cond: (term_id = 22781)
  (4 rows)

 Can you send EXPLAIN ANALYZE for this query for a problematic term_id? Are you
 really retrieving 150k records like it expects? In an OLTP environment that's
 an awful lot of records to be retrieving and might explain your high CPU usage
 all on its own.

The above is with the problematic term_id ;)

The above comes in at around 1/4 of a second which is fine for now but
will cause me severe problems in a few months when the size of teh
database swells.

 250ms might be as good as you'll get for 150k records. I'm not sure precaching
 that many records will help you. You're still going to have to read them from
 somewhere.

This is what I am thinking. I have tried various methods to reduce the
time. I even tried to use order by then reduce the amount of data to
50K records to see if this would work but it came in at around the
same amount of time. It is faster if I use the following though but
not by much.

=# explain select * from reverse_index where term_id = 22781 order by
term_frequency DESC limit 3;
 QUERY PLAN

 Limit  (cost=67337.87..67412.87 rows=3 width=12)
   -  Sort  (cost=67337.87..67565.26 rows=90956 width=12)
 Sort Key: term_frequency
 -  Index Scan using reverse_index_term_id_idx on
reverse_index  (cost=0.00..59846.33 rows=90956 width=12)
   Index Cond: (term_id = 22781)
(5 rows)

I was actually suprised by this and it shows that whatever routines
Postgresql is using to sort the data its pretty bloody fast. The total
sort time for 110K records is about 193ms. The its retrieval after
that. What also suprised me is that without the sort

select * from reverse_index where term_id = 22781;

is slower than

select item_id, term_frequency from reverse_index where term_id = 22781;

but with the sort and limit added

select * from reverse_index where term_id = 22781 order by
term_frequency DESC limit 3;

is faster than

select item_id, term_frequency from reverse_index where term_id =
22781 order by term_frequency DESC limit 3;

 I guess clustering on term_id might speed this up by putting all the records
 being retrieved together. It might also let the planner use a plain index scan
 instead of a bitmap scan and get the same benefit.

Yep. I clustered on the term_id index again before running the above
explain and this time we have a plain index scan.

  The next query absolutely flies but it would have been the one I would
  have expected to be a lot slower.
  ...
  This comes in under 10.6ms which is astounding and I am more than
  happy with the performance to be had from it.

 Out of curiosity it would be interesting to see the EXPLAIN ANALYZE from this
 too.


   QUERY PLAN

 Bitmap Heap Scan on item i  (cost=8.01..16.18 rows=4 width=478)
   Recheck Cond: ((item_id = 20006293) OR (item_id = 20097065) OR
(item_id = 20101014) OR (item_id = 20101015))
   -  BitmapOr  (cost=8.01..8.01 rows=4 width=0)
 -  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
   Index Cond: (item_id = 20006293)
 -  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
   Index Cond: (item_id = 20097065)

snip lots of single item_id bitmap index scans

 -  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
   Index Cond: (item_id = 20101014)
 -  Bitmap Index Scan on item_item_id_pk  (cost=0.00..2.00
rows=1 width=0)
   Index Cond: (item_id = 20101015)


Another intereting thing I noticed was the size of the tables and
indexes after the cluster operation

BEFORE:
  relname  |   bytes   | kbytes | relkind | mb
---+---++-+-
 reverse_index | 884293632 | 863568 | r   | 843
 reverse_index_pk  | 548126720 | 535280 | i   | 522
 reverse_index_term_id_idx | 415260672 | 405528 | i   | 396

AFTER:
 reverse_index | 635944960 | 621040 | r   | 606
 reverse_index_pk

[PERFORM] CPU and RAM

2005-12-21 Thread Harry Jackson
I am currently using a dual Opteron (248) single core system (RAM
PC3200) and for a change I am finding that the bottleneck is not disk
I/O but CPU/RAM (not sure which). The reason for this is that the most
frequently accessed tables/indexes are all held in RAM and when
querying the database there is almost no disk activity which is great,
most of the time. However, the database is growing and this database
is supporting an OLTP system where the retrieval of the data is an
order of magnitude more important than the insertion and general
upkeep of the data. It supports a search engine[0] and contains a
reverse index, lexicon and the actual data table (currently just under
2Gb for the three tables and associated indexes).

At the moment everything is working OK but I am noticing an almost
linear increase in time to retrieve data from the database as the data
set increases in size. Clustering knocks the access times down by 25%
but it also knocks users off the website and can take up to 30 minutes
which is hardly an ideal scenario. I have also considered partitioning
the tables up using extendible hashing and tries to allocate the terms
in the index to the correct table but after some testing I noticed no
noticeable gain using this method which surprised me a bit.

The actual size of the database is not that big (4Gb) but I am
expecting this to increase to at least 20Gb over the next year or so.
This means that search times are going to jump dramatically which also
means the site becomes completely unusable. This also means that
although disk access is currently low I am eventually going to run out
of RAM and require a decent disk subsystem.

Do people have any recommendations as to what hardware would alleviate
my current CPU/RAM problem but with a mind to the future would still
be able to cope with heavy disk access. My budget is about £2300/$4000
which is not a lot of money when talking databases so suggestions of a
Sun Fire T2000 or similar systems will be treated with the utmost
disdain ;) unless you are about to give me one to keep.

--
Harry
http://www.hjackson.org
http://www.uklug.co.uk


Before anyone asks I have considered using tsearch2.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Crashing DB or Server?

2005-12-16 Thread Harry Jackson
On 12/16/05, Moritz Bayer [EMAIL PROTECTED] wrote:
  This is really weird, just a few hours ago the machine run very smooth
 serving the data for a big portal.

Can you log the statements that are taking a long time and post them
to the list with the table structures and indexes for the tables being
used.

To do this turn on logging for statements taking a long time, edit
postgresql.conf file and change the following two parameters.

log_min_duration_statement = 2000 # 2 seconds

Your log should now be catching the statements that are slow. Then use
the statements to get the explain plan ie

dbnamr=# explain [sql thats taking a long time]

We would also need to see the table structures.

dbname=# \d [table name of each table in above explain plan]

  Has anybody an idea what might have happened here?
  I need a quick solution, since I'm talking about an live server that should
 be running 24 hours a day.

It may be that the planner has started to pick a bad plan. This can
happen if the database is regularly changing and the stats are not up
to date. I believe it can happen even if the stats are up to date but
is much less likely to do so.

It might also be an idea to vacuum the database.

dbname=# VACUUM ANALYZE;

This will load the server up for a while though.

--
http://www.hjackson.org
http://www.uklug.co.uk

---(end of broadcast)---
TIP 6: explain analyze is your friend