Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres

2004-09-12 Thread Gaetano Mendola
Mark Cotner wrote:
Requirements:
Merge table definition equivalent.  We use these
extensively.
What do you mean with "merge table definition equivalent"?

Regards
Gaetano Mendola



---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] TSearch2 and optimisation ...

2004-09-12 Thread George Essig
--- Herv� wrote:

> George,
> 
> I have well read many pages about this subject ... but I have not found any 
> thing for the moment to really help me ...
> What can I do to optimize my PostgreSQL configuration for a special use of 
> Tsearch2 ...
> I'm a little dispointed looking the Postgresql Russian search engine using 
> Tsearch2 is really quick ... why I can't haev the same result with a 
> bi-pentium III 933 and 1Gb of RAM with the text indexation of 1 500 000 
> records ?
> 
> Regards,
> -- 
> Herv� 
> Elma Ing� 6 rue du Faubourg Saint-Honor� F-75008 - Paris - France
> Pho. 33-144949901
> Fax. 33-144949902
> 

Tsearch does not scale indefinitely.  It was designed for fast online updates and to 
be integrated
into PostgreSQL.  My understanding is that it uses a bloom filter together with bit 
string
signatures.  Typically, full text searches use inverted indexes, scale better, but are 
slower to
update.

My understanding is that tsearch has a practical limit of 100,000 distinct word stems 
or lexemes. 
Note that word stems are not words.  Word stems are what are actually stored in a 
tsvector after
parsing and dictionary processing.

The key to making tsearch fast is to keep the number of word stems low.  You decrease 
the number
of word stems by using stop words, various dictionaries, synonyms, and preprocessing 
text before
it gets to tsearch.  You can find what word stems are stored in a tsvector column by 
using the
stat function.  For examples of how to use the stat function, see:

http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stat.html

Note that the stat function will take a long time to run on large tables.

Performance tuning must be done on a case by case basis.  It can take some time to try 
different
things and see the change in performance.  Each time you try something new, use the 
stat function
to see how the number of word stems has changed.

The largest project I used tsearch2 on contained 900,000 records.  Without performance 
tuning,
there were 275,000 distinct word stems.  After performance tuning, I got it down to 
14,000
distinct word stems.  

By using the stat function, I noticed some obvious stop words that were very frequent 
that nobody
would ever search for.  For how to use stop words, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/stop_words.html

Also I noticed some strange patterns by looking through all of the word stems.  

In one case, strings of 3 to 7 words were joined together with hyphens to indicate 
category
nesting.  Tsearch would store these long hyphenated words intact and also store the 
stem of each
individual word.  I made a judgment call that no one would ever search for the long 
hyphenated
words, so I preprocessed the text to remove the hyphens. 

I also noticed that many of the word stems were alphanumeric IDs that were designed to 
be unique. 
There were many of these IDs in the tsvector column although each ID would occur only 
once or
twice.  I again preprocessed the text to remove these IDs, but created a btree index 
on a varchar
column representing the IDs.  My search form allows users to either search full text 
using
tsearch2 or search IDs using 'LIKE' queries which use a btree index.  For 'LIKE' 
queries, it was
another matter to get postgres to use the btree index and not use a sequential scan.  
For this,
see:

http://www.postgresql.org/docs/7.4/static/indexes-opclass.html

Last, I noticed that most users wanted to restrict the full text search to a subset 
determined by
another column in the table.  As a result, I created a multicolumn gist index on an 
integer column
and a tsvector column.  For how to setup a multicolumn gist index, see:

http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/oscon_tsearch2/multi_column_index.html

There are no easy answers.  Like I said, performance tuning must be done on a case by 
case basis.

Hope this helps,

George Essig

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables

2004-09-12 Thread Christopher Browne
The world rejoiced as Mischa Sandberg <[EMAIL PROTECTED]> wrote:
> Mark Cotner wrote:
>> Requirements:
>> Merge table definition equivalent.  We use these
>> extensively.

> Looked all over mysql.com etc, and afaics merge table is indeed
> exactly a view of a union-all. Is that right?

> PG supports views, of course, as well (now) as tablespaces, allowing
> you to split tables/tablesets across multiple disk systems.  PG is
> also pretty efficient in query plans on such views, where (say) you
> make one column a constant (identifier, sort of) per input table.

The thing that _doesn't_ work well with these sorts of UNION views are
when you do self-joins.  Supposing you have 10 members, a self-join
leads to a 100-way join, which is not particularly pretty.

I'm quite curious as to how MySQL(tm) copes with this, although it may
not be able to take place; they may not support that...

>> Um, gonna sound silly, but the web interface has to remain "snappy"
>> under load.  I don't see this as a major concern since you don't
>> require table locking.

> Agreed. It's more in your warehouse design, and intelligent bounding
> of queries. I'd say PG's query analyzer is a few years ahead of
> MySQL for large and complex queries.

The challenge comes in if the application has had enormous amounts of
effort put into it to attune it exactly to MySQL(tm)'s feature set.

The guys working on RT/3 have found this a challenge; they had rather
a lot of dependancies on its case-insensitive string comparisons,
causing considerable grief.

> On the other hand, if you do warehouse-style loading (Insert, or PG
> COPY, into a temp table; and then 'upsert' into the perm table), I
> can guarantee 2500 inserts/sec is no problem.

The big wins are thus:

 1.  Group plenty of INSERTs into a single transaction.

 2.  Better still, use COPY to cut parsing costs plenty more.

 3.  Adding indexes _after_ the COPY are a further win.

Another possibility is to do clever things with stored procs; load
incoming data using the above optimizations, and then run stored
procedures to use some more or less fancy logic to put the data where
it's ultimately supposed to be.  Having the logic running inside the
engine is the big optimization.
-- 
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','linuxfinances.info').
http://linuxfinances.info/info/spreadsheets.html
Rules  of  the  Evil  Overlord   #198.   "I  will  remember  that  any
vulnerabilities I have  are to be revealed strictly  on a need-to-know
basis.  I will also remember that no one needs to know."


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Determine optimal fdatasync/fsync, O_SYNC/O_DSYNC options

2004-09-12 Thread mudfoot
Hi, I'd like to help with the topic in the Subject: line.  It seems to be a
TODO item.  I've reviewed some threads discussing the matter, so I hope I've
acquired enough history concerning it.  I've taken an initial swipe at
figuring out how to optimize sync'ing methods.  It's based largely on   
recommendations I've read on previous threads about fsync/O_SYNC and so on.
After reviewing, if anybody has recommendations on how to proceed then I'd   
love to hear them.  
   
Attached is a little program that basically does a bunch of sequential writes   
to a file.  All of the sync'ing methods supported by PostgreSQL WAL can be   
used.  Results are printed in microseconds.  Size and quanity of writes are   
configurable.  The documentation is in the code (how to configure, build, run,   
etc.).  I realize that this program doesn't reflect all of the possible   
activities of a production database system, but I hope it's a step in the   
right direction for this task.  I've used it to see differences in behavior   
between the various sync'ing methods on various platforms.   
   
Here's what I've found running the benchmark on some systems to which  
I have access.  The differences in behavior between platforms is quite vast.   
   
Summary first...   
   
   
PostgreSQL should be run on an old Apple MacIntosh attached to   
its own Hitachi disk array with 2GB cache or so.  Use any sync method   
except for fsync().   
   
  
Anyway, there is *a lot* of variance in file synching behavior across   
different hardware and O/S platforms.  It's probably not safe   
to conclude much.  That said, here are some findings so far based on   
tests I've run:  
  
1.  under no circumstances do fsync() or fdatasync() seem to perform   
better than opening files with O_SYNC or O_DSYNC   
2.  where there are differences, opening files with O_SYNC or O_DSYNC   
tends to be quite faster.  
3.  fsync() seems to be the slowest where there are differences.  And   
O_DSYNC seems to be the fastest where results differ.   
4.  the safest thing to assert at this point is that   
Solaris systems ought to use the O_DSYNC method for WAL.   
  
---   
   
Test system(s)   
   
Athlon Linux:   
AMD Athlon XP2000, 512MB RAM, single (54 or 7200?) RPM 20GB IDE disk,   
reiserfs filesystem (3 something I think)   
SuSE Linux kernel 2.4.21-99   
   
Mac Linux:   
I don't know the specific model.  400MHz G3, 512MB, single IDE disk,   
ext2 filesystem   
Debian GNU/Linux 2.4.16-powerpc   
   
HP Intel Linux:   
Prolient HPDL380G3, 2 x 3GHz Xeon, 2GB RAM, SmartArray 5i 64MB cache,   
2 x 15,000RPM 36GB U320 SCSI drives mirrored.  I'm not sure if   
writes are cached or not.  There's no battery backup.   
ext3 filesystem.   
Redhat Enterprise Linux 3.0 kernel based on 2.4.21   
   
Dell Intel OpenBSD:   
Poweredge ?, single 1GHz PIII, 128MB RAM, single 7200RPM 80GB IDE disk,   
ffs filesystem   
OpenBSD 3.2 GENERIC kernel   
   
SUN Ultra2:   
Ultra2, 2 x 296MHz UltraSPARC II, 2GB RAM, 2 x 10,000RPM 18GB U160   
SCSI drives mirrored with Solstice DiskSuite.  UFS filesystem.   
Solaris 8.   
   
SUN E4500 + HDS Thunder 9570v   
E4500, 8 x 400MHz UltraSPARC II, 3GB RAM,   
HDS Thunder 9570v, 2GB mirrored battery-backed cache, RAID5 with a   
bunch of 146GB 10,000RPM FC drives.  LUN is on single 2GB FC fabric   
connection.   
Veritas filesystem (VxFS)   
Solaris 8.   
   
Test methodology:   
   
All test runs were done with CHUNKSIZE 8 * 1024, CHUNKS 2 * 1024,   
FILESIZE_MULTIPLIER 2, and SLEEP 5.  So a total of 16MB was sequentially  
written for each benchmark.  
   
Results are in microseconds.   
   
PLATFORM:   Athlon Linux   
buffered:   48220   
fsync:  74854397   
fdatasync:  75061357   
open_sync:  73869239   
open_datasync:  74748145   
Notes:  System mostly idle.  Even during tests, top showed about 95%   
idle.  Something's not right on this box.  All sync methods similarly   
horrible on this system.   
   
PLATFORM:   Mac Linux   
buffered:   58912   
fsync:  1539079   
fdatasync:  769058   
open_sync:  767094   
open_datasync:  763074   
Notes: system mostly idle.  fsync seems worst.  Otherwise, they seem   
pretty equivalent.  This is the fastest system tested.  
   
PLATFORM:   HP Intel Linux   
buffered:   33026   
fsync:  29330067   
fdatasync:  28673880   
open_sync:  8783417   
open_datasync:  8747971   
Notes: system idle.  O_SYNC and O_DSYNC methods seem to be a lot   
better on this platform than fsync & fdatasync.  
  
PLATFORM:   Dell Intel OpenBSD  
buffered:   511890  
fsync:  1769190  
fdatasync:    
open_sync:  1748764  
open_datasync:  1747433  
Notes: system idle.  I couldn't locate fdatasync() on this box, so I  
couldn't test it.  All sync methods seem equivalent and are very fast --  
though still trail the old Mac.  
  
PLATFORM:   SUN Ultra2  
buffered:   1814824  
fsync: