Heikki Linnakangas wrote:
Here's results from a batch of test runs with LDC. This patch only
spreads out the writes, fsyncs work as before. This patch also includes
the optimization that we don't write buffers that were dirtied after
starting the checkpoint.
On Wed, 2007-06-13 at 18:06 -0400, Bruce Momjian wrote:
You bring up a very good point. There are fifteen new commands being
added for full text indexing:
alter-fulltext-config.sgml alter-fulltext-owner.sgml
create-fulltext-dict.sgml drop-fulltext-dict.sgml
tests| pgbench | DBT-2 response time
(avg/90%/max)
---+-+
---+-+---
LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
+ BM_CHECKPOINT_NEEDED(*) | 187 tps
Simon Riggs [EMAIL PROTECTED] wrote:
tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
LDC only | 181 tps | 1.12 / 4.38 / 12.13 s
+ BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83
I've done some more work on this point. After looking at the Snowball
code in more detail, I'm thinking it'd be a good idea to keep it at
arm's length in a loadable shared library, instead of incorporating it
I splited stemmers to two sets because of regression test. As I remember, there
was
Probably, having default text search configuration is not a good idea
and we could just require it as a mandatory parameter, which could
eliminate many confusion with selecting text search configuration.
Ugh. Having default configuration (by locale or by postgresql.conf or some other
way)
On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote:
Simon Riggs [EMAIL PROTECTED] wrote:
tests| pgbench | DBT-2 response time (avg/90%/max)
---+-+---
LDC only | 181 tps | 1.12
Heikki Linnakangas wrote:
Here's an updated WIP version of the LDC patch. I just spreads the
writes, that achieves the goal of smoothing the checkpoint I/O spikes. I
think sorting the writes etc. is interesting but falls in the category
of further development and should be pushed to 8.4.
Why
Heikki Linnakangas [EMAIL PROTECTED] writes:
I ran another series of tests, with a less aggressive bgwriter_delay setting,
which also affects the minimum rate of the writes in the WIP patch I used.
Now that the checkpoints are spread out more, the response times are very
smooth.
So
Teodor Sigaev [EMAIL PROTECTED] writes:
I splited stemmers to two sets because of regression test. As I
remember, there was some problems with loadable conversions and
configure's flag --disable-shared
I'm not worried about supporting --disable-shared installations very
much. They didn't have
danish, dutch, finnish, french, german, hungarian, italian, norwegian,
portuguese, spanish, swedish, russin and english
Albe Laurenz wrote:
Tom Lane wrote:
Teodor Sigaev [EMAIL PROTECTED] writes:
So, it's needed to change dictinitoption format of snowball
dictionaries to
point both
Tom Lane wrote:
Teodor Sigaev [EMAIL PROTECTED] writes:
So, it's needed to change dictinitoption format of snowball
dictionaries to
point both stop-word file and language's name.
Right.
Is there any chance to get support for other languages than English and
Russian into the tsearch2
Tom Lane wrote:
Bruce Momjian [EMAIL PROTECTED] writes:
First, why are we specifying the server locale here since it never
changes:
It's poorly described. What it should really say is the language
that the text-to-be-searched is in. We can actually support multiple
languages here
Bruce Momjian wrote:
My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.
So we create a pg_catalog full text
Simon Riggs [EMAIL PROTECTED] writes:
Although I'm happy to see tsearch finally hit the big time, I'm a bit
disappointed to see so many new datatype-specific SQL commands created.
Per subsequent discussion we are down to just one new set of commands,
CREATE/ALTER/DROP TEXT SEARCH CONFIGURATION,
On Jun 14, 2007, at 19:04 , [EMAIL PROTECTED] wrote:
For UUID, I
would value random access before sequential performance. Why would
anybody scan UUID through the index in sequential order?
AIUI, to allow UUID columns to be indexed using BTREE, there needs to
be some ordering defined. So
Go ahead and make the changes you want, and then I'll work on this.
So, I'm planing on this weekend:
1) rename FULLTEXT to TEXT SEARCH in SQL command
2) rework Snowball stemmer's as Tom suggested
3) ALTER FULLTEXT CONFIGURATION cfgname ADD/ALTER/DROP MAPPING
4) remove support of default
When done that way, you're going to see a lot of index B-tree
fragmentation with even DCE 1.1 (ISO/IEC 11578:1996) time based
UUIDs,
as described above. With random (version 4) or hashed based (version
3
or 5) UUIDs there's nothing that can be done to improve the
situation,
obviously.
1) Require the configuration to be always specified. The problem with
this is that casting (::tsquery) and operators (@@) have no way to
specify a configuration.
it's not comfortable for most often cases
2) Use a GUC that you can set for the configuration, and perhaps
default it if
Teodor Sigaev [EMAIL PROTECTED] writes:
My guess right now is that we use a GUC that will default if a
pg_catalog configuration name matches the lc_ctype locale name, and we
have to throw an error if an accessed index creation GUC doesn't match
the current GUC.
Where will index store index
I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv). But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)
--
Teodor Sigaev E-mail: [EMAIL PROTECTED]
Bruce Momjian [EMAIL PROTECTED] writes:
Do locale names vary across operating systems?
Yes, which is the fatal flaw in the whole thing. The ru_RU part is
reasonably well standardized, but the encoding part is not. Considering
that encoding is exactly the part of it we don't care about for this
Teodor Sigaev [EMAIL PROTECTED] writes:
I'd suggest allowing either full names (swedish) or the standard
two-letter abbreviations (sv). But let's stay away from locale names.
We can use database's encoding name (the same names used in initdb -E)
AFAICS the encoding name shouldn't be anywhere
Tom Lane [EMAIL PROTECTED] writes:
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific. It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the
Gregory Stark [EMAIL PROTECTED] writes:
Tom Lane [EMAIL PROTECTED] writes:
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific. It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
On Friday 15 June 2007 00:46, Oleg Bartunov wrote:
On Thu, 14 Jun 2007, Tom Lane wrote:
[ thinks some more... ] If we revived the GENERATED AS patch,
you could imagine computing tsvector columns via GENERATED AS
to_tsvector('english'::regconfig, big_text_col) instead of a
trigger, and
On Thursday 14 June 2007 15:10, Teodor Sigaev wrote:
That changes are doable for several days. I'd like to make changes together
with replacing of FULLTEXT keyword to TEXT SEARCH as you suggested.
AIUI the discussion on this change took place off list? Can we get a preview
of what the
I suggest that treating the UUID as anything other than a unique
random value is a mistake. There should be no assumptions by users
with regard to how the order is displayed.
You can always use random UUIDs -- that's a choice in UUID generation.
When dealing with random UUIDs you also (by the
The only reason the TS stuff needs an encoding spec is to figure out how
to read an external stop word file. I think my suggestion upthread is a
lot better: have just one stop word file per language, store them all in
UTF8, and convert to database encoding when loading them. The database
Hmm.
It's not really the index's problem; IIUC the behavior of the gist and
gin index opclasses is not locale-specific.
Right
It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the heap column about this.
It
Teodor Sigaev [EMAIL PROTECTED] writes:
It's the to_tsvector calls
that built the tsvector heap column that have a locale specified or
implicit. We need some way of annotating the heap column about this.
It seems too restrictive to advanced users.
Hm, are you trying to say that it's sane to
Teodor Sigaev [EMAIL PROTECTED] writes:
Hmm. You mean to use language name in configuration, use current encoding to
define which dictionary should be used (stemmers for the same language are
different for different encoding) and recode dictionaries file from UTF8 to
current locale. Did I
On Fri, 2007-06-15 at 10:36 -0400, Tom Lane wrote:
Simon Riggs [EMAIL PROTECTED] writes:
Although I'm happy to see tsearch finally hit the big time, I'm a bit
disappointed to see so many new datatype-specific SQL commands created.
Per subsequent discussion we are down to just one new set
Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings? Maybe we're all
Yes, I think so.
That might have sense for close languages. Norwegian languages has two dialects
and one of them has advanced rules for compound words,
The current discussion about the tsearch-in-core patch has convinced me
that there are plausible use-cases for typmod values that aren't simple
integers. For instance it could be sane for a type to want a locale or
language selection as a typmod, eg tsvector('ru') or tsvector('sv').
(I'm not
On Fri, Jun 15, 2007 at 12:14:45PM -0400, Tom Lane wrote:
[snip]
I propose changing the typmodin signature to typmodin(cstring[])
returns int4, that is, the typmods will be passed as strings not
integers. This will incur a bit of extra conversion overhead for
the normal uses where the
Hello All,Recently, I have been involved in some work that requires me to
monitor low level performance counters for pgsql. Specifically, when I execute
a particular query I want to be able to tell how many system calls get executed
on behalf of that query and time of each sys call. The idea is
So, added to my plan
(http://archives.postgresql.org/pgsql-hackers/2007-06/msg00618.php)
n) single encoded files. That will touch snowball, ispell, synonym, thesaurus
and simple dictionaries
n+1) use encoding names instead of locale's names in configuration
Tom Lane wrote:
Teodor Sigaev
One possibility is that the user-visible specification is just a name
(eg, english), but the actual filename out on the filesystem is,
say, name.encoding.stop (eg, english.utf8.stop) where we use PG's
names for the encodings. We could just fail if there's not a file
matching the database
Teodor Sigaev [EMAIL PROTECTED] writes:
But configuration for different languages might be differ, for example
russian (and any cyrillic-based) configuration is differ from
west-european configuration based on different character sets.
Sure. I'm just assuming that the set of stopwords doesn't
Sure. I'm just assuming that the set of stopwords doesn't need to vary
depending on the encoding you're using for a language --- that is, if
you're willing to convert the encoding then the same stopword list file
should serve for all encodings of a given language. Do you think this
might be
* Tom Lane ([EMAIL PROTECTED]) wrote:
I propose changing the typmodin signature to typmodin(cstring[]) returns
int4, that is, the typmods will be passed as strings not integers. This
will incur a bit of extra conversion overhead for the normal uses where
the typmods are integers, but I think
Stephen Frost [EMAIL PROTECTED] writes:
Would this allow for 'multi-value' typmods for user-defined types?
If you can squeeze them into 31 bits of stored typmod, yes. That
may mean that you still need the side table (with stored typmod being a
lookup key for the table). But this gets you out
Teodor Sigaev [EMAIL PROTECTED] writes:
I propose changing the typmodin signature to typmodin(cstring[]) returns
int4, that is, the typmods will be passed as strings not integers.
And modify ArrayGetTypmods() to ArrayGetIntegerTypmods()
Right --- the decoding work will only have to happen in
On Fri, 15 Jun 2007, Umar Farooq wrote:
Surprisingly, no matter what type of query I execute, when I use strace
to monitor the system calls generated they turn out to be the same for
ALL sorts of queries.
How are you calling strace? The master postgres progress forks off new
processes for
On Fri, 15 Jun 2007, Gregory Stark wrote:
If I understand it right Greg Smith's concern is that in a busier system
where even *with* the load distributed checkpoint the i/o bandwidth
demand during t he checkpoint was *still* being pushed over 100% then
spreading out the load would only
Is it worth providing an ArrayGetStringTypmods in core, when it won't
be used by any existing core datatypes?
I don't think so - cstring[] is a set of strings itself. I don't believe that we
could suggest something commonly useful without some real-world examples.
--
Teodor Sigaev
* Tom Lane ([EMAIL PROTECTED]) wrote:
Stephen Frost [EMAIL PROTECTED] writes:
Would this allow for 'multi-value' typmods for user-defined types?
If you can squeeze them into 31 bits of stored typmod, yes. That
may mean that you still need the side table (with stored typmod being a
lookup
Am Freitag, 15. Juni 2007 18:14 schrieb Tom Lane:
The current discussion about the tsearch-in-core patch has convinced me
that there are plausible use-cases for typmod values that aren't simple
integers. For instance it could be sane for a type to want a locale or
language selection as a
Gregory Stark wrote:
Heikki Linnakangas [EMAIL PROTECTED] writes:
Now that the checkpoints are spread out more, the response times are very
smooth.
So obviously the reason the results are so dramatic is that the checkpoints
used to push the i/o bandwidth demand up over 100%. By spreading it
Michael Paesold wrote:
Heikki Linnakangas wrote:
Here's an updated WIP version of the LDC patch. I just spreads the
writes, that achieves the goal of smoothing the checkpoint I/O spikes.
I think sorting the writes etc. is interesting but falls in the
category of further development and should
I propose changing the typmodin signature to typmodin(cstring[]) returns
int4, that is, the typmods will be passed as strings not integers. This
will incur a bit of extra conversion overhead for the normal uses where
the typmods are integers, but I think the gain in flexibility is worth
agree
Teodor Sigaev [EMAIL PROTECTED] writes:
Hm, are you trying to say that it's sane to have different tsvectors in
a column computed under different language settings? Maybe we're all
Yes, I think so.
That might have sense for close languages. Norwegian languages has two
dialects
and one
To support this sanely though wouldn't you need to know which language rule a
tsvector was generated with? Like, have a byte in the tsvector tagging it with
the language rule forever more?
No. As corner case, dictionary might return just a number or a hash value.
What I'm wondering about is
Greg Smith [EMAIL PROTECTED] writes:
On Fri, 15 Jun 2007, Gregory Stark wrote:
If I understand it right Greg Smith's concern is that in a busier system
where even *with* the load distributed checkpoint the i/o bandwidth demand
during t he checkpoint was *still* being pushed over 100% then
On 6/15/07, Gregory Stark [EMAIL PROTECTED] wrote:
While in theory spreading out the writes could have a detrimental effect I
think we should wait until we see actual numbers. I have a pretty strong
suspicion that the effect would be pretty minimal. We're still doing the same
amount of i/o
Stephen Frost [EMAIL PROTECTED] writes:
Any chance of this being increased?
No. Changing typmod to something other than int32 would require many
thousands of lines of diffs just in the core distro. I don't even want
to think about how much outside code would break.
On Fri, 15 Jun 2007 22:28:34 +0200, Gregory Maxwell [EMAIL PROTECTED]
wrote:
On 6/15/07, Gregory Stark [EMAIL PROTECTED] wrote:
While in theory spreading out the writes could have a detrimental
effect I
think we should wait until we see actual numbers. I have a pretty strong
suspicion that
On Fri, Jun 15, 2007 at 09:40:29AM -0500, Michael Glaesemann wrote:
On Jun 14, 2007, at 19:04 , [EMAIL PROTECTED] wrote:
For UUID, I
would value random access before sequential performance. Why would
anybody scan UUID through the index in sequential order?
AIUI, to allow UUID columns to be
I had the same problem so I tried building with increasingly older versions of
the MinGW runtime. It turns out version 3.9 is the more recent version without
the conflict in sys/time.h.
Looking for a
On Fri, Jun 15, 2007 at 11:05:01AM -0400, Robert Wojciechowski wrote:
Also, treating UUIDs as time based is completely valid -- that is the
point of version 1 UUIDs. They have quite a few advantages over random UUIDs.
It's a leap from extracting the UUID as time, to sorting by UUID for
results,
In utils/adt/tid.c, there's two mysterious functions with no comments,
and no-one calling them inside backend code AFAICT: currtid_byreloid and
currtid_byrelname. What do they do/did?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
---(end of
On Jun 14, 2007, at 7:21 AM, Heikki Linnakangas wrote:
We have these GUC variables that define a fraction of something:
#autovacuum_vacuum_scale_factor = 0.2 # fraction of rel size before
# vacuum
#autovacuum_analyze_scale_factor = 0.1 # fraction of
If index lookup speed or packing truly was the primary concern, people
would use a suitably sized SEQUENCE. They would not use UUID.
I believe the last time I calculated this, the result was that you
could fit 50% more entries in the index if you use a 32-bit sequence
number instead of a 128-bit
Heikki Linnakangas wrote:
In utils/adt/tid.c, there's two mysterious functions with no comments,
and no-one calling them inside backend code AFAICT: currtid_byreloid
and currtid_byrelname. What do they do/did?
If you have a look at the CVS annotations (
Heikki Linnakangas [EMAIL PROTECTED] writes:
In utils/adt/tid.c, there's two mysterious functions with no comments, and
no-one calling them inside backend code AFAICT: currtid_byreloid and
currtid_byrelname. What do they do/did?
The comments for heap_get_latest_tid() seem to apply. They're
Heikki Linnakangas [EMAIL PROTECTED] writes:
In utils/adt/tid.c, there's two mysterious functions with no comments,
and no-one calling them inside backend code AFAICT: currtid_byreloid and
currtid_byrelname. What do they do/did?
IIRC, the ODBC driver uses them, or once did, to deal with
Tom Lane [EMAIL PROTECTED] writes:
Heikki Linnakangas [EMAIL PROTECTED] writes:
In utils/adt/tid.c, there's two mysterious functions with no comments,
and no-one calling them inside backend code AFAICT: currtid_byreloid and
currtid_byrelname. What do they do/did?
IIRC, the ODBC driver
Now that we've fixed the partial/interleaved log line issue, I have
returned to trying toi get the CSV log patch into shape. Sadly, it still
needs lots of work, even after Greg Smith and I both attacked it, so I
am now going through it with a fine tooth comb.
One issue I notice is that it
Andrew Dunstan wrote:
Now that we've fixed the partial/interleaved log line issue, I have
returned to trying toi get the CSV log patch into shape. Sadly, it still
needs lots of work, even after Greg Smith and I both attacked it, so I
am now going through it with a fine tooth comb.
One
From: [EMAIL PROTECTED] To: pgsql-hackers@postgresql.org Tasneem,
The margins to the op2, i.e. m1 and m2, are added dynamically on both
the sides, considering the value it contains. To keep this margin big
is important for a certain reason discussed later. The NEAR operator
From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC:
pgsql-hackers@postgresql.org Subject: Re: [HACKERS] To all the pgsql
developers..Have a look at the operators proposed by me in my researc On
Sat, Jun 02, 2007 at 01:37:19PM +, Tasneem Memon wrote: We can make
the system ask the
Andrew Dunstan [EMAIL PROTECTED] writes:
One issue I notice is that it mangles the log message to add a tab
character before each newline. We do this in standard text logs to make
them more readable for humans. but the whole point of having CSV logs is
to make them machine readable, and I'm
73 matches
Mail list logo