Re: [HACKERS] Searching for substring with tsearch(1/2)
Tsearch was never minded as prefix search, and index structure doesn't support any kind of prefix or suffix. But you can write extension to tsearch, which will search by prefix. But such solution wiil not use index, only sequence scan. How efficient would tsearch be for really big expressions (where 'hu%' would be expanded (using a btree word index on one column word table) to tsearch equivalent of ( human or humanity or humming or huge or ..1000 words here...) before passing the expression to tsearch? GiST index of tsearch doen't support prefix search, so it will works only by seqscan, as we know :) disk is much more slow than processor, speed will be limited by disk. Prefix searches easy realized with inverted index, but it require a lot of programing. The simplest way is: create table invidx ( lexeme text not null primary key, ids[] int ); where ids[] - array with identificators of documents which contains this word. How hard (or sensible ;) would be creating such an index using GiST ? As proved by tsearch GiST can cope well with many-to-many indexes. Sorry, I don't understand. Do you mean that GiST supports one heap tuple in several index tuple? If yes then no :). GiST doesn't support this feature. I don't think that GiST may help in this situation. create table invidx ( lexeme text not null, textdate date not null, ids[]int, primary key (lexeme, textdate) ); which would partition the invidx table on textdate (or some other suitable datum) 2 If word is frequent then query with 'IN (select * from func()) may works slow... if it is often too slow then creating a temp table and doing a plain join may be faster. Table structure as indidx decrease this problem. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Searching for substring with tsearch(1/2)
Teodor Sigaev kirjutas K, 10.12.2003 kell 11:20: Tsearch was never minded as prefix search, and index structure doesn't support any kind of prefix or suffix. But you can write extension to tsearch, which will search by prefix. But such solution wiil not use index, only sequence scan. How efficient would tsearch be for really big expressions (where 'hu%' would be expanded (using a btree word index on one column word table) to tsearch equivalent of ( human or humanity or humming or huge or ..1000 words here...) before passing the expression to tsearch? GiST index of tsearch doen't support prefix search, so it will works only by seqscan, as we know :) disk is much more slow than processor, speed will be limited by disk. I meant that the expansion of 'hu%' is done before and outside of tsearch, so the question is how efficient will tsearch be for searching for hudreds or thousands of words in one expression. How hard (or sensible ;) would be creating such an index using GiST ? As proved by tsearch GiST can cope well with many-to-many indexes. Sorry, I don't understand. Do you mean that GiST supports one heap tuple in several index tuple? If yes then no :). GiST doesn't support this feature. I don't think that GiST may help in this situation. but tsearch seems to support this, and tsearch uses GiST. Is this functionality added entirely by tsearch ? -- Hannu ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] Searching for substring with tsearch(1/2)
I meant that the expansion of 'hu%' is done before and outside of tsearch, so the question is how efficient will tsearch be for searching for hudreds or thousands of words in one expression. Ok, I see. The answer - bad. Index structure is signature tree with constant signature length, by default 2016 bits. Siganture makes by hashing word and sets bits number HASHVAL % 2016 to 1. So, if query has many terms and all terms are ored then there is a lot of signatures that matched by query. This means a lot of pages in index will be readed. How hard (or sensible ;) would be creating such an index using GiST ? As proved by tsearch GiST can cope well with many-to-many indexes. Sorry, I don't understand. Do you mean that GiST supports one heap tuple in several index tuple? If yes then no :). GiST doesn't support this feature. I don't think that GiST may help in this situation. but tsearch seems to support this, and tsearch uses GiST. Is this functionality added entirely by tsearch ? No, one heap tuple - one index tuple. I'll try to explain index structure used by tsearch (three levels just for example): Root page internal tuple 1 - second level page 1 internal tuple 1.1 - internal tuple 1.2 - internal tuple 2 - second level page 2 internal tuple 2.1 - internal tuple 2.2 - third level (leaf) page 2.2 leaf tuple 2.2.1 - heap tuple leaf tuple 2.2.2 - heap tuple leaf tuple contains one of two types of predicats: 1 just lexemes (without psition information) 2 if store size of first type is too big then tuple stores signature as described above. internal tuple contains ored (super-imposed) signatures of childs. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] ERROR: Index pg_toast_8443892_index is not a btree
I agree about keeping it simple for the users. Anyway if that shows up a bad problems with either the implementation or the operating system of the users it would be nice to know how to inspect it further. In my case this could also help debugging a postgres extension (postgis) which is involved in text-internal conversion and is showing heap corruption problems. The question now is: what does that message mean ? Did a routine try to create an index and left its work before finishing it ? --strk; JanWieck wrote: Christopher Kings-Lynne wrote: I couldn't agree more. Look at this very instance. He now found the right reindex command and the corrupted file is gone. We don't have the slightest clue what happened to that file. Was it truncated? Did some other process scribble around in the shared memory? How do you tell now? The end user just could not care less. They want their machine running again as soon as is humanly possible without going through a back and forth process of subscribing to some lists they don't care about, etc. I know, that's (unfortunately) true. Although it's not very farsighted because better bug reports usually lead to better software in the next release. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] 73.5 and uw 713
Hi Tom, At last I have a much better trace for the vacuum full bug. Can some one help me on this one? Image mémoire de postmaster (processus p1) créée FICHIER IMAGE MEMOIRE [swapn dans qsort.c] 11 (segv code[SEGV_MAPERR] address[0x842]) SIGNALE dans p1 0xbffae03f (swapn+47:) movl (%esi),%eax debug Suivi de pile correspondant à p1, Programme postmaster *[0] swapn(0x2, 0x831b758, 0x831b770) [0xbffae03f] [1] qst(0x80448cc, 0x831b758, 0x831b788) [0xbffadca2] [2] qsort(0x831b758, 0x18, 0x2, 0x80eb9f8) [0xbffae17f] [3] repair_frag(vacrelstats=0x83122bc, onerel=0x82cf56c, vacuum_pages=0x8046a64, fraged_pages=0x8046a54, nindexes=1, Irel=0x83672e0) [EMAIL PROTECTED] [4] full_vacuum_rel(onerel=0x82cf56c, vacstmt=0x83104b4) [EMAIL PROTECTED] [5] vacuum_rel(relid=16408, vacstmt=0x83104b4, expected_relkind=114 (or 'r')) [EMAIL PROTECTED] [6] vacuum(vacstmt=0x83104b4) [EMAIL PROTECTED] [7] ProcessUtility(parsetree=0x83104b4, dest=Remote, completionTag=) [EMAIL PROTECTED]@713] [8] pg_exec_query_string(query_string=0x831020c, dest=Remote, parse_context=0x830e204) [EMAIL PROTECTED]@789] [9] PostgresMain(argc=4, argv=0x8046d78, username=ohp) [EMAIL PROTECTED]@2013] [10] DoBackend(port=0x829e500) [EMAIL PROTECTED] [11] BackendStartup(port=0x829e500)[EMAIL PROTECTED] [12] ServerLoop( présumé: 0x1, 0x8297af8, 0x1) [EMAIL PROTECTED] [13] PostmasterMain(argc=1, argv=0x8297af8)[EMAIL PROTECTED] [14] main(argc=1, argv=0x8047c44, 0x8047c4c) [EMAIL PROTECTED] [15] _start() [0x806ad1c] debug On Mon, 8 Dec 2003, Tom Lane wrote: Date: Mon, 08 Dec 2003 14:03:42 -0500 From: Tom Lane [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: pgsql-hackers list [EMAIL PROTECTED] Subject: Re: [HACKERS] 73.5 and uw 713 [EMAIL PROTECTED] writes: I've upgraded my system from 7.3.4 to 7.3.5 yesterday and have already experienced to crash during vacuum full. I have'nt recompiled with debug yet but it's a sigsegv in function repair_frag in vacuum.c Considering that vacuum.c hasn't changed in that branch since 7.3beta4, it's highly unlikely that this represents a regression between 7.3.4 and 7.3.5. Pre-existing bug, maybe ... regards, tom lane -- Olivier PRENANT Tel: +33-5-61-50-97-00 (Work) 6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax) 31190 AUTERIVE +33-6-07-63-80-64 (GSM) FRANCE Email: [EMAIL PROTECTED] -- Make your life a dream, make your dream a reality. (St Exupery) ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] pljava revisited
Hi, I'm working on a new pl/java prototype that I hope will become production quality some time in the future. Before my project gets to far, I'd like to gather some input from other users. I've taken a slightly different approach than what seems to be the case for other attempts that I've managed to dig up. Here's some highlights in my approach: 1. A new Java VM is spawned for each connection. I know that this will give a performance hit when a new connection is created. The alternative however, implies that all calls becomes inter-process calls which I think is a much worse scenario. Especially since most modern environments today has some kind of connection pooling. Another reason is that the connections represents sessions and those sessions gets a very natural isolation using separate VM's. A third reason is that the current connection would become unavailable in a remote process (see #5). 2. There's no actual Java code in the body of a function. Simply a reference to a static method. My reasoning is that when writing (and debugging) java, you want to use your favorite IDE. Mixing Java with SQL just gets messy. 3. As opposed to the Tcl, Python, and Perl, that for obvious reasons uses strings, my pl/java will use native types wherever possible. A flag can be added to the function definition if real objects are preferred instead of primitives (motivated by the fact that the primitives cannot reflect NULL values). 4. The code is actually written using JNI and C++ but without any templates, no -style object references, no operator overloads, external class libraries etc. I use C++ simply to get better quality, readability and structure on the code. 5. I plan to write a JDBC layer using JNI on top of the SPI calls to enable JDBC functionality on the current connection. Some things will be limited (begin/commit etc. will not be possible to do here for instance). Current status is that my first calls from Postgres to Java has been made. Lot's of work remain. What are your thoughts and ideas? Thomas Hallgren ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] PostgreSQL port to pure Java?
Robert Treat wrote: Someone did it but it didn't catch fire. I think what will catch fire in a big way is plphp. Managers will like an all php platform that is extremely capable and productive. Developers will enjoy php's natural syntax and agnostic approach to programming. PHP5, when it becomes production ready, will offer high level language features that compete with Java, C#, and Object Pascal. Merlin ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] ERROR: Index pg_toast_8443892_index is not a btree
strk [EMAIL PROTECTED] writes: The question now is: what does that message mean ? It means that the magic number that should be on the first page of the btree index isn't right. We can deduce that something has clobbered the first page of the index, but guessing what and how requires much more information. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [HACKERS] pljava revisited
Thomas Hallgren wrote: Hi, I'm working on a new pl/java prototype that I hope will become production quality some time in the future. Before my project gets to far, I'd like to gather some input from other users. I've taken a slightly different approach than what seems to be the case for other attempts that I've managed to dig up. Here's some highlights in my approach: 1. A new Java VM is spawned for each connection. I know that this will give a performance hit when a new connection is created. The alternative however, implies that all calls becomes inter-process calls which I think is a much worse scenario. Especially since most modern environments today has some kind of connection pooling. Another reason is that the connections represents sessions and those sessions gets a very natural isolation using separate VM's. A third reason is that the current connection would become unavailable in a remote process (see #5). Maybe on-demand might be better - if the particular backend doesn't need it why incur the overhead? 2. There's no actual Java code in the body of a function. Simply a reference to a static method. My reasoning is that when writing (and debugging) java, you want to use your favorite IDE. Mixing Java with SQL just gets messy. Perhaps an example or two might help me understand better how this would work. 3. As opposed to the Tcl, Python, and Perl, that for obvious reasons uses strings, my pl/java will use native types wherever possible. A flag can be added to the function definition if real objects are preferred instead of primitives (motivated by the fact that the primitives cannot reflect NULL values). 4. The code is actually written using JNI and C++ but without any templates, no -style object references, no operator overloads, external class libraries etc. I use C++ simply to get better quality, readability and structure on the code. Other pl* (perl, python, tcl) languages have vanilla C glue code. Might be better to stick to this. If you aren't using advanced C++ features that shouldn't be too hard - well structured C can be just as readable as well structured C++. At the very lowest level, about the only things C++ buys you are the ability to declare variables in arbitrary places, and // style comments. 5. I plan to write a JDBC layer using JNI on top of the SPI calls to enable JDBC functionality on the current connection. Some things will be limited (begin/commit etc. will not be possible to do here for instance). Again. examples would help me understand better. Is there a web page for your project? cheers andrew ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] ERROR: Index pg_toast_8443892_index is not a btree
Tom Lane wrote: strk [EMAIL PROTECTED] writes: The question now is: what does that message mean ? It means that the magic number that should be on the first page of the btree index isn't right. We can deduce that something has clobbered the first page of the index, but guessing what and how requires much more information. Clobbered or truncated. A zero size index file causes the same message. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] pljava revisited
On Dec 10, 2003, at 11:23 AM, Andrew Dunstan wrote: Thomas Hallgren wrote: Hi, I'm working on a new pl/java prototype that I hope will become production quality some time in the future. Before my project gets to far, I'd like to gather some input from other users. I've taken a slightly different approach than what seems to be the case for other attempts that I've managed to dig up. Here's some highlights in my approach: 1. A new Java VM is spawned for each connection. I know that this will give a performance hit when a new connection is created. The alternative however, implies that all calls becomes inter-process calls which I think is a much worse scenario. Especially since most modern environments today has some kind of connection pooling. Another reason is that the connections represents sessions and those sessions gets a very natural isolation using separate VM's. A third reason is that the current connection would become unavailable in a remote process (see #5). Maybe on-demand might be better - if the particular backend doesn't need it why incur the overhead? I think a JVM per connection is going to add too much overhead, even if its on-demand. Some platforms handle multiple JVMs better than others, but still. 25 or so individual JVMs is going to be a mess, in terms of resource consumption. Start time/connect time will be an issue. Saying 'people use pools', while generally accurate, kind of sweeps the problem under the carpet instead of the dust bin. 2. There's no actual Java code in the body of a function. Simply a reference to a static method. My reasoning is that when writing (and debugging) java, you want to use your favorite IDE. Mixing Java with SQL just gets messy. Perhaps an example or two might help me understand better how this would work. 3. As opposed to the Tcl, Python, and Perl, that for obvious reasons uses strings, my pl/java will use native types wherever possible. A flag can be added to the function definition if real objects are preferred instead of primitives (motivated by the fact that the primitives cannot reflect NULL values). 4. The code is actually written using JNI and C++ but without any templates, no -style object references, no operator overloads, external class libraries etc. I use C++ simply to get better quality, readability and structure on the code. Other pl* (perl, python, tcl) languages have vanilla C glue code. Might be better to stick to this. If you aren't using advanced C++ features that shouldn't be too hard - well structured C can be just as readable as well structured C++. At the very lowest level, about the only things C++ buys you are the ability to declare variables in arbitrary places, and // style comments. Agreed. Given that the rest of the code base is CI would imagine that the Powers that Be would frown a bit on merging C++ code in, and relegate it to contrib for eternity... Not knocking the idea, mind you - I think it would be great if it can be pulled off. Was thinking about it myself as a way to learn more of the backend code and scrape the thick layer of rust off of my C skills. Would like to see where you are with it. 5. I plan to write a JDBC layer using JNI on top of the SPI calls to enable JDBC functionality on the current connection. Some things will be limited (begin/commit etc. will not be possible to do here for instance). Again. examples would help me understand better. Is there a web page for your project? cheers andrew ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html Andrew Rawnsley President The Ravensfield Digital Resource Group, Ltd. (740) 587-0114 www.ravensfield.com ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pljava revisited
The JVM will be started on-demand. Although I realize that one JVM per connection will consume a fair amount of resources, I still think it is the best solution. The description of this system must of course make it very clear that this is what happens and ultimately provide the means of tuning the JVM's as much as possible. I advocate this solution because I think that the people that has the primary interest of a pl/java will be those who write enterprise systems using Java. J2EE systems are always equipped with connection pools. But, I'm of course open for other alternatives. Let's say that there's a JVM with a thread-pool that the Postgress sessions will connect to using some kind of RPC. This implies that each call will have an overhead of at least 2 OS context switches. Compared to in-process calls, this will severely crippel the performance. How do you suggest that we circumvent this problem? Antother problem is that we will immeditately loose the ability to use the current connection provided by the SPI interfaces. We can of course establish a back-channel to the original process but that will incure even more performance hits. A third alternative is to establish brand new connections in the remote JVM. Problem then is to propagate the transaction context correctly. Albeit solvable, the performance using distributed transactions will be much worse than in-process. How do we solve this? C++ or C is not a big issue. I might rewrite it into pure C. The main reason for C++ is to be able to use objects with virtual methods. I know how to do that in C too but I don't quite agree that its just as clean :-) - thomas I think a JVM per connection is going to add too much overhead, even if its on-demand. Some platforms handle multiple JVMs better than others, but still. 25 or so individual JVMs is going to be a mess, in terms of resource consumption. Start time/connect time will be an issue. Saying 'people use pools', while generally accurate, kind of sweeps the problem under the carpet instead of the dust bin. 2. There's no actual Java code in the body of a function. Simply a reference to a static method. My reasoning is that when writing (and debugging) java, you want to use your favorite IDE. Mixing Java with SQL just gets messy. Perhaps an example or two might help me understand better how this would work. 3. As opposed to the Tcl, Python, and Perl, that for obvious reasons uses strings, my pl/java will use native types wherever possible. A flag can be added to the function definition if real objects are preferred instead of primitives (motivated by the fact that the primitives cannot reflect NULL values). 4. The code is actually written using JNI and C++ but without any templates, no -style object references, no operator overloads, external class libraries etc. I use C++ simply to get better quality, readability and structure on the code. Other pl* (perl, python, tcl) languages have vanilla C glue code. Might be better to stick to this. If you aren't using advanced C++ features that shouldn't be too hard - well structured C can be just as readable as well structured C++. At the very lowest level, about the only things C++ buys you are the ability to declare variables in arbitrary places, and // style comments. Agreed. Given that the rest of the code base is CI would imagine that the Powers that Be would frown a bit on merging C++ code in, and relegate it to contrib for eternity... Not knocking the idea, mind you - I think it would be great if it can be pulled off. Was thinking about it myself as a way to learn more of the backend code and scrape the thick layer of rust off of my C skills. Would like to see where you are with it. 5. I plan to write a JDBC layer using JNI on top of the SPI calls to enable JDBC functionality on the current connection. Some things will be limited (begin/commit etc. will not be possible to do here for instance). Again. examples would help me understand better. Is there a web page for your project? cheers andrew ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html Andrew Rawnsley President The Ravensfield Digital Resource Group, Ltd. (740) 587-0114 www.ravensfield.com ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pljava revisited
Thomas Hallgren wrote: The JVM will be started on-demand. Although I realize that one JVM per connection will consume a fair amount of resources, I still think it is the best solution. The description of this system must of course make it very clear that this is what happens and ultimately provide the means of tuning the JVM's as much as possible. I advocate this solution because I think that the people that has the primary interest of a pl/java will be those who write enterprise systems using Java. J2EE systems are always equipped with connection pools. Yes, but as was pointed out even if I use connection pooling I would rather not have, say, 25 JVMs loaded if I can help it. But, I'm of course open for other alternatives. Let's say that there's a JVM with a thread-pool that the Postgress sessions will connect to using some kind of RPC. This implies that each call will have an overhead of at least 2 OS context switches. Compared to in-process calls, this will severely crippel the performance. How do you suggest that we circumvent this problem? Context switches are not likely to be more expensive that loading an extra JVM, I suspect. Depending on your OS/hw they can be incredibly cheap, in fact. Antother problem is that we will immeditately loose the ability to use the current connection provided by the SPI interfaces. We can of course establish a back-channel to the original process but that will incure even more performance hits. A third alternative is to establish brand new connections in the remote JVM. Problem then is to propagate the transaction context correctly. Albeit solvable, the performance using distributed transactions will be much worse than in-process. How do we solve this? We are theorising ahead of data, somewhat. My suggestion would be to continue in the direction you are going, and later, when you can, stress test it. Ideally, if you then need to move to a shared JVM this would be transparent to upper levels of the code. C++ or C is not a big issue. I might rewrite it into pure C. The main reason for C++ is to be able to use objects with virtual methods. I know how to do that in C too but I don't quite agree that its just as clean :-) Maybe not, but it's what is used in the core Pg distribution. Go with the flow :-) cheers andrew ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [HACKERS] pljava revisited
On Dec 10, 2003, at 1:51 PM, Andrew Dunstan wrote: Thomas Hallgren wrote: The JVM will be started on-demand. Although I realize that one JVM per connection will consume a fair amount of resources, I still think it is the best solution. The description of this system must of course make it very clear that this is what happens and ultimately provide the means of tuning the JVM's as much as possible. I advocate this solution because I think that the people that has the primary interest of a pl/java will be those who write enterprise systems using Java. J2EE systems are always equipped with connection pools. Yes, but as was pointed out even if I use connection pooling I would rather not have, say, 25 JVMs loaded if I can help it. Its also a bit of a solution by circumstance, rather that a solution by design. But, I'm of course open for other alternatives. Let's say that there's a JVM with a thread-pool that the Postgress sessions will connect to using some kind of RPC. This implies that each call will have an overhead of at least 2 OS context switches. Compared to in-process calls, this will severely crippel the performance. How do you suggest that we circumvent this problem? My comments here are pretty off the cuff. You've thought about this far more than I have. Context switches are not likely to be more expensive that loading an extra JVM, I suspect. Depending on your OS/hw they can be incredibly cheap, in fact. Antother problem is that we will immeditately loose the ability to use the current connection provided by the SPI interfaces. We can of course establish a back-channel to the original process but that will incure even more performance hits. A third alternative is to establish brand new connections in the remote JVM. Problem then is to propagate the transaction context correctly. Albeit solvable, the performance using distributed transactions will be much worse than in-process. How do we solve this? We are theorising ahead of data, somewhat. My suggestion would be to continue in the direction you are going, and later, when you can, stress test it. Ideally, if you then need to move to a shared JVM this would be transparent to upper levels of the code. Agreed - sounds like you've done a fair amount of ground work. I at least am interested in where you're going with it. C++ or C is not a big issue. I might rewrite it into pure C. The main reason for C++ is to be able to use objects with virtual methods. I know how to do that in C too but I don't quite agree that its just as clean :-) Maybe not, but it's what is used in the core Pg distribution. Go with the flow :-) cheers andrew ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings Andrew Rawnsley President The Ravensfield Digital Resource Group, Ltd. (740) 587-0114 www.ravensfield.com ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pljava revisited
Andrew Rawnsley wrote: Other pl* (perl, python, tcl) languages have vanilla C glue code. Might be better to stick to this. If you aren't using advanced C++ features that shouldn't be too hard - well structured C can be just as readable as well structured C++. At the very lowest level, about the only things C++ buys you are the ability to declare variables in arbitrary places, and // style comments. Agreed. Given that the rest of the code base is CI would imagine that the Powers that Be would frown a bit on merging C++ code in, and relegate it to contrib for eternity... It will probably have to live on GBorg right from the beginning anyway, so the Powers might not care at all. Thus far _all_ procedural languages are loadable modules. VM or not, I don't see why this one would be any different. That also answers the on demand question to some extent, doesn't it? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pljava revisited
Andrew Dunstan [EMAIL PROTECTED] writes: Thomas Hallgren wrote: C++ or C is not a big issue. I might rewrite it into pure C. The main reason for C++ is to be able to use objects with virtual methods. I know how to do that in C too but I don't quite agree that its just as clean :-) Maybe not, but it's what is used in the core Pg distribution. Go with the flow :-) If you have any hope of someday seeing pljava merged into the main PG distribution, you had better stick to C. IMHO there would be essentially no chance of adopting a module that requires C++, simply because the additional configuration and portability work would be too much of a pain in the neck. libpq++ got heaved overboard largely because the autoconf burden for it was too high, and we're unlikely to look favorably on something that would make us put that back in. Of course, if you don't think pljava will ever become mainstream, this argument won't have much force to you ... regards, tom lane ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pljava revisited
On Wed, 2003-12-10 at 13:04, Jan Wieck wrote: Andrew Rawnsley wrote: Other pl* (perl, python, tcl) languages have vanilla C glue code. Might be better to stick to this. If you aren't using advanced C++ features that shouldn't be too hard - well structured C can be just as readable as well structured C++. At the very lowest level, about the only things C++ buys you are the ability to declare variables in arbitrary places, and // style comments. Agreed. Given that the rest of the code base is CI would imagine that the Powers that Be would frown a bit on merging C++ code in, and relegate it to contrib for eternity... It will probably have to live on GBorg right from the beginning anyway, so the Powers might not care at all. Thus far _all_ procedural languages are loadable modules. VM or not, I don't see why this one would be any different. That also answers the on demand question to some extent, doesn't it? Maybe I'm mixing concepts here, but didn't Joe Conway create the ability to do pl module loading on demand or on connection creation via GUC? ISTR he needed this due to R's overhead. If so seems this could be implemented both ways, with a recommendation on which is best to follow. Speaking of plR, I'd recommend anyone interested in developing pl's, whether enhancing old ones or creating new ones, to check out the plR code on gborg, it was written recently and is pretty advanced. Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] pljava revisited
--- Thomas Hallgren [EMAIL PROTECTED] wrote: The JVM will be started on-demand. Although I realize that one JVM per connection will consume a fair amount of resources, I still think it is the best solution. The description of this system must of course make it very clear that this is what happens and ultimately provide the means of tuning the JVM's as much as possible. I think the new 1.5 JDK Tiger (to be released soon) will feature the shared VM option, i.e. one JVM could be used to run multiple and independent apps. Maybe worth looking into this. I advocate this solution because I think that the people that has the primary interest of a pl/java will be those who write enterprise systems using Java. J2EE systems are always equipped with connection pools. IMHO, pl/java would be a great feature for Postgresql to have. It would increase pgSql's chances to be considered as an enterprise RDBMS since most of the enterprise apps are written in Java nowdays. Regards, __ Do you Yahoo!? Free Pop-Up Blocker - Get it now http://companion.yahoo.com/ ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] ERROR: Index pg_toast_8443892_index is not a btree
On Tue, 2003-12-09 at 20:19, Jan Wieck wrote: Christopher Kings-Lynne wrote: I couldn't agree more. Look at this very instance. He now found the right reindex command and the corrupted file is gone. We don't have the slightest clue what happened to that file. Was it truncated? Did some other process scribble around in the shared memory? How do you tell now? The end user just could not care less. They want their machine running again as soon as is humanly possible without going through a back and forth process of subscribing to some lists they don't care about, etc. I know, that's (unfortunately) true. Although it's not very farsighted because better bug reports usually lead to better software in the next release. HINT:: You might be able to solve this problem by running the REINDEX command. Of course if you do that you'll destroy all evidence of what caused the problem, possibly forcing this problem on other users in the future because you were unwilling to help us to improve the software. But we understand, it's not like we wrote an entire database system for you... oh wait we did. :-) Robert Treat -- Build A Brighter Lamp :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] pljava revisited
Two comments. Context switches are of course much cheaper then loading a JVM. No argument there. The point is that the JVM is loaded once for each connection (when the connection makes the first call to a java function). Millions of calls may follow that reuses the same JVM. Each of those calls will suffer from context switches if the JVM is remote. A 1 to a million (or more) ratio is in fact very likey when function calls are used in predicates and/or projections of selects on larger tables. Regarding C++, as I said, no big deal. I'll change it for the reasons mentioned before I release my first cut. Thanks, - thomas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] Canonicalization of WHERE clauses considered harmful
I've been thinking about Josh's recent complaint about poor planning of queries like SELECT t1.a, t2.b FROM t1, t2 WHERE t1.a = t2.a AND ( ( t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k ) OR ( t1.c = y AND t1.f IN (n, o, p) AND t2.d = v AND t2.e BETWEEN k AND h ) OR ( t1.c = z AND t1.f IN (p, q) AND t2.d = w AND t2.e BETWEEN k AND h ) ) (which is an abstraction of Q19 in the TPC-R benchmark). I believe that most of the blame for this problem can be laid at the feet of optimizer/prep/prepqual.c, which attempts to transform the WHERE clause into a canonical form (either CNF AND-of-ORs, or DNF OR-of-ANDs, according to some heuristics that are not very compelling). I am not certain whether prepqual will actually fire on this query (still waiting for those EXPLAIN results, Josh), but if it doesn't then it'd really be disastrous, and if it does then it's still not going to improve matters much, because forcing this WHERE into pure AND-of-ORs form isn't very helpful. I've worked on prepqual.c some over the years, but it never occurred to me to question the fundamental premise of creating a normal-form WHERE clause before. Yet as I look at this example, I'm wondering why we should try to. Josh lied a little bit about the TPC-R query: the form actually given in the TPC spec is WHERE ( t1.a = t2.a AND t1.c = x AND t1.f IN (m, n, o) AND t2.d = v AND t2.e BETWEEN j AND k ) OR ( t1.a = t2.a AND t1.c = y AND t1.f IN (n, o, p) AND t2.d = v AND t2.e BETWEEN k AND h ) etc It is absolutely critical that the optimizer recognize that the t1.a = t2.a clause is common to all the OR branches and can be factored out to give the form Josh showed. Without that, you don't have any equijoin clause and you're going to end up with a horrid nested-loop join. In the TPC query per spec, some of the other single-table restrictions are also common to all the OR branches (ie, x, y, etc are not all different constants) and should be pulled out so that they can be applied as restriction clauses to the input tables before the join. In other words, we'd like the optimizer to transform (a AND b) OR (a AND c) to a AND (b OR c) Currently, this is accomplished by the roundabout method of converting the WHERE clause to CNF (AND-of-ORs) and then simplifying duplicate sub-clauses within an OR: (a AND b) OR (a AND c) expands by repeated application of the distributive law to (a OR a) AND (a OR c) AND (b OR a) AND (b OR c) and then qual_cleanup notices that (a OR a) is redundant, leaving a AND (a OR c) AND (b OR a) AND (b OR c) So we manage to pull out a all right, but we've left the query cluttered with additional, redundant clauses --- there is no logic that will notice that this could be simplified to a AND (b OR c) The extra clauses make for useless work during planning and during execution; they also screw up selectivity estimates (since the selectivity estimator doesn't realize they are redundant). This is bad. So as a mechanism for pulling out duplicate clauses, canonicalization sucks. Is there anything else that it does for us? After some thought the only possible benefit I can see is that it can sometimes manufacture restriction clauses from what otherwise would be join clauses. In the TPC-R example, once we've pulled out duplicate clauses we'd still be left with an OR of ANDed conditions, some on t1 and some on t2. If we leave this as-is, there's not much to be done with it except apply it as a join filter after the t1/t2 join is made. However, if we force to CNF form then some of the resulting sub-clauses will refer only to t1, some only to t2, and some to both tables. The ones that mention only t1 or only t2 would be usable as restriction clauses during the table scans, which is a big help in cutting the size of the required join. Also they could possibly be used as indexscan qualifications (I'm not sure whether TPC-R expects relevant indexes to be in place). So it seems to me that a reasonable heuristic for deciding whether to CNF-ify an OR clause is to check whether any of the resulting subclauses will refer to fewer relations than the original OR clause does. If not, we may as well leave it as an OR clause and avoid generating redundant clauses. There is no such test at present. Pulling out duplicate clauses from the OR arms would probably be better done directly than by relying on CNF-ification. Another consideration is that the planner already has logic for generating multi-indexscans by pulling relevant conditions out of OR-of-AND trees. So, at least for the case where there's a relevant index, we do not need any CNF conversion to be
Re: [HACKERS] Canonicalization of WHERE clauses considered harmful
On Wed, Dec 10, 2003 at 16:54:54 -0500, Tom Lane [EMAIL PROTECTED] wrote: In other words, we'd like the optimizer to transform (a AND b) OR (a AND c) to a AND (b OR c) Currently, this is accomplished by the roundabout method of converting the WHERE clause to CNF (AND-of-ORs) and then simplifying duplicate sub-clauses within an OR: (a AND b) OR (a AND c) expands by repeated application of the distributive law to (a OR a) AND (a OR c) AND (b OR a) AND (b OR c) and then qual_cleanup notices that (a OR a) is redundant, leaving a AND (a OR c) AND (b OR a) AND (b OR c) So we manage to pull out a all right, but we've left the query cluttered with additional, redundant clauses --- there is no logic that will notice that this could be simplified to a AND (b OR c) The extra clauses make for useless work during planning and during execution; they also screw up selectivity estimates (since the selectivity estimator doesn't realize they are redundant). This is bad. Comments? Shouldn't it be possible to simplify a AND (a OR c) AND (b OR a) AND (b OR c) to a AND (b or c) using a AND (a OR x) == a ? ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Canonicalization of WHERE clauses considered harmful
Bruno Wolff III [EMAIL PROTECTED] writes: Shouldn't it be possible to simplify a AND (a OR c) AND (b OR a) AND (b OR c) to a AND (b or c) using a AND (a OR x) == a That would be one possible response, but it strikes me as a band-aid fix. It would add quite a bit of overhead (looking to see if any OR subclauses match any top-level clauses) on top of the rather expensive equality checks qual_cleanup() is already making. Another problem is that sometimes prepqual decides not to attempt CNF-ification at all, which means that the ability to pull out duplicate subclauses is lost altogether. I think it'd be cleaner to expend the cycles on a direct check for duplicate subclauses, which we could apply independently of any decision to convert what remains to CNF form. We've had lots of problems with prepqual before, which is why those heuristics about whether to try for CNF or DNF are in there; but I've never been very happy about them. What I'm on about now is the idea that maybe the whole problem should be thrown overboard ... regards, tom lane ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Canonicalization of WHERE clauses considered harmful
On Wed, Dec 10, 2003 at 04:54:54PM -0500, Tom Lane wrote: Currently, this is accomplished by the roundabout method of converting the WHERE clause to CNF (AND-of-ORs) and then simplifying duplicate sub-clauses within an OR: (a AND b) OR (a AND c) expands by repeated application of the distributive law to (a OR a) AND (a OR c) AND (b OR a) AND (b OR c) This is wrong. What would be true however is: (a OR b) AND (b OR C) = (a AND b) OR (a AND c) OR (b AND b) OR (b AND C) (Replace AND by * and OR by +) Kurt ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] Canonicalization of WHERE clauses considered harmful
On Wed, Dec 10, 2003 at 05:35:11PM -0500, Tom Lane wrote: Kurt Roeckx [EMAIL PROTECTED] writes: On Wed, Dec 10, 2003 at 04:54:54PM -0500, Tom Lane wrote: (a AND b) OR (a AND c) expands by repeated application of the distributive law to (a OR a) AND (a OR c) AND (b OR a) AND (b OR c) This is wrong. I don't think so. The distributive law is a OR (b AND c) == (a OR b) AND (a OR c) Oh, I was reading it wrong. So basicly it makes a + bc = (a+b)*(a+c) = a + ab + ac + bc And ab + ac = a * (a+b) * (a+c) * (b+c) = ab + abc + ac (This is so much easier to read then using AND and OR.) Kurt ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
[HACKERS] CSV hack
Kind people, I've come up with yet another little hack, this time for turning 1-d arrays into CSV format. It's very handy in conjunction with the array_accum aggregate (can this be made a standard aggregate?) in http://developer.postgresql.org/docs/postgres/xaggr.html. Here 'tis... CREATE OR REPLACE FUNCTION csv(anyarray) RETURNS TEXT AS 'DECLARE in_array ALIAS FOR $1; temp_string TEXT; quoted_string TEXT; i INTEGER; BEGIN FOR i IN array_lower(in_array, 1)..array_upper(in_array, 1) LOOP IF in_array[i]::TEXT ~ THEN temp_string := || replace(in_array[i]::TEXT, , ) || ; ELSE temp_string := in_array[i]::TEXT; END IF; IF i = array_lower(in_array, 1) THEN quoted_string := temp_string; ELSE quoted_string := quoted_string || '','' || temp_string; END IF; END LOOP; RETURN quoted_string; END; ' LANGUAGE 'plpgsql'; Those DBD::Pg users among us who'd like to be able to bind_columns to postgresql arrays may have a leg up with Text::CSV_XS. Other middleware should be able to handle such things, too. :) Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 510 893 6100cell: +1 415 235 3778 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [HACKERS] [pgadmin-support] Cannot add an column of type serial
David Sigeti wrote: At 12:21 PM 12/10/2003 +0100, Andreas Pflug wrote: David Sigeti wrote: I am using pgadminIII 1.0.2 with PostgreSQL 7.4 under W2K SP4 and Cygwin (current as of 2 or 3 weeks ago). The server is running locally. If I attempt to add an column of type serial or bigserial to a table, the operation fails with a message telling me that adding an column with a default value is not implemented. What is really weird is that one can, in fact, add a column with a default value, like an int4 with a default value or 0. pgAdmin breaks this into two steps for you; have a look at the SQL page. The problem does not arise if the serial (or bigserial) column is created when the table is created (as a part of CREATE TABLE). serial isn't a true datatype but a keyword that can be used only at creation time. Still, pgAdmin3 should create the commands necessary for this to make life easier; I added this as a TODO item. Adding this functionality would be great but I suggest that the error message be changed right away as it is seriously misleading. The error message originates from the backend. I agree that this might be confusing to newbies, who don't know that serial actually is int4 default nextval('...') (currently pgAdmin3 doesn't either :-) I'm forwarding this to pgsql-hackers. Regards, Andreas ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] postgres panic error
Yurgis, On Tue, Dec 09, 2003 at 04:18:06PM -0800, Yurgis Baykshtis wrote: I tried to raise the question on pg-hackers forum and cygwin forum (regarding readdir() misbehavior) but could not get any help so far :( If you can produce a minimal test case that reproduces the problem, then one of the core Cygwin developers might be more willing to attempt to fix it. Jason -- PGP/GPG Key: http://www.tishler.net/jason/pubkey.asc or key servers Fingerprint: 7A73 1405 7F2B E669 C19D 8784 1AFD E4CC ECF4 8EF6 ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [HACKERS] CSV hack
David Fetter wrote: I've come up with yet another little hack, this time for turning 1-d arrays into CSV format. You mean like this (which is new in 7.4)? regression=# select array_to_string (array[1,2,3], ','); array_to_string - 1,2,3 (1 row) See: http://www.postgresql.org/docs/current/static/functions-array.html It's very handy in conjunction with the array_accum aggregate (can this be made a standard aggregate?) in http://developer.postgresql.org/docs/postgres/xaggr.html. Early in the 7.4 dev cycle array_accum() was actually in cvs as a built-in C function (and it still does exist in PL/R as such). But toward the end of the cycle an objection was raised and it was removed. Search the archives in the May/June 2003 timeframe. Joe ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [Dbdpg-general] Re: [HACKERS] CSV hack
On Wed, Dec 10, 2003 at 05:08:19PM -0800, Joe Conway wrote: David Fetter wrote: I've come up with yet another little hack, this time for turning 1-d arrays into CSV format. You mean like this (which is new in 7.4)? regression=# select array_to_string (array[1,2,3], ','); array_to_string - 1,2,3 (1 row) Not quite. The CSV thing quotes the way you'd see in CSV files, as in SELECT csv(array['1'::text, '2'::text, '3'::text]); csv --- 1,2,3 See: http://www.postgresql.org/docs/current/static/functions-array.html It's great, and I use it :) It's very handy in conjunction with the array_accum aggregate (can this be made a standard aggregate?) in http://developer.postgresql.org/docs/postgres/xaggr.html. Early in the 7.4 dev cycle array_accum() was actually in cvs as a built-in C function (and it still does exist in PL/R as such). But toward the end of the cycle an objection was raised and it was removed. Search the archives in the May/June 2003 timeframe. Um, OK. It would be kinda handy, tho. Cheers, D -- David Fetter [EMAIL PROTECTED] http://fetter.org/ phone: +1 510 893 6100cell: +1 415 235 3778 ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] CSV hack
You also need to quote values containing the separator. cheers andrew (who used to set creating CSV as a programming exercise - students almost never get it right) David Fetter wrote: Kind people, I've come up with yet another little hack, this time for turning 1-d arrays into CSV format. It's very handy in conjunction with the array_accum aggregate (can this be made a standard aggregate?) in http://developer.postgresql.org/docs/postgres/xaggr.html. Here 'tis... CREATE OR REPLACE FUNCTION csv(anyarray) RETURNS TEXT AS 'DECLARE in_array ALIAS FOR $1; temp_string TEXT; quoted_string TEXT; i INTEGER; BEGIN FOR i IN array_lower(in_array, 1)..array_upper(in_array, 1) LOOP IF in_array[i]::TEXT ~ THEN temp_string := || replace(in_array[i]::TEXT, , ) || ; ELSE temp_string := in_array[i]::TEXT; END IF; IF i = array_lower(in_array, 1) THEN quoted_string := temp_string; ELSE quoted_string := quoted_string || '','' || temp_string; END IF; END LOOP; RETURN quoted_string; END; ' LANGUAGE 'plpgsql'; Those DBD::Pg users among us who'd like to be able to bind_columns to postgresql arrays may have a leg up with Text::CSV_XS. Other middleware should be able to handle such things, too. :) Cheers, D ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match