Re: [HACKERS] Questions about indexes with text_pattern_ops
Kaare Rasmussen [EMAIL PROTECTED] writes: Hi The database is initialized with utf8, so in order for LIKE to use the index on a text field, I used text_pattern_ops when I created it. So far so good. It's in the documentation, but there's no explanation of why this index will only work for LIKE searches. How come that I have to have two different indexes if I want to give Postgres the ability to choose index scan over seq scan on LIKE and non-LIKE searches? Because in non-C locales (which you're almost certainly using if you're using UTF8) the ordering which the normal text operations use can be quite complex. Just as an example most locales have spaces being entirely insignificant. So no range can reliably match a prefix LIKE pattern. The text_pattern_ops use simple character-by-character ordering which are useful for LIKE but not for regular and comparisons. They're just two different orderings. Also, when I tried to create the index as a partial one (avoiding the 95% entries with empty strings), Postgresql chooses to use seq scan. This sounds counter intuitive to me. CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ''; This is 8.2.6. Hm, for a simple = or I think it doesn't matter which operator class you use. For or it would produce different answers. Postgres isn't clever enough to notice that this is equivalent though so I think you would have to do something like (untested): CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~~ ''; That uses the same operator that the LIKE clause will use for the index range. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Questions about indexes with text_pattern_ops
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: Hm, for a simple = or I think it doesn't matter which operator class you use. For or it would produce different answers. Postgres isn't clever enough to notice that this is equivalent though so I think you would have to do something like (untested): CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~~ ''; That uses the same operator that the LIKE clause will use for the index range. I'm intending to get rid of ~=~ and ~~ for 8.4; there's no longer any reason why those slots in the pattern_ops classes can't be filled by the plain = and operators. (There *was* a reason when they were first invented --- but now that texteq will only return true for exact bitwise match, I think it's OK to assume these are equivalent.) The only question is whether we'll keep that forever. I thought it was a good idea at the time but I'm starting to wonder about the implications for multi-key indexes. In the meantime, though, I think the only way that Kaare's query can use that index is if he writes WHERE b LIKE 'whatever' AND b ''; (with whatever spelling of the index predicate has). There is not anything in the predicate proving machinery that knows enough about LIKE to be able to show that b LIKE 'whatever' implies b ''. I was thinking that the inequalities that the LIKE index scan introduces would imply the inequality. I take it we generate those inequalities too late in the planning process to use them for other planning? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Questions about indexes with text_pattern_ops
Gregory Stark [EMAIL PROTECTED] writes: Hm, for a simple = or I think it doesn't matter which operator class you use. For or it would produce different answers. Postgres isn't clever enough to notice that this is equivalent though so I think you would have to do something like (untested): CREATE INDEX new_index ON a (b text_pattern_ops) WHERE b ~~ ''; That uses the same operator that the LIKE clause will use for the index range. I'm intending to get rid of ~=~ and ~~ for 8.4; there's no longer any reason why those slots in the pattern_ops classes can't be filled by the plain = and operators. (There *was* a reason when they were first invented --- but now that texteq will only return true for exact bitwise match, I think it's OK to assume these are equivalent.) In the meantime, though, I think the only way that Kaare's query can use that index is if he writes WHERE b LIKE 'whatever' AND b ''; (with whatever spelling of the index predicate has). There is not anything in the predicate proving machinery that knows enough about LIKE to be able to show that b LIKE 'whatever' implies b ''. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Questions about indexes with text_pattern_ops
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: I'm intending to get rid of ~=~ and ~~ for 8.4; there's no longer any reason why those slots in the pattern_ops classes can't be filled by the plain = and operators. (There *was* a reason when they were first invented --- but now that texteq will only return true for exact bitwise match, I think it's OK to assume these are equivalent.) The only question is whether we'll keep that forever. I thought it was a good idea at the time but I'm starting to wonder about the implications for multi-key indexes. How so? If you think this change is a bad idea you'd better speak up PDQ. In the meantime, though, I think the only way that Kaare's query can use that index is if he writes WHERE b LIKE 'whatever' AND b ''; (with whatever spelling of the index predicate has). There is not anything in the predicate proving machinery that knows enough about LIKE to be able to show that b LIKE 'whatever' implies b ''. I was thinking that the inequalities that the LIKE index scan introduces would imply the inequality. I take it we generate those inequalities too late in the planning process to use them for other planning? Hmm, good point [ experiments... ] Yeah, it seems we don't reconsider partial indexes after those clauses have been generated. Not sure how expensive it'd be to change that. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Questions about indexes with text_pattern_ops
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: I'm intending to get rid of ~=~ and ~~ for 8.4; there's no longer any reason why those slots in the pattern_ops classes can't be filled by the plain = and operators. (There *was* a reason when they were first invented --- but now that texteq will only return true for exact bitwise match, I think it's OK to assume these are equivalent.) The only question is whether we'll keep that forever. I thought it was a good idea at the time but I'm starting to wonder about the implications for multi-key indexes. How so? If you think this change is a bad idea you'd better speak up PDQ. Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. But I'm not sure it makes sense for 'foo ','a' to sort after 'foo','b' if the locale says that 'foo ' should be compare equal to 'foo' and 'a' before 'b'. I'm starting to think equality and sorts interchangeably are not the same operator at all. On the other hand it may not be worth the complexity that might bring. I was thinking that the inequalities that the LIKE index scan introduces would imply the inequality. I take it we generate those inequalities too late in the planning process to use them for other planning? Hmm, good point [ experiments... ] Yeah, it seems we don't reconsider partial indexes after those clauses have been generated. Not sure how expensive it'd be to change that. Perhaps we should always generate those inequalities even if there's no index that can use them. Then calculate the regexp selectivity based only on the regexp after the prefix. That might also help constraint exclusion. Maybe merge joins? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support! ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Questions about indexes with text_pattern_ops
Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: How so? If you think this change is a bad idea you'd better speak up PDQ. Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. But I'm not sure it makes sense for 'foo ','a' to sort after 'foo','b' if the locale says that 'foo ' should be compare equal to 'foo' and 'a' before 'b'. I don't think we can concern ourselves with that; it would require allowing different columns of an index to interact, which would be impossibly messy. What's more, it'd destroy the property that a btree index is sorted by its leading column(s) as well as by all its columns. Perhaps we should always generate those inequalities even if there's no index that can use them. Hmmm ... we intentionally don't do that, but the constraint exclusion code might be a sufficient reason to reconsider. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Questions about indexes with text_pattern_ops
Tom Lane [EMAIL PROTECTED] writes: Gregory Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: How so? If you think this change is a bad idea you'd better speak up PDQ. Well I think it's fine for 'foo ' != 'foo' even if they sort similarly. But I'm not sure it makes sense for 'foo ','a' to sort after 'foo','b' if the locale says that 'foo ' should be compare equal to 'foo' and 'a' before 'b'. I don't think we can concern ourselves with that; it would require allowing different columns of an index to interact, which would be impossibly messy. What's more, it'd destroy the property that a btree index is sorted by its leading column(s) as well as by all its columns. Well, I was thinking we might have to separate the equal operators from the btree opclass. Equals would be a stricter property than sorts as same. It would be mighty strange to have values which compare unequal but are neither or though. Or which compare equal but also compare or . It might be a little less surprising if we invent a new operator === for actually the same and have == report whether two objects sort as equals. But I'm not sure our experience with Turkish doesn't show that that will still surprise people. It may be more right in an abstract ideal world -- the reality is that text collation is annoyingly complex. But this may be a case where we can get away with just eliding this hassle. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Questions about indexes with text_pattern_ops
Gregory Stark [EMAIL PROTECTED] writes: It may be more right in an abstract ideal world -- the reality is that text collation is annoyingly complex. But this may be a case where we can get away with just eliding this hassle. If anyone actually complains about it, I think we can point to the SQL spec, which unambiguously says that a multicolumn sort key is considered one column at a time. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Questions about indexes?
Ryan Bradetich said: the table would look like: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password. 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner. etc... So I do not need the anomaly to be part of the index, I only need it to I agree with you, that I would not normally add the anomally to the index, except for the unique row requirement. Thinking about it now, maybe I should guarentee unique rows via a check constraint... Thanks for making me think about this in a different way! (sorry this is a bit long) Ryan, I use somewhat similarly structured data (archived records of various events) and when the database was setup (back when this baby was called postgres95), I too used indexes on all possible fields. My database consists of an 'operations' table, which holds for the last x days period (example) and several tables with archived records (per month, or per-year - see later, The operations table can have frequent updates, which add new data. Data is never modified but often lookups are made. The archived tables are generated once and forever from the operations table (possibly merging in the future, but I haven't yet made my mind on this) - then access is read-only, although sufficiently frequent. What I found for the many years of operating this database on different PostgreSQL versions and hardware is that indexes have considerable cost. :) So does the need to not miss anything from the operations table (that is, collect data from many places and have have it all it there). I ended up with few only indexes on the operations table, because the processes that fill it up do minimal lookups to see if data is already in the table, if not do inserts. Then at regular intervals, the table is cleaned up - that is, a process to remove the duplicate is run. This unfortunately costs OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the best way is to create the table without OIDs (but wouldn't this still waste OIDs?) use COPY and then clean afterwards? The archived tables are generated, then cleaned up. Then, as Tom suggested indexes are put on the archived tables, only for the fields that are used in queries. Once the table is created, there is no way duplicated data will exist, as it will not be inserted into. Therefore no need for UNIQUE index enforcement. If you need to have one large 'history' table, then perhaps you will just have to do (slow :) selects for each record before each insert, or just insert the data and then run the cleanup process. Daniel ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Questions about indexes?
I ended up with few only indexes on the operations table, because the processes that fill it up do minimal lookups to see if data is already in the table, if not do inserts. Then at regular intervals, the table is cleaned up - that is, a process to remove the duplicate is run. This unfortunately costs OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the best way is to create the table without OIDs (but wouldn't this still waste OIDs?) use COPY and then clean afterwards? No, WITHOUT OIDS is implemented specifically to not waste OIDs. Chris ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Questions about indexes?
Ryan Bradetich [EMAIL PROTECTED] writes: the table would look like: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. Ah, I see your point now. (Thinks: what about separating the anomaly column into an identifier and a complaint column: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | x| user has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y| user has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y| user has expired password. 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | /foo | file has improper owner. No, that doesn't quite work either, unless you are willing to make the categories more specific. At which point the category and the anomaly text become equivalent. Actually I'm wondering why you bother with the category at all; isn't it implied by the anomaly text?) I agree with you, that I would not normally add the anomally to the index, except for the unique row requirement. Thinking about it now, maybe I should guarentee unique rows via a check constraint... A check constraint won't be efficient either, at least not without a supporting index. Possibly you could index just the host and timestamp columns, which would not be unique but it would cut the number of rows the constraint would need to examine to something manageable. But I'm still thinking that enforcing uniqueness is a waste of time. What exactly is so harmful about it if 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. appears twice? How likely is that anyway (especially if you don't truncate the timestamp precision)? regards, tom lane ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Questions about indexes?
On Mon, 16 Feb 2003, Ryan Bradetich wrote: I am not sure why all the data is duplicated in the index ... Well, you have to have the full key in the index, or how would you know, when you look at a particular index item, if it actually matches what you're searching for? MS SQL server does have an interesting option that would help you a lot in this case: clustered indexes. A table may have a single clustered index, and each leaf node of the index stores not just the key but actually the entire row. Thus, in a case like yours, you'd store the row only once, not twice. Without thinking too hard about it (my usual mode of operation on this list :-)) this could probably be implemented in postgresql. But I don't think it would be entirely trivial, and your case is unusual enough that I very much doubt whether it would be worth implementing to fix that alone. It would also offer the advantage that any lookup using the clustered index would save fetching the heap page after that as well, but it's hard to say if the savings would be worth the work. Since my only requirement is that the rows be unique, I have developed a custom MD5 function in C, and created an index on the MD5 hash of the concatanation of all the fields. Well, that won't guarantee uniqueness, since it's perfectly possible to have two different rows hash to the same value. (If that weren't possible, your hash would have to contain as much information as the row itself, and your space savings wouldn't be nearly so dramatic.) cjs -- Curt Sampson [EMAIL PROTECTED] +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're all light. --XTC ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Questions about indexes?
Curt Sampson wrote: On Mon, 16 Feb 2003, Ryan Bradetich wrote: Since my only requirement is that the rows be unique, I have developed a custom MD5 function in C, and created an index on the MD5 hash of the concatanation of all the fields. Well, that won't guarantee uniqueness, since it's perfectly possible to have two different rows hash to the same value. (If that weren't possible, your hash would have to contain as much information as the row itself, and your space savings wouldn't be nearly so dramatic.) That's true, but even if he has 4 billion rows it drops the probability of a duplicate down to something like one in 4 billion, so it's probably a safe enough bet. His application doesn't require absolute uniqueness, fortunately, so md5 works well enough in this case. Otherwise md5 wouldn't be a terribly good hash... -- Kevin Brown [EMAIL PROTECTED] ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
[HACKERS] Questions about indexes?
Hello postgres hackers, Been a while since I have participated on this list ... but I have a new itch to scratch Although the table schema is immaterial, I will provide it so we have a common framework for this discussion: host_id integer (not null) timestamp datetime(not null) categorytext(not null) [=5 chars] anomaly text(not null) [= 1024 chars] This table is used to store archived data, so each row in the table must be unique. Currently I am using a primary key across each column to enforce this uniqueness. This table currently has ~86 million rows and is 16+ GB in size. This primary key index is also 16+ GB in size, because it appears all the data is duplicated in the index. (I have only done some preliminary looking at the database file with strings, etc ... so this assumption is purly based on these observations). I am not sure why all the data is duplicated in the index ... but i bet it has to do with performance since it would save a lookup in the main table. Is there any benchmarks or papers related to this topic I should locate and read? I am curious about this because it seems the only advantaged gained is searching the index for the specified values Once the entry is found, the full entry needs to be pulled from the main table anyhow since the index does not contain all the data. Also with the increased size, it seems additional pressure would be put on the shared memory caches (no idea how this really works, just guessing! :)) Since my only requirement is that the rows be unique, I have developed a custom MD5 function in C, and created an index on the MD5 hash of the concatanation of all the fields. This has reduced the disk space usage considerably, as show below against my test database ~6 million rows at 1+ GB. All this data is based off the test database running 7.3.2: TypeSize --- Database Table 1188642816 All columns pkey1510252544 MD5 columns pkey 370999296 Just using MD5 hash data instead of all the columns is a considerable diskspace win going from 1.5 GB to 370 MB. Has anyone else solved this problem? Has anyone else looked into something like this and mind sharing so I do not have to re-invent the wheel? :) Also (assuming there is no papers / benchmarks proving data in index is a good idea), how difficult would it be to impliment an index type that extracts the data from the main table? Thanks for reading. I will be happy to field any question that I can, or read any papers, research, etc that relates to this topic. - Ryan P.S. the production database is running 7.2.4 if that makes a difference. -- Ryan Bradetich [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [HACKERS] Questions about indexes?
Ryan Bradetich [EMAIL PROTECTED] writes: Although the table schema is immaterial, I will provide it so we have a common framework for this discussion: host_id integer (not null) timestamp datetime(not null) categorytext(not null) [=5 chars] anomaly text(not null) [= 1024 chars] This table is used to store archived data, so each row in the table must be unique. Currently I am using a primary key across each column to enforce this uniqueness. It's not real clear to me why you bother enforcing a constraint that the complete row be unique. Wouldn't a useful constraint be that the first three columns be unique? Even if that's not correct, what's wrong with tolerating a few duplicates? You can't tell me it's to save on storage ;-) I am not sure why all the data is duplicated in the index ... but i bet it has to do with performance since it would save a lookup in the main table. An index that can't prevent looking into the main table wouldn't be worth anything AFAICS ... regards, tom lane ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/users-lounge/docs/faq.html
Re: [HACKERS] Questions about indexes?
Ryan Bradetich [EMAIL PROTECTED] writes: On Sun, 2003-02-16 at 23:34, Tom Lane wrote: It's not real clear to me why you bother enforcing a constraint that the complete row be unique. Wouldn't a useful constraint be that the first three columns be unique? The table holds system policy compliance data. The catagory is basically the policy, and the anomaly is the detailed text explaining why the system is out of compliance. So the anomaly data is important (and often the reason why the key is unique). Well, sure the anomaly is important: it's the payload, the reason why you bother to have the table in the first place. But that doesn't mean it's part of the key. Generally the key would be the info you use to look up a particular anomaly text. In this example, it's not clear to me why you'd need/want two different anomaly texts entered for the same host_id and the same category at the same instant of time. ISTM there's something inadequate about your category column if you need that. regards, tom lane ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [HACKERS] Questions about indexes?
On Mon, 2003-02-17 at 00:15, Tom Lane wrote: Ryan Bradetich [EMAIL PROTECTED] writes: On Sun, 2003-02-16 at 23:34, Tom Lane wrote: It's not real clear to me why you bother enforcing a constraint that the complete row be unique. Wouldn't a useful constraint be that the first three columns be unique? The table holds system policy compliance data. The catagory is basically the policy, and the anomaly is the detailed text explaining why the system is out of compliance. So the anomaly data is important (and often the reason why the key is unique). Well, sure the anomaly is important: it's the payload, the reason why you bother to have the table in the first place. But that doesn't mean it's part of the key. Generally the key would be the info you use to look up a particular anomaly text. In this example, it's not clear to me why you'd need/want two different anomaly texts entered for the same host_id and the same category at the same instant of time. ISTM there's something inadequate about your category column if you need that. Ok, I understand what you are asking now :) Let me make up a contrived example to show how the table is used. host_id 1 = hosta.somewhere.com host_id 2 = hostb.somewhere.com The catagories are coded so (made up examples): cat p101 = /etc/passwd check cat f101 = filesystem check. the table would look like: 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell. 1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password. 2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner. etc... So I do not need the anomaly to be part of the index, I only need it to I agree with you, that I would not normally add the anomally to the index, except for the unique row requirement. Thinking about it now, maybe I should guarentee unique rows via a check constraint... Thanks for making me think about this in a different way! - Ryan regards, tom lane -- Ryan Bradetich [EMAIL PROTECTED] ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])