[PERFORM] Indexes for hashes
Hi, I have an application which stores a large amounts of hex-encoded hash strings (nearly 100 GB of them), which means: - The number of distinct characters (alphabet) is limited to 16 - Each string is of the same length, 64 characters - The strings are essentially random Creating a B-Tree index on this results in the index size being larger than the table itself, and there are disk space constraints. I've found the SP-GIST radix tree index, and thought it could be a good match for the data because of the above constraints. An attempt to create it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more than 12 hours (while a similar B-tree index takes a few hours at most), so I've interrupted it because "it probably is not going to finish in a reasonable time". Some slides I found on the spgist index allude that both build time and size are not really suitable for this purpose. My question is: what would be the most size-efficient index for this situation?
Re: [PERFORM] Clarification on using pg_upgrade
- Original Message - > From: Tory M Blue > To: Jim Nasby > Cc: "pgsql-performance@postgresql.org" > Sent: Tuesday, 14 June 2016, 22:08 > Subject: Re: [PERFORM] Clarification on using pg_upgrade > > Right, that's what we do, but then to upgrade, we have to drop/add the > node, because it's being upgraded. If I'm updating the underlying OS, > I have to kill it all. If I'm doing a postgres upgrade, using an old > version of slon, without using pg_upgrade, I have to drop the db, > recreate it, which requires a drop/add. > > I'm trying to figure out how to best do it using pg_upgrade instead > of the entire drop/add for postgres upgrades (which are needed if you > are using slon as an upgrade engine for your db). > I've just skimmed through this thread, but I can't quite gather what it is you're trying to achieve. Are you looking to move away from Slony? Upgrade by any means with or without Slony? Or just find a "fast" way of doing a major upgrade whilst keeping Slony in-place as your replication method? If it's the latter, the easiest way is to have 2 or more subscribers subscribed to the same sets and one at a time; drop a subscriber node, upgrade and re-initdb, then use clone node to recreate it from another subscriber. If you're intent on using pg_upgrade you might be able to fudge it as long as you can bump up current txid to be greater than what it was before the upgrade; in fact I've done similar before with a slony subscriber, but only as a test on a small database. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
Hello Ivan, I have an application which stores a large amounts of hex-encoded hash strings (nearly 100 GB of them), which means: * The number of distinct characters (alphabet) is limited to 16 * Each string is of the same length, 64 characters * The strings are essentially random Creating a B-Tree index on this results in the index size being larger than the table itself, and there are disk space constraints. I've found the SP-GIST radix tree index, and thought it could be a good match for the data because of the above constraints. An attempt to create it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more than 12 hours (while a similar B-tree index takes a few hours at most), so I've interrupted it because "it probably is not going to finish in a reasonable time". Some slides I found on the spgist index allude that both build time and size are not really suitable for this purpose. My question is: what would be the most size-efficient index for this situation? It depends on what you want to query. What about the BRIN-Index: https://www.postgresql.org/docs/9.5/static/brin-intro.html This will result in a very small size, but depending on what you want to query it will fit or not fit your needs. Greetings, Torsten -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > Hi, > > I have an application which stores a large amounts of hex-encoded hash > strings (nearly 100 GB of them), which means: > >- The number of distinct characters (alphabet) is limited to 16 >- Each string is of the same length, 64 characters >- The strings are essentially random > > Creating a B-Tree index on this results in the index size being larger than > the table itself, and there are disk space constraints. > > I've found the SP-GIST radix tree index, and thought it could be a good > match for the data because of the above constraints. An attempt to create > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more > than 12 hours (while a similar B-tree index takes a few hours at most), so > I've interrupted it because "it probably is not going to finish in a > reasonable time". Some slides I found on the spgist index allude that both > build time and size are not really suitable for this purpose. > > My question is: what would be the most size-efficient index for this > situation? Hi Ivan, If the strings are really random, then maybe a function index on the first 4, 8, or 16 characters could be used to narrow the search space and not need to index all 64. If they are not "good" random numbers, you could use a hash index on the strings. It will be much smaller since it currently uses a 32-bit hash. It has a number of caveats and is not currently crash-safe, but it seems like it might work in your environment. You can also use a functional index on a hash-function applied to your values with a btree to give you crash safety. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
On 15 June 2016 at 15:03, k...@rice.edu wrote: > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > > Hi, > > > > I have an application which stores a large amounts of hex-encoded hash > > strings (nearly 100 GB of them), which means: > > > >- The number of distinct characters (alphabet) is limited to 16 > >- Each string is of the same length, 64 characters > >- The strings are essentially random > > > > Creating a B-Tree index on this results in the index size being larger > than > > the table itself, and there are disk space constraints. > > > > I've found the SP-GIST radix tree index, and thought it could be a good > > match for the data because of the above constraints. An attempt to create > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes > more > > than 12 hours (while a similar B-tree index takes a few hours at most), > so > > I've interrupted it because "it probably is not going to finish in a > > reasonable time". Some slides I found on the spgist index allude that > both > > build time and size are not really suitable for this purpose. > > > > My question is: what would be the most size-efficient index for this > > situation? > > Hi Ivan, > > If the strings are really random, then maybe a function index on the first > 4, 8, or 16 characters could be used to narrow the search space and not > need > to index all 64. If they are not "good" random numbers, you could use a > hash > index on the strings. It will be much smaller since it currently uses a > 32-bit > hash. It has a number of caveats and is not currently crash-safe, but it > seems > like it might work in your environment. You can also use a functional > index on > a hash-function applied to your values with a btree to give you crash > safety. > > Hi, I figured the hash index might be helpful and I've tried it in the meantime: on one of the smaller tables (which is 51 GB in size), a btree index is 32 GB, while the hash index is 22 GB (so btree is around 45% larger). I don't suppose there's an effort in progress to make hash indexes use WAL? :D
Re: [PERFORM] Indexes for hashes
On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote: > On 15 June 2016 at 15:03, k...@rice.edu wrote: > > > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > > > Hi, > > > > > > I have an application which stores a large amounts of hex-encoded hash > > > strings (nearly 100 GB of them), which means: > > > > > >- The number of distinct characters (alphabet) is limited to 16 > > >- Each string is of the same length, 64 characters > > >- The strings are essentially random > > > > > > Creating a B-Tree index on this results in the index size being larger > > than > > > the table itself, and there are disk space constraints. > > > > > > I've found the SP-GIST radix tree index, and thought it could be a good > > > match for the data because of the above constraints. An attempt to create > > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes > > more > > > than 12 hours (while a similar B-tree index takes a few hours at most), > > so > > > I've interrupted it because "it probably is not going to finish in a > > > reasonable time". Some slides I found on the spgist index allude that > > both > > > build time and size are not really suitable for this purpose. > > > > > > My question is: what would be the most size-efficient index for this > > > situation? > > > > Hi Ivan, > > > > If the strings are really random, then maybe a function index on the first > > 4, 8, or 16 characters could be used to narrow the search space and not > > need > > to index all 64. If they are not "good" random numbers, you could use a > > hash > > index on the strings. It will be much smaller since it currently uses a > > 32-bit > > hash. It has a number of caveats and is not currently crash-safe, but it > > seems > > like it might work in your environment. You can also use a functional > > index on > > a hash-function applied to your values with a btree to give you crash > > safety. > > > > > Hi, > > I figured the hash index might be helpful and I've tried it in the > meantime: on one of the smaller tables (which is 51 GB in size), a btree > index is 32 GB, while the hash index is 22 GB (so btree is around 45% > larger). > > I don't suppose there's an effort in progress to make hash indexes use WAL? > :D Hi Ivan, Several people have looked at it but it has not made it to the top of anyone's to-do list. So if you need WAL and crash-safety, a functional index on a hash of your values is currently your best bet. Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > I have an application which stores a large amounts of hex-encoded hash > strings (nearly 100 GB of them), which means: Why do you keep them hex encoded, and not use bytea? I made a sample table with 1 million rows, looking like this: Table "public.new" Column | Type | Modifiers -+---+--- texthex | text | a_bytea | bytea | values are like: $ select * from new limit 10; texthex | a_bytea --+ c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa (10 rows) created two indexes: create index i1 on new (texthex); create index i2 on new (a_bytea); i1 is 91MB, and i2 is 56MB. Index creation was also much faster - best out of 3 runs for i1 was 4928.982 ms, best out of 3 runs for i2 was 2047.648 ms Best regards, depesz -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
Hi, I understand your idea, and have also been thinking about it. Basically, existing applications would need to be modified, however slightly, and that wouldn't be good. On 15 June 2016 at 15:38, hubert depesz lubaczewski wrote: > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > > I have an application which stores a large amounts of hex-encoded hash > > strings (nearly 100 GB of them), which means: > > Why do you keep them hex encoded, and not use bytea? > > I made a sample table with 1 million rows, looking like this: > > Table "public.new" > Column | Type | Modifiers > -+---+--- > texthex | text | > a_bytea | bytea | > > values are like: > > $ select * from new limit 10; > texthex | > a_bytea > > --+ > c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | > \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f > 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | > \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db > 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | > \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 > fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | > \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 > ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | > \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a > 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | > \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea > 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | > \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 > 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | > \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c > 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | > \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 > 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | > \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa > (10 rows) > > created two indexes: > create index i1 on new (texthex); > create index i2 on new (a_bytea); > > i1 is 91MB, and i2 is 56MB. > > Index creation was also much faster - best out of 3 runs for i1 was > 4928.982 > ms, best out of 3 runs for i2 was 2047.648 ms > > Best regards, > > depesz > >
Re: [PERFORM] Indexes for hashes
Hi Ivan, How about using crc32 ? and then index the integer as the result of crc32 function? you can split the hash into 2 part and do crc32 2x ? and then create composite index on both integer (the crc32 result) instead of using 64 char, you only employ 2 integer as index key. Regards, Jul On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras wrote: > Hi, > > I understand your idea, and have also been thinking about it. Basically, > existing applications would need to be modified, however slightly, and that > wouldn't be good. > > > > > On 15 June 2016 at 15:38, hubert depesz lubaczewski > wrote: > >> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: >> > I have an application which stores a large amounts of hex-encoded hash >> > strings (nearly 100 GB of them), which means: >> >> Why do you keep them hex encoded, and not use bytea? >> >> I made a sample table with 1 million rows, looking like this: >> >> Table "public.new" >> Column | Type | Modifiers >> -+---+--- >> texthex | text | >> a_bytea | bytea | >> >> values are like: >> >> $ select * from new limit 10; >> texthex | >> a_bytea >> >> --+ >> c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | >> \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f >> 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | >> \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db >> 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | >> \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 >> fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | >> \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 >> ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | >> \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a >> 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | >> \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea >> 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | >> \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 >> 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | >> \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c >> 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | >> \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 >> 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | >> \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa >> (10 rows) >> >> created two indexes: >> create index i1 on new (texthex); >> create index i2 on new (a_bytea); >> >> i1 is 91MB, and i2 is 56MB. >> >> Index creation was also much faster - best out of 3 runs for i1 was >> 4928.982 >> ms, best out of 3 runs for i2 was 2047.648 ms >> >> Best regards, >> >> depesz >> >> > -- Julyanto SUTANDANG Equnix Business Solutions, PT (An Open Source an Open Mind Company) Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta Pusat T: +6221 2282 F: +62216315281 M: +628164858028 Caution: The information enclosed in this email (and any attachments) may be legally privileged and/or confidential and is intended only for the use of the addressee(s). No addressee should forward, print, copy, or otherwise reproduce this message in any manner that would allow it to be viewed by any individual not originally listed as a recipient. If the reader of this message is not the intended recipient, you are hereby notified that any unauthorized disclosure, dissemination, distribution, copying or the taking of any action in reliance on the information herein is strictly prohibited. If you have received this communication in error, please immediately notify the sender and delete this message.Unless it is made by the authorized person, any views expressed in this message are those of the individual sender and may not necessarily reflect the views of PT Equnix Business Solutions.
Re: [PERFORM] Indexes for hashes
Hi, This idea is similar to the substring one, and while it does give excellent performance and small size, it requires application code modifications, so it's out. On 15 June 2016 at 15:58, julyanto SUTANDANG wrote: > Hi Ivan, > > How about using crc32 ? and then index the integer as the result of crc32 > function? you can split the hash into 2 part and do crc32 2x ? and then > create composite index on both integer (the crc32 result) > instead of using 64 char, you only employ 2 integer as index key. > > Regards, > > Jul > > On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras wrote: > >> Hi, >> >> I understand your idea, and have also been thinking about it. Basically, >> existing applications would need to be modified, however slightly, and that >> wouldn't be good. >> >> >> >> >> On 15 June 2016 at 15:38, hubert depesz lubaczewski >> wrote: >> >>> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: >>> > I have an application which stores a large amounts of hex-encoded hash >>> > strings (nearly 100 GB of them), which means: >>> >>> Why do you keep them hex encoded, and not use bytea? >>> >>> I made a sample table with 1 million rows, looking like this: >>> >>> Table "public.new" >>> Column | Type | Modifiers >>> -+---+--- >>> texthex | text | >>> a_bytea | bytea | >>> >>> values are like: >>> >>> $ select * from new limit 10; >>> texthex | >>> a_bytea >>> >>> --+ >>> c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | >>> \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f >>> 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | >>> \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db >>> 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | >>> \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 >>> fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | >>> \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 >>> ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | >>> \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a >>> 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | >>> \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea >>> 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | >>> \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 >>> 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | >>> \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c >>> 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | >>> \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 >>> 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | >>> \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa >>> (10 rows) >>> >>> created two indexes: >>> create index i1 on new (texthex); >>> create index i2 on new (a_bytea); >>> >>> i1 is 91MB, and i2 is 56MB. >>> >>> Index creation was also much faster - best out of 3 runs for i1 was >>> 4928.982 >>> ms, best out of 3 runs for i2 was 2047.648 ms >>> >>> Best regards, >>> >>> depesz >>> >>> >> > > > -- > > > Julyanto SUTANDANG > > Equnix Business Solutions, PT > (An Open Source an Open Mind Company) > > Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta > Pusat > T: +6221 2282 F: +62216315281 M: +628164858028 > > > Caution: The information enclosed in this email (and any attachments) may > be legally privileged and/or confidential and is intended only for the use > of the addressee(s). No addressee should forward, print, copy, or otherwise > reproduce this message in any manner that would allow it to be viewed by > any individual not originally listed as a recipient. If the reader of this > message is not the intended recipient, you are hereby notified that any > unauthorized disclosure, dissemination, distribution, copying or the taking > of any action in reliance on the information herein is strictly prohibited. > If you have received this communication in error, please immediately notify > the sender and delete this message.Unless it is made by the authorized > person, any views expressed in this message are those of the individual > sender and may not necessarily reflect the views of PT Equnix Business > Solutions. >
Re: [PERFORM] Indexes for hashes
Hi, Just for testing... is there a fast (i.e. written in C) crc32 or a similar small hash function for PostgreSQL? On 15 June 2016 at 16:00, Ivan Voras wrote: > Hi, > > This idea is similar to the substring one, and while it does give > excellent performance and small size, it requires application code > modifications, so it's out. > > > On 15 June 2016 at 15:58, julyanto SUTANDANG > wrote: > >> Hi Ivan, >> >> How about using crc32 ? and then index the integer as the result of crc32 >> function? you can split the hash into 2 part and do crc32 2x ? and then >> create composite index on both integer (the crc32 result) >> instead of using 64 char, you only employ 2 integer as index key. >> >> Regards, >> >> Jul >> >> On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras wrote: >> >>> Hi, >>> >>> I understand your idea, and have also been thinking about it. Basically, >>> existing applications would need to be modified, however slightly, and that >>> wouldn't be good. >>> >>> >>> >>> >>> On 15 June 2016 at 15:38, hubert depesz lubaczewski >>> wrote: >>> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote: > I have an application which stores a large amounts of hex-encoded hash > strings (nearly 100 GB of them), which means: Why do you keep them hex encoded, and not use bytea? I made a sample table with 1 million rows, looking like this: Table "public.new" Column | Type | Modifiers -+---+--- texthex | text | a_bytea | bytea | values are like: $ select * from new limit 10; texthex | a_bytea --+ c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa (10 rows) created two indexes: create index i1 on new (texthex); create index i2 on new (a_bytea); i1 is 91MB, and i2 is 56MB. Index creation was also much faster - best out of 3 runs for i1 was 4928.982 ms, best out of 3 runs for i2 was 2047.648 ms Best regards, depesz >>> >> >> >> -- >> >> >> Julyanto SUTANDANG >> >> Equnix Business Solutions, PT >> (An Open Source an Open Mind Company) >> >> Pusat Niaga ITC Roxy Mas Blok C2/42. Jl. KH Hasyim Ashari 125, Jakarta >> Pusat >> T: +6221 2282 F: +62216315281 M: +628164858028 >> >> >> Caution: The information enclosed in this email (and any attachments) may >> be legally privileged and/or confidential and is intended only for the use >> of the addressee(s). No addressee should forward, print, copy, or otherwise >> reproduce this message in any manner that would allow it to be viewed by >> any individual not originally listed as a recipient. If the reader of this >> message is not the intended recipient, you are hereby notified that any >> unauthorized disclosure, dissemination, distribution, copying or the taking >> of any action in reliance on the information herein is strictly prohibited. >> If you have received this communication in error, please immediately notify >> the sender and delete this message.Unless it is made by the authorized >> person, any views expressed in this message are those of the individual >> sender and may not necessarily reflect the views of PT Equnix Business >> Solutions. >> > >
Re: [PERFORM] Indexes for hashes
On 06/15/2016 07:20 AM, Ivan Voras wrote: Hi, Just for testing... is there a fast (i.e. written in C) crc32 or a similar small hash function for PostgreSQL? https://www.postgresql.org/docs/9.5/static/pgcrypto.html We also have a builtin md5(). JD -- Command Prompt, Inc. http://the.postgres.company/ +1-503-667-4564 PostgreSQL Centered full stack support, consulting and development. Everyone appreciates your honesty, until you are honest with them. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
On Wed, Jun 15, 2016 at 04:20:46PM +0200, Ivan Voras wrote: > Hi, > > Just for testing... is there a fast (i.e. written in C) crc32 or a similar > small hash function for PostgreSQL? > Hi Ivan, Here is an extension that provides a number of different hash functions, including a version of the version used internally: https://github.com/markokr/pghashlib Regards, Ken -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Indexes for hashes
On Wed, Jun 15, 2016 at 6:16 AM, k...@rice.edu wrote: > On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote: >> On 15 June 2016 at 15:03, k...@rice.edu wrote: >> >> >> I don't suppose there's an effort in progress to make hash indexes use WAL? >> :D > > Hi Ivan, > > Several people have looked at it but it has not made it to the top of anyone's > to-do list. I don't know if it is the top of his todo list, but Amit seems pretty serious about it: https://www.postgresql.org/message-id/caa4ek1lfzczyxloxs874ad0+s-zm60u9bwcyiuzx9mhz-kc...@mail.gmail.com I hope to give him some help if I get a chance. Cheers, Jeff -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] pg_restore seems very slow
This is my first post to the mailing list, so I apologize for any etiquette issues. I have a few databases that I am trying to move from one system to another. Both systems are running Windows 7 and Postgres 8.4, and they are pretty powerful machines (40-core Xeon workstations with decent hardware across the board). While the DBs vary in size, I'm working right now with one that is roughly 50 tables and probably 75M rows, and is about 300MB on disk when exported via pg_dump. I am exporting and restoring using these commands (on separate sytems): pg_dump -F c mydb > mydb.dump pg_restore -C -j 10 mydb.dump The dump process runs in about a minute and seems fine. The restore process has already been running for around 7 hours. Yesterday, I tried restoring a larger DB that is roughly triple the dimensions listed above, and it ran for over 16 hours without completing. I followed the advice given at http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and set the conf settings as directed and restarted the server. You can see in the command line that I am trying to use the -j parameter for parallelism, but I don't see much evidence of that in Task Manager. CPU load is consistently 1 or 2% and only a couple cores seem to be doing anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to look for pg_restore's disk I/O, but there is an entry for pg_restore in Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write Bytes. Since that's just the parent process that might make sense but I don't see much activity elsewhere either. Is there something simple that I am missing here? Does the -j flag not work in 8.4 and I should use --jobs? It just seems like none of the CPU or RAM usage I'd expect from this process are evident, it's taking many times longer than I would expect, and I don't know how to verify if the things I'm trying are working or not. Any insight would be appreciated! Thanks, Adrian
Re: [PERFORM] pg_restore seems very slow
On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers wrote: > This is my first post to the mailing list, so I apologize for any > etiquette issues. > > I have a few databases that I am trying to move from one system to > another. Both systems are running Windows 7 and Postgres 8.4, and they are > pretty powerful machines (40-core Xeon workstations with decent hardware > across the board). While the DBs vary in size, I'm working right now with > one that is roughly 50 tables and probably 75M rows, and is about 300MB on > disk when exported via pg_dump. > > I am exporting and restoring using these commands (on separate sytems): > pg_dump -F c mydb > mydb.dump > pg_restore -C -j 10 mydb.dump > > The dump process runs in about a minute and seems fine. The restore > process has already been running for around 7 hours. > > Yesterday, I tried restoring a larger DB that is roughly triple the > dimensions listed above, and it ran for over 16 hours without completing. > > I followed the advice given at > http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and > set the conf settings as directed and restarted the server. > > You can see in the command line that I am trying to use the -j parameter > for parallelism, but I don't see much evidence of that in Task Manager. CPU > load is consistently 1 or 2% and only a couple cores seem to be doing > anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to > look for pg_restore's disk I/O, but there is an entry for pg_restore in > Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write > Bytes. Since that's just the parent process that might make sense but I > don't see much activity elsewhere either. > > Is there something simple that I am missing here? Does the -j flag not > work in 8.4 and I should use --jobs? It just seems like none of the CPU or > RAM usage I'd expect from this process are evident, it's taking many times > longer than I would expect, and I don't know how to verify if the things > I'm trying are working or not. > > Any insight would be appreciated! > > Did any databases restore properly? Are there any message in logs or on the terminal? You should add the "--verbose" option to your pg_restore command to help provoke this. -C can be problematic at times. Consider manually ensuring the desired target database exists and is setup correctly (matches the original) and then do a non-create restoration to it specifically. -j should work fine in 8.4 (according to the docs) You need to get to a point where you are seeing feedback from the pg_restore process. Once you get it telling you what it is doing (or trying to do) then diagnosing can begin. David J.
Re: [PERFORM] pg_restore seems very slow
Hi David, Thank you for your reply. Yes, there is quite a lot of feedback in the terminal. I can see a small flurry of table operations followed by hours of table contents being printed, presumably as they are inserted. I didn't use the --verbose option, but it seems to be echoing everything it is doing. I haven't seen any errors, and I was able to restore a couple very small tables successfully, so it seems like the process is valid. The problem is that pg_restore is running for extremely long periods of time on even modestly large tables and I can't tell if the optimizations I am trying, such as the -j concurrency option, are having any effect. Thanks, Adrian On Wed, Jun 15, 2016 at 6:08 PM, David G. Johnston < david.g.johns...@gmail.com> wrote: > On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers > wrote: > >> This is my first post to the mailing list, so I apologize for any >> etiquette issues. >> >> I have a few databases that I am trying to move from one system to >> another. Both systems are running Windows 7 and Postgres 8.4, and they are >> pretty powerful machines (40-core Xeon workstations with decent hardware >> across the board). While the DBs vary in size, I'm working right now with >> one that is roughly 50 tables and probably 75M rows, and is about 300MB on >> disk when exported via pg_dump. >> >> I am exporting and restoring using these commands (on separate sytems): >> pg_dump -F c mydb > mydb.dump >> pg_restore -C -j 10 mydb.dump >> >> The dump process runs in about a minute and seems fine. The restore >> process has already been running for around 7 hours. >> >> Yesterday, I tried restoring a larger DB that is roughly triple the >> dimensions listed above, and it ran for over 16 hours without completing. >> >> I followed the advice given at >> http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and >> set the conf settings as directed and restarted the server. >> >> You can see in the command line that I am trying to use the -j parameter >> for parallelism, but I don't see much evidence of that in Task Manager. CPU >> load is consistently 1 or 2% and only a couple cores seem to be doing >> anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to >> look for pg_restore's disk I/O, but there is an entry for pg_restore in >> Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write >> Bytes. Since that's just the parent process that might make sense but I >> don't see much activity elsewhere either. >> >> Is there something simple that I am missing here? Does the -j flag not >> work in 8.4 and I should use --jobs? It just seems like none of the CPU or >> RAM usage I'd expect from this process are evident, it's taking many times >> longer than I would expect, and I don't know how to verify if the things >> I'm trying are working or not. >> >> Any insight would be appreciated! >> >> > Did any databases restore properly? > > Are there any message in logs or on the terminal? You should add the > "--verbose" option to your pg_restore command to help provoke this. > > -C can be problematic at times. Consider manually ensuring the desired > target database exists and is setup correctly (matches the original) and > then do a non-create restoration to it specifically. > > -j should work fine in 8.4 (according to the docs) > > You need to get to a point where you are seeing feedback from the > pg_restore process. Once you get it telling you what it is doing (or > trying to do) then diagnosing can begin. > > David J. > > >
Re: [PERFORM] pg_restore seems very slow
The "simple" case may be anti-virus or firewall blocking feeding into the database. Be sure to check windows system logs for any unusual messages. Check the postgres log (usually in PGDATA/pg_logs) For seeing disk I/O on Win7 check out http://www.digitalcitizen.life/how-use-resource-monitor-windows-7 Try also to restore without any -j or --jobs to see if you get more activity on CPU or disk. Can you view any data in the tables to at least know it's loading? Thanks, Adam C. Scott On Wed, Jun 15, 2016 at 4:41 PM, Adrian Myers wrote: > Hi David, > > Thank you for your reply. Yes, there is quite a lot of feedback in the > terminal. I can see a small flurry of table operations followed by hours of > table contents being printed, presumably as they are inserted. I didn't use > the --verbose option, but it seems to be echoing everything it is doing. > > I haven't seen any errors, and I was able to restore a couple very small > tables successfully, so it seems like the process is valid. The problem is > that pg_restore is running for extremely long periods of time on even > modestly large tables and I can't tell if the optimizations I am trying, > such as the -j concurrency option, are having any effect. > > Thanks, > Adrian > > On Wed, Jun 15, 2016 at 6:08 PM, David G. Johnston < > david.g.johns...@gmail.com> wrote: > >> On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers >> wrote: >> >>> This is my first post to the mailing list, so I apologize for any >>> etiquette issues. >>> >>> I have a few databases that I am trying to move from one system to >>> another. Both systems are running Windows 7 and Postgres 8.4, and they are >>> pretty powerful machines (40-core Xeon workstations with decent hardware >>> across the board). While the DBs vary in size, I'm working right now with >>> one that is roughly 50 tables and probably 75M rows, and is about 300MB on >>> disk when exported via pg_dump. >>> >>> I am exporting and restoring using these commands (on separate sytems): >>> pg_dump -F c mydb > mydb.dump >>> pg_restore -C -j 10 mydb.dump >>> >>> The dump process runs in about a minute and seems fine. The restore >>> process has already been running for around 7 hours. >>> >>> Yesterday, I tried restoring a larger DB that is roughly triple the >>> dimensions listed above, and it ran for over 16 hours without completing. >>> >>> I followed the advice given at >>> http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html >>> and set the conf settings as directed and restarted the server. >>> >>> You can see in the command line that I am trying to use the -j parameter >>> for parallelism, but I don't see much evidence of that in Task Manager. CPU >>> load is consistently 1 or 2% and only a couple cores seem to be doing >>> anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to >>> look for pg_restore's disk I/O, but there is an entry for pg_restore in >>> Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write >>> Bytes. Since that's just the parent process that might make sense but I >>> don't see much activity elsewhere either. >>> >>> Is there something simple that I am missing here? Does the -j flag not >>> work in 8.4 and I should use --jobs? It just seems like none of the CPU or >>> RAM usage I'd expect from this process are evident, it's taking many times >>> longer than I would expect, and I don't know how to verify if the things >>> I'm trying are working or not. >>> >>> Any insight would be appreciated! >>> >>> >> Did any databases restore properly? >> >> Are there any message in logs or on the terminal? You should add the >> "--verbose" option to your pg_restore command to help provoke this. >> >> -C can be problematic at times. Consider manually ensuring the desired >> target database exists and is setup correctly (matches the original) and >> then do a non-create restoration to it specifically. >> >> -j should work fine in 8.4 (according to the docs) >> >> You need to get to a point where you are seeing feedback from the >> pg_restore process. Once you get it telling you what it is doing (or >> trying to do) then diagnosing can begin. >> >> David J. >> >> >> >