date:20160615

[PERFORM] Indexes for hashes

2016-06-15 Thread Ivan Voras

Hi,

I have an application which stores a large amounts of hex-encoded hash
strings (nearly 100 GB of them), which means:

   - The number of distinct characters (alphabet) is limited to 16
   - Each string is of the same length, 64 characters
   - The strings are essentially random

Creating a B-Tree index on this results in the index size being larger than
the table itself, and there are disk space constraints.

I've found the SP-GIST radix tree index, and thought it could be a good
match for the data because of the above constraints. An attempt to create
it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
than 12 hours (while a similar B-tree index takes a few hours at most), so
I've interrupted it because "it probably is not going to finish in a
reasonable time". Some slides I found on the spgist index allude that both
build time and size are not really suitable for this purpose.

My question is: what would be the most size-efficient index for this
situation?

Re: [PERFORM] Clarification on using pg_upgrade

2016-06-15 Thread Glyn Astill

- Original Message -

> From: Tory M Blue 
> To: Jim Nasby 
> Cc: "pgsql-performance@postgresql.org" 
> Sent: Tuesday, 14 June 2016, 22:08

> Subject: Re: [PERFORM] Clarification on using pg_upgrade
>
> Right, that's what we do, but then to upgrade, we have to drop/add the
> node, because it's being upgraded.  If I'm updating the underlying OS,
> I have to kill it all. If I'm doing a postgres upgrade, using an old
> version of slon, without using pg_upgrade, I have to drop the db,
> recreate it, which requires a drop/add.
> 
> I'm trying to figure out how to best do it using pg_upgrade  instead
> of the entire drop/add for postgres upgrades (which are needed if you
> are using slon as an upgrade engine for your db).
> 


I've just skimmed through this thread, but I can't quite gather what it is 
you're trying to achieve.  Are you looking to move away from Slony? Upgrade by 
any means with or without Slony?  Or just find a "fast" way of doing a major 
upgrade whilst keeping Slony in-place as your replication method?

If it's the latter, the easiest way is to have 2 or more subscribers subscribed 
to the same sets and one at a time; drop a subscriber node, upgrade and 
re-initdb, then use clone node to recreate it from another subscriber.  If 
you're intent on using pg_upgrade you might be able to fudge it as long as you 
can bump up current txid to be greater than what it was before the upgrade; in 
fact I've done similar before with a slony subscriber, but only as a test on a 
small database.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Torsten Zuehlsdorff


Hello Ivan,


I have an application which stores a large amounts of hex-encoded hash
strings (nearly 100 GB of them), which means:

  * The number of distinct characters (alphabet) is limited to 16
  * Each string is of the same length, 64 characters
  * The strings are essentially random

Creating a B-Tree index on this results in the index size being larger
than the table itself, and there are disk space constraints.

I've found the SP-GIST radix tree index, and thought it could be a good
match for the data because of the above constraints. An attempt to
create it (as in CREATE INDEX ON t USING spgist(field_name)) apparently
takes more than 12 hours (while a similar B-tree index takes a few hours
at most), so I've interrupted it because "it probably is not going to
finish in a reasonable time". Some slides I found on the spgist index
allude that both build time and size are not really suitable for this
purpose.

My question is: what would be the most size-efficient index for this
situation?


It depends on what you want to query. What about the BRIN-Index:
https://www.postgresql.org/docs/9.5/static/brin-intro.html

This will result in a very small size, but depending on what you want to 
query it will fit or not fit your needs.


Greetings,
Torsten


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread k...@rice.edu

On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> Hi,
> 
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:
> 
>- The number of distinct characters (alphabet) is limited to 16
>- Each string is of the same length, 64 characters
>- The strings are essentially random
> 
> Creating a B-Tree index on this results in the index size being larger than
> the table itself, and there are disk space constraints.
> 
> I've found the SP-GIST radix tree index, and thought it could be a good
> match for the data because of the above constraints. An attempt to create
> it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes more
> than 12 hours (while a similar B-tree index takes a few hours at most), so
> I've interrupted it because "it probably is not going to finish in a
> reasonable time". Some slides I found on the spgist index allude that both
> build time and size are not really suitable for this purpose.
> 
> My question is: what would be the most size-efficient index for this
> situation?

Hi Ivan,

If the strings are really random, then maybe a function index on the first
4, 8, or 16 characters could be used to narrow the search space and not need
to index all 64. If they are not "good" random numbers, you could use a hash
index on the strings. It will be much smaller since it currently uses a 32-bit
hash. It has a number of caveats and is not currently crash-safe, but it seems
like it might work in your environment. You can also use a functional index on
a hash-function applied to your values with a btree to give you crash safety.

Regards,
Ken

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Ivan Voras

On 15 June 2016 at 15:03, k...@rice.edu  wrote:

> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> > Hi,
> >
> > I have an application which stores a large amounts of hex-encoded hash
> > strings (nearly 100 GB of them), which means:
> >
> >- The number of distinct characters (alphabet) is limited to 16
> >- Each string is of the same length, 64 characters
> >- The strings are essentially random
> >
> > Creating a B-Tree index on this results in the index size being larger
> than
> > the table itself, and there are disk space constraints.
> >
> > I've found the SP-GIST radix tree index, and thought it could be a good
> > match for the data because of the above constraints. An attempt to create
> > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes
> more
> > than 12 hours (while a similar B-tree index takes a few hours at most),
> so
> > I've interrupted it because "it probably is not going to finish in a
> > reasonable time". Some slides I found on the spgist index allude that
> both
> > build time and size are not really suitable for this purpose.
> >
> > My question is: what would be the most size-efficient index for this
> > situation?
>
> Hi Ivan,
>
> If the strings are really random, then maybe a function index on the first
> 4, 8, or 16 characters could be used to narrow the search space and not
> need
> to index all 64. If they are not "good" random numbers, you could use a
> hash
> index on the strings. It will be much smaller since it currently uses a
> 32-bit
> hash. It has a number of caveats and is not currently crash-safe, but it
> seems
> like it might work in your environment. You can also use a functional
> index on
> a hash-function applied to your values with a btree to give you crash
> safety.
>
>
Hi,

I figured the hash index might be helpful and I've tried it in the
meantime: on one of the smaller tables (which is 51 GB in size), a btree
index is 32 GB, while the hash index is 22 GB (so btree is around 45%
larger).

I don't suppose there's an effort in progress to make hash indexes use WAL?
:D

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread k...@rice.edu

On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote:
> On 15 June 2016 at 15:03, k...@rice.edu  wrote:
> 
> > On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> > > Hi,
> > >
> > > I have an application which stores a large amounts of hex-encoded hash
> > > strings (nearly 100 GB of them), which means:
> > >
> > >- The number of distinct characters (alphabet) is limited to 16
> > >- Each string is of the same length, 64 characters
> > >- The strings are essentially random
> > >
> > > Creating a B-Tree index on this results in the index size being larger
> > than
> > > the table itself, and there are disk space constraints.
> > >
> > > I've found the SP-GIST radix tree index, and thought it could be a good
> > > match for the data because of the above constraints. An attempt to create
> > > it (as in CREATE INDEX ON t USING spgist(field_name)) apparently takes
> > more
> > > than 12 hours (while a similar B-tree index takes a few hours at most),
> > so
> > > I've interrupted it because "it probably is not going to finish in a
> > > reasonable time". Some slides I found on the spgist index allude that
> > both
> > > build time and size are not really suitable for this purpose.
> > >
> > > My question is: what would be the most size-efficient index for this
> > > situation?
> >
> > Hi Ivan,
> >
> > If the strings are really random, then maybe a function index on the first
> > 4, 8, or 16 characters could be used to narrow the search space and not
> > need
> > to index all 64. If they are not "good" random numbers, you could use a
> > hash
> > index on the strings. It will be much smaller since it currently uses a
> > 32-bit
> > hash. It has a number of caveats and is not currently crash-safe, but it
> > seems
> > like it might work in your environment. You can also use a functional
> > index on
> > a hash-function applied to your values with a btree to give you crash
> > safety.
> >
> >
> Hi,
> 
> I figured the hash index might be helpful and I've tried it in the
> meantime: on one of the smaller tables (which is 51 GB in size), a btree
> index is 32 GB, while the hash index is 22 GB (so btree is around 45%
> larger).
> 
> I don't suppose there's an effort in progress to make hash indexes use WAL?
> :D

Hi Ivan,

Several people have looked at it but it has not made it to the top of anyone's
to-do list. So if you need WAL and crash-safety, a functional index on a hash
of your values is currently your best bet.

Regards,
Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread hubert depesz lubaczewski

On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> I have an application which stores a large amounts of hex-encoded hash
> strings (nearly 100 GB of them), which means:

Why do you keep them hex encoded, and not use bytea?

I made a sample table with 1 million rows, looking like this:

 Table "public.new"
 Column  | Type  | Modifiers 
-+---+---
 texthex | text  | 
 a_bytea | bytea | 

values are like:

$ select * from new limit 10;
 texthex  | 
 a_bytea   
--+
 c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f | 
\xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
 61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db | 
\x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
 757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 | 
\x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
 fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 | 
\xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
 ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a | 
\xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
 11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea | 
\x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
 5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 | 
\x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
 2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c | 
\x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
 2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 | 
\x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
 2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa | 
\x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
(10 rows)

created two indexes:
create index i1 on new (texthex);
create index i2 on new (a_bytea);

i1 is 91MB, and i2 is 56MB.

Index creation was also much faster - best out of 3 runs for i1 was 4928.982
ms, best out of 3 runs for i2 was 2047.648 ms

Best regards,

depesz



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Ivan Voras

Hi,

I understand your idea, and have also been thinking about it. Basically,
existing applications would need to be modified, however slightly, and that
wouldn't be good.




On 15 June 2016 at 15:38, hubert depesz lubaczewski 
wrote:

> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
> > I have an application which stores a large amounts of hex-encoded hash
> > strings (nearly 100 GB of them), which means:
>
> Why do you keep them hex encoded, and not use bytea?
>
> I made a sample table with 1 million rows, looking like this:
>
>  Table "public.new"
>  Column  | Type  | Modifiers
> -+---+---
>  texthex | text  |
>  a_bytea | bytea |
>
> values are like:
>
> $ select * from new limit 10;
>  texthex  |
>   a_bytea
>
> --+
>  c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f |
> \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
>  61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db |
> \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
>  757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 |
> \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
>  fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 |
> \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
>  ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a |
> \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
>  11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea |
> \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
>  5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 |
> \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
>  2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c |
> \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
>  2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 |
> \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
>  2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa |
> \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
> (10 rows)
>
> created two indexes:
> create index i1 on new (texthex);
> create index i2 on new (a_bytea);
>
> i1 is 91MB, and i2 is 56MB.
>
> Index creation was also much faster - best out of 3 runs for i1 was
> 4928.982
> ms, best out of 3 runs for i2 was 2047.648 ms
>
> Best regards,
>
> depesz
>
>

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread julyanto SUTANDANG

Hi Ivan,

How about using crc32 ? and then index the integer as the result of crc32
function? you can split the hash into 2 part and do crc32 2x ? and then
create composite index on both integer (the crc32 result)
instead of using 64 char, you only employ 2 integer as index key.

Regards,

Jul

On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras  wrote:

> Hi,
>
> I understand your idea, and have also been thinking about it. Basically,
> existing applications would need to be modified, however slightly, and that
> wouldn't be good.
>
>
>
>
> On 15 June 2016 at 15:38, hubert depesz lubaczewski 
> wrote:
>
>> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
>> > I have an application which stores a large amounts of hex-encoded hash
>> > strings (nearly 100 GB of them), which means:
>>
>> Why do you keep them hex encoded, and not use bytea?
>>
>> I made a sample table with 1 million rows, looking like this:
>>
>>  Table "public.new"
>>  Column  | Type  | Modifiers
>> -+---+---
>>  texthex | text  |
>>  a_bytea | bytea |
>>
>> values are like:
>>
>> $ select * from new limit 10;
>>  texthex  |
>> a_bytea
>>
>> --+
>>  c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f |
>> \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
>>  61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db |
>> \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
>>  757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 |
>> \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
>>  fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 |
>> \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
>>  ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a |
>> \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
>>  11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea |
>> \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
>>  5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 |
>> \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
>>  2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c |
>> \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
>>  2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 |
>> \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
>>  2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa |
>> \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
>> (10 rows)
>>
>> created two indexes:
>> create index i1 on new (texthex);
>> create index i2 on new (a_bytea);
>>
>> i1 is 91MB, and i2 is 56MB.
>>
>> Index creation was also much faster - best out of 3 runs for i1 was
>> 4928.982
>> ms, best out of 3 runs for i2 was 2047.648 ms
>>
>> Best regards,
>>
>> depesz
>>
>>
>

-- 

Julyanto SUTANDANG

Equnix Business Solutions, PT
(An Open Source an Open Mind Company)

Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta
Pusat
T: +6221 2282 F: +62216315281 M: +628164858028

Caution: The information enclosed in this email (and any attachments) may
be legally privileged and/or confidential and is intended only for the use
of the addressee(s). No addressee should forward, print, copy, or otherwise
reproduce this message in any manner that would allow it to be viewed by
any individual not originally listed as a recipient. If the reader of this
message is not the intended recipient, you are hereby notified that any
unauthorized disclosure, dissemination, distribution, copying or the taking
of any action in reliance on the information herein is strictly prohibited.
If you have received this communication in error, please immediately notify
the sender and delete this message.Unless it is made by the authorized
person, any views expressed in this message are those of the individual
sender and may not necessarily reflect the views of PT Equnix Business
Solutions.

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Ivan Voras

Hi,

This idea is similar to the substring one, and while it does give excellent
performance and small size, it requires application code modifications, so
it's out.


On 15 June 2016 at 15:58, julyanto SUTANDANG  wrote:

> Hi Ivan,
>
> How about using crc32 ? and then index the integer as the result of crc32
> function? you can split the hash into 2 part and do crc32 2x ? and then
> create composite index on both integer (the crc32 result)
> instead of using 64 char, you only employ 2 integer as index key.
>
> Regards,
>
> Jul
>
> On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras  wrote:
>
>> Hi,
>>
>> I understand your idea, and have also been thinking about it. Basically,
>> existing applications would need to be modified, however slightly, and that
>> wouldn't be good.
>>
>>
>>
>>
>> On 15 June 2016 at 15:38, hubert depesz lubaczewski 
>> wrote:
>>
>>> On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
>>> > I have an application which stores a large amounts of hex-encoded hash
>>> > strings (nearly 100 GB of them), which means:
>>>
>>> Why do you keep them hex encoded, and not use bytea?
>>>
>>> I made a sample table with 1 million rows, looking like this:
>>>
>>>  Table "public.new"
>>>  Column  | Type  | Modifiers
>>> -+---+---
>>>  texthex | text  |
>>>  a_bytea | bytea |
>>>
>>> values are like:
>>>
>>> $ select * from new limit 10;
>>>  texthex  |
>>> a_bytea
>>>
>>> --+
>>>  c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f |
>>> \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
>>>  61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db |
>>> \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
>>>  757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 |
>>> \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
>>>  fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 |
>>> \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
>>>  ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a |
>>> \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
>>>  11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea |
>>> \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
>>>  5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 |
>>> \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
>>>  2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c |
>>> \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
>>>  2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 |
>>> \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
>>>  2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa |
>>> \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
>>> (10 rows)
>>>
>>> created two indexes:
>>> create index i1 on new (texthex);
>>> create index i2 on new (a_bytea);
>>>
>>> i1 is 91MB, and i2 is 56MB.
>>>
>>> Index creation was also much faster - best out of 3 runs for i1 was
>>> 4928.982
>>> ms, best out of 3 runs for i2 was 2047.648 ms
>>>
>>> Best regards,
>>>
>>> depesz
>>>
>>>
>>
>
>
> --
>
>
> Julyanto SUTANDANG
>
> Equnix Business Solutions, PT
> (An Open Source an Open Mind Company)
>
> Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta
> Pusat
> T: +6221 2282 F: +62216315281 M: +628164858028
>
>
> Caution: The information enclosed in this email (and any attachments) may
> be legally privileged and/or confidential and is intended only for the use
> of the addressee(s). No addressee should forward, print, copy, or otherwise
> reproduce this message in any manner that would allow it to be viewed by
> any individual not originally listed as a recipient. If the reader of this
> message is not the intended recipient, you are hereby notified that any
> unauthorized disclosure, dissemination, distribution, copying or the taking
> of any action in reliance on the information herein is strictly prohibited.
> If you have received this communication in error, please immediately notify
> the sender and delete this message.Unless it is made by the authorized
> person, any views expressed in this message are those of the individual
> sender and may not necessarily reflect the views of PT Equnix Business
> Solutions.
>

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Ivan Voras

Hi,

Just for testing... is there a fast (i.e. written in C) crc32 or a similar
small hash function for PostgreSQL?


On 15 June 2016 at 16:00, Ivan Voras  wrote:

> Hi,
>
> This idea is similar to the substring one, and while it does give
> excellent performance and small size, it requires application code
> modifications, so it's out.
>
>
> On 15 June 2016 at 15:58, julyanto SUTANDANG 
> wrote:
>
>> Hi Ivan,
>>
>> How about using crc32 ? and then index the integer as the result of crc32
>> function? you can split the hash into 2 part and do crc32 2x ? and then
>> create composite index on both integer (the crc32 result)
>> instead of using 64 char, you only employ 2 integer as index key.
>>
>> Regards,
>>
>> Jul
>>
>> On Wed, Jun 15, 2016 at 8:54 PM, Ivan Voras  wrote:
>>
>>> Hi,
>>>
>>> I understand your idea, and have also been thinking about it. Basically,
>>> existing applications would need to be modified, however slightly, and that
>>> wouldn't be good.
>>>
>>>
>>>
>>>
>>> On 15 June 2016 at 15:38, hubert depesz lubaczewski 
>>> wrote:
>>>
 On Wed, Jun 15, 2016 at 11:34:18AM +0200, Ivan Voras wrote:
 > I have an application which stores a large amounts of hex-encoded hash
 > strings (nearly 100 GB of them), which means:

 Why do you keep them hex encoded, and not use bytea?

 I made a sample table with 1 million rows, looking like this:

  Table "public.new"
  Column  | Type  | Modifiers
 -+---+---
  texthex | text  |
  a_bytea | bytea |

 values are like:

 $ select * from new limit 10;
  texthex  |
   a_bytea

 --+
  c968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f |
 \xc968f64426b941bc9a8f6d4e87fc151c7a7192679837618570b7989c67c31e2f
  61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db |
 \x61dffbf002d7fc3db9df5953aca7f2e434a78d4c5fdd0db6f90f43ee8c4371db
  757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033 |
 \x757acf228adf2357356fd38d03b80529771f211e0ad3b35b66a23d6de53e5033
  fba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15 |
 \xfba35b8b33a7fccc2ac7d96389d43be509ff17636fe3c0f8a33af6d009f84f15
  ecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a |
 \xecd8587a8b9acae650760cea8683f8e1c131c4054c0d64b1d7de0ff269ccc61a
  11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea |
 \x11782c73bb3fc9f281b41d3eff8c1e7907b3494b3abe7b6982c1e88f49dad2ea
  5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852 |
 \x5862bd8d645e4d44997a485c616bc18f1acabeaec5df3c3b09b9d4c08643e852
  2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c |
 \x2d09a5cca2c03153a55faa3aff13df0f0593f4355a1b2cfcf9237c2931b4918c
  2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7 |
 \x2186eb2bcc12319ee6e00f7e08a1d61e379a75c01c579c29d0338693bc31c7c7
  2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa |
 \x2061bd05049c51bd1162e4d77f72a37f06d2397fc522ef587ed172a5ad8d57aa
 (10 rows)

 created two indexes:
 create index i1 on new (texthex);
 create index i2 on new (a_bytea);

 i1 is 91MB, and i2 is 56MB.

 Index creation was also much faster - best out of 3 runs for i1 was
 4928.982
 ms, best out of 3 runs for i2 was 2047.648 ms

 Best regards,

 depesz


>>>
>>
>>
>> --
>>
>>
>> Julyanto SUTANDANG
>>
>> Equnix Business Solutions, PT
>> (An Open Source an Open Mind Company)
>>
>> Pusat Niaga ITC Roxy Mas Blok C2/42.  Jl. KH Hasyim Ashari 125, Jakarta
>> Pusat
>> T: +6221 2282 F: +62216315281 M: +628164858028
>>
>>
>> Caution: The information enclosed in this email (and any attachments) may
>> be legally privileged and/or confidential and is intended only for the use
>> of the addressee(s). No addressee should forward, print, copy, or otherwise
>> reproduce this message in any manner that would allow it to be viewed by
>> any individual not originally listed as a recipient. If the reader of this
>> message is not the intended recipient, you are hereby notified that any
>> unauthorized disclosure, dissemination, distribution, copying or the taking
>> of any action in reliance on the information herein is strictly prohibited.
>> If you have received this communication in error, please immediately notify
>> the sender and delete this message.Unless it is made by the authorized
>> person, any views expressed in this message are those of the individual
>> sender and may not necessarily reflect the views of PT Equnix Business
>> Solutions.
>>
>
>

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Joshua D. Drake


On 06/15/2016 07:20 AM, Ivan Voras wrote:

Hi,

Just for testing... is there a fast (i.e. written in C) crc32 or a
similar small hash function for PostgreSQL?


https://www.postgresql.org/docs/9.5/static/pgcrypto.html

We also have a builtin md5().

JD
--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread k...@rice.edu

On Wed, Jun 15, 2016 at 04:20:46PM +0200, Ivan Voras wrote:
> Hi,
> 
> Just for testing... is there a fast (i.e. written in C) crc32 or a similar
> small hash function for PostgreSQL?
> 

Hi Ivan,

Here is an extension that provides a number of different hash
functions, including a version of the version used internally:

https://github.com/markokr/pghashlib

Regards,
Ken


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Indexes for hashes

2016-06-15 Thread Jeff Janes

On Wed, Jun 15, 2016 at 6:16 AM, k...@rice.edu  wrote:
> On Wed, Jun 15, 2016 at 03:09:04PM +0200, Ivan Voras wrote:
>> On 15 June 2016 at 15:03, k...@rice.edu  wrote:
>>
>>
>> I don't suppose there's an effort in progress to make hash indexes use WAL?
>> :D
>
> Hi Ivan,
>
> Several people have looked at it but it has not made it to the top of anyone's
> to-do list.

I don't know if it is the top of his todo list, but Amit seems pretty
serious about it:

https://www.postgresql.org/message-id/caa4ek1lfzczyxloxs874ad0+s-zm60u9bwcyiuzx9mhz-kc...@mail.gmail.com

I hope to give him some help if I get a chance.

Cheers,

Jeff


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[PERFORM] pg_restore seems very slow

2016-06-15 Thread Adrian Myers

This is my first post to the mailing list, so I apologize for any etiquette
issues.

I have a few databases that I am trying to move from one system to
another.  Both systems are running Windows 7 and Postgres 8.4, and they are
pretty powerful machines (40-core Xeon workstations with decent hardware
across the board). While the DBs vary in size, I'm working right now with
one that is roughly 50 tables and probably 75M rows, and is about 300MB on
disk when exported via pg_dump.

I am exporting and restoring using these commands (on separate sytems):
pg_dump -F c mydb > mydb.dump
pg_restore -C -j 10 mydb.dump

The dump process runs in about a minute and seems fine. The restore process
has already been running for around 7 hours.

Yesterday, I tried restoring a larger DB that is roughly triple the
dimensions listed above, and it ran for over 16 hours without completing.

I followed the advice given at
http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and
set the conf settings as directed and restarted the server.

You can see in the command line that I am trying to use the -j parameter
for parallelism, but I don't see much evidence of that in Task Manager. CPU
load is consistently 1 or 2% and only a couple cores seem to be doing
anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to
look for pg_restore's disk I/O, but there is an entry for pg_restore in
Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write
Bytes. Since that's just the parent process that might make sense but I
don't see much activity elsewhere either.

Is there something simple that I am missing here? Does the -j flag not work
in 8.4 and I should use --jobs? It just seems like none of the CPU or RAM
usage I'd expect from this process are evident, it's taking many times
longer than I would expect, and I don't know how to verify if the things
I'm trying are working or not.

Any insight would be appreciated!

Thanks,
Adrian

Re: [PERFORM] pg_restore seems very slow

2016-06-15 Thread David G. Johnston

On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers 
wrote:

> This is my first post to the mailing list, so I apologize for any
> etiquette issues.
>
> I have a few databases that I am trying to move from one system to
> another.  Both systems are running Windows 7 and Postgres 8.4, and they are
> pretty powerful machines (40-core Xeon workstations with decent hardware
> across the board). While the DBs vary in size, I'm working right now with
> one that is roughly 50 tables and probably 75M rows, and is about 300MB on
> disk when exported via pg_dump.
>
> I am exporting and restoring using these commands (on separate sytems):
> pg_dump -F c mydb > mydb.dump
> pg_restore -C -j 10 mydb.dump
>
> The dump process runs in about a minute and seems fine. The restore
> process has already been running for around 7 hours.
>
> Yesterday, I tried restoring a larger DB that is roughly triple the
> dimensions listed above, and it ran for over 16 hours without completing.
>
> I followed the advice given at
> http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and
> set the conf settings as directed and restarted the server.
>
> You can see in the command line that I am trying to use the -j parameter
> for parallelism, but I don't see much evidence of that in Task Manager. CPU
> load is consistently 1 or 2% and only a couple cores seem to be doing
> anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to
> look for pg_restore's disk I/O, but there is an entry for pg_restore in
> Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write
> Bytes. Since that's just the parent process that might make sense but I
> don't see much activity elsewhere either.
>
> Is there something simple that I am missing here? Does the -j flag not
> work in 8.4 and I should use --jobs? It just seems like none of the CPU or
> RAM usage I'd expect from this process are evident, it's taking many times
> longer than I would expect, and I don't know how to verify if the things
> I'm trying are working or not.
>
> Any insight would be appreciated!
>
>
Did any databases restore properly?

Are there any message in logs or on the terminal?  You should add the
"--verbose" option to your pg_restore command to help provoke this.

-C can be problematic at times.  Consider manually ensuring the desired
target database exists and is setup correctly (matches the original) and
then do a non-create restoration to it specifically.

-j should work fine in 8.4 (according to the docs)

You need to get to a point where you are seeing feedback from the
pg_restore process.  Once you get it telling you what it is doing (or
trying to do) then diagnosing can begin.

David J.

Re: [PERFORM] pg_restore seems very slow

2016-06-15 Thread Adrian Myers

Hi David,

Thank you for your reply. Yes, there is quite a lot of feedback in the
terminal. I can see a small flurry of table operations followed by hours of
table contents being printed, presumably as they are inserted. I didn't use
the --verbose option, but it seems to be echoing everything it is doing.

I haven't seen any errors, and I was able to restore a couple very small
tables successfully, so it seems like the process is valid. The problem is
that pg_restore is running for extremely long periods of time on even
modestly large tables and I can't tell if the optimizations I am trying,
such as the -j concurrency option, are having any effect.

Thanks,
Adrian

On Wed, Jun 15, 2016 at 6:08 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers 
> wrote:
>
>> This is my first post to the mailing list, so I apologize for any
>> etiquette issues.
>>
>> I have a few databases that I am trying to move from one system to
>> another.  Both systems are running Windows 7 and Postgres 8.4, and they are
>> pretty powerful machines (40-core Xeon workstations with decent hardware
>> across the board). While the DBs vary in size, I'm working right now with
>> one that is roughly 50 tables and probably 75M rows, and is about 300MB on
>> disk when exported via pg_dump.
>>
>> I am exporting and restoring using these commands (on separate sytems):
>> pg_dump -F c mydb > mydb.dump
>> pg_restore -C -j 10 mydb.dump
>>
>> The dump process runs in about a minute and seems fine. The restore
>> process has already been running for around 7 hours.
>>
>> Yesterday, I tried restoring a larger DB that is roughly triple the
>> dimensions listed above, and it ran for over 16 hours without completing.
>>
>> I followed the advice given at
>> http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html and
>> set the conf settings as directed and restarted the server.
>>
>> You can see in the command line that I am trying to use the -j parameter
>> for parallelism, but I don't see much evidence of that in Task Manager. CPU
>> load is consistently 1 or 2% and only a couple cores seem to be doing
>> anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to
>> look for pg_restore's disk I/O, but there is an entry for pg_restore in
>> Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write
>> Bytes. Since that's just the parent process that might make sense but I
>> don't see much activity elsewhere either.
>>
>> Is there something simple that I am missing here? Does the -j flag not
>> work in 8.4 and I should use --jobs? It just seems like none of the CPU or
>> RAM usage I'd expect from this process are evident, it's taking many times
>> longer than I would expect, and I don't know how to verify if the things
>> I'm trying are working or not.
>>
>> Any insight would be appreciated!
>>
>>
> Did any databases restore properly?
>
> Are there any message in logs or on the terminal?  You should add the
> "--verbose" option to your pg_restore command to help provoke this.
>
> -C can be problematic at times.  Consider manually ensuring the desired
> target database exists and is setup correctly (matches the original) and
> then do a non-create restoration to it specifically.
>
> -j should work fine in 8.4 (according to the docs)
>
> You need to get to a point where you are seeing feedback from the
> pg_restore process.  Once you get it telling you what it is doing (or
> trying to do) then diagnosing can begin.
>
> David J.
> 
>
>

Re: [PERFORM] pg_restore seems very slow

2016-06-15 Thread Adam Scott

The "simple" case may be anti-virus or firewall blocking feeding into the
database.  Be sure to check windows system logs for any unusual messages.

Check the postgres log (usually in PGDATA/pg_logs)

For seeing disk I/O on Win7 check out
http://www.digitalcitizen.life/how-use-resource-monitor-windows-7

Try also to  restore without any -j or --jobs to see if you get more
activity on CPU or disk.

Can you view any data in the tables to at least know it's loading?

Thanks,
Adam C. Scott




On Wed, Jun 15, 2016 at 4:41 PM, Adrian Myers 
wrote:

> Hi David,
>
> Thank you for your reply. Yes, there is quite a lot of feedback in the
> terminal. I can see a small flurry of table operations followed by hours of
> table contents being printed, presumably as they are inserted. I didn't use
> the --verbose option, but it seems to be echoing everything it is doing.
>
> I haven't seen any errors, and I was able to restore a couple very small
> tables successfully, so it seems like the process is valid. The problem is
> that pg_restore is running for extremely long periods of time on even
> modestly large tables and I can't tell if the optimizations I am trying,
> such as the -j concurrency option, are having any effect.
>
> Thanks,
> Adrian
>
> On Wed, Jun 15, 2016 at 6:08 PM, David G. Johnston <
> david.g.johns...@gmail.com> wrote:
>
>> On Wed, Jun 15, 2016 at 6:00 PM, Adrian Myers 
>> wrote:
>>
>>> This is my first post to the mailing list, so I apologize for any
>>> etiquette issues.
>>>
>>> I have a few databases that I am trying to move from one system to
>>> another.  Both systems are running Windows 7 and Postgres 8.4, and they are
>>> pretty powerful machines (40-core Xeon workstations with decent hardware
>>> across the board). While the DBs vary in size, I'm working right now with
>>> one that is roughly 50 tables and probably 75M rows, and is about 300MB on
>>> disk when exported via pg_dump.
>>>
>>> I am exporting and restoring using these commands (on separate sytems):
>>> pg_dump -F c mydb > mydb.dump
>>> pg_restore -C -j 10 mydb.dump
>>>
>>> The dump process runs in about a minute and seems fine. The restore
>>> process has already been running for around 7 hours.
>>>
>>> Yesterday, I tried restoring a larger DB that is roughly triple the
>>> dimensions listed above, and it ran for over 16 hours without completing.
>>>
>>> I followed the advice given at
>>> http://www.databasesoup.com/2014/09/settings-for-fast-pgrestore.html
>>> and set the conf settings as directed and restarted the server.
>>>
>>> You can see in the command line that I am trying to use the -j parameter
>>> for parallelism, but I don't see much evidence of that in Task Manager. CPU
>>> load is consistently 1 or 2% and only a couple cores seem to be doing
>>> anything, there certainly aren't 10 cpu-bound cores. I'm not sure where to
>>> look for pg_restore's disk I/O, but there is an entry for pg_restore in
>>> Task Manager/Processes which shows almost no I/O Read Bytes and 0 I/O Write
>>> Bytes. Since that's just the parent process that might make sense but I
>>> don't see much activity elsewhere either.
>>>
>>> Is there something simple that I am missing here? Does the -j flag not
>>> work in 8.4 and I should use --jobs? It just seems like none of the CPU or
>>> RAM usage I'd expect from this process are evident, it's taking many times
>>> longer than I would expect, and I don't know how to verify if the things
>>> I'm trying are working or not.
>>>
>>> Any insight would be appreciated!
>>>
>>>
>> Did any databases restore properly?
>>
>> Are there any message in logs or on the terminal?  You should add the
>> "--verbose" option to your pg_restore command to help provoke this.
>>
>> -C can be problematic at times.  Consider manually ensuring the desired
>> target database exists and is setup correctly (matches the original) and
>> then do a non-create restoration to it specifically.
>>
>> -j should work fine in 8.4 (according to the docs)
>>
>> You need to get to a point where you are seeing feedback from the
>> pg_restore process.  Once you get it telling you what it is doing (or
>> trying to do) then diagnosing can begin.
>>
>> David J.
>> 
>>
>>
>

[PERFORM] Indexes for hashes

Re: [PERFORM] Clarification on using pg_upgrade

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

Re: [PERFORM] Indexes for hashes

[PERFORM] pg_restore seems very slow

Re: [PERFORM] pg_restore seems very slow

Re: [PERFORM] pg_restore seems very slow

Re: [PERFORM] pg_restore seems very slow

18 matches

Site Navigation

Mail list logo

Footer information