Re: [PERFORM] Regex performance issue

2006-12-03 Thread Alexandru Coseru

Hello..

I presumed CPU intensive  because:
   1) I have no hdd lights turn on during a series of queries (about 50 of 
them)

   2) vmstat doesn't give me blocks in and just a couple of blocks out.
   3) top  reports between 95 and 100 user cpu.sometimes ,i can see 
some  hi and si work  (max of 2%).



I'm pretty sure that the whole table is in cache..
The table size is about 50 Megs  , and I have 1 Gb of RAM..
I have no other RAM eating procesess.

Could it be related to  CPU - RAM speed ?

Regards
   Alex

- Original Message - 
From: Tom Lane [EMAIL PROTECTED]

To: Alexandru Coseru [EMAIL PROTECTED]
Cc: Heikki Linnakangas [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Sunday, December 03, 2006 6:05 AM
Subject: Re: [PERFORM] Regex performance issue



Alexandru Coseru [EMAIL PROTECTED] writes:

Anyway , anybody has a clue why this regex is that CPU intensive ?


The EXPLAIN result you posted offers *no* evidence that the regexp is
CPU intensive.  All you know is that it took 850+ msec to fetch 5200
rows from disk and apply the regexp filter to them.  There's no evidence
here that that was CPU time and not I/O time.

regards, tom lane



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006





---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


[PERFORM] Regex performance issue

2006-12-02 Thread Alexandru Coseru

Hello..

I have a low performance problem with regexp.

Here are the details:

asterisk= explain analyze SELECT * FROM destlist WHERE '0039051248787' ~ 
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;

QUERY PLAN

Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual 
time=857.715..857.716 rows=2 loops=1)

  Sort Key: length((prefix)::text)
  -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30 rows=31 width=67) 
(actual time=2.156..857.686 rows=2 loops=1)

Recheck Cond: ((id_ent = -2) AND (dir = 0))
Filter: ('0039051248787'::text ~ (prefix)::text)
-  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16 
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)

  Index Cond: ((id_ent = -2) AND (dir = 0))
Total runtime: 857.804 ms
(8 rows)


The main problem is the query time.
As you can see , the use of index destlist_indx2  is pretty quick  (1.9 ms) 
, but the regexp operation takes a lot of time  (857 ms).


How can i improve that ?

Regards
   Alex


PS:  Additional info:

[EMAIL PROTECTED] billing]# uname -a
Linux voce1 2.6.11-1.1369_FC4smp #1 SMP Thu Jun 2 23:16:33 EDT 2005 x86_64 
x86_64 x86_64 GNU/Linux


[EMAIL PROTECTED] billing]# free
total   used   free sharedbuffers cached
Mem:   1023912 977788  46124  0  80900 523868
-/+ buffers/cache: 373020 650892
Swap:  3172728   84883164240


Welcome to psql 8.1.2, the PostgreSQL interactive terminal.

asterisk= \d destlist;
  Table public.destlist
Column  |  Type  | Modifiers
-++---
id  | bigint | not null default 
nextval(('destlist_id'::text)::regclass)

id_ent  | integer|
dir | integer|
prefix  | character varying(255) |
country | character varying(255) |
network | character varying(255) |
tip | integer|


Indexes:
   destlist_unique UNIQUE, btree (id_ent, dir, prefix)
   destlist_indx2 btree (id_ent, dir)
   destlist_indx3 btree (id_ent, dir, prefix)
   mmumu btree (prefix varchar_pattern_ops)


asterisk= select count(*) from destlist;
count

576424
(1 row)






[EMAIL PROTECTED] billing]# cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  :   Intel(R) Xeon(TM) CPU 3.00GHz
stepping: 3
cpu MHz : 2992.658
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl cid cx16 xtpr

bogomips: 5914.62
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model   : 4
model name  :   Intel(R) Xeon(TM) CPU 3.00GHz
stepping: 3
cpu MHz : 2992.658
cache size  : 2048 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 1
fpu : yes
fpu_exception   : yes
cpuid level : 5
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm 
constant_tsc pni monitor ds_cpl cid cx16 xtpr

bogomips: 5980.16
clflush size: 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Dave Dutcher
 -Original Message-
 From: [EMAIL PROTECTED] On Behalf Of Alexandru Coseru
 asterisk= explain analyze SELECT * FROM destlist WHERE 
 '0039051248787' ~ 
 prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;

  
 QUERY PLAN
 --
 --
  Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual 
 time=857.715..857.716 rows=2 loops=1)
Sort Key: length((prefix)::text)
-  Bitmap Heap Scan on destlist  (cost=60.16..7924.30 
 rows=31 width=67) 
 (actual time=2.156..857.686 rows=2 loops=1)
  Recheck Cond: ((id_ent = -2) AND (dir = 0))
  Filter: ('0039051248787'::text ~ (prefix)::text)
  -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16 
 rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
Index Cond: ((id_ent = -2) AND (dir = 0))
  Total runtime: 857.804 ms
 (8 rows)
 

 mmumu btree (prefix varchar_pattern_ops)
 

I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~. 

Dave
 


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Alexandru Coseru

Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
 1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
 2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
 3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
 4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
 5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
 6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
 7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
 8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
 9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
   Alex





- Original Message - 
From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru Coseru
asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)


mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave





--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Oleg Bartunov

I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea -  http://www.sai.msu.su/~megera/wiki/Gendict

Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:


Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
  Alex





- Original Message - From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru Coseru
asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)


mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave








Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Heikki Linnakangas

Alexandru Coseru wrote:

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..


Maybe you could create a functional index on substr(minimum length of 
prefix)? It might restrict the result set prior to applying the regex 
just enough to make the performance acceptable.



asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
 1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
 2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
 3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
 4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
 5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
 6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
 7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
 8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
 9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5

Now , I have to match a dialednumber   (let's say   00213618833) and find it's 
destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..


Another idea would be to add some extra rows so that you could use 
normal inequality searches. For example, let's take the Albanian rows:


  3 | -1 |   0 | 00355
  4 | -1 |   0 | 0035538
* 3 | -1 |   0 | 0035539
  5 | -1 |   0 | 0035568
  6 | -1 |   0 | 0035569
* 3 | -1 |   0 | 0035570

Now you can do SELECT * FROM destlist WHERE ? = prefix ORDER BY prefix 
LIMIT 1.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Alexandru Coseru

Hello..

I cannot use the first advice , because i'm not aware of the prefix length 
in the database...

This is why i'm ordering after length(prefix)..


On the 2nd one , i'm not sure that i can follow you..

Regards
   Alex
- Original Message - 
From: Heikki Linnakangas [EMAIL PROTECTED]

To: Alexandru Coseru [EMAIL PROTECTED]
Cc: Dave Dutcher [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Sent: Sunday, December 03, 2006 12:04 AM
Subject: Re: [PERFORM] Regex performance issue



Alexandru Coseru wrote:

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..


Maybe you could create a functional index on substr(minimum length of 
prefix)? It might restrict the result set prior to applying the regex 
just enough to make the performance acceptable.



asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
 1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
 2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
 3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
 4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
 5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
 6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
 7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
 8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
 9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5

Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..


Another idea would be to add some extra rows so that you could use normal 
inequality searches. For example, let's take the Albanian rows:


  3 | -1 |   0 | 00355
  4 | -1 |   0 | 0035538
* 3 | -1 |   0 | 0035539
  5 | -1 |   0 | 0035568
  6 | -1 |   0 | 0035569
* 3 | -1 |   0 | 0035570

Now you can do SELECT * FROM destlist WHERE ? = prefix ORDER BY prefix 
LIMIT 1.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006





---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Heikki Linnakangas

Alexandru Coseru wrote:

Hello..

I cannot use the first advice , because i'm not aware of the prefix 
length in the database...

This is why i'm ordering after length(prefix)..


On the 2nd one , i'm not sure that i can follow you..


Ok, let me try again :)


asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
 1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
 2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
 3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
 4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
 5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
 6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
 7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
 8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
 9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5 


Store the prefix in a character column, without the regex stuff. Like 
below. I've removed the columns that are not relevant, in fact it would 
make sense to store them in another table, and have just the id and 
prefix in this table.


id | prefix  | network
---+-+
 1 | 0093| AFGHANISTAN
 2 | 00937   | AFGHANISTAN Mobile
 3 | 00355   | ALBANIA
 4 | 0035538 | ALBANIA Mobile
 5 | 0035568 | ALBANIA Mobile
 6 | 0035569 | ALBANIA Mobile
 7 | 00213   | ALGERIA
 8 | 0021361 | ALGERIA Mobile
 9 | 0021362 | ALGERIA Mobile
10 | 0021363 | ALGERIA Mobile

Now, add the rows marked with start below:

 id | prefix  | network
+-+
  1 | 0093| AFGHANISTAN
  2 | 00937   | AFGHANISTAN Mobile
* 1 | 00938   | AFGHANISTAN
  3 | 00355   | ALBANIA
  4 | 0035538 | ALBANIA Mobile
* 3 | 0035539 | ALBANIA
  5 | 0035568 | ALBANIA Mobile
  6 | 0035569 | ALBANIA Mobile
* 3 | 003557  | ALBANIA
  7 | 00213   | ALGERIA
  8 | 0021361 | ALGERIA Mobile
  9 | 0021362 | ALGERIA Mobile
 10 | 0021363 | ALGERIA Mobile
* 7 | 0021364 | ALGERIA

The added rows make it unnecessary to use regex for the searches. You 
can use just the = operator like this: (using the example number you gave)


SELECT id FROM destlist WHERE '00213618833' = prefix ORDER BY prefix 
LIMIT 1


Which would return id 7. As another example, a query of 00213654321 
would match the last row, which again has id 7.


I'm too tired to figure out the exact algorithm for adding the rows, but 
I'm pretty sure it can be automated... The basic idea is that when 
there's a row with id A and prefix , and another row with id B and 
prefix Y, we add another row with id A and prefix (Y+1).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Alexandru Coseru

Hello..

Thanks for the tip , i think i have got the ideea..

I'm too tired too , and i will try it tommorow.


Anyway , anybody has a clue why this regex is that CPU intensive ?  I did 
not saw the light on my drives blinking , and also vmstat doesn't yeld any 
blocks in or out...

And how can it be optimized ?

Is there a way to trace the system calls ?
strace doesn't give me anything else but some lseeks and reads...


PS: Tried it with a 8.2 snaphsot and the result is the same..

Regards
   Alex
- Original Message - 
From: Heikki Linnakangas [EMAIL PROTECTED]

To: Alexandru Coseru [EMAIL PROTECTED]
Cc: pgsql-performance@postgresql.org
Sent: Sunday, December 03, 2006 12:35 AM
Subject: Re: [PERFORM] Regex performance issue



Alexandru Coseru wrote:

Hello..

I cannot use the first advice , because i'm not aware of the prefix 
length in the database...

This is why i'm ordering after length(prefix)..


On the 2nd one , i'm not sure that i can follow you..


Ok, let me try again :)


asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
 1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
 2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
 3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
 4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
 5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
 6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
 7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
 8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
 9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Store the prefix in a character column, without the regex stuff. Like 
below. I've removed the columns that are not relevant, in fact it would 
make sense to store them in another table, and have just the id and prefix 
in this table.


id | prefix  | network
---+-+
 1 | 0093| AFGHANISTAN
 2 | 00937   | AFGHANISTAN Mobile
 3 | 00355   | ALBANIA
 4 | 0035538 | ALBANIA Mobile
 5 | 0035568 | ALBANIA Mobile
 6 | 0035569 | ALBANIA Mobile
 7 | 00213   | ALGERIA
 8 | 0021361 | ALGERIA Mobile
 9 | 0021362 | ALGERIA Mobile
10 | 0021363 | ALGERIA Mobile

Now, add the rows marked with start below:

 id | prefix  | network
+-+
  1 | 0093| AFGHANISTAN
  2 | 00937   | AFGHANISTAN Mobile
* 1 | 00938   | AFGHANISTAN
  3 | 00355   | ALBANIA
  4 | 0035538 | ALBANIA Mobile
* 3 | 0035539 | ALBANIA
  5 | 0035568 | ALBANIA Mobile
  6 | 0035569 | ALBANIA Mobile
* 3 | 003557  | ALBANIA
  7 | 00213   | ALGERIA
  8 | 0021361 | ALGERIA Mobile
  9 | 0021362 | ALGERIA Mobile
 10 | 0021363 | ALGERIA Mobile
* 7 | 0021364 | ALGERIA

The added rows make it unnecessary to use regex for the searches. You can 
use just the = operator like this: (using the example number you gave)


SELECT id FROM destlist WHERE '00213618833' = prefix ORDER BY prefix 
LIMIT 1


Which would return id 7. As another example, a query of 00213654321 
would match the last row, which again has id 7.


I'm too tired to figure out the exact algorithm for adding the rows, but 
I'm pretty sure it can be automated... The basic idea is that when there's 
a row with id A and prefix , and another row with id B and prefix 
Y, we add another row with id A and prefix (Y+1).


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006




---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Alexandru Coseru

Hello..

I have never used tsearch2  , but at a first glance , i would not see any 
major improvement , because the main advantage of tsearch is the splitting 
in words of a phrase..

But here , i only have one word  (no spaces).


Regards
   Alex
- Original Message - 
From: Oleg Bartunov oleg@sai.msu.su

To: Alexandru Coseru [EMAIL PROTECTED]
Cc: Dave Dutcher [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Sent: Saturday, December 02, 2006 10:54 PM
Subject: Re: [PERFORM] Regex performance issue



I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea - 
http://www.sai.msu.su/~megera/wiki/Gendict


Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:


Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
  Alex





- Original Message - From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru 
Coseru

asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)

mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with 
Posix
regular expressions?  The docs seem to say that it can, but I'd be 
curious

to see what happens if you use LIKE instead of ~.

Dave








 Regards,
 Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Scott Marlowe
On Sun, 2006-12-03 at 02:53 +0200, Alexandru Coseru wrote:
 Hello..
 
 Thanks for the tip , i think i have got the ideea..
 
 I'm too tired too , and i will try it tommorow.
 
 
 Anyway , anybody has a clue why this regex is that CPU intensive ?  I did 
 not saw the light on my drives blinking , and also vmstat doesn't yeld any 
 blocks in or out...
 And how can it be optimized ?

I think it's just that you're running the regex across 5400 or so
elements.  at 850 or so milliseconds, that comes out to 150uS for each
comparison.  I'd say it's just the number of comparisons that have to be
made that's killing you here.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Tom Lane
Alexandru Coseru [EMAIL PROTECTED] writes:
 Anyway , anybody has a clue why this regex is that CPU intensive ?

The EXPLAIN result you posted offers *no* evidence that the regexp is
CPU intensive.  All you know is that it took 850+ msec to fetch 5200
rows from disk and apply the regexp filter to them.  There's no evidence
here that that was CPU time and not I/O time.

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Regex performance issue

2006-12-02 Thread Oleg Bartunov

On Sun, 3 Dec 2006, Alexandru Coseru wrote:


Hello..

I have never used tsearch2  , but at a first glance , i would not see any 
major improvement , because the main advantage of tsearch is the splitting in 
words of a phrase..

But here , i only have one word  (no spaces).


Oh, yes, I was confused :) What if you consider you prefix as 
1.2.3.4.5.6, then you could try our contrib/ltree module.



Oleg





Regards
  Alex
- Original Message - From: Oleg Bartunov oleg@sai.msu.su
To: Alexandru Coseru [EMAIL PROTECTED]
Cc: Dave Dutcher [EMAIL PROTECTED]; pgsql-performance@postgresql.org
Sent: Saturday, December 02, 2006 10:54 PM
Subject: Re: [PERFORM] Regex performance issue



I may miss something but I'd use tsearch2. Check
intdict dictionary for basic idea - 
http://www.sai.msu.su/~megera/wiki/Gendict


Oleg
On Sat, 2 Dec 2006, Alexandru Coseru wrote:


Hello...

I cannot use LIKE , because the order of the match is reversed.
The prefix column is containing telephone destinations.
IE:^001  - US  , ^0039 Italy , etc..

Here is a small sample:

asterisk= select * from destlist LIMIT 10;
id | id_ent | dir |   prefix   |   country   |  network   | tip
++-++-++-
1 | -1 |   0 | (^0093)| AFGHANISTAN | AFGHANISTAN|   6
2 | -1 |   0 | (^00937)   | AFGHANISTAN | AFGHANISTAN Mobile |   5
3 | -1 |   0 | (^00355)   | ALBANIA | ALBANIA|   6
4 | -1 |   0 | (^0035538) | ALBANIA | ALBANIA Mobile |   5
5 | -1 |   0 | (^0035568) | ALBANIA | ALBANIA Mobile |   5
6 | -1 |   0 | (^0035569) | ALBANIA | ALBANIA Mobile |   5
7 | -1 |   0 | (^00213)   | ALGERIA | ALGERIA|   6
8 | -1 |   0 | (^0021361) | ALGERIA | ALGERIA Mobile |   5
9 | -1 |   0 | (^0021362) | ALGERIA | ALGERIA Mobile |   5
10 | -1 |   0 | (^0021363) | ALGERIA | ALGERIA Mobile |   5


Now , I have to match a dialednumber   (let's say   00213618833) and find 
it's destination...(It's algeria mobile).
I tried to make with a query of using LIKE , but i was not able to get 
something..



Regards
  Alex





- Original Message - From: Dave Dutcher [EMAIL PROTECTED]
To: 'Alexandru Coseru' [EMAIL PROTECTED]; 
pgsql-performance@postgresql.org

Sent: Saturday, December 02, 2006 10:36 PM
Subject: RE: [PERFORM] Regex performance issue



-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Alexandru 
Coseru

asterisk= explain analyze SELECT * FROM destlist WHERE
'0039051248787' ~
prefix AND id_ent='-2' AND dir=0 ORDER by length(prefix) DESC;


QUERY PLAN
--
--
 Sort  (cost=7925.07..7925.15 rows=31 width=67) (actual
time=857.715..857.716 rows=2 loops=1)
   Sort Key: length((prefix)::text)
   -  Bitmap Heap Scan on destlist  (cost=60.16..7924.30
rows=31 width=67)
(actual time=2.156..857.686 rows=2 loops=1)
 Recheck Cond: ((id_ent = -2) AND (dir = 0))
 Filter: ('0039051248787'::text ~ (prefix)::text)
 -  Bitmap Index Scan on destlist_indx2  (cost=0.00..60.16
rows=6193 width=0) (actual time=1.961..1.961 rows=5205 loops=1)
   Index Cond: ((id_ent = -2) AND (dir = 0))
 Total runtime: 857.804 ms
(8 rows)

mmumu btree (prefix varchar_pattern_ops)



I'm surpised Postgres isn't using the index on prefix seeing as the index
uses the varchar_pattern_ops operator class.  It could be that the index
isn't selective enough, or is Postgres not able to use an index with Posix
regular expressions?  The docs seem to say that it can, but I'd be curious
to see what happens if you use LIKE instead of ~.

Dave








 Regards,
 Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83



--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.409 / Virus Database: 268.15.4/563 - Release Date: 12/2/2006




---(end of broadcast)---
TIP 6: explain analyze is your friend



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings