Re: UdmSearch: Webboard: PostgreSQL performance

2001-01-11 Thread The Hermit Hacker


actually, PgSQL works quite effectively if you stay away from search.c and
move towards something like the Perl front-end ... we switched, and the
difference in response times was like going from night-day ... right now,
our index database is running:

%sbin/indexer -S etc/udmsearch.conf

  UdmSearch statistics

StatusExpired  Total
   -
 0  0  6 Not indexed yet
 1  0  2 Unknown status
   200  0  13446 OK
   301  0160 Moved Permanently
   302  0 18 Moved Temporarily
   304  0  89691 Not Modified
   400  0  1 Bad Request
   401  0  6 Unauthorized
   403  0  6 Forbidden
   404  0908 Not found
   500  0 36 Internal Server Error
   503  0 16 Service Unavailable
   -
 Total  0 104296

udmsearch=# select count(1) from ndict;
  count
--
 12949315
(1 row)


On Thu, 11 Jan 2001, gluke wrote:

 Author: gluke
 Email: [EMAIL PROTECTED]
 Message:
 If you want to use SQL backends for your search, try to use MySQL server. Our tests 
show that this is the fastest solution for mnoGoSearch with SQL databases.

 Reply: http://search.mnogo.ru/board/message.php?id=1068

 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]



Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org



__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Mnogosearch-perl: $search-work_time in mod_perl ...

2000-11-21 Thread The Hermit Hacker

On Tue, 21 Nov 2000, Dubun Guillaume wrote:

 Hi,
 
 The Hermit Hacker wrote:
  
  what does it represent?  I've set my restop to be:
  
  !--restop--
  Search results:
  small$W/smallHR
  Displaying documents $f-$l of total B$t/B found. ($SearchTime secs)
  !--/restop--
  
  and the resutls are coming out as:
  
  Displaying documents 1-10 of total 458 found. (1705334.945 secs)
 
 Have you the same results without modperl or with search.cgi ?
 What is your mnogosearch / mnogosearch-perl version ?

latest version ... and the number of secs keeps increasing, as if its time
since the modperl script was loaded:

Displaying documents 1-10 of total 512 found. (1733825.195 secs)

~7hrs since the last time I tested, which is about right:

1733825 - 1705334
28491
28491 / 60
474
474 / 60
7

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: PgSQL: DELETE INDEX url_url;

2000-11-18 Thread The Hermit Hacker


That is so obviously not going to work ... about six or seven lines past
the 'UdmGetToken()' call, there is an 'if' that sets query_words, which
has to be passed to the LoadTemplate when that while() loop is finished :(

Something I'd like to recommend is to take sql.c and split it off into
db_mysql.c, db_pgsql.c, db_oracle.c, etc ... so that each of us on our
different platforms can spend time to optimzie for that platform ... I'm
loath to do this right now cause if I do, I'm bound to screw up something
that another platform is dependant on ;(

On Mon, 13 Nov 2000, Alexander Barkov wrote:

 The Hermit Hacker wrote:
  
  On Sat, 11 Nov 2000, Alexander Barkov wrote:
  
   The Hermit Hacker wrote:
   
okay, can someone make the following changes to the source code, so that
the search avoids using the index ... this will at least give a temporary
fix until our LIKE optimizer is fixed:
   
SELECT ndict.url_id,ndict.intag
  FROM ndict,url
 WHERE ndict.word_id=1971739852
   AND url.rec_id=ndict.url_id
   AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');
  
  
   I don't think that this is the best solution to fix search for buggy
   LIKE optimizer then to fix search back for fixed optimizer.
  
  After sending this out, it looks like there might be a bug in udmsearch
  itself, as I went through the code itself, in 3.1.7, and it is technically
  coded to do this, but it isn't sql.c:1894:
  
  if(c-DBType==UDM_DB_PGSQL)
  sprintf(UDM_STREND(c-urlstr),"(url.url || '') LIKE '%s')",URL);
  else
  sprintf(UDM_STREND(c-urlstr),"url.url LIKE '%s')",URL);
  return(0);
  
  Any idea why this isn't, in fact, working?
 
 
 I found  that the bug is in search.c
 
 Template is loaded after QUERY_STRING parsing. So, when URL limit are
 added,
 DBType is not known yet. You may just  move this:
 
 if(LoadTemplate(Indexer,template,query_words,0)){  
 printf("htmlbodyCan't open template file
 '%s'!/body/html

 return(0);  
 }
 
 before these lines:
 
 /* Parse Query String */   

 token=UdmGetToken(query_string,"",lasttok);
 while(token){
 
 
 
 
 It should work OK, but I'm not sure. Maybe something in LoadTemplate()
 requires
 QUERY_STRING to be already parsed.
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: PgSQL: DELETE INDEX url_url;

2000-11-11 Thread The Hermit Hacker

On Sat, 11 Nov 2000, Alexander Barkov wrote:

 The Hermit Hacker wrote:
  
  okay, can someone make the following changes to the source code, so that
  the search avoids using the index ... this will at least give a temporary
  fix until our LIKE optimizer is fixed:
  
  SELECT ndict.url_id,ndict.intag
FROM ndict,url
   WHERE ndict.word_id=1971739852
 AND url.rec_id=ndict.url_id
 AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');
 
 
 I don't think that this is the best solution to fix search for buggy 
 LIKE optimizer then to fix search back for fixed optimizer.

After sending this out, it looks like there might be a bug in udmsearch
itself, as I went through the code itself, in 3.1.7, and it is technically
coded to do this, but it isn't sql.c:1894:

if(c-DBType==UDM_DB_PGSQL)
sprintf(UDM_STREND(c-urlstr),"(url.url || '') LIKE '%s')",URL);
else
sprintf(UDM_STREND(c-urlstr),"url.url LIKE '%s')",URL);
return(0);

Any idea why this isn't, in fact, working?

My queries are coming out as the second of the two conditions, even though
my 'connect string' looks like:

DBAddr  pgsql:[EMAIL PROTECTED]/udmsearch/

Is my 'connect string' wrong?

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: Re[2]: UdmSearch: PgSQL: DELETE INDEX url_url;

2000-11-10 Thread The Hermit Hacker

On Fri, 10 Nov 2000, Sergey Kartashoff wrote:

 Hi!
 
 Thursday, November 09, 2000, 9:55:11 PM, you wrote:
 
 THH On Thu, 9 Nov 2000, Alexander Barkov wrote:
 
  Don't forget to recreate this index before starting indexer.
  As far as url_url index is UNIQUE this does not allow indexer to add
  the same link several time. If you remove index, the same documents
  might be added several times.
 
 THH can this unique index not be based on the crc32 value instead?  that might
 THH explain why I'm up to 140K docs when I was only expecting 91k :)
 
 no, index on crc32 cannot be unique, because of it will block adding
 site mirrors into url table.

okay, can someone make the following changes to the source code, so that
the search avoids using the index ... this will at least give a temporary
fix until our LIKE optimizer is fixed:

SELECT ndict.url_id,ndict.intag 
  FROM ndict,url 
 WHERE ndict.word_id=1971739852 
   AND url.rec_id=ndict.url_id  
   AND ( (url.url || ' ') LIKE 'http://www.postgresql.org/% ');

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: PgSQL: DELETE INDEX url_url;

2000-11-09 Thread The Hermit Hacker

On Thu, 9 Nov 2000, Alexander Barkov wrote:

 Don't forget to recreate this index before starting indexer.
 As far as url_url index is UNIQUE this does not allow indexer to add
 the same link several time. If you remove index, the same documents
 might be added several times.

can this unique index not be based on the crc32 value instead?  that might
explain why I'm up to 140K docs when I was only expecting 91k :)


 
 
 
 The Hermit Hacker wrote:
  
  Morning all ...
  
  Just a hint for anyone using udmsearch with PostgreSQL ... if you
  find performance is atrociuos, there is a relatively simple fix:
  
  DELETE INDEX url_url;
  
  I run the PostgreSQL.org site, and indexing it, with its mailing
  lists, has so far generated a database that currently looks like, and is
  growing:
  
  sbin/indexer -S etc/udmsearch.conf
  
UdmSearch statistics
  
  StatusExpired  Total
 -
   0  77516  77722 Not indexed yet
 200503  60508 OK
 301  0104 Moved Permanently
 302  0 27 Moved Temporarily
 304  0   4179 Not Modified
 401  0  2 Unauthorized
 403  0 17 Forbidden
 404  0666 Not found
 500  0 26 Internal Server Error
 503 10 50 Service Unavailable
 -
   Total  78029 143301
  
  Due to a limitation in current PgSQL with its LIKE query
  optimizer, it "mis-optimizes" the query so that an EXPLAIN looks good, but
  the query itself will take several minutes to run on a Dual-PIII 450Mhz
  with fast drives ...
  
  Getting rid of the INDEX helps keep the numbers lower, but am
  looking forward to us getting the optimizer fixed :)
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: Re[2]: UdmSearch: New message on the WebBoard #1: File HTTPsearch / FTP search CONF file Questions

2000-11-08 Thread The Hermit Hacker


thank you very much ... so much easier to follow threads when you can read
them :)


On Wed, 8 Nov 2000, Sergey Kartashoff wrote:

 Hi!
 
 Wednesday, November 08, 2000, 2:44:15 AM, you wrote:
 
 
 THH is it possible to get rid of the 'UdmSearch: new message on the webboard'
 THH part of these subjects?  
 
 Yes, i shorten this subject already.
 
 -- 
 Regards, Sergey aka gluke.
 
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: PgSQL: DELETE INDEX url_url;

2000-11-08 Thread The Hermit Hacker


Morning all ...

Just a hint for anyone using udmsearch with PostgreSQL ... if you
find performance is atrociuos, there is a relatively simple fix:

DELETE INDEX url_url;

I run the PostgreSQL.org site, and indexing it, with its mailing 
lists, has so far generated a database that currently looks like, and is
growing:

sbin/indexer -S etc/udmsearch.conf

  UdmSearch statistics

StatusExpired  Total
   -
 0  77516  77722 Not indexed yet
   200503  60508 OK
   301  0104 Moved Permanently
   302  0 27 Moved Temporarily
   304  0   4179 Not Modified
   401  0  2 Unauthorized
   403  0 17 Forbidden
   404  0666 Not found
   500  0 26 Internal Server Error
   503 10 50 Service Unavailable
   -
 Total  78029 143301

Due to a limitation in current PgSQL with its LIKE query
optimizer, it "mis-optimizes" the query so that an EXPLAIN looks good, but
the query itself will take several minutes to run on a Dual-PIII 450Mhz
with fast drives ...

Getting rid of the INDEX helps keep the numbers lower, but am
looking forward to us getting the optimizer fixed :)


Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Re: threads not threading?

2000-11-06 Thread The Hermit Hacker


ah, might want to add that tothe man page ... it was the first place I
checked :(


On Mon, 6 Nov 2000, Alexander Barkov wrote:

 Use "indexer -Nxxx"  where xxx is a number of threads to start.
 
 
 The Hermit Hacker wrote:
  
  Why is it, with freebsd pthreads enabled, on a dual-CPU PIII, I never see
  more then one URL indexed at once?
  
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1998-11/threads.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1998-12/index.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1998-12/threads.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1999-01/index.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1999-01/threads.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1999-02/index.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1999-02/threads.html
  Indexer[20399]: [1] 
http://www.postgresql.org/mhonarc/pgsql-ports/1999-03/index.html
  
  Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
  Systems Administrator @ hub.org
  primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Can someone explain this to me?

2000-11-05 Thread The Hermit Hacker


Why, in 3.1.7, when I do a search, do I see:

[14960] SQL 0.26s: SELECT word,lang FROM stopword
[14960] SQL 0.14s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=-1274543809 AND url.rec_id=ndict.url_id  AND (url.url LIKE 
'http://www.postgresql.org/%')
[14960] SQL 0.01s: INSERT INTO qtrack (qwords,qtime,found) VALUES ('rpm',973462820,11)

[14960] SQL 0.05s: SELECT 
rec_id,url,content_type,last_mod_time,title,txt,docsize,last_index_time,next_index_time,referrer,keywords,description,crc32,category
 FROM url WHERE rec_id IN (6,8,15,26,79,119,147,150,158,166,404) ORDER BY rec_id

[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-956346810 AND (status=200 OR status=304)
[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-1089368843 AND (status=200 OR status=304)
[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=690675342 AND (status=200 OR status=304)
[14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=967505276 AND (status=200 OR status=304)
[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=1672365359 AND (status=200 OR status=304)
[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-762182779 AND (status=200 OR status=304)
[14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=268438658 AND (status=200 OR status=304)
[14960] SQL 0.03s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-348977251 AND (status=200 OR status=304)
[14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-140427905 AND (status=200 OR status=304)
[14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=962713810 AND (status=200 OR status=304)
[14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-1930052569 AND (status=200 OR status=304)

Why are there those extra 11 SELECT calls after we've already gotten the
information above?  Why not do:

SELECT 
rec_id,url,content_type,last_mod_time,title,txt,docsize,last_index_time,next_index_time,referrer,keywords,description,crc32,category
 
  FROM url 
 WHERE rec_id IN (6,8,15,26,79,119,147,150,158,166,404) 
   AND (status = 200 OR status = 304) ORDER BY rec_id

And do it all in one query instead of 11?

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Re: Can someone explain this to me?

2000-11-05 Thread The Hermit Hacker


then again, if I'm only searching on one word, why the different crc32
searches? *raised eyebrow*

On Sun, 5 Nov 2000, The Hermit Hacker wrote:

 
 Why, in 3.1.7, when I do a search, do I see:
 
 [14960] SQL 0.26s: SELECT word,lang FROM stopword
 [14960] SQL 0.14s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=-1274543809 AND url.rec_id=ndict.url_id  AND (url.url LIKE 
'http://www.postgresql.org/%')
 [14960] SQL 0.01s: INSERT INTO qtrack (qwords,qtime,found) VALUES 
('rpm',973462820,11)
 
 [14960] SQL 0.05s: SELECT 
rec_id,url,content_type,last_mod_time,title,txt,docsize,last_index_time,next_index_time,referrer,keywords,description,crc32,category
 FROM url WHERE rec_id IN (6,8,15,26,79,119,147,150,158,166,404) ORDER BY rec_id
 
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-956346810 AND (status=200 OR status=304)
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-1089368843 AND (status=200 OR status=304)
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=690675342 AND (status=200 OR status=304)
 [14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=967505276 AND (status=200 OR status=304)
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=1672365359 AND (status=200 OR status=304)
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-762182779 AND (status=200 OR status=304)
 [14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=268438658 AND (status=200 OR status=304)
 [14960] SQL 0.03s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-348977251 AND (status=200 OR status=304)
 [14960] SQL 0.01s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-140427905 AND (status=200 OR status=304)
 [14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=962713810 AND (status=200 OR status=304)
 [14960] SQL 0.00s: SELECT rec_id,url,content_type,last_mod_time FROM url WHERE 
crc32=-1930052569 AND (status=200 OR status=304)
 
 Why are there those extra 11 SELECT calls after we've already gotten the
 information above?  Why not do:
 
 SELECT 
rec_id,url,content_type,last_mod_time,title,txt,docsize,last_index_time,next_index_time,referrer,keywords,description,crc32,category
 
   FROM url 
  WHERE rec_id IN (6,8,15,26,79,119,147,150,158,166,404) 
AND (status = 200 OR status = 304) ORDER BY rec_id
 
 And do it all in one query instead of 11?
 
 Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
 Systems Administrator @ hub.org 
 primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: threads not threading?

2000-11-05 Thread The Hermit Hacker


Why is it, with freebsd pthreads enabled, on a dual-CPU PIII, I never see
more then one URL indexed at once?

Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1998-11/threads.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1998-12/index.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1998-12/threads.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1999-01/index.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1999-01/threads.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1999-02/index.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1999-02/threads.html
Indexer[20399]: [1] http://www.postgresql.org/mhonarc/pgsql-ports/1999-03/index.html


Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: MySQL compared to Oracle8?

2000-09-15 Thread The Hermit Hacker

On Fri, 15 Sep 2000, Alain TESIO wrote:

  I am interested in some performance differency measurments
  of MySQL compared to Oracle8 when using UdmSearch.
  
  I know, the Oracle should be faster, but how much could we gain?
 
 How do you know ?
 You assume it's faster because you have to pay for it ?
 
 There are some benchmarks on the mysql site :
 
 http://www.mysql.com/information/benchmarks.html
 
 You may want to run your own benchmarks anyway, first because it will
 be yours, and then because most results in a benchmark are not
 significative for a search engine.

Always take benchmarks with a grain of salt until those running the tests
are working hand in hand with the other vendors ... they will almost
always be biased towards the vendor, as they will rarely spend the time to
get the tests to use features of the other vendors ...

 It's essentially read-only, and MySQL is very good for such an use,
 even if it has some drawbacks for other uses where you need concurrent
 read/write accesses.

actually, there are features that MySQL doesn't support that would greatly
help for search engines, main one being teh ability to use subselects so
that you run one query to find an answer instead of one query per word
being searched ...


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: using subselects, anyone looking at this?

2000-08-21 Thread The Hermit Hacker


a couple of weeks back, I sent in a suggestion on how to reduce query time
by what worked out to be ~1/3 when searching multiple words ... including
the exact query required ...

has/is anyone looking at how to implement this?

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Over 9million records and slower then molasses ...

2000-08-05 Thread The Hermit Hacker


I'm running udmsearch 3.1.3 against a PostgreSQL v7.0.2 database which is
on a Dual-PIII with 512Meg of RAM ... and queries are dog slow ...

Looking things over, and I'm wondering if it might be an idea to reduce
the string-length of the url field in the url table so that there is less
of a string to do a LIKE query against ...

Basically, create a new table called 'server' that contains two fields:

server ( http://www.postgresql.org )
server_id ( 0 ) ...

then, in the url table, have a field 'server_id' that refers to that, and
strip off the 'http://www.postgresql.org' component from the 'url' field
in the url table ... out of 9 million records, cutting off ~24bytes per
record would greatly reduce that file, and, I think, improve the
performance of the like search, such that this:

SELECT ndict.url_id,ndict.intag 
  FROM ndict,url 
 WHERE ndict.word_id=-667920895 
   AND url.rec_id=ndict.url_id  
   AND ((url.url || '') LIKE 'http://www.postgresql.org/mhonarc/pgsql-sql/%')

would become:

SELECT ndict.url_id,ndict.intag 
  FROM ndict,url 
 WHERE ndict.word_id=-667920895 
   AND url.rec_id=ndict.url_id  
   AND url.url LIKE '/mhonarc/pgsql-sql/%')
   AND url.server = server.id
   AND server.server = 'http://www.postgresql.org';

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Potential speed up for DBs supporting subselects ...

2000-08-05 Thread The Hermit Hacker


I'm finding that the more conditions you search on, the slower things
become, to the point of unusable ... looking at the logs with DEBUG
turned on, for instance, if I search on '4 words', it generates four
quesries ...

I'm wondering if using a subselect like the following would/could be used
to speed up this process:

SELECT url_id,intag
  FROM ndict
 WHERE word_id=572517542
   AND EXISTS ( SELECT ndict.url_id
  FROM ndict,url
 WHERE ndict.word_id=-747320868
   AND url.rec_id=ndict.url_id
   AND ((url.url || '') LIKE 
'http://www.postgresql.org/mhonarc/pgsql-hackers/%'));

Basically, the idea is that the first subquery reduces the number of url's
that the top level has to scan to just those that satisfy the search for
URLs with a word_id of -747320868 ...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Over 9million records and slower then molasses ...

2000-08-05 Thread The Hermit Hacker

On Sun, 6 Aug 2000, Alain TESIO wrote:

 The Hermit Hacker [EMAIL PROTECTED] wrote:
 
  A simple subselect would eliminate the extra join ...
 
 Adding a subselect or the join is even regarding the
 performance loss, subselects can prevent the db engine
 from using the best indexes.
 
 As you have a big database, what about using it to
 test your suggestions ? I think there are bad but you
 may want to check.
 Try my suggestion with hashed url path parts,
 it's probably the best way to speed up a filter on url.

whether you store the server url in the url table or a seperate table
isn't going to matter much, I agree ... I don't agree that storing three
different crc values is a good idea, but if even the url table could be
modified to include a 'server' field that is searched via = vs LIKE, and
remove that part of the substring from the url, that would help ...

The problem is that something like PostgreSQL's LIKE condition is only
significant to something like 15 characters for using an INDEX ... by
removing, for example, the http://www.postgresql.org part of the url,
which is the "Fixed" component for all URLs on that site, you've removed
24 characters from the LIKE search and moved the search to being *just*
the part that is significant ...


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: ANNOUNCE 3.1.1

2000-06-30 Thread The Hermit Hacker


This one appears to be running much better then 3.1.0 ... its actually
indexing this time :)

On Fri, 30 Jun 2000, Alexander Barkov wrote:

 Hello!
 
 UdmSearch-3.1.1  is available from our site http://search.mnogo.ru/
 
 From ChangeLog:
 
 * Nested categories support has been added (docs coming soon).
 * Tag type has been changed to CHAR instead of INT. 
The above changes require updating database structure. (Table "url").
 "Category" indexer.conf command has been added 
 * $CP template variable to display category path has been added 
 * $CS template variable to display current category subtree has been
 added 
 * "cat" search.cgi parameter has been added to pass the category to
 search through 
 * $cat template variable to display current category ID. 
 * A bug in cp1250 and iso-8859-2 support has been fixed 
 * Some functions moved from log.c to udmutils.c 
 * udm_snprintf for Solaris has been added 
 * Some bugs have been fixed 
 
 
 -- 
 Alexander Barkov
 IZHCOM, Izhevsk
 email:[EMAIL PROTECTED]  | http://www.izhcom.ru
 Phone:+7 (3412) 51-23-76 | Fax: +7 (3412) 78-70-10
 ICQ:  7748759
 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Re: your mail

2000-06-22 Thread The Hermit Hacker


First suggestion: upgraded to v7.0.2 of PostgreSQL ... major difference in
performance and stablity ...

On Thu, 22 Jun 2000, Stanislav Kuba wrote:

 Hello!
 I have problem with udmsearch and  postgresql.
 1) If I use "DBMode crc-multi" or "DBMode multi" the indexer sometimes fails
 with error :
 -
 Indexer[7443]: [1] Error: 'pqReadData() -- backend closed the channel
 unexpectedly.
 This probably means the backend terminated abnormally
 before or while processing the request.
 ---
 The DBMode single is OK.
 
 2) The PHP search sometimes fails with errors :
 --
 An error occured!
 
 Query error: SELECT flag,lang,mask,find,repl FROM affix WHERE type='s'
 pqReadData() -- read() failed: errno=32 Broken pipe
 
 or
 
  An error occured!
 
 Query error: SELECT flag,lang,mask,find,repl FROM affix WHERE type='s'
  pqFlush() -- backend closed the channel unexpectedly. This probably means
 the backend terminated abnormally
 before or while processing the  request.
 
 This happens with DBMode single.
 
 3) The db connection is opened after search. Is it OK?
 
 Can you help me?
 
 I have Linux , Postgresql 6.5.3 and udmsearch 3.0.18 and other PHP+posgtres
 application work OK.
 
 
 Stnislav Kuba
 
 
 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Indexer performance limits (MySQL locking?)

2000-06-22 Thread The Hermit Hacker

On Thu, 22 Jun 2000, J C Lawrence wrote:

 On Thu, 22 Jun 2000 15:46:20 -0300 (ADT) 
 The Hermit Hacker [EMAIL PROTECTED] wrote:
 
  Try setting your Period higher and using the -n option to restrict
  the number of pages it does in an invocation ...
 
  for instance, set Period to 1week, and -n option to 20k ...
 
  this way it only processes 20k expired pages, and they will only
  expire again in a week ...
 
 This creates a different problem, and is why I have the Period set
 low:
 
   There are certain pages that must be spidered every index.
 
 Specifically, most of my site consists of mailing list archives.
 The various index pages for the messages in those archives need to
 be spidered every time to pick up the new messages.
 
 Summary:
 
   I need a long period to prevent everything being indexed every
 time, but I also need certain pages to be spidered for new URLs
 every single time the indexer runs.  How to do?

that one I'm interested in as well ... I've been suffering with things so
far, as I didn't/don't think there is currently a way of treating
'subpages' seperately if the toplevel page is already being indexed ...

Someway of doing:

Period 604800
Server http://www.postgresql.org/%
Period 86400
Server http://www.postgresql.org/mhonarc/pgsql-hackers

Would be great ...


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Re: Damn, upgraded to 3.1.0 and now it won't index :(

2000-06-21 Thread The Hermit Hacker

On Wed, 21 Jun 2000, Alexander Barkov wrote:

 I've fixed this yesterday. This will be available in 3.1.1 soon.

Some way of getting critical patches out like this needs to be implemented
:(  Anyway of getting anon-cvs or something going?


  
 
 
 - Original Message -
 From: The Hermit Hacker [EMAIL PROTECTED]
 To: Alexander Barkov [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Sent: Wednesday, June 21, 2000 6:38 AM
 Subject: Damn, upgraded to 3.1.0 and now it won't index :(
 
 
 
  %sbin/indexer -n 10 etc/udmsearch.conf
  Indexer[35853]: indexer from UdmSearch v.3.1.0/PgSQL started with
 'etc/udmsearch.conf'
  Indexer[35853]: [1] Done
 
  I've replaced my .conf file with the one that comes with the distribution,
  just in case I was missing something ...
 
  Marc G. Fournier   ICQ#7615664   IRC Nick:
 Scrappy
  Systems Administrator @ hub.org
  primary: [EMAIL PROTECTED]   secondary:
 scrappy@{freebsd|postgresql}.org
 
 
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Damn, upgraded to 3.1.0 and now it won't index :(

2000-06-20 Thread The Hermit Hacker


%sbin/indexer -n 10 etc/udmsearch.conf
Indexer[35853]: indexer from UdmSearch v.3.1.0/PgSQL started with 'etc/udmsearch.conf'
Indexer[35853]: [1] Done

I've replaced my .conf file with the one that comes with the distribution,
just in case I was missing something ...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Somewhat off-topic: bad performance using Postgres

2000-06-04 Thread The Hermit Hacker


okay, several questions here, first one being what version of PostgreSQL
are you using?  What hardware are you running on?

I'm running UDMSearch 3.0.17 with PostgreSQL v7.0 on a Dual-PIII with
512Meg of RAM, with the databases sitting on an 18gig SCSI-3 hard drive
...

I'm running in crc-multi mode, with a database currently taking up 1.8gig,
and stats showing:

%sbin/indexer -S etc/udmsearch.conf 

  UdmSearch statistics

StatusExpired  Total
   -
   200  17935  17935 OK
   301 62 62 Moved Permanently
   302  2  2 Moved Temporarily
   304  51108  51108 Not Modified
   401  1  1 Unauthorized
   403 14 14 Forbidden
   404344344 Not found
   500  2  2 Internal Server Error
   503  4  4 Service Unavailable
   -
 Total  69472  69472

And it definitely didn't take me a week to load it ... 

From my experience, UDMSearch isn't the most efficient in its queries,
because MySQL doesn't support features that would *probably* improve it,
namely the use of subqueries, but I haven't had the time to sit down and
try and optimize the code, so am not 100% certain that that would help ...

One thing that you might find helps is to do something like:

while(1)
  indexer -n 5000 indexer.conf
  psql -c "vacuum analyze;" database
end

(or similar) ... so that it does periodic vacuums ... in 'near future'
released of PostgreSQL, there are plans for an 'overwrite storage
manager', but, for now, if you do an UPDATE on a record, it has to INSERT
a new one and DELETE the old, so tables grow a little faster then you'd
hope ...

On Mon, 5 Jun 2000, Alain TESIO wrote:

  I've noticed that indexing becomes teriibly slow when using postgres as
  the backend (regardless of version), versus MySQL... anybody know why?
  Is it because PG supports
  transactions and more stuff than MySQL?
 
 I have noticed the same for another similar job. PostgreSQL has nice
 features,
 especially transactions (should be soon available in MySQL)
 
  With MySQL I was able to index 6 urls within a 24Hrs, using Postgres
  this will take significant more time - nearly a week!
 
  Anybody got suggestions?
 
 I don't know what kind of queries are executed during the indexing, if the
 insertions don't need indexes you could try to drop the index and recreate
 them when the indexing is done.
 
 Alain
 
 
 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: 3.0.10 - 3.0.14: udmsearch.conf

2000-05-06 Thread The Hermit Hacker


Morning all ...

Just re-installed .10 to test out whether or not I've gone crazy ... .10
works fine on my server, indexes everything ... .14 fails to instead
http://www.postgresql.org at all ...

Below is the conf file that I'm using on my server ... and have been
pretty much since day one ... 

-
DBPort  5432
DBHost  db.hub.org
DBName  udmsearch
DBUser  pgsql
DBPass  

DBMode crc

Disallow /cgi-bin/ \.cgi /nph \?

Disallow \.b$\.sh$   \.md5$   \.rpm$
Disallow \.arj$  \.tar$  \.zip$  \.tgz$  \.gz$
Disallow \.lha$  \.lzh$  \.tar\.Z$  \.rar$  \.zoo$
Disallow \.gif$  \.jpg$  \.jpeg$ \.bmp$  \.tiff$ \.xpm$ \.xbm$ 
Disallow \.vdo$  \.mpeg$ \.mpe$  \.mpg$  \.avi$  \.movie$
Disallow \.mid$  \.mp3$  \.rm$   \.ram$  \.wav$  \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$  \.png$
Disallow \.exe$  \.cab$  \.dll$  \.bin$  \.class$
Disallow \.tex$  \.texi$ \.xls$  \.doc$  \.texinfo$
Disallow \.rtf$  \.pdf$  \.cdf$  \.ps$
Disallow \.ai$   \.eps$  \.ppt$  \.hqx$
Disallow \.cpt$  \.bms$  \.oda$  \.tcl$
Disallow \.o$ \.a$ \.la$ \.so$ \.so\.[0-9]$ 
Disallow \.pat$ \.pm$ \.m4$ \.am$

Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow /[.]{1,2} /\%2e /\%2f
Disallow [^:]//

AddType text/plain  \.pl$ \.js$ \.txt$ \.h$ \.c$ \.pm$ \.e$
AddType text/html   \.html$ \.htm$
AddType image/x-xpixmap \.xpm$
AddType image/x-xbitmap \.xbm$
AddType image/gif   \.gif$
AddType application/unknown .*

# Hub's stuff
Server http://www.postgresql.org/
Server http://www.hub.org/

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: udmsearch .14 ...

2000-05-05 Thread The Hermit Hacker



Just upgraded to .14, and due to some work I had to do on the database
server, cleared out the database and was goign to rebuild it ...

Using the same config from my .10 indexer, I can't index the site anymore
... just comes back with the following.  Has something changed that would
affect this?

pgsql sbin/indexer etc/udmsearch.conf
Indexer[29711]: indexer from UdmSearch v.3.0.14/PgSQL started with 'etc/udmsearch.conf'
Indexer[29711]: [1] http://www.postgresql.org/robots.txt
Indexer[29711]: [1] http://www.postgresql.org/
Indexer[29711]: [1] Done

Actually, something has changed, as I used to just have:

Server http://www.postgresql.org

which will now generate:

Indexer[38871]: indexer from UdmSearch v.3.0.14/PgSQL started with
'etc/udmsearch.conf'
Indexer[38871]: [1] http:///robots.txt
Indexer[38871]: [1] http://www.postgresql.org
Indexer[38871]: [1] Unknown host
Indexer[38871]: [1] Done

Where it didn't before ... but what has changed to prevent me from being
able to index that site? :(





__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: udmsearch .14 ...

2000-05-05 Thread The Hermit Hacker

On Fri, 5 May 2000, J C Lawrence wrote:

 On Fri, 5 May 2000 21:11:28 -0300 (ADT) 
 The Hermit Hacker [EMAIL PROTECTED] wrote:
 
  Just upgraded to .14, and due to some work I had to do on the
  database server, cleared out the database and was goign to rebuild
  it ...
 
 Is your robots.txt refusing access to the Indexer?  Have you edited
 indexer.conf to tell it to ignore robots.txt?

I don't have any robots.txt :(


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: udmsearch on a hosted area (ex. hway.net)

2000-05-04 Thread The Hermit Hacker


for those in this sort of situation, PostgreSQL, Inc does offer a
third-party hosting service, where we'll run the indexing for you and
provide you with binaries, as applicable ...

check out http://www.pgsql.com/hosting.html for more details ...

On Thu, 4 May 2000, Lone Directory wrote:

 Is it possible to install udmsearch on a web hosting
 site without having telnet access? is there a special
 procedure?
 
 thanks
 blackdir
 
 
 
 __
 Do You Yahoo!?
 Send instant messages  get email alerts with Yahoo! Messenger.
 http://im.yahoo.com/
 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Prob with 3.0.13

2000-05-02 Thread The Hermit Hacker

On Tue, 2 May 2000, Alexander Barkov wrote:

 Hi!
 
 
 - Original Message -
 From: Martin Ebert [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Tuesday, May 02, 2000 2:38 AM
 Subject: UdmSearch: Prob with 3.0.13
 
  User ist udm, database is udmsearch, no password.
 
  indexer says:
  # ./indexer
  Indexer[558]: indexer from UdmSearch v.3.0.13/PgSQL started with
  '/usr/local/udmsearch/etc/indexer.conf'
  Indexer[558]: [1] Error: 'ERROR:  robots: Permission denied.
 
  Part of indexer.conf:
  DBAddr  pgsql://udm:@localhost/udmsearch/
 
  What's wrong?
 
 I tested again with PgSQL-6.5.3 and it does work fine on my RH-6.1
 My DBAddr string was psql://test@localhost/test/
 
 
  Another question:
 
  I want to play with "DBMode crc-multi".
  Should I create other tables?
  If so, how can I find a script?
 
 It is in /create/pgsql directory of UdmSearch sources. You have to use
 "ndictXX" tables definitions with related indexes. But the
 problem is that there is a strange bug in PostgreSQL-6.5.3 (previous
 versions are known to work fine). Postmaster crashes after a sequence
 of INSERTS performed by indexer. It starts work fine when "hash" index type
 is used instead of  default "btree" with ndictXX tables. This is thing to be
 reported to PgSQL maling lists: strange behavour of  "int4" field type
 indexes,
 but I had no time to do that yet :-(

Just an FYI ... v7.0 is to be released any day now (we're just finishing
cleaning up docs and that's about it) ... I'm currently running 3.0.10 on:

pgsql /usr/local/udmsearch/sbin/indexer -S
/usr/local/udmsearch/etc/udmsearch.conf 

  UdmSearch statistics

StatusExpired  Total
   -
   200  0  85151 OK
   301  0120 Moved Permanently
   401  0  1 Unauthorized
   403  0  4 Forbidden
   404  0581 Not found
   500  0  2 Internal Server Error
   503  0  6 Service Unavailable
   -
 Total  0  85865

With an ndict that has grown to:


And a data/base directory of:

udmsearch= select count(word_id) from ndict;
   count

10495461
(1 row)

pgsql% du data/base/udmsearch
2637066 data/base/udmsearch

And results that come back in an acceptable period of time using just the
crc method ...

Check her out at:

http://www.postgresql.org/mhonarc/pgsql-hackers

Under PgSQL, there are a few things you *have* to do to get any sort of
performance though ... first one is disable-auto-wildcard ... its
evil.  The problem is that it creates a LIKE search of '%url%', so if you
want to find all urls that *start with* 'http://www.' something, you
pretty much screw up how PgSQL uses its indices ... All my 'ul' values
state 'http://www.../%' ...

Other then that, hoping Alexander got the LIKE query patch I submit'd
awhile back into the tree, she pretty much flies ...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Just some numbers ...

2000-04-18 Thread The Hermit Hacker


Morning all ...

Well, UDMSearch is now fully loaded with the PostgreSQL Web Site,
and a couple of other ones.  

The database reports itself as:

udmsearch= select version();
version
---
PostgreSQL 7.0.0 on i386-unknown-freebsdelf3.4, compiled by gcc 2.7.2.3
(1 row)

Its actually beta5, so doesn't have Tom's LIKE fixes, only the
QUERY work around for now.

The dictionary (ndict) table contains:

udmsearch= select count(word_id) from ndict;
  count
---
9695376
(1 row)

And the url table contains:

udmsearch= select count(rec_id) from url;
count
-
78943
(1 row)

So not a particularly small database :)  Disk space consumption,
after a vacuum, is at:

pgsql% du udmsearch
2238858 udmsearch

The server that it is running on is a Dual-PIII 450 with 512Meg of
RAM, and the database itself is sitting on an 18gig 10k RPM U2W SCSI hard
drive ... and PostgreSQL is the only server running on this machine ...

If I go to http://www.postgresql.org/search.cgi and search *all
docs* for MVCC, it returns 565 URLs, in ~60secs, with the bulk of the
search being spent in:

[14180] START SQL 16535819.64s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=572517542 AND url.rec_id=ndict.url_id  AND ((url.url || '') LIKE 
'http://www.postgresql.org/%')
[14180]   END SQL 52.47s

All in all, not too bad ... :)

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Re: [HACKERS] Just some numbers ...

2000-04-18 Thread The Hermit Hacker


d'oh ... I *swore* I did a VACUUM ANALYZE before sending these in, but it
turns out I only did a VACUUM :(  Now, the query is down to:

[40552] START SQL 16545848.04s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=572517542 AND url.rec_id=ndict.url_id  AND ((url.url || '') LIKE 
'http://www.postgresql.org/%')
[40552]   END SQL 7.66s

Same thing being searched (mvcc off of the URL below) ...

explain:

NOTICE:  QUERY PLAN:

Hash Join  (cost=6672.12..8.80 rows=239 width=10)
  -  Index Scan using n_word_id on ndict  (cost=0.00..79566.29 rows=23916 width=6)
  -  Hash  (cost=6670.15..6670.15 rows=789 width=4)
-  Seq Scan on url  (cost=0.00..6670.15 rows=789 width=4)

Sweet :)


 The Hermit Hacker [EMAIL PROTECTED] writes:
  If I go to http://www.postgresql.org/search.cgi and search *all
  docs* for MVCC, it returns 565 URLs, in ~60secs, with the bulk of the
  search being spent in:
 
  [14180] START SQL 16535819.64s: SELECT ndict.url_id,ndict.intag FROM ndict,url 
WHERE ndict.word_id=572517542 AND url.rec_id=ndict.url_id  AND ((url.url || '') LIKE 
'http://www.postgresql.org/%')
  [14180]   END SQL 52.47s
 
 What was the plan currently being generated for those, again?
 
   regards, tom lane
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: Suggested patch for speed improvements under PostgreSQL ...

2000-04-14 Thread The Hermit Hacker


The following patch is meant to force PostgreSQL (including v7.0) to *not*
try and use the index on a LIKE query.  There is an overestimation of the
speed of doing a LIKE search in PostgreSQL that values doing it before
doing an equality one ... end result on my current search is that a search
has to process ~60k*5million records, vs 11k*60k records to find the
results ... slightly slower :)

Only been tested manually so far, but is going in full bore in 5 minutes
or so to test it u nder load ...

*** sql.c   Tue Apr 11 19:18:22 2000
--- sql2.c  Fri Apr 14 11:59:09 2000
***
*** 1710,1716 
  __INDLIB__ int AddURLLimit(char *URL){
if(*urlstr)strcpy(UDM_STREND(urlstr)-1," OR ");
elsestrcat(urlstr," AND (");
!   sprintf(UDM_STREND(urlstr),"url.url LIKE '%s')",URL);
return(0);
  }
  __INDLIB__ int ClearLimits(){
--- 1710,1716 
  __INDLIB__ int AddURLLimit(char *URL){
if(*urlstr)strcpy(UDM_STREND(urlstr)-1," OR ");
elsestrcat(urlstr," AND (");
!   sprintf(UDM_STREND(urlstr),"(url.url || '') LIKE '%s')",URL);
return(0);
  }
  __INDLIB__ int ClearLimits(){


Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: [v7.0beta5] LIKE timings ...

2000-04-14 Thread The Hermit Hacker


Well, with that simple change of (url.url || ''), the results go to being
much much more acceptable for a database whose ndict now contains
6million+ tuples :)

[46327] START SQL 16213186.97s: SELECT word,lang FROM stopword
[46327]   END SQL 0.07s
[46327] START SQL 16213187.04s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=572517542 AND url.rec_id=ndict.url_id  AND ((url.url || '') LIKE 
'http://www.postgresql.org/docs/%')
[46327]   END SQL 1.55s
[46327] START SQL 16213188.59s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=13119092 AND url.rec_id=ndict.url_id  AND ((url.url || '') LIKE 
'http://www.postgresql.org/docs/%')
[46327]   END SQL 1.71s
[46327] START SQL 16213190.30s: SELECT 
rec_id,url,content_type,last_modified,title,txt, 
docsize,last_index_time,next_index_time, referrer,keywords,description,crc FROM url 
WHERE rec_id=14
[46327]   END SQL 0.03s
[46327] START SQL 16213190.33s: SELECT 
rec_id,url,content_type,last_modified,title,txt, 
docsize,last_index_time,next_index_time, referrer,keywords,description,crc FROM url 
WHERE rec_id=1444
[46327]   END SQL 0.02s
[46327] START SQL 16213190.35s: SELECT 
rec_id,url,content_type,last_modified,title,txt, 
docsize,last_index_time,next_index_time, referrer,keywords,description,crc FROM url 
WHERE rec_id=1598
[46327]   END SQL 0.04s
[46327] START SQL 16213190.39s: SELECT 
rec_id,url,content_type,last_modified,title,txt, 
docsize,last_index_time,next_index_time, referrer,keywords,description,crc FROM url 
WHERE rec_id=1723
[46327]   END SQL 0.15s
[46327] START SQL 16213190.55s: SELECT rec_id,url,content_type,last_modified FROM url 
WHERE crc='2aa2ad0362d064152caaed1010e6e3f9'
[46327]   END SQL 0.83s
[46327] START SQL 16213191.38s: SELECT rec_id,url,content_type,last_modified FROM url 
WHERE crc='b8d5691766f8ef8c59ac42f795a0a084'
[46327]   END SQL 0.36s
[46327] START SQL 16213191.73s: SELECT rec_id,url,content_type,last_modified FROM url 
WHERE crc='82f5c87cd45da6c97a25804f71e370a5'
[46327]   END SQL 0.35s
[46327] START SQL 16213192.09s: SELECT rec_id,url,content_type,last_modified FROM url 
WHERE crc='80494d661646144b02fd8aed89353797'
[46327]   END SQL 0.35s



Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




Re: UdmSearch: Re: [HACKERS] [v7.0beta5] LIKE timings ...

2000-04-14 Thread The Hermit Hacker

On Fri, 14 Apr 2000, Tom Lane wrote:

 The Hermit Hacker [EMAIL PROTECTED] writes:
  Well, with that simple change of (url.url || ''), the results go to being
  much much more acceptable for a database whose ndict now contains
  6million+ tuples :)
 
 So are we going to stand down from panic mode, or should I still plan on
 squeezing in a first-cut LIKE estimator tomorrow?

I say stand down ... things appear to be working great as they are, but a
more permanent fix should still be at the top of the list for v7.1, since
I don't imagine that we're always going to be able to come up with that
easy of a fix :)


__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]




UdmSearch: A URL that isn't indexable ... ?

2000-03-13 Thread The Hermit Hacker


http://greatcircle.com/lists/majordomo-users/archive.html

it indexes taht page, but none of the sub-links ...

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]



Re: UdmSearch: A URL that isn't indexable ... ?

2000-03-13 Thread The Hermit Hacker


d'oh :(

thanks ...


On Mon, 13 Mar 2000, andy butz wrote:

 hi Marc,
 
 it indexes that page, but none of the sub-links ...
 
 no wonder:
 
 http://greatcircle.com/robots.txt
 
  FILE robots.txt ++
 
 # robots.txt for http://www.greatcircle.com/
 # contact [EMAIL PROTECTED] for more information
 
 User-agent: *
 Disallow: /firewalls/archive/
 Disallow: /firewalls/mhonarc/
 Disallow: /list-managers/archive/
 Disallow: /list-managers/mhonarc/
 Disallow: /lists/firewalls/archive/
 Disallow: /lists/firewalls/mhonarc/
 Disallow: /lists/list-managers/archive/
 Disallow: /lists/list-managers/mhonarc/
 Disallow: /lists/majordomo-announce/archive/
 Disallow: /lists/majordomo-announce/mhonarc/
 Disallow: /lists/majordomo-docs/archive/
 Disallow: /lists/majordomo-docs/mhonarc/
 Disallow: /lists/majordomo-users/archive/
 Disallow: /lists/majordomo-users/mhonarc/
 Disallow: /lists/majordomo-workers/archive/
 Disallow: /lists/majordomo-workers/mhonarc/
 
 The Hermit Hacker wrote:
 
  http://greatcircle.com/lists/majordomo-users/archive.html
 
  it indexes taht page, but none of the sub-links ...
 
  Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
  Systems Administrator @ hub.org
  primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org
 
  __
  If you want to unsubscribe send "unsubscribe udmsearch"
  to [EMAIL PROTECTED]
 
 __
 If you want to unsubscribe send "unsubscribe udmsearch"
 to [EMAIL PROTECTED]
 
 

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]



UdmSearch: Its funny ...

2000-03-08 Thread The Hermit Hacker


I'm trying to figure out why udmsearch w/ PostgreSQL hurts so much ... I
just tried to start a search with an ndict of ~5million tuples, and a
total of about 50k documents, and I just got a 'zero sized reply' from the
backend for something as simple as 'SQL' ...

Looking at hte backend/server, I see tis process:

81779  ??  S 72:39.47 /home/database/v6.5.3/bin/postgres pgsql 216.126.84.1 
pgsearch LOCK TABLE waiting

Why is udmsearch doing any locking?  With MVCC, it is not required, and
just slows everything down where it doesn't need to ... :(

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]



UdmSearch: Slowly, we're getting somewhere ...

2000-03-08 Thread The Hermit Hacker


Is it possible to get the PID of the CGI added to the statistics too?  Now
I get a bunch of lines of:

SQL 0.24s: SELECT word,lang FROM stopword
SQL 0.10s: SELECT word,lang FROM stopword
SQL 0.11s: SELECT word,lang FROM stopword
SQL 0.09s: SELECT word,lang FROM stopword
SQL 0.11s: SELECT word,lang FROM stopword
SQL 0.16s: SELECT word,lang FROM stopword
SQL 0.09s: SELECT word,lang FROM stopword
SQL 263.27s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=-606704929 AND url.rec_id=ndict.url_id  AND (url.url LIKE 
'http://www.postgresql.org/%')
SQL 0.61s: SELECT word,lang FROM stopword
SQL 0.16s: SELECT word,lang FROM stopword
SQL 0.32s: SELECT word,lang FROM stopword
SQL 0.08s: SELECT word,lang FROM stopword
SQL 0.57s: SELECT word,lang FROM stopword
SQL 0.10s: SELECT word,lang FROM stopword
SQL 0.16s: SELECT word,lang FROM stopword
SQL 0.08s: SELECT word,lang FROM stopword
SQL 0.27s: SELECT word,lang FROM stopword
SQL 57.02s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=-1183739328 AND url.rec_id=ndict.url_id  AND (url.url LIKE 
'http://www.postgresql.org/docs/%')
SQL 10.09s: SELECT ndict.url_id,ndict.intag FROM ndict,url WHERE 
ndict.word_id=-1183739328 AND url.rec_id=ndict.url_id  AND (url.url LIKE 
'http://www.postgresql.org/docs/admin/%')
SQL 0.12s: SELECT rec_id,url,content_type,last_modified,title,txt, 
docsize,last_index_time,next_index_time, referrer,keywords,description,crc FROM url 
WHERE rec_id=416

But haven't got a clue who belongs to what ... 

Also, why would I get a bunch of 'SELECT ... stopword' queries, with what
appears to be no subsequent queries?

Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]



UdmSearch: CURSORs ...

2000-03-08 Thread The Hermit Hacker


is anyone out there workign on extending udmsearch
w/PostgreSQL?  performance on a PIII 450 w/ 758Meg of RAM is pretty much
atrocious, and one of the things that I'm wondering is what sort of
performance boast going to using CURSORs would give?  Basically, instead
of returning all the rows to the frontend that satisfy the search, just
return the 20 rows that will be displayed ...

So far as I'm able to see, udmsearch is great for small sites, but as soon
as I throw any amount of documents in, it slows down to being pretty much
unusable, and I'd *really* like to figure out a way of fixing that :(


Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]



UdmSearch: udmsearch stable gets periodic crashes ...

1999-12-13 Thread The Hermit Hacker


Using FreeBSD 3.4-RC and PostgreSQL 6.5.0, I get the hfollowing
periodically:

pgsql sbin/indexer
UdmSearch[58068]: indexer from UdmSearch v.2.2.1/PostgreSQL started with config 
/home/projects/pgsql/etc/indexer.conf
UdmSearch[58068]: http://www.postgresql.org/docs/pgsql/src/test/suite/quote.sql
UdmSearch[58068]: Error: Error: 'ERROR:  parser: parse error at or near "world"

If I run it a second time, it goes through fine...


Marc G. Fournier   ICQ#7615664   IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: [EMAIL PROTECTED]   secondary: scrappy@{freebsd|postgresql}.org 

__
If you want to unsubscribe send "unsubscribe udmsearch"
to [EMAIL PROTECTED]