Re: [OT] Re: search.cpan.org

2001-12-04 Thread Ask Bjoern Hansen

On Tue, 27 Nov 2001, Nick Tonkin wrote:

 Well, ask Ask if you want the whole truth. But when I saked him that's
 what he said. Maybe there's a problem with the architecture and some
 pre-indexing is done per session or something suboptimal like that. Ask?

No, Robert is right. It's just searches that are doing a full scan
of the database.  I know Graham is working on a better search
system.

If Bill got swish-e to support incremental database updates I'm sure
it would help. ;-) 


 - ask

-- 
ask bjoern hansen, http://ask.netcetera.dk/ !try; do();
more than a billion impressions per week, http://valueclick.com





Re: [OT] Re: search.cpan.org

2001-11-27 Thread Bill Moseley

At 12:55 PM 11/27/01 -0800, Nick Tonkin wrote:

Because it does a full text search of all the contents of the DB.

Perhaps, but it's just overloaded.

I'm sure he's working on it, but anyone want of offer Graham free hosting?
A few mirrors would be nice, too.

(Plus, all my CPAN.pm setups are now failing to work, too)



Bill Moseley
mailto:[EMAIL PROTECTED]



Re: [OT] Re: search.cpan.org

2001-11-27 Thread Mark Maunder

Nick Tonkin wrote:

 Because it does a full text search of all the contents of the DB.


Not sure what he's using for a back end, but mysql 4.0 (in alpha) has very fast and
feature rich full text searching now, so perhaps he can migrate to that once it's
released in December sometime. I'm using it on our site and searching fulltext
indexes on three fields (including a large text field) in under 3 seconds on over
70,000 records on a p550 with 490 megs of ram.




Re: [OT] Re: search.cpan.org

2001-11-27 Thread Bill Moseley

At 09:02 PM 11/27/01 +, Mark Maunder wrote:
I'm using it on our site and searching fulltext
indexes on three fields (including a large text field) in under 3 seconds
on over
70,000 records on a p550 with 490 megs of ram.


Hi Mark,

plug

Some day if you are bored, try indexing with swish-e (the development
version).
http://swish-e.org

The big problem with it right now is it doesn't do incremental indexing.
One of the developers is trying to get that working with in a few weeks.
But for most small sets of files it's not an issue since indexing is so fast.

My favorite feature is it can run an external program, such as a perl mbox
or html parser or perl spider, or DBI program or whatever to get the source
to index.  Use it with Cache::Cache and mod_perl and it's nice and fast
from page to page of results.

Here's indexing only 24,000 files:

 ./swish-e -c u -i /usr/doc
Indexing Data Source: File-System
Indexing /usr/doc
270279 unique words indexed.
4 properties sorted.  
23840 files indexed.  177638538 total bytes.
Elapsed time: 00:03:50 CPU time: 00:03:16
Indexing done!

Here's searching:

 ./swish-e -w install -m 1
# SWISH format: 2.1-dev-24
# Search words: install
# Number of hits: 2202
# Search time: 0.006 seconds
# Run time: 0.011 seconds

A phrase:

 ./swish-e -w 'public license' -m 1
# SWISH format: 2.1-dev-24
# Search words: public license
# Number of hits: 348
# Search time: 0.007 seconds
# Run time: 0.012 seconds
998 /usr/doc/packages/ijb/gpl.html gpl.html 26002


A wild card and boolean search:

 ./swish-e -w 'sa* or java' -m 1
# SWISH format: 2.1-dev-24
# Search words: sa* or java
# Number of hits: 7476
# Search time: 0.082 seconds
# Run time: 0.087 seconds

Or a good number of results:

 ./swish-e -w 'is or und or run' -m 1
# SWISH format: 2.1-dev-24
# Search words: is or und or run
# Number of hits: 14477
# Search time: 0.084 seconds
# Run time: 0.089 seconds

Or everything:

 ./swish-e -w 'not dksksks' -m 1
# SWISH format: 2.1-dev-24
# Search words: not dksksks
# Number of hits: 23840
# Search time: 0.069 seconds
# Run time: 0.074 seconds


This is pushing the limit for little old swish, but here's indexing a few
more very small xml files (~150 bytes each)

3830016 files indexed.  582898349 total bytes.
Elapsed time: 00:48:22 CPU time: 00:44:01

/plug

Bill Moseley
mailto:[EMAIL PROTECTED]



Re: [OT] Re: search.cpan.org

2001-11-27 Thread Randy Kobes

On Tue, 27 Nov 2001, Bill Moseley wrote:

 At 12:55 PM 11/27/01 -0800, Nick Tonkin wrote:
 
 Because it does a full text search of all the contents of the DB.

 Perhaps, but it's just overloaded.

I think the load, and network connection, is the main reason; the
search itself, if you were connected locally at a time when the
machine isn't so busy, is pretty quick.

 I'm sure he's working on it, but anyone want of offer Graham free hosting?
 A few mirrors would be nice, too.

They (Graham and Elaine) are aware that it can be slow at times,
and have set up at least one mirror site to help spread the load.

best regards,
randy kobes