Re: speeding up queries (MySQL faster)

2004-08-27 Thread Yonik Seeley
FYI, this optimization resulted in a fantastic
performance boost!  I went from 133 queries/sec to 990
queries per sec!  I'm now more limited by socket
overhead, as I get 1700 queries/sec when I stick the
clients right in the same process as the server.

Oddly enough, the performance increased, but the CPU
utilization decreased to around 55% (in both
configurations above).  I'll have to look into that
later, but any additional performance at this point is
pure gravy.

-Yonik


--- Yonik Seeley [EMAIL PROTECTED] wrote:
 Doug wrote:
  For example, Nutch automatically translates such
  clauses into QueryFilters.
 
 Thanks for the excellent pointer Doug!  I'll will
 definitely be implementing this optimization.




__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley
Oops, CPU usage is *not* 50%, but closer to 98%.
This is due to a bug in CPU% on RHEL 3 on
multiprocessor CPUS (I ran run multiple threads in
while(1) loops, and it will still only show 50% CPU
usage for that process).  The agregated (not
per-process) statistics shown by top are correct, and
they show about 73% user time, 25% system time, and
anywhere between .5% and 2% idle time.

Unfortunately, this means that I won't be getting any
performance improvements from using a second
IndexSearcher, and I'm stuck at being 3 times slower
than MySQL on the same data/queries.

I guess the next step is some profiling... move the
server out of the servlet container and move the
clients in with the server, and then try some hprof
work.

Does anyone have pointers to lucene caching and how to
tune it?

-Yonik 





--- Bernhard Messer [EMAIL PROTECTED]
wrote:
 Yonik,
 
 there is another synchronized block in
 CSInputStream which could block 
 your second cpu out.



__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-22 Thread Doug Cutting
Yonik Seeley wrote:
Setup info  Stats:
- 4.3M documents, 12 keyword fields per document, 11
 [ ... ]
field1:4 AND field2:188453 AND field3:1
field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).
The field1:4 clause is probably dominating the cost of query 
execution.  Clauses which match large portions of the collection are 
slow to evaluate.  If there are not too many different such clauses then 
you can optimize this by re-using a Filter in place of such clauses, 
typically a QueryFilter.

For example, Nutch automatically translates such clauses into 
QueryFilters.  See:

http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/searcher/LuceneQueryOptimizer.java?view=markup
Note that this only converts clauses whose boost is zero.  Since filters 
do not affect ranking we can only safely convert clauses which do not 
contribute to the score, i.e, those whose boost is zero.  Scores might 
still be different in the filtered results because of 
Similarity.coord().  But, in Nutch, Similarity.coord() is overidden to 
always return 1.0, so that the replacement of clauses with filters does 
not alter the final scores at all.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: speeding up queries (MySQL faster)

2004-08-22 Thread Yonik Seeley

 For example, Nutch automatically translates such
 clauses into QueryFilters.

Thanks for the excellent pointer Doug!  I'll will
definitely be implementing this optimization.

If anyone cares, I did a 1 minute hprof test with the
search server in a servlet container.  Here are the
results (sorry about Yahoo's short line length).

-Yonik

resin.hprof.txt: Exclusive Method Times (CPU) (virtual
times)
 27390  (37.5%)
java.net.PlainSocketImpl.socketAccept
 14885  (20.4%)
org.apache.lucene.index.SegmentTermDocs.skipTo
  6700   (9.2%)
org.apache.lucene.index.CompoundFileReader$CSInputStream.rea
dInternal
  5810   (8.0%) java.io.UnixFileSystem.list
  4785   (6.5%)
org.apache.lucene.store.InputStream.readByte
  3315   (4.5%) java.io.RandomAccessFile.readBytes
  1302   (1.8%)
java.net.SocketOutputStream.socketWrite0
  1004   (1.4%) java.io.RandomAccessFile.seek
   546   (0.7%) java.lang.String.intern
   336   (0.5%) com.caucho.vfs.WriteStream.print
   248   (0.3%)
org.apache.lucene.search.TermScorer.next
   236   (0.3%)
org.apache.lucene.queryParser.QueryParser.jj_scan_token
   232   (0.3%)
org.apache.lucene.index.SegmentTermEnum.readTerm
   228   (0.3%)
org.apache.lucene.search.ConjunctionScorer.score
   200   (0.3%)
org.apache.lucene.queryParser.FastCharStream.refill
   196   (0.3%)
org.apache.lucene.store.InputStream.readVInt
   180   (0.2%)
java.security.AccessController.doPrivileged
   172   (0.2%)
org.apache.lucene.search.ConjunctionScorer.doNext
   152   (0.2%) java.lang.Object.clone
   152   (0.2%)
org.apache.lucene.index.SegmentReader.document
   148   (0.2%)
java.lang.Throwable.fillInStackTrace
   128   (0.2%)
org.apache.lucene.index.SegmentReader.norms
   116   (0.2%)
org.apache.lucene.store.InputStream.readString
   112   (0.2%) java.lang.StrictMath.log
   108   (0.1%) java.util.LinkedList.addLast
   100   (0.1%)
java.net.SocketInputStream.socketRead0
88   (0.1%)
org.apache.lucene.search.ConjunctionScorer.next





__
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-21 Thread Otis Gospodnetic
Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis


--- Yonik Seeley [EMAIL PROTECTED] wrote:

 
 --- Otis Gospodnetic [EMAIL PROTECTED]
 wrote:
 
  The bottleneck seems to be disk IO.
 
 But it's not.  Linux is caching the whole file, and
 there really isn't any disk activity at all.  Most of
 the threads are blocked on InputStream.refill, not
 waiting for the disk, but waiting for their turn into
 the synchronized block to read from the disk (which is
 why I asked about cacheing above that level).
 
 CPU is a constant 50% on a dual CPU system (meaning
 100% of 1 cpu).
 
 -Yonik
 
 __
 Do You Yahoo!?
 Tired of spam?  Yahoo! Mail has the best spam protection around 
 http://mail.yahoo.com 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-21 Thread Bernhard Messer
Yonik,
there is another synchronized block in CSInputStream which could block 
your second cpu out. Do you think there is a chance to recreate the 
index (maybe a smaller subset) without compound file option enabled and 
run your test again, so that we can see if this helps ?

regards
Bernhard
Otis Gospodnetic wrote:
Ah, you may be right (no stack trace in email any more).  Somebody
recenly identified a few bottlenecks that, if I recall correctly, were
related to synchronized blocks.  I believe Doug committed some
improvements, but I can't remember which version of Lucene that is in. 
It's definitely in 1.4.1.

Otis
--- Yonik Seeley [EMAIL PROTECTED] wrote:
 

--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:
   

The bottleneck seems to be disk IO.
 

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).
CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).
-Yonik
__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




speeding up queries (MySQL faster)

2004-08-20 Thread Yonik Seeley
Hi,

I'm trying to figure out how to speed up queries to a
large index.
I'm currently getting 133 req/sec, which isn't bad,
but isn't too close
to MySQL, which is getting 500 req/sec on the same
hardware with the
same set of documents.

Setup info  Stats:
- 4.3M documents, 12 keyword fields per document, 11
unindexed fields per document.
- lucene index size on disk=1.3G
- Hardware: dual opteron w/ 16GB memory, running 64
bit JVM (Sun 1.5 beta)
- Lucene version 1.4.1
- Hitting multithreaded server w/ 10 clients at once
- This is a read-only index... no updating is done
- Single IndexSearcher that is reused for all requests
 

Q1)  while hitting it with multiple queries at once,
lucene is pegged at 50% CPU usage (meaning it is
only using 1 out of 2 CPUs on average).  I took a
thread dump
and all of the lucene threads except one are blocked
on
reading a file (see trace below).  I could create two
index
readers, but that seems like it might be a waste, and
fixing
a symptom instead of the root problem.  Would multiple
IndexSearchers or IndexReaders share internal caches?
Is there a way to cache more info at a higher level
such that
it would get rid of this bottleneck?  The JVM isn't
taking up
much space (125M or so), and I have 16GB to work with!
The OS (linux) is obviously caching the index file,
but
that doesn't get rid of the synchronization issues,
and the
overhead of re-reading.
How is caching in lucene configured?
Does it internally use FieldCache, or do I have to use
that
somehow myself?
 
tcpConnection-8080-72 daemon prio=1
tid=0x002b24412490 nid=0x34a4 waiting for monitor
entry 

[0x45aba000..0x45abb2d0]
at
org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215)
- waiting to lock 0x002ae153fa00 (a
org.apache.lucene.store.FSInputStream)
at
org.apache.lucene.store.InputStream.refill(InputStream.java:158)
at
org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
at
org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
at
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
at
org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88)
at
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53)
at
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
at
org.apache.lucene.search.Scorer.score(Scorer.java:37)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
at
org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at
org.apache.lucene.search.Hits.init(Hits.java:43)
at
org.apache.lucene.search.Searcher.search(Searcher.java:33)
at
org.apache.lucene.search.Searcher.search(Searcher.java:27)


Even using only 1 cpu though, MySQL is faster. Here is
what
the queries look like:

field1:4 AND field2:188453 AND field3:1

field1:4  done alone selects around 4.2M records
field2:188453 done alone selects around 1.6M records
field3:1  done alone selects around 1K records
The whole query normally selects less than 50 records
Only the first 10 are returned (or whatever range
the client selects).

The fields are all keywords checked for exact matches
(no
fulltext search is done).  Is there anything I can do
to
speed these queries up, or is the structure just more
suited
to MySQL (and not an inverted index)?

How is a query like this carried out?

Any help would be greatly appreciated.  There's not a
lot of info
on searching (much more on updating). I'm looking
forward
to Lucene in Action!  too bad it's not out till
October.

-Yonik



___
Do you Yahoo!?
Win 1 of 4,000 free domain names from Yahoo! Enter now.
http://promotions.yahoo.com/goldrush

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-20 Thread Otis Gospodnetic
The bottleneck seems to be disk IO.
Since this is a read-only index, why not spread some of the frequently
scanned index files over multiple disks, or put the index on SCSI disks
hooked up in a RAID.  Maybe this is already the case, but you didn't
mention in.

Oh, I already answered a similar question once before:
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05103.html

Otis
http://www.simpy.com/ -- Index, Search and Share your bookmarks


--- Yonik Seeley [EMAIL PROTECTED] wrote:

 Hi,
 
 I'm trying to figure out how to speed up queries to a
 large index.
 I'm currently getting 133 req/sec, which isn't bad,
 but isn't too close
 to MySQL, which is getting 500 req/sec on the same
 hardware with the
 same set of documents.
 
 Setup info  Stats:
 - 4.3M documents, 12 keyword fields per document, 11
 unindexed fields per document.
 - lucene index size on disk=1.3G
 - Hardware: dual opteron w/ 16GB memory, running 64
 bit JVM (Sun 1.5 beta)
 - Lucene version 1.4.1
 - Hitting multithreaded server w/ 10 clients at once
 - This is a read-only index... no updating is done
 - Single IndexSearcher that is reused for all requests
  
 
 Q1)  while hitting it with multiple queries at once,
 lucene is pegged at 50% CPU usage (meaning it is
 only using 1 out of 2 CPUs on average).  I took a
 thread dump
 and all of the lucene threads except one are blocked
 on
 reading a file (see trace below).  I could create two
 index
 readers, but that seems like it might be a waste, and
 fixing
 a symptom instead of the root problem.  Would multiple
 IndexSearchers or IndexReaders share internal caches?
 Is there a way to cache more info at a higher level
 such that
 it would get rid of this bottleneck?  The JVM isn't
 taking up
 much space (125M or so), and I have 16GB to work with!
 The OS (linux) is obviously caching the index file,
 but
 that doesn't get rid of the synchronization issues,
 and the
 overhead of re-reading.
 How is caching in lucene configured?
 Does it internally use FieldCache, or do I have to use
 that
 somehow myself?
  
 tcpConnection-8080-72 daemon prio=1
 tid=0x002b24412490 nid=0x34a4 waiting for monitor
 entry 
 
 [0x45aba000..0x45abb2d0]
 at

org.apache.lucene.index.CompoundFileReader$CSInputStream.readInternal(CompoundFileReader.java:215)
 - waiting to lock 0x002ae153fa00 (a
 org.apache.lucene.store.FSInputStream)
 at
 org.apache.lucene.store.InputStream.refill(InputStream.java:158)
 at
 org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
 at
 org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
 at

org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:176)
 at
 org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:88)
 at

org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:53)
 at

org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:48)
 at
 org.apache.lucene.search.Scorer.score(Scorer.java:37)
 at
 org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
 at
 org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
 at
 org.apache.lucene.search.Hits.init(Hits.java:43)
 at
 org.apache.lucene.search.Searcher.search(Searcher.java:33)
 at
 org.apache.lucene.search.Searcher.search(Searcher.java:27)
 
 
 Even using only 1 cpu though, MySQL is faster. Here is
 what
 the queries look like:
 
 field1:4 AND field2:188453 AND field3:1
 
 field1:4  done alone selects around 4.2M records
 field2:188453 done alone selects around 1.6M records
 field3:1  done alone selects around 1K records
 The whole query normally selects less than 50 records
 Only the first 10 are returned (or whatever range
 the client selects).
 
 The fields are all keywords checked for exact matches
 (no
 fulltext search is done).  Is there anything I can do
 to
 speed these queries up, or is the structure just more
 suited
 to MySQL (and not an inverted index)?
 
 How is a query like this carried out?
 
 Any help would be greatly appreciated.  There's not a
 lot of info
 on searching (much more on updating). I'm looking
 forward
 to Lucene in Action!  too bad it's not out till
 October.
 
 -Yonik
 
 
   
 ___
 Do you Yahoo!?
 Win 1 of 4,000 free domain names from Yahoo! Enter now.
 http://promotions.yahoo.com/goldrush
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: speeding up queries (MySQL faster)

2004-08-20 Thread Yonik Seeley

--- Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 The bottleneck seems to be disk IO.

But it's not.  Linux is caching the whole file, and
there really isn't any disk activity at all.  Most of
the threads are blocked on InputStream.refill, not
waiting for the disk, but waiting for their turn into
the synchronized block to read from the disk (which is
why I asked about cacheing above that level).

CPU is a constant 50% on a dual CPU system (meaning
100% of 1 cpu).

-Yonik

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]