[SQL] Suggestions on finetuning this search?

2001-02-16 Thread Justin Long

I would welcome any suggestions for fine-tuning this search to run faster.


Here is the SQL. Basically what we're allowing people to do is to specify
words to search our article index.

THE TABLES:
knowledge = the knowledge base of articles
kb_categories = the category that each article is assigned to
kbwords = an index of every word in the knowledge base (kbid, wordid)
wordindex = an index of every word in the knowledge base (wordid, word)

AMPLIFICATION:
kbwords = a list of all the words that appear in a specific article (might
return a list of 5 articles where the word 'monk' appears)
wordindex = a unique list of all the words that appear in all the articles
(would only return a single entry for the word 'monk')

THE SQL:
select * from knowledge k, kb_categories c , kbwords w0 , kbwords w1 WHERE
k.catid=c.catid AND ((k.kbid=w0.kbid and w0.wordid=(SELECT wordid from
wordindex where word='BUDDIST')) AND  (k.kbid=w1.kbid and w1.wordid=(SELECT
wordid from wordindex where word='MONK'))) ORDER BY k.regionid , k.ctryid ,
k.catid , k.title ;

WHAT "EXPLAIN" RETURNS:
Sort  (cost=2796577.40..2796577.40 rows=2878549 width=332)
  InitPlan
->  Index Scan using wordindex_word on wordindex  (cost=0.00..247.71
rows=376 width=4)
->  Index Scan using wordindex_word on wordindex  (cost=0.00..247.71
rows=376 width=4)
  ->  Merge Join  (cost=21187.45..21993.59 rows=2878549 width=332)
->  Merge Join  (cost=11060.50..11140.94 rows=52582 width=324)
  ->  Sort  (cost=933.56..933.56 rows=960 width=316)
->  Hash Join  (cost=1.43..885.97 rows=960 width=316)
  ->  Seq Scan on knowledge k  (cost=0.00..559.25
rows=2825 width=284)
  ->  Hash  (cost=1.34..1.34 rows=34 width=32)
->  Seq Scan on kb_categories c
(cost=0.00..1.34 rows=34 width=32)
  ->  Sort  (cost=10126.95..10126.95 rows=5474 width=8)
->  Seq Scan on kbwords w0  (cost=0.00..9787.02
rows=5474 width=8)
->  Sort  (cost=10126.95..10126.95 rows=5474 width=8)
  ->  Seq Scan on kbwords w1  (cost=0.00..9787.02 rows=5474
width=8)

Please e-mail suggestions to [EMAIL PROTECTED] Thanks!

To see the code in action, visit
http://www.strategicnetwork.org/index.asp?loc=kb




[SQL] Optimizing Query

2001-03-05 Thread Justin Long

Any suggestions welcome!

Here is my query:

select k.*, c.category from knowledge k, kb_categories c , kbwords w0 ,
kbwords w1 WHERE k.catid=c.catid AND ((k.kbid=w0.kbid and
w0.wordid=42743) AND (k.kbid=w1.kbid and w1.wordid=85369)) ORDER BY
k.kbid DESC LIMIT 25;

Now for the details
knowledge k = 3,150 records
kbwords = 2-field database (kbid, wordid), 825,748 records

Each word in the knowledge base is stored in a database called wordindex,
which has 50,000 records or so. The system first explodes the query
string and pulls the word #s from this database, which is where we get
42743 and 85369 above, "ASIA" and "CHILDREN"
respectively.) The idea is then to pull all the articles in the knowledge
base which contain both of these words.

Here is the EXPLAIN for the query:

NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..527690060.67 rows=2878549 width=308)
  ->  Nested Loop  (cost=0.00..9472443.40 rows=52582
width=304)
    ->  Nested Loop 
(cost=0.00..6278.63 rows=960 width=300)
 
->  Index Scan Backward using knowledge_kbid_key on knowledge
k  (cost=0.00..1292.51 rows=2825 width=284)
 
->  Seq Scan on kb_categories c  (cost=0.00..1.34 rows=34
width=16)
    ->  Seq Scan on
kbwords w0  (cost=0.00..9787.02 rows=5474 width=4)
  ->  Seq Scan on kbwords w1  (cost=0.00..9787.02
rows=5474 width=4)

This takes quite a while to return results... prohibitively long. There
are indexes on k.catid, c.catid, k.kbid, w0.kbid, w0.wordid. Any
suggestions for further optimization would be very welcome. We get about
3,000 searches on our database daily...

Blessings,
Justin Long



________
Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake,
VA 23320, USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.


________
Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake,
VA 23320, USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.



Re: [SQL] Optimizing Query

2001-03-05 Thread Justin Long

Does that mean that if you have 3 indexes on a table and you cluster one,
it deletes the other 2?

At 04:45 PM 3/5/2001 -0500, you wrote:
Yes.

> On Mon, Mar 05, 2001 at 04:07:57PM -0500, Bruce Momjian allegedly
wrote:
> > Have you tried VACUUM ANALYZE and CLUSTER?
> 
> I assume CLUSTER still drops all indexes except the one you're
clustering
> on?
> 
> Mathijs
> -- 
> It's not that perl programmers are idiots, it's that the
language
> rewards idiotic behavior in a way that no other language or tool
has
> ever done.
>
Erik Naggum
> 


-- 
  Bruce
Momjian   
| 
http://candle.pha.pa.us
 
[EMAIL PROTECTED]  
|  (610) 853-3000
  +  If your life is a hard drive,
|  830 Blythe Avenue
  +  Christ can be your
backup.    |  Drexel Hill,
Pennsylvania 19026


____
Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake,
VA 23320, USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.



Re: [SQL] Optimizing Query

2001-03-05 Thread Justin Long

Yes, it drops indexes, much to my chagrin, as I just realized ...
including SERIALs...

Justin

At 04:45 PM 3/5/2001 -0500, you wrote:
Yes.

> On Mon, Mar 05, 2001 at 04:07:57PM -0500, Bruce Momjian allegedly
wrote:
> > Have you tried VACUUM ANALYZE and CLUSTER?
> 
> I assume CLUSTER still drops all indexes except the one you're
clustering
> on?
> 
> Mathijs
> -- 
> It's not that perl programmers are idiots, it's that the
language
> rewards idiotic behavior in a way that no other language or tool
has
> ever done.
>
Erik Naggum
> 


-- 
  Bruce
Momjian   
| 
http://candle.pha.pa.us
 
[EMAIL PROTECTED]  
|  (610) 853-3000
  +  If your life is a hard drive,
|  830 Blythe Avenue
  +  Christ can be your
backup.    |  Drexel Hill,
Pennsylvania 19026


____
Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake,
VA 23320, USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.



Re: [SQL] Optimizing Query

2001-03-05 Thread Justin Long

Ok, now I have another question... it doesn't seem to be accessing the
index.

explain select k.kbid,k.title from knowledge k , kbwords w0 , kbwords w1
WHERE ((k.kbid=w0.kbid and w0.wordid=42743) AND (k.kbid=w1.kbid and
w1.wordid=85369))

NOTICE:  QUERY PLAN:

Merge Join  (cost=32339.30..35496.97 rows=19262538 width=24)
  ->  Merge Join  (cost=16530.24..16668.77 rows=233274
width=20)
    ->  Sort 
(cost=15809.06..15809.06 rows=8257 width=4)
 
->  Seq Scan on kbwords w1  (cost=0.00..15271.85 rows=8257
width=4)
    ->  Sort 
(cost=721.18..721.18 rows=2825 width=16)
 
->  Seq Scan on knowledge k  (cost=0.00..559.25 rows=2825
width=16)
  ->  Sort  (cost=15809.06..15809.06 rows=8257
width=4)
    ->  Seq Scan on
kbwords w0  (cost=0.00..15271.85 rows=8257 width=4)

Note the sequential scans... there is a wordindex where
w0.wordid=42743... why isn't it doing an indexscan? wouldn't that be more
efficient?

Justin


At 04:45 PM 3/5/2001 -0500, you wrote:
Yes.

> On Mon, Mar 05, 2001 at 04:07:57PM -0500, Bruce Momjian allegedly
wrote:
> > Have you tried VACUUM ANALYZE and CLUSTER?
> 
> I assume CLUSTER still drops all indexes except the one you're
clustering
> on?
> 
> Mathijs
> -- 
> It's not that perl programmers are idiots, it's that the
language
> rewards idiotic behavior in a way that no other language or tool
has
> ever done.
>
Erik Naggum
> 


-- 
  Bruce
Momjian   
| 
http://candle.pha.pa.us
 
[EMAIL PROTECTED]  
|  (610) 853-3000
  +  If your life is a hard drive,
|  830 Blythe Avenue
  +  Christ can be your
backup.    |  Drexel Hill,
Pennsylvania 19026



Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake,
VA 23320, USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.



Re: [SQL] Optimizing Query

2001-03-06 Thread Justin Long

Wow. I can't believe the difference. It didn't take too long. I'll set up
a script in my etc/cron.weekly to run it... would there be any benefit to
doing a vacuum analyze nightly?

Justin Long

At 11:10 PM 3/5/2001 -0500, you wrote:
Justin Long
<[EMAIL PROTECTED]> writes:
> Ok, now I have another question... it doesn't seem to be accessing
the index.

> explain select k.kbid,k.title from knowledge k , kbwords w0 ,
kbwords w1 
> WHERE ((k.kbid=w0.kbid and w0.wordid=42743) AND (k.kbid=w1.kbid and

> w1.wordid=85369))

> NOTICE:  QUERY PLAN:

> Merge Join  (cost=32339.30..35496.97 rows=19262538
width=24)
>    ->  Merge Join 
(cost=16530.24..16668.77 rows=233274 width=20)
>  -> 
Sort  (cost=15809.06..15809.06 rows=8257 width=4)
>   
->  Seq Scan on kbwords w1  (cost=0.00..15271.85 rows=8257

> width=4)
>  -> 
Sort  (cost=721.18..721.18 rows=2825 width=16)
>   
->  Seq Scan on knowledge k  (cost=0.00..559.25 rows=2825

> width=16)
>    ->  Sort  (cost=15809.06..15809.06
rows=8257 width=4)
>  -> 
Seq Scan on kbwords w0  (cost=0.00..15271.85 rows=8257 
width=4)

> Note the sequential scans... there is a wordindex where
w0.wordid=42743... 
> why isn't it doing an indexscan? wouldn't that be more
efficient?

It probably thinks not, because the estimated number of hits (8257)
is
so high.  That estimate is currently driven by the frequency of the
most
common value in the column (mainly because that's the only stat we
have
:-().  I am guessing that you have a few very common words, which
are
skewing the stats for kbwords and causing it not to pick an
indexscan.

Does your setup have a notion of "stop words" that shouldn't be
indexed,
like "a", "an", "the", etc?  Perhaps
you need to add such a feature, or
throw in a few more stopwords if you already have 'em.

regards,
tom lane

---(end of
broadcast)-------
TIP 1: subscribe and unsubscribe commands go to
[EMAIL PROTECTED]



Justin
Long Network
for Strategic Missions
[EMAIL PROTECTED] 1732 South Park
Court
http://www.strategicnetwork.org Chesapeake, VA 23320,
USA
Reality Check e-zine: [EMAIL PROTECTED]

Law: Never retreat. Never surrender. Never cut a deal with a 
dragon.
Corollary: No armor? Unclean life? Then do not mess in the affairs 
of dragons, for you are crunchy and taste good with ketchup.