On 4/3/06, Qvx <[EMAIL PROTECTED]> wrote:
A sum may be good enough for government work; I'm trying to get a sense of relevance, and a sum would also be a good approximation of what I was trying to approximate.
So is it correct if I build up a query, say, with two words (populating the histogram.word), and then select:
Stored procedure will be different for each database. I can help you with Oracle.
I would make my query like this:
SELECT doc.id, SUM( his.cnt/doc.word_count) weight
FROM doc, his
WHERE doc.id = his.doc_id
AND his.word IN (:w1, :w2, :w3)
GROUP BY doc.id
HAVING COUNT(*) = :num_search_words
ORDER BY 2 DESC
But then again, this is not what you want. You want product and I don't know of any. For that you will have to loop and multiply yourself.
A sum may be good enough for government work; I'm trying to get a sense of relevance, and a sum would also be a good approximation of what I was trying to approximate.
So is it correct if I build up a query, say, with two words (populating the histogram.word), and then select:
result = mapper.select_text ("""SELECT document_table, SUM(histogram_table.count/document_table.word_count) weight
FROM document_table, histogram_table
WHERE document_table.html_name = histogram_table.html_name
AND histogram.word in (:"sausage", :"spam")
GROUP BY document_table.html_name
HAVING COUNT(*) = document_table.word_count
ORDER BY 2 DESC""" )
How do I coerce floating rather than integer division?
Both histogram_table.count and document_table.word_count are integers.
How do I get 1.0/2.0 = 0.5 instead of 1/2 = 0?
On 4/3/06, Jonathan Hayward http://JonathansCorner.com <[EMAIL PROTECTED]> wrote:
> Thank you! I've just added that column.
>
> Would this be an appropriate time to have an SQL stored procedure? I'm
> thinking pseudocode like:
>
> function count(document, list_of_keywords)
> result = 1.0
> if document.word_count > 0:
> for keyword in list_of_keywords:
> get histogram result (= histogram count of number of
> occurrences of document keyword; if this is 0, return 0.)
> result = result * float(histogram_count) /
> float(document.word_count)
> return result
>
> Sorry if this is crude; I don't know SQL well, but I have the impression
> that a stored procedure might look good in a case like this.
>
>
> On 4/3/06, Qvx <[EMAIL PROTECTED]> wrote:
> > I would definitely add a "word_count" column to document table no
> > matter what your resulting query will look like. But you are going to
> > have some trouble making an efficient query which will also take into
> > account the given formula (the part where you expect element "0/4").
> >
> > Tvrtko
> >
> > On 4/3/06, Jonathan Hayward http://JonathansCorner.com
> > < [EMAIL PROTECTED]> wrote:
> > > I owe a BIG thank you to those of you who have patiently helped me. My
> > > program
> > > is basically running now, and though it's definitely a prototype, it
> works.
> > > (I had started the move in the hopes that a sluggish program could be
> sped
> > > up,
> > > and it is DEFINITELY speeding up.)
> > >
> > > There is one immediate help request for a complex use of SQLAlchemy:
> > >
> > > A document_table represents a webpage (roughly); a histogram_table
> > > represents
> > > an entry into a histogram of words. If the webpage says "egg bacon spam
> > > spam"
> > > it will have:
> > >
> > > * one histogram with its primary key, "egg", and a count of 1
> > > * one histogram with its primary key, "bacon", and a count of 1
> > > * one histogram with its primary key, "spam", and a count of 2
> > >
> > > In the kind of search I want to enable, a webpage has a score which is
> > > calculated by dividing its count for each search keyword by the total
> word
> > > count for the webpage. Thus if someone searches for "egg bacon spam",
> the
> > > score
> > > will be 1/4 * 1/4 * 2/4 = 0.03125. (This is actually a very high score,
> > > althoguh it looks low.) If a keyword is missing, it will have a score
> of 0:
> > > "egg sausage spam" will have a score of 1/4 * 0/4 * 2/4 = 0.0.
> > >
> > > What I want to do is select all documents with a nonzero score based on
> the
> > > keyword search terms, and sort them by score descending. I know I could
> > > bludgeon it and eventually get it working, but this is complex enough
> > > (should I
> > > make a separate histogram_grand_total_table with the total word count
> for a
> > > webpage?) that I wanted to ask for help.
> > >
> > >
> > > --
> > > ++ Jonathan Hayward, [EMAIL PROTECTED]
> > > ** To see an award-winning website with stories, essays, artwork,
> > > ** games, and a four-dimensional maze, why not visit my home page?
> > > ** All of this is waiting for you at http://JonathansCorner.com
> > >
> > > ** If you'd like a Google Mail ( gmail.com ) account, please tell me!
> >
>
>
>
> --
>
> ++ Jonathan Hayward, [EMAIL PROTECTED]
> ** To see an award-winning website with stories, essays, artwork,
> ** games, and a four-dimensional maze, why not visit my home page?
> ** All of this is waiting for you at http://JonathansCorner.com
>
> ** If you'd like a Google Mail ( gmail.com) account, please tell me!
--
++ Jonathan Hayward, [EMAIL PROTECTED]
** To see an award-winning website with stories, essays, artwork,
** games, and a four-dimensional maze, why not visit my home page?
** All of this is waiting for you at http://JonathansCorner.com
** If you'd like a Google Mail ( gmail.com) account, please tell me!