Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Lars-Erik Aabech
Hi!

I'm sorry I didn't do any hard research on this, it's so quick to ask. ;)

Is it possible to somehow find the count of each term in a set for each 
document returned by a query?

For instance, if I use the query +(foo:bar foo:morebar) +(bar:foo),
Could I without fetching all the documents from this query, find the count of 
occurances of the terms [barette, fooish, bar, morebar, foo]?
The result I'm after is something like
barette: 10,
fooish: 0,
bar: 5,
morebar: 8
foo: 3

Hope the question is clear enough.
Any suggestion is welcome.
I'd prefer not having to build a second index, though..

(I guess I could do a new "combined" query for each term in the set, but if any 
other way it'd be nice)

mvh.
Lars-Erik Aabech
Faglig leder utvikling
MarkedsPartner AS
Mobil: +47 920 30 537



Re: Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Michael McCandless
For terms that are in your query, you could use the
Scorer.getChildScorers API up front to hold onto each Scorer and then
in a custom collector check if that Scorer matched this particular
hit.

For terms that are not in your query.:

You could use term vectors and count up the terms yourself as you go
(in a custom collector), but that'd be insanely slow.

You could create a bit set of all matching docs, and then a bit set
for each of the terms of interest, and intersect them and count the
set bits.

You could pull the DocsEnum for each term of interest up front, and
then in a custom collector call .advance on each, for each collected
docID, and increment counts if that term matches that doc.

Or you could just do a separate query for each of the terms of
interest AND'd with your original query.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 22, 2013 at 4:14 AM, Lars-Erik Aabech  
wrote:
> Hi!
>
> I'm sorry I didn't do any hard research on this, it's so quick to ask. ;)
>
> Is it possible to somehow find the count of each term in a set for each 
> document returned by a query?
>
> For instance, if I use the query +(foo:bar foo:morebar) +(bar:foo),
> Could I without fetching all the documents from this query, find the count of 
> occurances of the terms [barette, fooish, bar, morebar, foo]?
> The result I'm after is something like
> barette: 10,
> fooish: 0,
> bar: 5,
> morebar: 8
> foo: 3
>
> Hope the question is clear enough.
> Any suggestion is welcome.
> I'd prefer not having to build a second index, though..
>
> (I guess I could do a new "combined" query for each term in the set, but if 
> any other way it'd be nice)
>
> mvh.
> Lars-Erik Aabech
> Faglig leder utvikling
> MarkedsPartner AS
> Mobil: +47 920 30 537
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Lars-Erik Aabech
Thanks.

ANDing was what I ment with "combined" queries.
I think I'll go with that one for now and see how it performs. Not too many 
docs/terms in the index. (~1500/30)

Bit sets sounds appealing, but I've got no idea how to go about it. :)
In "lucene in action", I only find a short mention of DocIdBitSet.
Any hints?

Lars-Erik

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: 22. februar 2013 11:27
To: java-user@lucene.apache.org
Subject: Re: Possible to "quickly" fetch count of other terms based on a query

For terms that are in your query, you could use the Scorer.getChildScorers API 
up front to hold onto each Scorer and then in a custom collector check if that 
Scorer matched this particular hit.

For terms that are not in your query.:

You could use term vectors and count up the terms yourself as you go (in a 
custom collector), but that'd be insanely slow.

You could create a bit set of all matching docs, and then a bit set for each of 
the terms of interest, and intersect them and count the set bits.

You could pull the DocsEnum for each term of interest up front, and then in a 
custom collector call .advance on each, for each collected docID, and increment 
counts if that term matches that doc.

Or you could just do a separate query for each of the terms of interest AND'd 
with your original query.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 22, 2013 at 4:14 AM, Lars-Erik Aabech  
wrote:
> Hi!
>
> I'm sorry I didn't do any hard research on this, it's so quick to ask. 
> ;)
>
> Is it possible to somehow find the count of each term in a set for each 
> document returned by a query?
>
> For instance, if I use the query +(foo:bar foo:morebar) +(bar:foo), 
> Could I without fetching all the documents from this query, find the count of 
> occurances of the terms [barette, fooish, bar, morebar, foo]?
> The result I'm after is something like
> barette: 10,
> fooish: 0,
> bar: 5,
> morebar: 8
> foo: 3
>
> Hope the question is clear enough.
> Any suggestion is welcome.
> I'd prefer not having to build a second index, though..
>
> (I guess I could do a new "combined" query for each term in the set, 
> but if any other way it'd be nice)
>
> mvh.
> Lars-Erik Aabech
> Faglig leder utvikling
> MarkedsPartner AS
> Mobil: +47 920 30 537
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Michael McCandless
On Fri, Feb 22, 2013 at 6:08 AM, Lars-Erik Aabech  
wrote:
> Thanks.
>
> ANDing was what I ment with "combined" queries.
> I think I'll go with that one for now and see how it performs. Not too many 
> docs/terms in the index. (~1500/30)
>
> Bit sets sounds appealing, but I've got no idea how to go about it. :)
> In "lucene in action", I only find a short mention of DocIdBitSet.
> Any hints?

You can just create a FixedBitSet of size maxDoc(), and then call
.or(DocsEnum) which you got for each term, to get the bitset for each
term.

For a Query, it's a bit trickier: you need to pull its Weight, and
then pull a Scorer from that, and then create a FixedBitSet and call
.or(Scorer) to set all bits.

Then you can .and these bitsets together and call .cardinality to get
total bits set.

To get best perf, you should do this per-segment (ie, iterate over
IR.leaves(), and do the code above per-segment), but for
easiest-to-write code, you can operate on the top-level reader by
wrapping your IR in SlowCompositeReaderWrapper).

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Lars-Erik Aabech
I guess it performs alright :P
Overall Elapsed:00:00:00.0290029

(29ms)

Lars-Erik

-Original Message-
From: Lars-Erik Aabech [mailto:l...@markedspartner.no] 
Sent: 22. februar 2013 12:09
To: java-user@lucene.apache.org
Subject: RE: Possible to "quickly" fetch count of other terms based on a query

Thanks.

ANDing was what I ment with "combined" queries.
I think I'll go with that one for now and see how it performs. Not too many 
docs/terms in the index. (~1500/30)

Bit sets sounds appealing, but I've got no idea how to go about it. :) In 
"lucene in action", I only find a short mention of DocIdBitSet.
Any hints?

Lars-Erik

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: 22. februar 2013 11:27
To: java-user@lucene.apache.org
Subject: Re: Possible to "quickly" fetch count of other terms based on a query

For terms that are in your query, you could use the Scorer.getChildScorers API 
up front to hold onto each Scorer and then in a custom collector check if that 
Scorer matched this particular hit.

For terms that are not in your query.:

You could use term vectors and count up the terms yourself as you go (in a 
custom collector), but that'd be insanely slow.

You could create a bit set of all matching docs, and then a bit set for each of 
the terms of interest, and intersect them and count the set bits.

You could pull the DocsEnum for each term of interest up front, and then in a 
custom collector call .advance on each, for each collected docID, and increment 
counts if that term matches that doc.

Or you could just do a separate query for each of the terms of interest AND'd 
with your original query.

Mike McCandless

http://blog.mikemccandless.com

On Fri, Feb 22, 2013 at 4:14 AM, Lars-Erik Aabech  
wrote:
> Hi!
>
> I'm sorry I didn't do any hard research on this, it's so quick to ask. 
> ;)
>
> Is it possible to somehow find the count of each term in a set for each 
> document returned by a query?
>
> For instance, if I use the query +(foo:bar foo:morebar) +(bar:foo), 
> Could I without fetching all the documents from this query, find the count of 
> occurances of the terms [barette, fooish, bar, morebar, foo]?
> The result I'm after is something like
> barette: 10,
> fooish: 0,
> bar: 5,
> morebar: 8
> foo: 3
>
> Hope the question is clear enough.
> Any suggestion is welcome.
> I'd prefer not having to build a second index, though..
>
> (I guess I could do a new "combined" query for each term in the set, 
> but if any other way it'd be nice)
>
> mvh.
> Lars-Erik Aabech
> Faglig leder utvikling
> MarkedsPartner AS
> Mobil: +47 920 30 537
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Possible to "quickly" fetch count of other terms based on a query

2013-02-22 Thread Lars-Erik Aabech
Thanks again. I'll look into this at a later time. :)
(Have to read the entire book too..)

-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com] 
Sent: 22. februar 2013 12:21
To: java-user@lucene.apache.org
Subject: Re: Possible to "quickly" fetch count of other terms based on a query

On Fri, Feb 22, 2013 at 6:08 AM, Lars-Erik Aabech  
wrote:
> Thanks.
>
> ANDing was what I ment with "combined" queries.
> I think I'll go with that one for now and see how it performs. Not too 
> many docs/terms in the index. (~1500/30)
>
> Bit sets sounds appealing, but I've got no idea how to go about it. :) 
> In "lucene in action", I only find a short mention of DocIdBitSet.
> Any hints?

You can just create a FixedBitSet of size maxDoc(), and then call
.or(DocsEnum) which you got for each term, to get the bitset for each term.

For a Query, it's a bit trickier: you need to pull its Weight, and then pull a 
Scorer from that, and then create a FixedBitSet and call
.or(Scorer) to set all bits.

Then you can .and these bitsets together and call .cardinality to get total 
bits set.

To get best perf, you should do this per-segment (ie, iterate over IR.leaves(), 
and do the code above per-segment), but for easiest-to-write code, you can 
operate on the top-level reader by wrapping your IR in 
SlowCompositeReaderWrapper).

Mike McCandless

http://blog.mikemccandless.com

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org