Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-11 Thread Alexei Martchenko
are you boosting your docs?

2011/8/8 Jason Toy jason...@gmail.com

 I am trying to test out and compare different sorts and scoring.

  When I use dismax to search for indie music
 with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100
 I see some stuff that seems irrelevant, meaning in top results I see only
 1 or 2 mentions of indie music, but when I look further down the list I
 do
 see other docs that have more occurrences of indie music.
 So I a want to test by comparing the the different queries versus seeing a
 list of docs ranked specifically by the count of occurrences of the phrase
 indie music

 On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.io
 wrote:

 
   Dismax queries can. But
  
   sort=termfreq(all_lists_text,'indie+music')
  
   is not using dismax.  Apparenty termfreq function can not? I am not
   familiar with the termfreq function.
 
  It simply returns the TF of the given _term_  as it is indexed of the
  current
  document.
 
  Sorting on TF like this seems strange as by default queries are already
  sorted
  that way since TF plays a big role in the final score.
 
  
   To understand why you'd need to reindex, you might want to read up on
 how
   lucene actually works, to get a basic understanding of how different
   indexing choices effect what is possible at query time. Lucene In
 Action
   is a pretty good book.
  
   On 8/8/2011 5:02 PM, Jason Toy wrote:
Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase
  searches,
  I
don't understand why I would need to reindex t be able to access
  phrases
from a function.
   
On Mon, Aug 8, 2011 at 1:49 PM, Markus
  Jelsmamarkus.jel...@openindex.iowrote:
Aelexei, thank you , that does seem to work.
   
My sort results seem to be totally wrong though, I'm not sure if
 its
because of my sort function or something else.
   
My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.
   
That's normal, you issue a catch all query. Sorting should work
 but..
   
All the results don't have the phrase indie music anywhere in
 their
   
data.
   
  Does termfreq not support phrases?
   
No, it is TERM frequency and indie music is not one term. I don't
 know
how this function parses your input but it might not understand your
 +
escape and
think it's one term constisting of exactly that.
   
If not, how can I sort specifically by termfreq of a phrase?
   
You cannot. What you can do is index multiple terms as one term
 using
the shingle filter. Take care, it can significantly increase your
  index
size and
number of unique terms.
   
On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko
   
ale...@superdownloads.com.br  wrote:
You can use the standard query parser and pass q=*:*
   
2011/8/8 Jason Toyjason...@gmail.com
   
I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable
 to
   
do
   
it without passing in data to the q paramater.  Is it possible to
  get
a
   
sorted
   
list without searching for any terms?
   
--
   
*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533
 



 --
 - sent from my mobile
 6176064373




-- 

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Jason Toy
Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its because
of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.
All the results don't have the phrase indie music anywhere in their data.
 Does termfreq not support phrases?
If not, how can I sort specifically by termfreq of a phrase?



On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
ale...@superdownloads.com.br wrote:

 You can use the standard query parser and pass q=*:*

 2011/8/8 Jason Toy jason...@gmail.com

  I am trying to list some data based on a function I run ,
  specifically  termfreq(post_text,'indie music')  and I am unable to do it
  without passing in data to the q paramater.  Is it possible to get a
 sorted
  list without searching for any terms?
 



 --

 *Alexei Martchenko* | *CEO* | Superdownloads
 ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
 5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Yury Kats
On 8/8/2011 4:34 PM, Jason Toy wrote:
 Aelexei, thank you , that does seem to work.
 
 My sort results seem to be totally wrong though, I'm not sure if its because
 of my sort function or something else.
 
 My query consists of:
 sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
 And I get back 4571232 hits.

That would be the total number of docs, I guess.
Since your query is *:*, ie find everything.

 All the results don't have the phrase indie music anywhere in their data.

You are only sorting on termfreq of indie music, you are not querying
documents that contain it.


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Markus Jelsma

 Aelexei, thank you , that does seem to work.
 
 My sort results seem to be totally wrong though, I'm not sure if its
 because of my sort function or something else.
 
 My query consists of:
 sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
 And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..

 All the results don't have the phrase indie music anywhere in their data.
  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how 
this function parses your input but it might not understand your + escape and 
think it's one term constisting of exactly that.

 If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the 
shingle filter. Take care, it can significantly increase your index size and 
number of unique terms.

 
 
 
 On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
 
 ale...@superdownloads.com.br wrote:
  You can use the standard query parser and pass q=*:*
  
  2011/8/8 Jason Toy jason...@gmail.com
  
   I am trying to list some data based on a function I run ,
   specifically  termfreq(post_text,'indie music')  and I am unable to do
   it without passing in data to the q paramater.  Is it possible to get
   a
  
  sorted
  
   list without searching for any terms?
  
  --
  
  *Alexei Martchenko* | *CEO* | Superdownloads
  ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
  5083.1018/5080.3535/5080.3533


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Jason Toy
Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma markus.jel...@openindex.iowrote:


  Aelexei, thank you , that does seem to work.
 
  My sort results seem to be totally wrong though, I'm not sure if its
  because of my sort function or something else.
 
  My query consists of:
  sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
  And I get back 4571232 hits.

 That's normal, you issue a catch all query. Sorting should work but..

  All the results don't have the phrase indie music anywhere in their
 data.
   Does termfreq not support phrases?

 No, it is TERM frequency and indie music is not one term. I don't know how
 this function parses your input but it might not understand your + escape
 and
 think it's one term constisting of exactly that.

  If not, how can I sort specifically by termfreq of a phrase?

 You cannot. What you can do is index multiple terms as one term using the
 shingle filter. Take care, it can significantly increase your index size
 and
 number of unique terms.

 
 
 
  On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
 
  ale...@superdownloads.com.br wrote:
   You can use the standard query parser and pass q=*:*
  
   2011/8/8 Jason Toy jason...@gmail.com
  
I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to
 do
it without passing in data to the q paramater.  Is it possible to get
a
  
   sorted
  
list without searching for any terms?
  
   --
  
   *Alexei Martchenko* | *CEO* | Superdownloads
   ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
   5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Jonathan Rochkind

Dismax queries can. But

sort=termfreq(all_lists_text,'indie+music')

is not using dismax.  Apparenty termfreq function can not? I am not familiar 
with the termfreq function.

To understand why you'd need to reindex, you might want to read up on how 
lucene actually works, to get a basic understanding of how different indexing 
choices effect what is possible at query time. Lucene In Action is a pretty 
good book.



On 8/8/2011 5:02 PM, Jason Toy wrote:

Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase  searches, I
don't understand why I would need to reindex t be able to access phrases
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsmamarkus.jel...@openindex.iowrote:


Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if its
because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work but..


All the results don't have the phrase indie music anywhere in their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't know how
this function parses your input but it might not understand your + escape
and
think it's one term constisting of exactly that.


If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term using the
shingle filter. Take care, it can significantly increase your index size
and
number of unique terms.




On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:

You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com


I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable to

do

it without passing in data to the q paramater.  Is it possible to get
a

sorted


list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533





Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Markus Jelsma

 Are not  Dismax queries able to search for phrases using the default
 index(which is what I am using?) If I can already do phrase  searches, I
 don't understand why I would need to reindex t be able to access phrases
 from a function.

Executing a Lucene phrase query is not the same as term frequency (phrase != 
term). A phrase consists of multiple terms and Lucene has an inverted term 
index, not an inverted phrase index (unless your index your data that way).

 
 On Mon, Aug 8, 2011 at 1:49 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
   Aelexei, thank you , that does seem to work.
   
   My sort results seem to be totally wrong though, I'm not sure if its
   because of my sort function or something else.
   
   My query consists of:
   sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
   And I get back 4571232 hits.
  
  That's normal, you issue a catch all query. Sorting should work but..
  
   All the results don't have the phrase indie music anywhere in their
  
  data.
  
Does termfreq not support phrases?
  
  No, it is TERM frequency and indie music is not one term. I don't know
  how this function parses your input but it might not understand your +
  escape and
  think it's one term constisting of exactly that.
  
   If not, how can I sort specifically by termfreq of a phrase?
  
  You cannot. What you can do is index multiple terms as one term using the
  shingle filter. Take care, it can significantly increase your index size
  and
  number of unique terms.
  
   On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko 
   
   ale...@superdownloads.com.br wrote:
You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toy jason...@gmail.com

 I am trying to list some data based on a function I run ,
 specifically  termfreq(post_text,'indie music')  and I am unable to
  
  do
  
 it without passing in data to the q paramater.  Is it possible to
 get a

sorted

 list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Markus Jelsma

 Dismax queries can. But
 
 sort=termfreq(all_lists_text,'indie+music')
 
 is not using dismax.  Apparenty termfreq function can not? I am not
 familiar with the termfreq function.

It simply returns the TF of the given _term_  as it is indexed of the current 
document. 

Sorting on TF like this seems strange as by default queries are already sorted 
that way since TF plays a big role in the final score.

 
 To understand why you'd need to reindex, you might want to read up on how
 lucene actually works, to get a basic understanding of how different
 indexing choices effect what is possible at query time. Lucene In Action
 is a pretty good book.
 
 On 8/8/2011 5:02 PM, Jason Toy wrote:
  Are not  Dismax queries able to search for phrases using the default
  index(which is what I am using?) If I can already do phrase  searches, I
  don't understand why I would need to reindex t be able to access phrases
  from a function.
  
  On Mon, Aug 8, 2011 at 1:49 PM, Markus 
Jelsmamarkus.jel...@openindex.iowrote:
  Aelexei, thank you , that does seem to work.
  
  My sort results seem to be totally wrong though, I'm not sure if its
  because of my sort function or something else.
  
  My query consists of:
  sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
  And I get back 4571232 hits.
  
  That's normal, you issue a catch all query. Sorting should work but..
  
  All the results don't have the phrase indie music anywhere in their
  
  data.
  
Does termfreq not support phrases?
  
  No, it is TERM frequency and indie music is not one term. I don't know
  how this function parses your input but it might not understand your +
  escape and
  think it's one term constisting of exactly that.
  
  If not, how can I sort specifically by termfreq of a phrase?
  
  You cannot. What you can do is index multiple terms as one term using
  the shingle filter. Take care, it can significantly increase your index
  size and
  number of unique terms.
  
  On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko
  
  ale...@superdownloads.com.br  wrote:
  You can use the standard query parser and pass q=*:*
  
  2011/8/8 Jason Toyjason...@gmail.com
  
  I am trying to list some data based on a function I run ,
  specifically  termfreq(post_text,'indie music')  and I am unable to
  
  do
  
  it without passing in data to the q paramater.  Is it possible to get
  a
  
  sorted
  
  list without searching for any terms?
  
  --
  
  *Alexei Martchenko* | *CEO* | Superdownloads
  ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
  5083.1018/5080.3535/5080.3533


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Jason Toy
I am trying to test out and compare different sorts and scoring.

 When I use dismax to search for indie music
with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100
I see some stuff that seems irrelevant, meaning in top results I see only
1 or 2 mentions of indie music, but when I look further down the list I do
see other docs that have more occurrences of indie music.
So I a want to test by comparing the the different queries versus seeing a
list of docs ranked specifically by the count of occurrences of the phrase
indie music

On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma markus.jel...@openindex.iowrote:


  Dismax queries can. But
 
  sort=termfreq(all_lists_text,'indie+music')
 
  is not using dismax.  Apparenty termfreq function can not? I am not
  familiar with the termfreq function.

 It simply returns the TF of the given _term_  as it is indexed of the
 current
 document.

 Sorting on TF like this seems strange as by default queries are already
 sorted
 that way since TF plays a big role in the final score.

 
  To understand why you'd need to reindex, you might want to read up on how
  lucene actually works, to get a basic understanding of how different
  indexing choices effect what is possible at query time. Lucene In Action
  is a pretty good book.
 
  On 8/8/2011 5:02 PM, Jason Toy wrote:
   Are not  Dismax queries able to search for phrases using the default
   index(which is what I am using?) If I can already do phrase  searches,
 I
   don't understand why I would need to reindex t be able to access
 phrases
   from a function.
  
   On Mon, Aug 8, 2011 at 1:49 PM, Markus
 Jelsmamarkus.jel...@openindex.iowrote:
   Aelexei, thank you , that does seem to work.
  
   My sort results seem to be totally wrong though, I'm not sure if its
   because of my sort function or something else.
  
   My query consists of:
   sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
   And I get back 4571232 hits.
  
   That's normal, you issue a catch all query. Sorting should work but..
  
   All the results don't have the phrase indie music anywhere in their
  
   data.
  
 Does termfreq not support phrases?
  
   No, it is TERM frequency and indie music is not one term. I don't know
   how this function parses your input but it might not understand your +
   escape and
   think it's one term constisting of exactly that.
  
   If not, how can I sort specifically by termfreq of a phrase?
  
   You cannot. What you can do is index multiple terms as one term using
   the shingle filter. Take care, it can significantly increase your
 index
   size and
   number of unique terms.
  
   On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko
  
   ale...@superdownloads.com.br  wrote:
   You can use the standard query parser and pass q=*:*
  
   2011/8/8 Jason Toyjason...@gmail.com
  
   I am trying to list some data based on a function I run ,
   specifically  termfreq(post_text,'indie music')  and I am unable to
  
   do
  
   it without passing in data to the q paramater.  Is it possible to
 get
   a
  
   sorted
  
   list without searching for any terms?
  
   --
  
   *Alexei Martchenko* | *CEO* | Superdownloads
   ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
   5083.1018/5080.3535/5080.3533




-- 
- sent from my mobile
6176064373


Re: bug in termfreq? was Re: is it possible to do a sort without query?

2011-08-08 Thread Markus Jelsma
If your want to understand and debug the scoring you can use debugQuery=true 
to see how different documents score. Most of the time docs with both terms are 
on top of the result set unless norms are interferring.

To understand your should check the Solr relevancy wiki but the Lucene docs 
are much better although very low level.

http://wiki.apache.org/solr/SolrRelevancyCookbook
http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html

Your question is more a relevance question than about the termfreq function. 
To be short, don't use those kind of functions if you don't yet understand 
similarity as describe in the Lucene docs.

 I am trying to test out and compare different sorts and scoring.
 
  When I use dismax to search for indie music
 with: qf=all_lists_textq=indie+musicdefType=dismaxrows=100
 I see some stuff that seems irrelevant, meaning in top results I see only
 1 or 2 mentions of indie music, but when I look further down the list I
 do see other docs that have more occurrences of indie music.
 So I a want to test by comparing the the different queries versus seeing a
 list of docs ranked specifically by the count of occurrences of the phrase
 indie music
 
 On Mon, Aug 8, 2011 at 2:19 PM, Markus Jelsma 
markus.jel...@openindex.iowrote:
   Dismax queries can. But
   
   sort=termfreq(all_lists_text,'indie+music')
   
   is not using dismax.  Apparenty termfreq function can not? I am not
   familiar with the termfreq function.
  
  It simply returns the TF of the given _term_  as it is indexed of the
  current
  document.
  
  Sorting on TF like this seems strange as by default queries are already
  sorted
  that way since TF plays a big role in the final score.
  
   To understand why you'd need to reindex, you might want to read up on
   how lucene actually works, to get a basic understanding of how
   different indexing choices effect what is possible at query time.
   Lucene In Action is a pretty good book.
   
   On 8/8/2011 5:02 PM, Jason Toy wrote:
Are not  Dismax queries able to search for phrases using the default
index(which is what I am using?) If I can already do phrase 
searches,
  
  I
  
don't understand why I would need to reindex t be able to access
  
  phrases
  
from a function.

On Mon, Aug 8, 2011 at 1:49 PM, Markus
  
  Jelsmamarkus.jel...@openindex.iowrote:
Aelexei, thank you , that does seem to work.

My sort results seem to be totally wrong though, I'm not sure if
its because of my sort function or something else.

My query consists of:
sort=termfreq(all_lists_text,'indie+music')+descq=*:*rows=100
And I get back 4571232 hits.

That's normal, you issue a catch all query. Sorting should work
but..

All the results don't have the phrase indie music anywhere in
their

data.

  Does termfreq not support phrases?

No, it is TERM frequency and indie music is not one term. I don't
know how this function parses your input but it might not
understand your + escape and
think it's one term constisting of exactly that.

If not, how can I sort specifically by termfreq of a phrase?

You cannot. What you can do is index multiple terms as one term
using the shingle filter. Take care, it can significantly increase
your
  
  index
  
size and
number of unique terms.

On Mon, Aug 8, 2011 at 1:08 PM, Alexei Martchenko

ale...@superdownloads.com.br  wrote:
You can use the standard query parser and pass q=*:*

2011/8/8 Jason Toyjason...@gmail.com

I am trying to list some data based on a function I run ,
specifically  termfreq(post_text,'indie music')  and I am unable
to

do

it without passing in data to the q paramater.  Is it possible to
  
  get
  
a

sorted

list without searching for any terms?

--

*Alexei Martchenko* | *CEO* | Superdownloads
ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
5083.1018/5080.3535/5080.3533