Re: Top 10 Terms in Index (by date)

2013-04-02 Thread Tomás Fernández Löbbe
Oh, I see, essentially you want to get the sum of the term frequencies for
every term in a subset of documents (instead of the document frequency as
the FacetComponent would give you). I don't know of an easy/out of the box
solution for this. I know the TermVectorComponent will give you the tf for
every term in a document, but I'm not sure if you can filter or sort on it.
Maybe you can do something like:
https://issues.apache.org/jira/browse/LUCENE-2393
or what's suggested here:
http://search-lucene.com/m/of5Fn1PUOHU/
but I have never used something like that.

Tomás



On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler andy.pick...@gmail.com wrote:

 I need total number of occurrences across all documents for each term.
 Imagine this...

 Post #1: I think, therefore I am like you
 Reply #1: You think too much
 Reply #2 I think that I think much as you

 Each of those documents are put into 'content'.  Pretending I don't have
 stop words, the top term query (not considering dateCreated in this
 example) would result in something like...

 think: 4
 I: 4
 you: 3
 much: 2
 ...

 Thus, just a number of documents approach doesn't work, because if a word
 occurs more than one time in a document it needs to be counted that many
 times.  That seemed to rule out faceting like you mentioned as well as the
 TermsComponent (which as I understand also only counts documents).

 Thanks,
 Andy Pickler

 On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe 
 tomasflo...@gmail.com
  wrote:

  So you have one document per user comment? Why not use faceting plus
  filtering on the dateCreated field? That would count number of
  documents for each term (so, in your case, if a term is used twice in
 one
  comment it would only count once). Is that what you are looking for?
 
  Tomás
 
 
  On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com
  wrote:
 
   Our company has an application that is Facebook-like for usage by
   enterprise customers.  We'd like to do a report of top 10 terms
 entered
  by
   users over (some time period).  With that in mind I'm using the
   DataImportHandler to put all the relevant data from our database into a
   Solr 'content' field:
  
   field name=content type=text_general indexed=true stored=false
   multiValued=false required=true termVectors=true/
  
   Along with the content is the 'dateCreated' for that content:
  
   field name=dateCreated type=tdate indexed=true stored=false
   multiValued=false required=true/
  
   I'm struggling with the TermVectorComponent documentation to understand
  how
   I can put together a query that answers the 'report' mentioned above.
   For
   each document I need each term counted however many times it is entered
   (content of I think what I think would report 'think' as used twice).
Does anyone have any insight as to whether I'm headed in the right
   direction and then what my query would be?
  
   Thanks,
   Andy Pickler
  
 



Re: Top 10 Terms in Index (by date)

2013-04-02 Thread Andy Pickler
A key problem with those approaches as well as Lucene's HighFreqTerms class
(
http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html)
is that none of them seem to have the ability to combine with a date range
query...which is key in my scenario.  I'm kinda thinking that what I'm
asking to do just isn't supported by Lucene or Solr, and that I'll have to
pursue another avenue.  If anyone has any other suggestions, I'm all ears.
I'm starting to wonder if I need to have some nightly batch job that
executes against my database and builds up that day's top terms in a
table or something.

Thanks,
Andy Pickler

On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe tomasflo...@gmail.com
 wrote:

 Oh, I see, essentially you want to get the sum of the term frequencies for
 every term in a subset of documents (instead of the document frequency as
 the FacetComponent would give you). I don't know of an easy/out of the box
 solution for this. I know the TermVectorComponent will give you the tf for
 every term in a document, but I'm not sure if you can filter or sort on it.
 Maybe you can do something like:
 https://issues.apache.org/jira/browse/LUCENE-2393
 or what's suggested here:
 http://search-lucene.com/m/of5Fn1PUOHU/
 but I have never used something like that.

 Tomás



 On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler andy.pick...@gmail.com
 wrote:

  I need total number of occurrences across all documents for each term.
  Imagine this...
 
  Post #1: I think, therefore I am like you
  Reply #1: You think too much
  Reply #2 I think that I think much as you
 
  Each of those documents are put into 'content'.  Pretending I don't
 have
  stop words, the top term query (not considering dateCreated in this
  example) would result in something like...
 
  think: 4
  I: 4
  you: 3
  much: 2
  ...
 
  Thus, just a number of documents approach doesn't work, because if a
 word
  occurs more than one time in a document it needs to be counted that many
  times.  That seemed to rule out faceting like you mentioned as well as
 the
  TermsComponent (which as I understand also only counts documents).
 
  Thanks,
  Andy Pickler
 
  On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe 
  tomasflo...@gmail.com
   wrote:
 
   So you have one document per user comment? Why not use faceting plus
   filtering on the dateCreated field? That would count number of
   documents for each term (so, in your case, if a term is used twice in
  one
   comment it would only count once). Is that what you are looking for?
  
   Tomás
  
  
   On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com
   wrote:
  
Our company has an application that is Facebook-like for usage by
enterprise customers.  We'd like to do a report of top 10 terms
  entered
   by
users over (some time period).  With that in mind I'm using the
DataImportHandler to put all the relevant data from our database
 into a
Solr 'content' field:
   
field name=content type=text_general indexed=true
 stored=false
multiValued=false required=true termVectors=true/
   
Along with the content is the 'dateCreated' for that content:
   
field name=dateCreated type=tdate indexed=true stored=false
multiValued=false required=true/
   
I'm struggling with the TermVectorComponent documentation to
 understand
   how
I can put together a query that answers the 'report' mentioned above.
For
each document I need each term counted however many times it is
 entered
(content of I think what I think would report 'think' as used
 twice).
 Does anyone have any insight as to whether I'm headed in the right
direction and then what my query would be?
   
Thanks,
Andy Pickler
   
  
 



Top 10 Terms in Index (by date)

2013-04-01 Thread Andy Pickler
Our company has an application that is Facebook-like for usage by
enterprise customers.  We'd like to do a report of top 10 terms entered by
users over (some time period).  With that in mind I'm using the
DataImportHandler to put all the relevant data from our database into a
Solr 'content' field:

field name=content type=text_general indexed=true stored=false
multiValued=false required=true termVectors=true/

Along with the content is the 'dateCreated' for that content:

field name=dateCreated type=tdate indexed=true stored=false
multiValued=false required=true/

I'm struggling with the TermVectorComponent documentation to understand how
I can put together a query that answers the 'report' mentioned above.  For
each document I need each term counted however many times it is entered
(content of I think what I think would report 'think' as used twice).
 Does anyone have any insight as to whether I'm headed in the right
direction and then what my query would be?

Thanks,
Andy Pickler


Re: Top 10 Terms in Index (by date)

2013-04-01 Thread Tomás Fernández Löbbe
So you have one document per user comment? Why not use faceting plus
filtering on the dateCreated field? That would count number of
documents for each term (so, in your case, if a term is used twice in one
comment it would only count once). Is that what you are looking for?

Tomás


On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote:

 Our company has an application that is Facebook-like for usage by
 enterprise customers.  We'd like to do a report of top 10 terms entered by
 users over (some time period).  With that in mind I'm using the
 DataImportHandler to put all the relevant data from our database into a
 Solr 'content' field:

 field name=content type=text_general indexed=true stored=false
 multiValued=false required=true termVectors=true/

 Along with the content is the 'dateCreated' for that content:

 field name=dateCreated type=tdate indexed=true stored=false
 multiValued=false required=true/

 I'm struggling with the TermVectorComponent documentation to understand how
 I can put together a query that answers the 'report' mentioned above.  For
 each document I need each term counted however many times it is entered
 (content of I think what I think would report 'think' as used twice).
  Does anyone have any insight as to whether I'm headed in the right
 direction and then what my query would be?

 Thanks,
 Andy Pickler



Re: Top 10 Terms in Index (by date)

2013-04-01 Thread Andy Pickler
I need total number of occurrences across all documents for each term.
Imagine this...

Post #1: I think, therefore I am like you
Reply #1: You think too much
Reply #2 I think that I think much as you

Each of those documents are put into 'content'.  Pretending I don't have
stop words, the top term query (not considering dateCreated in this
example) would result in something like...

think: 4
I: 4
you: 3
much: 2
...

Thus, just a number of documents approach doesn't work, because if a word
occurs more than one time in a document it needs to be counted that many
times.  That seemed to rule out faceting like you mentioned as well as the
TermsComponent (which as I understand also only counts documents).

Thanks,
Andy Pickler

On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe tomasflo...@gmail.com
 wrote:

 So you have one document per user comment? Why not use faceting plus
 filtering on the dateCreated field? That would count number of
 documents for each term (so, in your case, if a term is used twice in one
 comment it would only count once). Is that what you are looking for?

 Tomás


 On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com
 wrote:

  Our company has an application that is Facebook-like for usage by
  enterprise customers.  We'd like to do a report of top 10 terms entered
 by
  users over (some time period).  With that in mind I'm using the
  DataImportHandler to put all the relevant data from our database into a
  Solr 'content' field:
 
  field name=content type=text_general indexed=true stored=false
  multiValued=false required=true termVectors=true/
 
  Along with the content is the 'dateCreated' for that content:
 
  field name=dateCreated type=tdate indexed=true stored=false
  multiValued=false required=true/
 
  I'm struggling with the TermVectorComponent documentation to understand
 how
  I can put together a query that answers the 'report' mentioned above.
  For
  each document I need each term counted however many times it is entered
  (content of I think what I think would report 'think' as used twice).
   Does anyone have any insight as to whether I'm headed in the right
  direction and then what my query would be?
 
  Thanks,
  Andy Pickler