Re: Top 10 Terms in Index (by date)
Oh, I see, essentially you want to get the sum of the term frequencies for every term in a subset of documents (instead of the document frequency as the FacetComponent would give you). I don't know of an easy/out of the box solution for this. I know the TermVectorComponent will give you the tf for every term in a document, but I'm not sure if you can filter or sort on it. Maybe you can do something like: https://issues.apache.org/jira/browse/LUCENE-2393 or what's suggested here: http://search-lucene.com/m/of5Fn1PUOHU/ but I have never used something like that. Tomás On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler andy.pick...@gmail.com wrote: I need total number of occurrences across all documents for each term. Imagine this... Post #1: I think, therefore I am like you Reply #1: You think too much Reply #2 I think that I think much as you Each of those documents are put into 'content'. Pretending I don't have stop words, the top term query (not considering dateCreated in this example) would result in something like... think: 4 I: 4 you: 3 much: 2 ... Thus, just a number of documents approach doesn't work, because if a word occurs more than one time in a document it needs to be counted that many times. That seemed to rule out faceting like you mentioned as well as the TermsComponent (which as I understand also only counts documents). Thanks, Andy Pickler On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: So you have one document per user comment? Why not use faceting plus filtering on the dateCreated field? That would count number of documents for each term (so, in your case, if a term is used twice in one comment it would only count once). Is that what you are looking for? Tomás On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote: Our company has an application that is Facebook-like for usage by enterprise customers. We'd like to do a report of top 10 terms entered by users over (some time period). With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: field name=content type=text_general indexed=true stored=false multiValued=false required=true termVectors=true/ Along with the content is the 'dateCreated' for that content: field name=dateCreated type=tdate indexed=true stored=false multiValued=false required=true/ I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of I think what I think would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler
Re: Top 10 Terms in Index (by date)
A key problem with those approaches as well as Lucene's HighFreqTerms class ( http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html) is that none of them seem to have the ability to combine with a date range query...which is key in my scenario. I'm kinda thinking that what I'm asking to do just isn't supported by Lucene or Solr, and that I'll have to pursue another avenue. If anyone has any other suggestions, I'm all ears. I'm starting to wonder if I need to have some nightly batch job that executes against my database and builds up that day's top terms in a table or something. Thanks, Andy Pickler On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: Oh, I see, essentially you want to get the sum of the term frequencies for every term in a subset of documents (instead of the document frequency as the FacetComponent would give you). I don't know of an easy/out of the box solution for this. I know the TermVectorComponent will give you the tf for every term in a document, but I'm not sure if you can filter or sort on it. Maybe you can do something like: https://issues.apache.org/jira/browse/LUCENE-2393 or what's suggested here: http://search-lucene.com/m/of5Fn1PUOHU/ but I have never used something like that. Tomás On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler andy.pick...@gmail.com wrote: I need total number of occurrences across all documents for each term. Imagine this... Post #1: I think, therefore I am like you Reply #1: You think too much Reply #2 I think that I think much as you Each of those documents are put into 'content'. Pretending I don't have stop words, the top term query (not considering dateCreated in this example) would result in something like... think: 4 I: 4 you: 3 much: 2 ... Thus, just a number of documents approach doesn't work, because if a word occurs more than one time in a document it needs to be counted that many times. That seemed to rule out faceting like you mentioned as well as the TermsComponent (which as I understand also only counts documents). Thanks, Andy Pickler On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: So you have one document per user comment? Why not use faceting plus filtering on the dateCreated field? That would count number of documents for each term (so, in your case, if a term is used twice in one comment it would only count once). Is that what you are looking for? Tomás On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote: Our company has an application that is Facebook-like for usage by enterprise customers. We'd like to do a report of top 10 terms entered by users over (some time period). With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: field name=content type=text_general indexed=true stored=false multiValued=false required=true termVectors=true/ Along with the content is the 'dateCreated' for that content: field name=dateCreated type=tdate indexed=true stored=false multiValued=false required=true/ I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of I think what I think would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler
Top 10 Terms in Index (by date)
Our company has an application that is Facebook-like for usage by enterprise customers. We'd like to do a report of top 10 terms entered by users over (some time period). With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: field name=content type=text_general indexed=true stored=false multiValued=false required=true termVectors=true/ Along with the content is the 'dateCreated' for that content: field name=dateCreated type=tdate indexed=true stored=false multiValued=false required=true/ I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of I think what I think would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler
Re: Top 10 Terms in Index (by date)
So you have one document per user comment? Why not use faceting plus filtering on the dateCreated field? That would count number of documents for each term (so, in your case, if a term is used twice in one comment it would only count once). Is that what you are looking for? Tomás On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote: Our company has an application that is Facebook-like for usage by enterprise customers. We'd like to do a report of top 10 terms entered by users over (some time period). With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: field name=content type=text_general indexed=true stored=false multiValued=false required=true termVectors=true/ Along with the content is the 'dateCreated' for that content: field name=dateCreated type=tdate indexed=true stored=false multiValued=false required=true/ I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of I think what I think would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler
Re: Top 10 Terms in Index (by date)
I need total number of occurrences across all documents for each term. Imagine this... Post #1: I think, therefore I am like you Reply #1: You think too much Reply #2 I think that I think much as you Each of those documents are put into 'content'. Pretending I don't have stop words, the top term query (not considering dateCreated in this example) would result in something like... think: 4 I: 4 you: 3 much: 2 ... Thus, just a number of documents approach doesn't work, because if a word occurs more than one time in a document it needs to be counted that many times. That seemed to rule out faceting like you mentioned as well as the TermsComponent (which as I understand also only counts documents). Thanks, Andy Pickler On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: So you have one document per user comment? Why not use faceting plus filtering on the dateCreated field? That would count number of documents for each term (so, in your case, if a term is used twice in one comment it would only count once). Is that what you are looking for? Tomás On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler andy.pick...@gmail.com wrote: Our company has an application that is Facebook-like for usage by enterprise customers. We'd like to do a report of top 10 terms entered by users over (some time period). With that in mind I'm using the DataImportHandler to put all the relevant data from our database into a Solr 'content' field: field name=content type=text_general indexed=true stored=false multiValued=false required=true termVectors=true/ Along with the content is the 'dateCreated' for that content: field name=dateCreated type=tdate indexed=true stored=false multiValued=false required=true/ I'm struggling with the TermVectorComponent documentation to understand how I can put together a query that answers the 'report' mentioned above. For each document I need each term counted however many times it is entered (content of I think what I think would report 'think' as used twice). Does anyone have any insight as to whether I'm headed in the right direction and then what my query would be? Thanks, Andy Pickler