Certainly, yes. I'm just doing a word count, ie how often does a specific term come up in the corpus? On Oct 24, 2015 4:20 PM, "Upayavira" <u...@odoko.co.uk> wrote:
> yes, but what do you want to do with the TF? What problem are you > solving with it? If you are able to share that... > > On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: > > Yes, sorry, I am not being clear. > > > > We are not even doing scoring, just getting the raw TF values. We're > > doing > > this in solr because it can scale well. > > > > But with large corpora, retrieving the word counts takes some time, in > > part > > because solr is splitting up word count by document and generating a > > large > > request. We then get the request and just sum it all up. I'm wondering if > > there's a more direct way. > > On Oct 24, 2015 4:00 PM, "Upayavira" <u...@odoko.co.uk> wrote: > > > > > Can you explain more what you are using TF for? Because it sounds > rather > > > like scoring. You could disable field norms and IDF and scoring would > be > > > mostly TF, no? > > > > > > Upayavira > > > > > > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: > > > > Thanks, let me think about that. > > > > > > > > We're using termfreq to get the TF score, but we don't know which > term > > > > we'll need the TF for. So we'd have to do a corpuswide summing of > > > > termfreq > > > > for each potential term across all documents in the corpus. It seems > like > > > > it'd require some development work to compute that, and our code > would be > > > > fragile. > > > > > > > > Let me think about that more. > > > > > > > > It might make sense to just move to solrcloud, it's the right > > > > architectural > > > > decision anyway. > > > > > > > > > > > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <u...@odoko.co.uk> wrote: > > > > > > > > > If you just want word length, then do work during indexing - index > a > > > > > field for the word length. Then, I believe you can do faceting - > e.g. > > > > > with the json faceting API I believe you can do a sum() > calculation on > > > a > > > > > field rather than the more traditional count. > > > > > > > > > > Thinking aloud, there might be an easier way - index a field that > is > > > the > > > > > same for all documents, and facet on it. Instead of counting the > number > > > > > of documents, calculate the sum() of your word count field. > > > > > > > > > > I *think* that should work. > > > > > > > > > > Upayavira > > > > > > > > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote: > > > > > > Hi Jack, > > > > > > > > > > > > I'm just using solr to get word count across a large number of > > > documents. > > > > > > > > > > > > It's somewhat non-standard, because we're ignoring relevance, > but it > > > > > > seems > > > > > > to work well for this use case otherwise. > > > > > > > > > > > > My understanding then is: > > > > > > 1) since termfreq is pre-processed and fetched, there's no good > way > > > to > > > > > > speed it up (except by caching earlier calculations) > > > > > > > > > > > > 2) there's no way to have solr sum up all of the termfreqs > across all > > > > > > documents in a search and just return one number for total > termfreqs > > > > > > > > > > > > > > > > > > Are these correct? > > > > > > > > > > > > Thanks, > > > > > > Aki > > > > > > > > > > > > > > > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky > > > > > > <jack.krupan...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > That's what a normal query does - Lucene takes all the terms > used > > > in > > > > > the > > > > > > > query and sums them up for each document in the response, > > > producing a > > > > > > > single number, the score, for each document. That's the way > Solr is > > > > > > > designed to be used. You still haven't elaborated why you are > > > trying > > > > > to use > > > > > > > Solr in a way other than it was intended. > > > > > > > > > > > > > > -- Jack Krupansky > > > > > > > > > > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh < > a...@marketmuse.com> > > > > > wrote: > > > > > > > > > > > > > > > Gotcha - that's disheartening. > > > > > > > > > > > > > > > > One idea: when I run termfreq, I get all of the termfreqs for > > > each > > > > > > > document > > > > > > > > one-by-one. > > > > > > > > > > > > > > > > Is there a way to have solr sum it up before creating the > > > request, > > > > > so I > > > > > > > > only receive one number in the response? > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <u...@odoko.co.uk> > > > wrote: > > > > > > > > > > > > > > > > > If you mean using the term frequency function query, then > I'm > > > not > > > > > sure > > > > > > > > > there's a huge amount you can do to improve performance. > > > > > > > > > > > > > > > > > > The term frequency is a number that is used often, so it is > > > stored > > > > > in > > > > > > > > > the index pre-calculated. Perhaps, if your data is not > > > changing, > > > > > > > > > optimising your index would reduce it to one segment, and > thus > > > > > might > > > > > > > > > ever so slightly speed the aggregation of term frequencies, > > > but I > > > > > doubt > > > > > > > > > it'd make enough difference to make it worth doing. > > > > > > > > > > > > > > > > > > Upayavira > > > > > > > > > > > > > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote: > > > > > > > > > > Thanks, Jack. I did some more research and found similar > > > results. > > > > > > > > > > > > > > > > > > > > In our application, we are making multiple (think: 50) > > > concurrent > > > > > > > > > > requests > > > > > > > > > > to calculate term frequency on a set of documents in > > > > > "real-time". The > > > > > > > > > > faster that results return, the better. > > > > > > > > > > > > > > > > > > > > Most of these requests are unique, so cache only helps > > > slightly. > > > > > > > > > > > > > > > > > > > > This analysis is happening on a single solr instance. > > > > > > > > > > > > > > > > > > > > Other than moving to solr cloud and splitting out the > > > processing > > > > > onto > > > > > > > > > > multiple servers, do you have any suggestions for what > might > > > > > speed up > > > > > > > > > > termfreq at query time? > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Aki > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky > > > > > > > > > > <jack.krupan...@gmail.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Term frequency applies only to the indexed terms of a > > > tokenized > > > > > > > > field. > > > > > > > > > > > DocValues is really just a copy of the original source > text > > > > > and is > > > > > > > > not > > > > > > > > > > > tokenized into terms. > > > > > > > > > > > > > > > > > > > > > > Maybe you could explain how exactly you are using term > > > > > frequency in > > > > > > > > > > > function queries. More importantly, what is so "heavy" > > > about > > > > > your > > > > > > > > > usage? > > > > > > > > > > > Generally, moderate use of a feature is much more > > > advisable to > > > > > > > heavy > > > > > > > > > usage, > > > > > > > > > > > unless you don't care about performance. > > > > > > > > > > > > > > > > > > > > > > -- Jack Krupansky > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh < > > > > > a...@marketmuse.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > > > In our solr application, we use a Function Query > > > (termfreq) > > > > > very > > > > > > > > > heavily. > > > > > > > > > > > > > > > > > > > > > > > > Index time and disk space are not important, but > we're > > > > > looking to > > > > > > > > > improve > > > > > > > > > > > > performance on termfreq at query time. > > > > > > > > > > > > I've been reading up on docValues. Would this be a > way to > > > > > improve > > > > > > > > > > > > performance? > > > > > > > > > > > > > > > > > > > > > > > > I had read that Lucene uses Field Cache for Function > > > > > Queries, so > > > > > > > > > > > > performance may not be affected. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > And, any general suggestions for improving query > > > performance > > > > > on > > > > > > > > > Function > > > > > > > > > > > > Queries? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Aki > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >