With regression you're looking at how the change in one variable effects the change in another variable. So you need to have values that are changing. What you described is an average of field X which is not changing, regressed against the value of X.
I think one approach to this is to regress the moving average of X with the actual value of X. We can do this with the math library, but before exploring the code for this spend some thinking about if that's the problem you're trying to solve. Take a look at how moving averages work: https://en.wikipedia.org/wiki/Moving_average Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Mar 16, 2018 at 9:26 AM, John Smith <localde...@gmail.com> wrote: > Thanks for the link to the documentation, that will probably come in > useful. > > I didn't see a way though, to get my avg function working? So instead of > doing a linear regression on two fields, X and Y, in a hitlist, we need to > do a linear regression on field X, and the average value of X. Is that > possible? To pass in a function to the regress function instead of a field? > > > > > > On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <joels...@gmail.com> > wrote: > > > I've been working on the user guide for the math expressions. Here is the > > page on regression: > > > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > > documentation/solr/solr-ref-guide/src/regression.adoc > > > > This page is part of the larger math expression documentation. The TOC is > > here: > > > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > > documentation/solr/solr-ref-guide/src/math-expressions.adoc > > > > The docs are still very rough but you can get an idea of the coverage. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <joels...@gmail.com> > > wrote: > > > > > If you want to get everything in query you can do this: > > > > > > let(echo="d,e", > > > a=search(tx_prod_production, q="oil_first_90_days_production:[1 > TO > > > *]", > > > fq="isParent:true", rows="1500000", > > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > > sort="id > > > asc"), > > > b=col(a, oil_first_90_days_production), > > > c=col(a, oil_last_30_days_production), > > > d=regress(b, c), > > > e=someExpression()) > > > > > > The echo parameter tells the let expression which variables to output. > > > > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson < > erickerick...@gmail.com > > > > > > wrote: > > > > > >> What does the fq clause look like? > > >> > > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <localde...@gmail.com> > > >> wrote: > > >> > Hi Joel, I did some more work on this statistics stuff today. Yes, > we > > do > > >> > have nulls in our data; the document contains many fields, we don't > > >> always > > >> > have values for each field, but we can't set the nulls to 0 either > (or > > >> any > > >> > other value, really) as that will mess up other calculations (such > as > > >> when > > >> > calculating average etc); we would normally just ignore fields with > > null > > >> > values when calculating stats manually ourselves. > > >> > > > >> > Adding a check in the "q" parameter to ensure that the fields used > in > > >> the > > >> > calculations are > 0 does work now. Thanks for the tip (and sorry, > > >> should > > >> > have caught that myself). But I am unable to use "fq" for these > > checks, > > >> > they have to be added to the q instead. Adding fq's doesn't have any > > >> effect. > > >> > > > >> > > > >> > Anyway, I'm trying to change this up a little. This is what I'm > > >> currently > > >> > using (switched from "random" to "search" since I actually need the > > full > > >> > hitlist not just a random subset): > > >> > > > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 > > TO > > >> *]", > > >> > fq="isParent:true", rows="1500000", > > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", > > >> sort="id > > >> > asc"), > > >> > b=col(a, oil_first_90_days_production), > > >> > c=col(a, oil_last_30_days_production), > > >> > d=regress(b, c)) > > >> > > > >> > So I have 2 fields there defined, that works great (in terms of a > test > > >> and > > >> > running the query); but I need to replace the second field, > > >> > "oil_last_30_days_production" with the avg value in > > >> > oil_first_90_days_production. > > >> > > > >> > I can get the avg with this expression: > > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO > *]", > > >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_ > > production)) > > >> > > > >> > But I don't know how to push that avg value into the first streaming > > >> > expression; guessing I have to set "c=...." but that is where I'm > > >> getting > > >> > lost, since avg only returns 1 value and the first parameter, "b", > > >> returns > > >> > a list of sorts. Somehow I have to get the avg value stuffed inside > a > > >> > "col", where it is the same value for every row in the hitlist...? > > >> > > > >> > Thanks for your help! > > >> > > > >> > > > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <joels...@gmail.com > > > > >> wrote: > > >> > > > >> >> I suspect you've got nulls in your data. I just tested with null > > >> values and > > >> >> got the same error. For testing purposes try loading the data with > > >> default > > >> >> values of zero. > > >> >> > > >> >> > > >> >> Joel Bernstein > > >> >> http://joelsolr.blogspot.com/ > > >> >> > > >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein < > joels...@gmail.com> > > >> >> wrote: > > >> >> > > >> >> > Let's break the expression down and build it up slowly. Let's > start > > >> with: > > >> >> > > > >> >> > let(echo="true", > > >> >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", > > >> rows="15", > > >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"), > > >> >> > b=col(a, oil_first_90_days_production)) > > >> >> > > > >> >> > > > >> >> > This should return variables a and b. Let's see what the data > looks > > >> like. > > >> >> > I changed the rows from 15 to 15000. If it all looks good we can > > >> expand > > >> >> the > > >> >> > rows and continue adding functions. > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > Joel Bernstein > > >> >> > http://joelsolr.blogspot.com/ > > >> >> > > > >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <localde...@gmail.com > > > > >> wrote: > > >> >> > > > >> >> >> Thanks Joel for your help on this. > > >> >> >> > > >> >> >> What I've done so far: > > >> >> >> - unzip downloaded solr-7.2 > > >> >> >> - modify the _default "managed-schema" to add the random field > > type > > >> and > > >> >> >> the dynamic random field > > >> >> >> - start solr7 using "solr start -c" > > >> >> >> - indexed my data using pint/pdouble/boolean field types etc > > >> >> >> > > >> >> >> I can now run the random function all by itself, it returns > random > > >> >> >> results as expected. So far so good! > > >> >> >> > > >> >> >> However... now trying to get the regression stuff working: > > >> >> >> > > >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > > >> >> >> rows="15000", fl="oil_first_90_days_producti > > >> >> >> on,oil_last_30_days_production"), > > >> >> >> b=col(a, oil_first_90_days_production), > > >> >> >> c=col(a, oil_last_30_days_production), > > >> >> >> d=regress(b, c)) > > >> >> >> > > >> >> >> Posted directly into solr admin UI. Run the streaming expression > > >> and I > > >> >> >> get this error message: > > >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - > Numeric > > >> value > > >> >> >> expected but found type java.lang.String for value > > >> >> >> oil_first_90_days_production" > > >> >> >> > > >> >> >> It thinks my numeric field is defined as a string? But when I > view > > >> the > > >> >> >> schema, those 2 fields are defined as ints: > > >> >> >> > > >> >> >> > > >> >> >> When I run a normal query and choose xml as output format, then > it > > >> also > > >> >> >> puts "int" elements into the hitlist, so the schema appears to > be > > >> >> correct > > >> >> >> it's just when using this regress function that something goes > > >> wrong and > > >> >> >> solr thinks the field is string. > > >> >> >> > > >> >> >> Any suggestions? > > >> >> >> Thanks! > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein < > > joels...@gmail.com> > > >> >> >> wrote: > > >> >> >> > > >> >> >>> The field type will also need to be in the schema: > > >> >> >>> > > >> >> >>> <!-- The "RandomSortField" is not used to store or search any > > >> >> >>> > > >> >> >>> data. You can declare fields of this type it in your > > >> schema > > >> >> >>> > > >> >> >>> to generate pseudo-random orderings of your docs for > > >> sorting > > >> >> >>> > > >> >> >>> or function purposes. The ordering is generated based > > on > > >> the > > >> >> >>> field > > >> >> >>> > > >> >> >>> name and the version of the index. As long as the > index > > >> >> version > > >> >> >>> > > >> >> >>> remains unchanged, and the same field name is reused, > > >> >> >>> > > >> >> >>> the ordering of the docs will be consistent. > > >> >> >>> > > >> >> >>> If you want different psuedo-random orderings of > > >> documents, > > >> >> >>> > > >> >> >>> for the same version of the index, use a dynamicField > > and > > >> >> >>> > > >> >> >>> change the field name in the request. > > >> >> >>> > > >> >> >>> --> > > >> >> >>> > > >> >> >>> <fieldType name="random" class="solr.RandomSortField" > > >> indexed="true" /> > > >> >> >>> > > >> >> >>> > > >> >> >>> Joel Bernstein > > >> >> >>> http://joelsolr.blogspot.com/ > > >> >> >>> > > >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein < > > joels...@gmail.com > > >> > > > >> >> >>> wrote: > > >> >> >>> > > >> >> >>> > You'll need to have this field in your schema: > > >> >> >>> > > > >> >> >>> > <dynamicField name="random_*" type="random" /> > > >> >> >>> > > > >> >> >>> > I'll check to see if the default schema used with solr start > -c > > >> has > > >> >> >>> this > > >> >> >>> > field, if not I'll add it. Thanks for pointing this out. > > >> >> >>> > > > >> >> >>> > I checked and right now the random expression is only > accepting > > >> one > > >> >> fq, > > >> >> >>> > but I consider this a bug. It should accept multiple. I'll > > create > > >> >> >>> ticket > > >> >> >>> > for getting this fixed. > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > Joel Bernstein > > >> >> >>> > http://joelsolr.blogspot.com/ > > >> >> >>> > > > >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith < > > localde...@gmail.com > > >> > > > >> >> >>> wrote: > > >> >> >>> > > > >> >> >>> >> Joel, thanks for the pointers to the streaming feature. I > had > > no > > >> >> idea > > >> >> >>> solr > > >> >> >>> >> had that (and also just discovered the very intersting sql > > >> feature! > > >> >> I > > >> >> >>> will > > >> >> >>> >> be sure to investigate that in more detail in the future). > > >> >> >>> >> > > >> >> >>> >> However I'm having some trouble getting basic streaming > > >> functions > > >> >> >>> working. > > >> >> >>> >> I've already figured out that I had to move to "solr cloud" > > >> instead > > >> >> of > > >> >> >>> >> "solr standalone" because I was getting errors about "cannot > > >> find zk > > >> >> >>> >> instance" or whatever which went away when using "solr start > > -c" > > >> >> >>> instead. > > >> >> >>> >> > > >> >> >>> >> But now I'm trying to use the random function since that was > > >> one of > > >> >> >>> the > > >> >> >>> >> functions used in your example. > > >> >> >>> >> > > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname") > > >> >> >>> >> > > >> >> >>> >> I posted that directly in the "stream" section of the solr > > >> admin UI. > > >> >> >>> This > > >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several > > >> versions > > >> >> in > > >> >> >>> case > > >> >> >>> >> it was a bug in one) > > >> >> >>> >> > > >> >> >>> >> I get back an error message: > > >> >> >>> >> *sort param could not be parsed as a query, and is not a > field > > >> that > > >> >> >>> exists > > >> >> >>> >> in the index: random_-255009774* > > >> >> >>> >> > > >> >> >>> >> I'm not passing in any sort field anywhere. But the solr > logs > > >> show > > >> >> >>> these > > >> >> >>> >> three log entries: > > >> >> >>> >> > > >> >> >>> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header > > >> >> s:shard1 > > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] > o.a.s.c.S.Request > > >> >> >>> >> [tx_header_shard1_replica_n1] webapp=/solr path=/select > > >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname > > >> >> >>> >> *&sort=random_-255009774+asc*& > rows=100&wt=javabin&version=2} > > >> >> >>> status=400 > > >> >> >>> >> QTime=19 > > >> >> >>> >> > > >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header > > >> >> s:shard1 > > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] > > >> >> >>> o.a.s.c.s.i.CloudSolrClient > > >> >> >>> >> Request to collection [tx_header] failed due to (400) > > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$ > > >> >> RemoteSolrException: > > >> >> >>> >> Error > > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: > sort > > >> param > > >> >> >>> could > > >> >> >>> >> not be parsed as a query, and is not a field that exists in > > the > > >> >> index: > > >> >> >>> >> random_-255009774, retry? 0 > > >> >> >>> >> > > >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header > > >> >> s:shard1 > > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] > > >> >> >>> o.a.s.c.s.i.s.ExceptionStream > > >> >> >>> >> java.io.IOException: > > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$ > > >> >> RemoteSolrException: > > >> >> >>> >> Error > > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: > sort > > >> param > > >> >> >>> could > > >> >> >>> >> not be parsed as a query, and is not a field that exists in > > the > > >> >> index: > > >> >> >>> >> random_-255009774 > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> So basically it looks like solr is injecting the > > "sort=random_" > > >> >> stuff > > >> >> >>> into > > >> >> >>> >> my query and of course that is failing on the search since > > that > > >> >> >>> >> field/column doesn't exist in my schema. Everytime I run the > > >> random > > >> >> >>> >> function, I get a slightly different field name that it > > >> injects, but > > >> >> >>> they > > >> >> >>> >> all start with "random_" etc. > > >> >> >>> >> > > >> >> >>> >> I have tried adding my own sort field instead, hoping solr > > >> wouldn't > > >> >> >>> inject > > >> >> >>> >> one for me, but it still injected a random sort fieldname: > > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname", > > >> >> >>> sort="countyname > > >> >> >>> >> asc") > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> Assuming I can fix that whole problem, my second question > is: > > >> can I > > >> >> >>> add > > >> >> >>> >> multiple "fq=" parameters to the random function? I build a > > >> pretty > > >> >> >>> >> complicated query using many fq= fields, and then want to > run > > >> some > > >> >> >>> stats > > >> >> >>> >> on > > >> >> >>> >> that hitlist; so somehow I have to pass in the query that > made > > >> up > > >> >> the > > >> >> >>> >> exact > > >> >> >>> >> hitlist to these various functions, but when I used multiple > > >> "fq=" > > >> >> >>> values > > >> >> >>> >> it only seemed to use the last one I specified and just > > ignored > > >> all > > >> >> >>> the > > >> >> >>> >> previous fq's? > > >> >> >>> >> > > >> >> >>> >> Thanks in advance for any comments/suggestions...! > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein < > > >> joels...@gmail.com > > >> >> > > > >> >> >>> >> wrote: > > >> >> >>> >> > > >> >> >>> >> > This is going to be a complex answer because Solr actually > > >> now has > > >> >> >>> >> multiple > > >> >> >>> >> > ways of doing regression analysis as part of the Streaming > > >> >> >>> Expression > > >> >> >>> >> > statistical programming library. The basic documentation > is > > >> here: > > >> >> >>> >> > > > >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical- > > program > > >> >> >>> ming.html > > >> >> >>> >> > > > >> >> >>> >> > Here is a sample expression that performs a simple linear > > >> >> >>> regression in > > >> >> >>> >> > Solr 7.2: > > >> >> >>> >> > > > >> >> >>> >> > let(a=random(collection1, q="any query", rows="15000", > > >> fl="fieldA, > > >> >> >>> >> > fieldB"), > > >> >> >>> >> > b=col(a, fieldA), > > >> >> >>> >> > c=col(a, fieldB), > > >> >> >>> >> > d=regress(b, c)) > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > The expression above takes a random sample of 15000 > results > > >> from > > >> >> >>> >> > collection1. The result set will include fieldA and fieldB > > in > > >> each > > >> >> >>> >> record. > > >> >> >>> >> > The result set is stored in variable "a". > > >> >> >>> >> > > > >> >> >>> >> > Then the "col" function creates arrays of numbers from the > > >> results > > >> >> >>> >> stored > > >> >> >>> >> > in variable a. The values in fieldA are stored in the > > variable > > >> >> "b". > > >> >> >>> The > > >> >> >>> >> > values in fieldB are stored in variable "c". > > >> >> >>> >> > > > >> >> >>> >> > Then the regress function performs a simple linear > > regression > > >> on > > >> >> >>> arrays > > >> >> >>> >> > stored in variables "b" and "c". > > >> >> >>> >> > > > >> >> >>> >> > The output of the regress function is a map containing the > > >> >> >>> regression > > >> >> >>> >> > result. This result includes RSquared and other attributes > > of > > >> the > > >> >> >>> >> > regression model such as R (correlation), slope, y > intercept > > >> >> etc... > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > > > >> >> >>> >> > Joel Bernstein > > >> >> >>> >> > http://joelsolr.blogspot.com/ > > >> >> >>> >> > > > >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith < > > >> localde...@gmail.com > > >> >> > > > >> >> >>> >> wrote: > > >> >> >>> >> > > > >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats > > guy, > > >> but > > >> >> >>> the > > >> >> >>> >> end > > >> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is > > >> there no > > >> >> >>> way of > > >> >> >>> >> > > obtaining this value, then (short of iterating over all > > the > > >> >> >>> results in > > >> >> >>> >> > the > > >> >> >>> >> > > hitlist and calculating it myself)? > > >> >> >>> >> > > > > >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein < > > >> >> >>> joels...@gmail.com> > > >> >> >>> >> > > wrote: > > >> >> >>> >> > > > > >> >> >>> >> > > > Typically SSE is the sum of the squared errors of the > > >> >> >>> prediction in > > >> >> >>> >> a > > >> >> >>> >> > > > regression analysis. The stats component doesn't > perform > > >> >> >>> regression, > > >> >> >>> >> > > > although it might be a nice feature. > > >> >> >>> >> > > > > > >> >> >>> >> > > > > > >> >> >>> >> > > > > > >> >> >>> >> > > > Joel Bernstein > > >> >> >>> >> > > > http://joelsolr.blogspot.com/ > > >> >> >>> >> > > > > > >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith < > > >> >> >>> localde...@gmail.com> > > >> >> >>> >> > > wrote: > > >> >> >>> >> > > > > > >> >> >>> >> > > > > I'm using solr, and enabling stats as per this page: > > >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats- > > >> >> component > > >> >> >>> .html > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > I want to get more stat values though. Specifically > > I'm > > >> >> >>> looking > > >> >> >>> >> for > > >> >> >>> >> > > > > r-squared (coefficient of determination). This value > > is > > >> not > > >> >> >>> >> present > > >> >> >>> >> > in > > >> >> >>> >> > > > > solr, however some of the pieces used to calculate > r^2 > > >> are > > >> >> in > > >> >> >>> the > > >> >> >>> >> > stats > > >> >> >>> >> > > > > element, for example: > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > <double name="min">0.0</double> > > >> >> >>> >> > > > > <double name="max">10.0</double> > > >> >> >>> >> > > > > <long name="count">15</long> > > >> >> >>> >> > > > > <long name="missing">17</long> > > >> >> >>> >> > > > > <double name="sum">85.0</double> > > >> >> >>> >> > > > > <double name="sumOfSquares">603.0</double> > > >> >> >>> >> > > > > <double name="mean">5.666666666666667</double> > > >> >> >>> >> > > > > <double name="stddev">2.943920288775949</double> > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and > using > > >> this > > >> >> >>> >> > > calculation, I > > >> >> >>> >> > > > > can get R^2: > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > R^2 = 1 - SSE/SST > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get > SSE > > >> from > > >> >> >>> those > > >> >> >>> >> > other > > >> >> >>> >> > > > > stats in solr? > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > Thanks in advance! > > >> >> >>> >> > > > > > > >> >> >>> >> > > > > > >> >> >>> >> > > > > >> >> >>> >> > > > >> >> >>> >> > > >> >> >>> > > > >> >> >>> > > > >> >> >>> > > >> >> >> > > >> >> >> > > >> >> > > > >> >> > > >> > > > > > > > > >