Thanks for the link to the documentation, that will probably come in useful.
I didn't see a way though, to get my avg function working? So instead of doing a linear regression on two fields, X and Y, in a hitlist, we need to do a linear regression on field X, and the average value of X. Is that possible? To pass in a function to the regress function instead of a field? On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <joels...@gmail.com> wrote: > I've been working on the user guide for the math expressions. Here is the > page on regression: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/regression.adoc > > This page is part of the larger math expression documentation. The TOC is > here: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/math-expressions.adoc > > The docs are still very rough but you can get an idea of the coverage. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <joels...@gmail.com> > wrote: > > > If you want to get everything in query you can do this: > > > > let(echo="d,e", > > a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO > > *]", > > fq="isParent:true", rows="1500000", > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > sort="id > > asc"), > > b=col(a, oil_first_90_days_production), > > c=col(a, oil_last_30_days_production), > > d=regress(b, c), > > e=someExpression()) > > > > The echo parameter tells the let expression which variables to output. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> What does the fq clause look like? > >> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <localde...@gmail.com> > >> wrote: > >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we > do > >> > have nulls in our data; the document contains many fields, we don't > >> always > >> > have values for each field, but we can't set the nulls to 0 either (or > >> any > >> > other value, really) as that will mess up other calculations (such as > >> when > >> > calculating average etc); we would normally just ignore fields with > null > >> > values when calculating stats manually ourselves. > >> > > >> > Adding a check in the "q" parameter to ensure that the fields used in > >> the > >> > calculations are > 0 does work now. Thanks for the tip (and sorry, > >> should > >> > have caught that myself). But I am unable to use "fq" for these > checks, > >> > they have to be added to the q instead. Adding fq's doesn't have any > >> effect. > >> > > >> > > >> > Anyway, I'm trying to change this up a little. This is what I'm > >> currently > >> > using (switched from "random" to "search" since I actually need the > full > >> > hitlist not just a random subset): > >> > > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 > TO > >> *]", > >> > fq="isParent:true", rows="1500000", > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", > >> sort="id > >> > asc"), > >> > b=col(a, oil_first_90_days_production), > >> > c=col(a, oil_last_30_days_production), > >> > d=regress(b, c)) > >> > > >> > So I have 2 fields there defined, that works great (in terms of a test > >> and > >> > running the query); but I need to replace the second field, > >> > "oil_last_30_days_production" with the avg value in > >> > oil_first_90_days_production. > >> > > >> > I can get the avg with this expression: > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_ > production)) > >> > > >> > But I don't know how to push that avg value into the first streaming > >> > expression; guessing I have to set "c=...." but that is where I'm > >> getting > >> > lost, since avg only returns 1 value and the first parameter, "b", > >> returns > >> > a list of sorts. Somehow I have to get the avg value stuffed inside a > >> > "col", where it is the same value for every row in the hitlist...? > >> > > >> > Thanks for your help! > >> > > >> > > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <joels...@gmail.com> > >> wrote: > >> > > >> >> I suspect you've got nulls in your data. I just tested with null > >> values and > >> >> got the same error. For testing purposes try loading the data with > >> default > >> >> values of zero. > >> >> > >> >> > >> >> Joel Bernstein > >> >> http://joelsolr.blogspot.com/ > >> >> > >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <joels...@gmail.com> > >> >> wrote: > >> >> > >> >> > Let's break the expression down and build it up slowly. Let's start > >> with: > >> >> > > >> >> > let(echo="true", > >> >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> rows="15", > >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"), > >> >> > b=col(a, oil_first_90_days_production)) > >> >> > > >> >> > > >> >> > This should return variables a and b. Let's see what the data looks > >> like. > >> >> > I changed the rows from 15 to 15000. If it all looks good we can > >> expand > >> >> the > >> >> > rows and continue adding functions. > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > Joel Bernstein > >> >> > http://joelsolr.blogspot.com/ > >> >> > > >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <localde...@gmail.com> > >> wrote: > >> >> > > >> >> >> Thanks Joel for your help on this. > >> >> >> > >> >> >> What I've done so far: > >> >> >> - unzip downloaded solr-7.2 > >> >> >> - modify the _default "managed-schema" to add the random field > type > >> and > >> >> >> the dynamic random field > >> >> >> - start solr7 using "solr start -c" > >> >> >> - indexed my data using pint/pdouble/boolean field types etc > >> >> >> > >> >> >> I can now run the random function all by itself, it returns random > >> >> >> results as expected. So far so good! > >> >> >> > >> >> >> However... now trying to get the regression stuff working: > >> >> >> > >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> >> >> rows="15000", fl="oil_first_90_days_producti > >> >> >> on,oil_last_30_days_production"), > >> >> >> b=col(a, oil_first_90_days_production), > >> >> >> c=col(a, oil_last_30_days_production), > >> >> >> d=regress(b, c)) > >> >> >> > >> >> >> Posted directly into solr admin UI. Run the streaming expression > >> and I > >> >> >> get this error message: > >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric > >> value > >> >> >> expected but found type java.lang.String for value > >> >> >> oil_first_90_days_production" > >> >> >> > >> >> >> It thinks my numeric field is defined as a string? But when I view > >> the > >> >> >> schema, those 2 fields are defined as ints: > >> >> >> > >> >> >> > >> >> >> When I run a normal query and choose xml as output format, then it > >> also > >> >> >> puts "int" elements into the hitlist, so the schema appears to be > >> >> correct > >> >> >> it's just when using this regress function that something goes > >> wrong and > >> >> >> solr thinks the field is string. > >> >> >> > >> >> >> Any suggestions? > >> >> >> Thanks! > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein < > joels...@gmail.com> > >> >> >> wrote: > >> >> >> > >> >> >>> The field type will also need to be in the schema: > >> >> >>> > >> >> >>> <!-- The "RandomSortField" is not used to store or search any > >> >> >>> > >> >> >>> data. You can declare fields of this type it in your > >> schema > >> >> >>> > >> >> >>> to generate pseudo-random orderings of your docs for > >> sorting > >> >> >>> > >> >> >>> or function purposes. The ordering is generated based > on > >> the > >> >> >>> field > >> >> >>> > >> >> >>> name and the version of the index. As long as the index > >> >> version > >> >> >>> > >> >> >>> remains unchanged, and the same field name is reused, > >> >> >>> > >> >> >>> the ordering of the docs will be consistent. > >> >> >>> > >> >> >>> If you want different psuedo-random orderings of > >> documents, > >> >> >>> > >> >> >>> for the same version of the index, use a dynamicField > and > >> >> >>> > >> >> >>> change the field name in the request. > >> >> >>> > >> >> >>> --> > >> >> >>> > >> >> >>> <fieldType name="random" class="solr.RandomSortField" > >> indexed="true" /> > >> >> >>> > >> >> >>> > >> >> >>> Joel Bernstein > >> >> >>> http://joelsolr.blogspot.com/ > >> >> >>> > >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein < > joels...@gmail.com > >> > > >> >> >>> wrote: > >> >> >>> > >> >> >>> > You'll need to have this field in your schema: > >> >> >>> > > >> >> >>> > <dynamicField name="random_*" type="random" /> > >> >> >>> > > >> >> >>> > I'll check to see if the default schema used with solr start -c > >> has > >> >> >>> this > >> >> >>> > field, if not I'll add it. Thanks for pointing this out. > >> >> >>> > > >> >> >>> > I checked and right now the random expression is only accepting > >> one > >> >> fq, > >> >> >>> > but I consider this a bug. It should accept multiple. I'll > create > >> >> >>> ticket > >> >> >>> > for getting this fixed. > >> >> >>> > > >> >> >>> > > >> >> >>> > > >> >> >>> > Joel Bernstein > >> >> >>> > http://joelsolr.blogspot.com/ > >> >> >>> > > >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith < > localde...@gmail.com > >> > > >> >> >>> wrote: > >> >> >>> > > >> >> >>> >> Joel, thanks for the pointers to the streaming feature. I had > no > >> >> idea > >> >> >>> solr > >> >> >>> >> had that (and also just discovered the very intersting sql > >> feature! > >> >> I > >> >> >>> will > >> >> >>> >> be sure to investigate that in more detail in the future). > >> >> >>> >> > >> >> >>> >> However I'm having some trouble getting basic streaming > >> functions > >> >> >>> working. > >> >> >>> >> I've already figured out that I had to move to "solr cloud" > >> instead > >> >> of > >> >> >>> >> "solr standalone" because I was getting errors about "cannot > >> find zk > >> >> >>> >> instance" or whatever which went away when using "solr start > -c" > >> >> >>> instead. > >> >> >>> >> > >> >> >>> >> But now I'm trying to use the random function since that was > >> one of > >> >> >>> the > >> >> >>> >> functions used in your example. > >> >> >>> >> > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname") > >> >> >>> >> > >> >> >>> >> I posted that directly in the "stream" section of the solr > >> admin UI. > >> >> >>> This > >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several > >> versions > >> >> in > >> >> >>> case > >> >> >>> >> it was a bug in one) > >> >> >>> >> > >> >> >>> >> I get back an error message: > >> >> >>> >> *sort param could not be parsed as a query, and is not a field > >> that > >> >> >>> exists > >> >> >>> >> in the index: random_-255009774* > >> >> >>> >> > >> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs > >> show > >> >> >>> these > >> >> >>> >> three log entries: > >> >> >>> >> > >> >> >>> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header > >> >> s:shard1 > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request > >> >> >>> >> [tx_header_shard1_replica_n1] webapp=/solr path=/select > >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname > >> >> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} > >> >> >>> status=400 > >> >> >>> >> QTime=19 > >> >> >>> >> > >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header > >> >> s:shard1 > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] > >> >> >>> o.a.s.c.s.i.CloudSolrClient > >> >> >>> >> Request to collection [tx_header] failed due to (400) > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$ > >> >> RemoteSolrException: > >> >> >>> >> Error > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort > >> param > >> >> >>> could > >> >> >>> >> not be parsed as a query, and is not a field that exists in > the > >> >> index: > >> >> >>> >> random_-255009774, retry? 0 > >> >> >>> >> > >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header > >> >> s:shard1 > >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] > >> >> >>> o.a.s.c.s.i.s.ExceptionStream > >> >> >>> >> java.io.IOException: > >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$ > >> >> RemoteSolrException: > >> >> >>> >> Error > >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort > >> param > >> >> >>> could > >> >> >>> >> not be parsed as a query, and is not a field that exists in > the > >> >> index: > >> >> >>> >> random_-255009774 > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> So basically it looks like solr is injecting the > "sort=random_" > >> >> stuff > >> >> >>> into > >> >> >>> >> my query and of course that is failing on the search since > that > >> >> >>> >> field/column doesn't exist in my schema. Everytime I run the > >> random > >> >> >>> >> function, I get a slightly different field name that it > >> injects, but > >> >> >>> they > >> >> >>> >> all start with "random_" etc. > >> >> >>> >> > >> >> >>> >> I have tried adding my own sort field instead, hoping solr > >> wouldn't > >> >> >>> inject > >> >> >>> >> one for me, but it still injected a random sort fieldname: > >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname", > >> >> >>> sort="countyname > >> >> >>> >> asc") > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> Assuming I can fix that whole problem, my second question is: > >> can I > >> >> >>> add > >> >> >>> >> multiple "fq=" parameters to the random function? I build a > >> pretty > >> >> >>> >> complicated query using many fq= fields, and then want to run > >> some > >> >> >>> stats > >> >> >>> >> on > >> >> >>> >> that hitlist; so somehow I have to pass in the query that made > >> up > >> >> the > >> >> >>> >> exact > >> >> >>> >> hitlist to these various functions, but when I used multiple > >> "fq=" > >> >> >>> values > >> >> >>> >> it only seemed to use the last one I specified and just > ignored > >> all > >> >> >>> the > >> >> >>> >> previous fq's? > >> >> >>> >> > >> >> >>> >> Thanks in advance for any comments/suggestions...! > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein < > >> joels...@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > >> >> >>> >> > This is going to be a complex answer because Solr actually > >> now has > >> >> >>> >> multiple > >> >> >>> >> > ways of doing regression analysis as part of the Streaming > >> >> >>> Expression > >> >> >>> >> > statistical programming library. The basic documentation is > >> here: > >> >> >>> >> > > >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical- > program > >> >> >>> ming.html > >> >> >>> >> > > >> >> >>> >> > Here is a sample expression that performs a simple linear > >> >> >>> regression in > >> >> >>> >> > Solr 7.2: > >> >> >>> >> > > >> >> >>> >> > let(a=random(collection1, q="any query", rows="15000", > >> fl="fieldA, > >> >> >>> >> > fieldB"), > >> >> >>> >> > b=col(a, fieldA), > >> >> >>> >> > c=col(a, fieldB), > >> >> >>> >> > d=regress(b, c)) > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > The expression above takes a random sample of 15000 results > >> from > >> >> >>> >> > collection1. The result set will include fieldA and fieldB > in > >> each > >> >> >>> >> record. > >> >> >>> >> > The result set is stored in variable "a". > >> >> >>> >> > > >> >> >>> >> > Then the "col" function creates arrays of numbers from the > >> results > >> >> >>> >> stored > >> >> >>> >> > in variable a. The values in fieldA are stored in the > variable > >> >> "b". > >> >> >>> The > >> >> >>> >> > values in fieldB are stored in variable "c". > >> >> >>> >> > > >> >> >>> >> > Then the regress function performs a simple linear > regression > >> on > >> >> >>> arrays > >> >> >>> >> > stored in variables "b" and "c". > >> >> >>> >> > > >> >> >>> >> > The output of the regress function is a map containing the > >> >> >>> regression > >> >> >>> >> > result. This result includes RSquared and other attributes > of > >> the > >> >> >>> >> > regression model such as R (correlation), slope, y intercept > >> >> etc... > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > Joel Bernstein > >> >> >>> >> > http://joelsolr.blogspot.com/ > >> >> >>> >> > > >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith < > >> localde...@gmail.com > >> >> > > >> >> >>> >> wrote: > >> >> >>> >> > > >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats > guy, > >> but > >> >> >>> the > >> >> >>> >> end > >> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is > >> there no > >> >> >>> way of > >> >> >>> >> > > obtaining this value, then (short of iterating over all > the > >> >> >>> results in > >> >> >>> >> > the > >> >> >>> >> > > hitlist and calculating it myself)? > >> >> >>> >> > > > >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein < > >> >> >>> joels...@gmail.com> > >> >> >>> >> > > wrote: > >> >> >>> >> > > > >> >> >>> >> > > > Typically SSE is the sum of the squared errors of the > >> >> >>> prediction in > >> >> >>> >> a > >> >> >>> >> > > > regression analysis. The stats component doesn't perform > >> >> >>> regression, > >> >> >>> >> > > > although it might be a nice feature. > >> >> >>> >> > > > > >> >> >>> >> > > > > >> >> >>> >> > > > > >> >> >>> >> > > > Joel Bernstein > >> >> >>> >> > > > http://joelsolr.blogspot.com/ > >> >> >>> >> > > > > >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith < > >> >> >>> localde...@gmail.com> > >> >> >>> >> > > wrote: > >> >> >>> >> > > > > >> >> >>> >> > > > > I'm using solr, and enabling stats as per this page: > >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats- > >> >> component > >> >> >>> .html > >> >> >>> >> > > > > > >> >> >>> >> > > > > I want to get more stat values though. Specifically > I'm > >> >> >>> looking > >> >> >>> >> for > >> >> >>> >> > > > > r-squared (coefficient of determination). This value > is > >> not > >> >> >>> >> present > >> >> >>> >> > in > >> >> >>> >> > > > > solr, however some of the pieces used to calculate r^2 > >> are > >> >> in > >> >> >>> the > >> >> >>> >> > stats > >> >> >>> >> > > > > element, for example: > >> >> >>> >> > > > > > >> >> >>> >> > > > > <double name="min">0.0</double> > >> >> >>> >> > > > > <double name="max">10.0</double> > >> >> >>> >> > > > > <long name="count">15</long> > >> >> >>> >> > > > > <long name="missing">17</long> > >> >> >>> >> > > > > <double name="sum">85.0</double> > >> >> >>> >> > > > > <double name="sumOfSquares">603.0</double> > >> >> >>> >> > > > > <double name="mean">5.666666666666667</double> > >> >> >>> >> > > > > <double name="stddev">2.943920288775949</double> > >> >> >>> >> > > > > > >> >> >>> >> > > > > > >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and using > >> this > >> >> >>> >> > > calculation, I > >> >> >>> >> > > > > can get R^2: > >> >> >>> >> > > > > > >> >> >>> >> > > > > R^2 = 1 - SSE/SST > >> >> >>> >> > > > > > >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE > >> from > >> >> >>> those > >> >> >>> >> > other > >> >> >>> >> > > > > stats in solr? > >> >> >>> >> > > > > > >> >> >>> >> > > > > Thanks in advance! > >> >> >>> >> > > > > > >> >> >>> >> > > > > >> >> >>> >> > > > >> >> >>> >> > > >> >> >>> >> > >> >> >>> > > >> >> >>> > > >> >> >>> > >> >> >> > >> >> >> > >> >> > > >> >> > >> > > > > >