Re: statistics in hitlist
With regression you're looking at how the change in one variable effects the change in another variable. So you need to have values that are changing. What you described is an average of field X which is not changing, regressed against the value of X. I think one approach to this is to regress the moving average of X with the actual value of X. We can do this with the math library, but before exploring the code for this spend some thinking about if that's the problem you're trying to solve. Take a look at how moving averages work: https://en.wikipedia.org/wiki/Moving_average Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Mar 16, 2018 at 9:26 AM, John Smithwrote: > Thanks for the link to the documentation, that will probably come in > useful. > > I didn't see a way though, to get my avg function working? So instead of > doing a linear regression on two fields, X and Y, in a hitlist, we need to > do a linear regression on field X, and the average value of X. Is that > possible? To pass in a function to the regress function instead of a field? > > > > > > On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein > wrote: > > > I've been working on the user guide for the math expressions. Here is the > > page on regression: > > > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > > documentation/solr/solr-ref-guide/src/regression.adoc > > > > This page is part of the larger math expression documentation. The TOC is > > here: > > > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > > documentation/solr/solr-ref-guide/src/math-expressions.adoc > > > > The docs are still very rough but you can get an idea of the coverage. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein > > wrote: > > > > > If you want to get everything in query you can do this: > > > > > > let(echo="d,e", > > > a=search(tx_prod_production, q="oil_first_90_days_production:[1 > TO > > > *]", > > > fq="isParent:true", rows="150", > > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > > sort="id > > > asc"), > > > b=col(a, oil_first_90_days_production), > > > c=col(a, oil_last_30_days_production), > > > d=regress(b, c), > > > e=someExpression()) > > > > > > The echo parameter tells the let expression which variables to output. > > > > > > > > > > > > Joel Bernstein > > > http://joelsolr.blogspot.com/ > > > > > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson < > erickerick...@gmail.com > > > > > > wrote: > > > > > >> What does the fq clause look like? > > >> > > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith > > >> wrote: > > >> > Hi Joel, I did some more work on this statistics stuff today. Yes, > we > > do > > >> > have nulls in our data; the document contains many fields, we don't > > >> always > > >> > have values for each field, but we can't set the nulls to 0 either > (or > > >> any > > >> > other value, really) as that will mess up other calculations (such > as > > >> when > > >> > calculating average etc); we would normally just ignore fields with > > null > > >> > values when calculating stats manually ourselves. > > >> > > > >> > Adding a check in the "q" parameter to ensure that the fields used > in > > >> the > > >> > calculations are > 0 does work now. Thanks for the tip (and sorry, > > >> should > > >> > have caught that myself). But I am unable to use "fq" for these > > checks, > > >> > they have to be added to the q instead. Adding fq's doesn't have any > > >> effect. > > >> > > > >> > > > >> > Anyway, I'm trying to change this up a little. This is what I'm > > >> currently > > >> > using (switched from "random" to "search" since I actually need the > > full > > >> > hitlist not just a random subset): > > >> > > > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 > > TO > > >> *]", > > >> > fq="isParent:true", rows="150", > > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", > > >> sort="id > > >> > asc"), > > >> > b=col(a, oil_first_90_days_production), > > >> > c=col(a, oil_last_30_days_production), > > >> > d=regress(b, c)) > > >> > > > >> > So I have 2 fields there defined, that works great (in terms of a > test > > >> and > > >> > running the query); but I need to replace the second field, > > >> > "oil_last_30_days_production" with the avg value in > > >> > oil_first_90_days_production. > > >> > > > >> > I can get the avg with this expression: > > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO > *]", > > >> > fq="isParent:true", rows="150", avg(oil_first_90_days_ > > production)) > > >> > > > >> > But I don't know how to push that avg value into the first streaming > > >> > expression; guessing I have to set "c=" but that is where I'm > > >> getting > > >> > lost, since avg only
Re: statistics in hitlist
Thanks for the link to the documentation, that will probably come in useful. I didn't see a way though, to get my avg function working? So instead of doing a linear regression on two fields, X and Y, in a hitlist, we need to do a linear regression on field X, and the average value of X. Is that possible? To pass in a function to the regress function instead of a field? On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernsteinwrote: > I've been working on the user guide for the math expressions. Here is the > page on regression: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/regression.adoc > > This page is part of the larger math expression documentation. The TOC is > here: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/math-expressions.adoc > > The docs are still very rough but you can get an idea of the coverage. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein > wrote: > > > If you want to get everything in query you can do this: > > > > let(echo="d,e", > > a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO > > *]", > > fq="isParent:true", rows="150", > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > sort="id > > asc"), > > b=col(a, oil_first_90_days_production), > > c=col(a, oil_last_30_days_production), > > d=regress(b, c), > > e=someExpression()) > > > > The echo parameter tells the let expression which variables to output. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson > > > wrote: > > > >> What does the fq clause look like? > >> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith > >> wrote: > >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we > do > >> > have nulls in our data; the document contains many fields, we don't > >> always > >> > have values for each field, but we can't set the nulls to 0 either (or > >> any > >> > other value, really) as that will mess up other calculations (such as > >> when > >> > calculating average etc); we would normally just ignore fields with > null > >> > values when calculating stats manually ourselves. > >> > > >> > Adding a check in the "q" parameter to ensure that the fields used in > >> the > >> > calculations are > 0 does work now. Thanks for the tip (and sorry, > >> should > >> > have caught that myself). But I am unable to use "fq" for these > checks, > >> > they have to be added to the q instead. Adding fq's doesn't have any > >> effect. > >> > > >> > > >> > Anyway, I'm trying to change this up a little. This is what I'm > >> currently > >> > using (switched from "random" to "search" since I actually need the > full > >> > hitlist not just a random subset): > >> > > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 > TO > >> *]", > >> > fq="isParent:true", rows="150", > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", > >> sort="id > >> > asc"), > >> > b=col(a, oil_first_90_days_production), > >> > c=col(a, oil_last_30_days_production), > >> > d=regress(b, c)) > >> > > >> > So I have 2 fields there defined, that works great (in terms of a test > >> and > >> > running the query); but I need to replace the second field, > >> > "oil_last_30_days_production" with the avg value in > >> > oil_first_90_days_production. > >> > > >> > I can get the avg with this expression: > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > >> > fq="isParent:true", rows="150", avg(oil_first_90_days_ > production)) > >> > > >> > But I don't know how to push that avg value into the first streaming > >> > expression; guessing I have to set "c=" but that is where I'm > >> getting > >> > lost, since avg only returns 1 value and the first parameter, "b", > >> returns > >> > a list of sorts. Somehow I have to get the avg value stuffed inside a > >> > "col", where it is the same value for every row in the hitlist...? > >> > > >> > Thanks for your help! > >> > > >> > > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein > >> wrote: > >> > > >> >> I suspect you've got nulls in your data. I just tested with null > >> values and > >> >> got the same error. For testing purposes try loading the data with > >> default > >> >> values of zero. > >> >> > >> >> > >> >> Joel Bernstein > >> >> http://joelsolr.blogspot.com/ > >> >> > >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein > >> >> wrote: > >> >> > >> >> > Let's break the expression down and build it up slowly. Let's start > >> with: > >> >> > > >> >> > let(echo="true", > >> >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> rows="15", >
Re: statistics in hitlist
I've been working on the user guide for the math expressions. Here is the page on regression: https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/regression.adoc This page is part of the larger math expression documentation. The TOC is here: https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/math-expressions.adoc The docs are still very rough but you can get an idea of the coverage. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernsteinwrote: > If you want to get everything in query you can do this: > > let(echo="d,e", > a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO > *]", > fq="isParent:true", rows="150", > fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id > asc"), > b=col(a, oil_first_90_days_production), > c=col(a, oil_last_30_days_production), > d=regress(b, c), > e=someExpression()) > > The echo parameter tells the let expression which variables to output. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson > wrote: > >> What does the fq clause look like? >> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith >> wrote: >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we do >> > have nulls in our data; the document contains many fields, we don't >> always >> > have values for each field, but we can't set the nulls to 0 either (or >> any >> > other value, really) as that will mess up other calculations (such as >> when >> > calculating average etc); we would normally just ignore fields with null >> > values when calculating stats manually ourselves. >> > >> > Adding a check in the "q" parameter to ensure that the fields used in >> the >> > calculations are > 0 does work now. Thanks for the tip (and sorry, >> should >> > have caught that myself). But I am unable to use "fq" for these checks, >> > they have to be added to the q instead. Adding fq's doesn't have any >> effect. >> > >> > >> > Anyway, I'm trying to change this up a little. This is what I'm >> currently >> > using (switched from "random" to "search" since I actually need the full >> > hitlist not just a random subset): >> > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO >> *]", >> > fq="isParent:true", rows="150", >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", >> sort="id >> > asc"), >> > b=col(a, oil_first_90_days_production), >> > c=col(a, oil_last_30_days_production), >> > d=regress(b, c)) >> > >> > So I have 2 fields there defined, that works great (in terms of a test >> and >> > running the query); but I need to replace the second field, >> > "oil_last_30_days_production" with the avg value in >> > oil_first_90_days_production. >> > >> > I can get the avg with this expression: >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", >> > fq="isParent:true", rows="150", avg(oil_first_90_days_production)) >> > >> > But I don't know how to push that avg value into the first streaming >> > expression; guessing I have to set "c=" but that is where I'm >> getting >> > lost, since avg only returns 1 value and the first parameter, "b", >> returns >> > a list of sorts. Somehow I have to get the avg value stuffed inside a >> > "col", where it is the same value for every row in the hitlist...? >> > >> > Thanks for your help! >> > >> > >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein >> wrote: >> > >> >> I suspect you've got nulls in your data. I just tested with null >> values and >> >> got the same error. For testing purposes try loading the data with >> default >> >> values of zero. >> >> >> >> >> >> Joel Bernstein >> >> http://joelsolr.blogspot.com/ >> >> >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein >> >> wrote: >> >> >> >> > Let's break the expression down and build it up slowly. Let's start >> with: >> >> > >> >> > let(echo="true", >> >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", >> rows="15", >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"), >> >> > b=col(a, oil_first_90_days_production)) >> >> > >> >> > >> >> > This should return variables a and b. Let's see what the data looks >> like. >> >> > I changed the rows from 15 to 15000. If it all looks good we can >> expand >> >> the >> >> > rows and continue adding functions. >> >> > >> >> > >> >> > >> >> > >> >> > Joel Bernstein >> >> > http://joelsolr.blogspot.com/ >> >> > >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith >> wrote: >> >> > >> >> >> Thanks Joel for your help on this. >> >> >> >> >> >> What I've done so far: >> >> >> - unzip downloaded solr-7.2 >> >> >> - modify the _default
Re: statistics in hitlist
If you want to get everything in query you can do this: let(echo="d,e", a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", fq="isParent:true", rows="150", fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id asc"), b=col(a, oil_first_90_days_production), c=col(a, oil_last_30_days_production), d=regress(b, c), e=someExpression()) The echo parameter tells the let expression which variables to output. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Mar 15, 2018 at 3:13 PM, Erick Ericksonwrote: > What does the fq clause look like? > > On Thu, Mar 15, 2018 at 11:51 AM, John Smith wrote: > > Hi Joel, I did some more work on this statistics stuff today. Yes, we do > > have nulls in our data; the document contains many fields, we don't > always > > have values for each field, but we can't set the nulls to 0 either (or > any > > other value, really) as that will mess up other calculations (such as > when > > calculating average etc); we would normally just ignore fields with null > > values when calculating stats manually ourselves. > > > > Adding a check in the "q" parameter to ensure that the fields used in the > > calculations are > 0 does work now. Thanks for the tip (and sorry, should > > have caught that myself). But I am unable to use "fq" for these checks, > > they have to be added to the q instead. Adding fq's doesn't have any > effect. > > > > > > Anyway, I'm trying to change this up a little. This is what I'm currently > > using (switched from "random" to "search" since I actually need the full > > hitlist not just a random subset): > > > > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO > *]", > > fq="isParent:true", rows="150", > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > sort="id > > asc"), > > b=col(a, oil_first_90_days_production), > > c=col(a, oil_last_30_days_production), > > d=regress(b, c)) > > > > So I have 2 fields there defined, that works great (in terms of a test > and > > running the query); but I need to replace the second field, > > "oil_last_30_days_production" with the avg value in > > oil_first_90_days_production. > > > > I can get the avg with this expression: > > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > > fq="isParent:true", rows="150", avg(oil_first_90_days_production)) > > > > But I don't know how to push that avg value into the first streaming > > expression; guessing I have to set "c=" but that is where I'm getting > > lost, since avg only returns 1 value and the first parameter, "b", > returns > > a list of sorts. Somehow I have to get the avg value stuffed inside a > > "col", where it is the same value for every row in the hitlist...? > > > > Thanks for your help! > > > > > > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein > wrote: > > > >> I suspect you've got nulls in your data. I just tested with null values > and > >> got the same error. For testing purposes try loading the data with > default > >> values of zero. > >> > >> > >> Joel Bernstein > >> http://joelsolr.blogspot.com/ > >> > >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein > >> wrote: > >> > >> > Let's break the expression down and build it up slowly. Let's start > with: > >> > > >> > let(echo="true", > >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", > rows="15", > >> > fl="oil_first_90_days_production,oil_last_30_days_production"), > >> > b=col(a, oil_first_90_days_production)) > >> > > >> > > >> > This should return variables a and b. Let's see what the data looks > like. > >> > I changed the rows from 15 to 15000. If it all looks good we can > expand > >> the > >> > rows and continue adding functions. > >> > > >> > > >> > > >> > > >> > Joel Bernstein > >> > http://joelsolr.blogspot.com/ > >> > > >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith > wrote: > >> > > >> >> Thanks Joel for your help on this. > >> >> > >> >> What I've done so far: > >> >> - unzip downloaded solr-7.2 > >> >> - modify the _default "managed-schema" to add the random field type > and > >> >> the dynamic random field > >> >> - start solr7 using "solr start -c" > >> >> - indexed my data using pint/pdouble/boolean field types etc > >> >> > >> >> I can now run the random function all by itself, it returns random > >> >> results as expected. So far so good! > >> >> > >> >> However... now trying to get the regression stuff working: > >> >> > >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> >> rows="15000", fl="oil_first_90_days_producti > >> >> on,oil_last_30_days_production"), > >> >> b=col(a, oil_first_90_days_production), > >> >> c=col(a, oil_last_30_days_production), > >> >> d=regress(b, c)) > >> >> > >> >> Posted directly into solr admin UI. Run the streaming expression and > I
Re: statistics in hitlist
What does the fq clause look like? On Thu, Mar 15, 2018 at 11:51 AM, John Smithwrote: > Hi Joel, I did some more work on this statistics stuff today. Yes, we do > have nulls in our data; the document contains many fields, we don't always > have values for each field, but we can't set the nulls to 0 either (or any > other value, really) as that will mess up other calculations (such as when > calculating average etc); we would normally just ignore fields with null > values when calculating stats manually ourselves. > > Adding a check in the "q" parameter to ensure that the fields used in the > calculations are > 0 does work now. Thanks for the tip (and sorry, should > have caught that myself). But I am unable to use "fq" for these checks, > they have to be added to the q instead. Adding fq's doesn't have any effect. > > > Anyway, I'm trying to change this up a little. This is what I'm currently > using (switched from "random" to "search" since I actually need the full > hitlist not just a random subset): > > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > fq="isParent:true", rows="150", > fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id > asc"), > b=col(a, oil_first_90_days_production), > c=col(a, oil_last_30_days_production), > d=regress(b, c)) > > So I have 2 fields there defined, that works great (in terms of a test and > running the query); but I need to replace the second field, > "oil_last_30_days_production" with the avg value in > oil_first_90_days_production. > > I can get the avg with this expression: > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > fq="isParent:true", rows="150", avg(oil_first_90_days_production)) > > But I don't know how to push that avg value into the first streaming > expression; guessing I have to set "c=" but that is where I'm getting > lost, since avg only returns 1 value and the first parameter, "b", returns > a list of sorts. Somehow I have to get the avg value stuffed inside a > "col", where it is the same value for every row in the hitlist...? > > Thanks for your help! > > > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein wrote: > >> I suspect you've got nulls in your data. I just tested with null values and >> got the same error. For testing purposes try loading the data with default >> values of zero. >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein >> wrote: >> >> > Let's break the expression down and build it up slowly. Let's start with: >> > >> > let(echo="true", >> > a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15", >> > fl="oil_first_90_days_production,oil_last_30_days_production"), >> > b=col(a, oil_first_90_days_production)) >> > >> > >> > This should return variables a and b. Let's see what the data looks like. >> > I changed the rows from 15 to 15000. If it all looks good we can expand >> the >> > rows and continue adding functions. >> > >> > >> > >> > >> > Joel Bernstein >> > http://joelsolr.blogspot.com/ >> > >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith wrote: >> > >> >> Thanks Joel for your help on this. >> >> >> >> What I've done so far: >> >> - unzip downloaded solr-7.2 >> >> - modify the _default "managed-schema" to add the random field type and >> >> the dynamic random field >> >> - start solr7 using "solr start -c" >> >> - indexed my data using pint/pdouble/boolean field types etc >> >> >> >> I can now run the random function all by itself, it returns random >> >> results as expected. So far so good! >> >> >> >> However... now trying to get the regression stuff working: >> >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", >> >> rows="15000", fl="oil_first_90_days_producti >> >> on,oil_last_30_days_production"), >> >> b=col(a, oil_first_90_days_production), >> >> c=col(a, oil_last_30_days_production), >> >> d=regress(b, c)) >> >> >> >> Posted directly into solr admin UI. Run the streaming expression and I >> >> get this error message: >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value >> >> expected but found type java.lang.String for value >> >> oil_first_90_days_production" >> >> >> >> It thinks my numeric field is defined as a string? But when I view the >> >> schema, those 2 fields are defined as ints: >> >> >> >> >> >> When I run a normal query and choose xml as output format, then it also >> >> puts "int" elements into the hitlist, so the schema appears to be >> correct >> >> it's just when using this regress function that something goes wrong and >> >> solr thinks the field is string. >> >> >> >> Any suggestions? >> >> Thanks! >> >> >> >> >> >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein >> >> wrote: >> >> >> >>> The field type will also need to be in the schema: >> >>>
Re: statistics in hitlist
Hi Joel, I did some more work on this statistics stuff today. Yes, we do have nulls in our data; the document contains many fields, we don't always have values for each field, but we can't set the nulls to 0 either (or any other value, really) as that will mess up other calculations (such as when calculating average etc); we would normally just ignore fields with null values when calculating stats manually ourselves. Adding a check in the "q" parameter to ensure that the fields used in the calculations are > 0 does work now. Thanks for the tip (and sorry, should have caught that myself). But I am unable to use "fq" for these checks, they have to be added to the q instead. Adding fq's doesn't have any effect. Anyway, I'm trying to change this up a little. This is what I'm currently using (switched from "random" to "search" since I actually need the full hitlist not just a random subset): let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", fq="isParent:true", rows="150", fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id asc"), b=col(a, oil_first_90_days_production), c=col(a, oil_last_30_days_production), d=regress(b, c)) So I have 2 fields there defined, that works great (in terms of a test and running the query); but I need to replace the second field, "oil_last_30_days_production" with the avg value in oil_first_90_days_production. I can get the avg with this expression: stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", fq="isParent:true", rows="150", avg(oil_first_90_days_production)) But I don't know how to push that avg value into the first streaming expression; guessing I have to set "c=" but that is where I'm getting lost, since avg only returns 1 value and the first parameter, "b", returns a list of sorts. Somehow I have to get the avg value stuffed inside a "col", where it is the same value for every row in the hitlist...? Thanks for your help! On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernsteinwrote: > I suspect you've got nulls in your data. I just tested with null values and > got the same error. For testing purposes try loading the data with default > values of zero. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein > wrote: > > > Let's break the expression down and build it up slowly. Let's start with: > > > > let(echo="true", > > a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15", > > fl="oil_first_90_days_production,oil_last_30_days_production"), > > b=col(a, oil_first_90_days_production)) > > > > > > This should return variables a and b. Let's see what the data looks like. > > I changed the rows from 15 to 15000. If it all looks good we can expand > the > > rows and continue adding functions. > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Mon, Mar 5, 2018 at 4:11 PM, John Smith wrote: > > > >> Thanks Joel for your help on this. > >> > >> What I've done so far: > >> - unzip downloaded solr-7.2 > >> - modify the _default "managed-schema" to add the random field type and > >> the dynamic random field > >> - start solr7 using "solr start -c" > >> - indexed my data using pint/pdouble/boolean field types etc > >> > >> I can now run the random function all by itself, it returns random > >> results as expected. So far so good! > >> > >> However... now trying to get the regression stuff working: > >> > >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> rows="15000", fl="oil_first_90_days_producti > >> on,oil_last_30_days_production"), > >> b=col(a, oil_first_90_days_production), > >> c=col(a, oil_last_30_days_production), > >> d=regress(b, c)) > >> > >> Posted directly into solr admin UI. Run the streaming expression and I > >> get this error message: > >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value > >> expected but found type java.lang.String for value > >> oil_first_90_days_production" > >> > >> It thinks my numeric field is defined as a string? But when I view the > >> schema, those 2 fields are defined as ints: > >> > >> > >> When I run a normal query and choose xml as output format, then it also > >> puts "int" elements into the hitlist, so the schema appears to be > correct > >> it's just when using this regress function that something goes wrong and > >> solr thinks the field is string. > >> > >> Any suggestions? > >> Thanks! > >> > >> > >> > >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein > >> wrote: > >> > >>> The field type will also need to be in the schema: > >>> > >>> > >>> > >>> > >>> > >>> > >>> Joel Bernstein > >>> http://joelsolr.blogspot.com/ > >>> > >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein > >>> wrote: > >>> > >>> > You'll need to have this field in your schema: > >>> > > >>> > > >>>
Re: statistics in hitlist
I suspect you've got nulls in your data. I just tested with null values and got the same error. For testing purposes try loading the data with default values of zero. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernsteinwrote: > Let's break the expression down and build it up slowly. Let's start with: > > let(echo="true", > a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15", > fl="oil_first_90_days_production,oil_last_30_days_production"), > b=col(a, oil_first_90_days_production)) > > > This should return variables a and b. Let's see what the data looks like. > I changed the rows from 15 to 15000. If it all looks good we can expand the > rows and continue adding functions. > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Mon, Mar 5, 2018 at 4:11 PM, John Smith wrote: > >> Thanks Joel for your help on this. >> >> What I've done so far: >> - unzip downloaded solr-7.2 >> - modify the _default "managed-schema" to add the random field type and >> the dynamic random field >> - start solr7 using "solr start -c" >> - indexed my data using pint/pdouble/boolean field types etc >> >> I can now run the random function all by itself, it returns random >> results as expected. So far so good! >> >> However... now trying to get the regression stuff working: >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", >> rows="15000", fl="oil_first_90_days_producti >> on,oil_last_30_days_production"), >> b=col(a, oil_first_90_days_production), >> c=col(a, oil_last_30_days_production), >> d=regress(b, c)) >> >> Posted directly into solr admin UI. Run the streaming expression and I >> get this error message: >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value >> expected but found type java.lang.String for value >> oil_first_90_days_production" >> >> It thinks my numeric field is defined as a string? But when I view the >> schema, those 2 fields are defined as ints: >> >> >> When I run a normal query and choose xml as output format, then it also >> puts "int" elements into the hitlist, so the schema appears to be correct >> it's just when using this regress function that something goes wrong and >> solr thinks the field is string. >> >> Any suggestions? >> Thanks! >> >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein >> wrote: >> >>> The field type will also need to be in the schema: >>> >>> >>> >>> >>> >>> >>> Joel Bernstein >>> http://joelsolr.blogspot.com/ >>> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein >>> wrote: >>> >>> > You'll need to have this field in your schema: >>> > >>> > >>> > >>> > I'll check to see if the default schema used with solr start -c has >>> this >>> > field, if not I'll add it. Thanks for pointing this out. >>> > >>> > I checked and right now the random expression is only accepting one fq, >>> > but I consider this a bug. It should accept multiple. I'll create >>> ticket >>> > for getting this fixed. >>> > >>> > >>> > >>> > Joel Bernstein >>> > http://joelsolr.blogspot.com/ >>> > >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith >>> wrote: >>> > >>> >> Joel, thanks for the pointers to the streaming feature. I had no idea >>> solr >>> >> had that (and also just discovered the very intersting sql feature! I >>> will >>> >> be sure to investigate that in more detail in the future). >>> >> >>> >> However I'm having some trouble getting basic streaming functions >>> working. >>> >> I've already figured out that I had to move to "solr cloud" instead of >>> >> "solr standalone" because I was getting errors about "cannot find zk >>> >> instance" or whatever which went away when using "solr start -c" >>> instead. >>> >> >>> >> But now I'm trying to use the random function since that was one of >>> the >>> >> functions used in your example. >>> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname") >>> >> >>> >> I posted that directly in the "stream" section of the solr admin UI. >>> This >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in >>> case >>> >> it was a bug in one) >>> >> >>> >> I get back an error message: >>> >> *sort param could not be parsed as a query, and is not a field that >>> exists >>> >> in the index: random_-255009774* >>> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs show >>> these >>> >> three log entries: >>> >> >>> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request >>> >> [tx_header_shard1_replica_n1] webapp=/solr path=/select >>> >> params={q=*:*&_stateVer_=tx_header:6=countyname >>> >> *=random_-255009774+asc*=100=javabin=2} >>> status=400 >>> >> QTime=19 >>> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 >>> >> r:core_node2
Re: statistics in hitlist
Let's break the expression down and build it up slowly. Let's start with: let(echo="true", a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15", fl="oil_first_90_days_production,oil_last_30_days_production"), b=col(a, oil_first_90_days_production)) This should return variables a and b. Let's see what the data looks like. I changed the rows from 15 to 15000. If it all looks good we can expand the rows and continue adding functions. Joel Bernstein http://joelsolr.blogspot.com/ On Mon, Mar 5, 2018 at 4:11 PM, John Smithwrote: > Thanks Joel for your help on this. > > What I've done so far: > - unzip downloaded solr-7.2 > - modify the _default "managed-schema" to add the random field type and > the dynamic random field > - start solr7 using "solr start -c" > - indexed my data using pint/pdouble/boolean field types etc > > I can now run the random function all by itself, it returns random results > as expected. So far so good! > > However... now trying to get the regression stuff working: > > let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > rows="15000", fl="oil_first_90_days_production,oil_last_30_days_ > production"), > b=col(a, oil_first_90_days_production), > c=col(a, oil_last_30_days_production), > d=regress(b, c)) > > Posted directly into solr admin UI. Run the streaming expression and I get > this error message: > "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value > expected but found type java.lang.String for value > oil_first_90_days_production" > > It thinks my numeric field is defined as a string? But when I view the > schema, those 2 fields are defined as ints: > > > When I run a normal query and choose xml as output format, then it also > puts "int" elements into the hitlist, so the schema appears to be correct > it's just when using this regress function that something goes wrong and > solr thinks the field is string. > > Any suggestions? > Thanks! > > > > On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein wrote: > >> The field type will also need to be in the schema: >> >> >> >> >> >> >> Joel Bernstein >> http://joelsolr.blogspot.com/ >> >> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein >> wrote: >> >> > You'll need to have this field in your schema: >> > >> > >> > >> > I'll check to see if the default schema used with solr start -c has this >> > field, if not I'll add it. Thanks for pointing this out. >> > >> > I checked and right now the random expression is only accepting one fq, >> > but I consider this a bug. It should accept multiple. I'll create ticket >> > for getting this fixed. >> > >> > >> > >> > Joel Bernstein >> > http://joelsolr.blogspot.com/ >> > >> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith >> wrote: >> > >> >> Joel, thanks for the pointers to the streaming feature. I had no idea >> solr >> >> had that (and also just discovered the very intersting sql feature! I >> will >> >> be sure to investigate that in more detail in the future). >> >> >> >> However I'm having some trouble getting basic streaming functions >> working. >> >> I've already figured out that I had to move to "solr cloud" instead of >> >> "solr standalone" because I was getting errors about "cannot find zk >> >> instance" or whatever which went away when using "solr start -c" >> instead. >> >> >> >> But now I'm trying to use the random function since that was one of the >> >> functions used in your example. >> >> >> >> random(tx_header, q="*:*", rows="100", fl="countyname") >> >> >> >> I posted that directly in the "stream" section of the solr admin UI. >> This >> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in >> case >> >> it was a bug in one) >> >> >> >> I get back an error message: >> >> *sort param could not be parsed as a query, and is not a field that >> exists >> >> in the index: random_-255009774* >> >> >> >> I'm not passing in any sort field anywhere. But the solr logs show >> these >> >> three log entries: >> >> >> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 >> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request >> >> [tx_header_shard1_replica_n1] webapp=/solr path=/select >> >> params={q=*:*&_stateVer_=tx_header:6=countyname >> >> *=random_-255009774+asc*=100=javabin=2} >> status=400 >> >> QTime=19 >> >> >> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 >> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient >> >> Request to collection [tx_header] failed due to (400) >> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> >> Error >> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param >> could >> >> not be parsed as a query, and is not a field that exists in the index: >> >> random_-255009774, retry? 0 >> >> >> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
Re: statistics in hitlist
Thanks Joel for your help on this. What I've done so far: - unzip downloaded solr-7.2 - modify the _default "managed-schema" to add the random field type and the dynamic random field - start solr7 using "solr start -c" - indexed my data using pint/pdouble/boolean field types etc I can now run the random function all by itself, it returns random results as expected. So far so good! However... now trying to get the regression stuff working: let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000", fl="oil_first_90_days_production,oil_last_30_days_production"), b=col(a, oil_first_90_days_production), c=col(a, oil_last_30_days_production), d=regress(b, c)) Posted directly into solr admin UI. Run the streaming expression and I get this error message: "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value expected but found type java.lang.String for value oil_first_90_days_production" It thinks my numeric field is defined as a string? But when I view the schema, those 2 fields are defined as ints: When I run a normal query and choose xml as output format, then it also puts "int" elements into the hitlist, so the schema appears to be correct it's just when using this regress function that something goes wrong and solr thinks the field is string. Any suggestions? Thanks! On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernsteinwrote: > The field type will also need to be in the schema: > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein wrote: > > > You'll need to have this field in your schema: > > > > > > > > I'll check to see if the default schema used with solr start -c has this > > field, if not I'll add it. Thanks for pointing this out. > > > > I checked and right now the random expression is only accepting one fq, > > but I consider this a bug. It should accept multiple. I'll create ticket > > for getting this fixed. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 1, 2018 at 4:55 PM, John Smith wrote: > > > >> Joel, thanks for the pointers to the streaming feature. I had no idea > solr > >> had that (and also just discovered the very intersting sql feature! I > will > >> be sure to investigate that in more detail in the future). > >> > >> However I'm having some trouble getting basic streaming functions > working. > >> I've already figured out that I had to move to "solr cloud" instead of > >> "solr standalone" because I was getting errors about "cannot find zk > >> instance" or whatever which went away when using "solr start -c" > instead. > >> > >> But now I'm trying to use the random function since that was one of the > >> functions used in your example. > >> > >> random(tx_header, q="*:*", rows="100", fl="countyname") > >> > >> I posted that directly in the "stream" section of the solr admin UI. > This > >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in > case > >> it was a bug in one) > >> > >> I get back an error message: > >> *sort param could not be parsed as a query, and is not a field that > exists > >> in the index: random_-255009774* > >> > >> I'm not passing in any sort field anywhere. But the solr logs show these > >> three log entries: > >> > >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 > >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request > >> [tx_header_shard1_replica_n1] webapp=/solr path=/select > >> params={q=*:*&_stateVer_=tx_header:6=countyname > >> *=random_-255009774+asc*=100=javabin=2} status=400 > >> QTime=19 > >> > >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 > >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient > >> Request to collection [tx_header] failed due to (400) > >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > >> Error > >> from server at http://192.168.13.31:8983/solr/tx_header: sort param > could > >> not be parsed as a query, and is not a field that exists in the index: > >> random_-255009774, retry? 0 > >> > >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1 > >> r:core_node2 x:tx_header_shard1_replica_n1] > o.a.s.c.s.i.s.ExceptionStream > >> java.io.IOException: > >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > >> Error > >> from server at http://192.168.13.31:8983/solr/tx_header: sort param > could > >> not be parsed as a query, and is not a field that exists in the index: > >> random_-255009774 > >> > >> > >> So basically it looks like solr is injecting the "sort=random_" stuff > into > >> my query and of course that is failing on the search since that > >> field/column doesn't exist in my schema. Everytime I run the random > >> function, I get a slightly different field name that it injects, but > they > >> all start with "random_" etc. > >> > >> I have tried
Re: statistics in hitlist
The field type will also need to be in the schema: Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernsteinwrote: > You'll need to have this field in your schema: > > > > I'll check to see if the default schema used with solr start -c has this > field, if not I'll add it. Thanks for pointing this out. > > I checked and right now the random expression is only accepting one fq, > but I consider this a bug. It should accept multiple. I'll create ticket > for getting this fixed. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 1, 2018 at 4:55 PM, John Smith wrote: > >> Joel, thanks for the pointers to the streaming feature. I had no idea solr >> had that (and also just discovered the very intersting sql feature! I will >> be sure to investigate that in more detail in the future). >> >> However I'm having some trouble getting basic streaming functions working. >> I've already figured out that I had to move to "solr cloud" instead of >> "solr standalone" because I was getting errors about "cannot find zk >> instance" or whatever which went away when using "solr start -c" instead. >> >> But now I'm trying to use the random function since that was one of the >> functions used in your example. >> >> random(tx_header, q="*:*", rows="100", fl="countyname") >> >> I posted that directly in the "stream" section of the solr admin UI. This >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case >> it was a bug in one) >> >> I get back an error message: >> *sort param could not be parsed as a query, and is not a field that exists >> in the index: random_-255009774* >> >> I'm not passing in any sort field anywhere. But the solr logs show these >> three log entries: >> >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request >> [tx_header_shard1_replica_n1] webapp=/solr path=/select >> params={q=*:*&_stateVer_=tx_header:6=countyname >> *=random_-255009774+asc*=100=javabin=2} status=400 >> QTime=19 >> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient >> Request to collection [tx_header] failed due to (400) >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error >> from server at http://192.168.13.31:8983/solr/tx_header: sort param could >> not be parsed as a query, and is not a field that exists in the index: >> random_-255009774, retry? 0 >> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1 >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream >> java.io.IOException: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error >> from server at http://192.168.13.31:8983/solr/tx_header: sort param could >> not be parsed as a query, and is not a field that exists in the index: >> random_-255009774 >> >> >> So basically it looks like solr is injecting the "sort=random_" stuff into >> my query and of course that is failing on the search since that >> field/column doesn't exist in my schema. Everytime I run the random >> function, I get a slightly different field name that it injects, but they >> all start with "random_" etc. >> >> I have tried adding my own sort field instead, hoping solr wouldn't inject >> one for me, but it still injected a random sort fieldname: >> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname >> asc") >> >> >> Assuming I can fix that whole problem, my second question is: can I add >> multiple "fq=" parameters to the random function? I build a pretty >> complicated query using many fq= fields, and then want to run some stats >> on >> that hitlist; so somehow I have to pass in the query that made up the >> exact >> hitlist to these various functions, but when I used multiple "fq=" values >> it only seemed to use the last one I specified and just ignored all the >> previous fq's? >> >> Thanks in advance for any comments/suggestions...! >> >> >> >> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein >> wrote: >> >> > This is going to be a complex answer because Solr actually now has >> multiple >> > ways of doing regression analysis as part of the Streaming Expression >> > statistical programming library. The basic documentation is here: >> > >> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html >> > >> > Here is a sample expression that performs a simple linear regression in >> > Solr 7.2: >> > >> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA, >> > fieldB"), >> > b=col(a, fieldA), >> > c=col(a, fieldB), >> > d=regress(b, c)) >> > >> > >> > The expression above takes a random sample of 15000 results from >> > collection1. The result set will include fieldA and fieldB in each >> record. >> > The result set is
Re: statistics in hitlist
You'll need to have this field in your schema: I'll check to see if the default schema used with solr start -c has this field, if not I'll add it. Thanks for pointing this out. I checked and right now the random expression is only accepting one fq, but I consider this a bug. It should accept multiple. I'll create ticket for getting this fixed. Joel Bernstein http://joelsolr.blogspot.com/ On Thu, Mar 1, 2018 at 4:55 PM, John Smithwrote: > Joel, thanks for the pointers to the streaming feature. I had no idea solr > had that (and also just discovered the very intersting sql feature! I will > be sure to investigate that in more detail in the future). > > However I'm having some trouble getting basic streaming functions working. > I've already figured out that I had to move to "solr cloud" instead of > "solr standalone" because I was getting errors about "cannot find zk > instance" or whatever which went away when using "solr start -c" instead. > > But now I'm trying to use the random function since that was one of the > functions used in your example. > > random(tx_header, q="*:*", rows="100", fl="countyname") > > I posted that directly in the "stream" section of the solr admin UI. This > is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case > it was a bug in one) > > I get back an error message: > *sort param could not be parsed as a query, and is not a field that exists > in the index: random_-255009774* > > I'm not passing in any sort field anywhere. But the solr logs show these > three log entries: > > 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 > r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request > [tx_header_shard1_replica_n1] webapp=/solr path=/select > params={q=*:*&_stateVer_=tx_header:6=countyname > *=random_-255009774+asc*=100=javabin=2} status=400 > QTime=19 > > 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 > r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient > Request to collection [tx_header] failed due to (400) > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error > from server at http://192.168.13.31:8983/solr/tx_header: sort param could > not be parsed as a query, and is not a field that exists in the index: > random_-255009774, retry? 0 > > 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1 > r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream > java.io.IOException: > org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > Error > from server at http://192.168.13.31:8983/solr/tx_header: sort param could > not be parsed as a query, and is not a field that exists in the index: > random_-255009774 > > > So basically it looks like solr is injecting the "sort=random_" stuff into > my query and of course that is failing on the search since that > field/column doesn't exist in my schema. Everytime I run the random > function, I get a slightly different field name that it injects, but they > all start with "random_" etc. > > I have tried adding my own sort field instead, hoping solr wouldn't inject > one for me, but it still injected a random sort fieldname: > random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname > asc") > > > Assuming I can fix that whole problem, my second question is: can I add > multiple "fq=" parameters to the random function? I build a pretty > complicated query using many fq= fields, and then want to run some stats on > that hitlist; so somehow I have to pass in the query that made up the exact > hitlist to these various functions, but when I used multiple "fq=" values > it only seemed to use the last one I specified and just ignored all the > previous fq's? > > Thanks in advance for any comments/suggestions...! > > > > > On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein > wrote: > > > This is going to be a complex answer because Solr actually now has > multiple > > ways of doing regression analysis as part of the Streaming Expression > > statistical programming library. The basic documentation is here: > > > > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html > > > > Here is a sample expression that performs a simple linear regression in > > Solr 7.2: > > > > let(a=random(collection1, q="any query", rows="15000", fl="fieldA, > > fieldB"), > > b=col(a, fieldA), > > c=col(a, fieldB), > > d=regress(b, c)) > > > > > > The expression above takes a random sample of 15000 results from > > collection1. The result set will include fieldA and fieldB in each > record. > > The result set is stored in variable "a". > > > > Then the "col" function creates arrays of numbers from the results stored > > in variable a. The values in fieldA are stored in the variable "b". The > > values in fieldB are stored in variable "c". > > > > Then the regress function performs a simple linear regression on arrays > >
Re: statistics in hitlist
Joel, thanks for the pointers to the streaming feature. I had no idea solr had that (and also just discovered the very intersting sql feature! I will be sure to investigate that in more detail in the future). However I'm having some trouble getting basic streaming functions working. I've already figured out that I had to move to "solr cloud" instead of "solr standalone" because I was getting errors about "cannot find zk instance" or whatever which went away when using "solr start -c" instead. But now I'm trying to use the random function since that was one of the functions used in your example. random(tx_header, q="*:*", rows="100", fl="countyname") I posted that directly in the "stream" section of the solr admin UI. This is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case it was a bug in one) I get back an error message: *sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774* I'm not passing in any sort field anywhere. But the solr logs show these three log entries: 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request [tx_header_shard1_replica_n1] webapp=/solr path=/select params={q=*:*&_stateVer_=tx_header:6=countyname *=random_-255009774+asc*=100=javabin=2} status=400 QTime=19 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient Request to collection [tx_header] failed due to (400) org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.168.13.31:8983/solr/tx_header: sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774, retry? 0 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream java.io.IOException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.168.13.31:8983/solr/tx_header: sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774 So basically it looks like solr is injecting the "sort=random_" stuff into my query and of course that is failing on the search since that field/column doesn't exist in my schema. Everytime I run the random function, I get a slightly different field name that it injects, but they all start with "random_" etc. I have tried adding my own sort field instead, hoping solr wouldn't inject one for me, but it still injected a random sort fieldname: random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname asc") Assuming I can fix that whole problem, my second question is: can I add multiple "fq=" parameters to the random function? I build a pretty complicated query using many fq= fields, and then want to run some stats on that hitlist; so somehow I have to pass in the query that made up the exact hitlist to these various functions, but when I used multiple "fq=" values it only seemed to use the last one I specified and just ignored all the previous fq's? Thanks in advance for any comments/suggestions...! On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernsteinwrote: > This is going to be a complex answer because Solr actually now has multiple > ways of doing regression analysis as part of the Streaming Expression > statistical programming library. The basic documentation is here: > > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html > > Here is a sample expression that performs a simple linear regression in > Solr 7.2: > > let(a=random(collection1, q="any query", rows="15000", fl="fieldA, > fieldB"), > b=col(a, fieldA), > c=col(a, fieldB), > d=regress(b, c)) > > > The expression above takes a random sample of 15000 results from > collection1. The result set will include fieldA and fieldB in each record. > The result set is stored in variable "a". > > Then the "col" function creates arrays of numbers from the results stored > in variable a. The values in fieldA are stored in the variable "b". The > values in fieldB are stored in variable "c". > > Then the regress function performs a simple linear regression on arrays > stored in variables "b" and "c". > > The output of the regress function is a map containing the regression > result. This result includes RSquared and other attributes of the > regression model such as R (correlation), slope, y intercept etc... > > > > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Feb 23, 2018 at 3:10 PM, John Smith wrote: > > > Hi Joel, thanks for the answer. I'm not really a stats guy, but the end > > result of all this is supposed to be obtaining R^2. Is there no way of > > obtaining this value, then (short of iterating over all the results in > the > > hitlist and calculating it myself)? > > >
Re: statistics in hitlist
This is going to be a complex answer because Solr actually now has multiple ways of doing regression analysis as part of the Streaming Expression statistical programming library. The basic documentation is here: https://lucene.apache.org/solr/guide/7_2/statistical-programming.html Here is a sample expression that performs a simple linear regression in Solr 7.2: let(a=random(collection1, q="any query", rows="15000", fl="fieldA, fieldB"), b=col(a, fieldA), c=col(a, fieldB), d=regress(b, c)) The expression above takes a random sample of 15000 results from collection1. The result set will include fieldA and fieldB in each record. The result set is stored in variable "a". Then the "col" function creates arrays of numbers from the results stored in variable a. The values in fieldA are stored in the variable "b". The values in fieldB are stored in variable "c". Then the regress function performs a simple linear regression on arrays stored in variables "b" and "c". The output of the regress function is a map containing the regression result. This result includes RSquared and other attributes of the regression model such as R (correlation), slope, y intercept etc... Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Feb 23, 2018 at 3:10 PM, John Smithwrote: > Hi Joel, thanks for the answer. I'm not really a stats guy, but the end > result of all this is supposed to be obtaining R^2. Is there no way of > obtaining this value, then (short of iterating over all the results in the > hitlist and calculating it myself)? > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein > wrote: > > > Typically SSE is the sum of the squared errors of the prediction in a > > regression analysis. The stats component doesn't perform regression, > > although it might be a nice feature. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith > wrote: > > > > > I'm using solr, and enabling stats as per this page: > > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html > > > > > > I want to get more stat values though. Specifically I'm looking for > > > r-squared (coefficient of determination). This value is not present in > > > solr, however some of the pieces used to calculate r^2 are in the stats > > > element, for example: > > > > > > 0.0 > > > 10.0 > > > 15 > > > 17 > > > 85.0 > > > 603.0 > > > 5.667 > > > 2.943920288775949 > > > > > > > > > So I have the sumOfSquares available (SST), and using this > calculation, I > > > can get R^2: > > > > > > R^2 = 1 - SSE/SST > > > > > > All I need then is SSE. Is there anyway I can get SSE from those other > > > stats in solr? > > > > > > Thanks in advance! > > > > > >
Re: statistics in hitlist
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end result of all this is supposed to be obtaining R^2. Is there no way of obtaining this value, then (short of iterating over all the results in the hitlist and calculating it myself)? On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernsteinwrote: > Typically SSE is the sum of the squared errors of the prediction in a > regression analysis. The stats component doesn't perform regression, > although it might be a nice feature. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith wrote: > > > I'm using solr, and enabling stats as per this page: > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html > > > > I want to get more stat values though. Specifically I'm looking for > > r-squared (coefficient of determination). This value is not present in > > solr, however some of the pieces used to calculate r^2 are in the stats > > element, for example: > > > > 0.0 > > 10.0 > > 15 > > 17 > > 85.0 > > 603.0 > > 5.667 > > 2.943920288775949 > > > > > > So I have the sumOfSquares available (SST), and using this calculation, I > > can get R^2: > > > > R^2 = 1 - SSE/SST > > > > All I need then is SSE. Is there anyway I can get SSE from those other > > stats in solr? > > > > Thanks in advance! > > >
Re: statistics in hitlist
Typically SSE is the sum of the squared errors of the prediction in a regression analysis. The stats component doesn't perform regression, although it might be a nice feature. Joel Bernstein http://joelsolr.blogspot.com/ On Fri, Feb 23, 2018 at 12:17 PM, John Smithwrote: > I'm using solr, and enabling stats as per this page: > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html > > I want to get more stat values though. Specifically I'm looking for > r-squared (coefficient of determination). This value is not present in > solr, however some of the pieces used to calculate r^2 are in the stats > element, for example: > > 0.0 > 10.0 > 15 > 17 > 85.0 > 603.0 > 5.667 > 2.943920288775949 > > > So I have the sumOfSquares available (SST), and using this calculation, I > can get R^2: > > R^2 = 1 - SSE/SST > > All I need then is SSE. Is there anyway I can get SSE from those other > stats in solr? > > Thanks in advance! >
statistics in hitlist
I'm using solr, and enabling stats as per this page: https://lucene.apache.org/solr/guide/6_6/the-stats-component.html I want to get more stat values though. Specifically I'm looking for r-squared (coefficient of determination). This value is not present in solr, however some of the pieces used to calculate r^2 are in the stats element, for example: 0.0 10.0 15 17 85.0 603.0 5.667 2.943920288775949 So I have the sumOfSquares available (SST), and using this calculation, I can get R^2: R^2 = 1 - SSE/SST All I need then is SSE. Is there anyway I can get SSE from those other stats in solr? Thanks in advance!