Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein <joels...@gmail.com> wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein <joels...@gmail.com>
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >      a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="1500000",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >      b=col(a, oil_first_90_days_production),
> >      c=col(a, oil_last_30_days_production),
> >      d=regress(b, c),
> >      e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith <localde...@gmail.com>
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="1500000",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >      b=col(a, oil_first_90_days_production),
> >> >      c=col(a, oil_last_30_days_production),
> >> >      d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="1500000", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=...." but that is where I'm
> >> getting
> >> > lost, since avg only returns 1 value and the first parameter, "b",
> >> returns
> >> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> >> > "col", where it is the same value for every row in the hitlist...?
> >> >
> >> > Thanks for your help!
> >> >
> >> >
> >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein <joels...@gmail.com>
> >> wrote:
> >> >
> >> >> I suspect you've got nulls in your data. I just tested with null
> >> values and
> >> >> got the same error. For testing purposes try loading the data with
> >> default
> >> >> values of zero.
> >> >>
> >> >>
> >> >> Joel Bernstein
> >> >> http://joelsolr.blogspot.com/
> >> >>
> >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein <joels...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Let's break the expression down and build it up slowly. Let's start
> >> with:
> >> >> >
> >> >> > let(echo="true",
> >> >> >      a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15",
> >> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >> >> >      b=col(a, oil_first_90_days_production))
> >> >> >
> >> >> >
> >> >> > This should return variables a and b. Let's see what the data looks
> >> like.
> >> >> > I changed the rows from 15 to 15000. If it all looks good we can
> >> expand
> >> >> the
> >> >> > rows and continue adding functions.
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > Joel Bernstein
> >> >> > http://joelsolr.blogspot.com/
> >> >> >
> >> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith <localde...@gmail.com>
> >> wrote:
> >> >> >
> >> >> >> Thanks Joel for your help on this.
> >> >> >>
> >> >> >> What I've done so far:
> >> >> >> - unzip downloaded solr-7.2
> >> >> >> - modify the _default "managed-schema" to add the random field
> type
> >> and
> >> >> >> the dynamic random field
> >> >> >> - start solr7 using "solr start -c"
> >> >> >> - indexed my data using pint/pdouble/boolean field types etc
> >> >> >>
> >> >> >> I can now run the random function all by itself, it returns random
> >> >> >> results as expected. So far so good!
> >> >> >>
> >> >> >> However... now trying to get the regression stuff working:
> >> >> >>
> >> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> >> >> rows="15000", fl="oil_first_90_days_producti
> >> >> >> on,oil_last_30_days_production"),
> >> >> >>     b=col(a, oil_first_90_days_production),
> >> >> >>     c=col(a, oil_last_30_days_production),
> >> >> >>     d=regress(b, c))
> >> >> >>
> >> >> >> Posted directly into solr admin UI. Run the streaming expression
> >> and I
> >> >> >> get this error message:
> >> >> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric
> >> value
> >> >> >> expected but found type java.lang.String for value
> >> >> >> oil_first_90_days_production"
> >> >> >>
> >> >> >> It thinks my numeric field is defined as a string? But when I view
> >> the
> >> >> >> schema, those 2 fields are defined as ints:
> >> >> >>
> >> >> >>
> >> >> >> When I run a normal query and choose xml as output format, then it
> >> also
> >> >> >> puts "int" elements into the hitlist, so the schema appears to be
> >> >> correct
> >> >> >> it's just when using this regress function that something goes
> >> wrong and
> >> >> >> solr thinks the field is string.
> >> >> >>
> >> >> >> Any suggestions?
> >> >> >> Thanks!
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein <
> joels...@gmail.com>
> >> >> >> wrote:
> >> >> >>
> >> >> >>> The field type will also need to be in the schema:
> >> >> >>>
> >> >> >>>  <!-- The "RandomSortField" is not used to store or search any
> >> >> >>>
> >> >> >>>          data.  You can declare fields of this type it in your
> >> schema
> >> >> >>>
> >> >> >>>          to generate pseudo-random orderings of your docs for
> >> sorting
> >> >> >>>
> >> >> >>>          or function purposes.  The ordering is generated based
> on
> >> the
> >> >> >>> field
> >> >> >>>
> >> >> >>>          name and the version of the index. As long as the index
> >> >> version
> >> >> >>>
> >> >> >>>          remains unchanged, and the same field name is reused,
> >> >> >>>
> >> >> >>>          the ordering of the docs will be consistent.
> >> >> >>>
> >> >> >>>          If you want different psuedo-random orderings of
> >> documents,
> >> >> >>>
> >> >> >>>          for the same version of the index, use a dynamicField
> and
> >> >> >>>
> >> >> >>>          change the field name in the request.
> >> >> >>>
> >> >> >>>      -->
> >> >> >>>
> >> >> >>> <fieldType name="random" class="solr.RandomSortField"
> >> indexed="true" />
> >> >> >>>
> >> >> >>>
> >> >> >>> Joel Bernstein
> >> >> >>> http://joelsolr.blogspot.com/
> >> >> >>>
> >> >> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein <
> joels...@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>>
> >> >> >>> > You'll need to have this field in your schema:
> >> >> >>> >
> >> >> >>> > <dynamicField name="random_*" type="random" />
> >> >> >>> >
> >> >> >>> > I'll check to see if the default schema used with solr start -c
> >> has
> >> >> >>> this
> >> >> >>> > field, if not I'll add it. Thanks for pointing this out.
> >> >> >>> >
> >> >> >>> > I checked and right now the random expression is only accepting
> >> one
> >> >> fq,
> >> >> >>> > but I consider this a bug. It should accept multiple. I'll
> create
> >> >> >>> ticket
> >> >> >>> > for getting this fixed.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > Joel Bernstein
> >> >> >>> > http://joelsolr.blogspot.com/
> >> >> >>> >
> >> >> >>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith <
> localde...@gmail.com
> >> >
> >> >> >>> wrote:
> >> >> >>> >
> >> >> >>> >> Joel, thanks for the pointers to the streaming feature. I had
> no
> >> >> idea
> >> >> >>> solr
> >> >> >>> >> had that (and also just discovered the very intersting sql
> >> feature!
> >> >> I
> >> >> >>> will
> >> >> >>> >> be sure to investigate that in more detail in the future).
> >> >> >>> >>
> >> >> >>> >> However I'm having some trouble getting basic streaming
> >> functions
> >> >> >>> working.
> >> >> >>> >> I've already figured out that I had to move to "solr cloud"
> >> instead
> >> >> of
> >> >> >>> >> "solr standalone" because I was getting errors about "cannot
> >> find zk
> >> >> >>> >> instance" or whatever which went away when using "solr start
> -c"
> >> >> >>> instead.
> >> >> >>> >>
> >> >> >>> >> But now I'm trying to use the random function since that was
> >> one of
> >> >> >>> the
> >> >> >>> >> functions used in your example.
> >> >> >>> >>
> >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >> >> >>> >>
> >> >> >>> >> I posted that directly in the "stream" section of the solr
> >> admin UI.
> >> >> >>> This
> >> >> >>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several
> >> versions
> >> >> in
> >> >> >>> case
> >> >> >>> >> it was a bug in one)
> >> >> >>> >>
> >> >> >>> >> I get back an error message:
> >> >> >>> >> *sort param could not be parsed as a query, and is not a field
> >> that
> >> >> >>> exists
> >> >> >>> >> in the index: random_-255009774*
> >> >> >>> >>
> >> >> >>> >> I'm not passing in any sort field anywhere. But the solr logs
> >> show
> >> >> >>> these
> >> >> >>> >> three log entries:
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> >> >>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> >> >>> >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname
> >> >> >>> >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2}
> >> >> >>> status=400
> >> >> >>> >> QTime=19
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >> >>> o.a.s.c.s.i.CloudSolrClient
> >> >> >>> >> Request to collection [tx_header] failed due to (400)
> >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> >> RemoteSolrException:
> >> >> >>> >> Error
> >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> >> param
> >> >> >>> could
> >> >> >>> >> not be parsed as a query, and is not a field that exists in
> the
> >> >> index:
> >> >> >>> >> random_-255009774, retry? 0
> >> >> >>> >>
> >> >> >>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header
> >> >> s:shard1
> >> >> >>> >> r:core_node2 x:tx_header_shard1_replica_n1]
> >> >> >>> o.a.s.c.s.i.s.ExceptionStream
> >> >> >>> >> java.io.IOException:
> >> >> >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$
> >> >> RemoteSolrException:
> >> >> >>> >> Error
> >> >> >>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort
> >> param
> >> >> >>> could
> >> >> >>> >> not be parsed as a query, and is not a field that exists in
> the
> >> >> index:
> >> >> >>> >> random_-255009774
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> So basically it looks like solr is injecting the
> "sort=random_"
> >> >> stuff
> >> >> >>> into
> >> >> >>> >> my query and of course that is failing on the search since
> that
> >> >> >>> >> field/column doesn't exist in my schema. Everytime I run the
> >> random
> >> >> >>> >> function, I get a slightly different field name that it
> >> injects, but
> >> >> >>> they
> >> >> >>> >> all start with "random_" etc.
> >> >> >>> >>
> >> >> >>> >> I have tried adding my own sort field instead, hoping solr
> >> wouldn't
> >> >> >>> inject
> >> >> >>> >> one for me, but it still injected a random sort fieldname:
> >> >> >>> >> random(tx_header, q="*:*", rows="100", fl="countyname",
> >> >> >>> sort="countyname
> >> >> >>> >> asc")
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> Assuming I can fix that whole problem, my second question is:
> >> can I
> >> >> >>> add
> >> >> >>> >> multiple "fq=" parameters to the random function? I build a
> >> pretty
> >> >> >>> >> complicated query using many fq= fields, and then want to run
> >> some
> >> >> >>> stats
> >> >> >>> >> on
> >> >> >>> >> that hitlist; so somehow I have to pass in the query that made
> >> up
> >> >> the
> >> >> >>> >> exact
> >> >> >>> >> hitlist to these various functions, but when I used multiple
> >> "fq="
> >> >> >>> values
> >> >> >>> >> it only seemed to use the last one I specified and just
> ignored
> >> all
> >> >> >>> the
> >> >> >>> >> previous fq's?
> >> >> >>> >>
> >> >> >>> >> Thanks in advance for any comments/suggestions...!
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >>
> >> >> >>> >> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein <
> >> joels...@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >>
> >> >> >>> >> > This is going to be a complex answer because Solr actually
> >> now has
> >> >> >>> >> multiple
> >> >> >>> >> > ways of doing regression analysis as part of the Streaming
> >> >> >>> Expression
> >> >> >>> >> > statistical programming library. The basic documentation is
> >> here:
> >> >> >>> >> >
> >> >> >>> >> > https://lucene.apache.org/solr/guide/7_2/statistical-
> program
> >> >> >>> ming.html
> >> >> >>> >> >
> >> >> >>> >> > Here is a sample expression that performs a simple linear
> >> >> >>> regression in
> >> >> >>> >> > Solr 7.2:
> >> >> >>> >> >
> >> >> >>> >> > let(a=random(collection1, q="any query", rows="15000",
> >> fl="fieldA,
> >> >> >>> >> > fieldB"),
> >> >> >>> >> >     b=col(a, fieldA),
> >> >> >>> >> >     c=col(a, fieldB),
> >> >> >>> >> >     d=regress(b, c))
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > The expression above takes a random sample of 15000 results
> >> from
> >> >> >>> >> > collection1. The result set will include fieldA and fieldB
> in
> >> each
> >> >> >>> >> record.
> >> >> >>> >> > The result set is stored in variable "a".
> >> >> >>> >> >
> >> >> >>> >> > Then the "col" function creates arrays of numbers from the
> >> results
> >> >> >>> >> stored
> >> >> >>> >> > in variable a. The values in fieldA are stored in the
> variable
> >> >> "b".
> >> >> >>> The
> >> >> >>> >> > values in fieldB are stored in variable "c".
> >> >> >>> >> >
> >> >> >>> >> > Then the regress function performs a simple linear
> regression
> >> on
> >> >> >>> arrays
> >> >> >>> >> > stored in variables "b" and "c".
> >> >> >>> >> >
> >> >> >>> >> > The output of the regress function is a map containing the
> >> >> >>> regression
> >> >> >>> >> > result. This result includes RSquared and other attributes
> of
> >> the
> >> >> >>> >> > regression model such as R (correlation), slope, y intercept
> >> >> etc...
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> >
> >> >> >>> >> > Joel Bernstein
> >> >> >>> >> > http://joelsolr.blogspot.com/
> >> >> >>> >> >
> >> >> >>> >> > On Fri, Feb 23, 2018 at 3:10 PM, John Smith <
> >> localde...@gmail.com
> >> >> >
> >> >> >>> >> wrote:
> >> >> >>> >> >
> >> >> >>> >> > > Hi Joel, thanks for the answer. I'm not really a stats
> guy,
> >> but
> >> >> >>> the
> >> >> >>> >> end
> >> >> >>> >> > > result of all this is supposed to be obtaining R^2. Is
> >> there no
> >> >> >>> way of
> >> >> >>> >> > > obtaining this value, then (short of iterating over all
> the
> >> >> >>> results in
> >> >> >>> >> > the
> >> >> >>> >> > > hitlist and calculating it myself)?
> >> >> >>> >> > >
> >> >> >>> >> > > On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein <
> >> >> >>> joels...@gmail.com>
> >> >> >>> >> > > wrote:
> >> >> >>> >> > >
> >> >> >>> >> > > > Typically SSE is the sum of the squared errors of the
> >> >> >>> prediction in
> >> >> >>> >> a
> >> >> >>> >> > > > regression analysis. The stats component doesn't perform
> >> >> >>> regression,
> >> >> >>> >> > > > although it might be a nice feature.
> >> >> >>> >> > > >
> >> >> >>> >> > > >
> >> >> >>> >> > > >
> >> >> >>> >> > > > Joel Bernstein
> >> >> >>> >> > > > http://joelsolr.blogspot.com/
> >> >> >>> >> > > >
> >> >> >>> >> > > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith <
> >> >> >>> localde...@gmail.com>
> >> >> >>> >> > > wrote:
> >> >> >>> >> > > >
> >> >> >>> >> > > > > I'm using solr, and enabling stats as per this page:
> >> >> >>> >> > > > > https://lucene.apache.org/solr/guide/6_6/the-stats-
> >> >> component
> >> >> >>> .html
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > I want to get more stat values though. Specifically
> I'm
> >> >> >>> looking
> >> >> >>> >> for
> >> >> >>> >> > > > > r-squared (coefficient of determination). This value
> is
> >> not
> >> >> >>> >> present
> >> >> >>> >> > in
> >> >> >>> >> > > > > solr, however some of the pieces used to calculate r^2
> >> are
> >> >> in
> >> >> >>> the
> >> >> >>> >> > stats
> >> >> >>> >> > > > > element, for example:
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > <double name="min">0.0</double>
> >> >> >>> >> > > > > <double name="max">10.0</double>
> >> >> >>> >> > > > > <long name="count">15</long>
> >> >> >>> >> > > > > <long name="missing">17</long>
> >> >> >>> >> > > > > <double name="sum">85.0</double>
> >> >> >>> >> > > > > <double name="sumOfSquares">603.0</double>
> >> >> >>> >> > > > > <double name="mean">5.666666666666667</double>
> >> >> >>> >> > > > > <double name="stddev">2.943920288775949</double>
> >> >> >>> >> > > > >
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > So I have the sumOfSquares available (SST), and using
> >> this
> >> >> >>> >> > > calculation, I
> >> >> >>> >> > > > > can get R^2:
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > R^2 = 1 - SSE/SST
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > All I need then is SSE. Is there anyway I can get SSE
> >> from
> >> >> >>> those
> >> >> >>> >> > other
> >> >> >>> >> > > > > stats in solr?
> >> >> >>> >> > > > >
> >> >> >>> >> > > > > Thanks in advance!
> >> >> >>> >> > > > >
> >> >> >>> >> > > >
> >> >> >>> >> > >
> >> >> >>> >> >
> >> >> >>> >>
> >> >> >>> >
> >> >> >>> >
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >>
> >
> >
>

Reply via email to