Re: statistics in hitlist

2018-03-16 Thread Joel Bernstein
With regression you're looking at how the change in one variable effects
the change in another variable. So you need to have values that are
changing. What you described is an average of field X which is not
changing, regressed against the value of X.

I think one approach to this is to regress the moving average of X with the
actual value of X. We can do this with the math library, but before
exploring the code for this spend some
thinking about if that's the problem you're trying to solve. Take a look at
how moving averages work: https://en.wikipedia.org/wiki/Moving_average





Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Mar 16, 2018 at 9:26 AM, John Smith  wrote:

> Thanks for the link to the documentation, that will probably come in
> useful.
>
> I didn't see a way though, to get my avg function working? So instead of
> doing a linear regression on two fields, X and Y, in a hitlist, we need to
> do a linear regression on field X, and the average value of X. Is that
> possible? To pass in a function to the regress function instead of a field?
>
>
>
>
>
> On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein 
> wrote:
>
> > I've been working on the user guide for the math expressions. Here is the
> > page on regression:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/regression.adoc
> >
> > This page is part of the larger math expression documentation. The TOC is
> > here:
> >
> > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> > documentation/solr/solr-ref-guide/src/math-expressions.adoc
> >
> > The docs are still very rough but you can get an idea of the coverage.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> > wrote:
> >
> > > If you want to get everything in query you can do this:
> > >
> > > let(echo="d,e",
> > >  a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> > > *]",
> > > fq="isParent:true", rows="150",
> > > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > sort="id
> > > asc"),
> > >  b=col(a, oil_first_90_days_production),
> > >  c=col(a, oil_last_30_days_production),
> > >  d=regress(b, c),
> > >  e=someExpression())
> > >
> > > The echo parameter tells the let expression which variables to output.
> > >
> > >
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> What does the fq clause look like?
> > >>
> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> > >> wrote:
> > >> > Hi Joel, I did some more work on this statistics stuff today. Yes,
> we
> > do
> > >> > have nulls in our data; the document contains many fields, we don't
> > >> always
> > >> > have values for each field, but we can't set the nulls to 0 either
> (or
> > >> any
> > >> > other value, really) as that will mess up other calculations (such
> as
> > >> when
> > >> > calculating average etc); we would normally just ignore fields with
> > null
> > >> > values when calculating stats manually ourselves.
> > >> >
> > >> > Adding a check in the "q" parameter to ensure that the fields used
> in
> > >> the
> > >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> > >> should
> > >> > have caught that myself). But I am unable to use "fq" for these
> > checks,
> > >> > they have to be added to the q instead. Adding fq's doesn't have any
> > >> effect.
> > >> >
> > >> >
> > >> > Anyway, I'm trying to change this up a little. This is what I'm
> > >> currently
> > >> > using (switched from "random" to "search" since I actually need the
> > full
> > >> > hitlist not just a random subset):
> > >> >
> > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> > TO
> > >> *]",
> > >> > fq="isParent:true", rows="150",
> > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> > >> sort="id
> > >> > asc"),
> > >> >  b=col(a, oil_first_90_days_production),
> > >> >  c=col(a, oil_last_30_days_production),
> > >> >  d=regress(b, c))
> > >> >
> > >> > So I have 2 fields there defined, that works great (in terms of a
> test
> > >> and
> > >> > running the query); but I need to replace the second field,
> > >> > "oil_last_30_days_production" with the avg value in
> > >> > oil_first_90_days_production.
> > >> >
> > >> > I can get the avg with this expression:
> > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> > production))
> > >> >
> > >> > But I don't know how to push that avg value into the first streaming
> > >> > expression; guessing I have to set "c=" but that is where I'm
> > >> getting
> > >> > lost, since avg only 

Re: statistics in hitlist

2018-03-16 Thread John Smith
Thanks for the link to the documentation, that will probably come in useful.

I didn't see a way though, to get my avg function working? So instead of
doing a linear regression on two fields, X and Y, in a hitlist, we need to
do a linear regression on field X, and the average value of X. Is that
possible? To pass in a function to the regress function instead of a field?





On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein  wrote:

> I've been working on the user guide for the math expressions. Here is the
> page on regression:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/regression.adoc
>
> This page is part of the larger math expression documentation. The TOC is
> here:
>
> https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_
> documentation/solr/solr-ref-guide/src/math-expressions.adoc
>
> The docs are still very rough but you can get an idea of the coverage.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein 
> wrote:
>
> > If you want to get everything in query you can do this:
> >
> > let(echo="d,e",
> >  a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> > *]",
> > fq="isParent:true", rows="150",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >  b=col(a, oil_first_90_days_production),
> >  c=col(a, oil_last_30_days_production),
> >  d=regress(b, c),
> >  e=someExpression())
> >
> > The echo parameter tells the let expression which variables to output.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson  >
> > wrote:
> >
> >> What does the fq clause look like?
> >>
> >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
> >> wrote:
> >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we
> do
> >> > have nulls in our data; the document contains many fields, we don't
> >> always
> >> > have values for each field, but we can't set the nulls to 0 either (or
> >> any
> >> > other value, really) as that will mess up other calculations (such as
> >> when
> >> > calculating average etc); we would normally just ignore fields with
> null
> >> > values when calculating stats manually ourselves.
> >> >
> >> > Adding a check in the "q" parameter to ensure that the fields used in
> >> the
> >> > calculations are > 0 does work now. Thanks for the tip (and sorry,
> >> should
> >> > have caught that myself). But I am unable to use "fq" for these
> checks,
> >> > they have to be added to the q instead. Adding fq's doesn't have any
> >> effect.
> >> >
> >> >
> >> > Anyway, I'm trying to change this up a little. This is what I'm
> >> currently
> >> > using (switched from "random" to "search" since I actually need the
> full
> >> > hitlist not just a random subset):
> >> >
> >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1
> TO
> >> *]",
> >> > fq="isParent:true", rows="150",
> >> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> >> sort="id
> >> > asc"),
> >> >  b=col(a, oil_first_90_days_production),
> >> >  c=col(a, oil_last_30_days_production),
> >> >  d=regress(b, c))
> >> >
> >> > So I have 2 fields there defined, that works great (in terms of a test
> >> and
> >> > running the query); but I need to replace the second field,
> >> > "oil_last_30_days_production" with the avg value in
> >> > oil_first_90_days_production.
> >> >
> >> > I can get the avg with this expression:
> >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> >> > fq="isParent:true", rows="150", avg(oil_first_90_days_
> production))
> >> >
> >> > But I don't know how to push that avg value into the first streaming
> >> > expression; guessing I have to set "c=" but that is where I'm
> >> getting
> >> > lost, since avg only returns 1 value and the first parameter, "b",
> >> returns
> >> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> >> > "col", where it is the same value for every row in the hitlist...?
> >> >
> >> > Thanks for your help!
> >> >
> >> >
> >> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein 
> >> wrote:
> >> >
> >> >> I suspect you've got nulls in your data. I just tested with null
> >> values and
> >> >> got the same error. For testing purposes try loading the data with
> >> default
> >> >> values of zero.
> >> >>
> >> >>
> >> >> Joel Bernstein
> >> >> http://joelsolr.blogspot.com/
> >> >>
> >> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> >> >> wrote:
> >> >>
> >> >> > Let's break the expression down and build it up slowly. Let's start
> >> with:
> >> >> >
> >> >> > let(echo="true",
> >> >> >  a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15",
> 

Re: statistics in hitlist

2018-03-15 Thread Joel Bernstein
I've been working on the user guide for the math expressions. Here is the
page on regression:

https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/regression.adoc

This page is part of the larger math expression documentation. The TOC is
here:

https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_documentation/solr/solr-ref-guide/src/math-expressions.adoc

The docs are still very rough but you can get an idea of the coverage.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein  wrote:

> If you want to get everything in query you can do this:
>
> let(echo="d,e",
>  a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> fq="isParent:true", rows="150",
> fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
> asc"),
>  b=col(a, oil_first_90_days_production),
>  c=col(a, oil_last_30_days_production),
>  d=regress(b, c),
>  e=someExpression())
>
> The echo parameter tells the let expression which variables to output.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson 
> wrote:
>
>> What does the fq clause look like?
>>
>> On Thu, Mar 15, 2018 at 11:51 AM, John Smith 
>> wrote:
>> > Hi Joel, I did some more work on this statistics stuff today. Yes, we do
>> > have nulls in our data; the document contains many fields, we don't
>> always
>> > have values for each field, but we can't set the nulls to 0 either (or
>> any
>> > other value, really) as that will mess up other calculations (such as
>> when
>> > calculating average etc); we would normally just ignore fields with null
>> > values when calculating stats manually ourselves.
>> >
>> > Adding a check in the "q" parameter to ensure that the fields used in
>> the
>> > calculations are > 0 does work now. Thanks for the tip (and sorry,
>> should
>> > have caught that myself). But I am unable to use "fq" for these checks,
>> > they have to be added to the q instead. Adding fq's doesn't have any
>> effect.
>> >
>> >
>> > Anyway, I'm trying to change this up a little. This is what I'm
>> currently
>> > using (switched from "random" to "search" since I actually need the full
>> > hitlist not just a random subset):
>> >
>> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
>> *]",
>> > fq="isParent:true", rows="150",
>> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
>> sort="id
>> > asc"),
>> >  b=col(a, oil_first_90_days_production),
>> >  c=col(a, oil_last_30_days_production),
>> >  d=regress(b, c))
>> >
>> > So I have 2 fields there defined, that works great (in terms of a test
>> and
>> > running the query); but I need to replace the second field,
>> > "oil_last_30_days_production" with the avg value in
>> > oil_first_90_days_production.
>> >
>> > I can get the avg with this expression:
>> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
>> > fq="isParent:true", rows="150", avg(oil_first_90_days_production))
>> >
>> > But I don't know how to push that avg value into the first streaming
>> > expression; guessing I have to set "c=" but that is where I'm
>> getting
>> > lost, since avg only returns 1 value and the first parameter, "b",
>> returns
>> > a list of sorts. Somehow I have to get the avg value stuffed inside a
>> > "col", where it is the same value for every row in the hitlist...?
>> >
>> > Thanks for your help!
>> >
>> >
>> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein 
>> wrote:
>> >
>> >> I suspect you've got nulls in your data. I just tested with null
>> values and
>> >> got the same error. For testing purposes try loading the data with
>> default
>> >> values of zero.
>> >>
>> >>
>> >> Joel Bernstein
>> >> http://joelsolr.blogspot.com/
>> >>
>> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
>> >> wrote:
>> >>
>> >> > Let's break the expression down and build it up slowly. Let's start
>> with:
>> >> >
>> >> > let(echo="true",
>> >> >  a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> rows="15",
>> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
>> >> >  b=col(a, oil_first_90_days_production))
>> >> >
>> >> >
>> >> > This should return variables a and b. Let's see what the data looks
>> like.
>> >> > I changed the rows from 15 to 15000. If it all looks good we can
>> expand
>> >> the
>> >> > rows and continue adding functions.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > Joel Bernstein
>> >> > http://joelsolr.blogspot.com/
>> >> >
>> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith 
>> wrote:
>> >> >
>> >> >> Thanks Joel for your help on this.
>> >> >>
>> >> >> What I've done so far:
>> >> >> - unzip downloaded solr-7.2
>> >> >> - modify the _default 

Re: statistics in hitlist

2018-03-15 Thread Joel Bernstein
If you want to get everything in query you can do this:

let(echo="d,e",
 a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
 b=col(a, oil_first_90_days_production),
 c=col(a, oil_last_30_days_production),
 d=regress(b, c),
 e=someExpression())

The echo parameter tells the let expression which variables to output.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson 
wrote:

> What does the fq clause look like?
>
> On Thu, Mar 15, 2018 at 11:51 AM, John Smith  wrote:
> > Hi Joel, I did some more work on this statistics stuff today. Yes, we do
> > have nulls in our data; the document contains many fields, we don't
> always
> > have values for each field, but we can't set the nulls to 0 either (or
> any
> > other value, really) as that will mess up other calculations (such as
> when
> > calculating average etc); we would normally just ignore fields with null
> > values when calculating stats manually ourselves.
> >
> > Adding a check in the "q" parameter to ensure that the fields used in the
> > calculations are > 0 does work now. Thanks for the tip (and sorry, should
> > have caught that myself). But I am unable to use "fq" for these checks,
> > they have to be added to the q instead. Adding fq's doesn't have any
> effect.
> >
> >
> > Anyway, I'm trying to change this up a little. This is what I'm currently
> > using (switched from "random" to "search" since I actually need the full
> > hitlist not just a random subset):
> >
> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO
> *]",
> > fq="isParent:true", rows="150",
> > fl="id,oil_first_90_days_production,oil_last_30_days_production",
> sort="id
> > asc"),
> >  b=col(a, oil_first_90_days_production),
> >  c=col(a, oil_last_30_days_production),
> >  d=regress(b, c))
> >
> > So I have 2 fields there defined, that works great (in terms of a test
> and
> > running the query); but I need to replace the second field,
> > "oil_last_30_days_production" with the avg value in
> > oil_first_90_days_production.
> >
> > I can get the avg with this expression:
> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> > fq="isParent:true", rows="150", avg(oil_first_90_days_production))
> >
> > But I don't know how to push that avg value into the first streaming
> > expression; guessing I have to set "c=" but that is where I'm getting
> > lost, since avg only returns 1 value and the first parameter, "b",
> returns
> > a list of sorts. Somehow I have to get the avg value stuffed inside a
> > "col", where it is the same value for every row in the hitlist...?
> >
> > Thanks for your help!
> >
> >
> > On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein 
> wrote:
> >
> >> I suspect you've got nulls in your data. I just tested with null values
> and
> >> got the same error. For testing purposes try loading the data with
> default
> >> values of zero.
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> >> wrote:
> >>
> >> > Let's break the expression down and build it up slowly. Let's start
> with:
> >> >
> >> > let(echo="true",
> >> >  a=random(tx_prod_production, q="*:*", fq="isParent:true",
> rows="15",
> >> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >> >  b=col(a, oil_first_90_days_production))
> >> >
> >> >
> >> > This should return variables a and b. Let's see what the data looks
> like.
> >> > I changed the rows from 15 to 15000. If it all looks good we can
> expand
> >> the
> >> > rows and continue adding functions.
> >> >
> >> >
> >> >
> >> >
> >> > Joel Bernstein
> >> > http://joelsolr.blogspot.com/
> >> >
> >> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith 
> wrote:
> >> >
> >> >> Thanks Joel for your help on this.
> >> >>
> >> >> What I've done so far:
> >> >> - unzip downloaded solr-7.2
> >> >> - modify the _default "managed-schema" to add the random field type
> and
> >> >> the dynamic random field
> >> >> - start solr7 using "solr start -c"
> >> >> - indexed my data using pint/pdouble/boolean field types etc
> >> >>
> >> >> I can now run the random function all by itself, it returns random
> >> >> results as expected. So far so good!
> >> >>
> >> >> However... now trying to get the regression stuff working:
> >> >>
> >> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> >> rows="15000", fl="oil_first_90_days_producti
> >> >> on,oil_last_30_days_production"),
> >> >> b=col(a, oil_first_90_days_production),
> >> >> c=col(a, oil_last_30_days_production),
> >> >> d=regress(b, c))
> >> >>
> >> >> Posted directly into solr admin UI. Run the streaming expression and
> I

Re: statistics in hitlist

2018-03-15 Thread Erick Erickson
What does the fq clause look like?

On Thu, Mar 15, 2018 at 11:51 AM, John Smith  wrote:
> Hi Joel, I did some more work on this statistics stuff today. Yes, we do
> have nulls in our data; the document contains many fields, we don't always
> have values for each field, but we can't set the nulls to 0 either (or any
> other value, really) as that will mess up other calculations (such as when
> calculating average etc); we would normally just ignore fields with null
> values when calculating stats manually ourselves.
>
> Adding a check in the "q" parameter to ensure that the fields used in the
> calculations are > 0 does work now. Thanks for the tip (and sorry, should
> have caught that myself). But I am unable to use "fq" for these checks,
> they have to be added to the q instead. Adding fq's doesn't have any effect.
>
>
> Anyway, I'm trying to change this up a little. This is what I'm currently
> using (switched from "random" to "search" since I actually need the full
> hitlist not just a random subset):
>
> let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="150",
> fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
> asc"),
>  b=col(a, oil_first_90_days_production),
>  c=col(a, oil_last_30_days_production),
>  d=regress(b, c))
>
> So I have 2 fields there defined, that works great (in terms of a test and
> running the query); but I need to replace the second field,
> "oil_last_30_days_production" with the avg value in
> oil_first_90_days_production.
>
> I can get the avg with this expression:
> stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
> fq="isParent:true", rows="150", avg(oil_first_90_days_production))
>
> But I don't know how to push that avg value into the first streaming
> expression; guessing I have to set "c=" but that is where I'm getting
> lost, since avg only returns 1 value and the first parameter, "b", returns
> a list of sorts. Somehow I have to get the avg value stuffed inside a
> "col", where it is the same value for every row in the hitlist...?
>
> Thanks for your help!
>
>
> On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein  wrote:
>
>> I suspect you've got nulls in your data. I just tested with null values and
>> got the same error. For testing purposes try loading the data with default
>> values of zero.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
>> wrote:
>>
>> > Let's break the expression down and build it up slowly. Let's start with:
>> >
>> > let(echo="true",
>> >  a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
>> > fl="oil_first_90_days_production,oil_last_30_days_production"),
>> >  b=col(a, oil_first_90_days_production))
>> >
>> >
>> > This should return variables a and b. Let's see what the data looks like.
>> > I changed the rows from 15 to 15000. If it all looks good we can expand
>> the
>> > rows and continue adding functions.
>> >
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:
>> >
>> >> Thanks Joel for your help on this.
>> >>
>> >> What I've done so far:
>> >> - unzip downloaded solr-7.2
>> >> - modify the _default "managed-schema" to add the random field type and
>> >> the dynamic random field
>> >> - start solr7 using "solr start -c"
>> >> - indexed my data using pint/pdouble/boolean field types etc
>> >>
>> >> I can now run the random function all by itself, it returns random
>> >> results as expected. So far so good!
>> >>
>> >> However... now trying to get the regression stuff working:
>> >>
>> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> >> rows="15000", fl="oil_first_90_days_producti
>> >> on,oil_last_30_days_production"),
>> >> b=col(a, oil_first_90_days_production),
>> >> c=col(a, oil_last_30_days_production),
>> >> d=regress(b, c))
>> >>
>> >> Posted directly into solr admin UI. Run the streaming expression and I
>> >> get this error message:
>> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
>> >> expected but found type java.lang.String for value
>> >> oil_first_90_days_production"
>> >>
>> >> It thinks my numeric field is defined as a string? But when I view the
>> >> schema, those 2 fields are defined as ints:
>> >>
>> >>
>> >> When I run a normal query and choose xml as output format, then it also
>> >> puts "int" elements into the hitlist, so the schema appears to be
>> correct
>> >> it's just when using this regress function that something goes wrong and
>> >> solr thinks the field is string.
>> >>
>> >> Any suggestions?
>> >> Thanks!
>> >>
>> >>
>> >>
>> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein 
>> >> wrote:
>> >>
>> >>> The field type will also need to be in the schema:
>> >>>

Re: statistics in hitlist

2018-03-15 Thread John Smith
Hi Joel, I did some more work on this statistics stuff today. Yes, we do
have nulls in our data; the document contains many fields, we don't always
have values for each field, but we can't set the nulls to 0 either (or any
other value, really) as that will mess up other calculations (such as when
calculating average etc); we would normally just ignore fields with null
values when calculating stats manually ourselves.

Adding a check in the "q" parameter to ensure that the fields used in the
calculations are > 0 does work now. Thanks for the tip (and sorry, should
have caught that myself). But I am unable to use "fq" for these checks,
they have to be added to the q instead. Adding fq's doesn't have any effect.


Anyway, I'm trying to change this up a little. This is what I'm currently
using (switched from "random" to "search" since I actually need the full
hitlist not just a random subset):

let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150",
fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id
asc"),
 b=col(a, oil_first_90_days_production),
 c=col(a, oil_last_30_days_production),
 d=regress(b, c))

So I have 2 fields there defined, that works great (in terms of a test and
running the query); but I need to replace the second field,
"oil_last_30_days_production" with the avg value in
oil_first_90_days_production.

I can get the avg with this expression:
stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]",
fq="isParent:true", rows="150", avg(oil_first_90_days_production))

But I don't know how to push that avg value into the first streaming
expression; guessing I have to set "c=" but that is where I'm getting
lost, since avg only returns 1 value and the first parameter, "b", returns
a list of sorts. Somehow I have to get the avg value stuffed inside a
"col", where it is the same value for every row in the hitlist...?

Thanks for your help!


On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein  wrote:

> I suspect you've got nulls in your data. I just tested with null values and
> got the same error. For testing purposes try loading the data with default
> values of zero.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein 
> wrote:
>
> > Let's break the expression down and build it up slowly. Let's start with:
> >
> > let(echo="true",
> >  a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> > fl="oil_first_90_days_production,oil_last_30_days_production"),
> >  b=col(a, oil_first_90_days_production))
> >
> >
> > This should return variables a and b. Let's see what the data looks like.
> > I changed the rows from 15 to 15000. If it all looks good we can expand
> the
> > rows and continue adding functions.
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:
> >
> >> Thanks Joel for your help on this.
> >>
> >> What I've done so far:
> >> - unzip downloaded solr-7.2
> >> - modify the _default "managed-schema" to add the random field type and
> >> the dynamic random field
> >> - start solr7 using "solr start -c"
> >> - indexed my data using pint/pdouble/boolean field types etc
> >>
> >> I can now run the random function all by itself, it returns random
> >> results as expected. So far so good!
> >>
> >> However... now trying to get the regression stuff working:
> >>
> >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> >> rows="15000", fl="oil_first_90_days_producti
> >> on,oil_last_30_days_production"),
> >> b=col(a, oil_first_90_days_production),
> >> c=col(a, oil_last_30_days_production),
> >> d=regress(b, c))
> >>
> >> Posted directly into solr admin UI. Run the streaming expression and I
> >> get this error message:
> >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> >> expected but found type java.lang.String for value
> >> oil_first_90_days_production"
> >>
> >> It thinks my numeric field is defined as a string? But when I view the
> >> schema, those 2 fields are defined as ints:
> >>
> >>
> >> When I run a normal query and choose xml as output format, then it also
> >> puts "int" elements into the hitlist, so the schema appears to be
> correct
> >> it's just when using this regress function that something goes wrong and
> >> solr thinks the field is string.
> >>
> >> Any suggestions?
> >> Thanks!
> >> ​
> >>
> >>
> >> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein 
> >> wrote:
> >>
> >>> The field type will also need to be in the schema:
> >>>
> >>>  
> >>>
> >>> 
> >>>
> >>>
> >>> Joel Bernstein
> >>> http://joelsolr.blogspot.com/
> >>>
> >>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein 
> >>> wrote:
> >>>
> >>> > You'll need to have this field in your schema:
> >>> >
> >>> > 
> >>> 

Re: statistics in hitlist

2018-03-05 Thread Joel Bernstein
I suspect you've got nulls in your data. I just tested with null values and
got the same error. For testing purposes try loading the data with default
values of zero.


Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein  wrote:

> Let's break the expression down and build it up slowly. Let's start with:
>
> let(echo="true",
>  a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
> fl="oil_first_90_days_production,oil_last_30_days_production"),
>  b=col(a, oil_first_90_days_production))
>
>
> This should return variables a and b. Let's see what the data looks like.
> I changed the rows from 15 to 15000. If it all looks good we can expand the
> rows and continue adding functions.
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:
>
>> Thanks Joel for your help on this.
>>
>> What I've done so far:
>> - unzip downloaded solr-7.2
>> - modify the _default "managed-schema" to add the random field type and
>> the dynamic random field
>> - start solr7 using "solr start -c"
>> - indexed my data using pint/pdouble/boolean field types etc
>>
>> I can now run the random function all by itself, it returns random
>> results as expected. So far so good!
>>
>> However... now trying to get the regression stuff working:
>>
>> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
>> rows="15000", fl="oil_first_90_days_producti
>> on,oil_last_30_days_production"),
>> b=col(a, oil_first_90_days_production),
>> c=col(a, oil_last_30_days_production),
>> d=regress(b, c))
>>
>> Posted directly into solr admin UI. Run the streaming expression and I
>> get this error message:
>> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
>> expected but found type java.lang.String for value
>> oil_first_90_days_production"
>>
>> It thinks my numeric field is defined as a string? But when I view the
>> schema, those 2 fields are defined as ints:
>>
>>
>> When I run a normal query and choose xml as output format, then it also
>> puts "int" elements into the hitlist, so the schema appears to be correct
>> it's just when using this regress function that something goes wrong and
>> solr thinks the field is string.
>>
>> Any suggestions?
>> Thanks!
>> ​
>>
>>
>> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein 
>> wrote:
>>
>>> The field type will also need to be in the schema:
>>>
>>>  
>>>
>>> 
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein 
>>> wrote:
>>>
>>> > You'll need to have this field in your schema:
>>> >
>>> > 
>>> >
>>> > I'll check to see if the default schema used with solr start -c has
>>> this
>>> > field, if not I'll add it. Thanks for pointing this out.
>>> >
>>> > I checked and right now the random expression is only accepting one fq,
>>> > but I consider this a bug. It should accept multiple. I'll create
>>> ticket
>>> > for getting this fixed.
>>> >
>>> >
>>> >
>>> > Joel Bernstein
>>> > http://joelsolr.blogspot.com/
>>> >
>>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith 
>>> wrote:
>>> >
>>> >> Joel, thanks for the pointers to the streaming feature. I had no idea
>>> solr
>>> >> had that (and also just discovered the very intersting sql feature! I
>>> will
>>> >> be sure to investigate that in more detail in the future).
>>> >>
>>> >> However I'm having some trouble getting basic streaming functions
>>> working.
>>> >> I've already figured out that I had to move to "solr cloud" instead of
>>> >> "solr standalone" because I was getting errors about "cannot find zk
>>> >> instance" or whatever which went away when using "solr start -c"
>>> instead.
>>> >>
>>> >> But now I'm trying to use the random function since that was one of
>>> the
>>> >> functions used in your example.
>>> >>
>>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>>> >>
>>> >> I posted that directly in the "stream" section of the solr admin UI.
>>> This
>>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
>>> case
>>> >> it was a bug in one)
>>> >>
>>> >> I get back an error message:
>>> >> *sort param could not be parsed as a query, and is not a field that
>>> exists
>>> >> in the index: random_-255009774*
>>> >>
>>> >> I'm not passing in any sort field anywhere. But the solr logs show
>>> these
>>> >> three log entries:
>>> >>
>>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>>> >> params={q=*:*&_stateVer_=tx_header:6=countyname
>>> >> *=random_-255009774+asc*=100=javabin=2}
>>> status=400
>>> >> QTime=19
>>> >>
>>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>>> >> r:core_node2 

Re: statistics in hitlist

2018-03-05 Thread Joel Bernstein
Let's break the expression down and build it up slowly. Let's start with:

let(echo="true",
 a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15",
fl="oil_first_90_days_production,oil_last_30_days_production"),
 b=col(a, oil_first_90_days_production))


This should return variables a and b. Let's see what the data looks like. I
changed the rows from 15 to 15000. If it all looks good we can expand the
rows and continue adding functions.




Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 5, 2018 at 4:11 PM, John Smith  wrote:

> Thanks Joel for your help on this.
>
> What I've done so far:
> - unzip downloaded solr-7.2
> - modify the _default "managed-schema" to add the random field type and
> the dynamic random field
> - start solr7 using "solr start -c"
> - indexed my data using pint/pdouble/boolean field types etc
>
> I can now run the random function all by itself, it returns random results
> as expected. So far so good!
>
> However... now trying to get the regression stuff working:
>
> let(a=random(tx_prod_production, q="*:*", fq="isParent:true",
> rows="15000", fl="oil_first_90_days_production,oil_last_30_days_
> production"),
> b=col(a, oil_first_90_days_production),
> c=col(a, oil_last_30_days_production),
> d=regress(b, c))
>
> Posted directly into solr admin UI. Run the streaming expression and I get
> this error message:
> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
> expected but found type java.lang.String for value
> oil_first_90_days_production"
>
> It thinks my numeric field is defined as a string? But when I view the
> schema, those 2 fields are defined as ints:
>
>
> When I run a normal query and choose xml as output format, then it also
> puts "int" elements into the hitlist, so the schema appears to be correct
> it's just when using this regress function that something goes wrong and
> solr thinks the field is string.
>
> Any suggestions?
> Thanks!
> ​
>
>
> On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein  wrote:
>
>> The field type will also need to be in the schema:
>>
>>  
>>
>> 
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein 
>> wrote:
>>
>> > You'll need to have this field in your schema:
>> >
>> > 
>> >
>> > I'll check to see if the default schema used with solr start -c has this
>> > field, if not I'll add it. Thanks for pointing this out.
>> >
>> > I checked and right now the random expression is only accepting one fq,
>> > but I consider this a bug. It should accept multiple. I'll create ticket
>> > for getting this fixed.
>> >
>> >
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith 
>> wrote:
>> >
>> >> Joel, thanks for the pointers to the streaming feature. I had no idea
>> solr
>> >> had that (and also just discovered the very intersting sql feature! I
>> will
>> >> be sure to investigate that in more detail in the future).
>> >>
>> >> However I'm having some trouble getting basic streaming functions
>> working.
>> >> I've already figured out that I had to move to "solr cloud" instead of
>> >> "solr standalone" because I was getting errors about "cannot find zk
>> >> instance" or whatever which went away when using "solr start -c"
>> instead.
>> >>
>> >> But now I'm trying to use the random function since that was one of the
>> >> functions used in your example.
>> >>
>> >> random(tx_header, q="*:*", rows="100", fl="countyname")
>> >>
>> >> I posted that directly in the "stream" section of the solr admin UI.
>> This
>> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
>> case
>> >> it was a bug in one)
>> >>
>> >> I get back an error message:
>> >> *sort param could not be parsed as a query, and is not a field that
>> exists
>> >> in the index: random_-255009774*
>> >>
>> >> I'm not passing in any sort field anywhere. But the solr logs show
>> these
>> >> three log entries:
>> >>
>> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> >> params={q=*:*&_stateVer_=tx_header:6=countyname
>> >> *=random_-255009774+asc*=100=javabin=2}
>> status=400
>> >> QTime=19
>> >>
>> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
>> >> Request to collection [tx_header] failed due to (400)
>> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> >> Error
>> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
>> could
>> >> not be parsed as a query, and is not a field that exists in the index:
>> >> random_-255009774, retry? 0
>> >>
>> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header 

Re: statistics in hitlist

2018-03-05 Thread John Smith
Thanks Joel for your help on this.

What I've done so far:
- unzip downloaded solr-7.2
- modify the _default "managed-schema" to add the random field type and the
dynamic random field
- start solr7 using "solr start -c"
- indexed my data using pint/pdouble/boolean field types etc

I can now run the random function all by itself, it returns random results
as expected. So far so good!

However... now trying to get the regression stuff working:

let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000",
fl="oil_first_90_days_production,oil_last_30_days_production"),
b=col(a, oil_first_90_days_production),
c=col(a, oil_last_30_days_production),
d=regress(b, c))

Posted directly into solr admin UI. Run the streaming expression and I get
this error message:
"EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value
expected but found type java.lang.String for value
oil_first_90_days_production"

It thinks my numeric field is defined as a string? But when I view the
schema, those 2 fields are defined as ints:


When I run a normal query and choose xml as output format, then it also
puts "int" elements into the hitlist, so the schema appears to be correct
it's just when using this regress function that something goes wrong and
solr thinks the field is string.

Any suggestions?
Thanks!
​


On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein  wrote:

> The field type will also need to be in the schema:
>
>  
>
> 
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein  wrote:
>
> > You'll need to have this field in your schema:
> >
> > 
> >
> > I'll check to see if the default schema used with solr start -c has this
> > field, if not I'll add it. Thanks for pointing this out.
> >
> > I checked and right now the random expression is only accepting one fq,
> > but I consider this a bug. It should accept multiple. I'll create ticket
> > for getting this fixed.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Mar 1, 2018 at 4:55 PM, John Smith  wrote:
> >
> >> Joel, thanks for the pointers to the streaming feature. I had no idea
> solr
> >> had that (and also just discovered the very intersting sql feature! I
> will
> >> be sure to investigate that in more detail in the future).
> >>
> >> However I'm having some trouble getting basic streaming functions
> working.
> >> I've already figured out that I had to move to "solr cloud" instead of
> >> "solr standalone" because I was getting errors about "cannot find zk
> >> instance" or whatever which went away when using "solr start -c"
> instead.
> >>
> >> But now I'm trying to use the random function since that was one of the
> >> functions used in your example.
> >>
> >> random(tx_header, q="*:*", rows="100", fl="countyname")
> >>
> >> I posted that directly in the "stream" section of the solr admin UI.
> This
> >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in
> case
> >> it was a bug in one)
> >>
> >> I get back an error message:
> >> *sort param could not be parsed as a query, and is not a field that
> exists
> >> in the index: random_-255009774*
> >>
> >> I'm not passing in any sort field anywhere. But the solr logs show these
> >> three log entries:
> >>
> >> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> >> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> >> params={q=*:*&_stateVer_=tx_header:6=countyname
> >> *=random_-255009774+asc*=100=javabin=2} status=400
> >> QTime=19
> >>
> >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> >> Request to collection [tx_header] failed due to (400)
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774, retry? 0
> >>
> >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
> >> r:core_node2 x:tx_header_shard1_replica_n1]
> o.a.s.c.s.i.s.ExceptionStream
> >> java.io.IOException:
> >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> >> Error
> >> from server at http://192.168.13.31:8983/solr/tx_header: sort param
> could
> >> not be parsed as a query, and is not a field that exists in the index:
> >> random_-255009774
> >>
> >>
> >> So basically it looks like solr is injecting the "sort=random_" stuff
> into
> >> my query and of course that is failing on the search since that
> >> field/column doesn't exist in my schema. Everytime I run the random
> >> function, I get a slightly different field name that it injects, but
> they
> >> all start with "random_" etc.
> >>
> >> I have tried 

Re: statistics in hitlist

2018-03-01 Thread Joel Bernstein
The field type will also need to be in the schema:

 




Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein  wrote:

> You'll need to have this field in your schema:
>
> 
>
> I'll check to see if the default schema used with solr start -c has this
> field, if not I'll add it. Thanks for pointing this out.
>
> I checked and right now the random expression is only accepting one fq,
> but I consider this a bug. It should accept multiple. I'll create ticket
> for getting this fixed.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Mar 1, 2018 at 4:55 PM, John Smith  wrote:
>
>> Joel, thanks for the pointers to the streaming feature. I had no idea solr
>> had that (and also just discovered the very intersting sql feature! I will
>> be sure to investigate that in more detail in the future).
>>
>> However I'm having some trouble getting basic streaming functions working.
>> I've already figured out that I had to move to "solr cloud" instead of
>> "solr standalone" because I was getting errors about "cannot find zk
>> instance" or whatever which went away when using "solr start -c" instead.
>>
>> But now I'm trying to use the random function since that was one of the
>> functions used in your example.
>>
>> random(tx_header, q="*:*", rows="100", fl="countyname")
>>
>> I posted that directly in the "stream" section of the solr admin UI. This
>> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
>> it was a bug in one)
>>
>> I get back an error message:
>> *sort param could not be parsed as a query, and is not a field that exists
>> in the index: random_-255009774*
>>
>> I'm not passing in any sort field anywhere. But the solr logs show these
>> three log entries:
>>
>> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
>> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
>> params={q=*:*&_stateVer_=tx_header:6=countyname
>> *=random_-255009774+asc*=100=javabin=2} status=400
>> QTime=19
>>
>> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
>> Request to collection [tx_header] failed due to (400)
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error
>> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
>> not be parsed as a query, and is not a field that exists in the index:
>> random_-255009774, retry? 0
>>
>> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
>> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
>> java.io.IOException:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error
>> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
>> not be parsed as a query, and is not a field that exists in the index:
>> random_-255009774
>>
>>
>> So basically it looks like solr is injecting the "sort=random_" stuff into
>> my query and of course that is failing on the search since that
>> field/column doesn't exist in my schema. Everytime I run the random
>> function, I get a slightly different field name that it injects, but they
>> all start with "random_" etc.
>>
>> I have tried adding my own sort field instead, hoping solr wouldn't inject
>> one for me, but it still injected a random sort fieldname:
>> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
>> asc")
>>
>>
>> Assuming I can fix that whole problem, my second question is: can I add
>> multiple "fq=" parameters to the random function? I build a pretty
>> complicated query using many fq= fields, and then want to run some stats
>> on
>> that hitlist; so somehow I have to pass in the query that made up the
>> exact
>> hitlist to these various functions, but when I used multiple "fq=" values
>> it only seemed to use the last one I specified and just ignored all the
>> previous fq's?
>>
>> Thanks in advance for any comments/suggestions...!
>>
>>
>>
>>
>> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein 
>> wrote:
>>
>> > This is going to be a complex answer because Solr actually now has
>> multiple
>> > ways of doing regression analysis as part of the Streaming Expression
>> > statistical programming library. The basic documentation is here:
>> >
>> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>> >
>> > Here is a sample expression that performs a simple linear regression in
>> > Solr 7.2:
>> >
>> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
>> > fieldB"),
>> > b=col(a, fieldA),
>> > c=col(a, fieldB),
>> > d=regress(b, c))
>> >
>> >
>> > The expression above takes a random sample of 15000 results from
>> > collection1. The result set will include fieldA and fieldB in each
>> record.
>> > The result set is 

Re: statistics in hitlist

2018-03-01 Thread Joel Bernstein
You'll need to have this field in your schema:



I'll check to see if the default schema used with solr start -c has this
field, if not I'll add it. Thanks for pointing this out.

I checked and right now the random expression is only accepting one fq, but
I consider this a bug. It should accept multiple. I'll create ticket for
getting this fixed.



Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Mar 1, 2018 at 4:55 PM, John Smith  wrote:

> Joel, thanks for the pointers to the streaming feature. I had no idea solr
> had that (and also just discovered the very intersting sql feature! I will
> be sure to investigate that in more detail in the future).
>
> However I'm having some trouble getting basic streaming functions working.
> I've already figured out that I had to move to "solr cloud" instead of
> "solr standalone" because I was getting errors about "cannot find zk
> instance" or whatever which went away when using "solr start -c" instead.
>
> But now I'm trying to use the random function since that was one of the
> functions used in your example.
>
> random(tx_header, q="*:*", rows="100", fl="countyname")
>
> I posted that directly in the "stream" section of the solr admin UI. This
> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
> it was a bug in one)
>
> I get back an error message:
> *sort param could not be parsed as a query, and is not a field that exists
> in the index: random_-255009774*
>
> I'm not passing in any sort field anywhere. But the solr logs show these
> three log entries:
>
> 2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
> [tx_header_shard1_replica_n1]  webapp=/solr path=/select
> params={q=*:*&_stateVer_=tx_header:6=countyname
> *=random_-255009774+asc*=100=javabin=2} status=400
> QTime=19
>
> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
> Request to collection [tx_header] failed due to (400)
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
> not be parsed as a query, and is not a field that exists in the index:
> random_-255009774, retry? 0
>
> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
> java.io.IOException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
> Error
> from server at http://192.168.13.31:8983/solr/tx_header: sort param could
> not be parsed as a query, and is not a field that exists in the index:
> random_-255009774
>
>
> So basically it looks like solr is injecting the "sort=random_" stuff into
> my query and of course that is failing on the search since that
> field/column doesn't exist in my schema. Everytime I run the random
> function, I get a slightly different field name that it injects, but they
> all start with "random_" etc.
>
> I have tried adding my own sort field instead, hoping solr wouldn't inject
> one for me, but it still injected a random sort fieldname:
> random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
> asc")
>
>
> Assuming I can fix that whole problem, my second question is: can I add
> multiple "fq=" parameters to the random function? I build a pretty
> complicated query using many fq= fields, and then want to run some stats on
> that hitlist; so somehow I have to pass in the query that made up the exact
> hitlist to these various functions, but when I used multiple "fq=" values
> it only seemed to use the last one I specified and just ignored all the
> previous fq's?
>
> Thanks in advance for any comments/suggestions...!
>
>
>
>
> On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein 
> wrote:
>
> > This is going to be a complex answer because Solr actually now has
> multiple
> > ways of doing regression analysis as part of the Streaming Expression
> > statistical programming library. The basic documentation is here:
> >
> > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
> >
> > Here is a sample expression that performs a simple linear regression in
> > Solr 7.2:
> >
> > let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> > fieldB"),
> > b=col(a, fieldA),
> > c=col(a, fieldB),
> > d=regress(b, c))
> >
> >
> > The expression above takes a random sample of 15000 results from
> > collection1. The result set will include fieldA and fieldB in each
> record.
> > The result set is stored in variable "a".
> >
> > Then the "col" function creates arrays of numbers from the results stored
> > in variable a. The values in fieldA are stored in the variable "b". The
> > values in fieldB are stored in variable "c".
> >
> > Then the regress function performs a simple linear regression on arrays
> > 

Re: statistics in hitlist

2018-03-01 Thread John Smith
Joel, thanks for the pointers to the streaming feature. I had no idea solr
had that (and also just discovered the very intersting sql feature! I will
be sure to investigate that in more detail in the future).

However I'm having some trouble getting basic streaming functions working.
I've already figured out that I had to move to "solr cloud" instead of
"solr standalone" because I was getting errors about "cannot find zk
instance" or whatever which went away when using "solr start -c" instead.

But now I'm trying to use the random function since that was one of the
functions used in your example.

random(tx_header, q="*:*", rows="100", fl="countyname")

I posted that directly in the "stream" section of the solr admin UI. This
is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case
it was a bug in one)

I get back an error message:
*sort param could not be parsed as a query, and is not a field that exists
in the index: random_-255009774*

I'm not passing in any sort field anywhere. But the solr logs show these
three log entries:

2018-03-01 21:41:18.954 INFO  (qtp257513673-21) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request
[tx_header_shard1_replica_n1]  webapp=/solr path=/select
params={q=*:*&_stateVer_=tx_header:6=countyname
*=random_-255009774+asc*=100=javabin=2} status=400
QTime=19

2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient
Request to collection [tx_header] failed due to (400)
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774, retry? 0

2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1
r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream
java.io.IOException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
from server at http://192.168.13.31:8983/solr/tx_header: sort param could
not be parsed as a query, and is not a field that exists in the index:
random_-255009774


So basically it looks like solr is injecting the "sort=random_" stuff into
my query and of course that is failing on the search since that
field/column doesn't exist in my schema. Everytime I run the random
function, I get a slightly different field name that it injects, but they
all start with "random_" etc.

I have tried adding my own sort field instead, hoping solr wouldn't inject
one for me, but it still injected a random sort fieldname:
random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname
asc")


Assuming I can fix that whole problem, my second question is: can I add
multiple "fq=" parameters to the random function? I build a pretty
complicated query using many fq= fields, and then want to run some stats on
that hitlist; so somehow I have to pass in the query that made up the exact
hitlist to these various functions, but when I used multiple "fq=" values
it only seemed to use the last one I specified and just ignored all the
previous fq's?

Thanks in advance for any comments/suggestions...!




On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein  wrote:

> This is going to be a complex answer because Solr actually now has multiple
> ways of doing regression analysis as part of the Streaming Expression
> statistical programming library. The basic documentation is here:
>
> https://lucene.apache.org/solr/guide/7_2/statistical-programming.html
>
> Here is a sample expression that performs a simple linear regression in
> Solr 7.2:
>
> let(a=random(collection1, q="any query", rows="15000", fl="fieldA,
> fieldB"),
> b=col(a, fieldA),
> c=col(a, fieldB),
> d=regress(b, c))
>
>
> The expression above takes a random sample of 15000 results from
> collection1. The result set will include fieldA and fieldB in each record.
> The result set is stored in variable "a".
>
> Then the "col" function creates arrays of numbers from the results stored
> in variable a. The values in fieldA are stored in the variable "b". The
> values in fieldB are stored in variable "c".
>
> Then the regress function performs a simple linear regression on arrays
> stored in variables "b" and "c".
>
> The output of the regress function is a map containing the regression
> result. This result includes RSquared and other attributes of the
> regression model such as R (correlation), slope, y intercept etc...
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 3:10 PM, John Smith  wrote:
>
> > Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
> > result of all this is supposed to be obtaining R^2. Is there no way of
> > obtaining this value, then (short of iterating over all the results in
> the
> > hitlist and calculating it myself)?
> >
> 

Re: statistics in hitlist

2018-02-23 Thread Joel Bernstein
This is going to be a complex answer because Solr actually now has multiple
ways of doing regression analysis as part of the Streaming Expression
statistical programming library. The basic documentation is here:

https://lucene.apache.org/solr/guide/7_2/statistical-programming.html

Here is a sample expression that performs a simple linear regression in
Solr 7.2:

let(a=random(collection1, q="any query", rows="15000", fl="fieldA, fieldB"),
b=col(a, fieldA),
c=col(a, fieldB),
d=regress(b, c))


The expression above takes a random sample of 15000 results from
collection1. The result set will include fieldA and fieldB in each record.
The result set is stored in variable "a".

Then the "col" function creates arrays of numbers from the results stored
in variable a. The values in fieldA are stored in the variable "b". The
values in fieldB are stored in variable "c".

Then the regress function performs a simple linear regression on arrays
stored in variables "b" and "c".

The output of the regress function is a map containing the regression
result. This result includes RSquared and other attributes of the
regression model such as R (correlation), slope, y intercept etc...









Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Feb 23, 2018 at 3:10 PM, John Smith  wrote:

> Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
> result of all this is supposed to be obtaining R^2. Is there no way of
> obtaining this value, then (short of iterating over all the results in the
> hitlist and calculating it myself)?
>
> On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein 
> wrote:
>
> > Typically SSE is the sum of the squared errors of the prediction in a
> > regression analysis. The stats component doesn't perform regression,
> > although it might be a nice feature.
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Fri, Feb 23, 2018 at 12:17 PM, John Smith 
> wrote:
> >
> > > I'm using solr, and enabling stats as per this page:
> > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> > >
> > > I want to get more stat values though. Specifically I'm looking for
> > > r-squared (coefficient of determination). This value is not present in
> > > solr, however some of the pieces used to calculate r^2 are in the stats
> > > element, for example:
> > >
> > > 0.0
> > > 10.0
> > > 15
> > > 17
> > > 85.0
> > > 603.0
> > > 5.667
> > > 2.943920288775949
> > >
> > >
> > > So I have the sumOfSquares available (SST), and using this
> calculation, I
> > > can get R^2:
> > >
> > > R^2 = 1 - SSE/SST
> > >
> > > All I need then is SSE. Is there anyway I can get SSE from those other
> > > stats in solr?
> > >
> > > Thanks in advance!
> > >
> >
>


Re: statistics in hitlist

2018-02-23 Thread John Smith
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end
result of all this is supposed to be obtaining R^2. Is there no way of
obtaining this value, then (short of iterating over all the results in the
hitlist and calculating it myself)?

On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein  wrote:

> Typically SSE is the sum of the squared errors of the prediction in a
> regression analysis. The stats component doesn't perform regression,
> although it might be a nice feature.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Feb 23, 2018 at 12:17 PM, John Smith  wrote:
>
> > I'm using solr, and enabling stats as per this page:
> > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
> >
> > I want to get more stat values though. Specifically I'm looking for
> > r-squared (coefficient of determination). This value is not present in
> > solr, however some of the pieces used to calculate r^2 are in the stats
> > element, for example:
> >
> > 0.0
> > 10.0
> > 15
> > 17
> > 85.0
> > 603.0
> > 5.667
> > 2.943920288775949
> >
> >
> > So I have the sumOfSquares available (SST), and using this calculation, I
> > can get R^2:
> >
> > R^2 = 1 - SSE/SST
> >
> > All I need then is SSE. Is there anyway I can get SSE from those other
> > stats in solr?
> >
> > Thanks in advance!
> >
>


Re: statistics in hitlist

2018-02-23 Thread Joel Bernstein
Typically SSE is the sum of the squared errors of the prediction in a
regression analysis. The stats component doesn't perform regression,
although it might be a nice feature.



Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Feb 23, 2018 at 12:17 PM, John Smith  wrote:

> I'm using solr, and enabling stats as per this page:
> https://lucene.apache.org/solr/guide/6_6/the-stats-component.html
>
> I want to get more stat values though. Specifically I'm looking for
> r-squared (coefficient of determination). This value is not present in
> solr, however some of the pieces used to calculate r^2 are in the stats
> element, for example:
>
> 0.0
> 10.0
> 15
> 17
> 85.0
> 603.0
> 5.667
> 2.943920288775949
>
>
> So I have the sumOfSquares available (SST), and using this calculation, I
> can get R^2:
>
> R^2 = 1 - SSE/SST
>
> All I need then is SSE. Is there anyway I can get SSE from those other
> stats in solr?
>
> Thanks in advance!
>


statistics in hitlist

2018-02-23 Thread John Smith
I'm using solr, and enabling stats as per this page:
https://lucene.apache.org/solr/guide/6_6/the-stats-component.html

I want to get more stat values though. Specifically I'm looking for
r-squared (coefficient of determination). This value is not present in
solr, however some of the pieces used to calculate r^2 are in the stats
element, for example:

0.0
10.0
15
17
85.0
603.0
5.667
2.943920288775949


So I have the sumOfSquares available (SST), and using this calculation, I
can get R^2:

R^2 = 1 - SSE/SST

All I need then is SSE. Is there anyway I can get SSE from those other
stats in solr?

Thanks in advance!