from:"Mark Robinson"

Re: Overriding Sort and boosting some docs to the top

2021-02-24 Thread Mark Robinson

Thanks Marcus for your response.

Best,
Mark

On Wed, Feb 24, 2021 at 4:50 PM Markus Jelsma 
wrote:

> I would stick to the query elevation component, it is pretty fast and
> easier to handle/configure elevation IDs, instead of using function queries
> for it. We have customers that set a dozen of documents for a given query
> and it works just fine.
>
> I also do not expect the function query variant to be more performant, but
> i am not sure. If it were, would it be measurable?
>
> Regards,
> Markus
>
> Op wo 24 feb. 2021 om 12:15 schreef Mark Robinson  >:
>
> > Thanks for the reply Markus!
> >
> > I did try it.
> > My question specifically was (repasting here):-
> >
> > Which is more recommended/ performant?
> >
> > Note:- Assume that I have hundreds of ids to boost like this.
> > Is there a difference to the answer if docs to be boosted after the sort
> is
> > less?
> >
> > Thanks!
> > Mark
> >
> > On Wed, Feb 24, 2021 at 4:41 PM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> > > Hello,
> > >
> > > You are probably looking for the elevator component, check it out:
> > >
> >
> https://lucene.apache.org/solr/guide/8_8/the-query-elevation-component.html
> > >
> > > Regards,
> > > Markus
> > >
> > > Op wo 24 feb. 2021 om 11:59 schreef Mark Robinson <
> > mark123lea...@gmail.com
> > > >:
> > >
> > > > Hi,
> > > >
> > > > I wanted to sort and then boost some docs to the top and these docs
> > > should
> > > > be my first set in the results and the following ones appearing
> > according
> > > > to my sort criteria.
> > > >
> > > > I understand that sort overrides bq hence bq may not be used in this
> > case
> > > >
> > > > - I brought my boost into sort using "query()" and achieved my goal.
> > > > - I tried sort and then elevate with forceElevation and that also
> > worked.
> > > >
> > > > My question is which is more recommended/ performant?
> > > >
> > > > Note:- Assume that I have hundreds of ids to boost like this.
> > > > Is there a difference to the answer if docs to be boosted after the
> > sort
> > > is
> > > > less?
> > > >
> > > > Could someone please share your thoughts/experience?
> > > >
> > > > Thanks!
> > > > Mark.
> > > >
> > >
> >
>

Re: Overriding Sort and boosting some docs to the top

2021-02-24 Thread Mark Robinson

Thanks for the reply Markus!

I did try it.
My question specifically was (repasting here):-

Which is more recommended/ performant?

Note:- Assume that I have hundreds of ids to boost like this.
Is there a difference to the answer if docs to be boosted after the sort is
less?

Thanks!
Mark

On Wed, Feb 24, 2021 at 4:41 PM Markus Jelsma 
wrote:

> Hello,
>
> You are probably looking for the elevator component, check it out:
> https://lucene.apache.org/solr/guide/8_8/the-query-elevation-component.html
>
> Regards,
> Markus
>
> Op wo 24 feb. 2021 om 11:59 schreef Mark Robinson  >:
>
> > Hi,
> >
> > I wanted to sort and then boost some docs to the top and these docs
> should
> > be my first set in the results and the following ones appearing according
> > to my sort criteria.
> >
> > I understand that sort overrides bq hence bq may not be used in this case
> >
> > - I brought my boost into sort using "query()" and achieved my goal.
> > - I tried sort and then elevate with forceElevation and that also worked.
> >
> > My question is which is more recommended/ performant?
> >
> > Note:- Assume that I have hundreds of ids to boost like this.
> > Is there a difference to the answer if docs to be boosted after the sort
> is
> > less?
> >
> > Could someone please share your thoughts/experience?
> >
> > Thanks!
> > Mark.
> >
>

Overriding Sort and boosting some docs to the top

2021-02-24 Thread Mark Robinson

Hi,

I wanted to sort and then boost some docs to the top and these docs should
be my first set in the results and the following ones appearing according
to my sort criteria.

I understand that sort overrides bq hence bq may not be used in this case

- I brought my boost into sort using "query()" and achieved my goal.
- I tried sort and then elevate with forceElevation and that also worked.

My question is which is more recommended/ performant?

Note:- Assume that I have hundreds of ids to boost like this.
Is there a difference to the answer if docs to be boosted after the sort is
less?

Could someone please share your thoughts/experience?

Thanks!
Mark.

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Thanks!

Mark

On Tue, Oct 27, 2020 at 11:56 AM Dave  wrote:

> Agreed. Just a JavaScript check on the input box would work fine for 99%
> of cases, unless something automatic is running them in which case just
> server side redirect back to the form.
>
> > On Oct 27, 2020, at 11:54 AM, Mark Robinson 
> wrote:
> >
> > Hi  Konstantinos ,
> >
> > Thanks for the reply.
> > I too feel the same. Wanted to find what others also in the Solr world
> > thought about it.
> >
> > Thanks!
> > Mark.
> >
> >> On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
> >> konstantinos.koukou...@mecenat.com> wrote:
> >>
> >> Oh hi Mark!
> >>
> >> Why would you wanna do such a thing in the solr end. Imho it would be
> much
> >> more clean and easy to do it on the client side
> >>
> >> Regards,
> >> Konstantinos
> >>
> >>
> >>>> On 27 Oct 2020, at 16:42, Mark Robinson 
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I want to block queries having only a digit like "1" or "2" ,... or
> >>> just a letter like "a" or "b" ...
> >>>
> >>> Is it a good idea to block them ... ie just single digits 0 - 9 and  a
> -
> >> z
> >>> by putting them as a stop word? The problem with this I can anticipate
> >> is a
> >>> query like "1 inch screw" can have the important information "1"
> stripped
> >>> out if I tokenize it.
> >>>
> >>> So what would be a good way to avoid  single digit only and single
> letter
> >>> only queries, from the Solr end?
> >>> Or should I not do this at the Solr end at all?
> >>>
> >>> Could someone please share your thoughts?
> >>>
> >>> Thanks!
> >>> Mark
> >>
> >> ==
> >> Konstantinos Koukouvis
> >> konstantinos.koukou...@mecenat.com
> >>
> >> Using Golang and Solr? Try this: https://github.com/mecenat/solr
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Hi  Konstantinos ,

Thanks for the reply.
I too feel the same. Wanted to find what others also in the Solr world
thought about it.

Thanks!
Mark.

On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
konstantinos.koukou...@mecenat.com> wrote:

> Oh hi Mark!
>
> Why would you wanna do such a thing in the solr end. Imho it would be much
> more clean and easy to do it on the client side
>
> Regards,
> Konstantinos
>
>
> > On 27 Oct 2020, at 16:42, Mark Robinson  wrote:
> >
> > Hello,
> >
> > I want to block queries having only a digit like "1" or "2" ,... or
> > just a letter like "a" or "b" ...
> >
> > Is it a good idea to block them ... ie just single digits 0 - 9 and  a -
> z
> > by putting them as a stop word? The problem with this I can anticipate
> is a
> > query like "1 inch screw" can have the important information "1" stripped
> > out if I tokenize it.
> >
> > So what would be a good way to avoid  single digit only and single letter
> > only queries, from the Solr end?
> > Or should I not do this at the Solr end at all?
> >
> > Could someone please share your thoughts?
> >
> > Thanks!
> > Mark
>
> ==
> Konstantinos Koukouvis
> konstantinos.koukou...@mecenat.com
>
> Using Golang and Solr? Try this: https://github.com/mecenat/solr
>
>
>
>
>
>

Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Hello,

I want to block queries having only a digit like "1" or "2" ,... or
just a letter like "a" or "b" ...

Is it a good idea to block them ... ie just single digits 0 - 9 and  a - z
by putting them as a stop word? The problem with this I can anticipate is a
query like "1 inch screw" can have the important information "1" stripped
out if I tokenize it.

So what would be a good way to avoid  single digit only and single letter
only queries, from the Solr end?
Or should I not do this at the Solr end at all?

Could someone please share your thoughts?

Thanks!
Mark

ElevateIds - should I remove those that might be filtered off in the underlying query

2020-10-19 Thread Mark Robinson

Hi,

Suppose I have say 50 ElevateIds and I have a way to identify those that
would get filtered out in the query by predefined  fqs. So they would in
reality never be even in the results and hence never be elevated.

Is there any advantage if I avoid passing them in the elevateIds at the
time of creating the elevateIds,  thinking I can gain performance or they
remaining in the elevateIds does not cause any performance difference?

Thanks!
Mark

Re: "timeAllowed" param with "numFound" having a count value but doc list is empty

2020-09-16 Thread Mark Robinson

Thanks Colvin!
All the responses were helpful.

Best
Mark

On Wed, Sep 16, 2020 at 4:06 AM Colvin Cowie 
wrote:

> Hi Mark,
>
> If queries taking 10 (or however many) seconds isn't acceptable, then
> either you need to a) prevent or optimize those queries, b) improve the
> performance of your index, c) use timeAllowed and accept that queries
> taking that long may fail or provide incomplete results, or d) a
> combination of the above.
>
> If you use timeAllowed then you have to accept the possibility that a query
> won't complete within the time allowed. Therefore you need to be able to
> deal with the possibility of the query failing or of it returning
> incomplete results.
>
> In our use of Solr, if a query exceeds timeAllowed we always treat it as a
> failure, even if it might have returned partial results, and return a 5xx
> response from our own server since we don't want to serve incomplete
> results ever. But you could attempt to return whatever results you do
> receive, perhaps with a warning message for your client indicating what
> happened.
>
>
> On Wed, 16 Sep 2020 at 02:05, Mark Robinson 
> wrote:
>
> > Thanks Dominique!
> > So is this parameter generally recommended or not. I wanted to try with a
> > value of 10s. We are not using it now.
> > My goal is to prevent a query from running more than 10s on the solr
> server
> > and choking it.
> >
> > What is the general recommendation.
> >
> > Thanks!
> > Mark
> >
> > On Tue, Sep 15, 2020 at 5:38 PM Dominique Bejean <
> > dominique.bej...@eolya.fr>
> > wrote:
> >
> > > Hi,
> > >
> > > 1. Yes, your analysis is correct
> > >
> > > 2. Yes, it can occurs too with very slow query.
> > >
> > > Regards
> > >
> > > Dominique
> > >
> > > Le mar. 15 sept. 2020 à 15:14, Mark Robinson 
> a
> > > écrit :
> > >
> > > > Hi,
> > > >
> > > > When in a sample query I used "timeAllowed" as low as 10mS, I got
> value
> > > for
> > > >
> > > > "numFound" as say 2000, but no docs were returned. But when I
> increased
> > > the
> > > >
> > > > value for timeAllowed to be in seconds, never got this scenario.
> > > >
> > > >
> > > >
> > > > I have 2 qns:-
> > > >
> > > > 1. Why does numFound have a value like say 2000 or even 6000 but no
> > > >
> > > > documents actually returned. During document collection is
> calculation
> > of
> > > >
> > > > numFound done first and doc collection later?. Is doc list empty
> > > because,by
> > > >
> > > > the time doc collection started the timeAllowed cut off took effect?
> > > >
> > > >
> > > >
> > > > 2. If I give timeAllowed a value say, 10s or above do you think the
> > above
> > > >
> > > > scenario of valid count displayed in numFound, but doc list empty can
> > > ever
> > > >
> > > > occur still, as there is more time before cut-off to retrieve at
> least
> > > one
> > > >
> > > > doc ?
> > > >
> > > >
> > > >
> > > > Thanks!
> > > >
> > > > Mark
> > > >
> > > >
> > >
> >
>

Re: "timeAllowed" param with "numFound" having a count value but doc list is empty

2020-09-16 Thread Mark Robinson

Thanks much Bram!

Best,
Mark

On Wed, Sep 16, 2020 at 3:59 AM Bram Van Dam  wrote:

> There are a couple of open issues related to the timeAllowed parameter.
> For instance it currently doesn't work on conjunction with the
> cursorMark parameter [1]. And on Solr 7 it doesn't work at all [2].
>
> But other than that, when users have a lot of query flexibility, it's a
> pretty good idea to limit them somehow. You don't want your users to
> blow up your servers.
>
> [1] https://issues.apache.org/jira/browse/SOLR-14413
>
> [2] https://issues.apache.org/jira/browse/SOLR-9882
>
>  - Bram
>
> On 16/09/2020 03:04, Mark Robinson wrote:
> > Thanks Dominique!
> > So is this parameter generally recommended or not. I wanted to try with a
> > value of 10s. We are not using it now.
> > My goal is to prevent a query from running more than 10s on the solr
> server
> > and choking it.
> >
> > What is the general recommendation.
> >
> > Thanks!
> > Mark
> >
> > On Tue, Sep 15, 2020 at 5:38 PM Dominique Bejean <
> dominique.bej...@eolya.fr>
> > wrote:
> >
> >> Hi,
> >>
> >> 1. Yes, your analysis is correct
> >>
> >> 2. Yes, it can occurs too with very slow query.
> >>
> >> Regards
> >>
> >> Dominique
> >>
> >> Le mar. 15 sept. 2020 à 15:14, Mark Robinson 
> a
> >> écrit :
> >>
> >>> Hi,
> >>>
> >>> When in a sample query I used "timeAllowed" as low as 10mS, I got value
> >> for
> >>>
> >>> "numFound" as say 2000, but no docs were returned. But when I increased
> >> the
> >>>
> >>> value for timeAllowed to be in seconds, never got this scenario.
> >>>
> >>>
> >>>
> >>> I have 2 qns:-
> >>>
> >>> 1. Why does numFound have a value like say 2000 or even 6000 but no
> >>>
> >>> documents actually returned. During document collection is calculation
> of
> >>>
> >>> numFound done first and doc collection later?. Is doc list empty
> >> because,by
> >>>
> >>> the time doc collection started the timeAllowed cut off took effect?
> >>>
> >>>
> >>>
> >>> 2. If I give timeAllowed a value say, 10s or above do you think the
> above
> >>>
> >>> scenario of valid count displayed in numFound, but doc list empty can
> >> ever
> >>>
> >>> occur still, as there is more time before cut-off to retrieve at least
> >> one
> >>>
> >>> doc ?
> >>>
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Mark
> >>>
> >>>
> >>
> >
>
>

Re: "timeAllowed" param with "numFound" having a count value but doc list is empty

2020-09-15 Thread Mark Robinson

Thanks Dominique!
So is this parameter generally recommended or not. I wanted to try with a
value of 10s. We are not using it now.
My goal is to prevent a query from running more than 10s on the solr server
and choking it.

What is the general recommendation.

Thanks!
Mark

On Tue, Sep 15, 2020 at 5:38 PM Dominique Bejean 
wrote:

> Hi,
>
> 1. Yes, your analysis is correct
>
> 2. Yes, it can occurs too with very slow query.
>
> Regards
>
> Dominique
>
> Le mar. 15 sept. 2020 à 15:14, Mark Robinson  a
> écrit :
>
> > Hi,
> >
> > When in a sample query I used "timeAllowed" as low as 10mS, I got value
> for
> >
> > "numFound" as say 2000, but no docs were returned. But when I increased
> the
> >
> > value for timeAllowed to be in seconds, never got this scenario.
> >
> >
> >
> > I have 2 qns:-
> >
> > 1. Why does numFound have a value like say 2000 or even 6000 but no
> >
> > documents actually returned. During document collection is calculation of
> >
> > numFound done first and doc collection later?. Is doc list empty
> because,by
> >
> > the time doc collection started the timeAllowed cut off took effect?
> >
> >
> >
> > 2. If I give timeAllowed a value say, 10s or above do you think the above
> >
> > scenario of valid count displayed in numFound, but doc list empty can
> ever
> >
> > occur still, as there is more time before cut-off to retrieve at least
> one
> >
> > doc ?
> >
> >
> >
> > Thanks!
> >
> > Mark
> >
> >
>

"timeAllowed" param with "numFound" having a count value but doc list is empty

2020-09-15 Thread Mark Robinson

Hi,
When in a sample query I used "timeAllowed" as low as 10mS, I got value for
"numFound" as say 2000, but no docs were returned. But when I increased the
value for timeAllowed to be in seconds, never got this scenario.

I have 2 qns:-
1. Why does numFound have a value like say 2000 or even 6000 but no
documents actually returned. During document collection is calculation of
numFound done first and doc collection later?. Is doc list empty because,by
the time doc collection started the timeAllowed cut off took effect?

2. If I give timeAllowed a value say, 10s or above do you think the above
scenario of valid count displayed in numFound, but doc list empty can ever
occur still, as there is more time before cut-off to retrieve at least one
doc ?

Thanks!
Mark

Re: What is the Best way to block certain types of queries/ query patterns in Solr?

2020-09-08 Thread Mark Robinson

Makes sense.
Thanks much David!

Mark

On Fri, Sep 4, 2020 at 12:13 AM David Smiley  wrote:

> The general assumption in deploying a search platform is that you are going
> to front it with a service you write that has the search features you care
> about, and only those.  Only this service or other administrative functions
> should reach Solr.  Be wary of making your service so flexible to support
> arbitrary parameters you pass to Solr as-is that you don't know about in
> advance (i.e. use an allow-list).
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Aug 31, 2020 at 10:57 AM Mark Robinson 
> wrote:
>
> > Hi,
> > I had come across a mail (Oct, 2019 one) which suggested the best way is
> to
> > handle it before it reaches Solr. I was curious whether:-
> >1. Jetty query filter can be used (came across something like
> > that,, need to check)
> > 2. Any new features in Solr itself (like in a request handler...or
> > solrconfig, schema etc..)
> >
> > Thanks!
> > Mark
> >
>

What is the Best way to block certain types of queries/ query patterns in Solr?

2020-08-31 Thread Mark Robinson

Hi,
I had come across a mail (Oct, 2019 one) which suggested the best way is to
handle it before it reaches Solr. I was curious whether:-
   1. Jetty query filter can be used (came across something like
that,, need to check)
2. Any new features in Solr itself (like in a request handler...or
solrconfig, schema etc..)

Thanks!
Mark

Re: HttpShardHandlerFactory

2019-08-20 Thread Mark Robinson

Hello Michael,

Thank you for pointing that out.
Today I am planning to try this out along with the insights Shawn had
shared.

Thanks!
Mark.

On Mon, Aug 19, 2019 at 9:21 AM Michael Gibney 
wrote:

> Mark,
>
> Another thing to check is that I believe the configuration you posted may
> not actually be taking effect. Unless I'm mistaken, I think the correct
> element name to configure the shardHandler is "shardHandler*Factory*", not
> "shardHandler" ... as in, ' class="HttpShardHandlerFactory">...'
>
> The element name is documented correctly in the refGuide page for "Format
> of solr.xml":
>
> https://lucene.apache.org/solr/guide/8_1/format-of-solr-xml.html#the-shardhandlerfactory-element
>
> ... but the incorrect (?) element name is included in the refGuide page for
> "Distributed Requests":
>
> https://lucene.apache.org/solr/guide/8_1/distributed-requests.html#configuring-the-shardhandlerfactory
>
> Michael
>
> On Fri, Aug 16, 2019 at 9:40 AM Shawn Heisey  wrote:
>
> > On 8/16/2019 3:51 AM, Mark Robinson wrote:
> > > I am trying to understand the socket time out and connection time out
> in
> > > the HttpShardHandlerFactory:-
> > >
> > > 
> > >10
> > >20
> > > 
> >
> > The shard handler is used when that Solr instance needs to make
> > connections to another Solr instance (which could be itself, as odd as
> > that might sound).  It does not apply to the requests that you make from
> > outside Solr.
> >
> > > 1.Could some one please help me understand the effect of using such low
> > > values of 10 ms
> > >  and 20ms as given above inside my /select handler?
> >
> > A connection timeout of 10 milliseconds *might* result in connections
> > not establishing at all.  This is translated down to the TCP socket as
> > the TCP connection timeout -- the time limit imposed on making the TCP
> > connection itself.  Which as I understand it, is the completion of the
> > "SYN", "SYN/ACK", and "ACK" sequence.  If the two endpoints of the
> > connection are on a LAN, you might never see a problem from this -- LAN
> > connections are very low latency.  But if they are across the Internet,
> > they might never work.
> >
> > The socket timeout of 20 milliseconds means that if the connection goes
> > idle for 20 milliseconds, it will be forcibly closed.  So if it took 25
> > milliseconds for the remote Solr instance to respond, this Solr instance
> > would have given up and closed the connection.  It is extremely common
> > for requests to take 100, 500, 2000, or more milliseconds to respond.
> >
> > > 2. What is the guidelines for setting these parameters? Should they be
> > low
> > > or high
> >
> > I would probably use a value of about 5000 (five seconds) for the
> > connection timeout if everything's on a local LAN.  I might go as high
> > as 15 seconds if there's a high latency network between them, but five
> > seconds is probably long enough too.
> >
> > For the socket timeout, you want a value that's considerably longer than
> > you expect requests to ever take.  Probably somewhere between two and
> > five minutes.
> >
> > > 3. How can I test the effect of this chunk of code after adding it to
> my
> > > /select handler ie I want to
> > >   make sure the above code snippet is working. That is why I gave
> > such
> > > low values and
> > >   thought when I fire a query I would get both time out errors in
> the
> > > logs. But did not!
> > >   Or is it that within the above time frame (10 ms, 20ms) if no
> > request
> > > comes the socket will
> > >   time out and the connection will be lost. So to test this should
> I
> > > give a say 100 TPS load with
> > >   these low values and then increase the values to maybe 1000 ms
> and
> > > 1500 ms respectively
> > >   and see lesser time out error messages?
> >
> > If you were running a multi-server SolrCloud setup (or a single-server
> > setup with multiple shards and/or replicas), you probably would see
> > problems from values that low.  But if Solr never has any need to make
> > connections to satisfy a request, then the values will never take effect.
> >
> > If you want to control these values for requests made from outside Solr,
> > you will need to do it in your client software that is making the
> request.
> >
> > Thanks,
> > Shawn
> >
>

Re: HttpShardHandlerFactory

2019-08-20 Thread Mark Robinson

Hello Shawn,

Thank you so much for the detailed response.
It was so helpful!

Thanks!
Mark.

On Fri, Aug 16, 2019 at 9:40 AM Shawn Heisey  wrote:

> On 8/16/2019 3:51 AM, Mark Robinson wrote:
> > I am trying to understand the socket time out and connection time out in
> > the HttpShardHandlerFactory:-
> >
> > 
> >10
> >20
> > 
>
> The shard handler is used when that Solr instance needs to make
> connections to another Solr instance (which could be itself, as odd as
> that might sound).  It does not apply to the requests that you make from
> outside Solr.
>
> > 1.Could some one please help me understand the effect of using such low
> > values of 10 ms
> >  and 20ms as given above inside my /select handler?
>
> A connection timeout of 10 milliseconds *might* result in connections
> not establishing at all.  This is translated down to the TCP socket as
> the TCP connection timeout -- the time limit imposed on making the TCP
> connection itself.  Which as I understand it, is the completion of the
> "SYN", "SYN/ACK", and "ACK" sequence.  If the two endpoints of the
> connection are on a LAN, you might never see a problem from this -- LAN
> connections are very low latency.  But if they are across the Internet,
> they might never work.
>
> The socket timeout of 20 milliseconds means that if the connection goes
> idle for 20 milliseconds, it will be forcibly closed.  So if it took 25
> milliseconds for the remote Solr instance to respond, this Solr instance
> would have given up and closed the connection.  It is extremely common
> for requests to take 100, 500, 2000, or more milliseconds to respond.
>
> > 2. What is the guidelines for setting these parameters? Should they be
> low
> > or high
>
> I would probably use a value of about 5000 (five seconds) for the
> connection timeout if everything's on a local LAN.  I might go as high
> as 15 seconds if there's a high latency network between them, but five
> seconds is probably long enough too.
>
> For the socket timeout, you want a value that's considerably longer than
> you expect requests to ever take.  Probably somewhere between two and
> five minutes.
>
> > 3. How can I test the effect of this chunk of code after adding it to my
> > /select handler ie I want to
> >   make sure the above code snippet is working. That is why I gave
> such
> > low values and
> >   thought when I fire a query I would get both time out errors in the
> > logs. But did not!
> >   Or is it that within the above time frame (10 ms, 20ms) if no
> request
> > comes the socket will
> >   time out and the connection will be lost. So to test this should I
> > give a say 100 TPS load with
> >   these low values and then increase the values to maybe 1000 ms and
> > 1500 ms respectively
> >   and see lesser time out error messages?
>
> If you were running a multi-server SolrCloud setup (or a single-server
> setup with multiple shards and/or replicas), you probably would see
> problems from values that low.  But if Solr never has any need to make
> connections to satisfy a request, then the values will never take effect.
>
> If you want to control these values for requests made from outside Solr,
> you will need to do it in your client software that is making the request.
>
> Thanks,
> Shawn
>

HttpShardHandlerFactory

2019-08-16 Thread Mark Robinson

Hello,

I am trying to understand the socket time out and connection time out in
the HttpShardHandlerFactory:-

   
  10
  20
   

1.Could some one please help me understand the effect of using such low
values of 10 ms
and 20ms as given above inside my /select handler?

2. What is the guidelines for setting these parameters? Should they be low
or high

3. How can I test the effect of this chunk of code after adding it to my
/select handler ie I want to
 make sure the above code snippet is working. That is why I gave such
low values and
 thought when I fire a query I would get both time out errors in the
logs. But did not!
 Or is it that within the above time frame (10 ms, 20ms) if no request
comes the socket will
 time out and the connection will be lost. So to test this should I
give a say 100 TPS load with
 these low values and then increase the values to maybe 1000 ms and
1500 ms respectively
 and see lesser time out error messages?

I am trying to understand how these parameters can be put to good use.

Thanks!
Mark

Re: Solr restricting time-consuming/heavy processing queries

2019-08-13 Thread Mark Robinson

Thank you Jan for the reply.
I will try it out.

Best,
Mark.

On Mon, Aug 12, 2019 at 6:29 PM Jan Høydahl  wrote:

> I have never used such settings, but you could check out
> https://lucene.apache.org/solr/guide/8_1/common-query-parameters.html#segmentterminateearly-parameter
> which will allow you to pre-sort the index so that any early termination
> will actually return the most relevant docs. This will probably be easier
> to setup once https://issues.apache.org/jira/browse/SOLR-13681 is done.
>
> According to that same page you will not be able to abort long-running
> faceting using timeAllowed, but there are other ways to optimize faceting,
> such as using jsonFacet, threaded execution etc.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 12. aug. 2019 kl. 23:10 skrev Mark Robinson :
>
> Hi Jan,
>
> Thanks for the reply.
> Our normal search times is within 650 ms.
> We were analyzing some queries and found that few of them were like 14675
> ms, 13767 ms etc...
> So was curious to see whether we have some way to restrict the query to
> not run beyond say 5s or some ideal timing  in SOLR even if it returns only
> partial results.
>
> That is how I came across the "timeAllowed" and wanted to check on it.
> Also was curious to know whether  "shardHandler"  could be used to work
> in those lines or it is meant for a totally different functionality.
>
> Thanks!
> Best,
> Mark
>
>
> On Sun, Aug 11, 2019 at 8:17 AM Jan Høydahl  wrote:
>
>> What is the root use case you are trying to solve? What kind of solr
>> install is this and do you not have control over the clients or what is the
>> reason that users overload your servers?
>>
>> Normally you would scale the cluster to handle normal expected load
>> instead of trying to give users timeout exceptions. What kind of query
>> times do you experience that are above 1s and are these not important
>> enough to invest extra HW? Trying to understand the real reason behind your
>> questions.
>>
>> Jan Høydahl
>>
>> > 11. aug. 2019 kl. 11:43 skrev Mark Robinson :
>> >
>> > Hello,
>> > Could someone share their thoughts please or point to some link that
>> helps
>> > understand my above queries?
>> > In the Solr documentation I came across a few lines on timeAllowed and
>> > shardHandler, but if there was an example scenario for both it would
>> help
>> > understand them more thoroughly.
>> > Also curious to know different ways if any n SOLR to restrict/ limit a
>> time
>> > consuming query from processing for a long time.
>> >
>> > Thanks!
>> > Mark
>> >
>> > On Fri, Aug 9, 2019 at 2:15 PM Mark Robinson 
>> > wrote:
>> >
>> >>
>> >> Hello,
>> >> I have the following questions please:-
>> >>
>> >> In solrconfig.xml I created a new "/selecttimeout" handler copying
>> >> "/select" handler and added the following to my new "/selecttimeout":-
>> >>  
>> >>10
>> >>20
>> >>  
>> >>
>> >> 1.
>> >> Does the above mean that if I dont get a request once in 10ms on the
>> >> socket handling the /selecttimeout handler, that socket will be closed?
>> >>
>> >> 2.
>> >> Same with  connTimeOut? ie the connection  object remains live only if
>> at
>> >> least a connection request comes once in every 20 mS; if not the object
>> >> gets closed?
>> >>
>> >> Suppose a time consumeing query (say with lots of facets etc...), is
>> fired
>> >> against SOLR. How can I prevent Solr processing it for not more than
>> 1s?
>> >>
>> >> 3.
>> >> Is this achieved by setting timeAllowed=1000?  Or are there any other
>> ways
>> >> to do this in Solr?
>> >>
>> >> 4
>> >> For the same purpose to prevent heavy queries overloading SOLR, does
>> the
>> >>  above help in anyway or is it that shardHandler has
>> nothing
>> >> to restrict a query once fired against Solr?
>> >>
>> >>
>> >> Could someone pls share your views?
>> >>
>> >> Thanks!
>> >> Mark
>> >>
>>
>
>

Re: Solr restricting time-consuming/heavy processing queries

2019-08-12 Thread Mark Robinson

Hi Jan,

Thanks for the reply.
Our normal search times is within 650 ms.
We were analyzing some queries and found that few of them were like 14675
ms, 13767 ms etc...
So was curious to see whether we have some way to restrict the query to not
run beyond say 5s or some ideal timing  in SOLR even if it returns only
partial results.

That is how I came across the "timeAllowed" and wanted to check on it.
Also was curious to know whether  "shardHandler"  could be used to work in
those lines or it is meant for a totally different functionality.

Thanks!
Best,
Mark


On Sun, Aug 11, 2019 at 8:17 AM Jan Høydahl  wrote:

> What is the root use case you are trying to solve? What kind of solr
> install is this and do you not have control over the clients or what is the
> reason that users overload your servers?
>
> Normally you would scale the cluster to handle normal expected load
> instead of trying to give users timeout exceptions. What kind of query
> times do you experience that are above 1s and are these not important
> enough to invest extra HW? Trying to understand the real reason behind your
> questions.
>
> Jan Høydahl
>
> > 11. aug. 2019 kl. 11:43 skrev Mark Robinson :
> >
> > Hello,
> > Could someone share their thoughts please or point to some link that
> helps
> > understand my above queries?
> > In the Solr documentation I came across a few lines on timeAllowed and
> > shardHandler, but if there was an example scenario for both it would help
> > understand them more thoroughly.
> > Also curious to know different ways if any n SOLR to restrict/ limit a
> time
> > consuming query from processing for a long time.
> >
> > Thanks!
> > Mark
> >
> > On Fri, Aug 9, 2019 at 2:15 PM Mark Robinson 
> > wrote:
> >
> >>
> >> Hello,
> >> I have the following questions please:-
> >>
> >> In solrconfig.xml I created a new "/selecttimeout" handler copying
> >> "/select" handler and added the following to my new "/selecttimeout":-
> >>  
> >>10
> >>20
> >>  
> >>
> >> 1.
> >> Does the above mean that if I dont get a request once in 10ms on the
> >> socket handling the /selecttimeout handler, that socket will be closed?
> >>
> >> 2.
> >> Same with  connTimeOut? ie the connection  object remains live only if
> at
> >> least a connection request comes once in every 20 mS; if not the object
> >> gets closed?
> >>
> >> Suppose a time consumeing query (say with lots of facets etc...), is
> fired
> >> against SOLR. How can I prevent Solr processing it for not more than 1s?
> >>
> >> 3.
> >> Is this achieved by setting timeAllowed=1000?  Or are there any other
> ways
> >> to do this in Solr?
> >>
> >> 4
> >> For the same purpose to prevent heavy queries overloading SOLR, does the
> >>  above help in anyway or is it that shardHandler has
> nothing
> >> to restrict a query once fired against Solr?
> >>
> >>
> >> Could someone pls share your views?
> >>
> >> Thanks!
> >> Mark
> >>
>

Re: Solr restricting time-consuming/heavy processing queries

2019-08-11 Thread Mark Robinson

Hello,
Could someone share their thoughts please or point to some link that helps
understand my above queries?
In the Solr documentation I came across a few lines on timeAllowed and
shardHandler, but if there was an example scenario for both it would help
understand them more thoroughly.
Also curious to know different ways if any n SOLR to restrict/ limit a time
consuming query from processing for a long time.

Thanks!
Mark

On Fri, Aug 9, 2019 at 2:15 PM Mark Robinson 
wrote:

>
> Hello,
> I have the following questions please:-
>
> In solrconfig.xml I created a new "/selecttimeout" handler copying
> "/select" handler and added the following to my new "/selecttimeout":-
>   
> 10
> 20
>   
>
> 1.
> Does the above mean that if I dont get a request once in 10ms on the
> socket handling the /selecttimeout handler, that socket will be closed?
>
> 2.
> Same with  connTimeOut? ie the connection  object remains live only if at
> least a connection request comes once in every 20 mS; if not the object
> gets closed?
>
> Suppose a time consumeing query (say with lots of facets etc...), is fired
> against SOLR. How can I prevent Solr processing it for not more than 1s?
>
> 3.
> Is this achieved by setting timeAllowed=1000?  Or are there any other ways
> to do this in Solr?
>
> 4
> For the same purpose to prevent heavy queries overloading SOLR, does the
>  above help in anyway or is it that shardHandler has nothing
> to restrict a query once fired against Solr?
>
>
> Could someone pls share your views?
>
> Thanks!
> Mark
>

Solr restricting time-consuming/heavy processing queries

2019-08-09 Thread Mark Robinson

Hello,
I have the following questions please:-

In solrconfig.xml I created a new "/selecttimeout" handler copying
"/select" handler and added the following to my new "/selecttimeout":-
  
10
20
  

1.
Does the above mean that if I dont get a request once in 10ms on the socket
handling the /selecttimeout handler, that socket will be closed?

2.
Same with  connTimeOut? ie the connection  object remains live only if at
least a connection request comes once in every 20 mS; if not the object
gets closed?

Suppose a time consumeing query (say with lots of facets etc...), is fired
against SOLR. How can I prevent Solr processing it for not more than 1s?

3.
Is this achieved by setting timeAllowed=1000?  Or are there any other ways
to do this in Solr?

4
For the same purpose to prevent heavy queries overloading SOLR, does the
 above help in anyway or is it that shardHandler has nothing
to restrict a query once fired against Solr?


Could someone pls share your views?

Thanks!
Mark

Re: UIMA-SOLR integration

2018-03-29 Thread Mark Robinson

Thanks much Steve for the suggestions and pointers.

Best,
Mark

On Thu, Mar 29, 2018 at 3:17 PM, Steve Rowe <sar...@gmail.com> wrote:

> Hi Mark,
>
> Not sure about the advisability of pursuing UIMA - I’ve never used it with
> Lucene or Solr - but soon-to-be-released Solr 7.3, will include OpenNLP
> integration:
>
> * Language analysis, in the Solr reference guide: <
> https://builds.apache.org/view/L/view/Lucene/job/Solr-
> reference-guide-7.3/javadoc/language-analysis.html#opennlp-integration>
>
> * Language detection, in the Solr reference guide: <
> https://builds.apache.org/view/L/view/Lucene/job/Solr-
> reference-guide-7.3/javadoc/detecting-languages-during-indexing.html>
>
> * NER, in javadocs (sorry, couldn’t think of a place where a pre-release
> HTML view is available): <https://git-wip-us.apache.
> org/repos/asf?p=lucene-solr.git;a=blob;f=solr/contrib/
> analysis-extras/src/java/org/apache/solr/update/processor/
> OpenNLPExtractNamedEntitiesUpdateProcessorFactory.java;hb=
> refs/heads/branch_7_3#l60>
>
> --
> Steve
> www.lucidworks.com
>
> > On Mar 29, 2018, at 6:40 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >
> > Hi All,
> >
> > Is it still advisable to pursue UIMA or can some one pls advise something
> > else to check on related to SOLR and NLP?
> >
> > Thanks!
> > Mark
> >
> >
> > -- Forwarded message --
> > From: Mark Robinson <mark123lea...@gmail.com>
> > Date: Wed, Mar 28, 2018 at 2:21 PM
> > Subject: UIMA-SOLR integration
> > To: solr-user@lucene.apache.org
> >
> >
> > Hi,
> >
> > I was trying to integrate UIMA into SOLR following the solr docs and many
> > other hints on the net.
> > While trying to get a VALID_ALCHEMYAPI_KEY I contacted IBM support and
> got
> > the following advice:-
> >
> > "As announced a year a go the Alchemy Service was scheduled and shutdown
> on
> > March 7th, 2018, and is no longer supported.  The AlchemAPI services was
> > broken down into three other services where AlchemyLanguage has been
> > replaced by Natural Language Understanding, AlchemyVision by Visual
> > Recognition, and AlchemyDataNews by Discovery News.  The suggestion is to
> > migrated to the respective merged service in order to be able to take
> > advantage of the features."
> >
> > Could someone please share any other suggestions instead of having to
> > use ALCHEMYAPI so that I can still continue with my work.
> >
> > Note:- I already commented out OPENCALAIS references in
> > OverridingParamsExtServicesAE.xml as I was getting error with OPEN
> > CALAIS so was relying only on AlchemyAPI only.
> >
> > Any immediate help is greatly appreciated!
> >
> > Thanks!
> >
> > Mark
>
>

Fwd: UIMA-SOLR integration

2018-03-29 Thread Mark Robinson

Hi All,

Is it still advisable to pursue UIMA or can some one pls advise something
else to check on related to SOLR and NLP?

Thanks!
Mark


-- Forwarded message --
From: Mark Robinson <mark123lea...@gmail.com>
Date: Wed, Mar 28, 2018 at 2:21 PM
Subject: UIMA-SOLR integration
To: solr-user@lucene.apache.org


Hi,

I was trying to integrate UIMA into SOLR following the solr docs and many
other hints on the net.
While trying to get a VALID_ALCHEMYAPI_KEY I contacted IBM support and got
the following advice:-

"As announced a year a go the Alchemy Service was scheduled and shutdown on
March 7th, 2018, and is no longer supported.  The AlchemAPI services was
broken down into three other services where AlchemyLanguage has been
replaced by Natural Language Understanding, AlchemyVision by Visual
Recognition, and AlchemyDataNews by Discovery News.  The suggestion is to
migrated to the respective merged service in order to be able to take
advantage of the features."

Could someone please share any other suggestions instead of having to
use ALCHEMYAPI so that I can still continue with my work.

Note:- I already commented out OPENCALAIS references in
OverridingParamsExtServicesAE.xml as I was getting error with OPEN
CALAIS so was relying only on AlchemyAPI only.

Any immediate help is greatly appreciated!

Thanks!

Mark

UIMA-SOLR integration

2018-03-28 Thread Mark Robinson

Hi,

I was trying to integrate UIMA into SOLR following the solr docs and many
other hints on the net.
While trying to get a VALID_ALCHEMYAPI_KEY I contacted IBM support and got
the following advice:-

"As announced a year a go the Alchemy Service was scheduled and shutdown on
March 7th, 2018, and is no longer supported.  The AlchemAPI services was
broken down into three other services where AlchemyLanguage has been
replaced by Natural Language Understanding, AlchemyVision by Visual
Recognition, and AlchemyDataNews by Discovery News.  The suggestion is to
migrated to the respective merged service in order to be able to take
advantage of the features."

Could someone please share any other suggestions instead of having to
use ALCHEMYAPI so that I can still continue with my work.

Note:- I already commented out OPENCALAIS references in
OverridingParamsExtServicesAE.xml as I was getting error with OPEN
CALAIS so was relying only on AlchemyAPI only.

Any immediate help is greatly appreciated!

Thanks!

Mark

Re: Reading a prameter set in browser in request handler

2016-10-02 Thread Mark Robinson

Thanks much Alex!
Your pointer was helpful.

Thanks!
Anil.

On Sun, Oct 2, 2016 at 12:28 PM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> You've got Parameter Substitution:
> https://cwiki.apache.org/confluence/display/solr/Parameter+Substitution
>
> I think this should do the trick. Perhaps combined with switch query
> parser as https://cwiki.apache.org/confluence/display/solr/Other+
> Parsers#OtherParsers-SwitchQueryParser
> (like in an example here:
> https://gist.github.com/arafalov/5e04884e5aefaf46678c )
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 October 2016 at 23:19, Mark Robinson <mark123lea...@gmail.com> wrote:
> > Thanks Alex for the reply.
> >
> > Yes. in this context I want to determine the weather of the country
> passed
> > from browser which will go into another function query.
> >
> > If I do the above directly from the browser it works great. I am trying
> to
> > move this to the requesthandler so that I need to pass only the country
> > name which is what varies.
> > If all parameters in the browser itself:-
> > =USA=if (weatherDetermine, weather_${country},
> > defaultField).
> >
> > Basically  I am trying to configure most of my params in request handler
> so
> > that I need to pass only one param that actually varies, reducing my
> query
> > url length.
> >
> > So baseically was trying to find out *how I can access a parameter passed
> > from the browser in a request handler to determine on a field dynamically
> > using the info passed from browser.*
> >
> > Thanks!
> > Mark
> >
> >
> > On Sun, Oct 2, 2016 at 11:46 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> The variable substitution in the solrconfig.xml happens when that file
> >> is loaded, right at the start of core initialization. Your request
> >> variable comes much later. I don't think you can expand it this way.
> >>
> >> Could you explain a bit more what you are trying to achieve in
> >> business terms. E.g. 'weather' term needs to feed into something else
> >> in query I am guessing.
> >>
> >> Regards,
> >> Alex.
> >> 
> >> Newsletter and resources for Solr beginners and intermediates:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 2 October 2016 at 22:29, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > I pass a parameter *=USA* from the browser as part of my
> query.
> >> > (I have fields *weather_USA*, *weather_UK* etc... in my document).
> >> > How do I retrieve the parameter "country" in my requesthandler.
> >> >
> >> > I tried in my requesthandler:-
> >> >
> >> > 
> >> > storeId
> >> > weather_${country}
> >> >
> >> > ...but it is not working.  Even tried:-
> >> >  weather_$country
> >> >
> >> > I am trying to set my weather field based on the *country
> >> *passed.
> >> >
> >> >   Any suggestionsis is highly appreciated.
> >> >
> >> > Thanks!
> >> > Mark
> >>
>

Re: Reading a prameter set in browser in request handler

2016-10-02 Thread Mark Robinson

Thanks Alex for the reply.

Yes. in this context I want to determine the weather of the country passed
from browser which will go into another function query.

If I do the above directly from the browser it works great. I am trying to
move this to the requesthandler so that I need to pass only the country
name which is what varies.
If all parameters in the browser itself:-
=USA=if (weatherDetermine, weather_${country},
defaultField).

Basically  I am trying to configure most of my params in request handler so
that I need to pass only one param that actually varies, reducing my query
url length.

So baseically was trying to find out *how I can access a parameter passed
from the browser in a request handler to determine on a field dynamically
using the info passed from browser.*

Thanks!
Mark

On Sun, Oct 2, 2016 at 11:46 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> The variable substitution in the solrconfig.xml happens when that file
> is loaded, right at the start of core initialization. Your request
> variable comes much later. I don't think you can expand it this way.
>
> Could you explain a bit more what you are trying to achieve in
> business terms. E.g. 'weather' term needs to feed into something else
> in query I am guessing.
>
> Regards,
> Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
> On 2 October 2016 at 22:29, Mark Robinson <mark123lea...@gmail.com> wrote:
> > Hi,
> >
> > I pass a parameter *=USA* from the browser as part of my query.
> > (I have fields *weather_USA*, *weather_UK* etc... in my document).
> > How do I retrieve the parameter "country" in my requesthandler.
> >
> > I tried in my requesthandler:-
> >
> > 
> > storeId
> > weather_${country}
> >
> > ...but it is not working.  Even tried:-
> >  weather_$country
> >
> > I am trying to set my weather field based on the *country
> *passed.
> >
> >   Any suggestionsis is highly appreciated.
> >
> > Thanks!
> > Mark
>

Re: Reading a prameter set in browser in request handler

2016-10-02 Thread Mark Robinson

Hi,

While copy pasting correcting a typo.

I pass a parameter *=USA* from the browser as part of my query.
(I have fields *weather_USA*, *weather_UK* etc... in my document).
How do I retrieve the parameter "country" in my requesthandler.

I tried in my requesthandler:-

country
weather_${country}

...but it is not working.  Even tried:-
 weather_$country

I am trying to set my weather field based on the *country *passed.

  Any suggestionsis is highly appreciated.

Thanks!

On Sun, Oct 2, 2016 at 11:29 AM, Mark Robinson <mark123lea...@gmail.com>
wrote:

> Hi,
>
> I pass a parameter *=USA* from the browser as part of my query.
> (I have fields *weather_USA*, *weather_UK* etc... in my document).
> How do I retrieve the parameter "country" in my requesthandler.
>
> I tried in my requesthandler:-
>
> 
> storeId
> weather_${country}
>
> ...but it is not working.  Even tried:-
>  weather_$country
>
> I am trying to set my weather field based on the *country *passed.
>
>   Any suggestionsis is highly appreciated.
>
> Thanks!
> Mark
>

Reading a prameter set in browser in request handler

2016-10-02 Thread Mark Robinson

Hi,

I pass a parameter *=USA* from the browser as part of my query.
(I have fields *weather_USA*, *weather_UK* etc... in my document).
How do I retrieve the parameter "country" in my requesthandler.

I tried in my requesthandler:-


storeId
weather_${country}

...but it is not working.  Even tried:-
 weather_$country

I am trying to set my weather field based on the *country *passed.

  Any suggestionsis is highly appreciated.

Thanks!
Mark

Re: Load a java class on start up

2016-07-02 Thread Mark Robinson

Yes.
Integrating my CustomComponent along with SolrCoreAware (I was unaware of
this prev) should give me what I am looking for.

Thanks!
Mark.

On Sat, Jul 2, 2016 at 5:57 AM, Andrea Gazzarini <gxs...@gmail.com> wrote:

> You're welcome ;) is that close to what you were looking for?
> On 2 Jul 2016 11:53, "Mark Robinson" <mark123lea...@gmail.com> wrote:
>
> > Thanks much Andrea esp. for the suggestion of SolrCoreAware!
> >
> >
> >
> > Best,
> > Mark.
> >
> > On Thu, Jun 30, 2016 at 10:23 AM, Andrea Gazzarini <gxs...@gmail.com>
> > wrote:
> >
> > > Hi,
> > > the lifecycle of your Solr extension (i.e. the component) is not
> > something
> > > that's up to you.
> > > Before designing the component you should read the framework docs [1],
> in
> > > order to understand the context where it will live, once deployed.
> > >
> > > There's nothing, as far as I know, other than the component callbacks
> > > (e.g. the inform, init methods) that can help you to manage the
> lifecycle
> > > of a custom class you're using within the component. Look at the
> > > SolrCoreAware [2] interface, maybe it could fit your needs.
> > > From what you write it seems you could need something like a singleton
> > > (which is often an anti-pattern in distributed environment) , but
> without
> > > further details I'm just shooting in the dark
> > >
> > > In addition: you wrote a component so I guess it shouldn't be so hard
> to
> > > have a look at one of the existing built-in components. I'm quite sure
> > they
> > > already met (and solved) a similar issue.
> > >
> > > Best,
> > > Andrea
> > >
> > > [1]
> > >
> >
> https://lucene.apache.org/solr/6_1_0/solr-core/org/apache/solr/handler/component/SearchComponent.html
> > > [2] https://wiki.apache.org/solr/SolrPlugins#SolrCoreAware
> > >
> > >
> > > On 30/06/16 16:00, Mark Robinson wrote:
> > >
> > >> Hi,
> > >>
> > >> I have a java OBJECT which I need to load once.
> > >> I have written a java custom component, which I have added in
> > >> "last-components" in solrconfig.xml, from which I want to access the
> > above
> > >> mentioned OBJECT when each search request comes in.
> > >>
> > >> Is there a way I can load a java object on server/ instance startup?
> > >> OR
> > >> Load it when the first call comes to SOLR?
> > >>
> > >> For the time being I created that Java object inside the custom
> > component
> > >> itself; but it is loaded each time a search request comes in.
> > >>
> > >> Could some one pls give some pointers on how my above requirement can
> be
> > >> achieved in SOLR?
> > >>
> > >> Thanks!
> > >> Mark
> > >>
> > >>
> > >
> >
>

Re: Load a java class on start up

2016-07-02 Thread Mark Robinson

Thanks much Andrea esp. for the suggestion of SolrCoreAware!



Best,
Mark.

On Thu, Jun 30, 2016 at 10:23 AM, Andrea Gazzarini <gxs...@gmail.com> wrote:

> Hi,
> the lifecycle of your Solr extension (i.e. the component) is not something
> that's up to you.
> Before designing the component you should read the framework docs [1], in
> order to understand the context where it will live, once deployed.
>
> There's nothing, as far as I know, other than the component callbacks
> (e.g. the inform, init methods) that can help you to manage the lifecycle
> of a custom class you're using within the component. Look at the
> SolrCoreAware [2] interface, maybe it could fit your needs.
> From what you write it seems you could need something like a singleton
> (which is often an anti-pattern in distributed environment) , but without
> further details I'm just shooting in the dark
>
> In addition: you wrote a component so I guess it shouldn't be so hard to
> have a look at one of the existing built-in components. I'm quite sure they
> already met (and solved) a similar issue.
>
> Best,
> Andrea
>
> [1]
> https://lucene.apache.org/solr/6_1_0/solr-core/org/apache/solr/handler/component/SearchComponent.html
> [2] https://wiki.apache.org/solr/SolrPlugins#SolrCoreAware
>
>
> On 30/06/16 16:00, Mark Robinson wrote:
>
>> Hi,
>>
>> I have a java OBJECT which I need to load once.
>> I have written a java custom component, which I have added in
>> "last-components" in solrconfig.xml, from which I want to access the above
>> mentioned OBJECT when each search request comes in.
>>
>> Is there a way I can load a java object on server/ instance startup?
>> OR
>> Load it when the first call comes to SOLR?
>>
>> For the time being I created that Java object inside the custom component
>> itself; but it is loaded each time a search request comes in.
>>
>> Could some one pls give some pointers on how my above requirement can be
>> achieved in SOLR?
>>
>> Thanks!
>> Mark
>>
>>
>

Load a java class on start up

2016-06-30 Thread Mark Robinson

Hi,

I have a java OBJECT which I need to load once.
I have written a java custom component, which I have added in
"last-components" in solrconfig.xml, from which I want to access the above
mentioned OBJECT when each search request comes in.

Is there a way I can load a java object on server/ instance startup?
OR
Load it when the first call comes to SOLR?

For the time being I created that Java object inside the custom component
itself; but it is loaded each time a search request comes in.

Could some one pls give some pointers on how my above requirement can be
achieved in SOLR?

Thanks!
Mark

Accessing response docs in process method

2016-06-17 Thread Mark Robinson

Hi,

I would like to check the response for the *authors *data that comes in my
multiValued *authors* field and do some activity related to it before the
output is send back.

I know to access the facets and investigate it.

Could some one pls advise (the apis/ methods etc) on how I can get started
on this (accessing results in the *process* method).

Thanks!
Mark.

Re: Add a new field dynamically to each of the result docs and sort on it

2016-06-01 Thread Mark Robinson

Thanks Charlie!
I will check this and try it out.

Best,
Mark.

On Wed, Jun 1, 2016 at 7:00 AM, Charlie Hull <char...@flax.co.uk> wrote:

> On 01/06/2016 11:56, Mark Robinson wrote:
>
>> Just to complete my prev use case, in case no direct way is possible in
>> SOLR to sort on a field in a different core, is there a way to embed the
>> tagValue of a product dynamically into the results (the storeid will be
>> passed at query time. So query the product_tags core for that
>> product+storeid and get the tagValue and embed it into the product results
>> probably in the "process" method of a custom component ... in the first
>> place I believe we can add a value like that to each result doc). But then
>> how can we sort on this value as I am now working on the results which
>> came
>> out after any initial sort was applied and can we re-sort at this very
>> late
>> stage using some java sorting in the custom component.
>>
>
> Hi Mark,
>
> Not sure if this is directly relevant but we implemented a component to
> join Solr results with external data:
> http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/
>
> Cheers
>
> Charlie
>
>>
>> Thanks!
>> Mark.
>>
>> On Wed, Jun 1, 2016 at 6:44 AM, Mark Robinson <mark123lea...@gmail.com>
>> wrote:
>>
>> Thanks much Eric and Hoss!
>>>
>>> Let me try to detail.
>>> We have our "product" core with a couple of million docs.
>>> We have a couple of thousand outlets where the products get sold.
>>> Each product can have a different *tagValue* in each outlet.
>>> Our "product_tag" core (around 2M times 2000 records), captures tag info
>>> of each product in each outlet. It has some additional info also (a
>>> couple
>>> of more fields in addition to *tagValue*), pertaining to each
>>> product-outlet combination and there can be NRT *tag* updates for this
>>> core (the *tagValue* of each product in each outlet can change and is
>>> updated in real time). So we moved the volatile portion of product out
>>> to a
>>> separate core which has approx 2M times 2000 records and only 4 or 5
>>> fields
>>> per doc.
>>>
>>> A recent requirement is that we want our product results to be bumped up
>>> or down if it has a particular *tagValue*... for example products with
>>> tagValue=X should be at the top. Currently only one tag*Value* considered
>>> to decide results order.
>>> A future requirement could be products with *tagValue=*X bumped up
>>> followed by products with *tagValue=*Y.
>>>
>>>
>>> ie "product" results need to be ordered based on a field(s) in the
>>> "product_tag" core (a different core).
>>>
>>> Is there ANY way to achieve this scenario.
>>>
>>> Thanks!
>>>
>>> Mark.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 31, 2016 at 8:13 PM, Chris Hostetter <
>>> hossman_luc...@fucit.org
>>>
>>>> wrote:
>>>>
>>>
>>>
>>>> : When a query comes in, I want to populate value for this field in the
>>>> : results based on some values passed in the query.
>>>> : So what needs to be accommodated in the result depends on a parameter
>>>> in
>>>> : the query and I would like to sort the final results on this field
>>>> also,
>>>> : which is dynamically populated.
>>>>
>>>> populated how? ... what exactly do you want to provide at query time,
>>>> and
>>>> how exactly do you want it to affect your query results / sorting?
>>>>
>>>> The details of what you *think* you mean matter, because based on the
>>>> information you've provided we have no way of guessing what your goal
>>>> is -- and if we can't guess what you mean, then there's no way to
>>>> imagein
>>>> Solr can figure it out ... software doesn't have an imagination.
>>>>
>>>> We need to know what your documents are going to look like at index
>>>> time (with *real* details, and specific example docs) and what your
>>>> queries are going to look like (again: with *real* details on the "some
>>>> values passed in the query") and a detailed explanation of how what
>>>> results you want to see and why -- describe in words how the final
>>>> sorting
>>>> of the docs you should have already described to use would be determined
>>>> acording to the info pased in at query time which you should have also
>>>> already described to us.
>>>>
>>>>
>>>> In general i think i smell and XY Problem...
>>>>
>>>> https://people.apache.org/~hossman/#xyproblem
>>>> XY Problem
>>>>
>>>> Your question appears to be an "XY Problem" ... that is: you are dealing
>>>> with "X", you are assuming "Y" will help you, and you are asking about
>>>> "Y"
>>>> without giving more details about the "X" so that we can understand the
>>>> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>>>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>>>
>>>>
>>>> -Hoss
>>>> http://www.lucidworks.com/
>>>>
>>>>
>>>
>>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Re: Sorting documents in one core based on a field in another core

2016-06-01 Thread Mark Robinson

Thanks Mikhail!
I will check and get back.

Best,
Mark

On Tue, May 31, 2016 at 4:58 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Mark,
>
> Is it sounds like what's described at
>
> http://blog-archive.griddynamics.com/2015/08/scoring-join-party-in-solr-53.html
> ?
>
> On Tue, May 31, 2016 at 5:41 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I have a requirement to sort records in one core/ collection based on a
> > field in
> > another core/collection.
> >
> > Could some one please advise how it can be done in SOLR.
> >
> > I have used !join to restrict documents in one core based on field values
> > in another core. Is there some way to sort like that?
> >
> >
> > Thanks!
> > Mark.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>

Re: Add a new field dynamically to each of the result docs and sort on it

2016-06-01 Thread Mark Robinson

Thanks for the reply Hoss!

Let me do a quick explanation to the sort by tagValue. (Actually I quickly
added this part in a different mail when I found I missed it in this mail..)
That is where the dynamic input parameter comes in.
The input will specify for which local outlet (outlet id passed) we need to
take the tagValue to sort on. So the tagValue of each of the products for
that (local)  outlet is what is taken into consideration for products
coming for that query.
Note:- Two people searching from 2 different locations will have 2
different "local" outlets. So based on the location of the user his local
outlet's
   id is passed so that for the user his products are sorted by
tagValue corresponding to that outlet.
This is where my JOIN query to filter out only products
belonging to that local store was used and went well.
   Now I need to sort the product results based on the tagValue of
that local store somehow!

Thanks!
Mark.

On Wed, Jun 1, 2016 at 1:18 PM, Chris Hostetter 
wrote:

>
> : Let me try to detail.
> : We have our "product" core with a couple of million docs.
> : We have a couple of thousand outlets where the products get sold.
> : Each product can have a different *tagValue* in each outlet.
> : Our "product_tag" core (around 2M times 2000 records), captures tag info
> of
> : each product in each outlet. It has some additional info also (a couple
> of
> : more fields in addition to *tagValue*), pertaining to each
> : product-outlet combination and there can be NRT *tag* updates for this
> core
> : (the *tagValue* of each product in each outlet can change and is updated
> in
> : real time). So we moved the volatile portion of product out to a separate
> : core which has approx 2M times 2000 records and only 4 or 5 fields per
> doc.
>
> That information is helpful, but -- as i mentioned before -- to reduce
> misscommunication providing detailed examples at the document+field level
> is helpful.  ie: make up 2 products, tell us what field values those
> products have in each field (in each collection) and then explain how
> those two products should sort (relative to eachother) so that we can see
> a relaistic example of what you want to happen.
>
> Based on the information you've provided so far, you're question still
> doesn't make any sense to me me
>
> you've said you want "product results to be bumped up or down if it has a
> particular *tagValue* ... for example products with tagValue=X should be
> at the top" -- but you've also said that "Each product can have a
> different *tagValue* in each outlet" indicating that there is not a simple
> "product->tagValue" relationship.  What you've described a
> "(product,outlet)->tagValue" relationship.  So even if anything were
> possible, how would Solr know which tagValue to use when deciding how to
> "bump" a product up/down in scoring?
>
> Imagine a given productA was paired with multiple outlets, and one pairing
> with outlet1 was mapped to tagX which you said should sort first, but a
> diff pairing with outlet2 was mapped to tagZ which should sort
> last? .. what do you wnat to happen in that case?
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Add a new field dynamically to each of the result docs and sort on it

2016-06-01 Thread Mark Robinson

Just to complete my prev use case, in case no direct way is possible in
SOLR to sort on a field in a different core, is there a way to embed the
tagValue of a product dynamically into the results (the storeid will be
passed at query time. So query the product_tags core for that
product+storeid and get the tagValue and embed it into the product results
probably in the "process" method of a custom component ... in the first
place I believe we can add a value like that to each result doc). But then
how can we sort on this value as I am now working on the results which came
out after any initial sort was applied and can we re-sort at this very late
stage using some java sorting in the custom component.

Thanks!
Mark.

On Wed, Jun 1, 2016 at 6:44 AM, Mark Robinson <mark123lea...@gmail.com>
wrote:

> Thanks much Eric and Hoss!
>
> Let me try to detail.
> We have our "product" core with a couple of million docs.
> We have a couple of thousand outlets where the products get sold.
> Each product can have a different *tagValue* in each outlet.
> Our "product_tag" core (around 2M times 2000 records), captures tag info
> of each product in each outlet. It has some additional info also (a couple
> of more fields in addition to *tagValue*), pertaining to each
> product-outlet combination and there can be NRT *tag* updates for this
> core (the *tagValue* of each product in each outlet can change and is
> updated in real time). So we moved the volatile portion of product out to a
> separate core which has approx 2M times 2000 records and only 4 or 5 fields
> per doc.
>
> A recent requirement is that we want our product results to be bumped up
> or down if it has a particular *tagValue*... for example products with
> tagValue=X should be at the top. Currently only one tag*Value* considered
> to decide results order.
> A future requirement could be products with *tagValue=*X bumped up
> followed by products with *tagValue=*Y.
>
> ie "product" results need to be ordered based on a field(s) in the
> "product_tag" core (a different core).
>
> Is there ANY way to achieve this scenario.
>
> Thanks!
>
> Mark.
>
>
>
>
>
>
>
> On Tue, May 31, 2016 at 8:13 PM, Chris Hostetter <hossman_luc...@fucit.org
> > wrote:
>
>>
>> : When a query comes in, I want to populate value for this field in the
>> : results based on some values passed in the query.
>> : So what needs to be accommodated in the result depends on a parameter in
>> : the query and I would like to sort the final results on this field also,
>> : which is dynamically populated.
>>
>> populated how? ... what exactly do you want to provide at query time, and
>> how exactly do you want it to affect your query results / sorting?
>>
>> The details of what you *think* you mean matter, because based on the
>> information you've provided we have no way of guessing what your goal
>> is -- and if we can't guess what you mean, then there's no way to imagein
>> Solr can figure it out ... software doesn't have an imagination.
>>
>> We need to know what your documents are going to look like at index
>> time (with *real* details, and specific example docs) and what your
>> queries are going to look like (again: with *real* details on the "some
>> values passed in the query") and a detailed explanation of how what
>> results you want to see and why -- describe in words how the final sorting
>> of the docs you should have already described to use would be determined
>> acording to the info pased in at query time which you should have also
>> already described to us.
>>
>>
>> In general i think i smell and XY Problem...
>>
>> https://people.apache.org/~hossman/#xyproblem
>> XY Problem
>>
>> Your question appears to be an "XY Problem" ... that is: you are dealing
>> with "X", you are assuming "Y" will help you, and you are asking about "Y"
>> without giving more details about the "X" so that we can understand the
>> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
>
>

Re: Add a new field dynamically to each of the result docs and sort on it

2016-06-01 Thread Mark Robinson

Thanks much Eric and Hoss!

Let me try to detail.
We have our "product" core with a couple of million docs.
We have a couple of thousand outlets where the products get sold.
Each product can have a different *tagValue* in each outlet.
Our "product_tag" core (around 2M times 2000 records), captures tag info of
each product in each outlet. It has some additional info also (a couple of
more fields in addition to *tagValue*), pertaining to each
product-outlet combination and there can be NRT *tag* updates for this core
(the *tagValue* of each product in each outlet can change and is updated in
real time). So we moved the volatile portion of product out to a separate
core which has approx 2M times 2000 records and only 4 or 5 fields per doc.

A recent requirement is that we want our product results to be bumped up or
down if it has a particular *tagValue*... for example products with
tagValue=X should be at the top. Currently only one tag*Value* considered
to decide results order.
A future requirement could be products with *tagValue=*X bumped up followed
by products with *tagValue=*Y.

ie "product" results need to be ordered based on a field(s) in the
"product_tag" core (a different core).

Is there ANY way to achieve this scenario.

Thanks!

Mark.

On Tue, May 31, 2016 at 8:13 PM, Chris Hostetter 
wrote:

>
> : When a query comes in, I want to populate value for this field in the
> : results based on some values passed in the query.
> : So what needs to be accommodated in the result depends on a parameter in
> : the query and I would like to sort the final results on this field also,
> : which is dynamically populated.
>
> populated how? ... what exactly do you want to provide at query time, and
> how exactly do you want it to affect your query results / sorting?
>
> The details of what you *think* you mean matter, because based on the
> information you've provided we have no way of guessing what your goal
> is -- and if we can't guess what you mean, then there's no way to imagein
> Solr can figure it out ... software doesn't have an imagination.
>
> We need to know what your documents are going to look like at index
> time (with *real* details, and specific example docs) and what your
> queries are going to look like (again: with *real* details on the "some
> values passed in the query") and a detailed explanation of how what
> results you want to see and why -- describe in words how the final sorting
> of the docs you should have already described to use would be determined
> acording to the info pased in at query time which you should have also
> already described to us.
>
>
> In general i think i smell and XY Problem...
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: Add a new field dynamically to each of the result docs and sort on it

2016-05-31 Thread Mark Robinson

sorry Eric... I did not phrase it right ... what I meant was the field is
there in the schema, but I do not have values for it when normal indexing
happens.
When a query comes in, I want to populate value for this field in the
results based on some values passed in the query.
So what needs to be accommodated in the result depends on a parameter in
the query and I would like to sort the final results on this field also,
which is dynamically populated.

What could be the best way to dynamically add value to this field based on
a query parameter and sort on this field also.

Will a custom component help, with code in the *process *method to access
the results one by one and plug in this field help?
If so do I need to first index the value inside the *process *method for
reach result or is there a way to just add this value to each of my results
doc (no indexing) iterating through the result set and plugging in this
value for each result.

How will sort be applicable on this dynamically populated field as I am
already working on the results and is it too late to specify a sort and if
so how could it be possible.

Thanks!
Mark.

On Tue, May 31, 2016 at 11:10 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> I really don't understand this. If you don't have
> "fieldnew", where is the value coming from? It's
> not in the index so
>
> If you mean you're _adding_ a field after the index
> already has some docs in it, then the normal
> sort rules apply and you can specify sortMisingFirst/Last
> to tell Solr where other docs without that field shold go.
>
> Normal sort rules are '=field1 asc,field2 desc' etc.
>
> Best,
> Erick
>
> On Tue, May 31, 2016 at 7:53 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi,
> >
> > My core does not have a field say *fieldnew*.
> >
> > *Case 1:-*
> > But in my results I would like to have *fieldnew *also and my results
> > should be sorted on only this new field.
> >
> > *Case 2:-*
> > Just adding one more case further.
> > Suppose I have other fields also in the sort criteria and *fieldnew *is
> one
> > among them, in that case how do I realize this multi field sort also.
> >
> > Could some one suggest a way pls.
> >
> > Thanks!
> > Mark.
>

Re: Sorting documents in one core based on a field in another core

2016-05-31 Thread Mark Robinson

Thanks for the reply Eric!

Can we write a custom sort component to achieve this?...
I am thinking of normalizing as the last option as clear separation of the
cores helps me.

Thanks!
Mark.

On Tue, May 31, 2016 at 11:12 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Join doesn't work like that, which is why it's referred
> to as "pseudo join". There's no way that I know of
> to do what you want here.
>
> I'd strongly recommend you flatten your data at index time.
>
> Best,
> Erick
>
> On Tue, May 31, 2016 at 7:41 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi,
> >
> > I have a requirement to sort records in one core/ collection based on a
> > field in
> > another core/collection.
> >
> > Could some one please advise how it can be done in SOLR.
> >
> > I have used !join to restrict documents in one core based on field values
> > in another core. Is there some way to sort like that?
> >
> >
> > Thanks!
> > Mark.
>

Add a new field dynamically to each of the result docs and sort on it

2016-05-31 Thread Mark Robinson

Hi,

My core does not have a field say *fieldnew*.

*Case 1:-*
But in my results I would like to have *fieldnew *also and my results
should be sorted on only this new field.

*Case 2:-*
Just adding one more case further.
Suppose I have other fields also in the sort criteria and *fieldnew *is one
among them, in that case how do I realize this multi field sort also.

Could some one suggest a way pls.

Thanks!
Mark.

Sorting documents in one core based on a field in another core

2016-05-31 Thread Mark Robinson

Hi,

I have a requirement to sort records in one core/ collection based on a
field in
another core/collection.

Could some one please advise how it can be done in SOLR.

I have used !join to restrict documents in one core based on field values
in another core. Is there some way to sort like that?


Thanks!
Mark.

Re: Atomic updates and "stored"

2016-05-24 Thread Mark Robinson

Thanks Eric!

Best,
Mark

On Mon, May 23, 2016 at 1:35 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Yes, currently when using Atomic updates _all_ fields
> have to be stored, except the _destinations_ of copyField
> directives.
>
> Yes, it will make your index bigger. The affects on speed are
> probably minimal though. The stored data is in your *.fdt and
> *.fdx segments files and are not referenced only to pull
> the top N docs back, they're not referenced for _search_ at all.
>
> Coming Real Soon will be updateable DocValues, which may
> be what you really need.
>
> Best,
> Erick
>
> On Mon, May 23, 2016 at 6:13 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi,
> >
> > I have some 150 fields in my schema out of which about 100 are dynamic
> > fields which I am not storing (stored="false").
> > In case I need to do an atomic update to one or two fields which belong
> to
> > the stored list of fields, do I need to change my dynamic fields (100 or
> so
> > now not "stored") to stored="true"?
> >
> > If so wouldn't it considerably increase index size and affect performance
> > in the negative?
> >
> > Is there any way currently to do partial/ atomic updates to one or two
> > fields (which I will make stored="true") without having to make my now
> > stored="false" fields to stored="true" just
> > to accommodate atomic updates.
> >
> > Could some one pls give your suggestions.
> >
> > Thanks!
> > Mark.
>

Atomic updates and "stored"

2016-05-23 Thread Mark Robinson

Hi,

I have some 150 fields in my schema out of which about 100 are dynamic
fields which I am not storing (stored="false").
In case I need to do an atomic update to one or two fields which belong to
the stored list of fields, do I need to change my dynamic fields (100 or so
now not "stored") to stored="true"?

If so wouldn't it considerably increase index size and affect performance
in the negative?

Is there any way currently to do partial/ atomic updates to one or two
fields (which I will make stored="true") without having to make my now
stored="false" fields to stored="true" just
to accommodate atomic updates.

Could some one pls give your suggestions.

Thanks!
Mark.

Re: SOLR edismax and mm request parameter

2016-05-04 Thread Mark Robinson

Thanks for the mail Jaques.
I have a doubt here.

When we use q.op=AND what I understood is, ALL query terms should be
present any where across the various "qf" fields ie all of the query terms
need not be present in one single field, but just need to be present for
sure among the various qf fields.

My requirement is I need *ALL query terms* to be present in at least *any
one of the qf fields* for a doc to qualify.
So not sure whether q.op=AND will help me.

Pls correct me if I am missing something.

Thanks!
Mark.



On Wed, May 4, 2016 at 5:57 AM, Jacques du Rand <jacq...@pricecheck.co.za>
wrote:

> Sorry I meant "Ahmet Arslan" answer :)
>
>
> On 4 May 2016 at 11:56, Jacques du Rand <jacq...@pricecheck.co.za> wrote:
>
> > Although Mark Robinson's answer is correct you are now using the DISMAX
> > not the Edismax parser...
> > You can also play around with changing  q.op parameter  to 'AND'
> >
> >
> >
> > On 4 May 2016 at 11:40, Mark Robinson <mark123lea...@gmail.com> wrote:
> >
> >> Thanks much Ahmet!
> >>
> >> I will try that out.
> >>
> >> Best,
> >> Mark
> >>
> >> On Tue, May 3, 2016 at 11:53 PM, Ahmet Arslan <iori...@yahoo.com.invalid
> >
> >> wrote:
> >>
> >> > Hi Mark,
> >> >
> >> > You could do something like this:
> >> >
> >> > _query_:{!dismax qf='field1' mm='100%' v=$qq}
> >> > OR
> >> > _query_:{!dismax qf='field2' mm='100%' v=$qq}
> >> > OR
> >> > _query_:{!dismax qf='field3' mm='100%' v=$qq}
> >> >
> >> >
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
> >> >
> >> > Ahmet
> >> >
> >> >
> >> >
> >> > On Wednesday, May 4, 2016 4:59 AM, Mark Robinson <
> >> mark123lea...@gmail.com>
> >> > wrote:
> >> > Hi,
> >> > On further checking cld identify that *blue *is indeed appearing in
> one
> >> of
> >> > the qf fields.My bad!
> >> >
> >> > Cld someone pls help me with the 2nd question.
> >> >
> >> > Thanks!
> >> > Mark.
> >> >
> >> >
> >> >
> >> > On Tue, May 3, 2016 at 8:03 PM, Mark Robinson <
> mark123lea...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > I made a typo err in the prev mail for my first question when I
> listed
> >> > the
> >> > > query terms.
> >> > > Let me re-type both questions here once again pls.
> >> > > Sorry for any inconvenience.
> >> > >
> >> > > 1.
> >> > > My understanding of the mm parameter related to edismax is that,
> >> > > if mm=100%,  only if ALL my query terms appear across any of the qf
> >> > fields
> >> > > will I get back
> >> > > documents ... ie all the terms *need not be present in one single
> >> field*
> >> > ..
> >> > > they just need to be present across any of the fields in my qf list.
> >> > >
> >> > > But my query for  the terms:-
> >> > > *blue stainless washer*
> >> > > ... returns a document which has *Stainless Washer *in one of my qf
> >> > > fields, but *blue *is not there in any of the qf fields. Then how
> did
> >> it
> >> > > get returned even though I had given mm=100% (100%25 when I typed
> >> > directly
> >> > > in browser). Any suggestions please.. In fact this is my first
> record!
> >> > >
> >> > > 2.
> >> > > Another question I have is:-
> >> > > With edismax can I enforce that all my query terms should appear in
> >> ANY
> >> > of
> >> > > my qf fields to qualify as a result document? I know all terms
> >> appearing
> >> > in
> >> > > a single field can give a boost if we use the "pf" query parameter
> >> > > accordingly. But how can I insist that to qualify as a result, the
> doc
> >> > > should have ALL of my query term in one or more of the qf fields?
> >> > >
> >> > >
> >> > > Cld some one pls help.
> >> > >
> >> > > Thanks!
> >> > >
> >> > > Mark
> >&g

Re: SOLR edismax and mm request parameter

2016-05-04 Thread Mark Robinson

Thanks much Ahmet!

I will try that out.

Best,
Mark

On Tue, May 3, 2016 at 11:53 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
wrote:

> Hi Mark,
>
> You could do something like this:
>
> _query_:{!dismax qf='field1' mm='100%' v=$qq}
> OR
> _query_:{!dismax qf='field2' mm='100%' v=$qq}
> OR
> _query_:{!dismax qf='field3' mm='100%' v=$qq}
>
>
>
> https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries
>
> Ahmet
>
>
>
> On Wednesday, May 4, 2016 4:59 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> Hi,
> On further checking cld identify that *blue *is indeed appearing in one of
> the qf fields.My bad!
>
> Cld someone pls help me with the 2nd question.
>
> Thanks!
> Mark.
>
>
>
> On Tue, May 3, 2016 at 8:03 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I made a typo err in the prev mail for my first question when I listed
> the
> > query terms.
> > Let me re-type both questions here once again pls.
> > Sorry for any inconvenience.
> >
> > 1.
> > My understanding of the mm parameter related to edismax is that,
> > if mm=100%,  only if ALL my query terms appear across any of the qf
> fields
> > will I get back
> > documents ... ie all the terms *need not be present in one single field*
> ..
> > they just need to be present across any of the fields in my qf list.
> >
> > But my query for  the terms:-
> > *blue stainless washer*
> > ... returns a document which has *Stainless Washer *in one of my qf
> > fields, but *blue *is not there in any of the qf fields. Then how did it
> > get returned even though I had given mm=100% (100%25 when I typed
> directly
> > in browser). Any suggestions please.. In fact this is my first record!
> >
> > 2.
> > Another question I have is:-
> > With edismax can I enforce that all my query terms should appear in ANY
> of
> > my qf fields to qualify as a result document? I know all terms appearing
> in
> > a single field can give a boost if we use the "pf" query parameter
> > accordingly. But how can I insist that to qualify as a result, the doc
> > should have ALL of my query term in one or more of the qf fields?
> >
> >
> > Cld some one pls help.
> >
> > Thanks!
> >
> > Mark
> >
> > On Tue, May 3, 2016 at 6:28 PM, Mark Robinson <mark123lea...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> 1.
> >> My understanding of the mm parameter related to edismax is that,
> >> if mm=100%,  only if ALL my query terms appear across any of the qf
> >> fields will I get back
> >> documents ... ie all the terms *need not be present in one single field*
> >> .. they just need to be present across any of the fields in my qf list.
> >>
> >> But my query for  the terms:-
> >> *blue stainless washer*
> >> ... returns a document which has *Stainless Washer *in one of my qf
> >> fields, but *refrigerator *is not there in any of the qf fields. Then
> >> how did it get returned even though I had given mm=100% (100%25 when I
> >> typed directly in browser). Any suggestions please.
> >>
> >> 2.
> >> Another question I have is:-
> >> With edismax can I enforce that all my query terms should appear in ANY
> >> of my qf fields to qualify as a result document? I know all terms
> appearing
> >> in a single field can give a boost if we use the "pf" query parameter
> >> accordingly. But how can I insist that to qualify as a result, the doc
> >> should have ALL of my query term in one or more of the qf fields?
> >>
> >>
> >> Cld some one pls help.
> >>
> >> Thanks!
> >> Mark.
> >>
> >
> >
>

Re: SOLR edismax and mm request parameter

2016-05-03 Thread Mark Robinson

Hi,
On further checking cld identify that *blue *is indeed appearing in one of
the qf fields.My bad!

Cld someone pls help me with the 2nd question.

Thanks!
Mark.


On Tue, May 3, 2016 at 8:03 PM, Mark Robinson <mark123lea...@gmail.com>
wrote:

> Hi,
>
> I made a typo err in the prev mail for my first question when I listed the
> query terms.
> Let me re-type both questions here once again pls.
> Sorry for any inconvenience.
>
> 1.
> My understanding of the mm parameter related to edismax is that,
> if mm=100%,  only if ALL my query terms appear across any of the qf fields
> will I get back
> documents ... ie all the terms *need not be present in one single field* ..
> they just need to be present across any of the fields in my qf list.
>
> But my query for  the terms:-
> *blue stainless washer*
> ... returns a document which has *Stainless Washer *in one of my qf
> fields, but *blue *is not there in any of the qf fields. Then how did it
> get returned even though I had given mm=100% (100%25 when I typed directly
> in browser). Any suggestions please.. In fact this is my first record!
>
> 2.
> Another question I have is:-
> With edismax can I enforce that all my query terms should appear in ANY of
> my qf fields to qualify as a result document? I know all terms appearing in
> a single field can give a boost if we use the "pf" query parameter
> accordingly. But how can I insist that to qualify as a result, the doc
> should have ALL of my query term in one or more of the qf fields?
>
>
> Cld some one pls help.
>
> Thanks!
>
> Mark
>
> On Tue, May 3, 2016 at 6:28 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
>
>> Hi,
>>
>> 1.
>> My understanding of the mm parameter related to edismax is that,
>> if mm=100%,  only if ALL my query terms appear across any of the qf
>> fields will I get back
>> documents ... ie all the terms *need not be present in one single field*
>> .. they just need to be present across any of the fields in my qf list.
>>
>> But my query for  the terms:-
>> *blue stainless washer*
>> ... returns a document which has *Stainless Washer *in one of my qf
>> fields, but *refrigerator *is not there in any of the qf fields. Then
>> how did it get returned even though I had given mm=100% (100%25 when I
>> typed directly in browser). Any suggestions please.
>>
>> 2.
>> Another question I have is:-
>> With edismax can I enforce that all my query terms should appear in ANY
>> of my qf fields to qualify as a result document? I know all terms appearing
>> in a single field can give a boost if we use the "pf" query parameter
>> accordingly. But how can I insist that to qualify as a result, the doc
>> should have ALL of my query term in one or more of the qf fields?
>>
>>
>> Cld some one pls help.
>>
>> Thanks!
>> Mark.
>>
>
>

Re: SOLR edismax and mm request parameter

2016-05-03 Thread Mark Robinson

Hi,

I made a typo err in the prev mail for my first question when I listed the
query terms.
Let me re-type both questions here once again pls.
Sorry for any inconvenience.

1.
My understanding of the mm parameter related to edismax is that,
if mm=100%,  only if ALL my query terms appear across any of the qf fields
will I get back
documents ... ie all the terms *need not be present in one single field* ..
they just need to be present across any of the fields in my qf list.

But my query for  the terms:-
*blue stainless washer*
... returns a document which has *Stainless Washer *in one of my qf fields,
but *blue *is not there in any of the qf fields. Then how did it get
returned even though I had given mm=100% (100%25 when I typed directly in
browser). Any suggestions please.. In fact this is my first record!

2.
Another question I have is:-
With edismax can I enforce that all my query terms should appear in ANY of
my qf fields to qualify as a result document? I know all terms appearing in
a single field can give a boost if we use the "pf" query parameter
accordingly. But how can I insist that to qualify as a result, the doc
should have ALL of my query term in one or more of the qf fields?

Cld some one pls help.

Thanks!

Mark

On Tue, May 3, 2016 at 6:28 PM, Mark Robinson <mark123lea...@gmail.com>
wrote:

> Hi,
>
> 1.
> My understanding of the mm parameter related to edismax is that,
> if mm=100%,  only if ALL my query terms appear across any of the qf fields
> will I get back
> documents ... ie all the terms *need not be present in one single field*
> .. they just need to be present across any of the fields in my qf list.
>
> But my query for  the terms:-
> *blue stainless washer*
> ... returns a document which has *Stainless Washer *in one of my qf
> fields, but *refrigerator *is not there in any of the qf fields. Then how
> did it get returned even though I had given mm=100% (100%25 when I typed
> directly in browser). Any suggestions please.
>
> 2.
> Another question I have is:-
> With edismax can I enforce that all my query terms should appear in ANY of
> my qf fields to qualify as a result document? I know all terms appearing in
> a single field can give a boost if we use the "pf" query parameter
> accordingly. But how can I insist that to qualify as a result, the doc
> should have ALL of my query term in one or more of the qf fields?
>
>
> Cld some one pls help.
>
> Thanks!
> Mark.
>

Re: Phrases and edismax

2016-05-03 Thread Mark Robinson

Sorry Eric.
I will check and raise it under SOLR project.

Thanks!
Mark.

On Mon, May 2, 2016 at 11:43 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Mark:
>
> KYLIN-1644? This should be SOLR-. I suspect you entered the JIRA
> in the wrong Apache project.
>
>
> Erick
>
> On Mon, May 2, 2016 at 8:05 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi Eric,
> >
> > I have raised a JIRA:-   *KYLIN-1644*   with the problem mentioned.
> >
> > Thanks!
> > Mark.
> >
> > On Sun, May 1, 2016 at 5:25 PM, Mark Robinson <mark123lea...@gmail.com>
> > wrote:
> >
> >> Thanks much Eric for checking in detail.
> >> Yes I found the first term being left out in pf.
> >> Because of that I had some cases where a couple of unwanted records came
> >> in the results with higher priority than the normal ones. When I checked
> >> they matched from the 2nd term onwards.
> >>
> >> As suggested I wud raise a  JIRA.
> >>
> >> Thanks!
> >> Mark
> >>
> >> On Sat, Apr 30, 2016 at 1:20 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Looks like a bug in edismax to me when you field-qualify
> >>> the terms.
> >>>
> >>> As an aside, there's no need to specify the field when you only
> >>> want it to go against the fields defined in "qf" and "pf" etc. And,
> >>> that's a work-around for this particular case. But still:
> >>>
> >>> So here's what I get on 5x:
> >>> q=(erick men truck)=edismax=name=name
> >>> correctly returns:
> >>> "+((name:erick) (name:men) (name:truck)) (name:"erick men truck")",
> >>>
> >>> But,
> >>> q=name:(erick men truck)=edismax=name=name
> >>> incorrectly returns:
> >>> "+(name:erick name:men name:truck) (name:"men truck")",
> >>>
> >>> And this:
> >>> q=name:(erick men truck)=edismax=name=features
> >>> incorrectly gives this.
> >>>
> >>> "+(name:erick name:men name:truck) (features:"men truck")",
> >>>
> >>> Confusingly, the terms (with "erick" left out, strike 1)
> >>> goes against the pf field even though it's fully qualified against the
> >>> name field. Not entirely sure whether this is intended or not frankly.
> >>>
> >>> Please go ahead and raise a JIRA.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Fri, Apr 29, 2016 at 7:55 AM, Mark Robinson <
> mark123lea...@gmail.com>
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > q=productType:(two piece bathtub white)
> >>> > =edismax=productType^20.0=productType^15.0
> >>> >
> >>> > In the debug section this is what I see:-
> >>> > 
> >>> > (+(productType:two productType:piec productType:bathtub
> >>> productType:white)
> >>> > DisjunctionMaxQuery((productType:"piec bathtub
> white"^20.0)))/no_coord
> >>> > 
> >>> >
> >>> > My question is related to the "pf" (phrases) section of edismax.
> >>> > As shown in the debug section why is the phrase taken as "piec
> bathtub
> >>> > white". Why is the first word "two" not considered in the phrase
> fields
> >>> > section.
> >>> > I am looking for queries with the words "two piece bathtub white"
> being
> >>> > together to be boosted and not "piece bathtub white" only to be
> boosted.
> >>> >
> >>> > Could some one help me understand what I am missing?
> >>> >
> >>> > Thanks!
> >>> > Mark
> >>>
> >>
> >>
>

SOLR edismax and mm request parameter

2016-05-03 Thread Mark Robinson

Hi,

1.
My understanding of the mm parameter related to edismax is that,
if mm=100%,  only if ALL my query terms appear across any of the qf fields
will I get back
documents ... ie all the terms *need not be present in one single field* ..
they just need to be present across any of the fields in my qf list.

But my query for  the terms:-
*blue stainless washer*
... returns a document which has *Stainless Washer *in one of my qf fields,
but *refrigerator *is not there in any of the qf fields. Then how did it
get returned even though I had given mm=100% (100%25 when I typed directly
in browser). Any suggestions please.

2.
Another question I have is:-
With edismax can I enforce that all my query terms should appear in ANY of
my qf fields to qualify as a result document? I know all terms appearing in
a single field can give a boost if we use the "pf" query parameter
accordingly. But how can I insist that to qualify as a result, the doc
should have ALL of my query term in one or more of the qf fields?


Cld some one pls help.

Thanks!
Mark.

Re: Phrases and edismax

2016-05-02 Thread Mark Robinson

Hi Eric,

I have raised a JIRA:-   *KYLIN-1644*   with the problem mentioned.

Thanks!
Mark.

On Sun, May 1, 2016 at 5:25 PM, Mark Robinson <mark123lea...@gmail.com>
wrote:

> Thanks much Eric for checking in detail.
> Yes I found the first term being left out in pf.
> Because of that I had some cases where a couple of unwanted records came
> in the results with higher priority than the normal ones. When I checked
> they matched from the 2nd term onwards.
>
> As suggested I wud raise a  JIRA.
>
> Thanks!
> Mark
>
> On Sat, Apr 30, 2016 at 1:20 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> Looks like a bug in edismax to me when you field-qualify
>> the terms.
>>
>> As an aside, there's no need to specify the field when you only
>> want it to go against the fields defined in "qf" and "pf" etc. And,
>> that's a work-around for this particular case. But still:
>>
>> So here's what I get on 5x:
>> q=(erick men truck)=edismax=name=name
>> correctly returns:
>> "+((name:erick) (name:men) (name:truck)) (name:"erick men truck")",
>>
>> But,
>> q=name:(erick men truck)=edismax=name=name
>> incorrectly returns:
>> "+(name:erick name:men name:truck) (name:"men truck")",
>>
>> And this:
>> q=name:(erick men truck)=edismax=name=features
>> incorrectly gives this.
>>
>> "+(name:erick name:men name:truck) (features:"men truck")",
>>
>> Confusingly, the terms (with "erick" left out, strike 1)
>> goes against the pf field even though it's fully qualified against the
>> name field. Not entirely sure whether this is intended or not frankly.
>>
>> Please go ahead and raise a JIRA.
>>
>> Best,
>> Erick
>>
>> On Fri, Apr 29, 2016 at 7:55 AM, Mark Robinson <mark123lea...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > q=productType:(two piece bathtub white)
>> > =edismax=productType^20.0=productType^15.0
>> >
>> > In the debug section this is what I see:-
>> > 
>> > (+(productType:two productType:piec productType:bathtub
>> productType:white)
>> > DisjunctionMaxQuery((productType:"piec bathtub white"^20.0)))/no_coord
>> > 
>> >
>> > My question is related to the "pf" (phrases) section of edismax.
>> > As shown in the debug section why is the phrase taken as "piec bathtub
>> > white". Why is the first word "two" not considered in the phrase fields
>> > section.
>> > I am looking for queries with the words "two piece bathtub white" being
>> > together to be boosted and not "piece bathtub white" only to be boosted.
>> >
>> > Could some one help me understand what I am missing?
>> >
>> > Thanks!
>> > Mark
>>
>
>

Re: Phrases and edismax

2016-05-01 Thread Mark Robinson

Thanks much Eric for checking in detail.
Yes I found the first term being left out in pf.
Because of that I had some cases where a couple of unwanted records came in
the results with higher priority than the normal ones. When I checked they
matched from the 2nd term onwards.

As suggested I wud raise a  JIRA.

Thanks!
Mark

On Sat, Apr 30, 2016 at 1:20 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Looks like a bug in edismax to me when you field-qualify
> the terms.
>
> As an aside, there's no need to specify the field when you only
> want it to go against the fields defined in "qf" and "pf" etc. And,
> that's a work-around for this particular case. But still:
>
> So here's what I get on 5x:
> q=(erick men truck)=edismax=name=name
> correctly returns:
> "+((name:erick) (name:men) (name:truck)) (name:"erick men truck")",
>
> But,
> q=name:(erick men truck)=edismax=name=name
> incorrectly returns:
> "+(name:erick name:men name:truck) (name:"men truck")",
>
> And this:
> q=name:(erick men truck)=edismax=name=features
> incorrectly gives this.
>
> "+(name:erick name:men name:truck) (features:"men truck")",
>
> Confusingly, the terms (with "erick" left out, strike 1)
> goes against the pf field even though it's fully qualified against the
> name field. Not entirely sure whether this is intended or not frankly.
>
> Please go ahead and raise a JIRA.
>
> Best,
> Erick
>
> On Fri, Apr 29, 2016 at 7:55 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi,
> >
> > q=productType:(two piece bathtub white)
> > =edismax=productType^20.0=productType^15.0
> >
> > In the debug section this is what I see:-
> > 
> > (+(productType:two productType:piec productType:bathtub
> productType:white)
> > DisjunctionMaxQuery((productType:"piec bathtub white"^20.0)))/no_coord
> > 
> >
> > My question is related to the "pf" (phrases) section of edismax.
> > As shown in the debug section why is the phrase taken as "piec bathtub
> > white". Why is the first word "two" not considered in the phrase fields
> > section.
> > I am looking for queries with the words "two piece bathtub white" being
> > together to be boosted and not "piece bathtub white" only to be boosted.
> >
> > Could some one help me understand what I am missing?
> >
> > Thanks!
> > Mark
>

Re: Decide on facets from results

2016-04-29 Thread Mark Robinson

Thanks for the suggestion Joe.
I will check on it.

Thanks!
Mark.

On Fri, Apr 29, 2016 at 11:56 AM, Joel Bernstein <joels...@gmail.com> wrote:

> Check out the new docs for the gatherNodes streaming expression. it Allows
> you to aggregate and then use those aggregates as input for another
> expression. You can even do this across collections.
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62693238
>
> This is slated for Solr 6.1
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Apr 29, 2016 at 10:38 AM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
>
> > Thanks much everyone!
> > Appreciate your responses.
> >
> > Best,
> > Mark
> >
> > On Thu, Apr 28, 2016 at 10:52 AM, Jay Potharaju <jspothar...@gmail.com>
> > wrote:
> >
> > > On the same lines as Erik suggested but using facet stats instead. you
> > can
> > > get stats on your facet fields in the first pass and then include the
> > > facets that you need in the second pass.
> > >
> > >
> > > > On Apr 27, 2016, at 1:21 PM, Mark Robinson <mark123lea...@gmail.com>
> > > wrote:
> > > >
> > > > Thanks Eric!
> > > > So that will mean another call will be definitely required to SOLR
> with
> > > the
> > > > facets,  before the results can be send back (with the facet fields
> > being
> > > > derived traversing through the response).
> > > >
> > > > I was basically checking on whether in the "process" method (I
> believe
> > > > results will be accessed in the process method), we can dynamically
> > > > generate facets after traversing through the results and identifying
> > the
> > > > fields for faceting, using some aggregation function or so, without
> > > having
> > > > to make another call using facet=on=, before
> > the
> > > > response is send back to the user.
> > > >
> > > > Cheers!
> > > >
> > > > On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher <
> erik.hatc...@gmail.com>
> > > > wrote:
> > > >
> > > >> Results will vary based on how you indexed those fields, but sure…
> > > >> =on= - with sufficient RAM, lots of
> fun
> > > to be
> > > >> had!
> > > >>
> > > >> —
> > > >> Erik Hatcher, Senior Solutions Architect
> > > >> http://www.lucidworks.com <http://www.lucidworks.com/>
> > > >>
> > > >>
> > > >>
> > > >>>> On Apr 27, 2016, at 12:13 PM, Mark Robinson <
> > mark123lea...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> If I don't have my facet list at query time, from the results can I
> > > >> select
> > > >>> some fields and by any means create a facet on them? ie after I get
> > the
> > > >>> results I want to identify some fields as facets and send back
> facets
> > > for
> > > >>> them in the response.
> > > >>>
> > > >>> A kind of very dynamic faceting based on the results!
> > > >>>
> > > >>> Cld some one pls share their idea.
> > > >>>
> > > >>> Thanks!
> > > >>> Anil.
> > > >>
> > > >>
> > >
> >
>

Phrases and edismax

2016-04-29 Thread Mark Robinson

Hi,

q=productType:(two piece bathtub white)
=edismax=productType^20.0=productType^15.0

In the debug section this is what I see:-

(+(productType:two productType:piec productType:bathtub productType:white)
DisjunctionMaxQuery((productType:"piec bathtub white"^20.0)))/no_coord


My question is related to the "pf" (phrases) section of edismax.
As shown in the debug section why is the phrase taken as "piec bathtub
white". Why is the first word "two" not considered in the phrase fields
section.
I am looking for queries with the words "two piece bathtub white" being
together to be boosted and not "piece bathtub white" only to be boosted.

Could some one help me understand what I am missing?

Thanks!
Mark

Re: Decide on facets from results

2016-04-29 Thread Mark Robinson

Thanks much everyone!
Appreciate your responses.

Best,
Mark

On Thu, Apr 28, 2016 at 10:52 AM, Jay Potharaju <jspothar...@gmail.com>
wrote:

> On the same lines as Erik suggested but using facet stats instead. you can
> get stats on your facet fields in the first pass and then include the
> facets that you need in the second pass.
>
>
> > On Apr 27, 2016, at 1:21 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >
> > Thanks Eric!
> > So that will mean another call will be definitely required to SOLR with
> the
> > facets,  before the results can be send back (with the facet fields being
> > derived traversing through the response).
> >
> > I was basically checking on whether in the "process" method (I believe
> > results will be accessed in the process method), we can dynamically
> > generate facets after traversing through the results and identifying the
> > fields for faceting, using some aggregation function or so, without
> having
> > to make another call using facet=on=, before the
> > response is send back to the user.
> >
> > Cheers!
> >
> > On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher <erik.hatc...@gmail.com>
> > wrote:
> >
> >> Results will vary based on how you indexed those fields, but sure…
> >> =on= - with sufficient RAM, lots of fun
> to be
> >> had!
> >>
> >> —
> >> Erik Hatcher, Senior Solutions Architect
> >> http://www.lucidworks.com <http://www.lucidworks.com/>
> >>
> >>
> >>
> >>>> On Apr 27, 2016, at 12:13 PM, Mark Robinson <mark123lea...@gmail.com>
> >>> wrote:
> >>>
> >>> Hi,
> >>>
> >>> If I don't have my facet list at query time, from the results can I
> >> select
> >>> some fields and by any means create a facet on them? ie after I get the
> >>> results I want to identify some fields as facets and send back facets
> for
> >>> them in the response.
> >>>
> >>> A kind of very dynamic faceting based on the results!
> >>>
> >>> Cld some one pls share their idea.
> >>>
> >>> Thanks!
> >>> Anil.
> >>
> >>
>

Re: Decide on facets from results

2016-04-27 Thread Mark Robinson

Thanks Eric!
So that will mean another call will be definitely required to SOLR with the
facets,  before the results can be send back (with the facet fields being
derived traversing through the response).

I was basically checking on whether in the "process" method (I believe
results will be accessed in the process method), we can dynamically
generate facets after traversing through the results and identifying the
fields for faceting, using some aggregation function or so, without having
to make another call using facet=on=, before the
response is send back to the user.

Cheers!

On Wed, Apr 27, 2016 at 2:27 PM, Erik Hatcher <erik.hatc...@gmail.com>
wrote:

> Results will vary based on how you indexed those fields, but sure…
> =on= - with sufficient RAM, lots of fun to be
> had!
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
> > On Apr 27, 2016, at 12:13 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > If I don't have my facet list at query time, from the results can I
> select
> > some fields and by any means create a facet on them? ie after I get the
> > results I want to identify some fields as facets and send back facets for
> > them in the response.
> >
> > A kind of very dynamic faceting based on the results!
> >
> > Cld some one pls share their idea.
> >
> > Thanks!
> > Anil.
>
>

Decide on facets from results

2016-04-27 Thread Mark Robinson

Hi,

If I don't have my facet list at query time, from the results can I select
some fields and by any means create a facet on them? ie after I get the
results I want to identify some fields as facets and send back facets for
them in the response.

A kind of very dynamic faceting based on the results!

Cld some one pls share their idea.

Thanks!
Anil.

Re: Indexing 700 docs per second

2016-04-20 Thread Mark Robinson

Thank you all for your very valuable suggestions.
I will try out the options shared once our set up is ready and probably get
back on my experience once it is done.

Thanks!
Mark.

On Wed, Apr 20, 2016 at 9:54 AM, Bram Van Dam  wrote:

> > I have a requirement to index (mainly updation) 700 docs per second.
> > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> > byes (6 fields out of which only 2 will undergo updation at the above
> > rate). This collection has around 122Million docs and that count is
> pretty
> > much a constant.
>
> We've found that average index size per document is a good predictor of
> performance. For instance, I've got a 150GB index lying around,
> containing 400M documents. That's roughly 400 bytes per document in
> index size. This was indexed @ 4500 documents/second.
>
> If the average index size per documents doubles, the throughput will go
> down by about a third. Your mileage may vary.
>
> But yeah, I would say that 700 docs on your machine won't be much of a
> problem. Especially considering your index will likely fit in memory.
>
>  - Bram
>
>
>

Indexing 700 docs per second

2016-04-19 Thread Mark Robinson

Hi,

I have a requirement to index (mainly updation) 700 docs per second.
Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
byes (6 fields out of which only 2 will undergo updation at the above
rate). This collection has around 122Million docs and that count is pretty
much a constant.

1. Can I manage this updation rate with a non-sharded ie single Solr
instance set up?
2. Also is atomic update or a full update (the whole doc) of the changed
records the better approach in this case.

Could some one please share their views/ experience?

Thanks!
Mark.

Re: Not seeing the tokenized values when using solr.PathHierarchyTokenizerFactory

2016-04-18 Thread Mark Robinson

Thanks much Eric!
Got it.

Best,
Mark.

On Mon, Apr 18, 2016 at 7:53 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Assuming that you're talking about the docs returned in the result
> sets, these are the _stored_ fields, not the analyzed field. Stored
> fields are a verbatim copy of the original input.
>
> Best,
> Erick
>
> On Mon, Apr 18, 2016 at 12:51 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> > Hi,
> >
> > I was using the solr.PathHierarchyTokenizerFactory for a field say
> fieldB.
> > An input data like A/B/C when I check using the ANALYSIS facility in the
> > admin UI, is tokenized as A, A/B, A/B/C in fieldB.
> > A/B/C in my system is a "string" value in a fieldA which is both
> > indexed=stored=true. I copyField fieldA to fieldB which has the above
> > solr.PathHierarchyTokenizerFactory.
> >
> > fieldB also has indexed=stored=true as well as multiValued=true.
> >
> > Even then, when results displayed, fieldB shows only the original A/B/C
> ie
> > same as what is in fieldA.
> > But ANALYSIS as mentioned above shows all the different hierarchies for
> > fieldB. Also a querylike:-  fieldA:"A/B" yields no results but
> > fieldB:"A/B" gives results as fieldB has all the hierarchies in it.
> >
> > But then why can't I see all the different hierarchies in my result for
> > fieldB as I clearly see when I check through the ANALYSIS in admin UI?
> >
> > Could some one pls help understand this behavior.
> >
> > Thanks!
> > Mark.
>

Not seeing the tokenized values when using solr.PathHierarchyTokenizerFactory

2016-04-18 Thread Mark Robinson

Hi,

I was using the solr.PathHierarchyTokenizerFactory for a field say fieldB.
An input data like A/B/C when I check using the ANALYSIS facility in the
admin UI, is tokenized as A, A/B, A/B/C in fieldB.
A/B/C in my system is a "string" value in a fieldA which is both
indexed=stored=true. I copyField fieldA to fieldB which has the above
solr.PathHierarchyTokenizerFactory.

fieldB also has indexed=stored=true as well as multiValued=true.

Even then, when results displayed, fieldB shows only the original A/B/C ie
same as what is in fieldA.
But ANALYSIS as mentioned above shows all the different hierarchies for
fieldB. Also a querylike:-  fieldA:"A/B" yields no results but
fieldB:"A/B" gives results as fieldB has all the hierarchies in it.

But then why can't I see all the different hierarchies in my result for
fieldB as I clearly see when I check through the ANALYSIS in admin UI?

Could some one pls help understand this behavior.

Thanks!
Mark.

Which query to prefer

2016-03-03 Thread Mark Robinson

Hi,

I have a 125 million doc index1.
I identified 25 values for fieldA in the index. Each value can appear
multiple times (1).

There is another fieldB in the same index. I identified 6 values for this
fieldB.
I want only those records in (1)  which contain any of these values in
fieldB.

Query:-
q=field1:(value1 OR value2 OR value3 OR  value25)  AND fieldB:(value1
OR ... value6)

Again, I have fiedB values in another index1 of mine which has 3000 docs.
So I can do:-
q=field1:(value1 OR value2 OR value3 OR  value25)=join of index1
with index2 using fieldB as the join key.

Which query is more efficient.

Or is there a third better way?

Thanks!

Mark

Querying through SolrJ taking lot of time

2016-03-03 Thread Mark Robinson

Hi,
I am running the following query on an index that has around 123 million
records, using SolrJ..
Each record has only 5 fields.

String *qry*="( fieldA:(value1 OR value2 OR  value24) AND
fieldB:(value1 OR value2 OR value3 OR value4 OR value5) )
(...basically a simple AND of 2 ORs)

When I hit directly from browser QTime is in the range of 300 - 400 milli
secs max.

But when I run through SolrJ my (endtime - starttime) gives 20 seconds max
(when run on a machine with 16 CPUs and 60GB RAM with heap size 25G
allocated).
When run on my laptop which has only 4GB RAM the SolrJ  (endtime -
starttime) gives 60s to a maximum of 90s sometimes.

Why could this huge difference in timing be when queried using SolrJ.

Also could you please suggest on how I can get the timing close to the
timing I see when I hit the index directly from browser.
Note:- All programs (java as well as SOLR) reside on the same machine in
both cases (more powerful machine as well as laptop) when I tried.

When I tried with the more powerful machine I even gave firstSearcher
q=*:*, but no impact was seen.

I am looking for good response times from my first query itself. So I did
not explore much on caching.

Any help is greatly appreciated.

Thanks!
Mark

Re: Capture facets and its values in output and modify them

2016-02-27 Thread Mark Robinson

Hi Binoy,

Greatly appreciate your quick response.
Both posts were very helpful.

Thanks!
Mark

On Sat, Feb 27, 2016 at 3:31 AM, Binoy Dalal <binoydala...@gmail.com> wrote:

> Here's the code snippet for highlighting but you just replace highlighting
> with 'facet' adn do what you want with it.
>
> public void process(ResponseBuilder arg0) throws IOException {
> NamedList nl = arg0.rsp.getValues();
> SimpleOrderedMap sop = (SimpleOrderedMap) nl.get("highlighting");
> // DO YOUR THING
> }
>
> On Sat, Feb 27, 2016 at 1:46 PM Binoy Dalal <binoydala...@gmail.com>
> wrote:
>
> > Take a look here:
> >
> https://github.com/lttazz99/SolrPluginsExamples/blob/master/src/main/java/component/ComponentDemo.java
> >
> > I've uploaded an example of writing a search component.
> > All you have to do is get the facet object from the solr response and
> then
> > work in it.
> > I have done something similar with highlighting and will put that code
> > here in some time so you know how to fetch the facet values.
> >
> > On Sat, 27 Feb 2016, 13:42 Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >
> >> Hi,
> >> I have a requirement to capture facet fields in the output and append an
> >> additional data to
> >> each of the facet values before the final output (along with the results
> >> as
> >> well as the facets and values) is send back so that a middle layer can
> use
> >> this additional value added.
> >>
> >> I read that a custom search component will do this, but could some one
> >> please point to how I could get access to the facets and its values (a
> >> small code snippet or any resource which points to it).
> >>
> >> Thanks!
> >> Mark
> >>
> > --
> > Regards,
> > Binoy Dalal
> >
> --
> Regards,
> Binoy Dalal
>

Capture facets and its values in output and modify them

2016-02-27 Thread Mark Robinson

Hi,
I have a requirement to capture facet fields in the output and append an
additional data to
each of the facet values before the final output (along with the results as
well as the facets and values) is send back so that a middle layer can use
this additional value added.

I read that a custom search component will do this, but could some one
please point to how I could get access to the facets and its values (a
small code snippet or any resource which points to it).

Thanks!
Mark

Re: Retrieving 1000 records at a time

2016-02-19 Thread Mark Robinson

Thanks Shawn!

Best,
Mark.

On Wed, Feb 17, 2016 at 7:48 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/17/2016 3:49 PM, Mark Robinson wrote:
> > I have around 121 fields out of which 12 of them are indexed and almost
> all
> > 121 are stored.
> > Average size of a doc is 10KB.
> >
> > I was checking for start=0, rows=1000.
> > We were querying a Solr instance which was on another server and I think
> > network lag might have come into the picture also.
> >
> > I did not go for any caching as I wanted good response time in the first
> > time querying itself.
>
> Stored fields, which contain the data that is returned to the client in
> the response, are compressed on disk.  Uncompressing this data can
> contribute to the time on a slow query, but I do not think it can
> explain 30 seconds of delay.  Very large documents can be particularly
> slow to decompress, but you have indicated that each entire document is
> about 10K in size, which is not huge.
>
> It is more likely that the delay is caused by one of two things,
> possibly both:
>
> * Extremely long garbage collection pauses due to a heap that is too
> small or VERY huge (beyond 32GB) with inadequate GC tuning.
> * Not enough system memory to effectively cache the index.
>
> Some additional info that may be helpful in tracking this down further:
>
> * For each core on one machine, the size on disk of the data directory.
> * For each core, the number of documents and the number of deleted
> documents.
> * The max heap size for the Solr JVM.
> * Whether there is more than one Solr instance per server.
> * The total installed memory size in the server.
> * Whether or not the server is used for other applications.
> * What operating system the server is running.
> * Whether the index is distributed or contained in a single core.
> * Whether Solr is in SolrCloud mode or not.
> * Solr version.
>
> Thanks,
> Shawn
>
>

Re: Retrieving 1000 records at a time

2016-02-17 Thread Mark Robinson

Thanks Joel and Chris!

I have around 121 fields out of which 12 of them are indexed and almost all
121 are stored.
Average size of a doc is 10KB.

I was checking for start=0, rows=1000.
We were querying a Solr instance which was on another server and I think
network lag might have come into the picture also.

I did not go for any caching as I wanted good response time in the first
time querying itself.

Thanks much for the links and suggestions. I will go thru each of them.

Best,
Mark.

On Wed, Feb 17, 2016 at 5:26 PM, Chris Hostetter 
wrote:

>
> : I have a requirement where I need to retrieve 1 to 15000 records at a
> : time from SOLR.
> : With 20 or 100 records everything happens in milliseconds.
> : When it goes to 1000, 1  it is taking more time... like even 30
> seconds.
>
> so far all you've really told us about your setup is that some
> queries with "rows=1000" are slow -- but you haven't really told us
> anything else we can help you with -- for example it's not obvious if you
> mean that you are using start=0 in all ofthose queries andthey are slow,
> or if you mean you are paginating through results (ie: increasing start
> param) 1000 at a time nad it starts getting slow as you page deeply.
>
> you also haven't told us anything about the fields you are returning --
> how many are there?, what data types are they? are they large string
> values?
>
> how are you measuring the time? are you sure network lag, or client side
> processing of the data as solr returns it isn't the bulk of the time you
> are measuring?  what does the QTime in the solr responses for these slow
> queries say?
>
> my best guesses are that either: you are doing deep paging and conflating
> the increased response time for deep results with an increase in response
> time for large rows params (because you are getting "deeper" faster with a
> large rows#) or you are seeing an increase in processing time on the
> client due ot the large volume of data being returned -- possibly even
> with SolrJ which is designed to parse the entire response into java
> data structures by default before returning to the client.
>
> w/o more concrete information, it's hard to give you advice beyond
> guesses.
>
>
> potentially helpful links...
>
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
>
> https://lucidworks.com/blog/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>
> https://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/io/stream/expr/StreamFactory.html
>
>
>
> -Hoss
> http://www.lucidworks.com/
>

Retrieving 1000 records at a time

2016-02-17 Thread Mark Robinson

Hi,

I have a requirement where I need to retrieve 1 to 15000 records at a
time from SOLR.
With 20 or 100 records everything happens in milliseconds.
When it goes to 1000, 1  it is taking more time... like even 30 seconds.

Will Solr be able to return 1 records at a time in less than say 200
milliseconds?

I have read that disk read is a costly affair so we have to batch results
and lesser the number of records retrieved in a batch the faster the
response when using SOLR.

So is Solr a straight away NO candidate in a situation where 1 records
should be retrieved in a time of <=200 mS.

A quick response would be very helpful.

Thanks!
Mark

Re: Solr architecture

2016-02-12 Thread Mark Robinson

Thanks All for your suggestions!

Rgds,
Mark.

On Thu, Feb 11, 2016 at 9:45 AM, Upayavira <u...@odoko.co.uk> wrote:

> Your biggest issue here is likely to be http connections. Making an HTTP
> connection to Solr is way more expensive than the ask of adding a single
> document to the index. If you are expecting to add 24 billion docs per
> day, I'd suggest that somehow merging those documents into batches
> before sending them to Solr will be necessary.
>
> To my previous question - what do you gain by using Solr that you don't
> get from other solutions? I'd suggest that to make this system really
> work, you are going to need a deep understanding of how Lucene works -
> segments, segment merges, deletions, and many other things because when
> you start to work at that scale, the implementation details behind
> Lucene really start to matter and impact upon your ability to succeed.
>
> I'd suggest that what you are undertaking can certainly be done, but is
> a substantial project.
>
> Upayavira
>
> On Wed, Feb 10, 2016, at 09:48 PM, Mark Robinson wrote:
> > Thanks everyone for your suggestions.
> > Based on it I am planning to have one doc per event with sessionId
> > common.
> >
> > So in this case hopefully indexing each doc as and when it comes would be
> > okay? Or do we still need to batch and index to Solr?
> >
> > Also with 4M sessions a day with about 6000 docs (events) per session we
> > can expect about 24Billion docs per day!
> >
> > Will Solr still hold good. If so could some one please recommend a sizing
> > to cater to this levels of data.
> > The queries per second is around 320 qps.
> >
> > Thanks!
> > Mark
> >
> >
> > On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> > > Hi Mark,
> > > Appending session actions just to be able to return more than one
> session
> > > without retrieving large number of results is not good tradeoff. Like
> > > Upayavira suggested, you should consider storing one action per doc and
> > > aggregate on read time or push to Solr once session ends and aggregate
> on
> > > some other layer.
> > > If you are thinking handling infrastructure might be too much, you may
> > > consider using some of logging services to hold data. One such service
> is
> > > Sematext's Logsene (http://sematext.com/logsene).
> > >
> > > Thanks,
> > > Emir
> > >
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> > >
> > > On 10.02.2016 03:22, Mark Robinson wrote:
> > >
> > >> Thanks for your replies and suggestions!
> > >>
> > >> Why I store all events related to a session under one doc?
> > >> Each session can have about 500 total entries (events) corresponding
> to
> > >> it.
> > >> So when I try to retrieve a session's info it can back with around 500
> > >> records. If it is this compounded one doc per session, I can retrieve
> more
> > >> sessions at a time with one doc per session.
> > >> eg under a sessionId an array of eventA activities, eventB activities
> > >>   (using json). When an eventA activity again occurs, we will read all
> > >> that
> > >> data for that session, append this extra info to evenA data and push
> the
> > >> whole session related data back (indexing) to Solr. Like this for many
> > >> sessions parallely.
> > >>
> > >>
> > >> Why NRT?
> > >> Parallely many sessions are being written (4Million sessions hence
> > >> 4Million
> > >> docs per day). A person can do this querying any time.
> > >>
> > >> It is just a look up?
> > >> Yes. We just need to retrieve all info for a session and pass it on to
> > >> another system. We may even do some extra querying on some data like
> > >> timestamps, pageurl etc in that info added to a session.
> > >>
> > >> Thinking of having the data separate from the actual Solr Instance and
> > >> mention the loc of the dataDir in solrconfig.
> > >>
> > >> If Solr is not a good option could you please suggest something which
> will
> > >> satisfy this use case with min response time while querying.
> > >>
> > >> Thanks!
> > >> Mark
> > >>
> > >> On Tue, Feb

Re: Solr architecture

2016-02-10 Thread Mark Robinson

Thanks everyone for your suggestions.
Based on it I am planning to have a doc per event.



On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Mark,
> Appending session actions just to be able to return more than one session
> without retrieving large number of results is not good tradeoff. Like
> Upayavira suggested, you should consider storing one action per doc and
> aggregate on read time or push to Solr once session ends and aggregate on
> some other layer.
> If you are thinking handling infrastructure might be too much, you may
> consider using some of logging services to hold data. One such service is
> Sematext's Logsene (http://sematext.com/logsene).
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 10.02.2016 03:22, Mark Robinson wrote:
>
>> Thanks for your replies and suggestions!
>>
>> Why I store all events related to a session under one doc?
>> Each session can have about 500 total entries (events) corresponding to
>> it.
>> So when I try to retrieve a session's info it can back with around 500
>> records. If it is this compounded one doc per session, I can retrieve more
>> sessions at a time with one doc per session.
>> eg under a sessionId an array of eventA activities, eventB activities
>>   (using json). When an eventA activity again occurs, we will read all
>> that
>> data for that session, append this extra info to evenA data and push the
>> whole session related data back (indexing) to Solr. Like this for many
>> sessions parallely.
>>
>>
>> Why NRT?
>> Parallely many sessions are being written (4Million sessions hence
>> 4Million
>> docs per day). A person can do this querying any time.
>>
>> It is just a look up?
>> Yes. We just need to retrieve all info for a session and pass it on to
>> another system. We may even do some extra querying on some data like
>> timestamps, pageurl etc in that info added to a session.
>>
>> Thinking of having the data separate from the actual Solr Instance and
>> mention the loc of the dataDir in solrconfig.
>>
>> If Solr is not a good option could you please suggest something which will
>> satisfy this use case with min response time while querying.
>>
>> Thanks!
>> Mark
>>
>> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com>
>> wrote:
>>
>> So as I understand your use case, its effectively logging actions within a
>>> user session, why do you have to do the update in NRT?  Why not just log
>>> all the user session events (with some unique key, and ensuring the
>>> session
>>> Id is in the document somewhere), then when you want to do the query, you
>>> join on the session id, and that gives you all the data records for that
>>> session. I don't really follow why it has to be 1 document (which you
>>> continually update). If you really need that aggregation, couldn't that
>>> happen offline?
>>>
>>> I guess your 1 saving grace is that you query using the unique ID (in
>>> your
>>> scenario) so you could use the real-time get handler, since you aren't
>>> doing a complex query (strictly its not a search, its a raw key lookup).
>>>
>>> But I would still question your use case, if you go the Solr route for
>>> that
>>> kind of scale with querying and indexing that much, you're going to have
>>> to
>>> throw a lot of hardware at it, as Jack says probably in the order of
>>> hundreds of machines...
>>>
>>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
>>>
>>> Bear in mind that Lucene is optimised towards high read lower write.
>>>> That is, it puts in a lot of effort at write time to make reading
>>>> efficient. It sounds like you are going to be doing far more writing
>>>> than reading, and I wonder whether you are necessarily choosing the
>>>> right tool for the job.
>>>>
>>>> How would you later use this data, and what advantage is there to
>>>> storing it in Solr?
>>>>
>>>> Upayavira
>>>>
>>>> On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
>>>>
>>>>> Hi,
>>>>> Thanks for all your suggestions. I took some time to get the details to
>>>>> be
>>>>> more accurate. Please find what I have gathered:-
>>>>>
>

Re: Solr architecture

2016-02-10 Thread Mark Robinson

Thanks everyone for your suggestions.
Based on it I am planning to have one doc per event with sessionId common.

So in this case hopefully indexing each doc as and when it comes would be
okay? Or do we still need to batch and index to Solr?

Also with 4M sessions a day with about 6000 docs (events) per session we
can expect about 24Billion docs per day!

Will Solr still hold good. If so could some one please recommend a sizing
to cater to this levels of data.
The queries per second is around 320 qps.

Thanks!
Mark


On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Mark,
> Appending session actions just to be able to return more than one session
> without retrieving large number of results is not good tradeoff. Like
> Upayavira suggested, you should consider storing one action per doc and
> aggregate on read time or push to Solr once session ends and aggregate on
> some other layer.
> If you are thinking handling infrastructure might be too much, you may
> consider using some of logging services to hold data. One such service is
> Sematext's Logsene (http://sematext.com/logsene).
>
> Thanks,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 10.02.2016 03:22, Mark Robinson wrote:
>
>> Thanks for your replies and suggestions!
>>
>> Why I store all events related to a session under one doc?
>> Each session can have about 500 total entries (events) corresponding to
>> it.
>> So when I try to retrieve a session's info it can back with around 500
>> records. If it is this compounded one doc per session, I can retrieve more
>> sessions at a time with one doc per session.
>> eg under a sessionId an array of eventA activities, eventB activities
>>   (using json). When an eventA activity again occurs, we will read all
>> that
>> data for that session, append this extra info to evenA data and push the
>> whole session related data back (indexing) to Solr. Like this for many
>> sessions parallely.
>>
>>
>> Why NRT?
>> Parallely many sessions are being written (4Million sessions hence
>> 4Million
>> docs per day). A person can do this querying any time.
>>
>> It is just a look up?
>> Yes. We just need to retrieve all info for a session and pass it on to
>> another system. We may even do some extra querying on some data like
>> timestamps, pageurl etc in that info added to a session.
>>
>> Thinking of having the data separate from the actual Solr Instance and
>> mention the loc of the dataDir in solrconfig.
>>
>> If Solr is not a good option could you please suggest something which will
>> satisfy this use case with min response time while querying.
>>
>> Thanks!
>> Mark
>>
>> On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com>
>> wrote:
>>
>> So as I understand your use case, its effectively logging actions within a
>>> user session, why do you have to do the update in NRT?  Why not just log
>>> all the user session events (with some unique key, and ensuring the
>>> session
>>> Id is in the document somewhere), then when you want to do the query, you
>>> join on the session id, and that gives you all the data records for that
>>> session. I don't really follow why it has to be 1 document (which you
>>> continually update). If you really need that aggregation, couldn't that
>>> happen offline?
>>>
>>> I guess your 1 saving grace is that you query using the unique ID (in
>>> your
>>> scenario) so you could use the real-time get handler, since you aren't
>>> doing a complex query (strictly its not a search, its a raw key lookup).
>>>
>>> But I would still question your use case, if you go the Solr route for
>>> that
>>> kind of scale with querying and indexing that much, you're going to have
>>> to
>>> throw a lot of hardware at it, as Jack says probably in the order of
>>> hundreds of machines...
>>>
>>> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
>>>
>>> Bear in mind that Lucene is optimised towards high read lower write.
>>>> That is, it puts in a lot of effort at write time to make reading
>>>> efficient. It sounds like you are going to be doing far more writing
>>>> than reading, and I wonder whether you are necessarily choosing the
>>>> right tool for the job.
>>>>
>>>> How would you later use this data, and what advantage is there

Re: Solr architecture

2016-02-09 Thread Mark Robinson

gation
> > of results from those other shards.
> >
> > -- Jack Krupansky
> >
> > On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Short form: You really have to prototype. Here's the long form:
> >>
> >>
> >>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >>
> >> I've seen between 20M and 200M docs fit on a single piece of hardware,
> >> so you'll absolutely have to shard.
> >>
> >> And the other thing you haven't told us is whether you plan on
> >> _adding_ 2B docs a day or whether that number is the total corpus size
> >> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
> >> docs/day, 30 days later do you have 2B docs or 60B docs in your
> >> corpus?
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <susheel2...@gmail.com>
> >> wrote:
> >> > Also if you are expecting indexing of 2 billion docs as NRT or if it
> >> will
> >> > be offline (during off hours etc).  For more accurate sizing you may
> >> also
> >> > want to index say 10 million documents which may give you idea how
> much
> >> is
> >> > your index size and then use that for extrapolation to come up with
> >> memory
> >> > requirements.
> >> >
> >> > Thanks,
> >> > Susheel
> >> >
> >> > On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
> >> > emir.arnauto...@sematext.com> wrote:
> >> >
> >> >> Hi Mark,
> >> >> Can you give us bit more details: size of docs, query types, are docs
> >> >> grouped somehow, are they time sensitive, will they update or it is
> >> rebuild
> >> >> every time, etc.
> >> >>
> >> >> Thanks,
> >> >> Emir
> >> >>
> >> >>
> >> >> On 08.02.2016 16:56, Mark Robinson wrote:
> >> >>
> >> >>> Hi,
> >> >>> We have a requirement where we would need to index around 2 Billion
> >> docs
> >> >>> in
> >> >>> a day.
> >> >>> The queries against this indexed data set can be around 80K queries
> >> per
> >> >>> second during peak time and during non peak hours around 12K queries
> >> per
> >> >>> second.
> >> >>>
> >> >>> Can Solr realize this huge volumes.
> >> >>>
> >> >>> If so, assuming we have no constraints for budget what would be a
> >> >>> recommended Solr set up (number of shards, number of Solr instances
> >> >>> etc...)
> >> >>>
> >> >>> Thanks!
> >> >>> Mark
> >> >>>
> >> >>>
> >> >> --
> >> >> Monitoring * Alerting * Anomaly Detection * Centralized Log
> Management
> >> >> Solr & Elasticsearch Support * http://sematext.com/
> >> >>
> >> >>
> >>
> >
> >
>

Re: Solr architecture

2016-02-09 Thread Mark Robinson

Thanks for your replies and suggestions!

Why I store all events related to a session under one doc?
Each session can have about 500 total entries (events) corresponding to it.
So when I try to retrieve a session's info it can back with around 500
records. If it is this compounded one doc per session, I can retrieve more
sessions at a time with one doc per session.
eg under a sessionId an array of eventA activities, eventB activities
 (using json). When an eventA activity again occurs, we will read all that
data for that session, append this extra info to evenA data and push the
whole session related data back (indexing) to Solr. Like this for many
sessions parallely.

Why NRT?
Parallely many sessions are being written (4Million sessions hence 4Million
docs per day). A person can do this querying any time.

It is just a look up?
Yes. We just need to retrieve all info for a session and pass it on to
another system. We may even do some extra querying on some data like
timestamps, pageurl etc in that info added to a session.

Thinking of having the data separate from the actual Solr Instance and
mention the loc of the dataDir in solrconfig.

If Solr is not a good option could you please suggest something which will
satisfy this use case with min response time while querying.

Thanks!
Mark

On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins <danwcoll...@gmail.com>
wrote:

> So as I understand your use case, its effectively logging actions within a
> user session, why do you have to do the update in NRT?  Why not just log
> all the user session events (with some unique key, and ensuring the session
> Id is in the document somewhere), then when you want to do the query, you
> join on the session id, and that gives you all the data records for that
> session. I don't really follow why it has to be 1 document (which you
> continually update). If you really need that aggregation, couldn't that
> happen offline?
>
> I guess your 1 saving grace is that you query using the unique ID (in your
> scenario) so you could use the real-time get handler, since you aren't
> doing a complex query (strictly its not a search, its a raw key lookup).
>
> But I would still question your use case, if you go the Solr route for that
> kind of scale with querying and indexing that much, you're going to have to
> throw a lot of hardware at it, as Jack says probably in the order of
> hundreds of machines...
>
> On 9 February 2016 at 19:00, Upayavira <u...@odoko.co.uk> wrote:
>
> > Bear in mind that Lucene is optimised towards high read lower write.
> > That is, it puts in a lot of effort at write time to make reading
> > efficient. It sounds like you are going to be doing far more writing
> > than reading, and I wonder whether you are necessarily choosing the
> > right tool for the job.
> >
> > How would you later use this data, and what advantage is there to
> > storing it in Solr?
> >
> > Upayavira
> >
> > On Tue, Feb 9, 2016, at 03:40 PM, Mark Robinson wrote:
> > > Hi,
> > > Thanks for all your suggestions. I took some time to get the details to
> > > be
> > > more accurate. Please find what I have gathered:-
> > >
> > > My data being indexed is something like this.
> > > I am basically capturing all data related to a user session.
> > > Inside a session I have categorized my actions like actionA, actionB
> > > etc..,
> > > per page.
> > > So each time an action pertaining to say actionA or actionB etc.. (in
> > > each
> > > page) happens, it is updated in Solr under that session (sessionId).
> > >
> > > So in short there is only one doc pertaining to a single session
> > > (identified by sessionid) in my Solr index and that is retrieved and
> > > updated
> > > whenever a new action under that session occurs.
> > > We expect upto 4Million session per day.
> > >
> > > On an average *one session's* *doc has a size* of *3MB to 20MB*.
> > > So if it is *4Million sessions per day*, each session writing around
> *500
> > > times to Solr*, it is* 2Billion writes or (indexing) per day to Solr*.
> > > As it is one doc per session, it is *4Million docs per day*.
> > > This is around *80K docs indexed per second* during *peak* hours and
> > > around *15K
> > > docs indexed per second* into Solr during* non-peak* hours.
> > > Number of queries per second is around *320 queries per second*.
> > >
> > >
> > > 1. Average size of a doc
> > >  3MB to 20MB
> > > 2. Query types:-
> > >  Until that session is in progress, whatever data is there for that
> > > session so far is queried and the

Solr architecture

2016-02-08 Thread Mark Robinson

Hi,
We have a requirement where we would need to index around 2 Billion docs in
a day.
The queries against this indexed data set can be around 80K queries per
second during peak time and during non peak hours around 12K queries per
second.

Can Solr realize this huge volumes.

If so, assuming we have no constraints for budget what would be a
recommended Solr set up (number of shards, number of Solr instances etc...)

Thanks!
Mark

Dynamically Adding query parameters in my custom Request Handler class

2016-01-09 Thread Mark Robinson

Hi,
When I initially fire a query against my Solr instance using SOLRJ I pass
only, say q=*:*=(myfield:vaue1).

I have written a custom RequestHandler, which is what I call in my SolrJ
query.
Inside this custom request handler can I add more query params like say the
facets etc.. so that ultimately facets are also received back in my results
which were initially not specified when I invoked the Solr url using SolrJ.

In short, instead of constructing the query dynamically initially in SolrJ
I want to add the extra query params, adding a jar in Solr (a java code
that will check certain conditions and dynamically add the query params
after the initial SolrJ query is done). That is why I thought of a custom
RH which would help we write a java class and deploy in Solr.

Is this possible. Could some one get back please.

Thanks!
Mark.

Re: Dynamically Adding query parameters in my custom Request Handler class

2016-01-09 Thread Mark Robinson

ou can
> change parameters with a simple text edit rather than require a Java build
> and jar deploy.
>
> Can you share what some of the requirements are for your custom request
> handler, including the motivation? I'd hate to see you go off and invest
> significant effort in a custom request handler when simpler techniques may
> suffice.
>
> -- Jack Krupansky
>
> On Sat, Jan 9, 2016 at 12:08 PM, Ahmet Arslan <iori...@yahoo.com.invalid>
> wrote:
>
> > Hi Mark,
> >
> > Yes this is possible. Better, you can use a custom SearchComponent for
> > this task too.
> > You retrieve solr parameters, wrap it into ModifiableSolrParams. Add
> extra
> > parameters etc, then pass it to underlying search components.
> >
> > Ahmet
> >
> >
> > On Saturday, January 9, 2016 3:59 PM, Mark Robinson <
> > mark123lea...@gmail.com> wrote:
> > Hi,
> > When I initially fire a query against my Solr instance using SOLRJ I pass
> > only, say q=*:*=(myfield:vaue1).
> >
> > I have written a custom RequestHandler, which is what I call in my SolrJ
> > query.
> > Inside this custom request handler can I add more query params like say
> the
> > facets etc.. so that ultimately facets are also received back in my
> results
> > which were initially not specified when I invoked the Solr url using
> SolrJ.
> >
> > In short, instead of constructing the query dynamically initially in
> SolrJ
> > I want to add the extra query params, adding a jar in Solr (a java code
> > that will check certain conditions and dynamically add the query params
> > after the initial SolrJ query is done). That is why I thought of a custom
> > RH which would help we write a java class and deploy in Solr.
> >
> > Is this possible. Could some one get back please.
> >
> > Thanks!
> > Mark.
> >
>

Re: Dynamically Adding query parameters in my custom Request Handler class

2016-01-09 Thread Mark Robinson

Thanks Eric!

Appreciate your valuable suggestions.
Now I am getting the concept of a search-component better!

So my custom class is just this after removing the SOLRJ part, as I just
need to modify the query by adding some parameters dynamically before the
query actually is executed by SOLR:-

public void process(ResponseBuilder builder) throws IOException {
SolrParams *params *= builder.req.getParams();
String q = params.get(CommonParams.Q);
ModifiableSolrParams *params1* = new ModifiableSolrParams(*params*);
*params1.add*("fl", "id");
//Added this line
*builder.req.setParams(params1);*

System.out.println("q is ### "+q);
}

Note:- Nothing inside prepare() method,

In my /select RH added the following in solrconfig.xml just bef close of
the   tag:-


 exampleComponent
 

 

Still it is not restricting o/p fields to only fl.
The console output shows the following:-
q is ### *:*
16140 [qtp1856056345-12] INFO  org.apache.solr.core.SolrCore  û
[collection1] we
bapp=/solr path=/select params={q=*:*} hits=4 status=0 QTime=4

Note:- the  "###" proves that it accessed the custom class. But the
ModifiableSolrParams params1 = new ModifiableSolrParams(params);
params1.add("fl", "id");

did not take effect.

I think I am near to developing my first dynamic query component. Could
some one please tell me where I am going wrong this time.

Appreciate any help.I am very eager to see my first dynamic query
implemented using a customized Search Component!!

Thanks!
Mark.

On Sat, Jan 9, 2016 at 3:38 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote:

> Woah, Mark….  you’re making a search request within a search component.
> Instead, let the built-in “query” component do the work for you.
>
> I think one fix for you is to make your “components” be “first-components”
> instead (allowing the other default search components to come into play).
> You don’t need to search within your component, just affect parameters,
> right?
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
> > On Jan 9, 2016, at 3:19 PM, Mark Robinson <mark123lea...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > Ahmet, Jack, Thanks for the pointers.
> > My requirement is, I would not be having the facets or sort fields or its
> > order as static.
> > For example suppose for a particular scenario I need to show only 2
> facets
> > and sort on only one field.
> > For another scenario I may have to do facet.field for a different set of
> > fields and sort on again another set of fields.
> > Consider there are some sort of user preferences for each query.
> >
> > So I think I may not be able to store my parameters like facet fields,
> sort
> > fields etc preconfigured in solrconfig.xml.
> > Please correct me if I am wrong.
> >
> > Based on Ahmet's reply I created a CustomSearchComponent with help from
> the
> > net.
> > I created a dummy RH and added this as the searchComponent in
> > SolrConfig.xml:-
> > 
> >
> >  exampleComponent
> >
> >  
> >
> >   > class="org.ExampleSearchComponent">
> >  
> >
> > ..invoked it using:-
> > http://localhost:8984/solr/myexample?q=*:*
> > The o/p gave me that one record will ALL FIELDS fully in xml format.
> > *It did not give only the "id" field which was what I was trying as a
> test!*
> >
> >
> > my code for the custom Search Component shared below please:-
> > Before that I have these queries:-
> > 1. In my code,Instead of hitting the server AGAIN using SolrJ to enforce
> my
> > params (just "fl" newly added) , is there anyway the query can be
> > executed with my additional fl param.
> > 2. Just adding the input additional params is what I want to achieve. I
> > dont want to do anything on the response.
> >   Currently I am doing it:-
> >   --> builder.rsp.add( "example", doc.getFields());
> >  Note:- I removed this line and again when I ran this query NO OUTPUT
> came.
> >  So suppose I used it along with any of my existing RH by adding in
> > searchcomponent, I want it to only affect the input   querying by adding
> > additional params and should not influence the rendering of the o/p in
> any
> > way. How do I add this to one of my existing Request Handlers only to
> > influence the input for querying and NOT o/p format in any way.
> > 3. Why is all fields being rendered for the one doc I selected to come
> back
> > in my "example"  variable,

75 matches

Mail list logo