Re: Managed Schemas and Version Control

2018-06-29 Thread Walter Underwood
I wrote a Python program that:

1. Gets a cluster status.
2. Extracts the Zookeeper location from that.
3. Uploads solr.xml and config to Zookeeper (using kazoo library).
4. Sends an async reload command.
5. Polls for success until all the nodes have finished the reload.
6. Optionally rebuilds the suggester.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 29, 2018, at 8:15 PM, Erick Erickson  wrote:
> 
> Adding to Shawn's comments.
> 
> You've pretty much nailed all the possibilities, it depends on
> what you're most comfortable with I suppose.
> 
> The only thing I'd add is that you probably have dev and prod
> environments and work out the correct schemas on dev then
> migrate to prod (at least that's what paranoid people like me
> do). At some point you have to graduate to prod when you're
> happy with your dev configs. It shouldn't be much problem to
> script something that
> 
> - pulled the configs from dev
> - pushed them to Git
> - pulled them from Git (sanity check)
> - pushed them to prods ZK
> - reloaded the ZK collection.
> 
> I'd be a little reluctant to script all the managed schema steps,
> too easy to forget to put one of those steps in. I'm picturing
> someone at 3 AM _finally_ getting all the schema figured out and
> forgetting to properly log the managed schema step. In my
> proposal, it wouldn't matter, what you're archiving is the end result
> after you've done all your QA.
> 
> FWIW,
> Erick
> 
> P.S. you'd be surprised how many prod setups I've seen where
> they don't put their configs in VCS. Makes me break out in
> hives so kudos... ;)
> 
> On Fri, Jun 29, 2018 at 5:04 PM, Shawn Heisey  wrote:
>> On 6/29/2018 3:26 PM, Zimmermann, Thomas wrote:
>>> We're transitioning from Solr 4.10 to 7.x and working through our options 
>>> around managing our schemas. Currently we manage our schema files in a git 
>>> repository, make changes to the xml files,
>> 
>> Hopefully you've got the entire config in version control and not just
>> the schema.
>> 
>>> and then push them out to our zookeeper cluster via the zkcli and the 
>>> upconfig command like:
>>> 
>>> /apps/solr/bin/zkcli.sh -cmd upconfig -zkhost host.com:9580 -collection 
>>> core -confname core -confdir /apps/solr/cores/core/conf/ -solrhome 
>>> /apps/solr/
>> 
>> I don't think the collection parameter is valid for that command.  It
>> would be valid for the linkconfig command, but not for upconfig.  It's
>> probably not hurting anything, though.
>> 
>>> This allows us to deploy schema changes without restarting the cluster, 
>>> while maintaining version control. It looks like we could do the exact same 
>>> process using Solr 7 and the solr control script like
>>> 
>>> bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d 
>>> /path/to/configset
>> 
>> Yes, you can do it that way.
>> 
>>> Now of course we'd like to improve this process if possible, since manually 
>>> pushing schema files to the ZK server and reloading the cores is a bit 
>>> command line intensive. Does anyone has any guidance or experience here 
>>> leveraging the managed schema api to make updates to a schema in production 
>>> while maintaining a version controlled copy of the schema. I'd considered 
>>> using the api to make changes to our schemas, and then saving off the 
>>> generated schema api to git, or saving off a script that creates the schema 
>>> file using the managed api to git, but I'm not sure if that is any easier 
>>> or just adds complexity.
>> 
>> My preferred method would be manual edits, pushing to git, pushing to
>> zookeeper, and reloading the collection.  I'm comfortable with that
>> method, and don't know much about the schema API.
>> 
>> If you're comfortable with the schema API, you can use that, and then
>> use the "downconfig" command on one one of the ZK scripts included with
>> Solr for pushing to git.
>> 
>> Exactly how to handle the automation would depend on what OS platforms
>> are involved and what sort of tools are accessible to those who will be
>> making the changes.  If it would be on a system accessed with a
>> commandline shell, then a commandline script (perhaps a shell script)
>> seems like the best option.  A script could be created that runs the
>> necessary git commands and then the Solr script to upload the new
>> config, and it could even reload the collection with a tool like curl.
>> 
>> Thanks,
>> Shawn
>> 



Re: Querying in Solrcloud

2018-06-29 Thread Walter Underwood
We use an AWS ALB for all of our Solr clusters. One is 40 instances.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 29, 2018, at 8:33 PM, Sushant Vengurlekar  
> wrote:
> 
> What are some of the suggested loadbalancers for solrcloud? Can AWS ELB be
> used for load balancing?
> 
> On Fri, Jun 29, 2018 at 8:04 PM, Erick Erickson 
> wrote:
> 
>> In your setup, the load balancer prevents single points of failure.
>> 
>> Since you're pinging a URL, what happens if that node dies or is turned
>> off?
>> Your PHP program has no way of knowing what to do, but the load
>> balancer does.
>> 
>> Your understanding of Zookeeper's role shows a common misconception.
>> 
>> Zookeeper keeps track of the topology of the collections, what nodes are
>> up,
>> what ones down etc. It does _not_ have anything to do with distributing
>> queries
>> or updates. Imagine a 1,000 node collection. If each and every request had
>> to go through Zookeeper, that would be a bottleneck.
>> 
>> Instead, when each node's state changes, it informs Zookeeper which in turn
>> informs all the other Solr nodes who care. It looks like this.
>> - node starts up.
>> - as each replica comes up, it informs Zookeeper that it is now "active".
>> - for each collection with any replica on that node, a "watch" is set on
>> the
>>   collection's state.json node in Zookeeper
>> - every time that state.json node changes, Zookeeper notifies
>>   the node.
>> - eventually everything starts all the state changes are broadcast
>>  and Zookeeper just sits there.
>> - periodically Zookeeper pings each Solr node and if it has gone away
>>  it informs all the Solr nodes that this node is dead
>>  and the Solr node updates it's snapshot of the cluster's
>>  topologyl
>> 
>> A query comes in to a Solr node and this is what happens:
>> - the Solr node looks in it's Zookeeper information to see
>>  where all the replicas for the collection are.
>> - Solr picks one replica from each shard and sends the
>>   subquery to them
>> - Solr assembles the response from the subrequests
>> - Solr sends the response to the client.
>> 
>> note that Zookeeper isn't involved at all. In fact, Zookeeper
>> can go away completely and each Solr node will work on it's
>> last snapshot of the topology of the network and answer
>> _queries_. Updates will fail completely if Zookeeper falls
>> below quorum, but Zookeeper isn't handling the _update_.
>> It's still Solr knowing that Zookeeper is below quorum
>> and refusing to process an update.
>> 
>> There's more going on of course, but that's the general outline.
>> 
>> Since you're using PHP, it doesn't know about Zookeeper, all it
>> has is a URL so as I mentioned above, if that node goes away
>> it's your php program that's not Zookeeper-aware.
>> 
>> If you were using "CloudSolrClient" in SolrJ, it _is_ Zookeeper
>> aware and you would not need a load balancer. But again
>> that's because it knows the cluster topology (it registers its own
>> watchers) and can "do the right thing" if something goes away.
>> Zookeeper is still not directly involved in processing queries
>> or updates.
>> 
>> Best,
>> Erick
>> 
>> On Fri, Jun 29, 2018 at 7:31 PM, Sushant Vengurlekar
>>  wrote:
>>> Thanks for your reply. I have a follow up question. Why is a load
>> balancer
>>> needed? Isn't that the job of zookeeper to loadbalance queries across
>> solr
>>> nodes?
>>> 
>>> I was under the impression that you send query to zookeeper and it
>> handles
>>> the rest and sends the response back. Can you please enlighten .me on
>> that
>>> one.
>>> 
>>> Thank you
>>> 
>>> On Fri, Jun 29, 2018 at 7:19 PM, Shalin Shekhar Mangar <
>>> shalinman...@gmail.com> wrote:
>>> 
 You send your queries and updates directly to Solr's collection e.g.
 http://host:port/solr/. You can use any Solr node
 for
 this request. If the node does not have the collection being queried
>> then
 the request will be forwarded internally to a Solr instance which has
>> that
 collection.
 
 ZooKeeper is used by Solr's Java client to look up the list of Solr
>> nodes
 having the collection being queried. But if you are using PHP then you
>> can
 probably keep a list of Solr nodes in configuration and randomly choose
 one. A better implementation would be to setup a load balancer and put
>> all
 Solr nodes behind it and query the load balancer URL in your
>> application.
 
 On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
 svengurle...@curvolabs.com> wrote:
 
> I have a question regarding querying in solrcloud.
> 
> I am working on php code to query solrcloud for search results. Do I
>> send
> the query to zookeeper or send it to a particular solr node? How does
>> the
> querying process work in general.
> 
> Thank you
> 
 
 
 --
 Regards,
 Shalin Shekhar Mangar.
 
>> 



Re: Querying in Solrcloud

2018-06-29 Thread Sushant Vengurlekar
What are some of the suggested loadbalancers for solrcloud? Can AWS ELB be
used for load balancing?

On Fri, Jun 29, 2018 at 8:04 PM, Erick Erickson 
wrote:

> In your setup, the load balancer prevents single points of failure.
>
> Since you're pinging a URL, what happens if that node dies or is turned
> off?
> Your PHP program has no way of knowing what to do, but the load
> balancer does.
>
> Your understanding of Zookeeper's role shows a common misconception.
>
> Zookeeper keeps track of the topology of the collections, what nodes are
> up,
> what ones down etc. It does _not_ have anything to do with distributing
> queries
> or updates. Imagine a 1,000 node collection. If each and every request had
> to go through Zookeeper, that would be a bottleneck.
>
> Instead, when each node's state changes, it informs Zookeeper which in turn
> informs all the other Solr nodes who care. It looks like this.
> - node starts up.
> - as each replica comes up, it informs Zookeeper that it is now "active".
> - for each collection with any replica on that node, a "watch" is set on
> the
>collection's state.json node in Zookeeper
> - every time that state.json node changes, Zookeeper notifies
>the node.
> - eventually everything starts all the state changes are broadcast
>   and Zookeeper just sits there.
> - periodically Zookeeper pings each Solr node and if it has gone away
>   it informs all the Solr nodes that this node is dead
>   and the Solr node updates it's snapshot of the cluster's
>   topologyl
>
> A query comes in to a Solr node and this is what happens:
> - the Solr node looks in it's Zookeeper information to see
>   where all the replicas for the collection are.
> - Solr picks one replica from each shard and sends the
>subquery to them
> - Solr assembles the response from the subrequests
> - Solr sends the response to the client.
>
> note that Zookeeper isn't involved at all. In fact, Zookeeper
> can go away completely and each Solr node will work on it's
> last snapshot of the topology of the network and answer
> _queries_. Updates will fail completely if Zookeeper falls
> below quorum, but Zookeeper isn't handling the _update_.
> It's still Solr knowing that Zookeeper is below quorum
> and refusing to process an update.
>
> There's more going on of course, but that's the general outline.
>
> Since you're using PHP, it doesn't know about Zookeeper, all it
> has is a URL so as I mentioned above, if that node goes away
> it's your php program that's not Zookeeper-aware.
>
> If you were using "CloudSolrClient" in SolrJ, it _is_ Zookeeper
> aware and you would not need a load balancer. But again
> that's because it knows the cluster topology (it registers its own
> watchers) and can "do the right thing" if something goes away.
> Zookeeper is still not directly involved in processing queries
> or updates.
>
> Best,
> Erick
>
> On Fri, Jun 29, 2018 at 7:31 PM, Sushant Vengurlekar
>  wrote:
> > Thanks for your reply. I have a follow up question. Why is a load
> balancer
> > needed? Isn't that the job of zookeeper to loadbalance queries across
> solr
> > nodes?
> >
> > I was under the impression that you send query to zookeeper and it
> handles
> > the rest and sends the response back. Can you please enlighten .me on
> that
> > one.
> >
> > Thank you
> >
> > On Fri, Jun 29, 2018 at 7:19 PM, Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> >
> >> You send your queries and updates directly to Solr's collection e.g.
> >> http://host:port/solr/. You can use any Solr node
> >> for
> >> this request. If the node does not have the collection being queried
> then
> >> the request will be forwarded internally to a Solr instance which has
> that
> >> collection.
> >>
> >> ZooKeeper is used by Solr's Java client to look up the list of Solr
> nodes
> >> having the collection being queried. But if you are using PHP then you
> can
> >> probably keep a list of Solr nodes in configuration and randomly choose
> >> one. A better implementation would be to setup a load balancer and put
> all
> >> Solr nodes behind it and query the load balancer URL in your
> application.
> >>
> >> On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
> >> svengurle...@curvolabs.com> wrote:
> >>
> >> > I have a question regarding querying in solrcloud.
> >> >
> >> > I am working on php code to query solrcloud for search results. Do I
> send
> >> > the query to zookeeper or send it to a particular solr node? How does
> the
> >> > querying process work in general.
> >> >
> >> > Thank you
> >> >
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
>


Re: Managed Schemas and Version Control

2018-06-29 Thread Erick Erickson
Adding to Shawn's comments.

You've pretty much nailed all the possibilities, it depends on
what you're most comfortable with I suppose.

The only thing I'd add is that you probably have dev and prod
environments and work out the correct schemas on dev then
migrate to prod (at least that's what paranoid people like me
do). At some point you have to graduate to prod when you're
happy with your dev configs. It shouldn't be much problem to
script something that

- pulled the configs from dev
- pushed them to Git
- pulled them from Git (sanity check)
- pushed them to prods ZK
- reloaded the ZK collection.

I'd be a little reluctant to script all the managed schema steps,
too easy to forget to put one of those steps in. I'm picturing
someone at 3 AM _finally_ getting all the schema figured out and
forgetting to properly log the managed schema step. In my
proposal, it wouldn't matter, what you're archiving is the end result
after you've done all your QA.

FWIW,
Erick

P.S. you'd be surprised how many prod setups I've seen where
they don't put their configs in VCS. Makes me break out in
hives so kudos... ;)

On Fri, Jun 29, 2018 at 5:04 PM, Shawn Heisey  wrote:
> On 6/29/2018 3:26 PM, Zimmermann, Thomas wrote:
>> We're transitioning from Solr 4.10 to 7.x and working through our options 
>> around managing our schemas. Currently we manage our schema files in a git 
>> repository, make changes to the xml files,
>
> Hopefully you've got the entire config in version control and not just
> the schema.
>
>> and then push them out to our zookeeper cluster via the zkcli and the 
>> upconfig command like:
>>
>> /apps/solr/bin/zkcli.sh -cmd upconfig -zkhost host.com:9580 -collection core 
>> -confname core -confdir /apps/solr/cores/core/conf/ -solrhome /apps/solr/
>
> I don't think the collection parameter is valid for that command.  It
> would be valid for the linkconfig command, but not for upconfig.  It's
> probably not hurting anything, though.
>
>> This allows us to deploy schema changes without restarting the cluster, 
>> while maintaining version control. It looks like we could do the exact same 
>> process using Solr 7 and the solr control script like
>>
>> bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d 
>> /path/to/configset
>
> Yes, you can do it that way.
>
>> Now of course we'd like to improve this process if possible, since manually 
>> pushing schema files to the ZK server and reloading the cores is a bit 
>> command line intensive. Does anyone has any guidance or experience here 
>> leveraging the managed schema api to make updates to a schema in production 
>> while maintaining a version controlled copy of the schema. I'd considered 
>> using the api to make changes to our schemas, and then saving off the 
>> generated schema api to git, or saving off a script that creates the schema 
>> file using the managed api to git, but I'm not sure if that is any easier or 
>> just adds complexity.
>
> My preferred method would be manual edits, pushing to git, pushing to
> zookeeper, and reloading the collection.  I'm comfortable with that
> method, and don't know much about the schema API.
>
> If you're comfortable with the schema API, you can use that, and then
> use the "downconfig" command on one one of the ZK scripts included with
> Solr for pushing to git.
>
> Exactly how to handle the automation would depend on what OS platforms
> are involved and what sort of tools are accessible to those who will be
> making the changes.  If it would be on a system accessed with a
> commandline shell, then a commandline script (perhaps a shell script)
> seems like the best option.  A script could be created that runs the
> necessary git commands and then the Solr script to upload the new
> config, and it could even reload the collection with a tool like curl.
>
> Thanks,
> Shawn
>


Re: Querying in Solrcloud

2018-06-29 Thread Sushant Vengurlekar
Thanks for the detailed explanation Eric. Really helped clear up my
understanding.

On Fri, Jun 29, 2018 at 8:04 PM, Erick Erickson 
wrote:

> In your setup, the load balancer prevents single points of failure.
>
> Since you're pinging a URL, what happens if that node dies or is turned
> off?
> Your PHP program has no way of knowing what to do, but the load
> balancer does.
>
> Your understanding of Zookeeper's role shows a common misconception.
>
> Zookeeper keeps track of the topology of the collections, what nodes are
> up,
> what ones down etc. It does _not_ have anything to do with distributing
> queries
> or updates. Imagine a 1,000 node collection. If each and every request had
> to go through Zookeeper, that would be a bottleneck.
>
> Instead, when each node's state changes, it informs Zookeeper which in turn
> informs all the other Solr nodes who care. It looks like this.
> - node starts up.
> - as each replica comes up, it informs Zookeeper that it is now "active".
> - for each collection with any replica on that node, a "watch" is set on
> the
>collection's state.json node in Zookeeper
> - every time that state.json node changes, Zookeeper notifies
>the node.
> - eventually everything starts all the state changes are broadcast
>   and Zookeeper just sits there.
> - periodically Zookeeper pings each Solr node and if it has gone away
>   it informs all the Solr nodes that this node is dead
>   and the Solr node updates it's snapshot of the cluster's
>   topologyl
>
> A query comes in to a Solr node and this is what happens:
> - the Solr node looks in it's Zookeeper information to see
>   where all the replicas for the collection are.
> - Solr picks one replica from each shard and sends the
>subquery to them
> - Solr assembles the response from the subrequests
> - Solr sends the response to the client.
>
> note that Zookeeper isn't involved at all. In fact, Zookeeper
> can go away completely and each Solr node will work on it's
> last snapshot of the topology of the network and answer
> _queries_. Updates will fail completely if Zookeeper falls
> below quorum, but Zookeeper isn't handling the _update_.
> It's still Solr knowing that Zookeeper is below quorum
> and refusing to process an update.
>
> There's more going on of course, but that's the general outline.
>
> Since you're using PHP, it doesn't know about Zookeeper, all it
> has is a URL so as I mentioned above, if that node goes away
> it's your php program that's not Zookeeper-aware.
>
> If you were using "CloudSolrClient" in SolrJ, it _is_ Zookeeper
> aware and you would not need a load balancer. But again
> that's because it knows the cluster topology (it registers its own
> watchers) and can "do the right thing" if something goes away.
> Zookeeper is still not directly involved in processing queries
> or updates.
>
> Best,
> Erick
>
> On Fri, Jun 29, 2018 at 7:31 PM, Sushant Vengurlekar
>  wrote:
> > Thanks for your reply. I have a follow up question. Why is a load
> balancer
> > needed? Isn't that the job of zookeeper to loadbalance queries across
> solr
> > nodes?
> >
> > I was under the impression that you send query to zookeeper and it
> handles
> > the rest and sends the response back. Can you please enlighten .me on
> that
> > one.
> >
> > Thank you
> >
> > On Fri, Jun 29, 2018 at 7:19 PM, Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> >
> >> You send your queries and updates directly to Solr's collection e.g.
> >> http://host:port/solr/. You can use any Solr node
> >> for
> >> this request. If the node does not have the collection being queried
> then
> >> the request will be forwarded internally to a Solr instance which has
> that
> >> collection.
> >>
> >> ZooKeeper is used by Solr's Java client to look up the list of Solr
> nodes
> >> having the collection being queried. But if you are using PHP then you
> can
> >> probably keep a list of Solr nodes in configuration and randomly choose
> >> one. A better implementation would be to setup a load balancer and put
> all
> >> Solr nodes behind it and query the load balancer URL in your
> application.
> >>
> >> On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
> >> svengurle...@curvolabs.com> wrote:
> >>
> >> > I have a question regarding querying in solrcloud.
> >> >
> >> > I am working on php code to query solrcloud for search results. Do I
> send
> >> > the query to zookeeper or send it to a particular solr node? How does
> the
> >> > querying process work in general.
> >> >
> >> > Thank you
> >> >
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
>


Re: Querying in Solrcloud

2018-06-29 Thread Erick Erickson
In your setup, the load balancer prevents single points of failure.

Since you're pinging a URL, what happens if that node dies or is turned off?
Your PHP program has no way of knowing what to do, but the load
balancer does.

Your understanding of Zookeeper's role shows a common misconception.

Zookeeper keeps track of the topology of the collections, what nodes are up,
what ones down etc. It does _not_ have anything to do with distributing queries
or updates. Imagine a 1,000 node collection. If each and every request had
to go through Zookeeper, that would be a bottleneck.

Instead, when each node's state changes, it informs Zookeeper which in turn
informs all the other Solr nodes who care. It looks like this.
- node starts up.
- as each replica comes up, it informs Zookeeper that it is now "active".
- for each collection with any replica on that node, a "watch" is set on the
   collection's state.json node in Zookeeper
- every time that state.json node changes, Zookeeper notifies
   the node.
- eventually everything starts all the state changes are broadcast
  and Zookeeper just sits there.
- periodically Zookeeper pings each Solr node and if it has gone away
  it informs all the Solr nodes that this node is dead
  and the Solr node updates it's snapshot of the cluster's
  topologyl

A query comes in to a Solr node and this is what happens:
- the Solr node looks in it's Zookeeper information to see
  where all the replicas for the collection are.
- Solr picks one replica from each shard and sends the
   subquery to them
- Solr assembles the response from the subrequests
- Solr sends the response to the client.

note that Zookeeper isn't involved at all. In fact, Zookeeper
can go away completely and each Solr node will work on it's
last snapshot of the topology of the network and answer
_queries_. Updates will fail completely if Zookeeper falls
below quorum, but Zookeeper isn't handling the _update_.
It's still Solr knowing that Zookeeper is below quorum
and refusing to process an update.

There's more going on of course, but that's the general outline.

Since you're using PHP, it doesn't know about Zookeeper, all it
has is a URL so as I mentioned above, if that node goes away
it's your php program that's not Zookeeper-aware.

If you were using "CloudSolrClient" in SolrJ, it _is_ Zookeeper
aware and you would not need a load balancer. But again
that's because it knows the cluster topology (it registers its own
watchers) and can "do the right thing" if something goes away.
Zookeeper is still not directly involved in processing queries
or updates.

Best,
Erick

On Fri, Jun 29, 2018 at 7:31 PM, Sushant Vengurlekar
 wrote:
> Thanks for your reply. I have a follow up question. Why is a load balancer
> needed? Isn't that the job of zookeeper to loadbalance queries across solr
> nodes?
>
> I was under the impression that you send query to zookeeper and it handles
> the rest and sends the response back. Can you please enlighten .me on that
> one.
>
> Thank you
>
> On Fri, Jun 29, 2018 at 7:19 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> You send your queries and updates directly to Solr's collection e.g.
>> http://host:port/solr/. You can use any Solr node
>> for
>> this request. If the node does not have the collection being queried then
>> the request will be forwarded internally to a Solr instance which has that
>> collection.
>>
>> ZooKeeper is used by Solr's Java client to look up the list of Solr nodes
>> having the collection being queried. But if you are using PHP then you can
>> probably keep a list of Solr nodes in configuration and randomly choose
>> one. A better implementation would be to setup a load balancer and put all
>> Solr nodes behind it and query the load balancer URL in your application.
>>
>> On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
>> svengurle...@curvolabs.com> wrote:
>>
>> > I have a question regarding querying in solrcloud.
>> >
>> > I am working on php code to query solrcloud for search results. Do I send
>> > the query to zookeeper or send it to a particular solr node? How does the
>> > querying process work in general.
>> >
>> > Thank you
>> >
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>


Re: Querying in Solrcloud

2018-06-29 Thread Sushant Vengurlekar
Thanks for your reply. I have a follow up question. Why is a load balancer
needed? Isn't that the job of zookeeper to loadbalance queries across solr
nodes?

I was under the impression that you send query to zookeeper and it handles
the rest and sends the response back. Can you please enlighten .me on that
one.

Thank you

On Fri, Jun 29, 2018 at 7:19 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> You send your queries and updates directly to Solr's collection e.g.
> http://host:port/solr/. You can use any Solr node
> for
> this request. If the node does not have the collection being queried then
> the request will be forwarded internally to a Solr instance which has that
> collection.
>
> ZooKeeper is used by Solr's Java client to look up the list of Solr nodes
> having the collection being queried. But if you are using PHP then you can
> probably keep a list of Solr nodes in configuration and randomly choose
> one. A better implementation would be to setup a load balancer and put all
> Solr nodes behind it and query the load balancer URL in your application.
>
> On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
> svengurle...@curvolabs.com> wrote:
>
> > I have a question regarding querying in solrcloud.
> >
> > I am working on php code to query solrcloud for search results. Do I send
> > the query to zookeeper or send it to a particular solr node? How does the
> > querying process work in general.
> >
> > Thank you
> >
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Querying in Solrcloud

2018-06-29 Thread Shalin Shekhar Mangar
You send your queries and updates directly to Solr's collection e.g.
http://host:port/solr/. You can use any Solr node for
this request. If the node does not have the collection being queried then
the request will be forwarded internally to a Solr instance which has that
collection.

ZooKeeper is used by Solr's Java client to look up the list of Solr nodes
having the collection being queried. But if you are using PHP then you can
probably keep a list of Solr nodes in configuration and randomly choose
one. A better implementation would be to setup a load balancer and put all
Solr nodes behind it and query the load balancer URL in your application.

On Sat, Jun 30, 2018 at 7:31 AM Sushant Vengurlekar <
svengurle...@curvolabs.com> wrote:

> I have a question regarding querying in solrcloud.
>
> I am working on php code to query solrcloud for search results. Do I send
> the query to zookeeper or send it to a particular solr node? How does the
> querying process work in general.
>
> Thank you
>


-- 
Regards,
Shalin Shekhar Mangar.


Querying in Solrcloud

2018-06-29 Thread Sushant Vengurlekar
I have a question regarding querying in solrcloud.

I am working on php code to query solrcloud for search results. Do I send
the query to zookeeper or send it to a particular solr node? How does the
querying process work in general.

Thank you


Re: Managed Schemas and Version Control

2018-06-29 Thread Shawn Heisey
On 6/29/2018 3:26 PM, Zimmermann, Thomas wrote:
> We're transitioning from Solr 4.10 to 7.x and working through our options 
> around managing our schemas. Currently we manage our schema files in a git 
> repository, make changes to the xml files,

Hopefully you've got the entire config in version control and not just
the schema.

> and then push them out to our zookeeper cluster via the zkcli and the 
> upconfig command like:
>
> /apps/solr/bin/zkcli.sh -cmd upconfig -zkhost host.com:9580 -collection core 
> -confname core -confdir /apps/solr/cores/core/conf/ -solrhome /apps/solr/

I don't think the collection parameter is valid for that command.  It
would be valid for the linkconfig command, but not for upconfig.  It's
probably not hurting anything, though.

> This allows us to deploy schema changes without restarting the cluster, while 
> maintaining version control. It looks like we could do the exact same process 
> using Solr 7 and the solr control script like
>
> bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d 
> /path/to/configset

Yes, you can do it that way.

> Now of course we'd like to improve this process if possible, since manually 
> pushing schema files to the ZK server and reloading the cores is a bit 
> command line intensive. Does anyone has any guidance or experience here 
> leveraging the managed schema api to make updates to a schema in production 
> while maintaining a version controlled copy of the schema. I'd considered 
> using the api to make changes to our schemas, and then saving off the 
> generated schema api to git, or saving off a script that creates the schema 
> file using the managed api to git, but I'm not sure if that is any easier or 
> just adds complexity.

My preferred method would be manual edits, pushing to git, pushing to
zookeeper, and reloading the collection.  I'm comfortable with that
method, and don't know much about the schema API.

If you're comfortable with the schema API, you can use that, and then
use the "downconfig" command on one one of the ZK scripts included with
Solr for pushing to git.

Exactly how to handle the automation would depend on what OS platforms
are involved and what sort of tools are accessible to those who will be
making the changes.  If it would be on a system accessed with a
commandline shell, then a commandline script (perhaps a shell script)
seems like the best option.  A script could be created that runs the
necessary git commands and then the Solr script to upload the new
config, and it could even reload the collection with a tool like curl.

Thanks,
Shawn



Managed Schemas and Version Control

2018-06-29 Thread Zimmermann, Thomas
Hi,

We're transitioning from Solr 4.10 to 7.x and working through our options 
around managing our schemas. Currently we manage our schema files in a git 
repository, make changes to the xml files, and then push them out to our 
zookeeper cluster via the zkcli and the upconfig command like:

/apps/solr/bin/zkcli.sh -cmd upconfig -zkhost host.com:9580 -collection core 
-confname core -confdir /apps/solr/cores/core/conf/ -solrhome /apps/solr/

This allows us to deploy schema changes without restarting the cluster, while 
maintaining version control. It looks like we could do the exact same process 
using Solr 7 and the solr control script like

bin/solr zk upconfig -z 111.222.333.444:2181 -n mynewconfig -d 
/path/to/configset

Now of course we'd like to improve this process if possible, since manually 
pushing schema files to the ZK server and reloading the cores is a bit command 
line intensive. Does anyone has any guidance or experience here leveraging the 
managed schema api to make updates to a schema in production while maintaining 
a version controlled copy of the schema. I'd considered using the api to make 
changes to our schemas, and then saving off the generated schema api to git, or 
saving off a script that creates the schema file using the managed api to git, 
but I'm not sure if that is any easier or just adds complexity.

Any thoughts or experience appreciated.

Thanks,
TZ


Re: Solr 7.4 and Zookeeper 3.4.12

2018-06-29 Thread Walter Underwood
The documentation does not say that Solr uses the zk client 3.4.11. It says, 
"Solr currently uses Apache ZooKeeper v3.4.11.” That is on the page titled 
"Setting Up an External ZooKeeper Ensemble” in the section "Download Apache 
ZooKeeper”. Maybe that is supposed to mean “The Solr code uses the 3.4.11 
client, use whatever server you want”, but I read that as a recommendation to 
use the 3.4.11 server.

http://lucene.apache.org/solr/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#download-apache-zookeeper

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 29, 2018, at 11:54 AM, Zimmermann, Thomas  
> wrote:
> 
> Thanks Shawn - I misspoke when I said recommendation, should have said
> ³packaged with². I appreciate the feedback and the quick updates to the
> Jira issue. We¹ll plan to proceed with 3.4.12 when we go live.
> 
> -TZ
> 
> On 6/29/18, 11:38 AM, "Shawn Heisey"  wrote:
> 
>> On 6/28/2018 8:39 PM, Zimmermann, Thomas wrote:
>>> I was wondering if there was a reason Solr 7.4 is still recommending ZK
>>> 3.4.11 as the major version in the official changelog vs shipping with
>>> 3.4.12 despite the known regression in 3.4.11. Are there any known
>>> issues with running 7.4 alongside ZK 3.4.12. We are beginning a major
>>> Solr upgrade project (4.10 to 7.4) and want to stand up the most recent
>>> supported versions of both ZK/Solr as part of the process.
>> 
>> That is NOT a recommendation.
>> 
>> The mention of ZK 3.4.11 in Solr's CHANGES.txt file is simply the
>> version of ZK that Solr ships with.  ZK is included with Solr mostly for
>> the client functionality.  The regression is in the server code, and
>> unless you run the embedded ZK server, which is not recommended for
>> production, the ZK library that ships with Solr will not experience the
>> regression.
>> 
>> I am not aware of anywhere in Solr or its reference guide that makes a
>> recommendation about a specific version of ZK.  The reference guide does
>> mention version 3.4.11, but that's only because that's the version that
>> Solr includes.  The version number in the documentation source code is
>> dynamic and will always match the specific version that Solr includes.
>> 
>> The compatibility goals of the ZK project indicate that you can run any
>> 3.4.x or 3.5.x version of ZK on the server side and be compatible with
>> the ZK 3.4.x client that's in Solr.
>> 
>> Look for "Backward Compatibility" on this page:
>> 
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>> 
>> We have an issue for upgrading the version of ZK in Solr to 3.4.12.  I
>> have uploaded a new patch on that issue to try and clear up any
>> confusion about what version of ZK is recommended for use with Solr:
>> 
>> https://issues.apache.org/jira/browse/SOLR-12346
>> 
>> Thanks,
>> Shawn
>> 
> 



Re: Solr 7.4 and Zookeeper 3.4.12

2018-06-29 Thread Zimmermann, Thomas
Thanks Shawn - I misspoke when I said recommendation, should have said
³packaged with². I appreciate the feedback and the quick updates to the
Jira issue. We¹ll plan to proceed with 3.4.12 when we go live.

-TZ

On 6/29/18, 11:38 AM, "Shawn Heisey"  wrote:

>On 6/28/2018 8:39 PM, Zimmermann, Thomas wrote:
>> I was wondering if there was a reason Solr 7.4 is still recommending ZK
>>3.4.11 as the major version in the official changelog vs shipping with
>>3.4.12 despite the known regression in 3.4.11. Are there any known
>>issues with running 7.4 alongside ZK 3.4.12. We are beginning a major
>>Solr upgrade project (4.10 to 7.4) and want to stand up the most recent
>>supported versions of both ZK/Solr as part of the process.
>
>That is NOT a recommendation.
>
>The mention of ZK 3.4.11 in Solr's CHANGES.txt file is simply the
>version of ZK that Solr ships with.  ZK is included with Solr mostly for
>the client functionality.  The regression is in the server code, and
>unless you run the embedded ZK server, which is not recommended for
>production, the ZK library that ships with Solr will not experience the
>regression.
>
>I am not aware of anywhere in Solr or its reference guide that makes a
>recommendation about a specific version of ZK.  The reference guide does
>mention version 3.4.11, but that's only because that's the version that
>Solr includes.  The version number in the documentation source code is
>dynamic and will always match the specific version that Solr includes.
>
>The compatibility goals of the ZK project indicate that you can run any
>3.4.x or 3.5.x version of ZK on the server side and be compatible with
>the ZK 3.4.x client that's in Solr.
>
>Look for "Backward Compatibility" on this page:
>
>https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement
>
>We have an issue for upgrading the version of ZK in Solr to 3.4.12.  I
>have uploaded a new patch on that issue to try and clear up any
>confusion about what version of ZK is recommended for use with Solr:
>
>https://issues.apache.org/jira/browse/SOLR-12346
>
>Thanks,
>Shawn
>



Re: 7.3 appears to leak

2018-06-29 Thread Erick Erickson
This is truly puzzling then, I'm clueless. It's hard to imagine this
is lurking out there and nobody else notices, but you've eliminated
the custom code. And this is also very peculiar:

* it occurs only in our main text search collection, all other
collections are unaffected;
* despite what i said earlier, it is so far unreproducible outside
production, even when mimicking production as good as we can;

Here's a tedious idea. Restart Solr with the -v option, I _think_ that
shows you each and every jar file Solr loads. Is it "somehow" possible
that your main collection is loading some jar from somewhere that's
different than you expect? 'cause silly ideas like this are all I can
come up with.

Erick

On Fri, Jun 29, 2018 at 9:56 AM, Markus Jelsma
 wrote:
> Hello Erick,
>
> The custom search handler doesn't interact with SolrIndexSearcher, this is 
> really all it does:
>
>   public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) 
> throws Exception {
> super.handleRequestBody(req, rsp);
>
> if (rsp.getToLog().get("hits") instanceof Integer) {
>   rsp.addHttpHeader("X-Solr-Hits", 
> String.valueOf((Integer)rsp.getToLog().get("hits")));
> }
> if (rsp.getToLog().get("hits") instanceof Long) {
>   rsp.addHttpHeader("X-Solr-Hits", 
> String.valueOf((Long)rsp.getToLog().get("hits")));
> }
>   }
>
> I am not sure this qualifies as one more to go.
>
> Re: compiler warnings on resources, yes! This and tests failing due to 
> resources leaks have always warned me when i forgot to release something or 
> decrement a reference. But except for the above method (and the token filters 
> which i really can't disable) are all that is left.
>
> I am quite desperate about this problem so although i am unwilling to disable 
> stuff, i can do it if i must. But i so reason, yet, to remove the search 
> handler or the token filter stuff, i mean, how could those leak a 
> SolrIndexSearcher?
>
> Let me know :)
>
> Many thanks!
> Markus
>
> -Original message-
>> From:Erick Erickson 
>> Sent: Friday 29th June 2018 18:46
>> To: solr-user 
>> Subject: Re: 7.3 appears to leak
>>
>> bq. The only custom stuff left is an extension of SearchHandler that
>> only writes numFound to the response headers.
>>
>> Well, one more to go ;). It's incredibly easy to overlook
>> innocent-seeming calls that increment the underlying reference count
>> of some objects but don't decrement them, usually through a close
>> call. Which isn't necessarily a close if the underlying reference
>> count is still > 0.
>>
>> You may infer that I've been there and done that ;). Sometime the
>> compiler warnings about "resource leak" can help pinpoint those too.
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 29, 2018 at 9:16 AM, Markus Jelsma
>>  wrote:
>> > Hello Yonik,
>> >
>> > I took one node of the 7.2.1 cluster out of the load balancer so it would 
>> > only receive shard queries, this way i could kind of 'safely' disable our 
>> > custom components one by one, while keeping functionality in place by 
>> > letting the other 7.2.1 nodes continue on with the full configuration.
>> >
>> > I am now at a point where literally all custom components are deleted or 
>> > commented out in the config for the node running 7.4. The only custom 
>> > stuff left is an extension of SearchHandler that only writes numFound to 
>> > the response headers, and all the token filters in our schema.
>> >
>> > You were right, it was leaking exactly one SolrIndexSearcher instance on 
>> > each commit. But, with all our stuff gone, the leak is still there! I 
>> > triple checked it! Of course, the bastard is locally still not 
>> > reproducible.
>> >
>> > So, what is next? I have no clues left.
>> >
>> > Many, many thanks,
>> > Markus
>> >
>> > -Original message-
>> >> From:Markus Jelsma 
>> >> Sent: Thursday 28th June 2018 23:52
>> >> To: solr-user@lucene.apache.org
>> >> Subject: RE: 7.3 appears to leak
>> >>
>> >> Hello Yonik,
>> >>
>> >> If leaking a whole SolrIndexSearcher would cause this problem, then the 
>> >> only custom component would be our copy/paste-and-enhance version of the 
>> >> elevator component, is the root of all problems. It is a direct copy of 
>> >> the 7.2 source where only things like getAnalyzedQuery, the ElevationObj 
>> >> and the loop over the map entries is changed.
>> >>
>> >> There are no changes to code related to the searcher. Other component 
>> >> where we get a RefCount of searcher is used without issues, we always 
>> >> decrement the reference after using it. But those components are not in 
>> >> use in this collection.
>> >>
>> >> The source has changed a lot with 7.4 but we still use the old code. I 
>> >> will investigate the component thoroughly, even revert to the old 7.2 
>> >> vanilla component for a brief period in production for one machine. It 
>> >> may not be a problem if i don't let our load balancer access it directly, 
>> >> so it only serves shard queries.
>> >>
>> >> I 

Re: Solr - zoo with more than 1000 collections

2018-06-29 Thread Yago Riveiro
Solr doesn’t scale very well with ~2K collections, and yes de bottleneck is 
Zookeeper itself.

Zookeeper doesn’t perform operation as quickly as expected with folders with a 
lot of children.

In a scenario where you are in a recovery state (a node crash), this limitation 
will hurt a lot, the queue work stacks recovery operations due the low 
throughput to consume the queue.

Regards.

--

Yago Riveiro

On 29 Jun 2018 17:38 +0100, Bertrand Mahé , wrote:
> Hi,
>
>
>
> In order to store timeseries data and perform deletion easily, we create a
> several collections per day and then use aliases.
>
>
>
> We are using SOLR 7.3 and we have 2 questions:
>
>
>
> Q1 : In order to access quickly the latest data would it be possible to load
> cores in descending chronological order rather than alphabetical order?
>
>
>
> Q2: When we exceed 1200-1300 collections, zookeeper suddenly changes from
> 6-700 KB RAM to 3 GB RAM which makes zoo very slow or almost unusable. Is
> this normal?
>
>
>
> Thanks in advance,
>
>
>
> Bertrand
>


RE: 7.3 appears to leak

2018-06-29 Thread Markus Jelsma
Hello Erick,

The custom search handler doesn't interact with SolrIndexSearcher, this is 
really all it does:

  public void handleRequestBody(SolrQueryRequest req, SolrQueryResponse rsp) 
throws Exception {
super.handleRequestBody(req, rsp);

if (rsp.getToLog().get("hits") instanceof Integer) {
  rsp.addHttpHeader("X-Solr-Hits", 
String.valueOf((Integer)rsp.getToLog().get("hits")));
}
if (rsp.getToLog().get("hits") instanceof Long) {
  rsp.addHttpHeader("X-Solr-Hits", 
String.valueOf((Long)rsp.getToLog().get("hits")));
}
  }

I am not sure this qualifies as one more to go.

Re: compiler warnings on resources, yes! This and tests failing due to 
resources leaks have always warned me when i forgot to release something or 
decrement a reference. But except for the above method (and the token filters 
which i really can't disable) are all that is left.

I am quite desperate about this problem so although i am unwilling to disable 
stuff, i can do it if i must. But i so reason, yet, to remove the search 
handler or the token filter stuff, i mean, how could those leak a 
SolrIndexSearcher?

Let me know :)

Many thanks!
Markus

-Original message-
> From:Erick Erickson 
> Sent: Friday 29th June 2018 18:46
> To: solr-user 
> Subject: Re: 7.3 appears to leak
> 
> bq. The only custom stuff left is an extension of SearchHandler that
> only writes numFound to the response headers.
> 
> Well, one more to go ;). It's incredibly easy to overlook
> innocent-seeming calls that increment the underlying reference count
> of some objects but don't decrement them, usually through a close
> call. Which isn't necessarily a close if the underlying reference
> count is still > 0.
> 
> You may infer that I've been there and done that ;). Sometime the
> compiler warnings about "resource leak" can help pinpoint those too.
> 
> Best,
> Erick
> 
> On Fri, Jun 29, 2018 at 9:16 AM, Markus Jelsma
>  wrote:
> > Hello Yonik,
> >
> > I took one node of the 7.2.1 cluster out of the load balancer so it would 
> > only receive shard queries, this way i could kind of 'safely' disable our 
> > custom components one by one, while keeping functionality in place by 
> > letting the other 7.2.1 nodes continue on with the full configuration.
> >
> > I am now at a point where literally all custom components are deleted or 
> > commented out in the config for the node running 7.4. The only custom stuff 
> > left is an extension of SearchHandler that only writes numFound to the 
> > response headers, and all the token filters in our schema.
> >
> > You were right, it was leaking exactly one SolrIndexSearcher instance on 
> > each commit. But, with all our stuff gone, the leak is still there! I 
> > triple checked it! Of course, the bastard is locally still not reproducible.
> >
> > So, what is next? I have no clues left.
> >
> > Many, many thanks,
> > Markus
> >
> > -Original message-
> >> From:Markus Jelsma 
> >> Sent: Thursday 28th June 2018 23:52
> >> To: solr-user@lucene.apache.org
> >> Subject: RE: 7.3 appears to leak
> >>
> >> Hello Yonik,
> >>
> >> If leaking a whole SolrIndexSearcher would cause this problem, then the 
> >> only custom component would be our copy/paste-and-enhance version of the 
> >> elevator component, is the root of all problems. It is a direct copy of 
> >> the 7.2 source where only things like getAnalyzedQuery, the ElevationObj 
> >> and the loop over the map entries is changed.
> >>
> >> There are no changes to code related to the searcher. Other component 
> >> where we get a RefCount of searcher is used without issues, we always 
> >> decrement the reference after using it. But those components are not in 
> >> use in this collection.
> >>
> >> The source has changed a lot with 7.4 but we still use the old code. I 
> >> will investigate the component thoroughly, even revert to the old 7.2 
> >> vanilla component for a brief period in production for one machine. It may 
> >> not be a problem if i don't let our load balancer access it directly, so 
> >> it only serves shard queries.
> >>
> >> I will get back to this topic tomorrow!
> >>
> >> Many thanks,
> >> Markus
> >>
> >>
> >>
> >> -Original message-
> >> > From:Yonik Seeley 
> >> > Sent: Thursday 28th June 2018 23:30
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: 7.3 appears to leak
> >> >
> >> > > * SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry 
> >> > > instances are both leaked on commit;
> >> >
> >> > If these are actually filterCache entries being leaked, it stands to
> >> > reason that a whole searcher is being leaked somewhere.
> >> >
> >> > -Yonik
> >> >
> >>
> 


Re: 7.3 appears to leak

2018-06-29 Thread Erick Erickson
bq. The only custom stuff left is an extension of SearchHandler that
only writes numFound to the response headers.

Well, one more to go ;). It's incredibly easy to overlook
innocent-seeming calls that increment the underlying reference count
of some objects but don't decrement them, usually through a close
call. Which isn't necessarily a close if the underlying reference
count is still > 0.

You may infer that I've been there and done that ;). Sometime the
compiler warnings about "resource leak" can help pinpoint those too.

Best,
Erick

On Fri, Jun 29, 2018 at 9:16 AM, Markus Jelsma
 wrote:
> Hello Yonik,
>
> I took one node of the 7.2.1 cluster out of the load balancer so it would 
> only receive shard queries, this way i could kind of 'safely' disable our 
> custom components one by one, while keeping functionality in place by letting 
> the other 7.2.1 nodes continue on with the full configuration.
>
> I am now at a point where literally all custom components are deleted or 
> commented out in the config for the node running 7.4. The only custom stuff 
> left is an extension of SearchHandler that only writes numFound to the 
> response headers, and all the token filters in our schema.
>
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. But, with all our stuff gone, the leak is still there! I triple 
> checked it! Of course, the bastard is locally still not reproducible.
>
> So, what is next? I have no clues left.
>
> Many, many thanks,
> Markus
>
> -Original message-
>> From:Markus Jelsma 
>> Sent: Thursday 28th June 2018 23:52
>> To: solr-user@lucene.apache.org
>> Subject: RE: 7.3 appears to leak
>>
>> Hello Yonik,
>>
>> If leaking a whole SolrIndexSearcher would cause this problem, then the only 
>> custom component would be our copy/paste-and-enhance version of the elevator 
>> component, is the root of all problems. It is a direct copy of the 7.2 
>> source where only things like getAnalyzedQuery, the ElevationObj and the 
>> loop over the map entries is changed.
>>
>> There are no changes to code related to the searcher. Other component where 
>> we get a RefCount of searcher is used without issues, we always decrement 
>> the reference after using it. But those components are not in use in this 
>> collection.
>>
>> The source has changed a lot with 7.4 but we still use the old code. I will 
>> investigate the component thoroughly, even revert to the old 7.2 vanilla 
>> component for a brief period in production for one machine. It may not be a 
>> problem if i don't let our load balancer access it directly, so it only 
>> serves shard queries.
>>
>> I will get back to this topic tomorrow!
>>
>> Many thanks,
>> Markus
>>
>>
>>
>> -Original message-
>> > From:Yonik Seeley 
>> > Sent: Thursday 28th June 2018 23:30
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: 7.3 appears to leak
>> >
>> > > * SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry instances 
>> > > are both leaked on commit;
>> >
>> > If these are actually filterCache entries being leaked, it stands to
>> > reason that a whole searcher is being leaked somewhere.
>> >
>> > -Yonik
>> >
>>


Solr - zoo with more than 1000 collections

2018-06-29 Thread Bertrand Mahé
Hi,

 

In order to store timeseries data and perform deletion easily, we create a
several collections per day and then use aliases.

 

We are using SOLR 7.3 and we have 2 questions:

 

Q1 : In order to access quickly the latest data would it be possible to load
cores in descending chronological order rather than alphabetical order?

 

Q2: When we exceed 1200-1300 collections, zookeeper suddenly changes from
6-700 KB RAM to 3 GB RAM which makes zoo very slow or almost unusable. Is
this normal?

 

Thanks in advance,

 

Bertrand



RE: 7.3 appears to leak

2018-06-29 Thread Markus Jelsma
Hello Yonik,

I took one node of the 7.2.1 cluster out of the load balancer so it would only 
receive shard queries, this way i could kind of 'safely' disable our custom 
components one by one, while keeping functionality in place by letting the 
other 7.2.1 nodes continue on with the full configuration.

I am now at a point where literally all custom components are deleted or 
commented out in the config for the node running 7.4. The only custom stuff 
left is an extension of SearchHandler that only writes numFound to the response 
headers, and all the token filters in our schema.

You were right, it was leaking exactly one SolrIndexSearcher instance on each 
commit. But, with all our stuff gone, the leak is still there! I triple checked 
it! Of course, the bastard is locally still not reproducible.

So, what is next? I have no clues left.

Many, many thanks,
Markus 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 28th June 2018 23:52
> To: solr-user@lucene.apache.org
> Subject: RE: 7.3 appears to leak
> 
> Hello Yonik,
> 
> If leaking a whole SolrIndexSearcher would cause this problem, then the only 
> custom component would be our copy/paste-and-enhance version of the elevator 
> component, is the root of all problems. It is a direct copy of the 7.2 source 
> where only things like getAnalyzedQuery, the ElevationObj and the loop over 
> the map entries is changed.
> 
> There are no changes to code related to the searcher. Other component where 
> we get a RefCount of searcher is used without issues, we always decrement the 
> reference after using it. But those components are not in use in this 
> collection.
> 
> The source has changed a lot with 7.4 but we still use the old code. I will 
> investigate the component thoroughly, even revert to the old 7.2 vanilla 
> component for a brief period in production for one machine. It may not be a 
> problem if i don't let our load balancer access it directly, so it only 
> serves shard queries.
> 
> I will get back to this topic tomorrow!
> 
> Many thanks,
> Markus
> 
>  
>  
> -Original message-
> > From:Yonik Seeley 
> > Sent: Thursday 28th June 2018 23:30
> > To: solr-user@lucene.apache.org
> > Subject: Re: 7.3 appears to leak
> > 
> > > * SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry instances 
> > > are both leaked on commit;
> > 
> > If these are actually filterCache entries being leaked, it stands to
> > reason that a whole searcher is being leaked somewhere.
> > 
> > -Yonik
> > 
> 


Re: Graph, GraphML, Gephi and Edge Labels

2018-06-29 Thread Heidi McClure
Ok. Will do. I saw the place in the code, but haven’t managed to get the code 
to build, yet. 

> On Jun 29, 2018, at 9:03 AM, Joel Bernstein  wrote:
> 
> Hi,
> 
> Currently the nodes expression doesn't have this capability. Feel free to
> make a feature request on jira. This sounds like a fairly easy feature to
> add.
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, Jun 27, 2018 at 5:21 PM, Heidi McClure <
> heidi.mccl...@polarisalpha.com> wrote:
> 
>> Hello,
>> 
>> I am trying to export graph data from a Solr index (version 7.2) in a
>> format that can be imported to Gephi for visualization.  I'm getting
>> close!  Is there a way to add edge labels to the exports from this type of
>> command (see curl command that follows and sample outputs)?
>> 
>> Thanks in advance,
>> -heidi
>> 
>> Based on the examples found here: https://lucene.apache.org/
>> solr/guide/7_2/graph-traversal.html , this is working in my GDELT-based
>> data set query request:
>> 
>> curl --data-urlencode 'expr=nodes(gdelt_graph,
>>  nodes(gdelt_graph,
>>walk="POLICE->Actor1Name_s",
>>trackTraversal="true",
>>gather="Actor2Name_s"),
>>  walk="node->Actor1Name_s",
>>  scatter="leaves,branches",
>>  trackTraversal="true",
>>  gather="Actor2Name_s")'
>> http://mymachine:8983/solr/gdelt_graph/graph
>> 
>> Output is like this (just a subset):
>> 
>> http://graphml.graphdrawing.org/xmlns;
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>> xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
>> http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd;>
>> 
>> 
>> node
>> 0
>> 
>> 
>> 
>> Actor2Name_s
>> 1
>> 
>> 
>> 
>> Actor2Name_s
>> 1
>> 
>> 
>> 
>> 
>> 
>> And I'd like to have a key for label and the data tag on the edges so that
>> I can get the Labels into Gephi.  Does anyone know if this can be done?
>> Below is example of what I mean.  Notice the key for label at the top of
>> the file and the "This is an edge description" entries on two of the edges
>> (ids 1 and 2).
>> 
>> 
>> http://graphml.graphdrawing.org/xmlns;
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
>> xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
>> http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd;>
>>
>>
>>
>>node
>>0
>>
>>
>>This is an edge description.
>>
>>
>>Actor2Name_s
>>1
>>
>>
>>This is an edge description.
>>
>>
>>Actor2Name_s
>>1
>>
>>
>>
>>
>> 
>> 


Re: Sorting issue while using collection parameter

2018-06-29 Thread Erick Erickson
What _is_ your expectation? You haven't provided any examples of what
your input and expectations _are_.

You might review: https://wiki.apache.org/solr/UsingMailingLists

string types are case-sensitive for instance, so that's one thing that
could be happening. You
can also specify sortMissingFirst/Last to determine where docs with
missing fields appear in the results.

Best,
Erick

On Fri, Jun 29, 2018 at 3:13 AM, Vijay Tiwary  wrote:
> Hello Eric,
>
> title is a string field
>
> On Wed, 27 Jun 2018, 9:21 pm Erick Erickson, 
> wrote:
>
>> what kind of field is title? text_general or something? Sorting on a
>> tokenized field is usually something you don't want to do. If a field
>> has aardvard and zebra, how would it sort?
>>
>> There's usually something like alphaOnlySort. People often copyField
>> from "title" to "title_sort" and search on "title" and sort on
>> title_sort.
>>
>> alphaOnlySort uses KeywordTokenizer and LowercaseFilterFactory.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 27, 2018 at 12:45 AM, Vijay Tiwary 
>> wrote:
>> > Hello Team,
>> >
>> > I have multiple collection on solr (5.4.1) cloud based on year
>> > content2107
>> > content2018
>> >
>> > Also I have a collection "content" which does not have any data.
>> >
>> > Now if I query them as follows
>> > http://host:port/solr/content/select?q=*:*=content2107,
>> > content2108=title
>> > asc
>> >
>> > Where title is string field then results are not getting sorted as per
>> the
>> > expectation. Also note value for title is not present for some documents.
>> >
>> > Please help.
>>


Re: /replication?command=details does not show infos for all replicas on the core

2018-06-29 Thread Shawn Heisey

On 6/29/2018 8:47 AM, Arturas Mazeika wrote:

Out of curiosity: some cores give infos for both shards (through
replication query) and some only for one (if you still be able to see the
prev post). I wonder why..


Adding to what Erick said:

If SolrCloud has initiated a replication on that core at some point 
since that Solr instance started, then you might see both the master and 
slave side of that replication reported by the replication handler.  If 
a replication has never been initiated, then you will only see info 
about the local core.


The replication handler is used by SolrCloud for two things:

1) Index recovery when a replica gets too far out of sync.
2) Replicating data to TLOG and PULL replica types (new in 7.x).

Thanks,
Shawn



Re: CursorMarks and 'end of results'

2018-06-29 Thread Erick Erickson
bq. It basically cuts down the search time in half in the usual case
for us, so it's an important 'feature'.

Wait. You mean that the "extra" call to get back 0 rows doubles your
query time? That's surprising, tell us more.

How many times does your "usual" use case call using CursorMark? My
off-the-cuff explanation would be that
you usually get all the rows in the first call.

CursorMark is intended to help with the "deep paging" problem, i.e.
where start=some_big_number to allow
returning large results sets in chunks, say through 10s of K rows.
Part of our puzzlement is that in that
case the overhead of the last call is minuscule compared to the rest.

There's no reason that it can't be used for small result sets, those
are just usually handled by setting the
start parameter. Up through, say, 1,000 or so the extra overhead is
pretty unnoticeable. So my head was
in the "what's the problem with 1 extra call after making the first 50?".

OTOH, if you make 100 successive calls to search with the CursorMark
and call 101 takes as long as
the previous 100, something's horribly wrong.

Best,
Erick


On Fri, Jun 29, 2018 at 4:01 AM, David Frese
 wrote:
> Am 22.06.18 um 02:37 schrieb Chris Hostetter:
>>
>>
>> : the documentation of 'cursorMarks' recommends to fetch until a query
>> returns
>> : the cursorMark that was passed in to a request.
>> :
>> : But that always requires an additional request at the end, so I wonder
>> if I
>> : can stop already, if a request returns less results than requested (num
>> rows).
>> : There won't be new documents added during the search in my use case, so
>> could
>> : there every be a non-empty 'page' after a non-full 'page'?
>>
>> You could stop then -- if that fits your usecase -- but the documentation
>> (in particular the sentence you are refering to) is trying to be as
>> straightforward and general as possible ... which includes the use case
>> where someone is "tailing" an index and documents may be continually
>> added.
>>
>> When originally writing those docs, I did have a bit in there about
>> *either* getting back less then "rows" docs *or* getting back the same
>> cursor you passed in (to try to cover both use cases as efficiently as
>> possible) but it seemed more confusing -- and i was worried people might
>> be suprised/confused when the number of docs was perfectly divisible by
>> "rows" so the "less then rows" case could still wind up in a final
>> request that returned "0" docs.
>>
>> the current docs seemed like a good balance between brevity & clarity,
>> with the added bonus of being correct :)
>>
>> But as Anshum said: if you have suggested improvements for rewording,
>> patches/PRs certainly welcome.  It's hard to have a good perspective on
>> what docs are helpful to new users whne you have been working with the
>> software for 14 years and wrote the code in question.
>
>
> Thank you very much for the clarification.
>
> It basically cuts down the search time in half in the usual case for us, so
> it's an important 'feature'.
>
>
> --
> David Frese
> +49 7071 70896 75
>
> Active Group GmbH
> Hechinger Str. 12/1, 72072 Tübingen
> Registergericht: Amtsgericht Stuttgart, HRB 224404
> Geschäftsführer: Dr. Michael Sperber


Re: Solr 7.4 and Zookeeper 3.4.12

2018-06-29 Thread Shawn Heisey

On 6/28/2018 8:39 PM, Zimmermann, Thomas wrote:

I was wondering if there was a reason Solr 7.4 is still recommending ZK 3.4.11 
as the major version in the official changelog vs shipping with 3.4.12 despite 
the known regression in 3.4.11. Are there any known issues with running 7.4 
alongside ZK 3.4.12. We are beginning a major Solr upgrade project (4.10 to 
7.4) and want to stand up the most recent supported versions of both ZK/Solr as 
part of the process.


That is NOT a recommendation.

The mention of ZK 3.4.11 in Solr's CHANGES.txt file is simply the 
version of ZK that Solr ships with.  ZK is included with Solr mostly for 
the client functionality.  The regression is in the server code, and 
unless you run the embedded ZK server, which is not recommended for 
production, the ZK library that ships with Solr will not experience the 
regression.


I am not aware of anywhere in Solr or its reference guide that makes a 
recommendation about a specific version of ZK.  The reference guide does 
mention version 3.4.11, but that's only because that's the version that 
Solr includes.  The version number in the documentation source code is 
dynamic and will always match the specific version that Solr includes.


The compatibility goals of the ZK project indicate that you can run any 
3.4.x or 3.5.x version of ZK on the server side and be compatible with 
the ZK 3.4.x client that's in Solr.


Look for "Backward Compatibility" on this page:

https://cwiki.apache.org/confluence/display/ZOOKEEPER/ReleaseManagement

We have an issue for upgrading the version of ZK in Solr to 3.4.12.  I 
have uploaded a new patch on that issue to try and clear up any 
confusion about what version of ZK is recommended for use with Solr:


https://issues.apache.org/jira/browse/SOLR-12346

Thanks,
Shawn



Re: /replication?command=details does not show infos for all replicas on the core

2018-06-29 Thread Erick Erickson
Arturas:

Please make yourself a promise, "Only use the collections commands" ;)
At least for a while.

Trying to mix collection-level commands and core-level commands is
extremely confusing at the start. Under the covers, the Collections
API _uses_ the Core API, but in a very precise manner. Any seemingly
innocent mistake will be hard to untangle.

For your first question: "I wonder why the infos for the second
replica are not shown..." the answer is that you are using a
core-level API which does not "understand" anything about SolrCloud,
it's all purely local to that instance. So it's doing exactly what you
ask it to; reporting on the status of cores (replicas) _on that
particular Solr instance_. The _Collections_ API _is_ cloud/Zookeeper
aware and will report them all. What it does is fire the core-level
command out to all live Solr nodes and assemble the response into a
single cluster-wide report.

Second, the core-level "replication" command is all about old-style
master/slave index replication and I have no idea what it's reporting
on when you ask for replication status in SolrCloud. It has nothing to
do with, say, "replication factor" or anything else cloud related as
Shawn indicates. Old-style master/slave is used in SolrCloud under the
covers for "full sync", perhaps that's happened sometime (although
ideally it won't happen at all unless something goes wrong with normal
indexing and the only option is to copy the entire index from the
leader). The take-away is that the replication command is probably not
doing what you think it is.

Best,
Erick

On Fri, Jun 29, 2018 at 7:47 AM, Arturas Mazeika  wrote:
> Hi Shawn et al,
>
> Thanks a lot for the clarification. It makes a lot of sense and explains
> which functionality needs to be used to get the infos :-).
>
> Out of curiosity: some cores give infos for both shards (through
> replication query) and some only for one (if you still be able to see the
> prev post). I wonder why..
>
> Cheers,
> Arturas
>
> On Fri, Jun 29, 2018 at 4:30 PM, Shawn Heisey  wrote:
>
>> On 6/29/2018 7:53 AM, Arturas Mazeika wrote:
>>
>>> but the query reports infos on only one shard:
>>>
>>> F:\solr_server\solr-7.2.1>curl -s
>>> http://localhost:9996/solr/de_wiki_man/replication?command=details | grep
>>> "indexPath\|indexSize"
>>>  "indexSize":"15.04 GB",
>>>
>>> "indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\no
>>> de4\\solr\\de_wiki_man_shard4_replica_n12\\data\\index/",
>>>
>>> I wonder why the infos for the second replica are not shown. Comments?
>>>
>>
>> SolrCloud is aware of (and uses) the replication feature, but the
>> replication feature is not cloud-aware.  It is a core-level feature (not a
>> cloud-specific feature) and is only aware of that one specific core (shard
>> replica).  This is not likely to ever change.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Graph, GraphML, Gephi and Edge Labels

2018-06-29 Thread Joel Bernstein
Hi,

Currently the nodes expression doesn't have this capability. Feel free to
make a feature request on jira. This sounds like a fairly easy feature to
add.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jun 27, 2018 at 5:21 PM, Heidi McClure <
heidi.mccl...@polarisalpha.com> wrote:

> Hello,
>
> I am trying to export graph data from a Solr index (version 7.2) in a
> format that can be imported to Gephi for visualization.  I'm getting
> close!  Is there a way to add edge labels to the exports from this type of
> command (see curl command that follows and sample outputs)?
>
> Thanks in advance,
> -heidi
>
> Based on the examples found here: https://lucene.apache.org/
> solr/guide/7_2/graph-traversal.html , this is working in my GDELT-based
> data set query request:
>
> curl --data-urlencode 'expr=nodes(gdelt_graph,
>   nodes(gdelt_graph,
> walk="POLICE->Actor1Name_s",
> trackTraversal="true",
> gather="Actor2Name_s"),
>   walk="node->Actor1Name_s",
>   scatter="leaves,branches",
>   trackTraversal="true",
>   gather="Actor2Name_s")'
> http://mymachine:8983/solr/gdelt_graph/graph
>
> Output is like this (just a subset):
> 
> http://graphml.graphdrawing.org/xmlns;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
> http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd;>
> 
> 
> node
> 0
> 
> 
> 
> Actor2Name_s
> 1
> 
> 
> 
> Actor2Name_s
> 1
> 
> 
> 
> 
>
> And I'd like to have a key for label and the data tag on the edges so that
> I can get the Labels into Gephi.  Does anyone know if this can be done?
> Below is example of what I mean.  Notice the key for label at the top of
> the file and the "This is an edge description" entries on two of the edges
> (ids 1 and 2).
>
> 
> http://graphml.graphdrawing.org/xmlns;
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
> xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
> http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd;>
> 
> 
> 
> node
> 0
> 
> 
> This is an edge description.
> 
> 
> Actor2Name_s
> 1
> 
> 
> This is an edge description.
> 
> 
> Actor2Name_s
> 1
> 
> 
> 
> 
>
>


Re: /replication?command=details does not show infos for all replicas on the core

2018-06-29 Thread Arturas Mazeika
Hi Shawn et al,

Thanks a lot for the clarification. It makes a lot of sense and explains
which functionality needs to be used to get the infos :-).

Out of curiosity: some cores give infos for both shards (through
replication query) and some only for one (if you still be able to see the
prev post). I wonder why..

Cheers,
Arturas

On Fri, Jun 29, 2018 at 4:30 PM, Shawn Heisey  wrote:

> On 6/29/2018 7:53 AM, Arturas Mazeika wrote:
>
>> but the query reports infos on only one shard:
>>
>> F:\solr_server\solr-7.2.1>curl -s
>> http://localhost:9996/solr/de_wiki_man/replication?command=details | grep
>> "indexPath\|indexSize"
>>  "indexSize":"15.04 GB",
>>
>> "indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\no
>> de4\\solr\\de_wiki_man_shard4_replica_n12\\data\\index/",
>>
>> I wonder why the infos for the second replica are not shown. Comments?
>>
>
> SolrCloud is aware of (and uses) the replication feature, but the
> replication feature is not cloud-aware.  It is a core-level feature (not a
> cloud-specific feature) and is only aware of that one specific core (shard
> replica).  This is not likely to ever change.
>
> Thanks,
> Shawn
>
>


Re: /replication?command=details does not show infos for all replicas on the core

2018-06-29 Thread Shawn Heisey

On 6/29/2018 7:53 AM, Arturas Mazeika wrote:

but the query reports infos on only one shard:

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9996/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
 "indexSize":"15.04 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node4\\solr\\de_wiki_man_shard4_replica_n12\\data\\index/",

I wonder why the infos for the second replica are not shown. Comments?


SolrCloud is aware of (and uses) the replication feature, but the 
replication feature is not cloud-aware.  It is a core-level feature (not 
a cloud-specific feature) and is only aware of that one specific core 
(shard replica).  This is not likely to ever change.


Thanks,
Shawn



/replication?command=details does not show infos for all replicas on the core

2018-06-29 Thread Arturas Mazeika
Hi Solr-Team,

I am benchmarking solr with the German Wikipedia pages on 4 nodes (Running
on ports , 9998, 9997 and 9996), 4 shards, replication factor 2):

"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 3g -cloud -p  -s
"F:\solr_server\solr-7.2.1\example\cloud\node1\solr"
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 3g -cloud -p 9998 -s
"F:\solr_server\solr-7.2.1\example\cloud\node2\solr" -z localhost:10999
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 3g -cloud -p 9997 -s
"F:\solr_server\solr-7.2.1\example\cloud\node3\solr" -z localhost:10999
"F:\solr_server\solr-7.2.1\bin\solr.cmd" start -m 3g -cloud -p 9996 -s
"F:\solr_server\solr-7.2.1\example\cloud\node4\solr" -z localhost:10999

created with

*http://localhost:/solr/admin/collections?action=CREATE=de_wiki_man=4=2=2=xml
*

Then I inserted 40GB of data into the system and was curious how large the
index got. The query

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9996/solr/admin/cores?action=STATUS | grep
"size\|numDocs\|name" | sed "s/}},/\n/g"
  "name":"de_wiki_all_shard1_replica_n2",
"numDocs":671396,
"sizeInBytes":3781265902,
"size":"3.52 GB"

  "name":"de_wiki_all_shard3_replica_n10",
"numDocs":670564,
"sizeInBytes":3874165653,
"size":"3.61 GB"

  "name":"de_wiki_man_shard2_replica_n4",
"numDocs":670498,
"sizeInBytes":11936390483,
"size":"11.12 GB"

  "name":"de_wiki_man_shard4_replica_n12",
"numDocs":671484,
"sizeInBytes":16153375004,
"size":"15.04 GB"

  "name":"trans_shard1_replica_n1",
"numDocs":0,
"sizeInBytes":69,
"size":"69 bytes"

but the query reports infos on only one shard:

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9996/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
"indexSize":"15.04 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node4\\solr\\de_wiki_man_shard4_replica_n12\\data\\index/",

I wonder why the infos for the second replica are not shown. Comments?

Cheers,
Arturas









Additional infos:


F:\solr_server\solr-7.2.1>curl -s
http://localhost:/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
"indexSize":"16.73 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node1\\solr\\de_wiki_man_shard1_replica_n1\\data\\index.20180629092013755",
"indexSize":"15.32 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node2\\solr\\de_wiki_man_shard1_replica_n2\\data\\index/",

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9998/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
"indexSize":"15.32 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node2\\solr\\de_wiki_man_shard1_replica_n2\\data\\index/",

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9997/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
"indexSize":"16.51 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node3\\solr\\de_wiki_man_shard2_replica_n6\\data\\index.20180629063901343",
"indexSize":"11.12 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node4\\solr\\de_wiki_man_shard2_replica_n4\\data\\index/",

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9996/solr/de_wiki_man/replication?command=details | grep
"indexPath\|indexSize"
"indexSize":"11.12 GB",

"indexPath":"F:\\solr_server\\solr-7.2.1\\example\\cloud\\node4\\solr\\de_wiki_man_shard2_replica_n4\\data\\index/",








F:\solr_server\solr-7.2.1>curl -s
http://localhost:/solr/admin/cores?action=STATUS | grep
"size\|numDocs\|name" | sed "s/}},/\n/g"
  "name":"de_wiki_all_shard1_replica_n1",
"numDocs":671396,
"sizeInBytes":3815456445,
"size":"3.55 GB"

  "name":"de_wiki_all_shard3_replica_n8",
"numDocs":670564,
"sizeInBytes":3821193139,
"size":"3.56 GB"

  "name":"de_wiki_man_shard1_replica_n1",
"numDocs":1141843,
"sizeInBytes":17967817775,
"size":"16.73 GB"

  "name":"de_wiki_man_shard3_replica_n8",
"numDocs":670823,
"sizeInBytes":11625124732,
"size":"10.83 GB"

F:\solr_server\solr-7.2.1>curl -s
http://localhost:9998/solr/admin/cores?action=STATUS | grep
"size\|numDocs\|name" | sed "s/}},/\n/g"
  "name":"de_wiki_all_shard2_replica_n6",
"numDocs":670221,
"sizeInBytes":3828566867,
"size":"3.57 GB"

  "name":"de_wiki_all_shard4_replica_n14",
"numDocs":669221,
"sizeInBytes":3772631249,
"size":"3.51 GB"

  "name":"de_wiki_man_shard1_replica_n2",
"numDocs":668807,
"sizeInBytes":16449833639,
"size":"15.32 GB"

  "name":"de_wiki_man_shard3_replica_n10",
"numDocs":670823,
"sizeInBytes":15987092480,

Re: Importance of having the lsof utility on our solr server VMs

2018-06-29 Thread THADC
Thanks. I think that's a good point that it helps recognize port conflict at
start up. Although that scenario is unlikely in my case, I am going to try
to get it installed.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrJ Kerberos Client API

2018-06-29 Thread Jason Gerlowski
Hi Tushar,

You're right; the docs are a little out of date there.
Krb5HttpClientConfigurer underwent some refactoring recently and came
out with a different name: Krb5HttpClientBuilder.

The ref-guide should update the snippet you were referencing to
something more like:

System.setProperty("java.security.auth.login.config",
"/home/foo/jaas-client.conf");
HttpClientUtil.setHttpClientBuilder(new Krb5HttpClientBuilder());

There might be other small changes too.

Best,

Jason
On Thu, Jun 28, 2018 at 9:05 PM Tushar Inamdar  wrote:
>
> Hello,
>
> We are looking to move from SolrJ client v5.5.x to the latest v7.4.x.
>
> The documentation on wiring kerberos with the client API here
> 
> seems out-of-date. The HttpClientUtil class doesn't have a method
> setConfigurer(). Also Krb5HttpClientConfigurer class is missing from the
> SolrJ library. This mechanism used to work with v5.5.4, but doesn't work
> with any 7.x.
>
> Am I missing something or is the documentation indeed out-of-date?
>
> I am interested in the conventional jaas/keytab based access (not
> delegation token).
>
> Thanks,
> Tushar.


Re: Retrieving json.facet from a search

2018-06-29 Thread Jason Gerlowski
You might also have luck using the "NoOpResponseParser"

https://opensourceconnections.com/blog/2015/01/08/using-solr-cloud-for-robustness-but-returning-json-format/
https://lucene.apache.org/solr/7_0_0/solr-solrj/org/apache/solr/client/solrj/impl/NoOpResponseParser.html

(Disclaimer: Didn't try this out, but it looks like what you want).
On Thu, Jun 28, 2018 at 2:41 PM Yonik Seeley  wrote:
>
> There isn't typed support, but you can use the generic support like so:
>
> .getResponse().get("facets")
>
> -Yonik
>
> On Thu, Jun 28, 2018 at 2:31 PM, Webster Homer  wrote:
> > I have a fairly large existing code base for querying Solr. It is
> > architected where common code calls solr and returns a solrj QueryResponse
> > object.
> >
> > I'm currently using Solr 7.2 the code interacts with solr using the Solrj
> > client api
> >
> > I have a need that would be very easily met by using the json.facet api.
> > The problem is that I don't see how to get the json.facet out of a
> > QueryResponse object.
> >
> > There doesn't seem to be a lot of discussion on line about this.
> > Is there a way to get the Json object out of the QueryResponse?
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be
> > privileged or
> > otherwise protected from disclosure. If you are not the intended
> > recipient,
> > you must not copy this message or attachment or disclose the
> > contents to
> > any other person. If you have received this transmission in error,
> > please
> > notify the sender immediately and delete the message and any attachment
> >
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do
> > not accept liability for any omissions or errors in this
> > message which may
> > arise as a result of E-Mail-transmission or for damages
> > resulting from any
> > unauthorized changes of the content of this message and
> > any attachment thereto.
> > Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee
> > that this message is free of viruses and does
> > not accept liability for any
> > damages caused by any virus transmitted
> > therewith.
> >
> >
> >
> > Click http://www.emdgroup.com/disclaimer
> >  to access the
> > German, French, Spanish
> > and Portuguese versions of this disclaimer.


Re: Solr 7 MoreLikeThis boost calculation

2018-06-29 Thread Alessandro Benedetti
Hi Jesse,
you are correct, the variable 'bestScore' used in the
createQuery(PriorityQueue q) should be "minScore".

it is used to normalise the terms score :
tq = new BoostQuery(tq, boostFactor * myScore / bestScore);
e.g.

Queue -> Term1:100 , Term2:50, Term3:20, Term4:10

The minScore will be 10 and the normalised score will be :
Term1:10 , Term2:5, Term3:2, Term4:1

These values will be used to build the boost term queries.

I see no particular problem with that.
What is your concern ?



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: CursorMarks and 'end of results'

2018-06-29 Thread David Frese

Am 22.06.18 um 02:37 schrieb Chris Hostetter:


: the documentation of 'cursorMarks' recommends to fetch until a query returns
: the cursorMark that was passed in to a request.
:
: But that always requires an additional request at the end, so I wonder if I
: can stop already, if a request returns less results than requested (num rows).
: There won't be new documents added during the search in my use case, so could
: there every be a non-empty 'page' after a non-full 'page'?

You could stop then -- if that fits your usecase -- but the documentation
(in particular the sentence you are refering to) is trying to be as
straightforward and general as possible ... which includes the use case
where someone is "tailing" an index and documents may be continually
added.

When originally writing those docs, I did have a bit in there about
*either* getting back less then "rows" docs *or* getting back the same
cursor you passed in (to try to cover both use cases as efficiently as
possible) but it seemed more confusing -- and i was worried people might
be suprised/confused when the number of docs was perfectly divisible by
"rows" so the "less then rows" case could still wind up in a final
request that returned "0" docs.

the current docs seemed like a good balance between brevity & clarity,
with the added bonus of being correct :)

But as Anshum said: if you have suggested improvements for rewording,
patches/PRs certainly welcome.  It's hard to have a good perspective on
what docs are helpful to new users whne you have been working with the
software for 14 years and wrote the code in question.


Thank you very much for the clarification.

It basically cuts down the search time in half in the usual case for us, 
so it's an important 'feature'.



--
David Frese
+49 7071 70896 75

Active Group GmbH
Hechinger Str. 12/1, 72072 Tübingen
Registergericht: Amtsgericht Stuttgart, HRB 224404
Geschäftsführer: Dr. Michael Sperber


Re: Sorting issue while using collection parameter

2018-06-29 Thread Vijay Tiwary
Hello Eric,

title is a string field

On Wed, 27 Jun 2018, 9:21 pm Erick Erickson, 
wrote:

> what kind of field is title? text_general or something? Sorting on a
> tokenized field is usually something you don't want to do. If a field
> has aardvard and zebra, how would it sort?
>
> There's usually something like alphaOnlySort. People often copyField
> from "title" to "title_sort" and search on "title" and sort on
> title_sort.
>
> alphaOnlySort uses KeywordTokenizer and LowercaseFilterFactory.
>
> Best,
> Erick
>
> On Wed, Jun 27, 2018 at 12:45 AM, Vijay Tiwary 
> wrote:
> > Hello Team,
> >
> > I have multiple collection on solr (5.4.1) cloud based on year
> > content2107
> > content2018
> >
> > Also I have a collection "content" which does not have any data.
> >
> > Now if I query them as follows
> > http://host:port/solr/content/select?q=*:*=content2107,
> > content2108=title
> > asc
> >
> > Where title is string field then results are not getting sorted as per
> the
> > expectation. Also note value for title is not present for some documents.
> >
> > Please help.
>


Re: Maximum number of SolrCloud collections in limited hardware resource

2018-06-29 Thread Emir Arnautović
Hi,
It is probably the best if you merge some of your collections (or all) and have 
discriminator field that will be used to filter out tenant’s documents only. In 
case you go with multiple collections serving multiple tenants, you would have 
to have logic on top of it to resolve tenant to collection. Unfortunately, Solr 
does not have alias with filtering like ES that would come handy in such cases.
If you stick with multiple collections, you can turn off caches completely, 
monitor latency and turn on caches for collections when it is reaching some 
threshold.
Caches are invalidated on commit, so submitting dummy doc and committing should 
invalidate caches. Alternative is to reload collection.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Jun 2018, at 14:46, Shawn Heisey  wrote:
> 
> On 6/27/2018 5:10 AM, Sharif Shahrair wrote:
>> Now the problem is, when we create about 1400 collection(all of them are
>> empty i.e. no document is added yet) the solr service goes down showing out
>> of memory exception. We have few questions here-
>> 
>> 1. When we are creating collections, each collection is taking about 8 MB
>> to 12 MB of memory when there is no document yet. Is there any way to
>> configure SolrCloud in a way that it takes low memory for each collection
>> initially(like 1MB for each collection), then we would be able to create
>> 1500 collection using about 3GB of machines RAM?
> 
> Solr doesn't dictate how much memory it allocates for a collection.  It 
> allocates what it needs, and if the heap size is too small for that, then you 
> get OOME.
> 
> You're going to need a lot more than two Solr servers to handle that many 
> collections, and they're going to need more than 12GB of memory.  You should 
> already have at least three servers in your setup, because ZooKeeper requires 
> three servers for redundancy.
> 
> http://zookeeper.apache.org/doc/r3.4.12/zookeeperAdmin.html#sc_zkMulitServerSetup
> 
> Handling a large number of collections is one area where SolrCloud needs 
> improvement.  Work is constantly happening towards this goal, but it's a very 
> complex piece of software, so making design changes is not trivial.
> 
>> 2. Is there any way to clear/flush the cache of SolrCloud, specially from
>> those collections which we don't access for while(May be we can take those
>> inactive collections out of memory and load them back when they are needed
>> again)?
> 
> Unfortunately the functionality that allows index cores to be unloaded (which 
> we have colloquially called "LotsOfCores") does not work when Solr is running 
> in SolrCloud mode.SolrCloud functionality would break if its cores get 
> unloaded.  It would take a fair amount of development effort to allow the two 
> features to work together.
> 
>> 3. Is there any way to collect the Garbage Memory from SolrCloud(may be
>> created by deleting documents and collections) ?
> 
> Java handles garbage collection automatically.  It's possible to explicitly 
> ask the system to collect garbage, but any good programming guide for Java 
> will recommend that programmers should NOT explicitly trigger GC.  While it 
> might be possible for Solr's memory usage to become more efficient through 
> development effort, it's already pretty good.  To our knowledge, Solr does 
> not currently have any memory leak bugs, and if any are found, they are taken 
> seriously and fixed as fast as we can fix them.
> 
>> Our target is without increasing the hardware resources, create maximum
>> number of collections, and keeping the highly accessed collections &
>> documents in memory. We'll appreciate your help.
> 
> That goal will require a fair amount of hardware.  You may have no choice but 
> to increase your hardware resources.
> 
> Thanks,
> Shawn
>