Re: Down Replica is elected as Leader (solr v8.7.0)

2021-02-16 Thread matthew sporleder
I've run into this (or similar) issues in the past (solr6? I don't
remember exactly) where tlogs get stuck either growing indefinitely
and/or refusing to commit on restart.

What I ended up doing was writing a monitor to check for the number of
tlogs and alert if they got over some limit (100 or whatever) and then
I could stay ahead of the issue by rebuilding individual nodes
as-needed.

Are yours growing always, on all nodes, forever?  Or is it one or two
who ends up in a bad state?

On Tue, Feb 16, 2021 at 3:57 PM mmb1234  wrote:
>
> Looks like the problem is related to tlog rotation on the follower shard.
>
> We did the following for a specific shard.
>
> 0. start solr cloud
> 1. solr-0 (leader), solr-1, solr-2
> 2. rebalance to make solr-1 as preferred leader
> 3. solr-0, solr-1 (leader), solr-2
>
> The tlog file on solr-0 kept on growing infinitely (100s of GBs) until we
> shut the cluster and dropped all shards (manually).
>
> Only way to "restart" tlog rotation on solr-0 (follower) was to issue
> /admin/cores/action=RELOAD=x atleast twice when the tlog size was
> small (in MBs).
>
> Also if rebalance is is issued to select solr-0 as a leader, leader election
> never completes.
>
> solr-0 output after step (3) above.
>
> solr-0
> 2140856 ./data2/mydata_0_e000-/tlog
> 2140712 ./data2/mydata_0_e000-/tlog/tlog.021
>
> solr-1 (leader)
> 35268   ./data2/mydata_0_e000-/tlog
> 35264   ./data2/mydata_0_e000-/tlog/tlog.055
>
> solr-2
> 35256   ./data2/mydata_0_e000-/tlog
> 35252   ./data2/mydata_0_e000-/tlog/tlog.054
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Leading wildcard searches very slow

2021-01-19 Thread matthew sporleder
https://lucene.apache.org/solr/4_6_0/solr-core/org/apache/solr/analysis/ReversedWildcardFilterFactory.html
?

On Tue, Jan 19, 2021 at 4:01 AM mosheB  wrote:
>
> Hi, is there any sophisticated way [using the schema] to block brutal regex
> queries?
>
>
> Thanks
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr Slack Workspace

2021-01-15 Thread matthew sporleder
IRC has kind of died off,
https://lucene.apache.org/solr/community.html has a slack mentioned,
I'm on https://opensourceconnections.com/slack after taking their solr
training class and assume it's mostly open to solr community.

On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney
 wrote:
>
> Hi all,
>
> I did some googling and didn't find anything, but is there a Slack
> workspace for Solr? I think this could be useful to expand interaction
> within the community of Solr users and connect people solving similar
> problems.
>
> I'd be happy to get this setup if it does not exist already.
>
> Justin


Re: leader election stuck after hosts restarts

2021-01-12 Thread matthew sporleder
When this has happened to me before I have had pretty good luck by
restarting the overseer leader, which can be found in zookeeper under
/overseer_elect/leader

If that doesn't work I've had to do more intrusive and manual recovery
methods, which suck.

On Tue, Jan 12, 2021 at 10:36 AM Pierre Salagnac
 wrote:
>
> Hello,
> We had a stuck leader election for a shard.
>
> We have collections with 2 shards, each shard has 5 replicas. We have many
> collections but the issue happened for a single shard. Once all host
> restarts completed, this shard was stuck with one replica is "recovery"
> state and all other is "down" state.
>
> Here is the state of the shard returned by CLUSTERSTATUS command.
>   "replicas":{
> "core_node3":{
>   "core":"_shard1_replica_n1",
>   "base_url":"https://host1:8983/solr;,
>   "node_name":"host1:8983_solr",
>   "state":"recovering",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node9":{
>   "core":"_shard1_replica_n6",
>   "base_url":"https://host2:8983/solr;,
>   "node_name":"host2:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node26":{
>   "core":"_shard1_replica_n25",
>   "base_url":"https://host3:8983/solr;,
>   "node_name":"host3:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node28":{
>   "core":"_shard1_replica_n27",
>   "base_url":"https://host4:8983/solr;,
>   "node_name":"host4:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"},
> "core_node34":{
>   "core":"_shard1_replica_n33",
>   "base_url":"https://host5:8983/solr;,
>   "node_name":"host5:8983_solr",
>   "state":"down",
>   "type":"NRT",
>   "force_set_state":"false"}}}
>
> The workarounds to shutdown server host1 with the replica stuck in recovery
> state. This unblocked leader election, the 4 other replicas went active.
>
> Here is the first error I found in logs related to this shard. It happened
> while shutting a server host3 that was the leader at that time/
>  (updateExecutor-5-thread-33908-processing-x:..._shard1_replica_n25
> r:core_node26 null n:... s:shard1) [c:... s:shard1 r:core_node26
> x:..._shard1_replica_n25] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Error
> consuming and closing http response stream. =>
> java.nio.channels.AsynchronousCloseException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> java.nio.channels.AsynchronousCloseException: null
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:316)
> at java.io.InputStream.read(InputStream.java:205) ~[?:?]
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:287)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:283)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:176)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> ~[?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> ~[?:?]
> at java.lang.Thread.run(Thread.java:834) [?:?]
>
> My understanding is following this error, each server restart ended in the
> replica on this server being in "down" state, but I'm not sure how to
> confirm that.
> We then entered in a loop where term is increased because of failed
> replication.
>
> Is this a know issue? I found no similar ticket in Jira.
> Could you please having a better understanding of the issue?
> Thanks


Re: Re:Query over migrating a solr database from 7.7.1 to 8.7.0

2021-01-10 Thread matthew sporleder
I think the general advice is to do a full re-index on a major version
upgrade.  Also - did you ever commit?

On Sun, Jan 10, 2021 at 11:13 AM Flowerday, Matthew J <
matthew.flower...@gb.unisys.com> wrote:

> Hi There
>
>
>
> Thanks for contacting me.
>
>
>
> I carried out this analysis of the solr log from the updates I carried out
> at the time:
>
>
>
> Looking at the update requests sent to Solr. The first update of an
> existing record generated
>
>
>
> 2021-01-07 06:04:18.958 INFO  (qtp1458091526-17) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={wt=javabin=2}{add=[9901020319M01-X11
> (1688206792619720704)]} 0 59
>
> 2021-01-07 06:04:19.186 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.QuerySenderListener QuerySenderListener done.
>
> 2021-01-07 06:04:19.196 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.SolrCore [uleaf]  Registered new searcher autowarm time: 1 ms
>
> 2021-01-07 06:04:19.198 INFO  (qtp1458091526-23) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={waitSearcher=true=true=false=javabin=2}{commit=}
> 0 228
>
>
>
> And the record was duplicated:
>
>
>
>
>
> The next update generated
>
>
>
> 2021-01-07 06:10:59.786 INFO  (qtp1458091526-17) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={wt=javabin=2}{add=[9901020319M01-X11
> (1688207212953993216)]} 0 20
>
> 2021-01-07 06:10:59.974 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.QuerySenderListener QuerySenderListener done.
>
> 2021-01-07 06:10:59.982 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.SolrCore [uleaf]  Registered new searcher autowarm time: 0 ms
>
> 2021-01-07 06:10:59.998 INFO  (qtp1458091526-26) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={waitSearcher=true=true=false=javabin=2}{commit=}
> 0 208
>
>
>
> Which looks the same as the previous command – so no real difference here.
>
>
>
> And then the records looked like
>
>
>
>
>
> And this shows that the original (7.7.1) item is untouched and only the
> 8.6.3 item is updated on subsequent updates.
>
>
>
> A brand new record being sent to solr generate this dialog
>
>
>
> 2021-01-07 06:20:10.645 INFO  (qtp1458091526-25) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={wt=javabin=2}{add=[9901020319M01-X15 (1688207790576762880),
> 9901020319M01-DI21 (1688207790587248640)]} 0 15
>
> 2021-01-07 06:20:10.798 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.QuerySenderListener QuerySenderListener done.
>
> 2021-01-07 06:20:10.802 INFO
> (searcherExecutor-15-thread-1-processing-x:uleaf) [   x:uleaf]
> o.a.s.c.SolrCore [uleaf]  Registered new searcher autowarm time: 0 ms
>
> 2021-01-07 06:20:10.803 INFO  (qtp1458091526-23) [   x:uleaf]
> o.a.s.u.p.LogUpdateProcessorFactory [uleaf]  webapp=/solr path=/update
> params={waitSearcher=true=true=false=javabin=2}{commit=}
> 0 153
>
>
>
> And this has a similar update request line as the others – so no
> differences here. Solr just seems to leave the migrated records as is and
> just creates a duplicate when they are updated for some reason.
>
>
>
> I hope this is what you are after.
>
>
>
> Many Thanks
>
>
>
> Matthew
>
>
>
> *Matthew Flowerday* | Consultant | ULEAF
>
> Unisys | 01908 774830| matthew.flower...@unisys.com
>
> Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17
> 8LX
>
>
>
> [image: unisys_logo] 
>
>
>
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY
> MATERIAL and is for use only by the intended recipient. If you received
> this in error, please contact the sender and delete the e-mail and its
> attachments from all devices.
>
> [image: Grey_LI]   [image:
> Grey_TW]  [image: Grey_YT]
> [image: Grey_FB]
> [image: Grey_Vimeo]
> [image: Grey_UB] 
>
>
>
> *From:* xiefengchang 
> *Sent:* 10 January 2021 08:44
> *To:* solr-user@lucene.apache.org
> *Subject:* Re:Query over migrating a solr database from 7.7.1 to 8.7.0
>
>
>
> *EXTERNAL EMAIL - Be cautious of all links and attachments.*
>
> can you show the update request?
>
>
>
>
>
>
>
>
>
>
>
> At 2021-01-07 20:25:13, "Flowerday, Matthew J" <
> matthew.flower...@gb.unisys.com> wrote:
>
> Hi There
>
>
>
> I have recently upgraded a solr database from 7.7.1 to 8.7.0 and not wiped
> the database and re-indexed (as this would take too long to run on site).
>
>
>
> On my local windows machine I have a single solr server 7.7.1 installation
>
>
>
> I upgraded in the following manner
>
>
>
>- Installed windows solr 8.7.0 

Re: Query over migrating a solr database from 7.7.1 to 8.7.0

2021-01-09 Thread matthew sporleder
Did you commit?

> On Jan 9, 2021, at 5:44 AM, Flowerday, Matthew J 
>  wrote:
> 
> 
> Hi There
>  
> As a test I stopped Solr and ran the IndexUpgrader tool on the database to 
> see if this might fix the issue. It completed OK but unfortunately the issue 
> still occurs – a new version of the record on solr is created rather than 
> updating the original record.
>  
> It looks to me as if the record created under 7.7.1 is somehow not being 
> ‘marked as deleted’ in the way that records created under 8.7.0 are. Is there 
> a way for these records to be marked as deleted when they are updated.
>  
> Many Thanks
>  
> Matthew
>  
>  
> Matthew Flowerday | Consultant | ULEAF
> Unisys | 01908 774830| matthew.flower...@unisys.com
> Address Enigma | Wavendon Business Park | Wavendon | Milton Keynes | MK17 8LX
>  
> 
>  
> THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY 
> MATERIAL and is for use only by the intended recipient. If you received this 
> in error, please contact the sender and delete the e-mail and its attachments 
> from all devices.
> 
>  
> 
>  
> 
> 
> 
> 
>  
> From: Flowerday, Matthew J  
> Sent: 07 January 2021 12:25
> To: solr-user@lucene.apache.org
> Subject: Query over migrating a solr database from 7.7.1 to 8.7.0
>  
> Hi There
>  
> I have recently upgraded a solr database from 7.7.1 to 8.7.0 and not wiped 
> the database and re-indexed (as this would take too long to run on site).
>  
> On my local windows machine I have a single solr server 7.7.1 installation
>  
> I upgraded in the following manner
>  
> Installed windows solr 8.7.0 on my machine in a different folder
> Copied the core related folder (holding conf, data, lib, core.properties) 
> from 7.7.1 to the new 8.7.0 folder
> Brought up the solr
> Checked that queries work through the Solr Admin Tool and our application
>  
> This all worked fine until I tried to update a record which had been created 
> under 7.7.1. Instead of marking the old record as deleted it effectively 
> created a new copy of the record with the change in and left the old image as 
> still visible. When I updated the record again it then correctly updated the 
> new 8.7.0 version without leaving the old image behind. If I created a new 
> record and then updated it the solr record would be updated correctly. The 
> issue only seemed to affect the old 7.7.1 created records.
>  
> An example of the duplication as follows (the first record is 7.7.1 created 
> version and the second record is the 8.7.0 version after carrying out an 
> update):
>  
> {
>   "responseHeader":{
> "status":0,
> "QTime":4,
> "params":{
>   "q":"id:9901020319M01-N26",
>   "_":"1610016003669"}},
>   "response":{"numFound":2,"start":0,"numFoundExact":true,"docs":[
>   {
> "id":"9901020319M01-N26",
> "groupId":"9901020319M01",
> "urn":"N26",
> "specification":"nominal",
> "owningGroupId":"9901020319M01",
> "description":"N26, Yates, Mike, Alan, Richard, MALE",
> "group_t":"9901020319M01",
> "nominalUrn_t":"N26",
> "dateTimeCreated_dtr":"2020-12-30T12:00:53Z",
> "dateTimeCreated_dt":"2020-12-30T12:00:53Z",
> "title_t":"Captain",
> "surname_t":"Yates",
> "qualifier_t":"Voyager",
> "forename1_t":"Mike",
> "forename2_t":"Alan",
> "forename3_t":"Richard",
> "sex_t":"MALE",
> "orderedType_t":"Nominal",
> "_version_":1687507566832123904},
>   {
> "id":"9901020319M01-N26",
> "groupId":"9901020319M01",
> "urn":"N26",
> "specification":"nominal",
> "owningGroupId":"9901020319M01",
> "description":"N26, Yates, Mike, Alan, Richard, MALE",
> "group_t":"9901020319M01",
> "nominalUrn_t":"N26",
> "dateTimeCreated_dtr":"2020-12-30T12:00:53Z",
> "dateTimeCreated_dt":"2020-12-30T12:00:53Z",
> "title_t":"Captain",
> "surname_t":"Yates",
> "qualifier_t":"Voyager enterprise defiant yorktown xx yy",
> "forename1_t":"Mike",
> "forename2_t":"Alan",
> "forename3_t":"Richard",
> "sex_t":"MALE",
> "orderedType_t":"Nominal",
> "_version_":1688224966566215680}]
>   }}
>  
> I checked the solrconfig.xml file and it does have a uniqueKey set up
>  
>required="true" multiValued="false" />
>  
> id
>  
> I was wondering if this behaviour is expected and if there is a way to make 
> sure that records created under a previous version are updated correctly (so 
> that the old data is deleted when updated).
>  
> Also am I upgrading solr correctly as it could be that the way I have 
> upgraded it might be causing this issue (I tried hunting through the solr 
> documentation online but struggled to find window upgrade notes and the above 
> steps I worked out by trial and error).
>  
> Many thanks
>  
> Matthew
>  
> Matthew Flowerday | 

Re: Converting a collection name to an alias

2021-01-07 Thread matthew sporleder
https://lucene.apache.org/solr/guide/8_1/collections-api.html#rename

On Thu, Jan 7, 2021 at 2:07 PM ufuk yılmaz  wrote:
>
> Hi again,
>
> Lets say I have a collection named A.
> I’m trying to rename it to A_1, then create an alias named A, which points to 
> the A_1 collection.
> Is this possible without deleting and reindexing the collection from scratch?
>
> Regards,
> uyilmaz
>


Re: Sending compressed (gzip) UpdateRequest with SolrJ

2021-01-07 Thread matthew sporleder
jetty supports http gzip and I've added it to solr before in my own
installs (and submitted patches to do so by default to solr) but I
don't know about the handling for solrj.

IME compression helps a little, sometimes a lot, and never hurts.
Even the admin interface benefits a lot from regular old http gzip

On Thu, Jan 7, 2021 at 8:03 AM Gael Jourdan-Weil
 wrote:
>
> Answering to myself on this one.
>
> Solr uses Jetty 9.x which does not support compressed requests by itself 
> meaning, the application behind Jetty (that is Solr) has to decompress by 
> itself which is not the case for now.
> Thus even without using SolrJ, sending XML compressed in GZIP to Solr (with 
> cURL for instance) is not possible for now.
>
> Seems quite surprising to me though.
>
> -
>
> Hello,
>
> I was wondering if someone ever had the need to send compressed (gzip) update 
> requests (adding/deleting documents), especially using SolrJ.
>
> Somehow I expected it to be done by default, but didn't find any 
> documentation about it and when looking at the code it seems there is no 
> option to do it. Or is javabin compressed by default?
> - 
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/BinaryRequestWriter.java#L49
> - 
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/request/RequestWriter.java#L55
>  (if not using Javabin)
> - 
> https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/Http2SolrClient.java#L587
>
> By the way, is there any documentation about javabin? I could only find one 
> on the "old wiki".
>
> Thanks,
> Gaël


Re: Commits (with openSearcher = true) are too slow in solr 8

2020-12-07 Thread matthew sporleder
I would stick to soft commits and schedule hard-commits as
spaced-out-as-possible in regular maintenance windows until you can
find the culprit of the timeout.

This way you will have very focused windows for intense monitoring
during the hard-commit runs.


On Mon, Dec 7, 2020 at 9:24 AM raj.yadav  wrote:
>
> Hi Folks,
>
> Do let me know if any more information required to debug this.
>
>
> Regards,
> Raj
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Commits (with openSearcher = true) are too slow in solr 8

2020-12-06 Thread matthew sporleder
Is zookeeper on the solr hosts or on its own?  Have you tried
opensearcher=false (soft commit?)

On Sun, Dec 6, 2020 at 6:19 PM raj.yadav  wrote:
>
> Hi Everyone,
>
>
> matthew sporleder wrote
> > Are you stuck in iowait during that commit?
>
> During commit operation, there is no iowait.
> Infact most of the time cpu utilization percentage is very low.
>
> /*As I mentioned in my previous post that we are getting `SolrCmdDistributor
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server` and `DistributedZkUpdateProcessor` ERROR on
> one of the shards. And this error is always occurring on the shard that is
> used (in culr command) to issue commit. (See below example for better
> understanding)*/
>
> Here is shard and corresponding node details:
> shard1_0=>solr_199
> shard1_1=>solr_200
> shard2_0=> solr_254
> shard2_1=> solr_132
> shard3_0=>solr_133
> shard3_1=>solr_198
>
> We are using the following command to issue commit:
> /curl
> "http://solr_node:8389/solr/my_collection/update?openSearcher=true=true=json"/
>
> For example, in the above command, if we replace solr_node with solr_254,
> then it's throwing SolrCmdDistributor and DistributedZkUpdateProcessor
> errors on shard2_0. Similarly, if we replace solr_node with solr_200 its
> throws errors on shard1_1.
>
> *I'm not able to figure out why this is happening. Is there any connection
> timeout setting that is affecting this? Is there any limit that, at a time
> only N number of shards can run commit ops simultaneously or is it some
> network related issue?*
>
>
> For a better understanding of what's happening in SOLR logs. I will
> demonstrate here one commit operation.
>
> I used the below command to issue commit at `2020-12-06 18:37:40` (approx)
> curl
> "http://solr_200:8389/solr/my_collection/update?openSearcher=true=true=json;
>
>
> /*shard2_0 (node: solr_254) Logs:*/
>
>
> *Commit is received at `2020-12-06 18:37:47` and got over by `2020-12-06
> 18:37:47` since there were no changes to commit. And CPU utilization during
> the whole period is around 2%.*
>
>
> 2020-12-06 18:37:47.023 INFO  (qtp2034610694-31355) [c:my_collection
> s:shard2_0 r:core_node13 x:my_collection_shard2_0_replica_n11]
> o.a.s.u.DirectUpdateHandler2 start
> commit{_version_=1685355093842460672,optimize=false,ope
> nSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 2020-12-06 18:37:47.023 INFO No uncommitted changes. Skipping IW.commit.
> 2020-12-06 18:37:47.023 INFO end_commit_flush
> 2020-12-06 18:37:47.023 INFO  (qtp2034610694-31355) [c:my_collection
> s:shard2_0 r:core_node13 x:my_collection_shard2_0_replica_n11]
> o.a.s.u.p.LogUpdateProcessorFactory [my_collection_shard2_0_replica_n11]
> webapp=/solr path=/update
>
> params={update.distrib=TOLEADER=true=true=true=false=http://solr_200:8389/solr/my_collection_shard1_1_replica_n19/_end_point=leaders=javabi
> n=2=false}{commit=} 0 3
>
> /*shard2_1 (node: solr_132) Logs:*/
>
> *Commit is received at `2020-12-06 18:37:47` and got over by `2020-12-06
> 18:50:46` in between there were some external file reloading operations (our
> solr-5.4.2 system is also taking similar time to reload external files so
> right now this is not a major concern for us)
> CPU utilization before commit (i.e `2020-12-06 18:37:47` timestamp) is 2%
> and between commit ops (i.e from `2020-12-06 18:37:47`  to `2020-12-06
> 18:50:46` timestamp) is 14% and after commit operation is done it agains
> fall back to 2%*
>
>
> 2020-12-06 18:37:47.024 INFO  (qtp2034610694-30058) [c:my_collection
> s:shard2_1 r:core_node22 x:my_collection_shard2_1_replica_n21]
> o.a.s.u.DirectUpdateHandler2 start
> commit{_version_=1685355093844557824,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>
> 2020-12-06 18:50:46.218 INFO  (qtp2034610694-30058) [c:my_collection
> s:shard2_1 r:core_node22 x:my_collection_shard2_1_replica_n21]
> o.a.s.u.p.LogUpdateProcessorFactory [my_collection_shard2_1_replica_n21]
> webapp=/solr path=/update
> params={update.distrib=TOLEADER=true=true=true=false=http://solr_200:8389/solr/my_collection_shard1_1_replica_n19/_end_point=leaders=javabin=2=false}{commit=}
> 0 779196
>
>
> /*shard3_0 (node: solr_133) logs*/
>
> Same as shard2_1, commit received at `2020-12-06 18:37:47` and got over by
> `2020-12-06 18:49:24`.
> CPU utilization pattern is same is shard2_1.
>
> /*shard3_1 (node: solr_198) logs.*/
>
> Same as shard2_1, commit received at `2020-12-06 18:37:47` and got over by
> `2020-12-06 18:53:57`.
> CPU utilization pattern is same is sh

Re: Commits (with openSearcher = true) are too slow in solr 8

2020-12-06 Thread matthew sporleder
On unix the top command will tell you.  On windows you need to find
the disk latency stuff.

Are you on a spinning disk or on a (good) SSD?

Anyway my theory is that trying to do too many commits in parallel
(too many or not enough shards) is causing iowait = high latency to
work through.

On Sun, Dec 6, 2020 at 9:05 AM raj.yadav  wrote:
>
> matthew sporleder wrote
> > Are you stuck in iowait during that commit?
>
> I am not sure how do I determine that, could you help me here.
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Commits (with openSearcher = true) are too slow in solr 8

2020-12-04 Thread matthew sporleder
Are you stuck in iowait during that commit?



On Fri, Dec 4, 2020 at 6:28 AM raj.yadav  wrote:
>
> Hi everyone,
>
> As per suggestions in previous post (by Erick and Shawn) we did following
> changes.
>
> OLD CACHE CONFIG
>   size="32768"
>  initialSize="6000"
>  autowarmCount="6000"/>
>
>size="25600"
>   initialSize="6000"
>   autowarmCount="0"/>
>
> size="32768"
>initialSize="6144"
>autowarmCount="0"/>
>
> NEW CACHE CONFIG
>   size="8192"
>  initialSize="512"
>  autowarmCount="512"/>
>
>size="8192"
>   initialSize="3000"
>   autowarmCount="0"/>
>
> size="8192"
>initialSize="3072"
>autowarmCount="0"/>
>
>
> *Reduced JVM heap size from 30GB to 26GB*
>
>
>
> *Currently query request rate on the system is zero.
> But still, commit with openSearcher=true is taking 25 mins.*
>
> We looked into solr logs, and observed the following things:
>
> 1. /Once the commit is issued, five (shard1_0, shard1_1, shard2_2, shard3_0,
> shard3_1) of the six shards have immediately started processing commit but
> on one shard (shard2_1) we are getting follwing error:/
>
> 2020-12-03 12:29:17.518 ERROR
> (updateExecutor-5-thread-6-processing-n:solr_132:8389_solr
> x:my_collection_shard2_1_replica_n21 c:my_collection s:shard2_1
> r:core_node22) [c:my_collection s:shard2_1 r:core_node22
> x:my_collection_shard2_1_replica_n21] o.a.s.u.SolrCmdDistributor
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at:
> http://solr_198:8389/solr/my_collection_shard3_1_replica_n23/update?update.distrib=TOLEADER=http%3A%2F%2Fsolr_132%3A8389%2Fsolr%2Fmy_collection_shard2_1_replica_n21%2F
> at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:407)
> at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
> at 
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
> at
> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
> at
> org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
> at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.TimeoutException
> at
> org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
> at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:398)
> ... 13 more
>
> 2020-12-03 12:29:17.518 ERROR
> (updateExecutor-5-thread-2-processing-n:solr_132:8389_solr
> x:my_collection_shard2_1_replica_n21 c:my_collection s:shard2_1
> r:core_node22) [c:my_collection s:shard2_1 r:core_node22
> x:my_collection_shard2_1_replica_n21] o.a.s.u.SolrCmdDistributor
> org.apache.solr.client.solrj.SolrServerException: Timeout occured while
> waiting response from server at:
> http://solr_199:8389/solr/my_collection_shard1_0_replica_n7/update?update.distrib=TOLEADER=http%3A%2F%2Fsolr_132%3A8389%2Fsolr%2Fmy_collection_shard2_1_replica_n21%2F
> at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:407)
> at
> org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
> at
> org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
> at 
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
> at
> org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
> at
> org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> 

Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder
I went through the same stages of grief that you are about to start
but (luckily?) my core dataset grew some weird cousins and we ended up
writing our own indexer to join them all together/do partial
updates/other stuff beyond DIH.  It's not difficult to upload docs but
is definitely slower so far.  I think there is a bit of a 'clean core'
focus going on in solr-land right now and DIH is easy(!) but it's also
easy to hit its limits (atomic/partial updates?  wtf is an "entity?"
etc) so anyway try to be happy that you are aware of it now.

On Sat, Nov 28, 2020 at 7:41 PM Dmitri Maziuk  wrote:
>
> On 11/28/2020 5:48 PM, matthew sporleder wrote:
>
> > ...  The bottom of
> > that github page isn't hopeful however :)
>
> Yeah, "works with MariaDB" is a particularly bad way of saying "BYO JDBC
> JAR" :)
>
> It's a more general queston though, what is the path forward for users
> who with data in two places? Hope that a community-maintained plugin
> will still be there tomorrow? Dump our tables to CSV (and POST them) and
> roll our own delta-updates logic? Or are we to choose one datastore and
> drop the other?
>
> Dima


Re: data import handler deprecated?

2020-11-28 Thread matthew sporleder
https://solr.cool/#utilities -> https://github.com/rohitbemax/dataimporthandler

You can import it in the many new/novel ways to add things to a solr
install and it should work like always (apparently).  The bottom of
that github page isn't hopeful however :)

On Sat, Nov 28, 2020 at 5:21 PM Dmitri Maziuk  wrote:
>
> Hi all,
>
> trying to set up solr-8.7.0, contrib/dataimporthandler/README.txt says
> this module is deprecated as of 8.6 and scheduled for removal in 9.0.
>
> How do we pull data out of our relational database in 8.7+?
>
> TIA
> Dima


Re: Query generation is different for search terms with and without "-"

2020-11-24 Thread matthew sporleder
Is the normal/standard solution here to regex remove the '-'s and
combine them into a single token?

On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson  wrote:
>
> This is a common point of confusion. There are two phases for creating a 
> query,
> query _parsing_ first, then the analysis chain for the parsed result.
>
> So what e-dismax sees in the two cases is:
>
> Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes 
> into play.
>
> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, 
> splitting it on the hyphen comes later.
>
> It’s especially confusing since the field analysis then breaks up “high-tech” 
> into two tokens that
> look the same as “high tech” in the debug response, just without the phrase 
> query.
>
> Name_enUS:high
> Name_enUS:tech
>
> Best,
> Erick
>
> > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez 
> >  wrote:
> >
> > I am troubleshooting an issue with ranking for search terms that contain a
> > "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> > "high tech". The field that I am querying is using the standard tokenizer,
> > so I would expect that the underlying lucene query should be the same for
> > both versions of the query, however when printing the debug, it appears
> > they are generated differently. I know "-" must be escaped as it has
> > special meaning in lucene, however escaping does not fix the problem. It
> > appears that with the "-" present, the pf2 edismax parameter is not
> > respected and omitted from the final query. We use sow=false as we have
> > multiterm synonyms and need to ensure they are included in the final lucene
> > query. My expectation is that the final underlying lucene query should be
> > based on the output  of the field analyzer, however after briefly looking
> > at the code for ExtendedDismaxQParser, it appears that there is some string
> > processing happening outside of the analysis step which causes the
> > unexpected lucene query.
> >
> >
> > Solr Debug for "high tech":
> >
> > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > (Name_enUS:"high tech"~4)~0.4",
> >
> >
> > Solr Debug for "high-tech"
> >
> > parsedquery: "+DisjunctionMaxQueryName_enUS:high
> > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > tech"~5)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > (Name_enUS:"high tech"~5)~0.4"
> >
> > SolrConfig:
> >
> >  
> >
> >  true
> >  true
> >  json
> >  375%
> >  Name_enUS
> >  Name_enUS
> >  5
> >  Name_enUS
> >  4   
> >  3
> >  0.4
> >  explicit
> >  100
> >  false
> >
> >
> >  edismax
> >
> >  
> >
> > Schema:
> >
> >   > positionIncrementGap="100">
> >  
> >
> >
> >
> >
> >  
> >  
> >
> >
> > Using Solr 8.6.3
> >


Re: how do you manage your config and schema

2020-11-03 Thread matthew sporleder
Is there a more conservative starting point that is still up to date
than _default?

On Tue, Nov 3, 2020 at 11:13 AM matthew sporleder  wrote:
>
> So _default considered unsafe?  :)
>
> On Tue, Nov 3, 2020 at 11:08 AM Erick Erickson  
> wrote:
> >
> > The caution I would add is that you should be careful
> > that you don’t enable schemaless mode without understanding
> > the consequences in detail.
> >
> > There is, in fact, some discussion of removing schemaless entirely,
> > see:
> > https://issues.apache.org/jira/browse/SOLR-14701
> >
> > Otherwise, I usually recommend that you take the stock ocnfigs and
> > overlay whatever customizations you’ve added in terms of
> > field definitions and the like.
> >
> > Do also be careful, some default field params have changed…
> >
> > Best,
> > Erick
> >
> > > On Nov 3, 2020, at 9:30 AM, matthew sporleder  
> > > wrote:
> > >
> > > Yesterday I realized that we have been carrying forward our configs
> > > since, probably, 4.x days.
> > >
> > > I ran a config set action=create (from _default) and saw files i
> > > didn't recognize, and a lot *fewer* things than I've been uploading
> > > for the last few years.
> > >
> > > Anyway my new plan is to just use _default and keep params.json,
> > > solrconfig.xml, and schema.xml in git and just use the defaults for
> > > the rest.  (modulo synonyms/etc)
> > >
> > > Did everyone move on to managed schema and use some kind of
> > > intermediate format to upload?
> > >
> > > I'm just looking for updated best practices and a little survey of usage 
> > > trends.
> > >
> > > Thanks,
> > > Matt
> >


Re: how do you manage your config and schema

2020-11-03 Thread matthew sporleder
So _default considered unsafe?  :)

On Tue, Nov 3, 2020 at 11:08 AM Erick Erickson  wrote:
>
> The caution I would add is that you should be careful
> that you don’t enable schemaless mode without understanding
> the consequences in detail.
>
> There is, in fact, some discussion of removing schemaless entirely,
> see:
> https://issues.apache.org/jira/browse/SOLR-14701
>
> Otherwise, I usually recommend that you take the stock ocnfigs and
> overlay whatever customizations you’ve added in terms of
> field definitions and the like.
>
> Do also be careful, some default field params have changed…
>
> Best,
> Erick
>
> > On Nov 3, 2020, at 9:30 AM, matthew sporleder  wrote:
> >
> > Yesterday I realized that we have been carrying forward our configs
> > since, probably, 4.x days.
> >
> > I ran a config set action=create (from _default) and saw files i
> > didn't recognize, and a lot *fewer* things than I've been uploading
> > for the last few years.
> >
> > Anyway my new plan is to just use _default and keep params.json,
> > solrconfig.xml, and schema.xml in git and just use the defaults for
> > the rest.  (modulo synonyms/etc)
> >
> > Did everyone move on to managed schema and use some kind of
> > intermediate format to upload?
> >
> > I'm just looking for updated best practices and a little survey of usage 
> > trends.
> >
> > Thanks,
> > Matt
>


how do you manage your config and schema

2020-11-03 Thread matthew sporleder
Yesterday I realized that we have been carrying forward our configs
since, probably, 4.x days.

I ran a config set action=create (from _default) and saw files i
didn't recognize, and a lot *fewer* things than I've been uploading
for the last few years.

Anyway my new plan is to just use _default and keep params.json,
solrconfig.xml, and schema.xml in git and just use the defaults for
the rest.  (modulo synonyms/etc)

Did everyone move on to managed schema and use some kind of
intermediate format to upload?

I'm just looking for updated best practices and a little survey of usage trends.

Thanks,
Matt


Re: Solr dependency update at Apache Beam - which versions should be supported

2020-10-30 Thread matthew sporleder
Is there a reason you can't use a bunch of solr versions and let beam users 
choose at runtime?

> On Oct 30, 2020, at 4:58 AM, Piotr Szuberski  
> wrote:
> 
> Thank you very much for your answer!
> 
> Beam has a compile time dependency on Solr so the user doesn't have to
> provide his own. The problem would happen when a user wants to use both
> Solr X version and Beam SolrIO in the same project.
> 
> As I understood it'd be the best choice to use the 8.x.y version and it
> shouldn't break anything to the users using Beam as their only dependency?
> 
> Regards,
> Piotr
> 
>> On Tue, Oct 27, 2020 at 10:26 PM Mike Drob  wrote:
>> 
>> Piotr,
>> 
>> Based on the questions that we've seen over the past month on this list,
>> there are still users with Solr on 6, 7, and 8. I suspect there are still
>> Solr 5 users out there too, although they don't appear to be asking for
>> help - likely they are in set it and forget it mode.
>> 
>> Solr 7 may not be officially deprecated on our site, but it's pretty old at
>> this point and we're not doing any development on it outside of mybe a
>> very high profile security fix. Even then, we might acknowledge it and
>> recommend users update to 8.x anyway.
>> 
>> The index files generated by Lucene and consumed by Solr are backwards
>> compatible up to one major version. Some of the API remains compatible, a
>> client issuing simple queries to Solr 5 would probably work fine even
>> against Solr 9 when it comes out eventually. A client doing admin
>> operations will be less certain. I don't know enough about Beam to tell you
>> where on the spectrum your use will fall.
>> 
>> I'm not sure if this was helpful or not, but maybe it is a nudge in the
>> right direction.
>> 
>> Good luck,
>> Mike
>> 
>> 
>> On Tue, Oct 27, 2020 at 11:09 AM Piotr Szuberski <
>> piotr.szuber...@polidea.com> wrote:
>> 
>>> Hi,
>>> 
>>> We are working on dependency updates at Apache Beam and I would like to
>>> consult which versions should be supported so we don't break any existing
>>> users.
>>> 
>>> Previously the supported Solr version was 5.5.4.
>>> 
>>> Versions 8.x.y and 7.x.y naturally come to mind as they are the only not
>>> deprecated. But maybe there are users that use some earlier versions?
>>> 
>>> Are these versions backwards-compatible or there are things to be aware
>> of?
>>> 
>>> Regards
>>> 
>> 
> 
> 
> -- 
> 
> *Piotr Szuberski*
> Polidea  | Junior Software Engineer
> 
> E: piotr.szuber...@polidea.com
> 
> Unique Tech
> Check out our projects! 


Re: solr performance with >1 NUMAs

2020-10-22 Thread matthew sporleder
Great updates.  Thanks for keeping us all in the loop!

On Thu, Oct 22, 2020 at 7:43 PM Wei  wrote:
>
> Hi Shawn,
>
> I.m circling back with some new findings with our 2 NUMA issue.  After a
> few iterations, we do see improvement with the useNUMA flag and other JVM
> setting changes. Here are the current settings, with Java 11:
>
> -XX:+UseNUMA
>
> -XX:+UseG1GC
>
> -XX:+AlwaysPreTouch
>
> -XX:+UseTLAB
>
> -XX:G1MaxNewSizePercent=20
>
> -XX:MaxGCPauseMillis=150
>
> -XX:+DisableExplicitGC
>
> -XX:+DoEscapeAnalysis
>
> -XX:+ParallelRefProcEnabled
>
> -XX:+UnlockDiagnosticVMOptions
>
> -XX:+UnlockExperimentalVMOptions
>
>
> Compared to previous Java 8 + CMS on 2 NUMA servers,  P99 latency has
> improved over 20%.
>
>
> Thanks,
>
> Wei
>
>
>
>
> On Mon, Sep 28, 2020 at 4:02 PM Shawn Heisey  wrote:
>
> > On 9/28/2020 12:17 PM, Wei wrote:
> > > Thanks Shawn. Looks like Java 11 is the way to go with -XX:+UseNUMA. Do
> > you
> > > see any backward compatibility issue for Solr 8 with Java 11? Can we run
> > > Solr 8 built with JDK 8 in Java 11 JRE, or need to rebuild solr with Java
> > > 11 JDK?
> >
> > I do not know of any problems running the binary release of Solr 8
> > (which is most likely built with the Java 8 JDK) with a newer release
> > like Java 11 or higher.
> >
> > I think Sun was really burned by such problems cropping up in the days
> > of Java 5 and 6, and their developers have worked really hard to make
> > sure that never happens again.
> >
> > If you're running Java 11, you will need to pick a different garbage
> > collector if you expect the NUMA flag to function.  The most recent
> > releases of Solr are defaulting to G1GC, which as previously mentioned,
> > did not gain NUMA optimizations until Java 14.
> >
> > It is not clear to me whether the NUMA optimizations will work with any
> > collector other than Parallel until Java 14.  You would need to check
> > Java documentation carefully or ask someone involved with development of
> > Java.
> >
> > If you do see an improvement using the NUMA flag with Java 11, please
> > let us know exactly what options Solr was started with.
> >
> > Thanks,
> > Shawn
> >


Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
Your index is so small that it should easily get cached into OS memory
as it is accessed.  Having a too-big heap is a known problem
situation.

https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-HowmuchheapspacedoIneed?

On Tue, Oct 6, 2020 at 9:44 AM Karol Grzyb  wrote:
>
> Hi Matthew,
>
> Thank you for the answer, I cannot reproduce the setup locally I'll
> try to convince them to reduce Xmx, I guess they will rather not agree
> to 1GB but something less than 12G for sure.
> And have some proper dev setup because for now we could only test prod
> or stage which are difficult to adjust.
>
> Is being stuck in GC common behaviour when the index is small compared
> to available heap during bigger load? I was more worried about the
> ratio of heap to total host memory.
>
> Regards,
> Karol
>
>
> wt., 6 paź 2020 o 14:39 matthew sporleder  napisał(a):
> >
> > You have a 12G heap for a 200MB index?  Can you just try changing Xmx
> > to, like, 1g ?
> >
> > On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
> > >
> > > Hi,
> > >
> > > I'm involved in investigation of issue that involves huge GC overhead
> > > that happens during performance tests on Solr Nodes. Solr version is
> > > 6.1. Last test were done on staging env, and we run into problems for
> > > <100 requests/second.
> > >
> > > The size of the index itself is ~200MB ~ 50K docs
> > > Index has small updates every 15min.
> > >
> > >
> > >
> > > Queries involve sorting and faceting.
> > >
> > > I've gathered some heap dumps, I can see from them that most of heap
> > > memory is retained because of object of following classes:
> > >
> > > -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> > > (>4G, 91% of heap)
> > > -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> > > -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> > > -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> > > (>3.7G 76% of heap)
> > >
> > >
> > >
> > > Based on information above is there anything generic that can been
> > > looked at as source of potential improvement without diving deeply
> > > into schema and queries (which may be very difficlut to change at this
> > > moment)? I don't see docvalues being enabled - could this help, as if
> > > I get the docs correctly, it's specifically helpful when there are
> > > many sorts/grouping/facets? Or I
> > >
> > > Additionaly I see, that many threads are blocked on LRUCache.get,
> > > should I recomend switching to FastLRUCache?
> > >
> > > Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> > > memory? I see some (~5/s) page faults in Dynatrace during the biggest
> > > traffic.
> > >
> > > Thank you very much for any help,
> > > Kind regards,
> > > Karol


Re: Java GC issue investigation

2020-10-06 Thread matthew sporleder
You have a 12G heap for a 200MB index?  Can you just try changing Xmx
to, like, 1g ?

On Tue, Oct 6, 2020 at 7:43 AM Karol Grzyb  wrote:
>
> Hi,
>
> I'm involved in investigation of issue that involves huge GC overhead
> that happens during performance tests on Solr Nodes. Solr version is
> 6.1. Last test were done on staging env, and we run into problems for
> <100 requests/second.
>
> The size of the index itself is ~200MB ~ 50K docs
> Index has small updates every 15min.
>
>
>
> Queries involve sorting and faceting.
>
> I've gathered some heap dumps, I can see from them that most of heap
> memory is retained because of object of following classes:
>
> -org.apache.lucene.search.grouping.term.TermSecondPassGroupingCollector
> (>4G, 91% of heap)
> -org.apache.lucene.search.grouping.AbstractSecondPassGroupingCollector$SearchGroupDocs
> -org.apache.lucene.search.FieldValueHitQueue$MultiComparatorsFieldValueHitQueue
> -org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector
> (>3.7G 76% of heap)
>
>
>
> Based on information above is there anything generic that can been
> looked at as source of potential improvement without diving deeply
> into schema and queries (which may be very difficlut to change at this
> moment)? I don't see docvalues being enabled - could this help, as if
> I get the docs correctly, it's specifically helpful when there are
> many sorts/grouping/facets? Or I
>
> Additionaly I see, that many threads are blocked on LRUCache.get,
> should I recomend switching to FastLRUCache?
>
> Also, I wonder if -Xmx12288m for java heap is not too much for 16G
> memory? I see some (~5/s) page faults in Dynatrace during the biggest
> traffic.
>
> Thank you very much for any help,
> Kind regards,
> Karol


Re: Solr training

2020-09-17 Thread matthew sporleder
Is there a friends-on-the-mailing list discount?  I had a bit of sticker shock!

On Wed, Sep 16, 2020 at 9:38 AM Charlie Hull  wrote:
>
> I do of course mean 'Group Discounts': you don't get a discount for
> being in a 'froup' sadly (I wasn't even aware that was a thing!)
>
> Charlie
>
> On 16/09/2020 13:26, Charlie Hull wrote:
> >
> > Hi all,
> >
> > We're running our SolrThink Like a Relevance Engineer training 6-9 Oct
> > - you can find out more & book tickets at
> > https://opensourceconnections.com/training/solr-think-like-a-relevance-engineer-tlre/
> >
> > The course is delivered over 4 half-days from 9am EST / 2pm BST / 3pm
> > CET and is led by Eric Pugh who co-wrote the first book on Solr and is
> > a Solr Committer. It's suitable for all members of the search team -
> > search engineers, data scientists, even product owners who want to
> > know how Solr search can be measured & tuned. Delivered by working
> > relevance engineers the course features practical exercises and will
> > give you a great foundation in how to use Solr to build great search.
> >
> > Tthe early bird discount expires end of this week so do book soon if
> > you're interested! Froup discounts also available. We're also running
> > a more advanced course on Learning to Rank a couple of weeks later -
> > you can find all our training courses and dates at
> > https://opensourceconnections.com/training/
> >
> > Cheers
> >
> > Charlie
> >
> > --
> > Charlie Hull
> > OpenSource Connections, previously Flax
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web:www.o19s.com
>
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>


Re: Unexpected Performance decrease when upgrading Solr 5.5.2 to 8.5.2

2020-09-16 Thread matthew sporleder
Did you re-work your schema at all?  There are new primitive types,
new lucene versions, DocValue's, etc

On Wed, Sep 16, 2020 at 12:40 PM Keene Chen  wrote:
>
> Hi,
>
> Thanks for pointing that out. I've linked the images below:
>
> solr5_response_times.png
> 
>
> solr8_response_times.png
> 
>
> solr5_throughput.png
> 
>
> solr8_throughput.png
> 
>
> Regards,
> Keene
>
>
> On Wed, 16 Sep 2020 at 09:09, Colvin Cowie 
> wrote:
>
> > Hello,
> >
> > Your images won't appear on the mailing list. You'll need to post them
> > elsewhere and link to them.
> >
> > On Tue, 15 Sep 2020 at 09:44, Keene Chen  wrote:
> >
> > > Hi Solr users community,
> > >
> > >
> > > We have been doing some performance tests on Solr 5.5.2 and Solr 8.5.2 as
> > > part of an upgrading process, and we have noticed some reduced
> > performance
> > > for certain types of requests, particularly those that requests a large
> > > number of rows, eg. 1. Would anyone have an explanation as to why the
> > > performance degrades, and what areas can be looked at in order to improve
> > > its performance?
> > >
> > > The performance test example below was carried out using 18000 of such
> > > queries, running at a constant throughput as specified by the label in
> > the
> > > x-axis. “Rpm” here stands for “requests per minute”.
> > >
> > > Solr 8.5’s maximum response times are consistently better. However, the
> > > 95th and 99th percentile are comparably worse than Solr 5.5’s response
> > > times.
> > > [image: image.png]
> > > [image: image.png]
> > >
> > > The maximum throughput for solr 8.5 is reached sooner than Solr 5.5 at
> > > around 4 requests per second.
> > >
> > >
> > > [image: image.png]
> > > [image: image.png]
> > >
> > > Regards,
> > > Keene
> > >
> > > --
> > >
> > >
> > > Keene Chen  | Senior Software Developer
> > >
> > >
> > >
> > > Connect with us
> > >
> > > LinkedIn   Twitter
> > >   Instagram
> > >   Facebook
> > >   News
> > >   Blog  > >
> > >
> > > 
> > >
> > >
> > >
> > > Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > > Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> > >
> > > Contact details for our other offices can be found at
> > > http://www.mintel.com/office-locations.
> > >
> > > This email and any attachments may include content that is confidential,
> > > privileged
> > > or otherwise protected under applicable law. Unauthorised disclosure,
> > > copying, distribution
> > > or use of the contents is prohibited and may be unlawful. If you have
> > > received this email in error,
> > > including without appropriate authorisation, then please reply to the
> > > sender about the error
> > > and delete this email and any attachments.
> > >
> > >
> > >
> > > Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> > > Registered in England: Number 1475918. | VAT Number: GB 232 9342 72
> > >
> > > Contact details for our other offices can be found at
> > > http://www.mintel.com/office-locations.
> > >
> > > This email and any attachments may include content that is confidential,
> > > privileged
> > > or otherwise protected under applicable law. Unauthorised disclosure,
> > > copying, distribution
> > > or use of the contents is prohibited and may be unlawful. If you have
> > > received this email in error,
> > > including without appropriate authorisation, then please reply to the
> > > sender about the error
> > > and delete this email and any attachments.
> > >
> > >
> >
>
> --
>
> Mintel Group Ltd | 11 Pilgrim Street | London | EC4V 6RN
> Registered in
> England: Number 1475918. | VAT Number: GB 232 9342 72
>
> Contact details for
> our other offices can be found at http://www.mintel.com/office-locations
> .
>
> This email and any attachments
> may include content that is confidential, privileged
> or otherwise
> protected under applicable law. Unauthorised disclosure, copying,
> distribution
> or use of the contents is prohibited and may be unlawful. If
> you have received this email in error,
> including without appropriate
> authorisation, then please reply to the sender about the error
> and delete
> this email and any attachments.
>


Re: join query limitations

2020-09-14 Thread matthew sporleder
This probably carried forward from a very old version organically.  I
am running 7.7

On Mon, Sep 14, 2020 at 6:25 PM Erick Erickson  wrote:
>
> What version of Solr are you using? ‘cause 8x has this definition for 
> _version_
>
> 
>  
>
> and I find no text like you’re seeing in any schema file in 8x….
>
> So with a prior version, “try it and see”? See: 
> https://issues.apache.org/jira/browse/SOLR-9449 and linked JIRAs,
> the _version_ can be indexed=“false” since 6.3 at least if it’s 
> docValues=“true". It’s not clear to me that it needed
> to be indexed=“true” even before that, but no guarantees.
>
> updateLog will be defined in solrconfig.xml, but unless you’re on a very old 
> version of Solr it doesn’t matter
> ‘cause you don’t need to have indexed=“true”. Updatelog is not necessary if 
> you’re not running SolrCloud...
>
> I strongly urge you to completely remove all your indexes (perhaps create a 
> new collection) and re-index
> from scratch if you change the definition. You might be able to get away with 
> deleting all the docs then
> re-indexing, but just re-indexing all the docs without starting fresh can 
> have “interesting” results.
>
> Best,
> Erick
>
> > On Sep 14, 2020, at 5:16 PM, matthew sporleder  wrote:
> >
> > Yes but "the _version_ field is also a non-indexed, non-stored single
> > valued docValues field;"  <- is that a problem?
> >
> > My schema has this:
> >  
> >  
> >
> > I don't know if I use the updateLog or not.  How can I find out?
> >
> > I think that would work for me as I could just make a dynamic fild like:
> >  > stored="false" multiValued="false" required="false" docValues="true"
> > />
> >
> > ---
> > Yes it is just for functions, sorting, and boosting
> >
> > On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson  
> > wrote:
> >>
> >> Have you seen “In-place updates”?
> >>
> >> See:
> >> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
> >>
> >> Then use the field as part of a function query. Since it’s non-indexed, you
> >> won’t be searching on it. That said, you can do a lot with function queries
> >> to satisfy use-cases.
> >>
> >> Best.
> >> Erick
> >>
> >>> On Sep 14, 2020, at 3:12 PM, matthew sporleder  
> >>> wrote:
> >>>
> >>> I have hit a bit of a cross-road with our usage of solr where I want
> >>> to include some slightly dynamic data.
> >>>
> >>> I want to ask solr to find things like "text query" but only if they
> >>> meet some specific criteria.  When I have all of those criteria
> >>> indexed, everything works great.  (text contains "apples", in_season=1
> >>> ,sort by latest)
> >>>
> >>> Now I would like to add a criteria which changes every day -
> >>> popularity of a document, specifically.  This appeared to be *the*
> >>> canonical use case for external field files but I have 50M documents
> >>> (and growing) so a *text* file doesn't fit the bill.
> >>>
> >>> I also looked at using a !join but the limitations of !join, as I
> >>> understand them, appear to mean I can't use it for my use case? aka I
> >>> can't actually use the data from my traffic-stats core to sort/filter
> >>> "text contains" "apples", in_season=1, sort by most traffic, sort by
> >>> latest
> >>>
> >>> The last option appears to be updating all of my documents every
> >>> single day, possibly using atomic/partial updates, but even those have
> >>> a growing list of gotchas: losing stored=false documents is a big one,
> >>> caveats I don't quite understand related to copyFields, changes to the
> >>> _version_ field (the _version_ field is also a non-indexed, non-stored
> >>> single valued docValues field;), etc
> >>>
> >>> Where else can I look?  The last time we attempted something like this
> >>> we ended up rebuilding the index from scratch each day and shuffling
> >>> it out, which was really pretty nasty.
> >>>
> >>> Thanks,
> >>> Matt
> >>
>


Re: join query limitations

2020-09-14 Thread matthew sporleder
Yes but "the _version_ field is also a non-indexed, non-stored single
valued docValues field;"  <- is that a problem?

My schema has this:
  
  

I don't know if I use the updateLog or not.  How can I find out?

I think that would work for me as I could just make a dynamic fild like:


---
Yes it is just for functions, sorting, and boosting

On Mon, Sep 14, 2020 at 4:51 PM Erick Erickson  wrote:
>
> Have you seen “In-place updates”?
>
> See:
> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html
>
> Then use the field as part of a function query. Since it’s non-indexed, you
> won’t be searching on it. That said, you can do a lot with function queries
> to satisfy use-cases.
>
> Best.
> Erick
>
> > On Sep 14, 2020, at 3:12 PM, matthew sporleder  wrote:
> >
> > I have hit a bit of a cross-road with our usage of solr where I want
> > to include some slightly dynamic data.
> >
> > I want to ask solr to find things like "text query" but only if they
> > meet some specific criteria.  When I have all of those criteria
> > indexed, everything works great.  (text contains "apples", in_season=1
> > ,sort by latest)
> >
> > Now I would like to add a criteria which changes every day -
> > popularity of a document, specifically.  This appeared to be *the*
> > canonical use case for external field files but I have 50M documents
> > (and growing) so a *text* file doesn't fit the bill.
> >
> > I also looked at using a !join but the limitations of !join, as I
> > understand them, appear to mean I can't use it for my use case? aka I
> > can't actually use the data from my traffic-stats core to sort/filter
> > "text contains" "apples", in_season=1, sort by most traffic, sort by
> > latest
> >
> > The last option appears to be updating all of my documents every
> > single day, possibly using atomic/partial updates, but even those have
> > a growing list of gotchas: losing stored=false documents is a big one,
> > caveats I don't quite understand related to copyFields, changes to the
> > _version_ field (the _version_ field is also a non-indexed, non-stored
> > single valued docValues field;), etc
> >
> > Where else can I look?  The last time we attempted something like this
> > we ended up rebuilding the index from scratch each day and shuffling
> > it out, which was really pretty nasty.
> >
> > Thanks,
> > Matt
>


join query limitations

2020-09-14 Thread matthew sporleder
I have hit a bit of a cross-road with our usage of solr where I want
to include some slightly dynamic data.

I want to ask solr to find things like "text query" but only if they
meet some specific criteria.  When I have all of those criteria
indexed, everything works great.  (text contains "apples", in_season=1
,sort by latest)

Now I would like to add a criteria which changes every day -
popularity of a document, specifically.  This appeared to be *the*
canonical use case for external field files but I have 50M documents
(and growing) so a *text* file doesn't fit the bill.

I also looked at using a !join but the limitations of !join, as I
understand them, appear to mean I can't use it for my use case? aka I
can't actually use the data from my traffic-stats core to sort/filter
"text contains" "apples", in_season=1, sort by most traffic, sort by
latest

The last option appears to be updating all of my documents every
single day, possibly using atomic/partial updates, but even those have
a growing list of gotchas: losing stored=false documents is a big one,
caveats I don't quite understand related to copyFields, changes to the
_version_ field (the _version_ field is also a non-indexed, non-stored
single valued docValues field;), etc

Where else can I look?  The last time we attempted something like this
we ended up rebuilding the index from scratch each day and shuffling
it out, which was really pretty nasty.

Thanks,
Matt


downsides to infoStream ?

2020-09-08 Thread matthew sporleder
I saw
https://lucene.apache.org/solr/guide/8_5/indexconfig-in-solrconfig.html#other-indexing-settings
mentioned in a thread and was wondering if there was any downside to
running this potentially super useful log.

Thanks,
Matt


Re: external field file size

2020-09-01 Thread matthew sporleder
Okay thanks for the tip.  I am pretty wary of streaming logs into my
main set of documents + tons of $stat_updated_at fields + resetting
stats on ~every document every day + whatever else we feel like
trending.  It just feels like a lot of churn.

I will lean towards the !join on stats-$DATE probably.

On Tue, Sep 1, 2020 at 11:32 AM Erick Erickson  wrote:
>
> I wouldn’t use ExternalFileField if your use-case is served by in-place 
> updates. See
>
> https://lucene.apache.org/solr/guide/8_1/updating-parts-of-documents.html#in-place-updates
>
> EFFs were put in in order to have _some_ capability to change individual 
> fields in a doc
> long before in-place updates were around and long before SolrCloud. Using EFF 
> in any
> kind of sharded system will cause you significant heartburn in terms of 
> keeping the
> file up to date on all replicas.
>
> Best,
> Erick
>
> > On Sep 1, 2020, at 11:21 AM, matthew sporleder  wrote:
> >
> > We are researching the canonical use case for external fields --
> > traffic-based rankings
> >
> > What are the practical limits on the size of the external field file?
> > A k=v text file seems like it might fall over if it grows into the GB
> > range?
> >
> > Our other thought is to use rolling cores where we stream in web logs
> > and use !join queries.
> >
> > Does anyone have practical experience with this that they might want to 
> > share?
> >
> > Thanks,
> > Matt
>


external field file size

2020-09-01 Thread matthew sporleder
We are researching the canonical use case for external fields --
traffic-based rankings

What are the practical limits on the size of the external field file?
A k=v text file seems like it might fall over if it grows into the GB
range?

Our other thought is to use rolling cores where we stream in web logs
and use !join queries.

Does anyone have practical experience with this that they might want to share?

Thanks,
Matt


Re: Cannot add replica during backup

2020-08-11 Thread matthew sporleder
I can already tell you it is EFS that is slow. I had to switch to an ebs disk 
for backups on a different project because efs couldn't keep up. 

> On Aug 10, 2020, at 9:43 PM, Ashwin Ramesh  wrote:
> 
> Hey Aroop, the general process for our backup is:
> - Connect all machines to an EFS drive (AWS's NFS service)
> - Call the collections API to backup into EFS
> - ZIP the directory once the backup is completed
> - Copy the ZIP into an s3 bucket
> 
> I'll probably have to see which part of the process is the slowest.
> 
> On another note, can you simply remove the task from the ZK path to
> continue the execution of tasks?
> 
> Regards,
> 
> Ash
> 
>> On Tue, Aug 11, 2020 at 11:40 AM Aroop Ganguly
>>  wrote:
>> 
>> 12 hours is extreme, we take backups of 10TB worth of indexes in 15 mins
>> using the collection backup api.
>> How are you taking the backup?
>> 
>> Do you actually see any backup progress or u are just seeing the task in
>> the overseer queue linger ?
>> I have seen restore tasks hanging in the queue forever despite process
>> completing in Solr 77 so wouldn’t be surprised this happens with backup as
>> well. And also observed that unless that unless that task is removed from
>> the overseer-collection-queue the next ones do not proceed.
>> 
>> Also adding replicas while backup seems like overkill, why don’t you just
>> have the appropriate replication factor in the first place and have
>> autoAddReplicas=true for indemnity?
>> 
>>> On Aug 10, 2020, at 6:32 PM, Ashwin Ramesh 
>> wrote:
>>> 
>>> Hi everybody,
>>> 
>>> We are using solr 7.6 (SolrCloud). We notices that when the backup is
>>> running, we cannot add any replicas to the collection. By the looks of
>> it,
>>> the job to add the replica is put into the Overseer queue, but it is not
>>> being processed. Is this expected? And are there any workarounds?
>>> 
>>> Our backups take about 12 hours. Maybe we should try optimize that too.
>>> 
>>> Regards,
>>> 
>>> Ash
>>> 
>>> --
>>> **
>>> ** Empowering the world to design
>>> Share accurate
>>> information on COVID-19 and spread messages of support to your community.
>>> 
>>> Here are some resources
>>> <
>> https://about.canva.com/coronavirus-awareness-collection/?utm_medium=pr_source=news_campaign=covid19_templates>
>> 
>>> that can help.
>>>  
>>>  
>>>   
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> -- 
> **
> ** Empowering the world to design
> Share accurate 
> information on COVID-19 and spread messages of support to your community.
> 
> Here are some resources 
> 
>  
> that can help.
>   
>    
>     
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 


Re: copyField from empty multivalue

2020-08-07 Thread matthew sporleder
Nevermind I think we found this was caused by a bug in our (new) custom indexer

On Thu, Aug 6, 2020 at 4:11 PM matthew sporleder  wrote:
>
> I have a copyField:
>  
>  
>
> But sometimes preview ( indexed="true" stored="true" multiValued="true" />) is not populated.
>
> It appears that the "catchall" field does not get created when preview
> has no content in it.  Can I use required=false or similar on a
> copyField?
>
> Thanks,
> Matt


copyField from empty multivalue

2020-08-06 Thread matthew sporleder
I have a copyField:
 
 

But sometimes preview () is not populated.

It appears that the "catchall" field does not get created when preview
has no content in it.  Can I use required=false or similar on a
copyField?

Thanks,
Matt


Re: Meow attacks

2020-07-28 Thread matthew sporleder
On Tue, Jul 28, 2020 at 4:39 PM Odysci  wrote:
>
> Folks,
>
> I suspect one of our Zookeeper installations on AWS was subject to a Meow
> attack (
> https://arstechnica.com/information-technology/2020/07/more-than-1000-databases-have-been-nuked-by-mystery-meow-attack/
> )
>
> Basically, the configuration for one of our collections disappeared from
> the Zookeeper tree (when looking at the Solr interface), and it left
> several files ending in "-meow"
> Before I realized it, I stopped and restarted the ZK and Solr machines (as
> part of ubuntu updates), and when ZK didn't find the configuration for a
> collection, it deleted the collection from Solr. At least that's what I
> suspect happened.
>
> Fortunately it affected a very small index and we had backups. But it is
> very worrisome.
> Has anyone had any problems with this?
> Is there any type of log that I can check to sort out how this happened?
> The ZK log complained that the configs for the collection were not there,
> but that's about it.
>
> and, is there a better way to protect against such attacks?
> Thanks
>
> Reinaldo

Use VPC and private networks!

ask in ##aws on freenode if you are really lost


Re: Cybersecurity Incident Report

2020-07-24 Thread matthew sporleder
docker pull solr:8.4.1-slim

docker run -it --rm solr:8.4.1-slim /bin/bash

solr@223042112be5:/opt/solr-8.4.1$ find ./ -name "*jackson*"
./server/solr-webapp/webapp/WEB-INF/lib/jackson-core-2.10.0.jar
./server/solr-webapp/webapp/WEB-INF/lib/jackson-annotations-2.10.0.jar
./server/solr-webapp/webapp/WEB-INF/lib/jackson-dataformat-smile-2.10.0.jar
./server/solr-webapp/webapp/WEB-INF/lib/jackson-databind-2.10.0.jar
./contrib/prometheus-exporter/lib/jackson-jq-0.0.8.jar
./contrib/prometheus-exporter/lib/jackson-core-2.10.0.jar
./contrib/prometheus-exporter/lib/jackson-annotations-2.10.0.jar
./contrib/prometheus-exporter/lib/jackson-databind-2.10.0.jar
./contrib/clustering/lib/jackson-annotations-2.10.0.jar
./contrib/clustering/lib/jackson-databind-2.10.0.jar

How does the scanner work?

On Thu, Jul 23, 2020 at 11:23 PM Man with No Name
 wrote:
>
> Any help on this.?
>
> On Wed, Jul 22, 2020 at 4:25 PM Man with No Name 
> wrote:
>
> > The image is pulled from docker hub. After scanning the image from docker
> > hub, without any modification, this is the list of CVE we're getting.
> >
> >
> > Image  ID  CVE Package  
> >   Version SeverityStatus
> >CVSS
> > -  --  --- ---  
> >   --- --
> >
> > solr:8.4.1-slim57561b4889690532CVE-2019-16335  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10  9.8
> > solr:8.4.1-slim57561b4889690532CVE-2020-8840   
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> > 9.8
> > solr:8.4.1-slim57561b4889690532CVE-2020-11620  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10.49.8
> > solr:8.4.1-slim57561b4889690532CVE-2020-9546   
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10.49.8
> > solr:8.4.1-slim57561b4889690532CVE-2020-9547   
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10.49.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-20445  
> > io.netty_netty-codec   4.1.29.Finalcritical 
> >fixed in 4.1.44  9.1
> > solr:8.4.1-slim57561b4889690532CVE-2020-9548   
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10.49.8
> > solr:8.4.1-slim57561b4889690532CVE-2017-15095  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.1, 2.8.10   9.8
> > solr:8.4.1-slim57561b4889690532CVE-2018-14718  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.7   9.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-16942  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> > 9.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-14893  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.10.0, 2.9.10  9.8
> > solr:8.4.1-slim57561b4889690532CVE-2018-7489   
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.5, 2.8.11.1, 2.7.9.39.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-20444  
> > io.netty_netty-codec   4.1.29.Finalcritical 
> >fixed in 4.1.44  9.1
> > solr:8.4.1-slim57561b4889690532CVE-2019-14540  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10  9.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-16943  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> > 9.8
> > solr:8.4.1-slim57561b4889690532CVE-2020-11612  
> > io.netty_netty-codec   4.1.29.Finalcritical 
> >fixed in 4.1.46  9.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-20330  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10.29.8
> > solr:8.4.1-slim57561b4889690532CVE-2019-17267  
> > com.fasterxml.jackson.core_jackson-databind2.4.0   critical 
> >fixed in 2.9.10  9.8
> >
> >
> > On Tue, Jul 21, 2020 at 

Re: Sitecore 9.3 / Solr 8.1.1 - Zookeeper Issue

2020-07-20 Thread matthew sporleder
FWIW the real error is msg":"SolrCore is loading which is bad if you are in the 
middle of indexing

What is happening on solr at this time?

> On Jul 20, 2020, at 4:46 AM, Charlie Hull  wrote:
> 
> Hi Austin,
> 
> Sitecore is a commercial product so your first port of call should be whoever 
> sold you or is supporting Sitecorea quick (and by no means deep) bit of 
> research shows this error may be generated by the Sitecore indexer process 
> calling Solr. We won't be able to see how it does that if it's closed source 
> code.
> 
> Cheers
> 
> Charlie
> 
>> On 20/07/2020 04:53, Austin Kimmel wrote:
>> Hello,
>> 
>> We are seeing the following errors with Sitecore 9.3 connecting to a Solr 
>> 8.1.1 cluster running on Zookeeper and haven't been able to resolve:
>> 
>> 
>> 2020-07-17 18:10:58.238 WARN  (zkCallback-8-thread-3) 
>> [c:pj4_sitecore_web_index s:shard1 r:core_node5 
>> x:pj4_sitecore_web_index_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: 
>> core=pj4_sitecore_web_index_shard1_replica_n2 
>> url=https://10.5.64.40:8984/solr  got a 503 from 
>> https://10.5.64.41:8984/solr/pj4_sitecore_web_index_shard1_replica_n1/, 
>> counting as success => 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
>> from server at 
>> https://10.5.64.41:8984/solr/pj4_sitecore_web_index_shard1_replica_n1: 
>> Expected mime type application/octet-stream but got application/json. {   
>> "error":{ "metadata":[   
>> "error-class","org.apache.solr.common.SolrException",   
>> "root-error-class","org.apache.solr.common.SolrException"], 
>> "msg":"SolrCore is loading", "code":503}}
>> 
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:613)
>>  org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
>> from server at 
>> https://10.5.64.41:8984/solr/pj4_sitecore_web_index_shard1_replica_n1: 
>> Expected mime type application/octet-stream but got application/json. {   
>> "error":{ "metadata":[   
>> "error-class","org.apache.solr.common.SolrException",   
>> "root-error-class","org.apache.solr.common.SolrException"], 
>> "msg":"SolrCore is loading", "code":503}}
>> 
>> 
>> 2020-07-17 18:10:58.276 ERROR (zkCallback-8-thread-3) 
>> [c:pj4_sitecore_web_index s:shard1 r:core_node5 
>> x:pj4_sitecore_web_index_shard1_replica_n2] o.a.s.c.SyncStrategy Sync 
>> request error: 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
>> from server at 
>> https://10.5.64.41:8984/solr/pj4_sitecore_web_index_shard1_replica_n1: 
>> Expected mime type application/octet-stream but got application/json. {   
>> "error":{ "metadata":[   
>> "error-class","org.apache.solr.common.SolrException",   
>> "root-error-class","org.apache.solr.common.SolrException"], 
>> "msg":"SolrCore is loading", "code":503}}
>> 
>> 
>> 2020-07-17 18:10:59.598 ERROR (qtp1661210650-149) [   ] o.a.s.s.HttpSolrCall 
>> null:org.apache.solr.common.SolrException: Error trying to proxy request for 
>> url: https://10.5.64.42:8984/solr/pj4_sitecore_web_index/admin/ping at 
>> org.apache.solr.servlet.HttpSolrCall.remoteQuery(HttpSolrCall.java:692) 
>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:526) at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:397)
>>  at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343)
>>  at 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
>>  at 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)   
>>   at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
>>  at 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)  
>>at 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>>  at 
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>>  at 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>>  at 
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>>  at 
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>>  at 
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>>  at 
>> 

Re: [ANNOUNCE] Apache Solr 8.6.0 released

2020-07-16 Thread matthew sporleder
I hear all of that and agree, obviously, but "curl
solr:8983/collection/dataimport?blah" in cron was *pretty freaking
easy* ;)

Not sure why "pull" is elevated to "anti-pattern"; data is data is data

On Thu, Jul 16, 2020 at 8:49 PM Ishan Chattopadhyaya
 wrote:
>
> Thanks Aroop for your feedback. We shall try to ensure continuity of
> functionality via packages. Your help in those efforts would be greatly
> appreciated as well. Let us take this discussion to SOLR-14660.
>
> > Is there a replacement for DIH?
> DIH is available as a community supported package. However, it is an
> anti-pattern for a search engine to be pulling data from outside. Instead,
> please consider writing separate indexing programs that pull data from the
> database systems and index into Solr. It is not only a good practice, but
> also more efficient in terms of throughput. For more information on this,
> please start another thread in solr-users@ list, and more people can
> suggest best alternatives here.
>
>
> On Fri, Jul 17, 2020 at 5:50 AM matthew sporleder 
> wrote:
>
> > Is there a replacement for DIH?
> >
> > On Wed, Jul 15, 2020 at 10:08 AM Ishan Chattopadhyaya
> >  wrote:
> > >
> > > Dear Solr Users,
> > >
> > > In this release (Solr 8.6), we have deprecated the following:
> > >
> > >   1. Data Import Handler
> > >
> > >   2. HDFS support
> > >
> > >   3. Cross Data Center Replication (CDCR)
> > >
> > >
> > >
> > > All of these are scheduled to be removed in a future 9.x release.
> > >
> > > It was decided that these components did not meet the standards of
> > quality
> > > and support that we wish to ensure for all components we ship. Some of
> > > these also relied on design patterns that we no longer recommend for use
> > in
> > > critical production environments.
> > >
> > > If you rely on these features, you are encouraged to try out community
> > > supported versions of these, where available [0]. Where such community
> > > support is not available, we encourage you to participate in the
> > migration
> > > of these components into community supported packages and help continue
> > the
> > > development. We envision that using packages for these components via
> > > package manager will actually make it easier for users to use such
> > features.
> > >
> > > Regards,
> > >
> > > Ishan Chattopadhyaya
> > >
> > > (On behalf of the Apache Lucene/Solr PMC)
> > >
> > > [0] -
> > >
> > https://cwiki.apache.org/confluence/display/SOLR/Community+supported+packages+for+Solr
> > >
> > > On Wed, Jul 15, 2020 at 2:30 PM Bruno Roustant  > >
> > > wrote:
> > >
> > > > The Lucene PMC is pleased to announce the release of Apache Solr 8.6.0.
> > > >
> > > >
> > > > Solr is the popular, blazing fast, open source NoSQL search platform
> > from
> > > > the Apache Lucene project. Its major features include powerful
> > full-text
> > > > search, hit highlighting, faceted search, dynamic clustering, database
> > > > integration, rich document handling, and geospatial search. Solr is
> > highly
> > > > scalable, providing fault tolerant distributed search and indexing, and
> > > > powers the search and navigation features of many of the world's
> > largest
> > > > internet sites.
> > > >
> > > >
> > > > Solr 8.6.0 is available for immediate download at:
> > > >
> > > >
> > > >   <https://lucene.apache.org/solr/downloads.html>
> > > >
> > > >
> > > > ### Solr 8.6.0 Release Highlights:
> > > >
> > > >
> > > >  * Cross-Collection Join Queries: Join queries can now work
> > > > cross-collection, even when shared or when spanning nodes.
> > > >
> > > >  * Search: Performance improvement for some types of queries when exact
> > > > hit count isn't needed by using BlockMax WAND algorithm.
> > > >
> > > >  * Streaming Expression: Percentiles and standard deviation
> > aggregations
> > > > added to stats, facet and time series.  Streaming expressions added to
> > > > /export handler.  Drill Streaming Expression for efficient and accurate
> > > > high cardinality aggregation.
> > > >
> > > >  * Package manager: Support for cl

Re: [ANNOUNCE] Apache Solr 8.6.0 released

2020-07-16 Thread matthew sporleder
Is there a replacement for DIH?

On Wed, Jul 15, 2020 at 10:08 AM Ishan Chattopadhyaya
 wrote:
>
> Dear Solr Users,
>
> In this release (Solr 8.6), we have deprecated the following:
>
>   1. Data Import Handler
>
>   2. HDFS support
>
>   3. Cross Data Center Replication (CDCR)
>
>
>
> All of these are scheduled to be removed in a future 9.x release.
>
> It was decided that these components did not meet the standards of quality
> and support that we wish to ensure for all components we ship. Some of
> these also relied on design patterns that we no longer recommend for use in
> critical production environments.
>
> If you rely on these features, you are encouraged to try out community
> supported versions of these, where available [0]. Where such community
> support is not available, we encourage you to participate in the migration
> of these components into community supported packages and help continue the
> development. We envision that using packages for these components via
> package manager will actually make it easier for users to use such features.
>
> Regards,
>
> Ishan Chattopadhyaya
>
> (On behalf of the Apache Lucene/Solr PMC)
>
> [0] -
> https://cwiki.apache.org/confluence/display/SOLR/Community+supported+packages+for+Solr
>
> On Wed, Jul 15, 2020 at 2:30 PM Bruno Roustant 
> wrote:
>
> > The Lucene PMC is pleased to announce the release of Apache Solr 8.6.0.
> >
> >
> > Solr is the popular, blazing fast, open source NoSQL search platform from
> > the Apache Lucene project. Its major features include powerful full-text
> > search, hit highlighting, faceted search, dynamic clustering, database
> > integration, rich document handling, and geospatial search. Solr is highly
> > scalable, providing fault tolerant distributed search and indexing, and
> > powers the search and navigation features of many of the world's largest
> > internet sites.
> >
> >
> > Solr 8.6.0 is available for immediate download at:
> >
> >
> >   
> >
> >
> > ### Solr 8.6.0 Release Highlights:
> >
> >
> >  * Cross-Collection Join Queries: Join queries can now work
> > cross-collection, even when shared or when spanning nodes.
> >
> >  * Search: Performance improvement for some types of queries when exact
> > hit count isn't needed by using BlockMax WAND algorithm.
> >
> >  * Streaming Expression: Percentiles and standard deviation aggregations
> > added to stats, facet and time series.  Streaming expressions added to
> > /export handler.  Drill Streaming Expression for efficient and accurate
> > high cardinality aggregation.
> >
> >  * Package manager: Support for cluster (CoreContainer) level plugins.
> >
> >  * Health Check: HealthCheckHandler can now require that all cores are
> > healthy before returning OK.
> >
> >  * Zookeeper read API: A read API at /api/cluster/zk/* to fetch raw ZK
> > data and view contents of a ZK directory.
> >
> >  * Admin UI: New panel with security info in admin UI's dashboard.
> >
> >  * Query DSL: Support for {param:ref} and {bool: {excludeTags:""}}
> >
> >  * Ref Guide: Major redesign of Solr's documentation.
> >
> >
> > Please read CHANGES.txt for a full list of new features and changes:
> >
> >
> >   
> >
> >
> > Solr 8.6.0 also includes features, optimizations  and bugfixes in the
> > corresponding Apache Lucene release:
> >
> >
> >   
> >
> >
> > Note: The Apache Software Foundation uses an extensive mirroring network
> > for
> >
> > distributing releases. It is possible that the mirror you are using may
> > not have
> >
> > replicated the release yet. If that is the case, please try another mirror.
> >
> > This also applies to Maven access.
> >


Re: CDCR stress-test issues

2020-06-24 Thread matthew sporleder
On Wed, Jun 24, 2020 at 9:46 AM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
>
> In attempting to stress-test CDCR (running Solr 7.4), I am running into a 
> couple of issues.
>
> One is that the tlog files keep accumulating for some nodes in the CDCR 
> system, particularly for the non-Leader nodes in the Source SolrCloud. No 
> quantity of hard commits seem to cause any of these tlog files to be 
> released. This can become a problem upon reboot if there are hundreds of 
> thousands of tlog files, and Solr fails to start (complaining that there are 
> too many open files).
>
> The tlogs had been accumulating on all the nodes of the CDCR set of 
> SolrClouds until I added these two lines to the solrconfig.xml file (for 
> testing purposes, using numbers much lower than in the examples):
> 5
> 2
> Since then, it is mostly the non-Leader nodes of the Source SolrCloud which 
> accumulates tlog files (the Target SolrCloud does seem to have a tendency to 
> clean up the tlog files, as does the Leader of the Source SolrCloud). If I 
> use ADDREPLICAPROP and REBALANCELEADERS to change which node is the Leader, 
> and if I then start adding more data, the tlogs on the new Leader sometimes 
> will go away, but then the old Leader begins accumulating tlog files. I am 
> dubious whether frequent reassignment of Leadership would be a practical 
> solution.
>
> I also have several times attempted to simulate a production environment by 
> running several loops simultaneously, each of which inserts multiple records 
> on each iteration of the loop. Several times, I end up with a dozen records 
> on (both replicas of) the Source which never make it to (either replica of) 
> the Target. The Target has thousands of records which were inserted before 
> the missing records, and thousands of records which were inserted after the 
> missing records (and all these records, the replicated and the missing, were 
> inserted by curl commands which only differed in sequential numbers 
> incorporated into the values being inserted).
>
> I also have a question regarding SOLR-13141: the 11/Feb/19 comment says that 
> the fix for Solr 7.3 had a problem; and the header says "Affects Version/s: 
> 7.5, 7.6": does that indicate that Solr 7.4 is not affected?
>
> Are  there any suggestions?
>
> Thanks

Just going to "me too" where i've had (non cdcr) installs accumulate
tlogs until eventual rebuilds or crashes.


Re: Getting rid of zookeeper

2020-06-10 Thread matthew sporleder
FWIW -- zookeeper is pretty set-and-forget in my experience with
settings like autopurge.snapRetainCount, autopurge.purgeInterval, and
rotating the zookeeper.out stdout file.

It is a big hassle to setup the individual myid files and keep them in
sync with the server.$id=hostname in zoo.cfg but, again, one time
pain.

I think smaller solr deployments could benefit from some easier
ability to configure the embedded zookeeper (like the improved zk
upconfig and friends) which might address this entire point?  The only
reason I don't run embedded zk (I use three small ec2's) is because
cpu/disk contention on the same server have burned me in the past.

On Wed, Jun 10, 2020 at 3:30 AM Jan Høydahl  wrote:
>
> Curator is just on the client (solr) side, to make it easier to integrate 
> with Zookeeper, right?
>
> If you study Elastic, they had terrible cluster stability a few years ago 
> since everything
> was too «dynamic» and «zero config». That led to the system outsmarting 
> itself when facing
> real-life network partitions and other failures. Solr did not have these 
> issues exactly because
> it relies on Zookeeper which is very static and hard to change (on purpose), 
> and thus delivers
> a strong, stable quorum. So what did Elastic do a couple years ago? They 
> adopted the same
> best practice as ZK, recommending 3 or 5 (statically defined) master nodes 
> that owns the
> cluster state.
>
> Solr could get rid of ZK the same way as KAFKA. But while KAFKA already has a
> distributed log they could replace ZK with (hey, Kafka IS a log), Solr would 
> need to add
> such a log, and it would need to be embedded in the Solr process to avoid 
> that extra runtime.
> I believe it could be done with Apache Ratis 
> (https://ratis.incubator.apache.org ) 
> which
> is a RAFT Java library. But I’m doubtful if the project has the bandwidth and 
> dedication right
> now to embark on such a project. It would probably be a multi-year effort, 
> first building
> abstractions on top of ZK, then moving one piece of ZK dependency over to 
> RAFT at a time,
> needing both systems in parallel, before at the end ZK could go away.
>
> I’d like to see it happen. Especially for smaller deployments it would be 
> fantastic.
>
> Jan
>
> > 10. jun. 2020 kl. 01:03 skrev Erick Erickson :
> >
> > The intermediate solution is to migrate to Curator. I don’t know all the 
> > ins and outs
> > of that and whether or not it would be easier to setup and maintain.
> >
> > I do know that Zookeeper is deeply embedded in Solr and taking replacing it 
> > with
> > most anything would be a major pain.
> >
> > I’m also certain that rewriting Zookeeper is a rat-hole that would take a 
> > major
> > effort. If anyone would like to try it, all patches welcome.
> >
> > FWIW,
> > er...@curmudgeon.com
> >
> >> On Jun 9, 2020, at 6:01 PM, Dave  wrote:
> >>
> >> Is it horrible that I’m already burnt out from just reading that?
> >>
> >> I’m going to stick to the classic solr master slave set up for the 
> >> foreseeable future, at least that let’s me focus more on the search theory 
> >> rather than the back end system non stop.
> >>
> >>> On Jun 9, 2020, at 5:11 PM, Vincenzo D'Amore  wrote:
> >>>
> >>> My 2 cents, I have few solrcloud productions installations, I would share
> >>> some thoughts of what I learned in the latest 4/5 years (fwiw) just as 
> >>> they
> >>> come out of my mind.
> >>>
> >>> - to configure a SolrCloud *production* Cluster you have to be a zookeeper
> >>> expert even if you only need Solr.
> >>> - the Zookeeper ensemble (3 or 5 zookeeper nodes) is recommended to run on
> >>> separate machines but for many customers this is too expensive. And for 
> >>> the
> >>> rest it is expensive just to have the instances (i.e. dockers). It is
> >>> expensive even to have people that know Zookeeper or even only train them.
> >>> - given the high availability function of a zookeeper cluster you have
> >>> to monitor it and promptly backup and restore. But it is hard to monitor
> >>> (and configure the monitoring) and it is even harder to backup and restore
> >>> (when it is running).
> >>> - You can't add or remove nodes in zookeeper when it is up. Only the 
> >>> latest
> >>> version should finally give the possibility to add/remove nodes when it is
> >>> running, but afak this is not still supported by SolrCloud (out of the 
> >>> box).
> >>> - many people fail when they try to run a SolrCloud cluster because it is
> >>> hard to set up, for example: SolrCloud zkcli runs poorly on windows.
> >>> - it is hard to admin the zookeeper remotely, basically there are no
> >>> utilities that let you easily list/read/write/delete files on a zookeeper
> >>> filesystem.
> >>> - it was really hard to create a zookeeper ensemble in kubernetes, only
> >>> recently appeared few solutions. This was so counter-productive for the
> >>> Solr project because now the world is moving to Kubernetes, and there is
> >>> 

Re: JMX metrics for solr cloud cluster state

2020-05-31 Thread matthew sporleder
complain to new relic on their lagging solr support!!!  I have and
could use some support!

To address your actual question I have found JMX in solr to be crazy
unreliable but the admin/metrics web endpoint is pretty good.

I have some (crappy) python for parsing it for datadog:
https://github.com/msporleder/dd-solrcloud  you might be able to ship
something similar to insights if you were so inclined

On Sun, May 31, 2020 at 7:15 PM Ganesh Sethuraman
 wrote:
>
> Hi
>
> We use New Relic to monitor Sold Cloud Cluster 7.2.1. we would like to get
> alerted on any cluster state change. Like for example degraded shard.
> Replica down. New relic can monitor any JMX metrices.
>
> Can you suggest JMX metrics that will help monitor degraded cluster,
> replica recovering, shard replica down, etc?
>
> I couldn't find any metric on Solr documents.
>
> Regards
> Ganesh


Re: Indexing huge data onto solr

2020-05-22 Thread matthew sporleder
I can index (without nested entities ofc ;) ) 100M records in about
6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql
so it is probably worth looking at why it is going slow before writing
your own indexer (which we are finally having to do)

On Fri, May 22, 2020 at 1:22 PM Erick Erickson  wrote:
>
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
>
> Best,
> Erick
>
> > On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
> >  wrote:
> >
> > Hi All,
> >
> > We are runnnig solr 8.4.1. We have a database table which has more than 100 
> > million of records. Till now we were using DIH to do full-import on the 
> > tables. But for this table, when we do full-import via DIH it is taking 
> > more than 3-4 days to complete and also it consumes fair bit of JVM memory 
> > while running.
> >
> > Are there any speedier/alternates ways to load data onto this solr core.
> >
> > P.S: Only initial data import is problem, further updates/additions to this 
> > core is being done through SolrJ.
> >
> > Thanks,
> > Srinivas
> > 
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender immediately 
> > by replying to the e-mail, and then delete it without making copies or 
> > using it in any way.
> > No representation is made that this email or any attachments are free of 
> > viruses. Virus scanning is recommended and is the responsibility of the 
> > recipient.
> >
> > Disclaimer
> >
> > The information contained in this communication from the sender is 
> > confidential. It is intended solely for use by the recipient and others 
> > authorized to receive it. If you are not the recipient, you are hereby 
> > notified that any disclosure, copying, distribution or taking action in 
> > relation of the contents of this information is strictly prohibited and may 
> > be unlawful.
> >
> > This email has been scanned for viruses and malware, and may have been 
> > automatically archived by Mimecast Ltd, an innovator in Software as a 
> > Service (SaaS) for business. Providing a safer and more useful place for 
> > your human generated data. Specializing in; Security, archiving and 
> > compliance. To find out more visit the Mimecast website.
>


Re: when to use docvalue

2020-05-19 Thread matthew sporleder
You can index AND docvalue?  For some reason I thought they were exclusive

On Tue, May 19, 2020 at 5:36 PM Erick Erickson  wrote:
>
> Yes. You should also index them….
>
> Here’s the way I think of it.
>
> For questions “For term X, which docs contain that value?” means index=true. 
> This is a search.
>
> For questions “Does doc X have value Y in field Z”, means docValues=true.
>
> what’s the difference? Well, the first one is to get the result set. The 
> second is for, given a result set,
> count/sort/whatever.
>
> fq clauses are searches, so index=true.
>
> sorting, faceting, grouping and function queries  are “for each doc in the 
> result set, what values does field Y contain?”
>
> Maybe that made things clear as mud, but it’s the way I think of it ;)
>
> Best,
> Erick
>
>
>
> fq clauses are searches. Indexed=true is for searching.
>
> sort
>
> > On May 19, 2020, at 4:00 PM, matthew sporleder  wrote:
> >
> > I have quite a few numeric / meta-data type fields in my schema and
> > pretty much only use them in fq=, sort=, and friends.  Should I always
> > use DocValue on these if i never plan to q=search: on them?  Are there
> > any drawbacks?
> >
> > Thanks,
> > Matt
>


when to use docvalue

2020-05-19 Thread matthew sporleder
I have quite a few numeric / meta-data type fields in my schema and
pretty much only use them in fq=, sort=, and friends.  Should I always
use DocValue on these if i never plan to q=search: on them?  Are there
any drawbacks?

Thanks,
Matt


Re: nested entities and DIH indexing time

2020-05-14 Thread matthew sporleder
On Thu, May 14, 2020 at 4:46 PM Shawn Heisey  wrote:
>
> On 5/14/2020 9:36 AM, matthew sporleder wrote:
> > It appears that adding entities to my entities in my data import
> > config is slowing down my import process by a lot.  Is there a good
> > way to speed this up?  I see the ID's are individually queried instead
> > of using IN() or similar normal techniques to make things faster.
> >
> > Just looking for some tips.  I prefer this architecture to the way we
> > currently do it with complex SQL, inserting weird strings, and then
> > splitting on them (gross but faster).
>
> When you have nested entities, this is how DIH works.  A separate SQL
> query for the inner entity is made for each row returned on the outer
> entity.  Nested entities tend to be extremely slow for this reason.
>
> The best way to work around this is to make the database server do the
> heavy lifting -- using JOIN or other methods so that you only need one
> entity and one SQL query.  Doing this will mean that you'll need to
> split the data after import, using either the DIH config or the analysis
> configuration in the schema.
>
> Thanks,
> Shawn

This is too bad because it is very clean and the JOIN/CONCAT/SPLIT
method is very gross.

I was also hoping to use different delta queries for each nested entity.

Can a non-nested entity write into existing docs, or do they always
have to produce document-per-entity?


nested entities and DIH indexing time

2020-05-14 Thread matthew sporleder
It appears that adding entities to my entities in my data import
config is slowing down my import process by a lot.  Is there a good
way to speed this up?  I see the ID's are individually queried instead
of using IN() or similar normal techniques to make things faster.

Just looking for some tips.  I prefer this architecture to the way we
currently do it with complex SQL, inserting weird strings, and then
splitting on them (gross but faster).


Re: DIH nested entity repeating query in verbose output

2020-05-14 Thread matthew sporleder
I think this is just an issue in the verbose/debug output.  tcpdump
does not show the same issue.

On Wed, May 13, 2020 at 7:39 PM matthew sporleder  wrote:
>
> I am attempting to use nested entities to populate documents from
> different tables and verbose/debug output is showing repeated queries
> on import.  The doc number repeats the sqls.
>
> "verbose-output":
> [ "entity:parent",
> ..
> [ "document#5", [
> ...
> "entity:nested1", [
> "query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
> "query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
> "query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
> "query", "SELECT body AS nested1 FROM table WHERE p_id = '1234",
> "query", "SELECT body AS nested1 FROM table WHERE p_id = '1234",
> "time-taken", "0:0:0.1",
> "time-taken", "0:0:0.1",
> "time-taken", "0:0:0.1",
> "time-taken", "0:0:0.1",
> "time-taken", "0:0:0.1" ],
>
>
> The counts appears to be correct?
> Requests: 61 , Fetched: 20 , Skipped: 0 , Processed: 20
>
>
> I have a config like:
>
>dataSource="database"
>   name="parent"
>   pk="id"
>   query="SELECT .."
>   deltaImportQuery="SELECT.."
>   deltaQuery="SELECT.."
>   >
>name="child1"
> query="SELECT body AS nested1 FROM table WHERE p_id = '${parent.id}'
> deltaQuery=...
> parentDeltaQuery=...
> etc
>   >
>   
>name="child2"
> query="SELECT body AS nested2 FROM table WHERE p_id = '${parent.id}'
> deltaQuery=...
> parentDeltaQuery=...
> etc
>   >
>   
>name="child3"
> query="SELECT body AS nested3 FROM table WHERE p_id = '${parent.id}'
> deltaQuery=...
> parentDeltaQuery=...
> etc
>   >
>   
>
>
> 


DIH nested entity repeating query in verbose output

2020-05-13 Thread matthew sporleder
I am attempting to use nested entities to populate documents from
different tables and verbose/debug output is showing repeated queries
on import.  The doc number repeats the sqls.

"verbose-output":
[ "entity:parent",
..
[ "document#5", [
...
"entity:nested1", [
"query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
"query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
"query", "SELECT body AS nested1 FROM table WHERE p_id = '1234'",
"query", "SELECT body AS nested1 FROM table WHERE p_id = '1234",
"query", "SELECT body AS nested1 FROM table WHERE p_id = '1234",
"time-taken", "0:0:0.1",
"time-taken", "0:0:0.1",
"time-taken", "0:0:0.1",
"time-taken", "0:0:0.1",
"time-taken", "0:0:0.1" ],


The counts appears to be correct?
Requests: 61 , Fetched: 20 , Skipped: 0 , Processed: 20


I have a config like:


  

Re: Response Time Diff between Collection with low deletes

2020-05-10 Thread matthew sporleder
Why so many shards?

> On May 10, 2020, at 9:09 PM, Ganesh Sethuraman  
> wrote:
> 
> We are using dedicated host, Cent OS in EC2  r5.12xlarge (48  CPU, ~360GB
> RAM), 2 nodes. Swapiness set to 1. With General purpose 2T EBS SSD volume.
> JVM size of 18gb, with G1 GC enabled. About 92 collection with average of 8
> shards and 2 replica each. Most of updates over daily batch updates.
> 
> While we have Solr disk utilization of about ~800gb. Most of the collection
> space are for real time GET, /get call. The issue we are having is for few
> collection where we having query use case /need. This has 32 replica (16
> shards 2 replica each). During performance test, issue is few calls where
> we have high response time, it is noticeable when test duration is small,
> the response time improve when the test is for longer duration.
> 
> Hope this information helps.
> 
> Regards
> Ganesh
> 
> Regards
> Ganesh
> 
> 
>> On Sun, May 10, 2020, 8:14 PM Shawn Heisey  wrote:
>> 
>>> On 5/10/2020 4:48 PM, Ganesh Sethuraman wrote:
>>> The additional info is that when we execute the test for longer (20mins)
>> we
>>> are seeing better response time, however for a short test (5mins) and
>> rerun
>>> the test after an hour or so we are seeing slow response times again.
>> Note
>>> that we don't update the collection during the test or in between the
>> test.
>>> Does this help to identify the issue?
>> 
>> Assuming Solr is the only software that is running, most operating
>> systems would not remove Solr data from the disk cache, so unless you
>> have other software running on the machine, it's a little weird that
>> performance drops back down after waiting an hour.  Windows is an
>> example of an OS that *does* proactively change data in the disk cache,
>> and on that OS, I would not be surprised by such behavior.  You haven't
>> mentioned which OS you're running on.
>> 
>>> 3. We have designed our test to mimick reality where filter cache is not
>>> hit at all. From solr, we are seeing that there is ZERO Filter cache hit.
>>> There is about 4% query and document cache hit in prod and we are seeing
>> no
>>> filter cache hit in both QA and PROD
>> 
>> If you're getting zero cache hits, you should disable the cache that is
>> getting zero hits.  There is no reason to waste the memory that the
>> cache uses, because there is no benefit.
>> 
>>> Give that, could this be some warming up related issue to keep the Solr /
>>> Lucene memory-mapped file in RAM? Is there any way to measure which
>>> collection is using memory? we do have 350GB RAM, but we see it full with
>>> buffer cache, not really sure what is really using this memory.
>> 
>> You would have to ask the OS which files are contained by the OS disk
>> cache, and it's possible that even if the information is available, that
>> it is very difficult to get.  There is no way Solr can report this.
>> 
>> Thanks,
>> Shawn
>> 


Re: SolrCloud degraded during backup and batch CSV update

2020-05-01 Thread matthew sporleder
If the errors happen with garbage collection then potentially, yes.
You should never pause longer than your zk timeout (both sides).


On Thu, Apr 30, 2020 at 11:03 PM Ganesh Sethuraman
 wrote:
>
> Any other JVM settings change possible?
>
> On Tue, Apr 28, 2020, 10:15 PM Sethuraman, Ganesh
>  wrote:
>
> > Hi
> >
> > We are using SolrCloud 7.2.1 with 3 node Zookeeper ensemble. We have 92
> > collection each on avg. having 8 shards and 2 replica with 2 EC2 nodes,
> > with JVM size of 18GB (G1 GC). We need your help with the Issue we faced
> > today: The issue is SolrCloud server went into a degraded collections (for
> > few collections) when the Solr backup and the Solr batch CSV update load
> > happened at the same time as backup. The CSV data load was about ~5 GB per
> > shard/replica. We think this happened after zkClient disconnect happened as
> > noted below.  We had to restart Solr to bring it back to normal.
> >
> >
> >   1.  Is it not suggested to run backup and Solr batch CSV update large
> > load at the same time?
> >   2.  In the past we have seen two CSV batch update load in parallel
> > causes issues, is this also not suggested (this issue is not related to
> > that)?
> >   3.  Do you think we should increase Zookeeper timeout?
> >   4.  How do we know if  we need to up the JVM Max memory, and by how much?
> >   5.  We also see that once the Solr goes into degraded collection and
> > recovery failed, it NEVER get back to normal, even after when there is no
> > load. Is this a bug?
> >
> > The GC information and Solr Log below
> >
> >
> > https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMjAvMDQvMjkvLS0wMl9zb2xyX2djLmxvZy56aXAtLTEtNDAtMzE==WEB
> >
> >
> > 2020-04-27 07:34:07.322 WARN
> > (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-SendThread(zoo-prd-n1:2181))
> > [   ] o.a.z.ClientCnxn Client session timed out, have not heard from server
> > in 10775ms for sessionid 0x171a6fb51310008
> > 
> > 2020-04-27 07:34:07.426 WARN
> > (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-EventThread)
> > [   ] o.a.s.c.c.ConnectionManager zkClient has disconnected
> >
> >
> >
> >
> > SOLR Log Below (Curtailed WARN log)
> > 
> > 2020-04-27 07:26:45.402 WARN
> > (recoveryExecutor-4-thread-697-processing-n:mysolrsever.com:6010_solr
> > x:mycollection_shard13_replica_n48 s:shard13 c:mycollection r:core_node51)
> > [c:mycollection s:shard13 r:core_node51 x:mycollection_shard13_replica_n48]
> > o.a.s.h.IndexFetcher Error in fetching file: _1kr_r.liv (downloaded 0 of
> > 587 bytes)
> > java.io.EOFException
> >   at
> > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
> >   at
> > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> >   at
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1579)
> >   at
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1545)
> >   at
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1526)
> >   at
> > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1008)
> >   at
> > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:566)
> >   at
> > org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:345)
> >   at
> > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:420)
> >   at
> > org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:225)
> >   at
> > org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:626)
> >   at
> > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
> >   at
> > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:292)
> >   at
> > com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
> >   at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >   at
> > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
> >   at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >   at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >   at java.lang.Thread.run(Thread.java:748)
> > 2020-04-27 07:26:45.405 WARN
> > (recoveryExecutor-4-thread-697-processing-n:mysolrsever.com:6010_solr
> > x:mycollection_shard13_replica_n48 s:shard13 c:mycollection r:core_node51)
> > [c:mycollection s:shard13 r:core_node51 x:mycollection_shard13_replica_n48]
> > o.a.s.h.IndexFetcher 

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread matthew sporleder
If you use the stemmer in your query analysis it should act the same, right?

On Thu, Apr 30, 2020 at 3:54 PM Erick Erickson  wrote:
>
> They are being stemmed to two different tokens, “identif” and “identifi”. 
> Stemming is algorithmic and imperfect and in this case you’re getting bitten 
> by that algorithm. It looks like you’re using PorterStemFilter, if you want 
> you can look up the exact algorithm, but I don’t think it’s a bug, just one 
> of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try adding 
> =query to your query, in particular looking at the parsed query that’s 
> returned. That’ll tell you a bunch. In this particular case I don’t think 
> it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
> distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez  
> > wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases are working 
> > correctly, for example, if we search for bidding, solr brings results for 
> > bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> > stemming is not working. Even when analyzers seems to have correct stemming 
> > of the word, the results are not reflecting that. One example. If I search 
> > ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> > https://1drv.ms/u/s!AlRTlFq8tQbShd4-Cp40Cmc0QioS0A?e=1f3GJp
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> > leverage the proprietary tools available or manually pull a log file report 
> > to understand the trends and gauge auction spread overtime to assess the 
> > impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
> > leverage the proprietary tools available or manually pull a log file report 
> > to understand the trends and gauge auction spread overtime to assess the 
> > impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings results for 
> > other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> > https://1drv.ms/u/s!AlRTlFq8tQbShd49RpiQObzMgSjVhA
> >
> >
> > Even with proper stemming, solr is only bringing results for the word 
> > identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming Filter 
> > applied for both, query and index. This behavior is consistent with other 
> > ‘ion’ ended nouns: representation, modification, etc.
> >
> > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> >
> > Thanks.
> >
> >
> >
> >
> >
> > -Original Message-
> >
> > From: Erick Erickson 
> >
> > Sent: jueves, 30 de abril de 2020 1:47 p. m.
> >
> > To: solr-user@lucene.apache.org
> >
> > Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'
> >
> >
> >
> > This email has been sent from a source external to Publicis Groupe. Please 
> > use caution when clicking links or opening attachments.
> >
> > Cet email a été envoyé depuis une source externe à Publicis Groupe. 
> > Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou 
> > lorsque vous ouvrez des pièces jointes.
> >
> >
> >
> >
> >
> >
> >
> > The mail server is pretty aggressive about stripping links, so we can’t see 
> > the images.
> >
> >
> >
> > Could you put them somewhere and paste a link?
> >
> >
> >
> > Best,
> >
> > Erick
> >
> >
> >
> >> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez  
> >> wrote:
> >
> >>
> >
> >> We’re facing an issue with stemming in solr. Most of the cases are working 
> >> correctly, for example, if we search for bidding, solr brings results for 
> >> bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
> >> stemming is not working. Even when analyzers seems to have correct 
> >> stemming of the word, the results are not reflecting that. One example. If 
> >> I search ‘identifying’, this is the output:
> >
> >>
> >
> >> Analyzer (image):
> >
> >>
> >
> >> A clip of results:
> >
> >> "haschildren_b":false,
> >
> >>"isbucket_text_s":"0",
> >
> >>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
> >> leverage the proprietary tools available or manually pull a log file 
> >> report to understand the trends and gauge auction spread overtime to 
> >> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >
> >>"parsedupdatedby_s":"sitecorecarvaini",
> >
> >>"sectionbody_t_en":"\n\n\nIn order 

Re: Solr fields mapping

2020-04-30 Thread matthew sporleder
fl=createdByMap:concat("createdBy.userName:
",createdBy.userName,",","createdBy.name: ",createdBy.name," ...)

On Thu, Apr 30, 2020 at 3:20 PM sambasivarao giddaluri
 wrote:
>
> Hi Audrey,
>
> Yes i am aware of copyField but it does not fit in my use case. Reason is
> while giving as output we have to show each field with its
> value,  with copy it combines the value but we do not know field and value
> relationship.
>
> regards
> sam
>
> On Wed, Apr 29, 2020 at 9:53 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Hi, Sam!
> >
> > Have you tried creating a copyField?
> > https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/copying-fields.html
> >
> > Best,
> > Audrey
> >
> > On 4/28/20, 1:07 PM, "sambasivarao giddaluri" <
> > sambasiva.giddal...@gmail.com> wrote:
> >
> > Hi All,
> > Is there a way we can map fields in a single field?
> > Ex: scheme has below fields
> > createdBy.userName
> > createdBy.name
> > createdBy.email
> >
> > If have to retrieve these fields need to pass all the three fields in
> > *fl*
> > parameter  instead is there a way i can have a map or a object of these
> > fields in to createdBy and in fl i pass only createdBy and get all
> > these 3
> > as output
> >
> > Regards
> > sam
> >
> >
> >


Re: SolrCloud degraded during backup and batch CSV update

2020-04-29 Thread matthew sporleder
You can add something like this to SOLR_OPTS -DzkClientTimeout=3
in your init script or adjust solr.xml ${zkClientTimeout:15000}

On Wed, Apr 29, 2020 at 5:41 PM Sethuraman, Ganesh
 wrote:
>
> 3 Zookeeper ensemble are all in 3 separate boxes (EC2 instances). Each have 
> separate transactional logs directory (separate EBS volume, separate disk), 
> as this was zookeeper best practices.
>
> It feels ZK timeout is more of symptom and Solr slowness is the cause. Having 
> said that, do you increase the timeout setting  in Solr or Zookeeper, if you 
> can share parameters it will certainly help
>
> Regards
> Ganesh
>
> -----Original Message-
> From: matthew sporleder 
> Sent: Wednesday, April 29, 2020 11:47 AM
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud degraded during backup and batch CSV update
>
> CAUTION: This email originated from outside of D Please do not click links 
> or open attachments unless you recognize the sender and know the content is 
> safe.
>
>
> FWIW I've had some luck with strategy 3 (increase zk timeout) when you
> overwhelm the connection to zk or the disk on zk.
>
> Is zk on the same boxes as solr?
>
> On Tue, Apr 28, 2020 at 10:15 PM Sethuraman, Ganesh
>  wrote:
> >
> > Hi
> >
> > We are using SolrCloud 7.2.1 with 3 node Zookeeper ensemble. We have 92 
> > collection each on avg. having 8 shards and 2 replica with 2 EC2 nodes, 
> > with JVM size of 18GB (G1 GC). We need your help with the Issue we faced 
> > today: The issue is SolrCloud server went into a degraded collections (for 
> > few collections) when the Solr backup and the Solr batch CSV update load 
> > happened at the same time as backup. The CSV data load was about ~5 GB per 
> > shard/replica. We think this happened after zkClient disconnect happened as 
> > noted below.  We had to restart Solr to bring it back to normal.
> >
> >
> >   1.  Is it not suggested to run backup and Solr batch CSV update large 
> > load at the same time?
> >   2.  In the past we have seen two CSV batch update load in parallel causes 
> > issues, is this also not suggested (this issue is not related to that)?
> >   3.  Do you think we should increase Zookeeper timeout?
> >   4.  How do we know if  we need to up the JVM Max memory, and by how much?
> >   5.  We also see that once the Solr goes into degraded collection and 
> > recovery failed, it NEVER get back to normal, even after when there is no 
> > load. Is this a bug?
> >
> > The GC information and Solr Log below
> >
> > https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgceasy.io%2Fmy-gc-report.jsp%3Fp%3Dc2hhcmVkLzIwMjAvMDQvMjkvLS0wMl9zb2xyX2djLmxvZy56aXAtLTEtNDAtMzE%3D%26channel%3DWEBdata=02%7C01%7CSethuramanG%40dnb.com%7C8e47fdc157ce4bf8425c08d7ec548d6c%7C19e2b708bf12437597198dec42771b3e%7C0%7C0%7C637237720241486583sdata=MlstokkgBAX7joUpljJnQZjrbQ7cZZAoSPfQebx2q5I%3Dreserved=0
> >
> >
> > 2020-04-27 07:34:07.322 WARN  
> > (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-SendThread(zoo-prd-n1:2181))
> >  [   ] o.a.z.ClientCnxn Client session timed out, have not heard from 
> > server in 10775ms for sessionid 0x171a6fb51310008
> > 
> > 2020-04-27 07:34:07.426 WARN  
> > (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-EventThread)
> >  [   ] o.a.s.c.c.ConnectionManager zkClient has disconnected
> >
> >
> >
> >
> > SOLR Log Below (Curtailed WARN log)
> > 
> > 2020-04-27 07:26:45.402 WARN  
> > (recoveryExecutor-4-thread-697-processing-n:mysolrsever.com:6010_solr 
> > x:mycollection_shard13_replica_n48 s:shard13 c:mycollection r:core_node51) 
> > [c:mycollection s:shard13 r:core_node51 x:mycollection_shard13_replica_n48] 
> > o.a.s.h.IndexFetcher Error in fetching file: _1kr_r.liv (downloaded 0 of 
> > 587 bytes)
> > java.io.EOFException
> >   at 
> > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
> >   at 
> > org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
> >   at 
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1579)
> >   at 
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1545)
> >   at 
> > org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1526)
> >   at 
> > org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1008)
> >   at 
> > org.apache

Re: off-heap OOM

2020-04-29 Thread matthew sporleder
What does the message look like, exactly, from solr.log ?

On Wed, Apr 29, 2020 at 1:27 PM Raji N  wrote:
>
> Thank you for your reply.  When OOM happens somehow it doesn't generate
> dump file. So we have hourly heaps running to diagnose this issue. Heap is
> around 700MB and threads around 150. But 29GB of native memory is used up,
> it is consumed by java.io.DirectBufferR (27GB major consumption) and
> java.io.DirectByteBuffer  objects .
>
> We use solr 7.6.0 in solrcloud mode and OS is alpine . Java version
>
> java -version
>
> Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
>
> java version "1.8.0_211"
>
> Java(TM) SE Runtime Environment (build 1.8.0_211-b12)
>
> Java HotSpot(TM) 64-Bit Server VM (build 25.211-b12, mixed mode)
>
>
>
> Thanks much for taking a look at it.
>
> Raji
>
>
>
> On Wed, Apr 29, 2020 at 10:04 AM Shawn Heisey  wrote:
>
> > On 4/29/2020 2:07 AM, Raji N wrote:
> > > Has anyone encountered off-heap OOM. We are thinking of reducing heap
> > > further and increasing the hardcommit interval . Any other suggestions? .
> > > Please share your thoughts.
> >
> > It sounds like it's not heap memory that's running out.
> >
> > When the OutOfMemoryError is logged, it will also contain a message
> > mentioning which resource ran out.
> >
> > A common message that might be logged with the OOME is "Unable to create
> > native thread".  This type of error, if that's what's happening,
> > actually has nothing at all to do with memory, OOME is just how Java
> > happens to report it.
> >
> > You will need to know exactly which resource is running out before we
> > can offer any assistance.
> >
> > If the OOME is logged, the message you're looking for will be in the
> > solr log, not the tiny special log that is created when Solr is killed
> > by an OOME.  What version of Solr are you running, and what OS is it
> > running on?
> >
> > Thanks,
> > Shawn
> >


Re: SolrCloud degraded during backup and batch CSV update

2020-04-29 Thread matthew sporleder
FWIW I've had some luck with strategy 3 (increase zk timeout) when you
overwhelm the connection to zk or the disk on zk.

Is zk on the same boxes as solr?

On Tue, Apr 28, 2020 at 10:15 PM Sethuraman, Ganesh
 wrote:
>
> Hi
>
> We are using SolrCloud 7.2.1 with 3 node Zookeeper ensemble. We have 92 
> collection each on avg. having 8 shards and 2 replica with 2 EC2 nodes, with 
> JVM size of 18GB (G1 GC). We need your help with the Issue we faced today: 
> The issue is SolrCloud server went into a degraded collections (for few 
> collections) when the Solr backup and the Solr batch CSV update load happened 
> at the same time as backup. The CSV data load was about ~5 GB per 
> shard/replica. We think this happened after zkClient disconnect happened as 
> noted below.  We had to restart Solr to bring it back to normal.
>
>
>   1.  Is it not suggested to run backup and Solr batch CSV update large load 
> at the same time?
>   2.  In the past we have seen two CSV batch update load in parallel causes 
> issues, is this also not suggested (this issue is not related to that)?
>   3.  Do you think we should increase Zookeeper timeout?
>   4.  How do we know if  we need to up the JVM Max memory, and by how much?
>   5.  We also see that once the Solr goes into degraded collection and 
> recovery failed, it NEVER get back to normal, even after when there is no 
> load. Is this a bug?
>
> The GC information and Solr Log below
>
> https://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMjAvMDQvMjkvLS0wMl9zb2xyX2djLmxvZy56aXAtLTEtNDAtMzE==WEB
>
>
> 2020-04-27 07:34:07.322 WARN  
> (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-SendThread(zoo-prd-n1:2181))
>  [   ] o.a.z.ClientCnxn Client session timed out, have not heard from server 
> in 10775ms for sessionid 0x171a6fb51310008
> 
> 2020-04-27 07:34:07.426 WARN  
> (zkConnectionManagerCallback-6-thread-1-processing-n:mysolrsever.com:6010_solr-EventThread)
>  [   ] o.a.s.c.c.ConnectionManager zkClient has disconnected
>
>
>
>
> SOLR Log Below (Curtailed WARN log)
> 
> 2020-04-27 07:26:45.402 WARN  
> (recoveryExecutor-4-thread-697-processing-n:mysolrsever.com:6010_solr 
> x:mycollection_shard13_replica_n48 s:shard13 c:mycollection r:core_node51) 
> [c:mycollection s:shard13 r:core_node51 x:mycollection_shard13_replica_n48] 
> o.a.s.h.IndexFetcher Error in fetching file: _1kr_r.liv (downloaded 0 of 587 
> bytes)
> java.io.EOFException
>   at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
>   at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:160)
>   at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1579)
>   at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1545)
>   at 
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1526)
>   at 
> org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1008)
>   at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:566)
>   at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:345)
>   at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:420)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:225)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:626)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:292)
>   at 
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:188)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-04-27 07:26:45.405 WARN  
> (recoveryExecutor-4-thread-697-processing-n:mysolrsever.com:6010_solr 
> x:mycollection_shard13_replica_n48 s:shard13 c:mycollection r:core_node51) 
> [c:mycollection s:shard13 r:core_node51 x:mycollection_shard13_replica_n48] 
> o.a.s.h.IndexFetcher Error in fetching file: _1kr_r.liv (downloaded 0 of 587 
> bytes)
> java.io.EOFException
>   at 
> org.apache.solr.common.util.FastInputStream.readFully(FastInputStream.java:168)
>   at 
> 

Re: Solr Ref Guide Redesign coming in 8.6

2020-04-28 Thread matthew sporleder
I highly recommend a version selector in the header!  I am *always*
landing on 6.x docs from google.

On Tue, Apr 28, 2020 at 5:18 PM Cassandra Targett  wrote:
>
> In case the list breaks the URL to view the Jenkins build, here's a shorter
> URL:
>
> https://s.apache.org/df7ew.
>
> On Tue, Apr 28, 2020 at 3:12 PM Cassandra Targett 
> wrote:
>
> > The PMC would like to engage the Solr user community for feedback on an
> > extensive redesign of the Solr Reference Guide I've just committed to the
> > master (future 9.0) branch.
> >
> > You can see the new design from our Jenkins build of master:
> >
> > https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-master/javadoc/
> >
> > The hope is that you will receive these changes positively. If so, we'll
> > use this for the upcoming 8.6 Ref Guide and future releases. We also may
> > re-publish earlier 8.x versions so they use this design.
> >
> > I embarked on this project last December simply as an attempt to upgrade
> > the version of Bootstrap used by the Guide. After a couple of days, I'd
> > changed the layout entirely. In the ensuing few months I've tried to iron
> > out the kinks and made some extensive changes to the "backend" (the CSS,
> > JavaScript, etc.).
> >
> > I'm no graphic designer, but some of my guiding thoughts were to try to
> > make full use of the browser window, improve responsiveness for different
> > sized screens, and just give it a more modern feel. The full list of what
> > has changed is detailed in the Jira issue if you are interested:
> > https://issues.apache.org/jira/browse/SOLR-14173
> >
> > This is Phase 1 of several changes. There is one glaring remaining issue,
> > which is that our list of top-level categories is too long for the new
> > design. I've punted fixing that to Phase 2, which will be an extensive
> > re-consideration of how the Ref Guide is organized with the goal of
> > trimming down the top-level categories to only 4-6. SOLR-1 will track
> > phase 2.
> >
> > One last thing to note: this redesign really only changes the presentation
> > of the pages and some of the framework under the hood - it doesn't yet add
> > full-text search. All of the obstacles to providing search still exist, but
> > please know that we fully understand frustration on this point and still
> > hope to fix it.
> >
> > I look forward to hearing your feedback in this thread.
> >
> > Best,
> > Cassandra
> >


Re: Which Solr metrics do you find important?

2020-04-28 Thread matthew sporleder
I think clusterstatus is how you find some of that stuff.

I wrote this when I was using datadog to supplement what they offered:
https://github.com/msporleder/dd-solrcloud/blob/master/solrcloud.py
(sorry for crappy python) and it got me most of the monitoring I
needed for my particular situation.




On Tue, Apr 28, 2020 at 10:52 AM Radu Gheorghe
 wrote:
>
> Thanks a lot, Matthew! OK, so you do care about the size of tlogs. As well
> as Collections API stuff (clusterstatus, overseerstatus).
>
> And DIH, I didn't think that these stats would be interesting, but surely
> they are for people who use DIH :)
>
> Best regards,
> Radu
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
> On Tue, Apr 28, 2020 at 4:17 PM matthew sporleder 
> wrote:
>
> > size-on-disk of cores, size of tlogs, DIH stats over time, last
> > modified date of cores
> >
> > The most important alert-type things are -- collections in recovery or
> > down state, solrcloud election events, various error rates
> >
> > It's also important to be able to tie these back to aliases so you are
> > only monitoring cores you care about, even if their backing collection
> > name changes every so often
> >
> >
> >
> > On Tue, Apr 28, 2020 at 7:57 AM Radu Gheorghe
> >  wrote:
> > >
> > > Hi fellow Solr users,
> > >
> > > I'm looking into improving our Solr monitoring
> > > <https://sematext.com/docs/integration/solr/> and I was curious on which
> > > metrics you consider relevant.
> > >
> > > From what we currently have, I'm only really missing fieldCache. Which we
> > > collect, but not show in the UI yet (unless you add a custom chart -
> > we'll
> > > add it to default soon).
> > >
> > > You can click on a demo account <https://apps.sematext.com/demo>
> > (there's a
> > > Solr app there called PH.Prod.Solr7) to see what we already collect, but
> > > I'll write it here in short:
> > > - query rate and latency (you can group per handler, per core, per
> > > collection if it's SolrCloud)
> > > - index size (number of segments, files...)
> > > - indexing: added/deleted docs, commits
> > > - caches (size, hit ratio, warmup...)
> > > - OS- and JVM-level metrics (from CPU iowait to GC latency and everything
> > > in between)
> > >
> > > Anything that we should add?
> > >
> > > I went through the Metrics API output, and the only significant thing I
> > can
> > > think of is the transaction log. But to be honest I never checked those
> > > metrics in practice.
> > >
> > > Or maybe there's something outside the Metrics API that would be useful?
> > I
> > > thought about the breakdown of shards that are up/down/recovering... as
> > > well as replica types. We plan on adding those, but there's a challenge
> > in
> > > de-duplicating metrics. Because one would install one agent per node, and
> > > I'm not aware of a way to show only local shards in the Collections API
> > ->
> > > CLUSTERSTATUS.
> > >
> > > Thanks in advance for any feedback that you may have!
> > > Radu
> > > --
> > > Monitoring - Log Management - Alerting - Anomaly Detection
> > > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >


Re: Which Solr metrics do you find important?

2020-04-28 Thread matthew sporleder
size-on-disk of cores, size of tlogs, DIH stats over time, last
modified date of cores

The most important alert-type things are -- collections in recovery or
down state, solrcloud election events, various error rates

It's also important to be able to tie these back to aliases so you are
only monitoring cores you care about, even if their backing collection
name changes every so often



On Tue, Apr 28, 2020 at 7:57 AM Radu Gheorghe
 wrote:
>
> Hi fellow Solr users,
>
> I'm looking into improving our Solr monitoring
>  and I was curious on which
> metrics you consider relevant.
>
> From what we currently have, I'm only really missing fieldCache. Which we
> collect, but not show in the UI yet (unless you add a custom chart - we'll
> add it to default soon).
>
> You can click on a demo account  (there's a
> Solr app there called PH.Prod.Solr7) to see what we already collect, but
> I'll write it here in short:
> - query rate and latency (you can group per handler, per core, per
> collection if it's SolrCloud)
> - index size (number of segments, files...)
> - indexing: added/deleted docs, commits
> - caches (size, hit ratio, warmup...)
> - OS- and JVM-level metrics (from CPU iowait to GC latency and everything
> in between)
>
> Anything that we should add?
>
> I went through the Metrics API output, and the only significant thing I can
> think of is the transaction log. But to be honest I never checked those
> metrics in practice.
>
> Or maybe there's something outside the Metrics API that would be useful? I
> thought about the breakdown of shards that are up/down/recovering... as
> well as replica types. We plan on adding those, but there's a challenge in
> de-duplicating metrics. Because one would install one agent per node, and
> I'm not aware of a way to show only local shards in the Collections API ->
> CLUSTERSTATUS.
>
> Thanks in advance for any feedback that you may have!
> Radu
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/


Re: stored=true what should I see from stem fields

2020-04-25 Thread matthew sporleder
the quick brown fox jumped over the sleeping dogI was just doing that
to troubleshoot/discover.  I knew that you couldn't copy-to-copy but,
apparently, needed to be reminded.

My end goal (which I don't think I can achieve?) was to get my
everything field to contain something like:
everything: [ 'the quick brown fox jumped over the sleeping dog',
'quick brown fox jump over sleep dog']

So that a single/simple query would match that doc for q=dog or q=jump
or q=sleeping and would score extra high for "the dog jump", but I
guess I will need to change the query logic to search on both fields.

On Sat, Apr 25, 2020 at 8:16 AM Erick Erickson  wrote:
>
> One other bit:
>
> There’s rarely a reason to, and multiple reasons _not_ to set stored=true for
> the _destination_ of a copyField, set it for the source field.
>
> If you need to retrieve the original, just specify the source field in the fl 
> list.
>
> Best,
> Erick
>
> > On Apr 24, 2020, at 8:42 PM, Chris Hostetter  
> > wrote:
> >
> >
> > : Is what is shown in "analysis" the same as what is stored in a field?
> >
> > https://lucene.apache.org/solr/guide/8_5/analyzers.html
> >
> > The output of an Analyzer affects the terms indexed in a given field (and
> > the terms used when parsing queries against those fields) but it has no
> > impact on the stored value for the fields. For example: an analyzer might
> > split "Brown Cow" into two indexed terms "brown" and "cow", but the stored
> > value will still be a single String: "Brown Cow"
> >
> >
> > : So I indexed a document with "the quick brown fox jumped over the
> > : sleeping dog" set for stuff_raw and when I query for the document
> > : stuff_stems just has "the quick brown fox jumped over the sleeping
> > : dog" and NOT "quick brown fox jump over sleep dog"
> >
> >
> > https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> >
> > Fields are copied before analysis is done, meaning you can have two
> > fields with identical original content, but which use different analysis
> > chains and are stored in the index differently.
> >
> >
> >
> > : Also stuff_everything only contains a single item, which is weird
> > : because I copy two things into it.
> >
> > https://lucene.apache.org/solr/guide/8_5/copying-fields.html
> >
> > Copying is done at the stream source level and no copy feeds into another
> > copy. This means that copy fields cannot be chained i.e., you cannot copy
> > from here to there and then from there to elsewhere. However, the same
> > source field can be copied to multiple destination fields:
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>


stored=true what should I see from stem fields

2020-04-24 Thread matthew sporleder
Is what is shown in "analysis" the same as what is stored in a field?

I am confusing myself pretty thoroughly:

I have some fields:
  
 
  
 


 
   
   
   
   
   
   
 

  

  
  




 


And I have this:
 
 
 


I run this through the analyzer for stuff_stems:
"the quick brown fox jumped over the sleeping dog"

It prints out a bunch of stuff but the last thing it says is:
"quick brown fox jump over sleep dog"

So far so good.

So I indexed a document with "the quick brown fox jumped over the
sleeping dog" set for stuff_raw and when I query for the document
stuff_stems just has "the quick brown fox jumped over the sleeping
dog" and NOT "quick brown fox jump over sleep dog"

Also stuff_everything only contains a single item, which is weird
because I copy two things into it.

In fact here is everything:

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":0,
"params":{
  "q":"*:*",
  "wt":"json"}},
  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
  {
"id":1,
"stuff_raw":"the quick brown fox jumped over the sleeping dog",
"stuff_stems":"the quick brown fox jumped over the sleeping dog",
"stuff_everything":["the quick brown fox jumped over the sleeping dog"],
"_version_":1664899022194737152,
"timestamp":"2020-04-24T23:37:16.877Z",
"score":1.0},
  {
"id":2,
"stuff_raw":"jumped jumping jumper",
"stuff_stems":"jumped jumping jumper",
"stuff_everything":["jumped jumping jumper"],
"_version_":1664899046865633280,
"timestamp":"2020-04-24T23:37:40.404Z",
"score":1.0}]
  }}


Re: How to update dataImportHandler config in solr version 5.3

2020-04-24 Thread matthew sporleder
Are you 100% sure it is using solrcloud and that the config is not
simply on the disk?

On Fri, Apr 24, 2020 at 7:11 AM Lewin Joy (TMNA)  wrote:
>
> ll PROTECTED 関係者外秘
> Hi,
>
> We have an old collection running on a very old solr version. 5.3
> Now, we have a need to update the url string inside db-data-config.xml for 
> the DataImportHandler.
>
> Now, I see that this version does not support downconfig and upconfig as good 
> as in current versions.
> I was able to downconfig using zkcli.sh scripts. But, I notice that zookeeper 
> is not storing all the collection config.
> It was just storing 2 properties files that stored last index times.
> So, downconfig was not useful. If I take all files individually, I could 
> create the structure for the whole collection config.
> But, since currently zookeeper is not storing these configs, would upconfig 
> even work if I do this?
>
> So, my question is:
> How can I just update the url string in db-data-config.xml in Solr version 
> 5.3.2?
> Does anyone remember? Any pointers?
>
> Thanks,
> Lewin


Re: solr as a general search engine

2020-04-21 Thread matthew sporleder
Sorry for the vague question and I appreciate the book recommendations
-- I actually think I am mostly confused about suggest vs spellcheck
vs morelikethis as they relate to what I referred to as "expected"
behavior (like from a typed-in search bar).

For reference we have been using solr as search in some form for
almost 10 years and it's always been great in finding things based on
clear keywords, programmatic-type discovery, a nosql/distrtibuted k:v
(actually really really good at this) but has always fallen short
(imho and also our fault, obviously) in the "typed in a search query"
experience.

We are in the midst of re-developing our internal content ranking
system and it has me grasping on how to *really* elevate our game in
terms of giving an excellent human-driven discovery vs our current
behavior of: "here is everything we have that contains those words,
minus ones I took out".





On Tue, Apr 21, 2020 at 5:35 AM Charlie Hull  wrote:
>
> Hi Matt,
>
> Are you looking for a good, general purpose schema and config for Solr?
> Well, there's the problem: you need to define what you mean by general
> purpose. Every search application will have its own requirements and
> they'll be slightly different to every other application. Yes, there
> will be some commonalities too. I guess by "as a human might expect one
> to behave" you mean "a bit like how Google works" but unfortunately
> Google is a poor example: you won't have Google's money or staff or
> platform in your company, nor are you likely to be building a
> massive-scale web search engine, so at best you can just take
> inspiration from it, not replicate it.
>
> In practice, what a lot of people do is start with an example setup
> (perhaps from one of the examples supplied with Solr, e.g.
> 'techproducts') and adapt it: or they might start with the Solr
> configset provided by another framework, e.g. Drupal (yay! Pink
> Ponies!). Unfortunately the standard example configsets are littered
> with comments that say things like 'Here is how you *could* do XYZ but
> please don't actually attempt it this way' and other config sections
> that if you un-comment them may just get you into further trouble. It's
> grown rather than been built, and to my mind there's a good argument for
> starting with an absolutely minimal Solr configset and only adding
> things in as you need them and understand them (see
> https://lucene.472066.n3.nabble.com/minimal-solrconfig-example-td4322977.html
> for some background and a great presentation from Alex Rafalovitch on
> the examples).
>
> You're also going to need some background on *why* all these features
> should be used, and for that I'd recommend my colleague Doug's book
> Relevant Search https://www.manning.com/books/relevant-search - or maybe
> our training (quick plug: we're running some online training in a couple
> of weeks
> https://opensourceconnections.com/blog/2020/05/05/tlre-solr-remote/ )
>
> Hope this helps,
>
> Cheers
>
> Charlie
>
> On 20/04/2020 23:43, matthew sporleder wrote:
> > Is there a comprehensive/big set of tips for making solr into a
> > search-engine as a human would expect one to behave?  I poked around
> > in the nutch github for a minute and found this:
> > https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
> >   but I was wondering if I was missing a very obvious document
> > somewhere.
> >
> > I guess I'm looking for things like:
> > use suggester here, use spelling there, use DocValues around here, DIY
> > pagerank, etc
> >
> > Thanks,
> > Matt
>
>
> --
> Charlie Hull
> OpenSource Connections, previously Flax
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.o19s.com
>


solr as a general search engine

2020-04-20 Thread matthew sporleder
Is there a comprehensive/big set of tips for making solr into a
search-engine as a human would expect one to behave?  I poked around
in the nutch github for a minute and found this:
https://github.com/apache/nutch/blob/9e5ae7366f7dd51eaa76e77bee6eb69f812bd29b/src/plugin/indexer-solr/schema.xml
 but I was wondering if I was missing a very obvious document
somewhere.

I guess I'm looking for things like:
use suggester here, use spelling there, use DocValues around here, DIY
pagerank, etc

Thanks,
Matt


Re: entity in DIH for partial update?

2020-04-10 Thread matthew sporleder
Do you mean something along the lines of this (hackish?)
https://stackoverflow.com/questions/21006045/can-solr-dih-do-atomic-updates
method?

On Fri, Apr 10, 2020 at 10:19 AM Jörn Franke  wrote:
>
> You could use atomic updates in DIH. However, there is a bug in 
> current/potentially also old Solr version that this leaks a searcher (which 
> means the index data is infinitely growing until you restart the server).
> You can also export from the database to Jsonline, post it to the json update 
> handler together with the atomic processor.
>
> > Am 10.04.2020 um 16:02 schrieb matthew sporleder :
> >
> > I have an field I would like to add to my schema which is stored in a
> > different database from my primary data.  Can I use a separate entity
> > in my DIH to update a single field of my documents?
> >
> > Thanks,
> > Matt


entity in DIH for partial update?

2020-04-10 Thread matthew sporleder
I have an field I would like to add to my schema which is stored in a
different database from my primary data.  Can I use a separate entity
in my DIH to update a single field of my documents?

Thanks,
Matt


spelling dictionaries

2020-04-03 Thread matthew sporleder
Does anyone have good sources for word dictionaries to use for the
spell checker?

Thanks,
Matt


Re: Solr Instance Migration - Server Access

2020-03-26 Thread matthew sporleder
If it's solrcloud + zookeeper you can get most of the configs from the
"tree" browser on the console: /solr/#/~cloud?view=tree

You can otherwise derive a lot of the configs/schema/data-import
properties from the web console and api, neither of which require
server access.

It is also possible to get into servers where you do not have the
passwords assuming you have physical access/cloud console access/etc
but that is not a solr question.

On Thu, Mar 26, 2020 at 3:24 AM Landon Cowan  wrote:
>
> Hello!  I’m working on a website for a client that was migrated from another 
> website development company.  The previous company used Solr to build out the 
> site search – but they did not send us the server credentials.  The 
> developers who built the tool are no longer with the company – is there a 
> process we should follow to secure the credentials?  I worry we may need to 
> rebuild the feature from the ground up.
>
>


Re: edge ngram/find as you type sorting

2020-03-26 Thread matthew sporleder
That explains the OOM's I've been getting in the initial test cycle.
I'm working with about 50M (small) documents.

On Thu, Mar 26, 2020 at 7:58 AM Erick Erickson  wrote:
>
> the ngramming is a time/space tradeoff. Typically,
> if you restrict the wildcards to have three or more
> “real” characters performance is fine. One real
> character (i.e. a*) will be your worst-case. I’ve
> seen requiring two characters in the prefix work well
> too. It Depends (tm).
>
> Conceptually what happens here is that Lucene has
> to enumerate all of the terms that start with the prefix
> and create a ginormous OR clause. The term
> enumeration will take longer the more terms there are.
> Things are more efficient than that, but still...
>
> So make sure you’re testing with a real corpus. Having
> a test index with just a few terms will be misleading.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:37 PM, matthew sporleder  wrote:
> >
> > Okay confirmed-
> > I am getting a more predictable results set after adding an additional 
> > field:
> >   > sortMissingLast="true" omitNorms="true">
> > 
> >  
> >  
> >   > pattern="\p{Punct}" replacement=""/>
> > 
> >  
> >
> > q=slug:what_is_lo*=slug=1000=csv=slug_alpha%20asc
> >
> > So it appears I can skip edge ngram entirely using this method as
> > slug:foo* appears to be the exact same results as fayt:foo, but I have
> > the cost of the alphaOnly field :)
> >
> > I will try to figure out some benchmarks or something to decide how to go.
> >
> > Thanks again for the help so far.
> >
> >
> > On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson  
> > wrote:
> >>
> >> You’re getting the correct sorted order… The underscore character is 
> >> confusing you.
> >>
> >> It’s ascii code for underscore is %2d which sorts before any letter, 
> >> uppercase or lowercase.
> >>
> >> See the alphaOnlySort type for a way to remove this, although the output 
> >> there can also
> >> be confusing.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 1:30 PM, matthew sporleder  
> >>> wrote:
> >>>
> >>> What_is_Lov_Holtz_known_for
> >>> What_is_lova_after_it_harddens
> >>> What_is_Lova_Moor's_birthday
> >>> What_is_lovable_in_Spanish
> >>> What_is_lovage
> >>> What_is_Lovagny's_population
> >>> What_is_lovan_for
> >>> What_is_lovanox
> >>> What_is_lovarstan_for
> >>> What_is_Lovasatin
> >>
>


Re: edge ngram/find as you type sorting

2020-03-25 Thread matthew sporleder
Okay confirmed-
I am getting a more predictable results set after adding an additional field:
  
 
  
  
  
 
  

q=slug:what_is_lo*=slug=1000=csv=slug_alpha%20asc

So it appears I can skip edge ngram entirely using this method as
slug:foo* appears to be the exact same results as fayt:foo, but I have
the cost of the alphaOnly field :)

I will try to figure out some benchmarks or something to decide how to go.

Thanks again for the help so far.


On Wed, Mar 25, 2020 at 2:39 PM Erick Erickson  wrote:
>
> You’re getting the correct sorted order… The underscore character is 
> confusing you.
>
> It’s ascii code for underscore is %2d which sorts before any letter, 
> uppercase or lowercase.
>
> See the alphaOnlySort type for a way to remove this, although the output 
> there can also
> be confusing.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 1:30 PM, matthew sporleder  wrote:
> >
> > What_is_Lov_Holtz_known_for
> > What_is_lova_after_it_harddens
> > What_is_Lova_Moor's_birthday
> > What_is_lovable_in_Spanish
> > What_is_lovage
> > What_is_Lovagny's_population
> > What_is_lovan_for
> > What_is_lovanox
> > What_is_lovarstan_for
> > What_is_Lovasatin
>


Re: edge ngram/find as you type sorting

2020-03-25 Thread matthew sporleder
Okay.  I am getting pretty much a random order of documents containing
the prefix.

Does my "string_ci" defined below count as
"keywordtokenizer+lowecasefilter"?  (assumption)
Does my "fayt" copy field below look right? (assumption)

I have a bunch of web pages indexed with "slug" fields with the prefix
"what_is_lov"
so I search:
select?q=fayt:what_is_lov=slug=1000=slug%20asc=csv

and get:
slug
What_is_Lov_Holtz_known_for
What_is_lova_after_it_harddens
What_is_Lova_Moor's_birthday
What_is_lovable_in_Spanish
What_is_lovage
What_is_Lovagny's_population
What_is_lovan_for
What_is_lovanox
What_is_lovarstan_for
What_is_Lovasatin



On Wed, Mar 25, 2020 at 1:15 PM Erick Erickson  wrote:
>
> What _is_ happening? Please provide examples of the inputs
> and outputs that don’t work for you. ‘cause
> the sort order should be “nothing comes before something"
> so sorting ascending on a keywordtokenizer+lowecasefilter
> should give you exactly what you’re asking for with no
> need for a length field.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 11:07 AM, matthew sporleder  
> > wrote:
> >
> > My original goal was to avoid indexing the string length because I
> > wanted edge ngram to "score" based on how "exact" the match was:
> >
> > q=abc
> > "abc" has a high score
> > "abcd" has a lower score
> > "abcde" has an even lower score
> >
> > You say sorting by by the original field will do that but in practice
> > it is not happening so I am probably missing something.
> >
> > I *am* getting a close version of what I said above with sorting on
> > the length, which I added to the index.
> >
> > searching for my keyword-lowercase field:abc* + sorting by length is
> > also working so maybe I can skip the edge ngram field entirely and
> > just do that but I was hoping the trade some disk space for
> > performance.  This field will get queried a lot.
> >
> >
> > On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson  
> > wrote:
> >>
> >> Why do you want to deal with score at all? Sorting
> >> overrides score-based sorting. Well, unless you
> >> specify score as a secondary sort. But since you’re
> >> sorting by length anyway, trying to score
> >> based on proximity to the end does nothing.
> >>
> >> The weirdness you’re going to get here, though, is
> >> that the order of the results will not be alphabetical.
> >> Say you have two docs, one with abcd and one with
> >> abce. Now say you search on abc. Whether abcd or
> >> abce comes first is indeterminant.
> >>
> >> If you simply stored the keyword-lowercased value
> >> in a copyfield and sorted on _that_, you wouldn’t have
> >> this problem. But if you’re really worried about space,
> >> that might not be an option.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 25, 2020, at 9:49 AM, matthew sporleder  
> >>> wrote:
> >>>
> >>> Where I landed:
> >>>
> >>>  >>> sortMissingLast="true" omitNorms="false">
> >>>
> >>> 
> >>> 
> >>>
> >>> 
> >>>
> >>>  >>> positionIncrementGap="100">
> >>> 
> >>>  
> >>>   >>> maxGramSize="25" />
> >>>  
> >>> 
> >>> 
> >>>  
> >>>  
> >>> 
> >>> 
> >>>
> >>>
> >>>  >>> multiValued="false" />
> >>>  >>> omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> >>> />
> >>>  >>> multiValued="false" />
> >>>
> >>> ---
> >>>
> >>> I can then do a search for
> >>>
> >>> q=fayt:my_article_slu=qt_len asc
> >>>
> >>> to get the shortest/most exact find-as-you-type match.  I couldn't get
> >>> around all results having the same score (can I boost proximity to the
> >>> end of a string?) in the edge ngram search but I am hoping this is the
> >>> fastest way to do this type of search since I can avoid wildcards
> >>> "my_article_slu*" and stuff.
> >>>
> >>> More suggestions welcome and thanks for the help.  I will re-index
> >>> with omitNorms=true again to see if I can sa

Re: edge ngram/find as you type sorting

2020-03-25 Thread matthew sporleder
My original goal was to avoid indexing the string length because I
wanted edge ngram to "score" based on how "exact" the match was:

q=abc
"abc" has a high score
"abcd" has a lower score
"abcde" has an even lower score

You say sorting by by the original field will do that but in practice
it is not happening so I am probably missing something.

I *am* getting a close version of what I said above with sorting on
the length, which I added to the index.

searching for my keyword-lowercase field:abc* + sorting by length is
also working so maybe I can skip the edge ngram field entirely and
just do that but I was hoping the trade some disk space for
performance.  This field will get queried a lot.


On Wed, Mar 25, 2020 at 10:39 AM Erick Erickson  wrote:
>
> Why do you want to deal with score at all? Sorting
> overrides score-based sorting. Well, unless you
> specify score as a secondary sort. But since you’re
> sorting by length anyway, trying to score
> based on proximity to the end does nothing.
>
> The weirdness you’re going to get here, though, is
> that the order of the results will not be alphabetical.
> Say you have two docs, one with abcd and one with
> abce. Now say you search on abc. Whether abcd or
> abce comes first is indeterminant.
>
> If you simply stored the keyword-lowercased value
> in a copyfield and sorted on _that_, you wouldn’t have
> this problem. But if you’re really worried about space,
> that might not be an option.
>
> Best,
> Erick
>
> > On Mar 25, 2020, at 9:49 AM, matthew sporleder  wrote:
> >
> > Where I landed:
> >
> >   > sortMissingLast="true" omitNorms="false">
> > 
> >  
> >  
> > 
> >  
> >
> >  > positionIncrementGap="100">
> > 
> >   
> >> maxGramSize="25" />
> >   
> > 
> > 
> >   
> >   
> > 
> > 
> >
> >
> >   > multiValued="false" />
> >   > omitNorms="false" omitTermFreqAndPositions="false" multiValued="true"
> > />
> >   > multiValued="false" />
> >
> > ---
> >
> > I can then do a search for
> >
> > q=fayt:my_article_slu=qt_len asc
> >
> > to get the shortest/most exact find-as-you-type match.  I couldn't get
> > around all results having the same score (can I boost proximity to the
> > end of a string?) in the edge ngram search but I am hoping this is the
> > fastest way to do this type of search since I can avoid wildcards
> > "my_article_slu*" and stuff.
> >
> > More suggestions welcome and thanks for the help.  I will re-index
> > with omitNorms=true again to see if I can save a little space.
> >
> >
> >
> >
> >
> > On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder  
> > wrote:
> >>
> >> Okay I appreciate you responding.
> >>
> >> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> >> about the same results, which makes sense to me now :)
> >>
> >> The previous definition of string_ci was:
> >>   >> sortMissingLast="true" omitNorms="true">
> >> 
> >>  
> >>  
> >> 
> >>  
> >>
> >> So lowercase + KeywordTokenizerFactory;
> >>
> >> I am trying again with omitNorms=false  to see if I can get the more
> >> "exact" matches to score better this time around.
> >>
> >>
> >> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson  
> >> wrote:
> >>>
> >>> Won’t work. String types are totally unanalyzed. Your string_ci fieldType 
> >>> is what I was looking for.
> >>>
> >>> No, you shouldn’t kill the lowercasefilter unless you want all of your 
> >>> searches will then be case-sensitive.
> >>>
> >>> So you should try:
> >>>
> >>> q=edgy_text:whatever=string_ci asc
> >>>
> >>> Please use the admin>>pick_core>>analysis page when thinking about 
> >>> changing your schema, it’ll answer a _lot_ of these questions immediately.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Mar 24, 2020, at 8:37 AM, matthew sporleder  
> >>>> wrote:
> >>>>
> >>>> Oh maybe a schema bug!
> >>>>
> >>>> my string_ci:
&

Re: edge ngram/find as you type sorting

2020-03-25 Thread matthew sporleder
Where I landed:

  
 
  
  
 
  


 
   
   
   
 
 
   
   
 



  
  
  

---

I can then do a search for

q=fayt:my_article_slu=qt_len asc

to get the shortest/most exact find-as-you-type match.  I couldn't get
around all results having the same score (can I boost proximity to the
end of a string?) in the edge ngram search but I am hoping this is the
fastest way to do this type of search since I can avoid wildcards
"my_article_slu*" and stuff.

More suggestions welcome and thanks for the help.  I will re-index
with omitNorms=true again to see if I can save a little space.





On Tue, Mar 24, 2020 at 11:39 AM matthew sporleder  wrote:
>
> Okay I appreciate you responding.
>
> Switching "slug" from "string_ci" class="solr.StrField" accomplished
> about the same results, which makes sense to me now :)
>
> The previous definition of string_ci was:
>sortMissingLast="true" omitNorms="true">
>  
>   
>   
>  
>   
>
> So lowercase + KeywordTokenizerFactory;
>
> I am trying again with omitNorms=false  to see if I can get the more
> "exact" matches to score better this time around.
>
>
> On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson  
> wrote:
> >
> > Won’t work. String types are totally unanalyzed. Your string_ci fieldType 
> > is what I was looking for.
> >
> > No, you shouldn’t kill the lowercasefilter unless you want all of your 
> > searches will then be case-sensitive.
> >
> > So you should try:
> >
> > q=edgy_text:whatever=string_ci asc
> >
> > Please use the admin>>pick_core>>analysis page when thinking about changing 
> > your schema, it’ll answer a _lot_ of these questions immediately.
> >
> > Best,
> > Erick
> >
> > > On Mar 24, 2020, at 8:37 AM, matthew sporleder  
> > > wrote:
> > >
> > > Oh maybe a schema bug!
> > >
> > > my string_ci:
> > >  > > sortMissingLast="true" omitNorms="true">
> > > 
> > >  
> > >  
> > > 
> > >  
> > >
> > > going to try this instead:
> > >   > > sortMissingLast="true" omitNorms="true">
> > > 
> > >      
> > >  
> > > 
> > >  
> > >
> > > Then I can probably kill the lowercasefilter on edgeytext:
> > >
> > >
> > >
> > > On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson  
> > > wrote:
> > >>
> > >> Sort by the full field. You’ll need to copy to a field with 
> > >> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not 
> > >> really a :”string”) type.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Mar 24, 2020, at 7:10 AM, matthew sporleder  
> > >>> wrote:
> > >>>
> > >>> I have added an edge ngram field to my index and get decent results
> > >>> with partial words but the results appear randomly sorted and all
> > >>> contain the same score.  Ideally I would like to sort by shortest
> > >>> ngram match within my other qualifiers.
> > >>>
> > >>> Is there a canonical solution to this?
> > >>>
> > >>> Thanks,
> > >>> Matt
> > >>>
> > >>> p.s. I mostly followed
> > >>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> > >>>
> > >>> schema bits:
> > >>>
> > >>>  > >>> positionIncrementGap="100">
> > >>> 
> > >>>  
> > >>>  
> > >>>   > >>> maxGramSize="25" />
> > >>> 
> > >>>
> > >>>  > >>> multiValued="false" />
> > >>>
> > >>>  > >>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> > >>> />
> > >>>
> > >>>
> > >>> 
> > >>
> >


Re: edge ngram/find as you type sorting

2020-03-24 Thread matthew sporleder
Okay I appreciate you responding.

Switching "slug" from "string_ci" class="solr.StrField" accomplished
about the same results, which makes sense to me now :)

The previous definition of string_ci was:
  
 
  
  
 
  

So lowercase + KeywordTokenizerFactory;

I am trying again with omitNorms=false  to see if I can get the more
"exact" matches to score better this time around.


On Tue, Mar 24, 2020 at 9:54 AM Erick Erickson  wrote:
>
> Won’t work. String types are totally unanalyzed. Your string_ci fieldType is 
> what I was looking for.
>
> No, you shouldn’t kill the lowercasefilter unless you want all of your 
> searches will then be case-sensitive.
>
> So you should try:
>
> q=edgy_text:whatever=string_ci asc
>
> Please use the admin>>pick_core>>analysis page when thinking about changing 
> your schema, it’ll answer a _lot_ of these questions immediately.
>
> Best,
> Erick
>
> > On Mar 24, 2020, at 8:37 AM, matthew sporleder  wrote:
> >
> > Oh maybe a schema bug!
> >
> > my string_ci:
> >  > sortMissingLast="true" omitNorms="true">
> > 
> >  
> >  
> > 
> >  
> >
> > going to try this instead:
> >   > sortMissingLast="true" omitNorms="true">
> > 
> >  
> >  
> > 
> >  
> >
> > Then I can probably kill the lowercasefilter on edgeytext:
> >
> >
> >
> > On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson  
> > wrote:
> >>
> >> Sort by the full field. You’ll need to copy to a field with 
> >> keywordTokenizer and lowercaseFilter (string_ci? assuming it’s not really 
> >> a :”string”) type.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 24, 2020, at 7:10 AM, matthew sporleder  
> >>> wrote:
> >>>
> >>> I have added an edge ngram field to my index and get decent results
> >>> with partial words but the results appear randomly sorted and all
> >>> contain the same score.  Ideally I would like to sort by shortest
> >>> ngram match within my other qualifiers.
> >>>
> >>> Is there a canonical solution to this?
> >>>
> >>> Thanks,
> >>> Matt
> >>>
> >>> p.s. I mostly followed
> >>> https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >>>
> >>> schema bits:
> >>>
> >>>  >>> positionIncrementGap="100">
> >>> 
> >>>  
> >>>  
> >>>   >>> maxGramSize="25" />
> >>> 
> >>>
> >>>  >>> multiValued="false" />
> >>>
> >>>  >>> omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> >>> />
> >>>
> >>>
> >>> 
> >>
>


Re: edge ngram/find as you type sorting

2020-03-24 Thread matthew sporleder
Oh maybe a schema bug!

my string_ci:
 
 
  
  
 
  

going to try this instead:
  
 
  
  
 
  

Then I can probably kill the lowercasefilter on edgeytext:



On Tue, Mar 24, 2020 at 7:44 AM Erick Erickson  wrote:
>
> Sort by the full field. You’ll need to copy to a field with keywordTokenizer 
> and lowercaseFilter (string_ci? assuming it’s not really a :”string”) type.
>
> Best,
> Erick
>
> > On Mar 24, 2020, at 7:10 AM, matthew sporleder  wrote:
> >
> > I have added an edge ngram field to my index and get decent results
> > with partial words but the results appear randomly sorted and all
> > contain the same score.  Ideally I would like to sort by shortest
> > ngram match within my other qualifiers.
> >
> > Is there a canonical solution to this?
> >
> > Thanks,
> > Matt
> >
> > p.s. I mostly followed
> > https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/
> >
> > schema bits:
> >
> >  > positionIncrementGap="100">
> > 
> >   
> >   
> >> maxGramSize="25" />
> > 
> >
> >   > multiValued="false" />
> >
> >   > omitNorms="false" omitTermFreqAndPositions="true" multiValued="true"
> > />
> >
> >
> > 
>


edge ngram/find as you type sorting

2020-03-24 Thread matthew sporleder
I have added an edge ngram field to my index and get decent results
with partial words but the results appear randomly sorted and all
contain the same score.  Ideally I would like to sort by shortest
ngram match within my other qualifiers.

Is there a canonical solution to this?

Thanks,
Matt

p.s. I mostly followed
https://lucidworks.com/post/auto-suggest-from-popular-queries-using-edgengrams/

schema bits:


 
   
   
   
 

  

  





Re: disappearing MBeans

2011-02-02 Thread matthew sporleder
Sorry to reply to myself, but I just wanted to see if anyone saw
this/had ideas why MBeans would be removed/re-added/removed.

I tried looking for this in the code but was unable to grok what
triggers bean removal.

Any hints?


On Thu, Jan 27, 2011 at 3:30 PM, matthew sporleder msporle...@gmail.com wrote:
 I am using JMX to monitor my replication status and am finding that my
 MBeans are disappearing.  I turned on debugging for JMX and found that
 solr seems to be deleting the mbeans.

 Is this a bug?  Some trace info is below..

 here's me reading the mbean successfully:
 Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
 FINER: Receive a new request.
 Jan 27, 2011 5:00:02 PM DefaultMBeanServerInterceptor getAttribute
 FINER: Attribute= indexReplicatedAt, obj=
 solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:00:02 PM Repository retrieve
 FINER: 
 name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
 FINER: Finish a request.


 a little while later it removes the mbean from the PM Repository
 (whatever that is) and then re-adds it:
 FINER: Send create notification of object
 solr/myapp-core:id=org.apache.solr.handler.component.SearchHandler,type=atlas
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
 FINER: JMX.mbean.registered
 solr/myapp-core:type=atlas,id=org.apache.solr.handler.component.SearchHandler
 Jan 27, 2011 5:16:14 PM Repository contains
 FINER: 
 name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM Repository retrieve
 FINER: 
 name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM Repository remove
 FINER: 
 name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
 FINER: Send delete notification of object
 solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
 FINER: JMX.mbean.unregistered
 solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
 FINER: ObjectName =
 solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM Repository addMBean
 FINER: 
 name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObject
 FINER: Send create notification of object
 solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
 FINER: JMX.mbean.registered
 solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler


 And after a tons of messages but still in the same second it does:
 Jan 27, 2011 5:16:14 PM Repository contains
 FINER: 
 name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM Repository retrieve
 FINER: 
 name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM Repository removeFINER:
 name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
 FINER: Send delete notification of object
 solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
 FINER: JMX.mbean.unregistered
 solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
 FINER: ObjectName =
 solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandlerJan
 27, 2011 5:16:14 PM Repository addMBean
 FINER: 
 name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObjectFINER:
 Send create notification of object
 solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
 Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor
 sendNotificationFINER: JMX.mbean.registered
 solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler


 And then I don't know what this is about but it removes the bean again:
 Jan 27, 2011 5:16:15 PM Repository contains
 FINER: 
 name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id

disappearing MBeans

2011-01-27 Thread matthew sporleder
I am using JMX to monitor my replication status and am finding that my
MBeans are disappearing.  I turned on debugging for JMX and found that
solr seems to be deleting the mbeans.

Is this a bug?  Some trace info is below..

here's me reading the mbean successfully:
Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
FINER: Receive a new request.
Jan 27, 2011 5:00:02 PM DefaultMBeanServerInterceptor getAttribute
FINER: Attribute= indexReplicatedAt, obj=
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:00:02 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:00:02 PM ServerCommunicatorAdmin reqIncoming
FINER: Finish a request.


a little while later it removes the mbean from the PM Repository
(whatever that is) and then re-adds it:
FINER: Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.component.SearchHandler,type=atlas
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered
solr/myapp-core:type=atlas,id=org.apache.solr.handler.component.SearchHandler
Jan 27, 2011 5:16:14 PM Repository contains
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository remove
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
FINER: Send delete notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.unregistered
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName =
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository addMBean
FINER: 
name=solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObject
FINER: Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=/replication
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.registered
solr/myapp-core:type=/replication,id=org.apache.solr.handler.ReplicationHandler


And after a tons of messages but still in the same second it does:
Jan 27, 2011 5:16:14 PM Repository contains
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM Repository removeFINER:
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor unregisterMBean
FINER: Send delete notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor sendNotification
FINER: JMX.mbean.unregistered
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor registerMBean
FINER: ObjectName =
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandlerJan
27, 2011 5:16:14 PM Repository addMBean
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor addObjectFINER:
Send create notification of object
solr/myapp-core:id=org.apache.solr.handler.ReplicationHandler,type=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:14 PM DefaultMBeanServerInterceptor
sendNotificationFINER: JMX.mbean.registered
solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler


And then I don't know what this is about but it removes the bean again:
Jan 27, 2011 5:16:15 PM Repository contains
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:15 PM Repository retrieve
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011 5:16:15 PM Repository remove
FINER: 
name=solr/myapp-core:type=org.apache.solr.handler.ReplicationHandler,id=org.apache.solr.handler.ReplicationHandler
Jan 27, 2011