Re: Extract a list of the most recent field values?

2021-02-05 Thread Emir Arnautović
Hi Jimi,
It seems to me that you could get the results using collapsing query parse: 
https://lucene.apache.org/solr/guide/6_6/collapse-and-expand-results.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Feb 2021, at 14:17, Hullegård, Jimi 
>  wrote:
> 
> Hi,
> 
> Say we have a bunch of documents in Solr, and each document has a multi value 
> field "category". Now I would like to get the N most recently used 
> categories, ordered so that the most recently used category comes first and 
> then in falling order.
> 
> My simplistic solution to this would be:
> 
> 1. Perform a search for all documents with at least one category set, sorted 
> by last modified date
> 2. Iterate over the search results, extracting the categories used, and add 
> them to a list
> 3. If we have N number of categories, stop iterating the results
> 4. If there isn't enough categories, go to the next page in the search 
> results and start over at step 2 above, until N categories are found or no 
> more search results
> 
> But this doesn't seem very efficient. Especially if there are lots of 
> documents, and N is a high number and/or only a handful of categories are 
> used most of the time, since it could mean one has to look through a whole 
> lot of documents before having enough categories. Worst case scenario: N is 
> higher than the total number of unique categories used, in which case one 
> would iterate over every single document that has a category.
> 
> Is there a way one can construct some special query to solr to get this data 
> in a more efficient way?
> 
> Regards
> /Jimi
> 
> Svenskt Näringsliv är företagsamhetens röst i Sverige. Vi samverkar med 50 
> arbetsgivar- och branschorganisationer och är den gemensamma rösten för 60 
> 000 företag med nästan 2 miljoner medarbetare. Vår uppgift är att tala för 
> alla företag och branscher, även de som ännu inte finns men som kan uppstå om 
> förutsättningarna är de rätta. Ett bättre företagsklimat för ett bättre 
> Sverige. Det är vårt uppdrag.
> 
> Svenskt Näringsliv behandlar dina personuppgifter i enlighet med GDPR. Här 
> kan du läsa mer om vår behandling och dina rättigheter, 
> Integritetspolicy



Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-27 Thread Emir Arnautović
Hi Jaan,
You can also check in admin console in caches the sizes of field* caches. That 
will tell you if some field needs docValues=true.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Oct 2020, at 14:36, Jaan Arjasepp  wrote:
> 
> Hi Erick,
> 
> Thanks for this information, I will look into it.
> Main changes were regarding parsing the results JSON got from solr, not the 
> queries or updates.
> 
> Jaan
> 
> P.S. configuration change about requestParser was not it.
> 
> 
> -Original Message-
> From: Erick Erickson  <mailto:erickerick...@gmail.com>> 
> Sent: 27 October 2020 15:03
> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org>
> Subject: Re: SOLR uses too much CPU and GC is also weird on Windows server
> 
> Jean:
> 
> The basic search uses an “inverted index”, which is basically a list of terms 
> and the documents they appear in, e.g.
> my - 1, 4, 9, 12
> dog - 4, 8, 10
> 
> So the word “my” appears in docs 1, 4, 9 and 12, and “dog” appears in 4, 8, 
> 10. Makes it easy to search for my AND dog for instance, obviously both 
> appear in doc 4.
> 
> But that’s a lousy structure for faceting, where you have a list of documents 
> and are trying to find the terms it has to count them up. For that, you want 
> to “uninvert” the above structure,
> 1 - my
> 4 - my dog
> 8 - dog
> 9 - my
> 10 - dog
> 12 - my
> 
> From there, it’s easy to say “count the distinct terms for docs 1 and 4 and 
> put them in a bucket”, giving facet counts like 
> 
> my (2)
> dog (1)
> 
> If docValues=true, then the second structure is built at index time and 
> occupies memory at run time out in MMapDirectory space, i.e. _not_ on the 
> heap. 
> 
> If docValues=false, the second structure is built _on_ the heap when it’s 
> needed, adding to GC, memory pressure, CPU utilization etc.
> 
> So one theory is that when you upgraded your system (and you did completely 
> rebuild your corpus, right?) you inadvertently changed the docValues property 
> for one or more fields that you facet, group, sort, or use function queries 
> on and Solr is doing all the extra work of uninverting the field that it 
> didn’t have to before.
> 
> To answer that, you need to go through your schema and insure that 
> docValues=true is set for any field you facet, group, sort, or use function 
> queries on. If you do change this value, you need to blow away your index so 
> there are no segments and index all your documents again.
> 
> But that theory has problems:
> 1> why should Solr run for a while and then go crazy? It’d have to be 
> 1> that the query that
>triggers uninversion is uncommon.
> 2> docValues defaults to true for simple types in recent schemas. 
> 2> Perhaps you pulled
>  over an old definition from your former schema?
> 
> 
> One other thing: you mention a bit of custom code you needed to change. I 
> always try to investigate that first. Is it possible to
> 1> reproduce the problem no a non-prod system
> 2> see what happens if you take the custom code out?
> 
> Best,
> Erick
> 
> 
>> On Oct 27, 2020, at 4:42 AM, Emir Arnautović  
>> wrote:
>> 
>> Hi Jaan,
>> It can be several things:
>> caches
>> fieldCache/fieldValueCache - it can be that you you are missing doc values 
>> on some fields that are used for faceting/sorting/functions and that 
>> uninverted field structures are eating your memory. 
>> filterCache - you’ve changed setting for filter caches and set it to 
>> some large value heavy queries return a lot of documents facet on high 
>> cardinality fields deep pagination
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
>> Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 27 Oct 2020, at 08:48, Jaan Arjasepp  wrote:
>>> 
>>> Hello,
>>> 
>>> We have been using SOLR for quite some time. We used 6.0 and now we did a 
>>> little upgrade to our system and servers and we started to use 8.6.1.
>>> We use it on a Windows Server 2019.
>>> Java version is 11
>>> Basically using it in a default setting, except giving SOLR 2G of heap. It 
>>> used 512, but it ran out of memory and stopped responding. Not sure if it 
>>> was the issue. When older version, it managed fine with 512MB.
>>> SOLR is not in a cloud mode, but in solo mode as we use it internally and 
>>> it does not have too many reques

Re: SOLR uses too much CPU and GC is also weird on Windows server

2020-10-27 Thread Emir Arnautović
Hi Jaan,
It can be several things:
caches
fieldCache/fieldValueCache - it can be that you you are missing doc values on 
some fields that are used for faceting/sorting/functions and that uninverted 
field structures are eating your memory. 
filterCache - you’ve changed setting for filter caches and set it to some large 
value
heavy queries
return a lot of documents
facet on high cardinality fields
deep pagination

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Oct 2020, at 08:48, Jaan Arjasepp  wrote:
> 
> Hello,
> 
> We have been using SOLR for quite some time. We used 6.0 and now we did a 
> little upgrade to our system and servers and we started to use 8.6.1.
> We use it on a Windows Server 2019.
> Java version is 11
> Basically using it in a default setting, except giving SOLR 2G of heap. It 
> used 512, but it ran out of memory and stopped responding. Not sure if it was 
> the issue. When older version, it managed fine with 512MB.
> SOLR is not in a cloud mode, but in solo mode as we use it internally and it 
> does not have too many request nor indexing actually.
> Document sizes are not big, I guess. We only use one core.
> Document stats are here:
> Num Docs: 3627341
> Max Doc: 4981019
> Heap Memory Usage: 434400
> Deleted Docs: 1353678
> Version: 15999036
> Segment Count: 30
> 
> The size of index is 2.66GB
> 
> While making upgrade we had to modify one field and a bit of code that uses 
> it. Thats basically it. It works.
> If needed more information about background of the system, I am happy to help.
> 
> 
> But now to the issue I am having.
> If SOLR is started, at first 40-60 minutes it works just fine. CPU is not 
> high, heap usage seem normal. All is good, but then suddenly, the heap usage 
> goes crazy, going up and down, up and down and CPU rises to 50-60% of the 
> usage. Also I noticed over the weekend, when there are no writing usage, the 
> CPU remains low and decent. I can try it this weekend again to see if and how 
> this works out.
> Also it seems to me, that after 4-5 days of working like this, it stops 
> responding, but needs to be confirmed with more heap also.
> 
> Heap memory usage via JMX and jconsole -> 
> https://drive.google.com/file/d/1Zo3B_xFsrrt-WRaxW-0A0QMXDNscXYih/view?usp=sharing
> As you can see, it starts of normal, but then goes crazy and it has been like 
> this over night.
> 
> This is overall monitoring graphs, as you can see CPU is working hard or 
> hardly working. -> 
> https://drive.google.com/file/d/1_Gtz-Bi7LUrj8UZvKfmNMr-8gF_lM2Ra/view?usp=sharing
> VM summary can be found here -> 
> https://drive.google.com/file/d/1FvdCz0N5pFG1fmX_5OQ2855MVkaL048w/view?usp=sharing
> And finally to have better and quick overview of the SOLR executing 
> parameters that I have -> 
> https://drive.google.com/file/d/10VCtYDxflJcvb1aOoxt0u3Nb5JzTjrAI/view?usp=sharing
> 
> If you can point me what I have to do to make it work, then I appreciate it a 
> lot.
> 
> Thank you in advance.
> 
> Best regards,
> Jaan
> 
> 



Re: Question on solr metrics

2020-10-27 Thread Emir Arnautović
Hi,
In order to see time range metrics, you’ll need to collect metrics periodically 
and send it to some storage and then query/visualise. Solr has exporters for 
some popular backends, or you can use some cloud based solution. One such 
solution is our: https://sematext.com/integrations/solr-monitoring/ and we’ve 
also just added Solr logs integration so you can collect/visualise/alert on 
both metrics and logs.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Oct 2020, at 22:08, yaswanth kumar  wrote:
> 
> Can we get the metrics for a particular time range? I know metrics history
> was not enabled, so that I will be having only from when the solr node is
> up and running last time, but even from it can we do a data range like for
> example on to see CPU usage on a particular time range?
> 
> Note: Solr version: 8.2
> 
> -- 
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com



Re: Any blog or url that explain step by step configure grafana dashboard to monitor solr metrics

2020-09-25 Thread Emir Arnautović
Hi,
In case you decide to go with cloud solution, you can check how you can monitor 
Solr with Sematext: 
https://sematext.com/blog/solr-monitoring-made-easy-with-sematext/ 


Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Sep 2020, at 04:55, yaswanth kumar  wrote:
> 
> Can some one post here any blogs or url where I can get the detailed steps 
> involved in configuring grafana dashboard for monitoring solr metrics??
> 
> Sent from my iPhone



Re: Replication in soft commit

2020-09-03 Thread Emir Arnautović
Hi Tushar,
This is not usecase suitable for MS model. You should go with Solr Cloud, or if 
that is an overhead for you, have separate Solr, each doing indexing on its 
own. Solr provides eventual consistency anyway, so you should have some sort of 
stickiness in place even if you use MS model.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Sep 2020, at 13:54, Tushar Arora  wrote:
> 
> Hi Emir,
> Thanks for the response.
> Actually the use case is real time indexing from DB to solr in every second
> on the master server using queueing mechanism.
> So, I think instead of doing hard commits every second we should go for
> soft commits. And doing hard commits after some intervals.
> And we have to replicate the data to slave immediately.
> 
> Regards,
> Tushar
> On Thu, 3 Sep 2020 at 16:17, Emir Arnautović 
> wrote:
> 
>> Hi Tushar,
>> Replication is file based process and hard commit is when segment is
>> flushed to disk. It is not common that you use soft commits on master. The
>> only usecase that I can think of is when you read your index as part of
>> indexing process, but even that is bad practice and should be avoided.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Sep 2020, at 08:38, Tushar Arora  wrote:
>>> 
>>> Hi,
>>> I want to ask if the soft commit works in replication.
>>> One of our use cases deals with indexing the data every second on a
>> master
>>> server. And then it has to replicate to slaves. So if we use soft commit,
>>> then does the data replicate immediately to the slave server or after the
>>> hard commit takes place.
>>> Use cases require transfer of data from master to slave immediately.
>>> 
>>> Regards,
>>> Tushar
>> 
>> 



Re: Replication in soft commit

2020-09-03 Thread Emir Arnautović
Hi Tushar,
Replication is file based process and hard commit is when segment is flushed to 
disk. It is not common that you use soft commits on master. The only usecase 
that I can think of is when you read your index as part of indexing process, 
but even that is bad practice and should be avoided.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Sep 2020, at 08:38, Tushar Arora  wrote:
> 
> Hi,
> I want to ask if the soft commit works in replication.
> One of our use cases deals with indexing the data every second on a master
> server. And then it has to replicate to slaves. So if we use soft commit,
> then does the data replicate immediately to the slave server or after the
> hard commit takes place.
> Use cases require transfer of data from master to slave immediately.
> 
> Regards,
> Tushar



Re: Understanding Negative Filter Queries

2020-07-14 Thread Emir Arnautović
Hi Chris,
tag:* is a wildcard query while *:* is match all query. I believe that 
adjusting pure negative is turned on by default so you can safely just use 
-tag:email and it’ll be translated to *:* -tag:email.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Jul 2020, at 14:00, Chris Dempsey  wrote:
> 
> I'm trying to understand the difference between something like
> fq={!cache=false}(tag:* -tag:email) which is very slow compared to
> fq={!cache=false}(*:* -tag:email) on Solr 7.7.1.
> 
> I believe in the case of `tag:*` Solr spends some effort to gather all of
> the documents that have a value for `tag` and then removes those with
> `-tag:email` while in the `*:*` Solr simply uses the document set as-is
> and  then remove those with `-tag:email` (*and I believe Erick mentioned
> there were special optimizations for `*:*`*)?



Re: Search for term except within phrase

2020-07-07 Thread Emir Arnautović
Hi Stavros,
I didn’t check what’s supported in ComplexPhraseQueryParser but that is wrapper 
around span queries, so you should be able to do what you need: 
https://lucene.apache.org/solr/guide/7_6/other-parsers.html#complex-phrase-query-parser
 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Jul 2020, at 03:11, Stavros Macrakis  wrote:
> 
> (Sorry for sending this with the wrong subject earlier.)
> 
> How can I search for a term except when it's part of certain phrases?
> 
> For example, I might want to find documents mentioning "pepper" where it is
> not part of the phrases "chili pepper", "hot pepper", or "pepper sauce".
> 
> It does not work to search for [pepper NOT ("chili pepper" OR "hot pepper"
> OR "pepper sauce")] because that excludes all documents which mention
> "chili pepper" even if they also mention "black pepper" or the unmodified
> word "pepper". Maybe some way using synonyms?
> 
> Thanks!
> 
> -s



Re: Searching document content and mult-valued fields

2020-07-06 Thread Emir Arnautović
Hi Shaun,
If project content is relatively static, you could use nested documents 
 or 
you could plain with join query parser 
.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 1 Jul 2020, at 18:19, Shaun Campbell  wrote:
> 
> Hi
> 
> Been using Solr on a project now for a couple of years and is working well.
> It's just a simple index of about 20 - 25 fields and 7,000 project records.
> 
> Now there's a requirement to be able to search on the content of documents
> (web pages, Word, pdf etc) related to those projects.  My initial thought
> was to just create a new index to store the Tika'd content and just search
> on that. However, the requirement is to somehow search through both the
> project records and the content records at the same time and list the main
> project with perhaps some info on the matching content data. I tried to
> explain that you may find matching main project records but no content, and
> vice versa.
> 
> My only solution to this search problem is to either concatenate all the
> document content into one field on the main project record, and add that to
> my dismax search, and use boosting etc or to use a multi-valued field to
> store the content of each project document.  I'm a bit reluctant to do this
> as the application is running well and I'm a bit nervous about a change to
> the schema and the indexing process.  I just wondered what you thought
> about adding a lot of content to an existing schema (single or multivalued
> field) that doesn't normally store big amounts of data.
> 
> Or does anyone know of any way, I can join two searches like this together
> and two separate indexes?
> 
> Thanks
> Shaun



Re: Solr caches per node or per core

2020-06-24 Thread Emir Arnautović
Hi Reinaldo,
It is per core. Single node can have cores from different collections, each 
configured differently. When you size caches from memory consumption point of 
view, you have to take into account how many cores will be placed on each node. 
Of course, you have to count replicas as well.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Jun 2020, at 16:38, Odysci  wrote:
> 
> Hi,
> 
> I have a Solrcloud configuration with 2 nodes and 2 shards/2 replicas.
> I configure the sizes of the solr caches on solrconfig.xml, which I
> believe apply to nodes.
> 
> But when I look at the caches in the Solr UI, they are shown per core
> (e.g., shard1_replica_N1). Are the cache sizes defined in the
> solrconfig.xml the total size (adding up the caches for all cores in the
> node)? or are the cache sizes defined in the solrconfig.xm applied to each
> core separately?
> Thanks
> 
> Reinaldo



Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-24 Thread Emir Arnautović
Hi all,
Here is how I see it and explain to others that are not too familiar with Solr: 
Solr comes in two flavours - Cloud and Standalone. In any mode Solr writes to 
primary core(s). There is option to have different types of replicas, but in 
Standalone mode one can only have pull replica. In addition to different types 
of replicas, in SolrCloud mode multiple cores can be shards of a singe 
collection and primary is not fixed.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Jun 2020, at 15:19, Mark H. Wood  wrote:
> 
> On Wed, Jun 24, 2020 at 12:45:25PM +0200, Jan Høydahl wrote:
>> Master/slave and standalone are used interchangably to mean zk-less Solr. I 
>> have a feeling that master/slave is the more popular of the two, but 
>> personally I have been using both.
> 
> I've been trying to stay quiet and let the new-terminology issue
> settle, but I had a thought.  Someone has already pointed out that the
> so-called master/slave cluster is misnamed:  the so-called "master"
> node doesn't order the "slaves" about and indeed has no notion of
> being a master in any sense.  It acts as a servant to the "slave"
> nodes, which are in charge of keeping themselves updated.
> 
> So, it's kind of odd, but I could get used to calling this mode a
> "client/server cluster".
> 
> That leaves the question of what to call Solr Cloud mode, in which no
> node is permanently special.  I could see calling it a "herd" or
> suchlike.
> 
> Now I'll try to shut up again. :-)
> 
> -- 
> Mark H. Wood
> Lead Technology Analyst
> 
> University Library
> Indiana University - Purdue University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749
> www.ulib.iupui.edu



Re: Solr Deletes

2020-05-26 Thread Emir Arnautović
Hi Dwane,
DBQ does not play well with concurrent updates - it’ll block updates on 
replicas causing replicas to fall behind, trigger full replication and 
potentially OOM. My advice is to go with cursors (or even better use some DB as 
source of IDs) and DBID with some batching. You’ll need some tests to see which 
test size is best in your case.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 May 2020, at 01:48, Dwane Hall  wrote:
> 
> Hey Solr users,
> 
> 
> 
> I'd really appreciate some community advice if somebody can spare some time 
> to assist me.  My question relates to initially deleting a large amount of 
> unwanted data from a Solr Cloud collection, and then advice on best patterns 
> for managing delete operations on a regular basis.   We have a situation 
> where data in our index can be 're-mastered' and as a result orphan records 
> are left dormant and unneeded in the index (think of a scenario similar to 
> client resolution where an entity can switch between golden records depending 
> on the information available at the time).  I'm considering removing these 
> dormant records with a large initial bulk delete, and then running a delete 
> process on a regular maintenance basis.  The initial record backlog is 
> ~50million records in a ~1.2billion document index (~4%) and the maintenance 
> deletes are small in comparison ~20,000/week.
> 
> 
> 
> So with this scenario in mind I'm wondering what my best approach is for the 
> initial bulk delete:
> 
>  1.  Do nothing with the initial backlog and remove the unwanted documents 
> during the next large reindexing process?
>  2.  Delete by query (DBQ) with a specific delete query using the document 
> id's?
>  3.  Delete by id (DBID)?
> 
> Are there any significant performance advantages between using DBID over a 
> specific DBQ? Should I break the delete operations up into batches of say 
> 1000, 1, 10, N DOC_ID's at a time if I take this approach?
> 
> 
> 
> The Solr Reference guide mentions DBQ ignores the commitWithin parameter but 
> you can specify multiple documents to remove with an OR (||) clause in a DBQ 
> i.e.
> 
> 
> Option 1 – Delete by id
> 
> {"delete":["",""]}
> 
> 
> 
> Option 2 – Delete by query (commitWithin ignored)
> 
> {"delete":{"query":"DOC_ID:( || )"}}
> 
> 
> 
> Shawn also provides a great explanation in this user group post from 2015 of 
> the DBQ process 
> (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)
> 
> 
> 
> I follow the Solr release notes fairly closely and also noticed this 
> excellent addition and discussion from Hossman and committers in the Solr 8.5 
> release and it looks ideal for this scenario 
> (https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're 
> still on the 7.7.2 branch and are unable to take advantage of the streaming 
> deletes feature.
> 
> 
> 
> If I do implement a weekly delete maintenance regime is there any advice the 
> community can offer from experience?  I'll definitely want to avoid times of 
> heavy indexing but how do deletes effect query performance?  Will users 
> notice decreased performance during delete operations so they should be 
> avoided during peak query windows as well?
> 
> 
> 
> As always any advice greatly is appreciated,
> 
> 
> 
> Thanks,
> 
> 
> 
> Dwane
> 
> 
> 
> Environment
> 
> SolrCloud 7.7.2, 30 shards, 2 replicas
> 
> ~3 qps during peak times



Re: solr payloads performance

2020-05-11 Thread Emir Arnautović
Hi Wei,
In order to use payload you have to use functions and that’s not cheap. In 
order to make it work fast, you could use it as post filter and filter on some 
summary field like minPrice/maxPrice/defaultPrice.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 9 May 2020, at 01:26, Wei  wrote:
> 
> Hi everyone,
> 
> Have a question regarding typical  e-commerce scenario: each item may have
> different price in different store. suppose there are 10 million items and
> 1000 stores.
> 
> Option 1:  use solr payloads, each document have
> store_prices_payload:store1|price1 store2|price2  .
> store1000|price1000
> 
> Option 2: use dynamic fields and have 1000 fields in each document, i.e.
>   field1:  store1_price:  price1
>   field2:  store2_price:  price2
>   ...
>   field1000:  store1000_price: price1000
> 
> Option 2 doesn't look elegant,  but is there any performance benchmark on
> solr payloads? In terms of filtering, sorting or faceting, how would query
> performance compare between the two?
> 
> Thanks,
> Wei



Re: Minimum Match Query

2020-05-07 Thread Emir Arnautović
Hi Russel,
You are right about mm - it is about min term matches. Frequencies are usually 
used to determine score. But you can also filter on number of matches using 
function queries:
fq={!frange l=3}sum(termfreq(field, ‘barker’), termfreq(field, ‘jones’), 
termfreq(field, ‘baker’))

It is not perfect and you will need to handle phrases at index time to be able 
to match phrases. Or you can combine it with some other query to filter out 
unwanted results and use this approach to make sure frequencies match.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 May 2020, at 03:12, Russell Bahr  wrote:
> 
> Hi Atita,
> We actually looked into that and it does not appear to match based on a
> single phrase, but says that it must match a certain percentage of the
> listed phrases.  What we need is something that would match based on a
> single phrase appearing a minimum number of times i.e. "Barker" minimum
> number of matches =3 where "Barker" showed up in a document 3 or more times.
> 
> Am I missing something there or am I reading this wrong?
> The mm (Minimum Should Match) Parameter When processing queries,
> Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and
> "optional" (also known as "should" clauses). By default, all words or
> phrases specified in the q parameter are treated as "optional" clauses
> unless they are preceded by a "+" or a "-". When dealing with these
> "optional" clauses, the mm parameter makes it possible to say that a
> certain minimum number of those clauses must match. The DisMax query parser
> offers great flexibility in how the minimum number can be specified.
> 
> We did try doing a query and the results that came back were reflective
> only of minimum number of phrases matching as opposed to a phrase being
> mentioned a minimum number of times.
> 
> For example, If I say query for “Google” with mm=100 it doesn’t find
> Articles with 100 mentions of Google.  It is used for multiple phrase
> queries.  Example against our servers:
> 
> query = "Barker" OR "Jones" OR “Baker” mm=1 103,896 results
> query = "Barker" OR "Jones" OR “Baker” mm=2 1200 results
> query = "Barker" OR "Jones" OR “Baker” mm=3 16 results
> 
> Please let me know.
> Thank you,
> Russ
> 
> 
> 
> On Wed, May 6, 2020 at 10:13 AM Atita Arora  wrote:
> 
>> Hi,
>> 
>> Did you happen to look into :
>> 
>> 
>> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Themm_MinimumShouldMatch_Parameter
>> 
>> I believe 6.5.1 has it too.
>> 
>> I hope it should help.
>> 
>> 
>> On Wed, May 6, 2020 at 6:46 PM Russell Bahr  wrote:
>> 
>>> Hi SOLR team,
>>> I have been asked if there is a way to return results only if those
>>> results match a minimum number of times present in the query.
>>> ( queries looking for a minimum amount of mentions for a particular
>>> term/phrase. Ie must be mentioned 'x' amount of times to return results).
>>> Is this something that is possible using SOLR 6.5.1?  Is this something
>>> that would require a newer version of SOLR?
>>> Any help on this would be appreciated.
>>> Thank you,
>>> Russ
>>> 
>> 



Re: Reindexing using dataimporthandler

2020-04-27 Thread Emir Arnautović
Hi Bjarke,
I don’t see a problem with that approach if you have enough resources to handle 
both cores at the same time, especially if you are doing that while serving 
production queries. The only issue is that if you plan to do that then you have 
to have all fields stored. Also note that cursorMark support was added a bit 
later to entity processor, so if you are running a bit older version of Solr, 
you might not have cursors - I’ve found it the hard way.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Apr 2020, at 13:11, Bjarke Buur Mortensen  wrote:
> 
> Hi list,
> 
> Let's say I add a copyField to my solr schema, or change the analysis chain
> of a field or some other change.
> It seems to me to be an alluring choice to use a very simple
> dataimporthandler to reindex all documents, by using a SolrEntityProcessor
> that points to itself. I have just done this for a very small collection,
> but I was wondering what the caveats are, since this is not the recommended
> practice. What can go wrong using this approach?
> 
>   "http://localhost:8983/solr/mycollection; qt="lucene" query="*:*" wt=
> "javabin" rows="1000" cursorMark="true" sort="id asc" fl=
> "*,orig_version_l:_version_"/> 
> 
> PS: (It is probably necessary to add a version:[* TO
> ] to ensure it terminates for large imports)
> PPS: (Obviously you shouldn't add the clean parameter)
> 
> /Bjarke



Re: Rule of thumb for determining maxTime of AutoCommit

2020-02-27 Thread Emir Arnautović
Hi Kaya,
Since you do not have soft commits, you must have explicit commits somewhere 
since your hard commits are configured not to open searcher.

Re warming up: yes - you are right. You need to check your queries and warmup 
numbers in cache configs. What you need to check is how log does warmup takes 
and if it takes too long reduce number of warmup queries/items. I think that 
there is cumulative warming time in admin console, or if you prefer some proper 
Solr monitoring tool, you can check out our Solr integration: 
https://apps.sematext.com/demo <https://apps.sematext.com/demo>

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Feb 2020, at 03:00, Kayak28  wrote:
> 
> Hello, Emir:
> 
> Thank you for your reply.
> I do understand that the frequency of creating searcher depends on how much
> realitime-search is required.
> 
> As you advise me, I have checked a soft-commit configuration.
> It is configured as:
> ${solr.autoSoftCommit.maxTime:-1}
> 
> If I am correct, I have not set autoSoftCommit, which means autoSoftCommit
> does not create a new searcher.
> Next, I will take a look at my explicit commit.
> 
> I am curious about what you say "warming strategy."
> Is there any good resource to learn about tuning warming up?
> 
> As far as I know about warming up, there is two warming-up functions in
> Solr.
> One is static warming up, which you can configure queries in solrconfig.xml
> The other is dynamic warming up, which uses queries from old cache.
> 
> How should I tune them?
> What is the first step to look at?
> (I am kinda guessing the answer can vary depends on the system, the
> service, etc... )
> 
> 
> 
> Sincerely,
> Kaya Ota
> 
> 
> 
> 2020年2月26日(水) 17:36 Emir Arnautović :
> 
>> Hi Kaya,
>> The answer is simple: as much as your requirements allow delay between
>> data being indexed and changes being visible. It is sometimes seconds and
>> sometimes hours or even a day is tolerable. On each commit your caches are
>> invalidated and warmed (if it is configured like that) so in order to get
>> better use of caches, you should commit as rare as possible.
>> 
>> The setting that you provided is about hard commits and those are
>> configured not to open new searcher so such commit does not cause “exceeded
>> limit” error. You either have soft auto commits configured or you do
>> explicit commits when updating documents. Check and tune those and if you
>> do explicit commits, remove those if possible. If you cannot afford less
>> frequent commits, you have to tune your warming strategy to make sure it
>> does not take as much time as period between two commits.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 26 Feb 2020, at 06:16, Kayak28  wrote:
>>> 
>>> Hello, Solr Community:
>>> 
>>> Another day, I had an error "exceeded limit of maxWarmingSearchers=2."
>>> I know this error causes when multiple commits(which opens a new
>> searcher)
>>> are requested too frequently.
>>> 
>>> As far as I read Solr wiki, it recommends for me to have more interval
>>> between each commit, and make commit frequency less.
>>> Using autoCommit,  I would like to decrease the commit frequency, but I
>> am
>>> not sure how much I should increase the value of maxTime in autoCommit?
>>> 
>>> My current configuration is the following:
>>> 
>>>   
>>> ${solr.autoCommit.maxTime:15000}
>>> false
>>>   
>>> 
>>> 
>>> 
>>> How do you determine how much you increase the value in this case?
>>> Is there any rule of thumb advice to configure commit frequency?
>>> 
>>> Any help will be appreciated.
>>> 
>>> Sincerely,
>>> Kaya Ota
>> 
>> 



Re: Rule of thumb for determining maxTime of AutoCommit

2020-02-26 Thread Emir Arnautović
Hi Kaya,
The answer is simple: as much as your requirements allow delay between data 
being indexed and changes being visible. It is sometimes seconds and sometimes 
hours or even a day is tolerable. On each commit your caches are invalidated 
and warmed (if it is configured like that) so in order to get better use of 
caches, you should commit as rare as possible.

The setting that you provided is about hard commits and those are configured 
not to open new searcher so such commit does not cause “exceeded limit” error. 
You either have soft auto commits configured or you do explicit commits when 
updating documents. Check and tune those and if you do explicit commits, remove 
those if possible. If you cannot afford less frequent commits, you have to tune 
your warming strategy to make sure it does not take as much time as period 
between two commits.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2020, at 06:16, Kayak28  wrote:
> 
> Hello, Solr Community:
> 
> Another day, I had an error "exceeded limit of maxWarmingSearchers=2."
> I know this error causes when multiple commits(which opens a new searcher)
> are requested too frequently.
> 
> As far as I read Solr wiki, it recommends for me to have more interval
> between each commit, and make commit frequency less.
> Using autoCommit,  I would like to decrease the commit frequency, but I am
> not sure how much I should increase the value of maxTime in autoCommit?
> 
> My current configuration is the following:
> 
>
>  ${solr.autoCommit.maxTime:15000}
>  false
>
> 
> 
> 
> How do you determine how much you increase the value in this case?
> Is there any rule of thumb advice to configure commit frequency?
> 
> Any help will be appreciated.
> 
> Sincerely,
> Kaya Ota



Re: How to monitor the performance of the SolrCloud cluster in real time

2020-02-23 Thread Emir Arnautović
Hi Adonis,
If you are up to 3rd party, cloud based monitoring solution, you can try our 
integration for Solr/SolrCloud: https://sematext.com/cloud/ 


Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Feb 2020, at 08:54, Adonis Ling  wrote:
> 
> Hi team,
> 
> Our team is using Solr as a complementary full text search service for our
> NoSQL database and I'm building the monitor system for Solr.
> 
> After I read the related section (Performance Statistics Reference) in
> reference guide, I realized the requestTimes metrics are collected since
> the Solr core was first created. Is it possible to monitor the requests
> (count or latency) of a collection in real time?
> 
> I think it should reset the related metrics periodically. Are there some
> configurations to do this?
> 
> -- 
> Adonis



Re: SOLR PERFORMANCE Warning

2020-02-20 Thread Emir Arnautović
Hi,
It means that you are either committing too frequently or your warming up takes 
too long. If you are committing on every bulk, stop doing that and use 
autocommit.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Feb 2020, at 06:54, Akreeti Agarwal  wrote:
> 
> Hi All,
> 
> 
> 
> I am using SOLR 7.5 version with master slave architecture.
> 
> I am getting :
> 
> 
> 
> "PERFORMANCE WARNING: Overlapping onDeckSearchers=2"
> 
> 
> 
> continuously on my master logs for all cores. Please help me to resolve this.
> 
> 
> 
> 
> 
> Thanks & Regards,
> 
> Akreeti Agarwal
> 
> ::DISCLAIMER::
> 
> The contents of this e-mail and any attachment(s) are confidential and 
> intended for the named recipient(s) only. E-mail transmission is not 
> guaranteed to be secure or error-free as information could be intercepted, 
> corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses 
> in transmission. The e mail and its contents (with or without referred 
> errors) shall therefore not attach any liability on the originator or HCL or 
> its affiliates. Views or opinions, if any, presented in this email are solely 
> those of the author and may not necessarily reflect the views or opinions of 
> HCL or its affiliates. Any form of reproduction, dissemination, copying, 
> disclosure, modification, distribution and / or publication of this message 
> without the prior written consent of authorized representative of HCL is 
> strictly prohibited. If you have received this email in error please delete 
> it and notify the sender immediately. Before opening any email and/or 
> attachments, please check them for viruses and other defects.
> 



Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

2020-02-10 Thread Emir Arnautović
Hi Pratik,
Shingle filter should do that.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 10 Feb 2020, at 18:57, Pratik Patel  wrote:
> 
> Thanks for the reply Emir.
> 
> I will be exploring the option of creating a custom filter. It's good to
> know that we can consume more than one tokens from previous filter and emit
> different number of tokens. Do you know of any existing filter in Solr
> which does something similar? It would be greatly helpful to see how more
> than one tokens can be consumed. I can implement my custom logic once I
> have access to multiple tokens from previous filter.
> 
> Thanks
> Pratik
> 
> On Mon, Feb 10, 2020 at 2:47 AM Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Pratik,
>> You might be able to do some of required things using
>> PatternReplaceChartFilter, but as you can see it does not operate on tokens
>> level but input string. Your best bet is custom token filter. Not sure how
>> familiar you are with how token filters work, but you have access to tokens
>> from previous filter and you can implement any logic you want: you consume
>> three tokens and emit tokens based on adjacent tokens.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 7 Feb 2020, at 19:27, Pratik Patel  wrote:
>>> 
>>> Hello Everyone,
>>> 
>>> Let's say I have an analyzer which has following token stream as an
>> output.
>>> 
>>> *token stream : [], a, ab, [], c, [], d, de, def .*
>>> 
>>> Now let's say I want to add another filter which will drop a certain
>> tokens
>>> based on whether adjacent token on the right side is [] or some string.
>>> 
>>> for a given token,
>>>drop/replace it by empty string it if there is a non-empty string
>>> token on its right and
>>>keep it if there is an empty token string on its right.
>>> 
>>> based on this, the resulting token stream would be like this.
>>> 
>>> *desired output stream : [], [a], ab, [], c, [], d,
>>> de, def *
>>> 
>>> 
>>> *Is there any Filter available in solr with which this can be achieved?*
>>> *If writing a custom filter is the only possible option then I want to
>> know
>>> whether its possible to access adjacent tokens in the custom filter?*
>>> 
>>> *Any idea about this would be really helpful.*
>>> 
>>> Thanks,
>>> Pratik
>> 
>> 



Re: Solr Analyzer : Filter to drop tokens based on some logic which needs access to adjacent tokens

2020-02-09 Thread Emir Arnautović
Hi Pratik,
You might be able to do some of required things using 
PatternReplaceChartFilter, but as you can see it does not operate on tokens 
level but input string. Your best bet is custom token filter. Not sure how 
familiar you are with how token filters work, but you have access to tokens 
from previous filter and you can implement any logic you want: you consume 
three tokens and emit tokens based on adjacent tokens.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Feb 2020, at 19:27, Pratik Patel  wrote:
> 
> Hello Everyone,
> 
> Let's say I have an analyzer which has following token stream as an output.
> 
> *token stream : [], a, ab, [], c, [], d, de, def .*
> 
> Now let's say I want to add another filter which will drop a certain tokens
> based on whether adjacent token on the right side is [] or some string.
> 
> for a given token,
> drop/replace it by empty string it if there is a non-empty string
> token on its right and
> keep it if there is an empty token string on its right.
> 
> based on this, the resulting token stream would be like this.
> 
> *desired output stream : [], [a], ab, [], c, [], d,
> de, def *
> 
> 
> *Is there any Filter available in solr with which this can be achieved?*
> *If writing a custom filter is the only possible option then I want to know
> whether its possible to access adjacent tokens in the custom filter?*
> 
> *Any idea about this would be really helpful.*
> 
> Thanks,
> Pratik



Re: Number of requested rows

2020-02-05 Thread Emir Arnautović
Hi Toke,
Thanks for the post. Good that things are moving forward! It took a while!

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Feb 2020, at 15:23, Toke Eskildsen  wrote:
> 
> On Wed, 2020-02-05 at 13:00 +0100, Emir Arnautović wrote:
>> I was thinking in that direction. Do you know where it is in the
>> codebase or which structure is used - I am guessing some array of
>> objects?
> 
> Yeah. More precisely a priority queue of Objects, initialized with
> sentinel Objects. rows=100 is bad both from a memory allocation POW
> and because the heap-structure of the priority queue implementation has
> extremely bad memory locality when it is being updated.
> 
> I performed some measurements and did some experiments a few years ago:
> https://sbdevel.wordpress.com/2015/10/05/speeding-up-core-search/
> and there is https://issues.apache.org/jira/browse/LUCENE-8875 which
> takes care of the Sentinel thing in solr 8.2.
> 
> - Toke Eskildsen, Royal Danish Library
> 
> 



Re: Number of requested rows

2020-02-05 Thread Emir Arnautović
Thanks a lot!

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Feb 2020, at 13:27, Mikhail Khludnev  wrote:
> 
> Hi, Emir.
> 
> Please check callers of org.apache.lucene.search.HitQueue.HitQueue(int,
> boolean), you may found an alternative usage you probably is looking for.
> 
> On Wed, Feb 5, 2020 at 3:01 PM Emir Arnautović 
> wrote:
> 
>> Hi Mikhail,
>> I was thinking in that direction. Do you know where it is in the codebase
>> or which structure is used - I am guessing some array of objects?
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 5 Feb 2020, at 12:54, Mikhail Khludnev  wrote:
>>> 
>>> Absolutely. Searcher didn't know number of hits a priory. It eagerly
>>> allocate results heap before collecting results. The only cap I'm aware
>> of
>>> is maxDocs.
>>> 
>>> On Wed, Feb 5, 2020 at 2:42 PM Emir Arnautović <
>> emir.arnauto...@sematext.com>
>>> wrote:
>>> 
>>>> Hi,
>>>> Does somebody know if requested number of rows is used internally to set
>>>> some temp structures? In other words will query with rows=100 be
>> more
>>>> expensive than query with rows=1000 if number of hits is 1000?
>>>> 
>>>> Thanks,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> --
>>> Sincerely yours
>>> Mikhail Khludnev
>> 
>> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Re: Number of requested rows

2020-02-05 Thread Emir Arnautović
Hi Mikhail,
I was thinking in that direction. Do you know where it is in the codebase or 
which structure is used - I am guessing some array of objects?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Feb 2020, at 12:54, Mikhail Khludnev  wrote:
> 
> Absolutely. Searcher didn't know number of hits a priory. It eagerly
> allocate results heap before collecting results. The only cap I'm aware of
> is maxDocs.
> 
> On Wed, Feb 5, 2020 at 2:42 PM Emir Arnautović 
> wrote:
> 
>> Hi,
>> Does somebody know if requested number of rows is used internally to set
>> some temp structures? In other words will query with rows=100 be more
>> expensive than query with rows=1000 if number of hits is 1000?
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Number of requested rows

2020-02-05 Thread Emir Arnautović
Hi,
Does somebody know if requested number of rows is used internally to set some 
temp structures? In other words will query with rows=100 be more expensive 
than query with rows=1000 if number of hits is 1000?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/





Re: How expensive is core loading?

2020-01-29 Thread Emir Arnautović
Hi Rahul,
It depends. You might have warm up queries that would populate caches. For each 
core Solr exposes JMX stats so you can read just those without “touching" core. 
You can also try using some of existing tools for monitoring Solr, but I don’t 
think that any of them provides you info about cores that are not loaded. You 
would see it as occupied disk.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 30 Jan 2020, at 01:01, Rahul Goswami  wrote:
> 
> Hi Shawn,
> Thanks for the inputs. I realize I could have been clearer. By "expensive",
> I mean expensive in terms of memory utilization. Eg: Let's say I have a
> core with an index size of 10 GB and is not loaded on startup as per
> configuration. If I load it in order to know the total documents and the
> index size (to gather stats about the Solr server), is the amount of memory
> consumed proportional to the index size in some way?
> 
> Thanks,
> Rahul
> 
> On Wed, Jan 29, 2020 at 6:43 PM Shawn Heisey  wrote:
> 
>> On 1/29/2020 3:01 PM, Rahul Goswami wrote:
>>> 1) How expensive is core loading if I am only getting stats like the
>> total
>>> docs and size of the index (no expensive queries)?
>>> 2) Does the memory consumption on core loading depend on the index size ?
>>> 3) What is a reasonable value for transient cache size in a production
>>> setup with above configuration?
>> 
>> What I would do is issue a RELOAD command.  For non-cloud deployments,
>> I'd use the CoreAdmin API.  For cloud deployments, I'd use the
>> Collections API.  To discover the answer, see how long it takes for the
>> response to come back.
>> 
>> The time interval for a RELOAD is likely different than when Solr starts
>> ... but it sounds like you're more interested in the numbers for core
>> loading after Solr starts than the ones during startup.
>> 
>> Thanks,
>> Shawn
>> 



Re: Easiest way to export the entire index

2020-01-29 Thread Emir Arnautović
Hi Amanda,
I assume that you have all the fields stored so you will be able to export full 
document.

Several thousands records should not be too much to use regular start+rows to 
paginate results, but the proper way of doing that would be to use cursors. 
Adjust page size to avoid creating huge responses and you can use curl or some 
similar tool to avoid using admin console. I did a quick search and there are 
several blog posts with scripts that does what you need.

HTH,
Emir

--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Jan 2020, at 15:43, Amanda Shuman  wrote:
> 
> Dear all:
> 
> I've been asked to produce a JSON file of our index so it can be combined
> and indexed with other records. (We run solr 5.3.1 on this project; we're
> not going to upgrade, in part because funding has ended.) The index has
> several thousand rows, but nothing too drastic. Unfortunately, this is too
> much to handle for a simple query dump from the admin console. I tried to
> follow instructions related to running /export directly but I guess the
> export handler isn't installed. I tried to divide the query into rows, but
> after a certain amount it freezes, and it also freezes when I try to limit
> rows (e.g., rows 501-551 freezes the console). Is there any other way to
> export the index short of having to install the export handler considering
> we're not working on this project anyone?
> 
> Thanks,
> Amanda
> 
> --
> Dr. Amanda Shuman
> Researcher and Lecturer, Institute of Chinese Studies, University of
> Freiburg
> Coordinator for the MA program in Modern China Studies
> Database Administrator, The Maoist Legacy 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 96748



Re: In-place re-indexing after DocValue schema change

2020-01-29 Thread Emir Arnautović
Hi,
1. No, it’s not valid. Solr will look at schema to see if it can use docValues 
or if it has to uninvert field and it assumes that all fields will have doc 
values. You might expect from wrong results to errors if you do something like 
that.
2. Not sure if it would work, but It is not better than reindexing everything. 
Lucene segments are immutable and it needs to create new document and flag 
existing as deleted and purge it at segment merge time. If you are trying to 
avoid changing collection name, maybe you could do something like that by using 
aliases: index into new collection, delete existing collection, create alias 
with old collection name pointing to new collection.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Jan 2020, at 09:37, moscovig  wrote:
> 
> Hi all
> 
> We are about to alter our schema with some DocValue annotations. 
> According to docs, we should whether delete all docs and re-insert, or
> create a new collection with the new schema.
> 
> 1. Is it valid to modify the schema in the current collection, where all
> documents were created without docValue, and having docValue for new docs?
> 
> 2. Is it valid to upsert all documents onto the same collection, having all
> docs re-indexed in-place? It does sound risky, but would it work if we will
> take care of *all* documents?
> 
> Thanks!
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-16 Thread Emir Arnautović
Hi Jan,
Here is a blog post related to this topic: 
https://sematext.com/blog/solr-vs-elasticsearch-differences/ 

It also contains links to other resources that might help you make a decision.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Jan 2020, at 05:02, Dc Tech  wrote:
> 
> I am SOLR fant and had implemented it in our company over 10 years ago.
> I moved away from that role and the new search team in the meanwhile
> implemented a proprietary (and expensive) nosql style search engine. That
> the project did not go well, and now I am back to project and reviewing the
> technology stack.
> 
> Some of the team think that ElasticSearch could be a good option,
> especially since we can easily get hosted versions with AWS where we have
> all the contractual stuff sorted out.
> 
> Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
> graph, and all the knobs and dials for relevancy tuning), Elastic may be
> sufficient for our needs. It does not seem to have LTR out of the box but
> the relevancy tuning knobs and dials seem to be similar to what SOLR has.
> 
> The corpus size is not a challenge  - we have about one million document,
> of which about 1/2 have full text, while the test are simpler (i.e. company
> directory etc.).
> The query volumes are also quite low (max 5/second at peak).
> We have implemented the content ingestion and processing pipelines already
> in python and SPARK, so most of the data will be pushed in using APIs.
> 
> I would really appreciate any guidance from the community !!



Re: Boosting only top n results that match a criteria

2019-12-28 Thread Emir Arnautović
You could try and see if field collapsing can help you. That could let you 
return top 5 from each class if that is acceptable. Otherwise, you’ll have to 
go with two queries.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Dec 2019, at 19:08, Nitin Arora  wrote:
> 
> Simply boosting on class A1 won't work since there may be many documents
> from that class, all getting equal boost. I want only top 5 docs of that
> class to get the boost.
> 
> On Fri, 27 Dec 2019 at 22:42, Erick Erickson 
> wrote:
> 
>> Yes. Rerank essentially takes the top N results of one query and re-scores
>> them through another query. So just boost the secondary query.
>> 
>> But you may not even have to do that. Just add a boost clause to a single
>> query and boost your class A1 quite high. See “boost” and/or “bq”.
>> 
>> Best,
>> Erick
>> 
>>> On Dec 27, 2019, at 10:57 AM, Nitin Arora 
>> wrote:
>>> 
>>> Hi Erick, I was not able to figure how exactly I will use
>>> RerankQParserPlugin to achieve the desired reranking. I see that I can
>>> rerank all the top RERANK_DOCS results - it is possible that they
>> contain a
>>> hundred results of class A1 or none. But the desired behaviour I want is
>> to
>>> pick (only) the top 5 results of class A1 from my potentially 100s of
>>> results. Then boost them to first page.
>>> Do you think this(or near this) behaviour is possible
>>> using RerankQParserPlugin? Please shed more light how.
>>> 
>>> On Fri, 27 Dec 2019 at 19:48, Erick Erickson 
>>> wrote:
>>> 
>>>> Have you seen RerankQParserPlugin?
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Dec 27, 2019, at 8:49 AM, Emir Arnautović <
>>>> emir.arnauto...@sematext.com> wrote:
>>>>> 
>>>>> Hi Nitin,
>>>>> Can you simply filter and return top 5:
>>>>> 
>>>>> ….=class:A1=5
>>>>> 
>>>>> Emir
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 27 Dec 2019, at 13:55, Nitin Arora  wrote:
>>>>>> 
>>>>>> Hello, I have a complex solr query with various boosts applied that
>>>>>> returns, say a few hundred results. Out of these hundreds of results I
>>>> want
>>>>>> to further boost, say the top 5 results that satisfy a particular
>>>> criteria
>>>>>> - e.g. class=A1. So I want the top 5 results from class A1 in my
>>>> existing
>>>>>> results set to come further higher, so that I can show them on the
>> first
>>>>>> page of my final results. How do I achieve this?
>>>>>> I am new to SOLR and this community so apologies if this is
>>>> trivial/repeat.
>>>>>> 
>>>>>> Thanks,
>>>>>> Nitin
>>>>> 
>>>> 
>>>> 
>> 
>> 



Re: Boosting only top n results that match a criteria

2019-12-27 Thread Emir Arnautović
Hi Nitin,
Can you simply filter and return top 5:

….=class:A1=5

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Dec 2019, at 13:55, Nitin Arora  wrote:
> 
> Hello, I have a complex solr query with various boosts applied that
> returns, say a few hundred results. Out of these hundreds of results I want
> to further boost, say the top 5 results that satisfy a particular criteria
> - e.g. class=A1. So I want the top 5 results from class A1 in my existing
> results set to come further higher, so that I can show them on the first
> page of my final results. How do I achieve this?
> I am new to SOLR and this community so apologies if this is trivial/repeat.
> 
> Thanks,
> Nitin



Re: Indexing with customized parameters

2019-12-12 Thread Emir Arnautović
Hi Anuj,
Maybe I am missing something but this is more question for some SQL group than 
for Solr group. I am surprised that you get any records. You can consult your 
DB documentation for some more elegant solution, but a brute-force solution, if 
your column is string, could be:
WHERE sector = 27 OR sector LIKE ’27,%’ OR sector LIKE ‘%,27,%’ OR sector LIKE 
‘%,27’ OR sector = 2701…

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Dec 2019, at 08:38, Anuj Bhargava  wrote:
> 
> Any suggestions?
> 
> Regards,
> 
> Anuj
> 
> On Tue, 10 Dec 2019 at 20:52, Anuj Bhargava  wrote:
> 
>> I am trying to index where the *sector field* has the values 27 and/or
>> 2701 and/or 2702 using the following -
>> 
>> >  query="SELECT * FROM country WHERE sector = 27 OR sector = 2701 OR
>> sector = 2702"
>>  deltaImportQuery="SELECT * FROM country
>>WHERE posting_id = '${dataimporter.delta.posting_id}' AND sector = 27
>> OR sector = 2701 OR sector = 2702"
>>  deltaQuery="SELECT posting_id FROM country
>>WHERE last_modified > '${dataimporter.last_index_time}' AND sector =
>> 27 OR sector = 2701 OR sector = 2702">
>> 
>> 
>> The sector field has comma separated multiple values like -
>> 27,19,527
>> 38,27,62701
>> 2701,49
>> 55,2702,327
>> 
>> The issue is when I run the above, it indexes the fields containing data
>> 27,19,527 and 2701,49 and ignores the other data. It indexes if the data in
>> the sector fields starts with either 27 or 2701 or 2702. It doesn't index
>> if the values 27 or 2701 or 2702 are placed 2nd or 3rd in the sector data
>> field
>> 



Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-04 Thread Emir Arnautović
Hi,
I’ve spent quite a lot time working on a similar issue but I did not think 
about it much since (at the time it was Solr 1.3) so some new features could 
push me to some other direction, but here is what I remember: You cannot rely 
on users entering standardised address format even within one country. Users 
will use both abbreviations and full names. If you need to support Japan - good 
luck. India is a similar story. You might want to preprocess input and do some 
entity extraction and parsing both at index time and query time. Solr scoring 
is not good enough for addresses - it is good for giving you candidates but 
after that you need to apply custom scoring function on either Solr or client 
side. If you have ability to use full blown geocoder, use it at both index and 
query time - you can even store multiple geocoding results with scores and use 
those scores to calculate final score. The good thing is that Solr has many 
extension points and I’ve used almost all but unfortunately, those were 
proprietary plugins and was not able to persuade client to open source it.

Good Luck,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Dec 2019, at 08:13, Paras Lehana  wrote:
> 
> Hi Yeikel,
> 
> I want to stress on three things:
> 
>   1. If you know the probable words which can be written in different ways
>   (like street), you can use Synonyms.
> 
>   2. The longer queries can have different mm's. The mm parameter supports
>   different values for different word lengths. We generally do 100% mm match
>   for 2 words, decrease it words-1 for words > 2 and 70% for words > 7.
> 
>   3. The returned numDocs should not heavily impact your response time.
>   You can always use rows parameter to decrease the result set. Is your issue
>   regarding the ranking of documents or the number of documents? Please give
>   examples of the results that you don't want to get fetched for a query.
> 
> 
> On Tue, 3 Dec 2019 at 10:13,  wrote:
> 
>> Thank you for jumping in @hastings.recurs...@gmail.com
>> 
>> I have an index with raw addresses in a nonstandardized format such as
>> "123 main street" or "main street 123", and I am looking to search this
>> index and pull the closest addresses from another raw input with a similar
>> unpredictable format. Ideally, I am trying to reduce the number of results
>> as much as possible because of time constraints.
>> 
>> At the moment, I am launching a dismax query with the mm(minimum should
>> match) parameter set to a value I am comfortable with(say 50% for example).
>> 
>> In an address such as "123 main street CA 90201 US" , if I execute a query
>> such as: "return addresses that match 50% of the tokens"(dismax,with mm set
>> to 50%),  I will potentially get records with "US Street 123" or "main
>> street CA", which is not something that I am looking for. I understand that
>> I could increase the mm parameter and set it to say "100%", but again, I am
>> not sure if the token "street" should be considered when calculating the mm
>> parameter as I could miss a record such as "123 main CA 90201 US"
>> 
>> For longer addresses, the relevance of "main" or "street" is much lower
>> than keywords such as apartment number or the city.
>> 
>> I am not sure if this is the right way to search for unstructured
>> addresses so we are open for suggestions.
>> 
>> Thank you
>> 
>> -Original Message-
>> From: Dave 
>> Sent: Monday, December 2, 2019 7:50 PM
>> To: solr-user@lucene.apache.org
>> Cc: wun...@wunderwood.org; jornfra...@gmail.com
>> Subject: Re: Is it possible to have different Stop words depending on the
>> value of a field?
>> 
>> I’ll add to that since I’m up. Stopwords are in a practical sense useless
>> and serve no purpose. It’s an old way to save index size that’s not needed
>> any more. You’d need very specific use cases to want to use them. Maybe you
>> do, but generally you never do unless it’s for training a machine or
>> something a bit more on the experimental side. If you can explain *why you
>> think you need stop words that would be helpful in perhaps guiding you to
>> an alternative
>> 
>>> On Dec 2, 2019, at 7:45 PM,   wrote:
>>> 
>>> That makes sense, thank you for the clarification!
>>> 
>>> @wun...@wunderwood.org If you can, please build on your explanation as
>> It sounds relevant.
>>> -Original Message-
>>> From: Dave 
>>> Sent: Monday, December 2, 2019 7:38 PM
>>> To: solr-user@lucene.apache.org
>>> Cc: jornfra...@gmail.com
>>> Subject: Re: Is it possible to have different Stop words depending on
>> the value of a field?
>>> 
>>> It clarifies yes. You need new fields. In this case something like
>> Address_us Address_uk And index and search them accordingly with different
>> stopword files used in different field types, hence the copy field from
>> “address” into as many new fields as needed
>>> 
 On Dec 2, 2019, at 7:33 PM,  
>> 

Re: Solr Case Insensitive Search while preserving cases in Index and allowing Boolean AND/OR searches

2019-12-02 Thread Emir Arnautović
Hi Lewin,
Not sure I follow your example. From what I read, you could have one field 
lowercased and other not and filter on the first field and facet on the second. 
There is probably something that I am missing, so some example would probably 
help.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 25 Nov 2019, at 23:00, Lewin Joy (TMNA)  wrote:
> 
> Hi,
> 
> I am exploring possibility to do case insensitive filter/facet queries in 
> solr.
> I would also need to preserve the cases in the index.
> This means that the normal LowerCaseFilterFactory approach would not work as 
> facet values will not preserve cases and will show in all lowercase.
> 
> One method was to use facet.contains along with 
> f.fieldname.facet.ignoreCase=true.
> But, I need an option to do more with the search keyword. 
> Example if possible,  would be something like  --> facet.contains=Apple OR 
> Dell OR HP
> 
> Another approach is to do a filter query with general expressions, which gets 
> costly.
> Or copy field with edge Ngram and LowerCaseFilter factory which is again 
> costly.
> 
> 
> Does anyone have any suggestions? It would be good if we have an option with 
> the facet.contains 
> Just need a Boolean capability in there.
> 
> Thanks,
> Lewin



Re: Exact match

2019-12-02 Thread Emir Arnautović
Hi Omer,
From performance perspective, it is the best if you index title as a single 
token: KeywordTokenizer + LowerCaseFilter

If you need to query that field in some other way, you can index it differently 
as some other field using copyField.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Dec 2019, at 21:43, OTH  wrote:
> 
> Hello,
> 
> What would be the best way to get exact matches (if any) to a query?
> 
> E.g.:  Let's the document text is:  "united states of america".
> Currently, any query containing one or more of the three words "united",
> "states", or "america" will match with the above document.  I would like a
> way so that the document matches only and only if the query were also
> "united states of america" (case-insensitive).
> 
> Document field type:  TextField
> Index Analyzer: TokenizerChain
> Index Tokenizer: StandardTokenizerFactory
> Index Token Filters: StopFilterFactory, LowerCaseFilterFactory,
> SnowballPorterFilterFactory
> The Query Analyzer / Tokenizer / Token Filters are the same as the Index
> ones above.
> 
> FYI I'm relatively novice at Solr / Lucene / Search.
> 
> Much appreciated
> Omer



Re: How to implement NOTIN operator with Solr

2019-11-19 Thread Emir Arnautović
Right - didn’t read all your examples. In that case you can use span queries. 
In this case complexphrase query parser should do the trick:
{!complexphrase df=text}”credit -card”

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Nov 2019, at 11:08, Raboah, Avi  wrote:
> 
> In that case I got only doc1
> 
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
> Sent: Tuesday, November 19, 2019 11:51 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to implement NOTIN operator with Solr
> 
> Hi Avi,
> There are span queries, but in this case you don’t need it. It is enough to 
> simply filter out documents that are with “credit card”. Your query can be 
> something like
> +text:credit -text:”credit card”
> If you prefer using boolean operators, you can write it as:
> text:credit AND NOT text: “credit card”
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 19 Nov 2019, at 10:30, Raboah, Avi  wrote:
>> 
>> I am trying to find the documents which hit this example:
>> 
>> q=text:"credit" NOTIN "credit card"
>> 
>> for that query I want to get all the documents which contain the term 
>> "credit" but not as part of the phrase "credit card".
>> 
>> so:
>> 
>> 1. I don't want to get the documents which include just "credit card".
>> 
>> 2. I want to get the documents which include just "credit".
>> 
>> 3. I want to get the documents which include "credit" but not as part of 
>> credit card.
>> 
>> 
>> 
>> for example:
>> 
>> doc1 text: "I want to buy with my credit in my card"
>> 
>> doc2 text: "I want to buy with my credit in my credit card"
>> 
>> doc3 text: "I want to buy with my credit card"
>> 
>> The documents should be returned:
>> 
>> doc1, doc2
>> 
>> I can't find nothing about NOTIN operator implementation in SOLR docs.
>> 
>> 
>> 
>> This electronic message may contain proprietary and confidential information 
>> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
>> is intended to be for the use of the individual(s) or entity(ies) named 
>> above. If you are not the intended recipient (or authorized to receive this 
>> e-mail for the intended recipient), you may not use, copy, disclose or 
>> distribute to anyone this message or any information contained in this 
>> message. If you have received this electronic message in error, please 
>> notify us by replying to this e-mail.
> 
> 
> 
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.



Re: How to implement NOTIN operator with Solr

2019-11-19 Thread Emir Arnautović
Hi Avi,
There are span queries, but in this case you don’t need it. It is enough to 
simply filter out documents that are with “credit card”. Your query can be 
something like
+text:credit -text:”credit card”
If you prefer using boolean operators, you can write it as:
text:credit AND NOT text: “credit card”

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Nov 2019, at 10:30, Raboah, Avi  wrote:
> 
> I am trying to find the documents which hit this example:
> 
> q=text:"credit" NOTIN "credit card"
> 
> for that query I want to get all the documents which contain the term 
> "credit" but not as part of the phrase "credit card".
> 
> so:
> 
> 1. I don't want to get the documents which include just "credit card".
> 
> 2. I want to get the documents which include just "credit".
> 
> 3. I want to get the documents which include "credit" but not as part of 
> credit card.
> 
> 
> 
> for example:
> 
> doc1 text: "I want to buy with my credit in my card"
> 
> doc2 text: "I want to buy with my credit in my credit card"
> 
> doc3 text: "I want to buy with my credit card"
> 
> The documents should be returned:
> 
> doc1, doc2
> 
> I can't find nothing about NOTIN operator implementation in SOLR docs.
> 
> 
> 
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.



Re: Use of TLog

2019-11-18 Thread Emir Arnautović
Hi,
Copying indices will work and it is a fine approach. An alternative would be to 
join new node to a cluster, use add replica to copy cores to this new node and 
then remove replicas from old nodes, if you want to move cores.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Nov 2019, at 13:22, Sripra deep  wrote:
> 
> Hi Emir,
> 
>  Thank you so much. Now I got it cleared with the TLOG purpose.
>   I am trying to copy an index of one solr cluster to use it to build
> other solr cluster. I am able to make that work but Is this design okay? or
> any other approach I can try to get a new cluster spin up with the same
> data as in the old one.
> 
> Thanks,
> Sripradeep P
> 
> 
> On Mon, Nov 18, 2019 at 2:12 PM Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Sripradeep,
>> Simplified: TLog files are used to replay index updates from the last
>> successful hard commit in case of some Solr crashes. It is used on the next
>> Solr startup. It does not contain all updates, otherwise, it would
>> duplicate the index size.
>> If you start from these premises, you will understand why it is not copied
>> when adding replicas and why it is not needed and why you cannot use TLog
>> to spin up a new cluster.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 18 Nov 2019, at 06:35, Sripra deep 
>> wrote:
>>> 
>>> Hi Guys,
>>> 
>>> I observed a scenario with the tlog creation and usage and couldn't find
>>> any usage for the tlog.
>>> 
>>> Solr version: 7.1.0
>>> Number of shards = 3
>>> Number of replica = 1
>>> I indexed the about 10k docs into the collection.
>>> 
>>> Scenario 1:
>>> Using add replica collection API, I created one more replica (tried with
>>> both nrt and tlog) neither of the replicas doesn't pull the tlog files.
>>> Only the index files are pulled from master.
>>> * If the tlog is not present in a replica then during ungraceful
>> shutdown
>>> of the solr server how the replicas will regain the index without tlog
>>> files.
>>> * To verify the above scenario, I killed the newly added replica server
>>> with kill -9  command and started back
>>> also stopped the leader node.
>>> 
>>> Questions:
>>> 1) TLog files are not used even in the case of ungraceful shutdown,
>> where
>>> else it will be used?
>>> 2) Tlog files doesn't get copied to the newly added replica so adding a
>>> new replica to the already created collection with data/index is not
>>> advisable?
>>> 3) Is there a way to make the newly added slave node to replicate the
>>> tlog file as it does for the data/index files from leader?
>>> 4) Is it possible to use the Tlog files /index files from an existing
>>> solr server to spin up a new solr cluster?
>>> 
>>> 
>>> It would be much helpful for me to understand the core working of Solr
>>> server.
>>> 
>>> Thanks,
>>> Sripradeep P
>> 
>> 



Re: Use of TLog

2019-11-18 Thread Emir Arnautović
Hi Sripradeep,
Simplified: TLog files are used to replay index updates from the last 
successful hard commit in case of some Solr crashes. It is used on the next 
Solr startup. It does not contain all updates, otherwise, it would duplicate 
the index size.
If you start from these premises, you will understand why it is not copied when 
adding replicas and why it is not needed and why you cannot use TLog to spin up 
a new cluster.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Nov 2019, at 06:35, Sripra deep  wrote:
> 
> Hi Guys,
> 
> I observed a scenario with the tlog creation and usage and couldn't find
> any usage for the tlog.
> 
> Solr version: 7.1.0
> Number of shards = 3
> Number of replica = 1
> I indexed the about 10k docs into the collection.
> 
> Scenario 1:
>  Using add replica collection API, I created one more replica (tried with
> both nrt and tlog) neither of the replicas doesn't pull the tlog files.
> Only the index files are pulled from master.
>  * If the tlog is not present in a replica then during ungraceful shutdown
> of the solr server how the replicas will regain the index without tlog
> files.
>  * To verify the above scenario, I killed the newly added replica server
> with kill -9  command and started back
>  also stopped the leader node.
> 
> Questions:
>  1) TLog files are not used even in the case of ungraceful shutdown, where
> else it will be used?
>  2) Tlog files doesn't get copied to the newly added replica so adding a
> new replica to the already created collection with data/index is not
> advisable?
>  3) Is there a way to make the newly added slave node to replicate the
> tlog file as it does for the data/index files from leader?
>  4) Is it possible to use the Tlog files /index files from an existing
> solr server to spin up a new solr cluster?
> 
> 
> It would be much helpful for me to understand the core working of Solr
> server.
> 
> Thanks,
> Sripradeep P



Re: Solr 7.2.1 - unexpected docvalues type

2019-11-11 Thread Emir Arnautović
Hi Antony,
Like Erick explained, you still have to preprocess your field in order to be 
able to use doc values. What you can do is use update request processor chain 
and have all the logic in Solr. Here is blog post explaining how it could work: 
https://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 10 Nov 2019, at 15:54, Erick Erickson  wrote:
> 
> So “lowercase” is, indeed, a solr.TextField, which is ineligible for 
> docValues. Given that definition, the difference will be that a “string” type 
> is totally un-analyzed, so the values that go into the index and the query 
> itself will be case-sensitive. You’ll have to pre-process both to do the 
> right thing.
> 
>> On Nov 9, 2019, at 6:15 PM, Antony Alphonse  wrote:
>> 
>> Hi Shawn,
>> 
>> Thank you. I switched the fieldType=string and it worked. I might have to
>> check on the use-case to see if "string" will work for us.
>> 
>> I have noted the "lowercase" field type which I believe is similar to the
>> one in schema ver 1.6.
>> 
>> 
>> >   positionIncrementGap="100">
>>   
>>   > class="solr.KeywordTokenizerFactory" />
>>   > />
>>   
>>   
>> 
>> Thanks,
>> Antony
>> 
>> On Sat, Nov 9, 2019 at 7:52 AM Erick Erickson 
>> wrote:
>> 
>>> We can’t answer whether you should change the field type for two reasons:
>>> 
>>> 1> It depends on your use case.
>>> 2> we don’t know what the field type “lowercase” does. It’s composed of an
>>> analysis chain that you may have changed. And whatever config you are using
>>> may have changed with different releases of Solr.
>>> 
>>> Grouping is generally done on a docValues-eligible field type. AFAIK,
>>> “lowercase” is a solr-text based field so is ineligible for docValues. I’ve
>>> got to guess here, but I’d suggest you start with a fieldType of “string”,
>>> and enable docValues on it.
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> 
 On Nov 9, 2019, at 12:54 AM, Antony Alphonse 
>>> wrote:
 
> 
> Hi Shawn,
> 
 
 I will try that solution. Also I had to mention that the queries that
>>> fail
 with this error has the "group.field":"lowercase". Should I change the
 field type?
 
 Thanks,
 Antony
>>> 
>>> 
> 



Re: Commit disabled

2019-11-08 Thread Emir Arnautović
Hi David,
Index will get updated (hard commit is happening every 15s) but changes will 
not be visible until you explicitly commit or you reload core. Note that Solr 
restart reloads cores.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 Nov 2019, at 12:19, Villacorta, David (Arlington) 
>  wrote:
> 
> Just want to confirm, given the following config settings at solrconfig.xml:
> 
> 
>  ${solr.autoCommit.maxTime:15000}
>  false
>
> 
> 
>  ${solr.autoSoftCommit.maxTime:-1}
> 
> 
> Solr index will not be updated unless created item in Sitecore is manually 
> indexed, right?
> 
> Regards
> David Villacorta
> 
> Notice of Confidentiality
> This email contains confidential material prepared for the intended 
> addressees only and it may contain intellectual property of Willis Towers 
> Watson, its affiliates or a third party. This material may not be suitable 
> for, and we accept no responsibility for, use in any context or for any 
> purpose other than for the intended context and purpose. If you are not the 
> intended recipient or if we did not authorize your receipt of this material, 
> any use, distribution or copying of this material is strictly prohibited and 
> may be unlawful. If you have received this communication in error, please 
> return it to the original sender with the subject heading "Received in 
> error," then delete any copies.
> 
> You may receive direct marketing communications from Willis Towers Watson. If 
> so, you have the right to opt out of these communications. You can opt out of 
> these communications or request a copy of Willis Towers Watson's privacy 
> notice by emailing 
> unsubscr...@willistowerswatson.com.
> 
> 
> This e-mail has come to you from Willis Towers Watson US LLC



Re: Query on changing FieldType

2019-10-23 Thread Emir Arnautović
Hi Shubham,
My guess that it might be working for text because it uses o.toString() so 
there are no runtime errors while in case of others, it has to assume some 
class so it does class casting. You can check in logs what sort of error 
happens. But in any case, like Jason pointed out, that is a problem that is 
just waiting to happen somewhere and the only way to make sure it does not 
happen is to do full reindexing or to create a new field (with a new name) and 
stop using the one that is wrong. Different field types are indexed in 
different structures and with different defaults (e.g. for docValues) and I 
would not rely on some features working after field type changed.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 23 Oct 2019, at 08:18, Shubham Goswami  wrote:
> 
> Hi Jason
> 
> Thanks for the response.
> You are right that re-indexing is required after making any changes to
> Schema even i am re-indexing the docs in which i have
> changed the fieldtypes, but here Emir is talking about full re-indexing
> i.e. deleting the existing/core and creating new one that is
> time consuming i think. My doubt is that i am not able to change the type
> which has implementation classes like LongPointField/IntPointField to the
> type with implementation classes LongPointField/IntPointField.t
> 
> But i am able to change into Text related fields like TextFields
> and from TextFields to any other Int/Long type fields.
> So i just want to know that what is exact dependency on these classes so
> that iam able to change types of some fields ?
> 
> Thanks
> Shubham
> 
> On Tue, Oct 22, 2019 at 6:29 PM Jason Gerlowski 
> wrote:
> 
>> Hi Shubbham,
>> 
>> Emir gave you accurate advice - you cannot (safely) change field types
>> without reindexing.  You may avoid errors for a time, and searches may
>> even return the results you expect.  But the type-change is still a
>> ticking time bomb...Solr might try to merge segments down the road or
>> do some other operation and blow up in unexpected ways.  For more
>> information on why this is, see the documentation here:
>> https://lucene.apache.org/solr/guide/8_2/reindexing.html.
>> 
>> Unfortunately there's no way around it.  This, by the way, is why the
>> community strongly recommends against using schema-guessing mode for
>> anything other than experimentation.
>> 
>> Best of luck,
>> 
>> Jason
>> 
>> On Tue, Oct 22, 2019 at 7:42 AM Shubham Goswami
>>  wrote:
>>> 
>>> Hi Emir
>>> 
>>> As you have mentioned above we cannot change field type after indexing
>> once
>>> and we have to do dull re-indexing again, I tried to change field type
>> from
>>> plong to pint which has implemented class solr.LongPointField and
>>> solr.IntPointField respectively and it was showing error as expected.
>>>But when i changed field types from pint/plong to any type which
>>> has implemented class solr.TextField, in this case its working fine and i
>>> am able to index the documents after changing its fieldtype with same and
>>> different id.
>>> 
>>> So i want to know if is there any compatibility with implemented classes
>> ?
>>> 
>>> Thanks
>>> Shubham
>>> 
>>> On Tue, Oct 22, 2019 at 2:46 PM Emir Arnautović <
>>> emir.arnauto...@sematext.com> wrote:
>>> 
>>>> Hi Shubham,
>>>> No you cannot. What you can do is to use copy field or update request
>>>> processor to store is as some other field and use that in your query
>> and
>>>> ignore the old one that will eventually disappear as the result of
>> segment
>>>> merges.
>>>> 
>>>> HTH,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 22 Oct 2019, at 10:53, Shubham Goswami >> 
>>>> wrote:
>>>>> 
>>>>> Hi Emir
>>>>> 
>>>>> Thanks for the reply, i got your point.
>>>>> But is there any other way to do like one field could have two or
>> more
>>>>> different types defined ?
>>>>> or  if i talk about my previous query, can we index some data for the
>>>> same
>>>>> field with different unique id after replacing the type ?
>>>

Re: Query on changing FieldType

2019-10-22 Thread Emir Arnautović
Hi Shubham,
No you cannot. What you can do is to use copy field or update request processor 
to store is as some other field and use that in your query and ignore the old 
one that will eventually disappear as the result of segment merges.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Oct 2019, at 10:53, Shubham Goswami  wrote:
> 
> Hi Emir
> 
> Thanks for the reply, i got your point.
> But is there any other way to do like one field could have two or more
> different types defined ?
> or  if i talk about my previous query, can we index some data for the same
> field with different unique id after replacing the type ?
> 
> Thanks again
> Shubham
> 
> On Tue, Oct 22, 2019 at 1:23 PM Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Shubham,
>> Changing type is not allowed without full reindexing. If you do something
>> like that, Solr will end up with segments with different types for the same
>> field. Remember that segments are immutable and that reindexing some
>> document will be in new segment, but old segment will still be there and at
>> query type Solr will have mismatch between what is stated in schema and
>> what is in segment. In order to change type you have to do full reindexing
>> - create a new collection and reindex all documents.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 22 Oct 2019, at 09:25, Shubham Goswami 
>> wrote:
>>> 
>>> Hello Community
>>> 
>>> I have indexed some documents for which solr has taken its type="plongs"
>> by
>>> auto guessing but i am trying to change its type="pint" and re-indexing
>> the
>>> same data with the same id and indexing the data with different id where
>> id
>>> is unique key but it is showing error.
>>> 
>>> Can somebody please let me know if it is possible or not, if not possible
>>> then why it is not possible as i am using different id as well ? if
>>> possible then how we could achieve it ?
>>> Any help will be appreciated. Thanks in advance.
>>> 
>>> --
>>> *Thanks & Regards*
>>> Shubham Goswami
>>> Enterprise Software Engineer
>>> *HotWax Systems*
>>> *Enterprise open source experts*
>>> cell: +91-7803886288
>>> office: 0731-409-3684
>>> http://www.hotwaxsystems.com
>> 
>> 
> 
> -- 
> *Thanks & Regards*
> Shubham Goswami
> Enterprise Software Engineer
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-7803886288
> office: 0731-409-3684
> http://www.hotwaxsystems.com



Re: Query on changing FieldType

2019-10-22 Thread Emir Arnautović
Hi Shubham,
Changing type is not allowed without full reindexing. If you do something like 
that, Solr will end up with segments with different types for the same field. 
Remember that segments are immutable and that reindexing some document will be 
in new segment, but old segment will still be there and at query type Solr will 
have mismatch between what is stated in schema and what is in segment. In order 
to change type you have to do full reindexing - create a new collection and 
reindex all documents.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Oct 2019, at 09:25, Shubham Goswami  wrote:
> 
> Hello Community
> 
> I have indexed some documents for which solr has taken its type="plongs" by
> auto guessing but i am trying to change its type="pint" and re-indexing the
> same data with the same id and indexing the data with different id where id
> is unique key but it is showing error.
> 
> Can somebody please let me know if it is possible or not, if not possible
> then why it is not possible as i am using different id as well ? if
> possible then how we could achieve it ?
> Any help will be appreciated. Thanks in advance.
> 
> -- 
> *Thanks & Regards*
> Shubham Goswami
> Enterprise Software Engineer
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-7803886288
> office: 0731-409-3684
> http://www.hotwaxsystems.com



Re: Metrics API - Documentation

2019-10-07 Thread Emir Arnautović
Hi Richard,
We do not use API to collect metrics but JMX, but I believe that those are the 
same (did not verify it in code). You can see how we handled those metrics into 
reports/charts or even use our agent to send data to Prometheus: 
https://github.com/sematext/sematext-agent-integrations/tree/master/solr 


You can also see some links to Solr metric related blog posts in this repo. If 
you find out that managing your own monitoring stack is overwhelming, you can 
try our Solr integration.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Oct 2019, at 12:40, Richard Goodman  wrote:
> 
> Hi there,
> 
> I'm currently working on using the prometheus exporter to provide some 
> detailed insights for our Solr Cloud clusters.
> 
> Using the provided template killed our prometheus server, as well as the 
> exporter due to the size of our clusters (each cluster is around 96 nodes, 
> ~300 collections with 3way replication and 16 shards), so you can imagine the 
> amount of data that comes through /admin/metrics and not filtering it down 
> first.
> 
> I've began working on writing my own template to reduce the amount of data 
> being requested and it's working fine, and I'm starting to build some nice 
> graphs in Grafana.
> 
> The only difficulty I'm having with this, is I'm struggling to find decent 
> documentation on the metrics themselves. I was using the resources metrics 
> reporting - metrics-api 
>  
> and monitoring solr with prometheus and grafana 
> 
>  but there is a lack of information on most metrics. 
> 
> For example:
> "ADMIN./admin/collections.totalTime":6715327903,
> I understand this is a counter, however, I'm not sure what unit this would be 
> represented when displaying it, for example:
> 
> 
> 
> A latency of 1mil, not sure if this means milliseconds, million, etc., 
> Another example would be the GC metrics:
>   "gc.ConcurrentMarkSweep.count":7,
>   "gc.ConcurrentMarkSweep.time":1247,
>   "gc.ParNew.count":16759,
>   "gc.ParNew.time":884173,
> Which when displayed, doesn't give the clearest insight as to what the unit 
> is:
> 
> 
> If anyone has any advice / guidance, that would be greatly appreciated. If 
> there isn't documentation for the API, then this would also be something I'll 
> look into help contributing with too.
> 
> Thanks,
> -- 
> Richard Goodman



Re: Optimizing after daily Replication - Does optimization resets cache?

2019-09-16 Thread Emir Arnautović
Hi Paras,
In master-slave model, optimisation will affect only master since slaves will 
get optimised segments from master. But note that slaves get what changed from 
master and in case of optimisation entire index will be replicated so you can 
experience longer replications and in case of large indices you can run into 
issues with network unless you throttle replication (which will again result in 
longer replication time). When it comes to caches, there are some per segment 
caches but majority of caches are invalidated on any index searcher reopening 
so it does not matter if you optimised or not. What you need to do is autowarm 
caches if you do not want to experience performance drops. When it comes to 
performance, optimisation will give you some boost since you are searching 
single segment instead of many, but you should not expect significant 
difference. Optimisation might be a good option in case you do a lot of 
deletes/updates so your big segments end up with large number of deleted 
documents, but you can also consider using reclaimDeletesWeight to tune merge 
policy or use expungeDeletes.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 16 Sep 2019, at 08:06, Paras Lehana  wrote:
> 
> Hi,
> 
> *The Scenario: *We *atomically* update (*no deletion*) data for our
> Auto-Suggest index every morning on master. The slave is our production
> server. Master replicates to slave when the traffic is least on production
> server.
> 
> *Objective: *Faster search. Real-Time updates are really not necessary.
> 
> *Query: *I was reading articles about optimizations, merging and
> committing. My basic queries can be summed up as:
> 
>   1. Should we optimize the index after atomic updates before replication?
>   This equates to single optimization operation daily during low traffic
>   (morning).
>   2. Since speed is our focus (and not storage resource at this time) and
>   that optimizations builds the index again by merging, does this mean that
>   the whole cache is also refreshed? If yes, this would mean that we would be
>   flushing cache everyday and if we really want to go ahead with this, I
>   think, we should relook about autowarming.
>   3. What's tradeoff in our scenario? Storage v/s Speed? Or am I missing
>   something. What if I choose not to optimize?
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Software Programmer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.



Re: Is it possible to skip scoring completely?

2019-09-12 Thread Emir Arnautović
Hi Ash,
I did not check the code, so not sure if your question is based on something 
that you find in the codebase or you are just assuming that scoring is called? 
I would assume differently: if you use only fq, then Solr does not have 
anything to score. Also, if you order by something other than score and do not 
request score to be returned, I would also assume that Solr will not calculate 
score. Again, didn’t have time to check the code, so these are just assumptions.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Sep 2019, at 01:27, Ashwin Ramesh  wrote:
> 
> Hi everybody,
> 
> I was wondering if there is a way we can tell solr (7.3+) to run none of
> it's scoring logic. We would like to simply add a set of filter queries and
> order on a specific docValue field.
> 
> e.g. "Give me all fq=color:red documents ORDER on popularityScore DESC"
> 
> Thanks in advance,
> 
> Ash
> 
> -- 
> *P.S. We've launched a new blog to share the latest ideas and case studies 
> from our team. Check it out here: product.canva.com 
> . ***
> ** Empowering the 
> world to design
> Also, we're hiring. Apply here! 
> 
>  
>   
>     
>   
> 
> 
> 
> 
> 
> 



Re: how to use copy filed as only taken after the suffix

2019-07-15 Thread Emir Arnautović
Hi Uma,
Take a look at 
https://lucene.apache.org/solr/guide/8_1/charfilterfactories.html#solr-patternreplacecharfilterfactory
 

Depending on your usecase, this might be enough for you.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Jul 2019, at 14:54, UMA MAHESWAR  wrote:
> 
> hi all,
> i am working as solr developer and my ask is to title field  is copied to
> copyField to content type but  i want o index only suffix (.)after
> vslue(like docx,ppt,jpg like that) is there any way.
> 
> thank and regards,
> uma
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Function Query with multi-value field

2019-07-15 Thread Emir Arnautović
Hi Wei,
I see two options:
1. create custom distance function for colors
2. split each color component to a separate numeric fields and try calculate 
distance function using standard set of functions. (I think that Solr does not 
support 3d points).

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Jul 2019, at 00:47, Wei  wrote:
> 
> Any suggestion?
> 
> On Thu, Jul 11, 2019 at 3:03 PM Wei  wrote:
> 
>> Hi,
>> 
>> I have a question regarding function query that operates on multi-value
>> fields.  For the following field:
>> 
>> > multivalued="true"/>
>> 
>> Each value is a hex string representation of RGB value.  for example
>> there are 3 values indexed
>> 
>> #FF00FF- C1
>> #EE82EE   - C2
>> #DA70D6   - C3
>> 
>> How would I write a function query that operates on all values of the
>> field?  Given color S in query, how to calculate the similarities between
>> S and C1/C2/C3 and find which one is the closest?
>> I checked https://lucene.apache.org/solr/guide/6_6/function-queries.html but
>> didn't see an example.
>> 
>> Thanks,
>> Wei
>> 



Re: Solr cloud setup

2019-06-07 Thread Emir Arnautović
Hi Abhishek,
Here is a nice blog post about migrating to SolrCloud: 
https://sematext.com/blog/solr-master-slave-solrcloud-migration/ 


Re number of shards - there is no definite answer - it depends on your 
indexing/search latency requirements. Only tests can tell. Here are some 
thought on how to perform tests: 
https://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Jun 2019, at 09:05, Midas A  wrote:
> 
> Hi ,
> 
> Currently we are in master slave architechture we want to move in solr
> cloud architechture .
> how i should decide shard number in solr cloud ?
> 
> My current solr in version 6 and index size is 300 GB.
> 
> 
> 
> Regards,
> Abhishek Tiwari



Re: Softer version of grouping and/or filter query

2019-05-08 Thread Emir Arnautović
Hi Doug,
It seems to me that you’ve found a way to increase score for those that are 
within selected price range, but “A price higher than $150 should not increase 
the score”. I’ll just remind you that scores in Solr are relevant to query and 
that you cannot do much other than sorting on it so it should not matter much 
if you boost the one that you like more or decrease score for those that are 
not your first choice.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 8 May 2019, at 23:56, Doug Reeder  wrote:
> 
> We have a query to return products related to a given product. To give some
> variety to the results, we group by vendor:
> group=true=true=merchantId
> 
> We need at least four results to display. Unfortunately, some categories
> don't have a lot of products, and grouping takes us (say) from five results
> to three.
> 
> Can I "soften" the grouping, so other products by the same vendor will
> appear in the results, but with much lower score?
> 
> 
> Similarly, we have a filter query that only returns products over $150:
> fq=price:[150+TO+*]
> 
> Can this be changed to a q or qf parameter where products less than $150
> have score less than any product priced $150 or more? (A price higher than
> $150 should not increase the score.)



Re: Solr monitoring

2019-04-29 Thread Emir Arnautović
Hi Shruti,
One such tool is our https://sematext.com/spm . It 
provides Solr integration and ability to also send Solr logs.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Apr 2019, at 07:17, shruti suri  wrote:
> 
> Hi, 
> 
> I want to monitor solr Java heap and health check. What will be the best way
> or tool to do it in production.. Also I want to check per collection
> requests and memory utilization. I am using solr6.1.
> 
> Thanks
> Shruti
> 
> 
> 
> -
> Regards
> Shruti
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Optimal RAM to size index ration

2019-04-15 Thread Emir Arnautović
Hi,
The recommendation to have RAM enough to place your entire index into memory is 
sort of worst case scenario (maybe better called the best case scenario) where 
your index is optimal and is fully used all the time. OS will load pages that 
are used and those that might be used to memory, so even if you have 40GB of 
index files on disk if you do not use the files they will not be loaded to 
memory. Why would you not use some files: maybe some fields are stored but you 
never retrieve them, or you enabled doc values but you never use doc values or 
you use only subset of your documents and old documents are never part of 
result…
The best thing is that you run your Solr with some monitoring tool and see how 
much RAM is actually used on average/max and use that value with some headroom. 
You can put some alert on used RAM and react if/when your system starts 
requiring more. One such tool is our https://sematext.com/spm 
 

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Apr 2019, at 15:25, SOLR4189  wrote:
> 
> Hi all,
> 
> I have a collection with many shards. Each shard is in separate SOLR node
> (VM) has 40Gb index size, 4 CPU and SSD. 
> 
> When I run performance checking with 50GB RAM (10Gb for JVM and 40Gb for
> index) per node and 25GB RAM (10Gb for JVM and 15Gb for index), I get the
> same queries times (percentile80, percentile90 and percentile95). I run the
> long test - 8 hours production queries and updates.
> 
> What does it mean? All index in RAM it not must? Maybe is it due to SSD? How
> can I check it?
> 
> Thank you.
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Filters and data cleansing

2019-04-15 Thread Emir Arnautović
Hi Ken,
What Solr returns is stored value which is original value. Analysis is applied 
and its result is stored as “index” and is used for searching. In order to get 
what you want, you have to move analysis at least one step earlier. It can be 
moved to update request processor chain where you apply analysis on some 
document field and alter input document, or you move it completely on client 
side and apply some analysis before constructing document that is sent to Solr.

HTH,
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Apr 2019, at 15:50, Ken Wiltshire  wrote:
> 
> hello experts.
> 
> I have what is probably a simple question.  Feels like it should be.  i
> have some filters set up on INDEX.  Lets say "lowercasefilterfactory" for
> instance.  I understand the data will be indexed as lowercase but when i
> qry this same data its still in its original form.  This works for most
> instances but I'd also like to filter strings on qry response as well so
> that the data returned is scrubbed.  EX:  index:"á"  qry for this document
> returns "a" using asIIFolding.  I can see in the analysis that it actually
> removes the characters accordingly but when I retrieve the data via qry its
> still in its original form.
> 
> Any help is appreciated.
> 
> Best,
> K
> 
> 
> Ken Wiltshire
> *VP of Technology*
> Shoppable®  
> 139 Fulton St.
> New York, NY 10038
> 347 675 5213



Re: nested documents performance

2019-04-15 Thread Emir Arnautović
Hi Roi,
I don’t know the details about your test, but trying to assume how it looks 
like and explain observed. With your flat test you are denormalising data, 
meaning creating data duplication so the resulting document set is larger. That 
means more fields/text for Solr/Lucene to analyse and to write to disk. With 
parent/child you are doing some data normalisation so less data, less analysis, 
less disk writes. You should observe a similar behaviour with RDBMs as well, 
and similar to RDBMs you pay the price at query time. What is different from 
RDBMs is that they are built to work with relational/normalised data while Solr 
is not so joining is not as fast as with RDBMs.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Apr 2019, at 08:57, Roi Wexler  wrote:
> 
> Hi,
> we're at the process of testing Solr for its indexing speed which is very 
> impotent to our application.
> we've witnessed strange behavior that we wish to understand before using it.
> when we indexed 1M docs it took about 63 seconds but when we indexed the same 
> documents only now we've nested them as 1000 parented with 1000 child 
> documents each, it took only 27 seconds.
> 
> we know that Lucene don't support nested documents for it has a flat object 
> model, and we do see that in fact it does index each of the child documents 
> as a separate document.
> 
> we have tests shows that we get the same results in case we index all 
> documents flat (without childs) or when we index them as 1000 parents with 
> 1000 nested documents each.
> 
> do we miss something here?
> why does it behave like that?
> what kind of constraints does child documents have, or what is the price we 
> pay to get this better index speed?
> we're trying to establish if this is a valid way to get a better performance 
> in index speed..
> 
> any help will be appreciated.



Re: DocValues or stored fields to enable atomic updates

2019-04-05 Thread Emir Arnautović
Hi Andreas,
Stored values are compressed so should take less disk. I am thinking that doc 
values might perform better when it comes to executing atomic update.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Apr 2019, at 12:54, Andreas Hubold  wrote:
> 
> Hi,
> 
> I have a question on schema design: If a single-valued StrField is just used 
> for filtering results by exact value (indexed=true) and its value isn't 
> needed in the search result and not for sorting, faceting or highlighting - 
> should I use docValues=true or stored=true to enable atomic updates? Or even 
> both? I understand that either docValues or stored fields are needed for 
> atomic updates but which of the two would perform better / consume less 
> resources in this scenario?
> 
> Thank you.
> 
> Best regards,
> Andreas
> 
> 
> 



Re: Solr index slow response

2019-03-19 Thread Emir Arnautović
The fact that it is happening with a single client suggests that it is not 
about concurrency. If it is happening equally frequent, I would assume it is 
about bulks - they might appear the same but might be significantly different 
from Solr’s POV. Is it update or append always? If it is append, maybe try 
isolate bulk that is taking longer and try repeating the same bulk multiple 
times and see if it will always be slow.
Maybe try doing thread dump while processing slow bulk and it might show you 
some pointers where Solr is spending time. Or maybe even try sending a single 
doc bulks and see if some documents are significantly heavier than others.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Mar 2019, at 13:22, Aaron Yingcai Sun  wrote:
> 
> Yes, the same behavior even with a single thread client. The following page 
> says "In general, adding many documents per update request is faster than one 
> per update request."  but in reality, add many documents per request result 
> in much longer response time, it's not liner, response time of 100 docs per 
> request  is bigger than (the response time of 10 docs per request) * 10.
> 
> 
> https://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor
> 
> SolrPerformanceFactors - Solr 
> Wiki<https://wiki.apache.org/solr/SolrPerformanceFactors#mergeFactor>
> wiki.apache.org
> Schema Design Considerations. indexed fields. The number of indexed fields 
> greatly increases the following: Memory usage during indexing ; Segment merge 
> time
> 
> 
> 
> 
> 
> From: Emir Arnautović 
> Sent: Tuesday, March 19, 2019 1:00:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr index slow response
> 
> If you start indexing with just a single thread/client, do you still see slow 
> bulks?
> 
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 19 Mar 2019, at 12:54, Aaron Yingcai Sun  wrote:
>> 
>> "QTime" value is from the solr rest api response, extracted from the 
>> http/json payload.  The "Request time" is what I measured from client side, 
>> it's almost the same value as QTime, just some milliseconds difference.  I 
>> could provide tcpdump to prove that it is really solr slow response.
>> 
>> Those long response time is not really spikes, it's constantly happening, 
>> almost half of the request has such long delay.  The more document added in 
>> one request the more delay it has.
>> 
>> 
>> From: Emir Arnautović 
>> Sent: Tuesday, March 19, 2019 12:30:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr index slow response
>> 
>> Just to add different perspective here: how do you send documents to Solr? 
>> Are those log lines from your client? Maybe it is not Solr that is slow. 
>> Could it be network or client itself. If you have some dry run on client, 
>> maybe try running it without Solr to eliminate client from the suspects.
>> 
>> Do you observe similar spikes when you run indexing with less concurrent 
>> clients?
>> 
>> It is really hard to pinpoint the issue without looking at some monitoring 
>> tool.
>> 
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 19 Mar 2019, at 09:17, Aaron Yingcai Sun  wrote:
>>> 
>>> We have around 80 million documents to index, total index size around 3TB,  
>>> I guess I'm not the first one to work with this big amount of data. with 
>>> such slow response time, the index process would take around 2 weeks. While 
>>> the system resource is not very loaded, there must be a way to speed it up.
>>> 
>>> 
>>> To Walter, I don't see why G1GC would improve this, we only do index, no 
>>> query in the background. There is no memory constraint.  it's more feel 
>>> like some internal thread are blocking each other.
>>> 
>>> 
>>> I used to run with more documents in one request, that give much worse 
>>> response time, 300 documents in one request could end up 20 minutes 
>>> response time, now I changed to max 10 documents in one request, still many 
>>> response time around 30 seconds, while some of them are very fast( ~100 
>>> ms).  How come there are such big difference? the documents size does n

Re: Solr index slow response

2019-03-19 Thread Emir Arnautović
If you start indexing with just a single thread/client, do you still see slow 
bulks?

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Mar 2019, at 12:54, Aaron Yingcai Sun  wrote:
> 
> "QTime" value is from the solr rest api response, extracted from the 
> http/json payload.  The "Request time" is what I measured from client side, 
> it's almost the same value as QTime, just some milliseconds difference.  I 
> could provide tcpdump to prove that it is really solr slow response.
> 
> Those long response time is not really spikes, it's constantly happening, 
> almost half of the request has such long delay.  The more document added in 
> one request the more delay it has.
> 
> 
> From: Emir Arnautović 
> Sent: Tuesday, March 19, 2019 12:30:33 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr index slow response
> 
> Just to add different perspective here: how do you send documents to Solr? 
> Are those log lines from your client? Maybe it is not Solr that is slow. 
> Could it be network or client itself. If you have some dry run on client, 
> maybe try running it without Solr to eliminate client from the suspects.
> 
> Do you observe similar spikes when you run indexing with less concurrent 
> clients?
> 
> It is really hard to pinpoint the issue without looking at some monitoring 
> tool.
> 
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 19 Mar 2019, at 09:17, Aaron Yingcai Sun  wrote:
>> 
>> We have around 80 million documents to index, total index size around 3TB,  
>> I guess I'm not the first one to work with this big amount of data. with 
>> such slow response time, the index process would take around 2 weeks. While 
>> the system resource is not very loaded, there must be a way to speed it up.
>> 
>> 
>> To Walter, I don't see why G1GC would improve this, we only do index, no 
>> query in the background. There is no memory constraint.  it's more feel like 
>> some internal thread are blocking each other.
>> 
>> 
>> I used to run with more documents in one request, that give much worse 
>> response time, 300 documents in one request could end up 20 minutes response 
>> time, now I changed to max 10 documents in one request, still many response 
>> time around 30 seconds, while some of them are very fast( ~100 ms).  How 
>> come there are such big difference? the documents size does not have such 
>> big difference.
>> 
>> 
>> I just want to speed it up since nothing seems to be overloaded.  Are there 
>> any other faster way to index such big amount of data?
>> 
>> 
>> BRs
>> 
>> //Aaron
>> 
>> 
>> From: Walter Underwood 
>> Sent: Monday, March 18, 2019 4:59:20 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr index slow response
>> 
>> Solr is not designed to have consistent response times for updates. You are 
>> expecting Solr to do something that it does not do.
>> 
>> About Xms and Xmx, the JVM will continue to allocate memory until it hits 
>> the max. After it hits the max, it will start to collect garbage. A smaller 
>> Xms just wastes time doing allocations after the JVM is running. Avoid that 
>> by making Xms and Xms the same.
>> 
>> We run all of our JVMs with 8 GB of heap and the G1 collector. You probably 
>> do not need more than 8 GB unless you are doing high-cardinality facets or 
>> some other memory-hungry querying.
>> 
>> The first step would be to use a good configuration. We start our Java 8 
>> JVMs with these parameters:
>> 
>> SOLR_HEAP=8g
>> # Use G1 GC  -- wunder 2017-01-23
>> # Settings from https://wiki.apache.org/solr/ShawnHeisey
>> GC_TUNE=" \
>> -XX:+UseG1GC \
>> -XX:+ParallelRefProcEnabled \
>> -XX:G1HeapRegionSize=8m \
>> -XX:MaxGCPauseMillis=200 \
>> -XX:+UseLargePages \
>> -XX:+AggressiveOpts \
>> “
>> 
>> Use SSD for disks, with total space about 3X as big as the expected index 
>> size.
>> 
>> Have RAM not used by Solr or the OS that is equal to the expected index size.
>> 
>> After that, let’s figure out what the real requirement is. If you must have 
>> consistent response times for update requests, you’ll need to do that 
>> outside of Solr. But if you need high data import rates, we can probably 
>> help.
>> 

Re: Solr index slow response

2019-03-19 Thread Emir Arnautović
rdcommit interval should be much lower. Probably something on the
>> order of seconds (15000) instead of hours (currently 360). When the
>> hard commit fires, numerous merges might be firing off in the background
>> due to the volume of documents you are indexing, which might explain the
>> periodic bad response times shown in your logs.
>> 
>> It would depend on your specific scenario, but here's our setup. During
>> long periods of constant indexing of documents to a staging collection (~2
>> billion documents), we have following commit settings
>> 
>> softcommit: 360ms (for periodic validation of data, since it's not in
>> production)
>> hardcommit: openSearcher -> false, 15000ms (no document limit)
>> 
>> This makes the documents available for searching every hour, but doesn't
>> result in the large bursts of IO due to the infrequent hard commits.
>> 
>> For more info, Erick Erickson has a great write up:
>> https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>> 
>> Best,
>> Chris
>> 
>> On Mon, Mar 18, 2019 at 9:36 AM Aaron Yingcai Sun  wrote:
>> 
>>> Hi, Emir,
>>> 
>>> My system used to run with max 32GB, the response time is bad as well.
>>> swap is set to 4GB, there 3.2 free, I doubt swap would affect it since
>>> there is such huge free memory.
>>> 
>>> I could try to with set Xms and Xmx to the same value, but I doubt how
>>> much would that change the response time.
>>> 
>>> 
>>> BRs
>>> 
>>> //Aaron
>>> 
>>> 
>>> From: Emir Arnautović 
>>> Sent: Monday, March 18, 2019 2:19:19 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr index slow response
>>> 
>>> Hi Aaron,
>>> Without looking too much into numbers, my bet would be that it is large
>>> heap that is causing issues. I would decrease is significantly (<30GB) and
>>> see if it is enough for your max load. Also, disable swap or reduce
>>> swappiness to min.
>>> 
>>> In any case, you should install some monitoring tool that would help you
>>> do better analysis when you run into problems. One such tool is our
>>> monitoring solution: https://sematext.com/spm
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 18 Mar 2019, at 13:14, Aaron Yingcai Sun  wrote:
>>>> 
>>>> Hello, Emir,
>>>> 
>>>> Thanks for the reply, this is the solr version and heap info, standalone
>>> single solr server. I don't have monitor tool connected. only look at
>>> 'top', has not seen cpu spike so far, when the slow response happens, cpu
>>> usage is not high at all, around 30%.
>>>> 
>>>> 
>>>> # curl 'http://.../solr/admin/info/system?wt=json=true'
>>>> {
>>>> "responseHeader":{
>>>>  "status":0,
>>>>  "QTime":27},
>>>> "mode":"std",
>>>> "solr_home":"/ardome/solr",
>>>> "lucene":{
>>>>  "solr-spec-version":"6.5.1",
>>>>  "solr-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 -
>>> jimczi - 2017-04-21 12:23:42",
>>>>  "lucene-spec-version":"6.5.1",
>>>>  "lucene-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75
>>> - jimczi - 2017-04-21 12:17:15"},
>>>> "jvm":{
>>>>  "version":"1.8.0_144 25.144-b01",
>>>>  "name":"Oracle Corporation Java HotSpot(TM) 64-Bit Server VM",
>>>>  "spec":{
>>>>"vendor":"Oracle Corporation",
>>>>"name":"Java Platform API Specification",
>>>>"version":"1.8"},
>>>>  "jre":{
>>>>    "vendor":"Oracle Corporation",
>>>>"version":"1.8.0_144"},
>>>>  "vm":{
>>>>"vendor":"Oracle Corporation",
>>>>"name":"Java HotSpot(TM) 64-Bit Server VM",
>>>>"version":"25.144-b0

Re: Solr index slow response

2019-03-18 Thread Emir Arnautović
4GB swap on 400GB machine does not make much sense, so disable it. Even 4GB, 
some pages might be swapped, and if those are some Solr pages, it’ll affect 
Solr.

Setting Xms and Xmx to the same value will not solve your issue but you will 
avoid heap resize when your heap reaches Xms.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Mar 2019, at 14:36, Aaron Yingcai Sun  wrote:
> 
> Hi, Emir,
> 
> My system used to run with max 32GB, the response time is bad as well.  swap 
> is set to 4GB, there 3.2 free, I doubt swap would affect it since there is 
> such huge free memory.
> 
> I could try to with set Xms and Xmx to the same value, but I doubt how much 
> would that change the response time.
> 
> 
> BRs
> 
> //Aaron
> 
> ____
> From: Emir Arnautović 
> Sent: Monday, March 18, 2019 2:19:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr index slow response
> 
> Hi Aaron,
> Without looking too much into numbers, my bet would be that it is large heap 
> that is causing issues. I would decrease is significantly (<30GB) and see if 
> it is enough for your max load. Also, disable swap or reduce swappiness to 
> min.
> 
> In any case, you should install some monitoring tool that would help you do 
> better analysis when you run into problems. One such tool is our monitoring 
> solution: https://sematext.com/spm
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 18 Mar 2019, at 13:14, Aaron Yingcai Sun  wrote:
>> 
>> Hello, Emir,
>> 
>> Thanks for the reply, this is the solr version and heap info, standalone 
>> single solr server. I don't have monitor tool connected. only look at 'top', 
>> has not seen cpu spike so far, when the slow response happens, cpu usage is 
>> not high at all, around 30%.
>> 
>> 
>> # curl 'http://.../solr/admin/info/system?wt=json=true'
>> {
>> "responseHeader":{
>>   "status":0,
>>   "QTime":27},
>> "mode":"std",
>> "solr_home":"/ardome/solr",
>> "lucene":{
>>   "solr-spec-version":"6.5.1",
>>   "solr-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
>> jimczi - 2017-04-21 12:23:42",
>>   "lucene-spec-version":"6.5.1",
>>   "lucene-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
>> jimczi - 2017-04-21 12:17:15"},
>> "jvm":{
>>   "version":"1.8.0_144 25.144-b01",
>>   "name":"Oracle Corporation Java HotSpot(TM) 64-Bit Server VM",
>>   "spec":{
>> "vendor":"Oracle Corporation",
>> "name":"Java Platform API Specification",
>> "version":"1.8"},
>>   "jre":{
>> "vendor":"Oracle Corporation",
>> "version":"1.8.0_144"},
>>   "vm":{
>> "vendor":"Oracle Corporation",
>> "name":"Java HotSpot(TM) 64-Bit Server VM",
>> "version":"25.144-b01"},
>>   "processors":32,
>>   "memory":{
>> "free":"69.1 GB",
>> "total":"180.2 GB",
>> "max":"266.7 GB",
>> "used":"111 GB (%41.6)",
>> "raw":{
>>   "free":74238728336,
>>   "total":193470136320,
>>   "max":286331502592,
>>   "used":119231407984,
>>   "used%":41.64103736566334}},
>>   "jmx":{
>> 
>> "bootclasspath":"/usr/java/jdk1.8.0_144/jre/lib/resources.jar:/usr/java/jdk1.8.0_144/jre/lib/rt.jar:/usr/java/jdk1.8.0_144/jre/lib/sunrsasign.jar:/usr/java/jdk1.8.0_144/jre/lib/jsse.jar:/usr/java/jdk1.8.0_144/jre/lib/jce.jar:/usr/java/jdk1.8.0_144/jre/lib/charsets.jar:/usr/java/jdk1.8.0_144/jre/lib/jfr.jar:/usr/java/jdk1.8.0_144/jre/classes",
>> "classpath":"...",
>> "commandLineArgs":["-Xms100G",
>>   "-Xmx300G",
>>   "-DSTOP.PORT=8079",
>>   "-DSTOP.KEY=..",
>>   "-Dsolr.solr.home=..",
>>   "-Djetty.port=8983"],
>> 

Re: Solr index slow response

2019-03-18 Thread Emir Arnautović
Hi Aaron,
You are right - large heap means that there will be no major GC all the time, 
but eventually it will happen and then the larger the heap the longer it will 
take. So with 300GB heap it takes observed 300s. If you used to run on 32GB 
heap and it was slow, it probably means that heap is too small for your load, 
but if you did not run into OOM, then it means it is starving but still can 
handle the load.
In any case, I would not go with super large heap. One option is to split your 
server to more Solr instances. That would let you run more instances with 
smaller heap. I am not sure about your business case and if it is possible to 
do it without switching to SolrCloud, but if you have 100s of clients that are 
completely separate, you can simply split clients to several Solr instances 
that are running on the same server and have some logic that routes client to 
the right instance.
If your average doc size is 5MB, then I would reduce the number of documents 
and that will reduce some load to heap. Unfortunately, the only way to answer 
similar questions is to run some tests.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Mar 2019, at 14:30, Aaron Yingcai Sun  wrote:
> 
> I'm a bit confused, why large heap size would make it slower?  Isn't that 
> give it enough room to make it not busy doing GC all the time?
> 
> My http/json request contains 100 documents, the total size of the 100 
> documents is around 5M, there are ~100 client sending those requests 
> continuously.
> 
> Previously the JVM is set to max 32 GB ,  the speed was even worse,  now it's 
> running with min 100GB, max 300GB, it use around 100GB.
> 
> 
> this page suggest to use smaller number of documents per request,   
> https://wiki.apache.org/solr/SolrPerformanceProblems
> 
> SolrPerformanceProblems - Solr 
> Wiki
> wiki.apache.org
> General information. There is a performance bug that makes *everything* slow 
> in versions 6.4.0 and 6.4.1. The problem is fixed in 6.4.2. It is described 
> by SOLR-10130.This is highly version specific, so if you are not running one 
> of the affected versions, don't worry about it.
> 
> So I try to reduce the number, still I could get lots of large response QTime:
> 
> 190318-142652.695-160214 DBG1:doc_count: 10 , doc_size: 609  KB, Res code: 
> 200, QTime: 47918 ms, Request time: 47921 ms.
> 190318-142652.704-160179 DBG1:doc_count: 10 , doc_size: 568  KB, Res code: 
> 200, QTime: 36919 ms, Request time: 36922 ms.
> 190318-142652.780-160197 DBG1:doc_count: 10 , doc_size: 609  KB, Res code: 
> 200, QTime: 36082 ms, Request time: 36084 ms.
> 190318-142652.859-160200 DBG1:doc_count: 10 , doc_size: 569  KB, Res code: 
> 200, QTime: 36880 ms, Request time: 36882 ms.
> 190318-142653.131-160148 DBG1:doc_count: 10 , doc_size: 608  KB, Res code: 
> 200, QTime: 37222 ms, Request time: 37224 ms.
> 190318-142653.154-160211 DBG1:doc_count: 10 , doc_size: 541  KB, Res code: 
> 200, QTime: 37241 ms, Request time: 37243 ms.
> 190318-142653.223-163490 DBG1:doc_count: 10 , doc_size: 589  KB, Res code: 
> 200, QTime: 37174 ms, Request time: 37176 ms.
> 190318-142653.359-160154 DBG1:doc_count: 10 , doc_size: 592  KB, Res code: 
> 200, QTime: 37008 ms, Request time: 37011 ms.
> 190318-142653.497-163491 DBG1:doc_count: 10 , doc_size: 583  KB, Res code: 
> 200, QTime: 24828 ms, Request time: 24830 ms.
> 190318-142653.987-160208 DBG1:doc_count: 10 , doc_size: 669  KB, Res code: 
> 200, QTime: 23900 ms, Request time: 23902 ms.
> 190318-142654.114-160208 DBG1:doc_count: 10 , doc_size: 544  KB, Res code: 
> 200, QTime: 121 ms, Request time: 122 ms.
> 190318-142654.233-160208 DBG1:doc_count: 10 , doc_size: 536  KB, Res code: 
> 200, QTime: 113 ms, Request time: 115 ms.
> 190318-142654.354-160208 DBG1:doc_count: 10 , doc_size: 598  KB, Res code: 
> 200, QTime: 116 ms, Request time: 117 ms.
> 190318-142654.466-160208 DBG1:doc_count: 10 , doc_size: 546  KB, Res code: 
> 200, QTime: 107 ms, Request time: 108 ms.
> 190318-142654.586-160208 DBG1:doc_count: 10 , doc_size: 566  KB, Res code: 
> 200, QTime: 114 ms, Request time: 115 ms.
> 190318-142654.687-160208 DBG1:doc_count: 10 , doc_size: 541  KB, Res code: 
> 200, QTime: 96 ms, Request time: 98 ms.
> 190318-142654.768-160208 DBG1:doc_count: 10 , doc_size: 455  KB, Res code: 
> 200, QTime: 75 ms, Request time: 77 ms.
> 190318-142654.870-160208 DBG1:doc_count: 10 , doc_size: 538  KB, Res code: 
> 200, QTime: 97 ms, Request time: 98 ms.
> 190318-142654.967-160208 DBG1:doc_count: 10 , doc_size: 539  KB, Res code: 
> 200, QTime: 92 ms, Request time: 93 ms.
> 190318-142655.096-160208 DBG1:doc_count: 10 , doc_size: 672  KB, Res code: 
> 200, QTime: 124 ms, Request time: 125 ms.
> 190318-142655.210-160208 DBG1:doc_count: 10 , doc_size: 605  KB, Res code: 
> 200, QTime: 108 ms, Request time: 110 ms.
> 190318-142655.304-160208 

Re: Solr index slow response

2019-03-18 Thread Emir Arnautović
One more thing - it is considered a good practice to use the same value for Xmx 
and Xms.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Mar 2019, at 14:19, Emir Arnautović  
> wrote:
> 
> Hi Aaron,
> Without looking too much into numbers, my bet would be that it is large heap 
> that is causing issues. I would decrease is significantly (<30GB) and see if 
> it is enough for your max load. Also, disable swap or reduce swappiness to 
> min.
> 
> In any case, you should install some monitoring tool that would help you do 
> better analysis when you run into problems. One such tool is our monitoring 
> solution: https://sematext.com/spm
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 18 Mar 2019, at 13:14, Aaron Yingcai Sun  wrote:
>> 
>> Hello, Emir,
>> 
>> Thanks for the reply, this is the solr version and heap info, standalone 
>> single solr server. I don't have monitor tool connected. only look at 'top', 
>> has not seen cpu spike so far, when the slow response happens, cpu usage is 
>> not high at all, around 30%.
>> 
>> 
>> # curl 'http://.../solr/admin/info/system?wt=json=true'
>> {
>> "responseHeader":{
>>   "status":0,
>>   "QTime":27},
>> "mode":"std",
>> "solr_home":"/ardome/solr",
>> "lucene":{
>>   "solr-spec-version":"6.5.1",
>>   "solr-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
>> jimczi - 2017-04-21 12:23:42",
>>   "lucene-spec-version":"6.5.1",
>>   "lucene-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
>> jimczi - 2017-04-21 12:17:15"},
>> "jvm":{
>>   "version":"1.8.0_144 25.144-b01",
>>   "name":"Oracle Corporation Java HotSpot(TM) 64-Bit Server VM",
>>   "spec":{
>> "vendor":"Oracle Corporation",
>> "name":"Java Platform API Specification",
>> "version":"1.8"},
>>   "jre":{
>> "vendor":"Oracle Corporation",
>> "version":"1.8.0_144"},
>>   "vm":{
>> "vendor":"Oracle Corporation",
>> "name":"Java HotSpot(TM) 64-Bit Server VM",
>> "version":"25.144-b01"},
>>   "processors":32,
>>   "memory":{
>> "free":"69.1 GB",
>> "total":"180.2 GB",
>> "max":"266.7 GB",
>> "used":"111 GB (%41.6)",
>> "raw":{
>>   "free":74238728336,
>>   "total":193470136320,
>>   "max":286331502592,
>>   "used":119231407984,
>>   "used%":41.64103736566334}},
>>   "jmx":{
>> 
>> "bootclasspath":"/usr/java/jdk1.8.0_144/jre/lib/resources.jar:/usr/java/jdk1.8.0_144/jre/lib/rt.jar:/usr/java/jdk1.8.0_144/jre/lib/sunrsasign.jar:/usr/java/jdk1.8.0_144/jre/lib/jsse.jar:/usr/java/jdk1.8.0_144/jre/lib/jce.jar:/usr/java/jdk1.8.0_144/jre/lib/charsets.jar:/usr/java/jdk1.8.0_144/jre/lib/jfr.jar:/usr/java/jdk1.8.0_144/jre/classes",
>> "classpath":"...",
>> "commandLineArgs":["-Xms100G",
>>   "-Xmx300G",
>>   "-DSTOP.PORT=8079",
>>   "-DSTOP.KEY=..",
>>   "-Dsolr.solr.home=..",
>>   "-Djetty.port=8983"],
>> "startTime":"2019-03-18T09:35:27.892Z",
>> "upTimeMS":9258422}},
>> "system":{
>>   "name":"Linux",
>>   "arch":"amd64",
>>   "availableProcessors":32,
>>   "systemLoadAverage":14.72,
>>   "version":"3.0.101-311.g08a8a9d-default",
>>   "committedVirtualMemorySize":2547960700928,
>>   "freePhysicalMemorySize":4530696192,
>>   "freeSwapSpaceSize":3486846976,
>>   "processCpuLoad":0.3257436126790475,
>>   "processCpuTime":9386945000,
>>   "systemCpuLoad":0.3279781055816521,
>>   "totalPhysicalMe

Re: Solr index slow response

2019-03-18 Thread Emir Arnautović
Hi Aaron,
Without looking too much into numbers, my bet would be that it is large heap 
that is causing issues. I would decrease is significantly (<30GB) and see if it 
is enough for your max load. Also, disable swap or reduce swappiness to min.

In any case, you should install some monitoring tool that would help you do 
better analysis when you run into problems. One such tool is our monitoring 
solution: https://sematext.com/spm

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Mar 2019, at 13:14, Aaron Yingcai Sun  wrote:
> 
> Hello, Emir,
> 
> Thanks for the reply, this is the solr version and heap info, standalone 
> single solr server. I don't have monitor tool connected. only look at 'top', 
> has not seen cpu spike so far, when the slow response happens, cpu usage is 
> not high at all, around 30%.
> 
> 
> # curl 'http://.../solr/admin/info/system?wt=json=true'
> {
>  "responseHeader":{
>"status":0,
>"QTime":27},
>  "mode":"std",
>  "solr_home":"/ardome/solr",
>  "lucene":{
>"solr-spec-version":"6.5.1",
>"solr-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
> jimczi - 2017-04-21 12:23:42",
>"lucene-spec-version":"6.5.1",
>"lucene-impl-version":"6.5.1 cd1f23c63abe03ae650c75ec8ccb37762806cc75 - 
> jimczi - 2017-04-21 12:17:15"},
>  "jvm":{
>"version":"1.8.0_144 25.144-b01",
>"name":"Oracle Corporation Java HotSpot(TM) 64-Bit Server VM",
>"spec":{
>  "vendor":"Oracle Corporation",
>  "name":"Java Platform API Specification",
>  "version":"1.8"},
>"jre":{
>  "vendor":"Oracle Corporation",
>  "version":"1.8.0_144"},
>"vm":{
>  "vendor":"Oracle Corporation",
>  "name":"Java HotSpot(TM) 64-Bit Server VM",
>  "version":"25.144-b01"},
>"processors":32,
>"memory":{
>  "free":"69.1 GB",
>  "total":"180.2 GB",
>  "max":"266.7 GB",
>  "used":"111 GB (%41.6)",
>  "raw":{
>"free":74238728336,
>"total":193470136320,
>"max":286331502592,
>"used":119231407984,
>"used%":41.64103736566334}},
>"jmx":{
>  
> "bootclasspath":"/usr/java/jdk1.8.0_144/jre/lib/resources.jar:/usr/java/jdk1.8.0_144/jre/lib/rt.jar:/usr/java/jdk1.8.0_144/jre/lib/sunrsasign.jar:/usr/java/jdk1.8.0_144/jre/lib/jsse.jar:/usr/java/jdk1.8.0_144/jre/lib/jce.jar:/usr/java/jdk1.8.0_144/jre/lib/charsets.jar:/usr/java/jdk1.8.0_144/jre/lib/jfr.jar:/usr/java/jdk1.8.0_144/jre/classes",
>  "classpath":"...",
>  "commandLineArgs":["-Xms100G",
>"-Xmx300G",
>"-DSTOP.PORT=8079",
>"-DSTOP.KEY=..",
>"-Dsolr.solr.home=..",
>"-Djetty.port=8983"],
>  "startTime":"2019-03-18T09:35:27.892Z",
>  "upTimeMS":9258422}},
>  "system":{
>"name":"Linux",
>"arch":"amd64",
>"availableProcessors":32,
>"systemLoadAverage":14.72,
>"version":"3.0.101-311.g08a8a9d-default",
>"committedVirtualMemorySize":2547960700928,
>"freePhysicalMemorySize":4530696192,
>"freeSwapSpaceSize":3486846976,
>"processCpuLoad":0.3257436126790475,
>"processCpuTime":9386945000,
>"systemCpuLoad":0.3279781055816521,
>"totalPhysicalMemorySize":406480175104,
>"totalSwapSpaceSize":4302303232,
>"maxFileDescriptorCount":32768,
>"openFileDescriptorCount":385,
>"uname":"Linux ... 3.0.101-311.g08a8a9d-default #1 SMP Wed Dec 14 10:15:37 
> UTC 2016 (08a8a9d) x86_64 x86_64 x86_64 GNU/Linux\n",
>"uptime":" 13:09pm  up 5 days 21:23,  7 users,  load average: 14.72, 
> 12.28, 11.48\n"}}
> 
> 
> 
> 
> 
> From: Emir Arnautović 
> Sent: Monday, March 18, 2019 12:10:30 PM
> To: solr-user@lucene.apache.or

Re: Solr index slow response

2019-03-18 Thread Emir Arnautović
Hi Aaron,
Which version of Solr? How did you configure your heap? Is it standalone Solr 
or SolrCloud? A single server? Do you use some monitoring tool? Do you see some 
spikes, pauses or CPU usage is constant?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Mar 2019, at 11:47, Aaron Yingcai Sun  wrote:
> 
> Hello, Solr!
> 
> 
> We are having some performance issue when try to send documents for solr to 
> index. The repose time is very slow and unpredictable some time.
> 
> 
> Solr server is running on a quit powerful server, 32 cpus, 400GB RAM, while 
> 300 GB is reserved for solr, while this happening, cpu usage is around 30%, 
> mem usage is 34%.  io also look ok according to iotop. SSD disk.
> 
> 
> Our application send 100 documents to solr per request, json encoded. the 
> size is around 5M each time. some times the response time is under 1 seconds, 
> some times could be 300 seconds, the slow response happens very often.
> 
> 
> "Soft AutoCommit: disabled", "Hard AutoCommit: if uncommited for 360ms; 
> if 100 uncommited docs"
> 
> 
> There are around 100 clients sending those documents at the same time, but 
> each for the client is blocking call which wait the http response then send 
> the next one.
> 
> 
> I tried to make the number of documents smaller in one request, such as 20, 
> but  still I see slow response time to time, like 80 seconds.
> 
> 
> Would you help to give some hint how improve the response time?  solr does 
> not seems very loaded, there must be a way to make the response faster.
> 
> 
> BRs
> 
> //Aaron
> 
> 
> 



Re: questions regrading stored fields role in query time

2019-02-26 Thread Emir Arnautović
Hi Saurabh,
DocValues can be used for retrieving field values (note that order will not be 
preserved in case of multivalue field) but they are also stored in files, just 
different structures. Doc values will load some structure in memory, but will 
also use memory mapped files to access values (not familiar with this code and 
just assuming) so in any case it will use “shared” OS caches. Those caches will 
be affected when loading stored fields to do partial update. Also it’ll take 
some memory when indexing documents. That is why storing and doing partial 
updates could indirectly affect query performances. But that might be 
insignificant and only test can tell for sure. Unless you have small index and 
enough RAM, then I can also tell that for sure.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2019, at 11:21, Saurabh Sharma  wrote:
> 
> Hi Emir,
> 
> I had this question in my mind if I store my only returnable field as
> docValue in RAM.will my stored documents be referenced while constructing
> the response after the query. Ideally, as the field asked to return i.e fl
> is already in RAM then documents on disk should not be consulted for this
> field.
> 
> Any insight about the usage of docValued field vs stored field and
> preference order will help here in understanding the situation in a better
> way.
> 
> Thanks
> Saurabh
> 
> On Tue, Feb 26, 2019 at 2:41 PM Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Saurabh,
>> Welcome to the channel!
>> Storing fields should not affect query performances directly if you use
>> lazy field loading and it is the default set. And it should not affect at
>> all if you have enough RAM compared to index size. Otherwise OS caches
>> might be affected by stored fields. The best way to tell is to tests with
>> expected indexing/partial updates load and see if/how much it affects
>> performances.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 26 Feb 2019, at 09:34, Saurabh Sharma 
>> wrote:
>>> 
>>> Hi All ,
>>> 
>>> 
>>> I am new here on this channel.
>>> Few days back we upgraded our solr cloud to version 7.3 and doing
>> real-time
>>> document posting with 15 seconds soft commit and 2 minutes hard commit
>>> time.As of now we posting full document to solr which includes data
>>> accumulations from various sources.
>>> 
>>> Now we want to do partial updates.I went through the documentation and
>>> found that all the fields should be stored or docValues for partial
>>> updates. I have few questions regarding this?
>>> 
>>> 1) In case i am just fetching only 1 field while making query.What will
>> the
>>> performance impact due to all fields being stored? Lets say i have an
>> "id"
>>> field and i do have doc value true for the field, will solr use stored
>>> fields in this case? will it load whole document in RAM ?
>>> 
>>> 2)What's the impact of large stored fields (.fdt) on query time
>>> performance. Do query time even depend on the stored field or they just
>>> depend on indexes?
>>> 
>>> 
>>> Thanks and regards
>>> Saurabh
>> 
>> 



Re: questions regrading stored fields role in query time

2019-02-26 Thread Emir Arnautović
Hi Saurabh,
Welcome to the channel!
Storing fields should not affect query performances directly if you use lazy 
field loading and it is the default set. And it should not affect at all if you 
have enough RAM compared to index size. Otherwise OS caches might be affected 
by stored fields. The best way to tell is to tests with expected 
indexing/partial updates load and see if/how much it affects performances.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2019, at 09:34, Saurabh Sharma  wrote:
> 
> Hi All ,
> 
> 
> I am new here on this channel.
> Few days back we upgraded our solr cloud to version 7.3 and doing real-time
> document posting with 15 seconds soft commit and 2 minutes hard commit
> time.As of now we posting full document to solr which includes data
> accumulations from various sources.
> 
> Now we want to do partial updates.I went through the documentation and
> found that all the fields should be stored or docValues for partial
> updates. I have few questions regarding this?
> 
> 1) In case i am just fetching only 1 field while making query.What will the
> performance impact due to all fields being stored? Lets say i have an "id"
> field and i do have doc value true for the field, will solr use stored
> fields in this case? will it load whole document in RAM ?
> 
> 2)What's the impact of large stored fields (.fdt) on query time
> performance. Do query time even depend on the stored field or they just
> depend on indexes?
> 
> 
> Thanks and regards
> Saurabh



Re: What's the deal with dataimporthandler overwriting indexes?

2019-02-12 Thread Emir Arnautović
Hi Joakim,
This might not be what you expect but it is expected behaviour. When you do 
clean=true, DIH will first delete all records. That is how it works in both M/S 
and Cloud. The diff might be that you disabled replication or disabled auto 
commits in your old setup so it is not visible. You can disable auto commits in 
Cloud and you will have your old index until the next commit, but that is not 
recommended way. What is usually done when you want to control what becomes 
active index is using aliases and do full import into new collection. After you 
verify that everything is ok, you update alias to new collection and it becomes 
the active one. You can keep the old one so you can roll back in case you 
notice some issues or you simply drop it when alias is updated.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Feb 2019, at 10:46, Joakim Hansson  wrote:
> 
> Hi!
> We are currently upgrading from solr 6.2 master slave setup to solr 7.6
> running solrcloud.
> I dont know if I've missed something really trivial, but everytime I start
> a full import (dataimport?command=full-import=true=true) the
> old index gets overwritten by the new import.
> 
> In 6.2 this wasn't really a problem since I could disable replication in
> the API on the master and enable it once the import was completed.
> With 7.6 and solrcloud we use NRT-shards and replicas since those are the
> only ones that support rule-based replica placement and whenever I start a
> new import the old index is overwritten all over the solrcloud cluster.
> 
> I have tried changing to clean=false, but that makes the import finish
> without adding any docs.
> Doesn't matter if I use soft or hard commits.
> 
> I don't get the logic in this. Why would you ever want to delete an
> existing index before there is a new one in place? What is it I'm missing
> here?
> 
> Please enlighten me.



Re: Load balance writes

2019-02-11 Thread Emir Arnautović
Hi Boban,
Not sure if there is Solrj port to Go, but you can take that as model to build 
your ZK aware client that groups and sends updates to shard leaders. I see that 
there are couple of Solr Go clients, so you might first check if some already 
supports it or if it makes sense that you contribute that part to one of your 
choice.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Feb 2019, at 16:09, Boban Acimovic  wrote:
> 
> Thank you Emir for quick reply. I use home brewed Go client and write just to 
> one of 12 available nodes. I believe I should find out this smart way to 
> handle this :)
> 
> 
> 
> 
>> On 11. Feb 2019, at 15:21, Emir Arnautović  
>> wrote:
>> 
>> Hi Boban,
>> If you use SolrCloud  Solrj client and initialise it with ZK, it should be 
>> aware of masters and send documents in a smart way.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 



Re: Load balance writes

2019-02-11 Thread Emir Arnautović
Hi Boban,
If you use SolrCloud  Solrj client and initialise it with ZK, it should be 
aware of masters and send documents in a smart way.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Feb 2019, at 12:18, Boban Acimovic  wrote:
> 
> I am wondering would I get performance benefits if I distribute writes to 
> Solr nodes by sending documents exactly to the master of collection where the 
> document belongs? My idea is that this would save some load between the 
> cluster nodes and improve performances. How to do writes in the best way? 
> Thank you in advance.



Re: shingles + stop words

2018-12-10 Thread Emir Arnautović
Hi David,
As you already observed shingles are concatenating tokens based on positions 
and in case of stopwords it results in empty string (you can configure it to be 
something else with fillerToken option).
You can do the following:
1. if you do not have too many stopwords, you could use 
PatternReplaceChartFilter to remove stopwords before it hits tokenizer. That 
way stopwords will not increase positions and it’ll result with expected 
shingles. This way you will loose managed part of stopwords and will have to 
reload cores in order to change stopwords.
2. customise stopword filter not to increment positions when finds stopword.
3. customise shingle filter to be able to add desired flag

HTH,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Dec 2018, at 15:18, David Hastings  wrote:
> 
> Hey there, I have a field type defined as such:
> 
>
>  
>  
>   outputUnigrams="false" fillerToken="" maxShingleSize="2"/>
>
>  
> 
> but whats happening is the shingles being returned are often times "
> nonstopword"
> with the space being defined as the filter token.  I was hoping that the
> ManagedStopFilterFactory would have removed the stop words completely
> before going to the shingle factory, and would have returned "nonstopword1
> nonstopword2" with an indexed value of
> "nonstopword1 stopword1 stopword2 nonstopword2" but obviously isnt the
> case.  is there a way to force it as such?
> 
> Thanks, David



Re: Delete by query in SOLR 6.3

2018-11-15 Thread Emir Arnautović
Hi Rakesh,
Since Solr has to maintain eventual consistency of all replicas, it has to 
block updates while DBQ is running. Here is blog post with high level 
explaination of the issue: 
http://www.od-bits.com/2018/03/dbq-or-delete-by-query.html 


You should do query and delete by ids in order to avoid issues caused by DBQ.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Nov 2018, at 06:09, RAKESH KOTE  wrote:
> 
> Hi,   We are using SOLR 6.3 in cloud and we have created 2 collections in a 
> single SOLR cluster consisting of 20 shards and 3 replicas each(overall 20X3 
> = 60 instances). The first collection has close to 2.5 billion records and 
> the second collection has 350 million records. Both the collection uses the 
> same instances which has 4 cores and 26 GB RAM (10 -12 GB assigned for Heap 
> and 14 GB assigned for OS).The first collection's index size is close to 50GB 
> and second collection index size is close to 5 GB in each of the instances. 
> We are using the default solrconfig values and the autoCommit and softCommits 
> are set to 5 minutes. The SOLR cluster is supported by 3 ZK.
> We are able to reach 5000/s updates and we are using solrj to index the data 
> to solr. We also delete the documents in each of the collection periodically 
> using solrj  delete by query method(we use a non-id filed in delete 
> query).(we are using java 1.8) The updates happens without much issues but 
> when we try to delete, it is taking considerable amount of time(close to 20 
> sec on an average but some of them takes more than 4-5 mins) which slows down 
> the whole application. We don't do an explicit commit after deletion and let 
> the autoCommit take care of it for every 5 mins. Since we are not doing a 
> commit we are wondering why the delete is taking more time comparing to 
> updates which are very fast and finishes in less than 50ms - 100 ms. Could 
> you please let us know the reason or how the deletes are different than the 
> updates operation in SOLR.
> with warm regards,RK.



Re: A different result with filters

2018-10-26 Thread Emir Arnautović
Hi,
The second query is equivalent to:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0",
>"price_i:[* TO 75]"
>  ]
> }


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Oct 2018, at 08:49, Владислав Властовский  wrote:
> 
> Hi, I use 7.5.0 Solr
> 
> Why do I get two different results for similar requests?
> 
> First req/res:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0",
>"{!parent which=kind_s:edition}price_i:[* TO 75]"
>  ]
> }
> 
> {
>  "response": {
>"numFound": 453,
>"start": 0,
>"docs": []
>  }
> }
> 
> And second query:
> {
>  "query": "*:*",
>  "limit": 0,
>  "filter": [
>"{!parent which=kind_s:edition}condition_s:0 AND price_i:[* TO 75]"
>  ]
> }
> 
> {
>  "response": {
>"numFound": 452,
>"start": 0,
>"docs": []
>  }
> }



Re: Internal Solr communication question

2018-10-25 Thread Emir Arnautović
Hi Fernando,
I did not look at code and not sure if there is special handling in case of a 
single shard collection, but Solr does not have to choose local shard to query. 
It assumes that one node will receive all requests and that it needs to 
balance. What you can do is add preferLocalShards=true to make sure local 
shards are queried.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 25 Oct 2018, at 16:18, Fernando Otero  wrote:
> 
> Hey Shawn
>Thanks for your answer!. I changed the config to 1 shard with 7
> replicas but I still see communication between nodes, is that expected?
> Each node has 1 shard so it should have all the data needed to compute, I
> don't get why I'm seeing communication between them.
> 
> Thanks
> 
> On Tue, Oct 23, 2018 at 2:21 PM Shawn Heisey  wrote:
> 
>> On 10/23/2018 9:31 AM, Fernando Otero wrote:
>>> Hey all
>>>  I'm running some tests on Solr cloud (10 nodes, 3 shards, 3
>> replicas),
>>> when I run the queries I end up seeing 7x traffic ( requests / minute)
>> in
>>> Newrelic.
>>> 
>>> Could it be that the internal communication between nodes is done through
>>> HTTP and newrelic counts those calls?
>> 
>> The inter-node communication is indeed done over HTTP, using the same
>> handlers that clients use, and if you have something watching Solr's
>> statistics or watching Jetty's counters, one of the counters will go up
>> when an inter-node request happens.
>> 
>> With 3 shards, one request coming in will generate as many as six
>> additional requests -- one request to a replica for each shard, and then
>> another request to each shard that has matches for the query, to
>> retrieve the documents that will be in the response. The node that
>> received the initial request will compile the results from all the
>> shards and send them back in response to the original request.
>> Nutshell:  One request from a client expands. With three shards, that
>> will be four to seven requests total.  If you have 10 shards, it will be
>> between 11 and 21 total requests.
>> 
>> Thanks,
>> Shawn
>> 
>> 
> 
> -- 
> 
> Fernando Otero
> 
> Sr Engineering Manager, Panamera
> 
> Buenos Aires - Argentina
> 
> Mobile: +54 911 67697108
> 
> Email:  fernando.ot...@olx.com



Re: indexed and stored for fields that are sources of a copy field

2018-10-22 Thread Emir Arnautović
Hi Chris,
Even better - you can contribute with documentation - you can create jira with 
patch.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Oct 2018, at 15:43, Chris Wareham  
> wrote:
> 
> Hi Emir,
> 
> Many thanks for the confirmation. I'd kind of inferred this was correct
> from the paragraph starting with "Copying is done at the stream source
> level", but it would be good to mention it in the "Copying Fields"
> section of the Solr documentation. Should I create a JIRA issue asking
> for this?
> 
> Regards,
> 
> Chris
> 
> On 22/10/2018 14:28, Emir Arnautović wrote:
>> Hi Chris,
>> Yes you can do that. There is also type=“ignored” that you can use in such 
>> scenario.
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> On 22 Oct 2018, at 15:22, Chris Wareham  
>>> wrote:
>>> 
>>> Hi folks,
>>> 
>>> I have a number of fields defined in my managed-schema file that are used 
>>> as the sources for a copy field:
>>> 
>>>  >> stored="true"/>
>>>  >> stored="true"  multiValued="true"/>
>>>  >> stored="true"  multiValued="true"/>
>>> 
>>>  >> stored="false" multiValued="true"/>
>>> 
>>>  
>>>  
>>>  
>>> 
>>> Can I set both the indexed and stored values to false for the body, sectors 
>>> and locations fields since I don't want to search or retrieve them?
>>> 
>>> Regards,
>>> 
>>> Chris



Re: indexed and stored for fields that are sources of a copy field

2018-10-22 Thread Emir Arnautović
Hi Chris,
Yes you can do that. There is also type=“ignored” that you can use in such 
scenario.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Oct 2018, at 15:22, Chris Wareham  
> wrote:
> 
> Hi folks,
> 
> I have a number of fields defined in my managed-schema file that are used as 
> the sources for a copy field:
> 
>   stored="true"/>
>   stored="true"  multiValued="true"/>
>   stored="true"  multiValued="true"/>
> 
>   stored="false" multiValued="true"/>
> 
>  
>  
>  
> 
> Can I set both the indexed and stored values to false for the body, sectors 
> and locations fields since I don't want to search or retrieve them?
> 
> Regards,
> 
> Chris



Re: Custom typeahead using Solr

2018-10-17 Thread Emir Arnautović
Hi Vineet,
You can index your jobtitle field in two different ways:
1. standard tokenizer -> edge ngram
2. keyword tokenizer -> edge ngram

The first field will be used to match word regardless of its position and the 
second one to prefer exact matches.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 17 Oct 2018, at 11:20, Vineet Mangla  wrote:
> 
> Hi All,
> 
> We have a requirement to create typeahead using Solr with following logic:
> 
> Let's say my Solr core has a field called jobtitle with two values as
> "senior software engineer" and "software engineer"
> 
>   1. Now, if I search for "*sen*", result should be "*sen*ior software
>   engineer"
>   2.
> 
>   if I search for "*soft*", result should be in following order:
> 
>   "*soft*ware engineer"
> 
>   "senior *soft*ware engineer"
>   3. If I search for "*soft**ware eng*",result should be in following
>   order:   "*soft**ware eng*ineer"
> 
>   "senior *soft**ware eng*ineer"
> 
> Is there anyway we can achieve this functionality?
> 
> 
> Regards,
> Vineet Mangla



Re: Using function in fiter query

2018-10-08 Thread Emir Arnautović
Hi Skanth,
You can use FunctionRangeQueryParser to do that:
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-FunctionRangeQueryParser
 


Let us know if you are having troubles forming query. You have examples in 
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-UsingFunctionQuery
 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Oct 2018, at 10:07, skanth2...@gmail.com wrote:
> 
> Hi,
> 
> I need help on using a custom function in filter query. Can anyone help on
> how to get it wokring. Below is the problem statement.
> 
> Have a date field in long and a buffer time in milliseconds in the documents
> which can vary.
> 
> startTime: 153886680
> bufferTime: 86400
> 
> Need to query for docs who's startTime is currentTime - bufferTime
> 
> like:
> 
> fq=startTime:[* TO sub(NOW, bufferTime)]
> 
> Thanks,
> skanth
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Migrate cores from 4.10.2 to 7.5.0

2018-10-03 Thread Emir Arnautović
Hi Wolfgang,
I would say that your safest bet is to start from 7.5 schema, adjust it to 
suite your needs and reindex (better than to try adjust your existing schema to 
7.5). If all your fields are stored in current collection, you might be able to 
use DIH to reindex: http://www.od-bits.com/2018/07/reindexing-solr-core.html 


I’ve recently used this approach for 4.x to 6.x.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Oct 2018, at 23:17, Pure Host - Wolfgang Freudenberger 
>  wrote:
> 
> Hi guys,
> 
> Is there any way to migrate cores from 4.10.2 to 7.5.0? I guess not, but 
> perhaps someone has an idea. ^^
> 
> -- 
> Mit freundlichem Gruß / kind regards
> 
> Wolfgang Freudenberger
> Pure Host IT-Services
> Münsterstr. 14
> 48341 Altenberge
> GERMANY
> Tel.: (+49) 25 71 - 99 20 170
> Fax: (+49) 25 71 - 99 20 171
> 
> Umsatzsteuer ID DE259181123
> 
> Informieren Sie sich über unser gesamtes Leistungsspektrum unter 
> www.pure-host.de
> Get our whole services at www.pure-host.de
> 
> 



Re: How to do rollback from solrclient using python

2018-10-03 Thread Emir Arnautović
Hi Chetra,
In addition to what Jason explained, rollbacks do not work in Solr Cloud.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Oct 2018, at 14:45, Jason Gerlowski  wrote:
> 
> Hi Chetra,
> 
> The syntax that you're looking for is 
> "/solr/someCoreName/update?rollback=true".
> 
> But I'm afraid Rollback might not be quite what you think it is.  You
> mentioned: "but it doesn't work, whenever there is a commit the
> request still updates on the server".  Yes, that is the expected
> behavior with rollbacks.  Rollbacks reset your index to the last
> commit point.  If there was a commit right before a rollback, the
> rollback will have no effect.
> 
> One last point is that you should be very careful using rollbacks.
> Rollbacks are going to undo all changes to your index since the last
> commit.  If you have more than one client thread changing documents,
> this can be very dangerous as you will reset a lot of things you
> didn't intend.  Even if you can guarantee that there's only one client
> making changes to your index, and that client is itself
> single-threaded, the result of a rollback is still indeterminate if
> you're using server auto-commit settings.  The client-triggered
> rollback will occasionally race against the server-triggered commit.
> Will your doc changes get rolled back?  They will if the rollback
> happens first, but if the commit happens right before the rollback,
> your rollback won't do anything!  Anyways rollbacks have their place,
> but be very careful when using them!
> 
> Hope that helps,
> 
> Jason
> On Wed, Oct 3, 2018 at 4:41 AM Chetra Tep  wrote:
>> 
>> Hi Solr team,
>> Current I am creating a python application that accesses to solr server.
>> I have to handle updating document and need a rollback function.
>> I want to send a rollback request whenever exception occurs.
>> first I try sth like this from curl command :
>> curl http://localhost:8983/solr/mysolr/update?command=rollback
>> and I also try
>> curl http://localhost:8983/solr/mysolr/update?rollback true
>> 
>> but it doesn't work. whenever there is a commit the request still updates
>> on the server.
>> 
>> I also try to submit xml document  , but it doesn't work, too.
>> 
>> Could you guide me how to do this?  I haven't found much documentation
>> about this on the internet.
>> 
>> Thanks you in advance.
>> Best regards,
>> Chetra



Re: Clarification about Solr Cloud and Shard

2018-10-03 Thread Emir Arnautović
Hi Rekha,
In addition to what Shawn explained, the answer to your last question is yes 
and no: You can split shards, but cannot change number of shards without 
reindexing. And you can add nodes but you should make sure adding nodes will 
result in well balanced cluster.
You can address scalability issues differently. Depending on your case, you 
might not need to have a single index with 200 billion documents. E.g. if you 
have multi-tenant system and each tenant search only its own data, each tenant 
or group of tenants can have a separate index or even separate cluster. Also if 
you append data and often filter by time, you may have time based indices.

Here is blog explaining how to run tests to estimate shard/cluster size: 
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Oct 2018, at 22:41, Shawn Heisey  wrote:
> 
> On 10/2/2018 9:33 AM, Rekha wrote:
>> Dear Solr Team, I need following clarification from you, please check and 
>> give suggestion to me, 1. I want to store and search 200 Billions of 
>> documents(Each document contains 16 fields). For my case can I able to 
>> achieve by using Solr cloud? 2. For my case how many shard and nodes will be 
>> needed? 3. In future can I able to increase the nodes and shards? Thanks, 
>> Rekha Karthick
> 
> In a nutshell:  It's not possible to give generic advice. The contents of the 
> fields will affect exactly what you need.  The nature of the queries that you 
> send will affect exactly what you need.  The query rate will affect exactly 
> what you need. The overall size of the index (disk space, as well as document 
> count) will affect what you need.
> 
> In the "not very helpful" department, but I promise this is absolute truth, 
> there's this blog post:
> 
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> To handle 200 billion documents *in a single collection*, you're probably 
> going to want at least 200 shards, and there are good reasons to go with even 
> more shards than that.  But you need to be warned that there can be serious 
> scalability problems when SolrCloud must keep track of that many different 
> indexes.  Here's an issue I filed for scalability problems with thousands of 
> collections ... there can be similar problems with lots of shards as well.  
> This issue says it is fixed, but no code changes that I am aware of were ever 
> made related to the issue, and as far as I can tell, it's still a problem 
> even in the latest version:
> 
> https://issues.apache.org/jira/browse/SOLR-7191
> 
> That many shard/replicas on one collection is likely to need zookeeper's 
> maximum znode size (jute.maxbuffer) boosted, because it will probably require 
> more than one megabyte to hold the JSON structure describing the collection.
> 
> As for how many machines you'll need ... absolutely no idea.  If query rate 
> will be insanely high, you'll want a dedicated machine for each shard 
> replica, and you may need many replicas, which is going to mean hundreds, 
> possibly thousands, of servers.  If the query rate is really low and/or each 
> document is very small, you might be able to house more than one shard per 
> server.  But you should know that handling 200 billion documents is going to 
> require a lot of hardware even if it turns out that you're not going to be 
> handling tons of data (per document) or queries.
> 
> Thanks,
> Shawn
> 



Re: Dynamic filters

2018-10-02 Thread Emir Arnautović
Hi Tamas,
Maybe I am missing the point and you already discarded that option, but you 
should be able to cover such cases with simple faceting?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 2 Oct 2018, at 12:55, Tamás Barta  wrote:
> 
> Hi,
> 
> I have been using Solr for a while for an online web store. After search a
> filter box appears where user can filter results by many attributes. My
> question is how can I do it with Solr that he filter box show only
> available options based on result. For example if attribute "type" can be
> 1, 2, 3 but the results contains only 1 and 2, then only these two values
> should be available in the filters.
> 
> I get only the first page results and I don't want to read the full results
> from Solr because of performance. Is there any way to get available values
> by fields for a query without degrade performance?
> 
> Thanks, Tamás



Re: CACHE -> fieldValueCache usage

2018-09-20 Thread Emir Arnautović
Hi Vincenzo,
Are you saying that you used to see some numbers other than 0 and now you see 
0? If it is always zero, it means that you are not using features that require 
uninverted version of field (mainly faceting) or that you have doc values 
enabled on all fields that are used in such scenario so there is no need to use 
fieldValueCache which is recommended.

FieldValueCache is defined by default so that’s why it is not defined in 
solr.xml, but you can override defaults.

HTH,
Emir 

--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Sep 2018, at 15:43, Vincenzo D'Amore  wrote:
> 
> Hi all,
> 
> sorry if I bothered you all but in these days I'm just struggling what's
> going on with my production servers...
> 
> Looking at Solr Admin Panel I've found the CACHE -> fieldValueCache tab
> where all the values are 0.
> 
> class:org.apache.solr.search.FastLRUCache
> description:Concurrent LRU Cache(maxSize=1, initialSize=10,
> minSize=9000, acceptableSize=9500, cleanupThread=false)
> stats:
> CACHE.searcher.fieldValueCache.cumulative_evictions:0
> CACHE.searcher.fieldValueCache.cumulative_hitratio:0
> CACHE.searcher.fieldValueCache.cumulative_hits:0
> CACHE.searcher.fieldValueCache.cumulative_inserts:0
> CACHE.searcher.fieldValueCache.cumulative_lookups:0
> CACHE.searcher.fieldValueCache.evictions:0
> CACHE.searcher.fieldValueCache.hitratio:0
> CACHE.searcher.fieldValueCache.hits:0
> CACHE.searcher.fieldValueCache.inserts:0
> CACHE.searcher.fieldValueCache.lookups:0
> CACHE.searcher.fieldValueCache.size:0
> CACHE.searcher.fieldValueCache.warmupTime:0
> 
> what do you thing, is that normal? Given that this stats comes from a
> production server I suppose to have some number here and looking at
> solrconfig.xml I don't see any configuration regarding the fieldValueCache.
> Don't should I see something here?
> 
> Cheers,
> Vincenzo
> 
> -- 
> Vincenzo D'Amore



Re: 20180913 - Clarification about Limitation

2018-09-14 Thread Emir Arnautović
Hi,
Here are some thought on how to resolve some of “it depends”: 
http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 13 Sep 2018, at 14:59, Shawn Heisey  wrote:
> 
> On 9/13/2018 2:07 AM, Rekha wrote:
>> Hi Solr Team,
>> I am new to SOLR. I need following clarification from you.
>>  How many documents can be stored in one core?   
>> Is there any limit for number of fields per document?   How 
>> many Cores can be created in on SOLR?Is there 
>> any other limitation is there based on the Disk storage size? I mean some of 
>> the database has the 10 GM limit, I have asked like that. 
>> Can we use SOLR as a database?  
> 
> You *can* use Solr as a database, but I wouldn't.  It's not designed for that 
> role.  Actual database software is better for that.  If all you need is 
> simple data storage, Solr can handle that, but as soon as you start talking 
> about complex operations like JOIN, a real database is FAR better.  Solr is a 
> search engine, and in my opinion, that's what it should be used for.
> 
> The only HARD limit that Solr has is actually a Lucene limit.  Lucene uses 
> the java "int" type for its internal document ID.  Which means that the 
> absolute maximum number of documents in one Solr core is 2147483647.  That's 
> a little over two billion.  You're likely to have scalability problems long 
> before you reach this number, though.  Also, this number includes deleted 
> documents, so it's not a good idea to actually get close to the limit.  One 
> rough rule of thumb that sometimes gets used:  If you have more than one 
> hundred million documents in a single core, you PROBABLY need to think about 
> re-designing your setup.
> 
> Using a sharded index (which SolrCloud can do a lot easier than standalone 
> Solr) removes the two billion document limitation for an index -- by 
> spreading the index across multiple Solr cores.
> 
> As for storage, you should have enough disk space available so that your 
> index data can triple in size temporarily.  This is not a joke -- that's 
> really the recommendation.  The way that Lucene operates requires that you 
> have at least *double* capacity, but there are real world situations in which 
> the index can triple in size.
> 
> Running with really big indexes means that you also need a lot of memory.  
> Good performance with Solr requires that the operating system has enough 
> memory to effectively cache the often-used parts of the index.
> 
> https://wiki.apache.org/solr/SolrPerformanceProblems#RAM
> 
> Thanks,
> Shawn
> 



Re: Data Import Handler with Solr Source behind Load Balancer

2018-09-14 Thread Emir Arnautović
Hi Thomas,
Is this SolrCloud or Solr master-slave? Do you update index while indexing? Did 
you check if all your instances behind LB are in sync if you are using 
master-slave?
My guess would be that DIH is using cursors to read data from another Solr. If 
you are using multiple Solr instances behind LB there might be some diffs in 
index that results in different documents being returned for the same cursor 
mark. Is num doc and max doc the same on new instance after import?

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 12 Sep 2018, at 05:53, Zimmermann, Thomas  
> wrote:
> 
> We have a Solr v7 Instance sourcing data from a Data Import Handler with a 
> Solr data source running Solr v4. When it hits a single server in that 
> instance directly, all documents are read and written correctly to the v7. 
> When we hit the load balancer DNS entry, the resulting data import handler 
> json states that it read all the documents and skipped none, and all looks 
> fine, but the result set is missing ~20% of the documents in the v7 core. 
> This has happened multiple time on multiple environments.
> 
> Any thoughts on whether this might be a bug in the underlying DIH code? I'll 
> also pass it along to the server admins on our side for input.



Re: Boost only first 10 records

2018-09-03 Thread Emir Arnautović
Hi,
The requirement is not 100% clear or logical. If user selects filter 
type:comedy, it does not make sense to show anything else. You might have 
“Other categories relavant results” and that can be done as a separate query. 
It seems that you want to prefer comedy, but you have an issue with boosting it 
too much results in only comedy top results and boosting it too little does not 
result in comedy being top hit all the time. Boosting is usually used to prefer 
one type if there are similar results but that does not guaranty that they will 
be top all the time. Your options are:
1. tune boost parameter so the results are as expected in most times (it will 
never be all the times)
2. use collapse (group) feature to make sure you get results from all categories
3. have two queries and combine results on UI side
4. use faceting in combination with query and let user choose genre.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Sep 2018, at 08:48, mama  wrote:
> 
> Hi 
> We have requirement to boost only first few records & rest of result should
> be as per search.
> e.g. if i have books of different genre & if user search for some book
> (intrested in genere : comedy) then 
> we want to show say first 3 records of genre:comedy and rest of results
> should be of diff genre .
> Reason for this is , we have lots of books in db , if we boost comedy genre
> then first 100s of records will be comedy and user may not be aware of other
> books.
> is it possible ?
> 
> Query for boosting genre comedy
> genre:comedy^0.5
> 
> can someone help with requirement of limiting boost to first few records ?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Split on whitespace parameter doubt

2018-08-30 Thread Emir Arnautović
Hi David,
Your observations seem correct. If all fields produces the same tokens then 
Solr goes for “term centric” query, but if different fields produce different 
tokens, then it uses field centric query. Here is blog post that explains it 
from multiword synonyms perspective: 
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
 


IMO the issue is that it is not clear how term centric would look like in case 
of different tokens: Imagine that your query is “a b” and you are searching  
two fields title (analysed) and title_s (string) so you will end up with tokens 
‘a’, ‘b’ and ‘a b’. So term centric query would be (title:a || title_s:a) 
(title:b || title_s:b)(title:a b || title_s:a b). If not already weird, lets 
assume you allow one token to be missed…

I am not sure why field centric field is not used all the time or at least why 
there is no parameter to force it.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 30 Aug 2018, at 15:02, David Argüello Sánchez 
>  wrote:
> 
> Hi everyone,
> 
> I am doing some tests to understand how the split on whitespace
> parameter works with eDisMax query parser. I understand the behaviour,
> but I have a doubt about why it works like that.
> 
> When sow=true, it works as it did with previous Solr versions.
> When sow=false, the behaviour changes and all the terms have to be
> present in the same field. However, if all queried fields' query
> structure is the same, it works as if it had sow=true. This is the
> thing that I don’t fully understand.
> Specifying sow=false I might want to match only those documents
> containing all the terms in the same field, but because of all queried
> fields having the same query structure, I would get back documents
> containing both terms in any of the fields.
> 
> Does anyone know the reasoning behind this decision?
> Thank you in advance.
> 
> Regards,
> David



Re: Issue with adding an extra Solr Slave

2018-08-28 Thread Emir Arnautović
Hi Zafar,
Slaves are separate nodes and accessing admin console through ELB does not make 
much sense since different requests will go to different nodes and that’s why 
you sometimes see cores and other time it is empty. Since it is empty, it seems 
that you did not define core(s) on this new slave. Replication handler is 
defined on core level so I am not sure what you mean that solrconfig.xml are 
the same on both servers.

What you need to do is create new core on new slave. Make sure replication 
handler is properly configured and that master is reachable (try pinging 
replication handler of master from slave). Issue fetch index command for new 
slave (http://slave_host:port/solr/core_name/replication?command=fetchindex). 
And when checking in admin console, use slave’s IP, not ELB.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 28 Aug 2018, at 21:03, Zafar Khurasani  
> wrote:
> 
> Hi Emir,
> 
> I access the admin console through the ELB. I do NOT see any replication 
> errors in the new Slave's logs. I also double checked to make sure the 
> connectivity between the master and slaves exist. The only error I see in the 
> new Slave log is what I shared originally.
> 
> Thanks,
> Zafar.
> 
> 
> 
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
> Sent: Tuesday, August 28, 2018 2:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Issue with adding an extra Solr Slave
> 
> Hi Zafar,
> How do you access admin console? Through ELB or you see this behaviour when 
> accessing admin console of a new slave? Do you see any replication related 
> errors in new slave’s logs? Did you check connectivity of a new slave and 
> master nodes?
> 
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 27 Aug 2018, at 16:52, Zafar Khurasani  
>> wrote:
>> 
>> Hi,
>> 
>> I'm running Solr 5.3 in one of our applications. Currently, we have 
>> one Solr Master and one Solr slave running on AWS EC2 instances. I'm 
>> trying to add an additional Solr slave. I'm using an Elastic 
>> LoadBalancer (ELB) in front of my Slaves. I see the following error in 
>> the logs after adding the second slave,
>> 
>> 
>> java version "1.8.0_121"
>> 
>> Solr version: 5.3.0 1696229
>> 
>> 
>> org.apache.solr.common.SolrException: Core with core name [xxx-xxx-] 
>> does not exist.
>>   at 
>> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:770)
>>   at 
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:240)
>>   at 
>> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:194)
>>   at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>>   at 
>> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:675)
>>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:443)
>>   at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
>>   at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>>   at 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>>   at 
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>>   at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>>   at 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>>   at 
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>>   at 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>>   at 
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>>   at 
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>>   at 
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>>   at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>>   at 
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>>   at 
>> org.eclipse.jetty.server.handler.Hand

Re: Issue with adding an extra Solr Slave

2018-08-28 Thread Emir Arnautović
Hi Zafar,
How do you access admin console? Through ELB or you see this behaviour when 
accessing admin console of a new slave? Do you see any replication related 
errors in new slave’s logs? Did you check connectivity of a new slave and 
master nodes?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Aug 2018, at 16:52, Zafar Khurasani  
> wrote:
> 
> Hi,
> 
> I'm running Solr 5.3 in one of our applications. Currently, we have one Solr 
> Master and one Solr slave running on AWS EC2 instances. I'm trying to add an 
> additional Solr slave. I'm using an Elastic LoadBalancer (ELB) in front of my 
> Slaves. I see the following error in the logs after adding the second slave,
> 
> 
> java version "1.8.0_121"
> 
> Solr version: 5.3.0 1696229
> 
> 
> org.apache.solr.common.SolrException: Core with core name [xxx-xxx-] does 
> not exist.
>at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:770)
>at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal(CoreAdminHandler.java:240)
>at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:194)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>at 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:675)
>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:443)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
>at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>at org.eclipse.jetty.server.Server.handle(Server.java:499)
>at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>at java.lang.Thread.run(Thread.java:745)
> 
> 
> Also, when I hit the Solr Admin UI, I'm able to see my core infrequently. I 
> have to refresh the page multiple times to be able to see it.  What's the 
> right way to add a slave to my existing setup?
> 
> FYI - the Solr Replication section in solrconfig.xml is exactly the same for 
> both the Slaves.
> 
> Thanks,
> Zafar Khurasani
> 



Re: How to hit filterCache?if filterQuery is a sub range query of another already cache range filterQuery

2018-08-24 Thread Emir Arnautović
Hi,
No it will not and it does not make sense to - it would still have to apply 
filter on top of cached results since they can include values with 2. You can 
consider a query as entry into cache.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 24 Aug 2018, at 13:23, zhenyuan wei  wrote:
> 
> Hi All,
> I am confuse about How to hit filterCache?
> 
> If filterQuery is range [3 to 100] , but not cache in FilterCache,
> and  filterCache already exists  filterQuery range [2 to 100],
> 
> My question is " Dose this filterQuery range [3 to 100]  will  fetch DocSet
> from FilterCache range[2 to 100]" ?



Re: How to add solr admin ui

2018-08-22 Thread Emir Arnautović
Hi Ahmed,
I am not aware of some extension point in UI, but you can maybe use some 
combination of request handler and velocity response writer to get what you 
want, but you will not have some link in UI.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 22 Aug 2018, at 03:11, Ahmed Musallam  wrote:
> 
> Hi,
> 
> I'd like to build a UI plugin for solr, I can see all the ui related assets
> in `/server/solr-webapp` but is there a way to add UI plugins without
> modifying the ui assets under `/server/solr-webapp`?
> 
> by plugin, I mean some way I can add a some form of UI to the admin UI, and
> even better, make is specific to a certain core.
> 
> any tutorials or documentation would be greatly appreciated!
> 
> Thanks!
> Ahmed



Re: Metrics for a healthy Solr cluster

2018-08-16 Thread Emir Arnautović
Hi,
If you are up to ready-to-go Solr monitoring, you can check out Sematext’s Solr 
integration: https://sematext.com/integrations/solr-monitoring/ 


Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 16 Aug 2018, at 17:24, Greenhorn Techie  wrote:
> 
> Hi,
> 
> Solr provides numerous JMX metrics for monitoring the health of the
> cluster. We are setting up a SolrCloud cluster and hence wondering what are
> the important parameters / metrics to look into, to ascertain that the
> cluster health is good. Obvious things comes to my mind are CPU utilisation
> and memory utilisation.
> 
> However, wondering what are the other parameters to look into from the
> health of the cluster? Are there any best practices?
> 
> Thanks



Re: Solr changing the search when given many qf fields?

2018-08-16 Thread Emir Arnautović
Hi Aaron,
It is probably not about number of fields but related to different analysis of 
different fields. As long as all your fields analyzers produce the same tokens 
you should get “term centric” query. Once any of your analyzers produce 
different token, it’ll become “field centric”. It is likely that one of your 
fields (tipping point) is string and produces “foo bar” token.
Here is blog post explaining this part of edismax in a context of multi term 
synonyms: 
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 15 Aug 2018, at 17:23, Aaron Gibbons  wrote:
> 
> I found a tipping point where the search being built changes with the
> number of qf fields being passed in.
> 
> Example search: "foo bar"
> solr 7.2.1
> select?q.op=AND=edismax=foo bar
> 
> Debugging the query you can see it results in:
> "parsedquery_toString":"+(+(text:foo) +(text:bar))"
> 
> Adding more qf values you get:
> "text name_text"
> "parsedquery_toString":"+(+(name_text:foo | text:foo) +(name_text:bar |
> text:bar))"
> "text name_text city_text"
> "parsedquery_toString":"+(+(city_text:foo | name_text:foo | text:foo)
> +(city_text:bar | name_text:bar | text:bar))"
> 
> The search continues to build this way until I get to a certain amount of
> qf values.
> 
> Large number of qf values:
> "34 values.."
> "parsedquery_toString":"+(+((+comments_text:foo +comments_text:bar) |
> (+zip_text:foo +zip_text:bar) | (+city_text:foo +city_text:bar) |
> (+street_address_text:foo +street_address_text:bar) |
> (+street_address_two_text:foo +street_address_two_text:bar) |
> (+state_text:foo +state_text:bar)..."
> Now the search is requiring both foo and bar to be in each qf field in the
> search, not foo to be in any qf field and bar to be in any qf field. I had
> to cut the number of qf values down to 15 to get it back to the correct
> search.
> 
> Why is the search changing? Is there any way around this or a better way we
> should be doing the search?
> I realize we could copy all of the fields to the default text field. However,
> most of the fields are searchable individually as well as keyword
> searchable so specifying the fields vs using the default text field makes
> sense in that respect.
> 
> Thank you,
> Aaron



Re: Ignored fields and copyfield

2018-08-06 Thread Emir Arnautović
Hi John,
Yes it can and it is common pattern when you want to index multiple fields into 
a single field or if you want to standardise naming without changing indexing 
logic.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Aug 2018, at 22:47, John Davis  wrote:
> 
> Hi there,
> If a field is set as "ignored" (indexed=false, stored=false) can it be used
> for another field as part of copyfield directive which might index/store it.
> 
> John



Re: Need an advice for architecture.

2018-07-19 Thread Emir Arnautović
Hi Francois,
If I got your numbers right, you are indexing on a single server and indexing 
rate is ~31 doc/s. I would first check if something is wrong with indexing 
logic. You check where the bottleneck is: do you read documents from DB fast 
enough, do you batch documents…
Assuming you cannot have better rate than 30 doc/s and that bottleneck is Solr, 
in order to finish it in 6h, you need to parallelise indexing on Solr by 
splitting index to ~6 servers and have overall indexing rate of 180 doc/s.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Jul 2018, at 09:59, servus01  wrote:
> 
> Would like to ask what your recommendations are for a new performant Solr
> architecture. 
> 
> SQL DB 4M documents with up to 5000 metadata fields each document [2xXeon
> 2.1Ghz, 32GB RAM]
> Actual Solr: 1 Core version 4.6, 3.8M documents, schema has 300 metadata
> fields to import, size 3.6GB [2xXeon 2.4Ghz, 32GB RAM]
> (atm we need 35h to build the index and about 24h for a mass update which
> affects the production)
> 
> Building the index should be less than 6h. Sometimes we change some of the
> Metadata fields which affects most of the documents and therefore a
> massupdate / reindex is necessary. Reindex is ok also for about 6h (night)
> but should not have an impact to user queries. Anyway, every faster indexing
> is very welcome. We will have max. 20 - 30 CCUser.
> 
> So i asked myself. How many nodes, Lshards, replicas ect. Could someone
> please give me recommendation for a fast working architecture. 
> 
> really appreciate this, best
> 
> Francois 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr7.3.1 Installation

2018-07-11 Thread Emir Arnautović
Hi,
Why are you building Solr? Because you added your custom query parser? If 
that’s the case, then it is not the way to do it. You should set up separate 
project for your query parser, build it and include jar in your Solr setup.
It is not query parser, but here is blog/code for simple update processor: 
https://www.od-bits.com/2018/02/solr-docvalues-on-analysed-field.html 


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Jul 2018, at 07:20, tapan1707  wrote:
> 
> We are trying to install solr-7.3.1 into our existing system (We have also
> made some changes by adding one custom query parser).
> 
> I am having some build issues and it would be really helpful if someone can
> help.
> 
> While running ant test(in the process of building the solr package), it
> terminates because of failed tests.
> At first time (build with ant-1.9)
> Tests with failures [seed: C2C0D761AEAAE8A4] (first 10 out of 23):
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.response.TestSuggesterResponse (suite)
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.response.TermsResponseTest (suite)
> 21:25:20[junit4]   - org.apache.solr.client.solrj.TestSolrJErrorHandling
> (suite)
> 21:25:20[junit4]   - org.apache.solr.client.solrj.GetByIdTest (suite)
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.response.TestSpellCheckResponse (suite)
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.embedded.LargeVolumeEmbeddedTest (suite)
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.embedded.JettyWebappTest.testAdminUI
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.embedded.SolrExampleStreamingBinaryTest (suite)
> 21:25:20[junit4]   - org.apache.solr.client.solrj.SolrExampleBinaryTest
> (suite)
> 21:25:20[junit4]   -
> org.apache.solr.client.solrj.embedded.LargeVolumeBinaryJettyTest (suite)
> 
> Running the same ant test command without doing any changes (build with
> ant-1.10)
> Tests with failures [seed: 7E004642A6008D89]:
> 11:30:57[junit4]   -
> org.apache.solr.cloud.MoveReplicaHDFSTest.testFailedMove  
> 
> Thirds time (build with ant 1.10)
> [junit4] Tests with failures [seed: EFD939D82A6EC707]:
> [junit4]   - org.apache.solr.cloud.autoscaling.SystemLogListenerTest.test
> 
> Even though I'm not making any changes, build is failing with different
> failed tests. Can anyone help me with this, I mean if there is any problem
> with the code then shouldn't it fail with same test cases?
> Also, all above-mentioned test cases work fine if I check them individually.
> (using ant test -Dtests.class=)
> 
> Also, does ant version has any effects in build??
> 
> At last, at present, we are using solr-6.4.2 which has zookeeper-3.4.6
> dependency but for solr-7, the zookeeper dependency has been upgraded to
> 3.4.10, so my question is, At what extent does this might affect our system
> performance? Can we use zookeeper-3.4.6 with solr-7?
> (same with the jetty version) 
> 
> Thanks in advance
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Maximum number of SolrCloud collections in limited hardware resource

2018-06-29 Thread Emir Arnautović
Hi,
It is probably the best if you merge some of your collections (or all) and have 
discriminator field that will be used to filter out tenant’s documents only. In 
case you go with multiple collections serving multiple tenants, you would have 
to have logic on top of it to resolve tenant to collection. Unfortunately, Solr 
does not have alias with filtering like ES that would come handy in such cases.
If you stick with multiple collections, you can turn off caches completely, 
monitor latency and turn on caches for collections when it is reaching some 
threshold.
Caches are invalidated on commit, so submitting dummy doc and committing should 
invalidate caches. Alternative is to reload collection.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 27 Jun 2018, at 14:46, Shawn Heisey  wrote:
> 
> On 6/27/2018 5:10 AM, Sharif Shahrair wrote:
>> Now the problem is, when we create about 1400 collection(all of them are
>> empty i.e. no document is added yet) the solr service goes down showing out
>> of memory exception. We have few questions here-
>> 
>> 1. When we are creating collections, each collection is taking about 8 MB
>> to 12 MB of memory when there is no document yet. Is there any way to
>> configure SolrCloud in a way that it takes low memory for each collection
>> initially(like 1MB for each collection), then we would be able to create
>> 1500 collection using about 3GB of machines RAM?
> 
> Solr doesn't dictate how much memory it allocates for a collection.  It 
> allocates what it needs, and if the heap size is too small for that, then you 
> get OOME.
> 
> You're going to need a lot more than two Solr servers to handle that many 
> collections, and they're going to need more than 12GB of memory.  You should 
> already have at least three servers in your setup, because ZooKeeper requires 
> three servers for redundancy.
> 
> http://zookeeper.apache.org/doc/r3.4.12/zookeeperAdmin.html#sc_zkMulitServerSetup
> 
> Handling a large number of collections is one area where SolrCloud needs 
> improvement.  Work is constantly happening towards this goal, but it's a very 
> complex piece of software, so making design changes is not trivial.
> 
>> 2. Is there any way to clear/flush the cache of SolrCloud, specially from
>> those collections which we don't access for while(May be we can take those
>> inactive collections out of memory and load them back when they are needed
>> again)?
> 
> Unfortunately the functionality that allows index cores to be unloaded (which 
> we have colloquially called "LotsOfCores") does not work when Solr is running 
> in SolrCloud mode.SolrCloud functionality would break if its cores get 
> unloaded.  It would take a fair amount of development effort to allow the two 
> features to work together.
> 
>> 3. Is there any way to collect the Garbage Memory from SolrCloud(may be
>> created by deleting documents and collections) ?
> 
> Java handles garbage collection automatically.  It's possible to explicitly 
> ask the system to collect garbage, but any good programming guide for Java 
> will recommend that programmers should NOT explicitly trigger GC.  While it 
> might be possible for Solr's memory usage to become more efficient through 
> development effort, it's already pretty good.  To our knowledge, Solr does 
> not currently have any memory leak bugs, and if any are found, they are taken 
> seriously and fixed as fast as we can fix them.
> 
>> Our target is without increasing the hardware resources, create maximum
>> number of collections, and keeping the highly accessed collections &
>> documents in memory. We'll appreciate your help.
> 
> That goal will require a fair amount of hardware.  You may have no choice but 
> to increase your hardware resources.
> 
> Thanks,
> Shawn
> 



Re: SolrCloud Large Cluster Performance Issues

2018-06-25 Thread Emir Arnautović
Hi,
With such a big cluster a lot of things can go wrong and it is hard to give any 
answer without looking into it more and understanding your model. I assume that 
you are monitoring your system (both Solr/ZK and components that index/query) 
so it should be the first thing to look at and see if there are some 
bottlenecks. If you doubled number of nodes and don’t see increase in indexing 
throughput, it is likely that the bottleneck is indexing component or that you 
did not spread the load to your entire cluster. With more nodes, there is more 
pressure on ZK so check that as well. 
You will have to dive in and search for bottleneck or find some Solr consultant 
and let him do it for you.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 25 Jun 2018, at 03:38, 苗海泉  wrote:
> 
> Hello, everyone, we encountered two solr problems and hoped to get help.
> Our data volume is very large, 24.5TB a day, and the number of records is
> 110 billion. We originally used 49 solr nodes. Because of insufficient
> storage, we expanded to 100. For a solr cluster composed of multiple
> machines, we found that the performance of 60 solrclouds and the overall
> performance of 49 solr clusters are the same. How do we optimize it? Now
> the cluster speed is 1.5 million on average per second. Why is that?
> 
> The second problem solrhome can only specify a solrhome, but now the disk
> is divided into two directories, another solr can be stored using hdfs, but
> the overall indexing performance is not up to standard, how to do, thank
> you for your attention.
> [image: Mailtrack]
> 
> Sender
> notified by
> Mailtrack
> 
> 18/06/25
> 上午9:38:13



Re: Delete By Query issue followed by Delete By Id Issues

2018-06-24 Thread Emir Arnautović
Hi Sujatha,
Did I get it right that you are deleting the same documents that will be 
updated afterward? If that’s the case, then you can simply skip deleting, and 
just send updated version of document. Solr (Lucene) does not have delete - 
it’ll just flag document as deleted. Updating document (assuming id is the 
same) will result in the same thing - old document will not be retrievable and 
will be removed from index when segments holding it is merged.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Jun 2018, at 19:59, sujatha sankaran  wrote:
> 
> Thanks,Shawn.
> 
> Our use case is something like this in a batch load of  several 1000's of
> documents,we do a delete first followed by update.Example delete all 1000
> docs and send an update request for 1000.
> 
> What we see is that there are many missing docs due to DBQ re-ordering of
> the order of  deletes followed by updates.We also saw issue with nodes
> going down
> similar tot issue described here:
> http://lucene.472066.n3.nabble.com/SolrCloud-Nodes-going-to-recovery-state-during-indexing-td4369396.html
> 
> we see at the end of this batch process, many (several thousand ) missing
> docs.
> 
> Due to this and after reading above thread , we decided to move to DBI and
> now are facing issues due to custom routing or implicit routing which we
> have in place.So I don't think DBQ was working for us, but we did have
> several such process ( DBQ followed by updates) for different activities in
> the collection happening at the same time.
> 
> 
> Sujatha
> 
> On Thu, Jun 21, 2018 at 1:21 PM, Shawn Heisey  wrote:
> 
>> On 6/21/2018 9:59 AM, sujatha sankaran wrote:
>>> Currently from our business perspective we find that we are left with no
>>> options for deleting docs in a batch load as :
>>> 
>>> DBQ+ batch does not work well together
>>> DBI+ custom routing (batch load / normal)would not work as well.
>> 
>> I would expect DBQ to work, just with the caveat that if you are trying
>> to do other indexing operations at the same time, you may run into
>> significant delays, and if there are timeouts configured anywhere that
>> are shorter than those delays, requests may return failure responses or
>> log failures.
>> 
>> If you are using DBQ, you just need to be sure that there are no other
>> operations happening at the same time, or that your error handling is
>> bulletproof.  Making sure that no other operations are happening at the
>> same time as the DBQ is in my opinion a better option.
>> 
>> Thanks,
>> Shawn
>> 
>> 



Re: Solr cloud with different JVM size nodes

2018-06-19 Thread Emir Arnautović
Hi Rishi,
It is not uncommon to have tiers in your cluster assuming you weighted if it is 
the best choice.

I would remind you that 32GB is not a good heap size since you cannot use 
compressed OOPS. Check what is the limit of your JVM but 30GB is a safe bet.
Also, what did you mean be “got high field cache”?

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Jun 2018, at 09:11, Rishikant Snigh  wrote:
> 
> Hello everyone,
> 
> I am planning to create a a solr cloud with 16GB and 32GB nodes.
> Some what to create an underneath pseudo cluster -
> 32G to hold historical data(got high field cache).
> 16G to hold regular collections.
> 
> NOTE - Shards of collection placed on 16G will never be placed on 32G and
> vice versa.
> 
> Do you guys see an impact ?
> 
> Thanks, Rishi



  1   2   3   4   >