Re: Regarding LTR feature

2018-05-02 Thread Prateek Agarwal
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" :

Re: Regarding LTR feature

2018-05-02 Thread Prateek
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" :

Re: Regarding LTR feature

2018-05-02 Thread Prateek
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" :

Solr working £ Symbol

2018-05-02 Thread Mohan Cheema
Hi There, We are using Solr to index our data. The data contains £ symbol within the text and for currency. When data is exported from the source system data contains £ symbol, however, when the data is imported into the Solr £ symbol is converted to �. How can we keep the £ symbol as is when

SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Hi, Good Morning!! In the case of a SolrCloud setup with sharing and replication in place, when a document is sent for indexing, what happens when only the shard leader has indexed the document, but the replicas failed, for whatever reason. Will the document be resent by the leader to the

Too many commits

2018-05-02 Thread Patrick Recchia
Hello, I'm seeing way too many commits on our solr cluster, and I don't know why. Here is the landscape: - Each collection we create (one per day) is created with 10 shards with 2 replicas each. - we send live data, 2B records / day. so on average 200M records/shard per day - for a size of

Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Markus Jelsma
Hello, One of our collections, that is heavy with tons of TokenFilters using large dictionaries, has a lot of trouble dealing with collection reload. I removed all custom plugins from solrconfig, dumbed the schema down and removed all custom filters and replaced a customized decompounder with

Re: Regarding LTR feature

2018-05-02 Thread prateek . agarwal
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" :

Re: Solr Heap usage

2018-05-02 Thread Susheel Kumar
Take a look at https://wiki.apache.org/solr/SolrPerformanceProblems. The section "how much heap do i need" talks about that. Cache also goes to JVM so take a look how much you need/allocating for different cache's. Thnx On Tue, May 1, 2018 at 7:33 PM, Greenhorn Techie

Re: Query Regarding Solr Garbage Collection

2018-05-02 Thread Susheel Kumar
A very high rate of indexing documents could cause heap usage to go high (all temporary objects getting created are in JVM memory and with very high rate heap utilization may go high) Having Cache's not sized/set correctly would also return in high JVM usage since as searches are happening, it

Autocomplete returning shingles

2018-05-02 Thread O. Klein
I need to use autocomplete with edismax (ngrams,edgegrams) to return shingled suggestions. Field value "new york city" needs to return on query "ne" -> "new","new york","new york city". With suggester this is easy. But im forced to use edismax because I need to apply mutliple filter queries. What

SorCloud Sharding

2018-05-02 Thread Greenhorn Techie
Hi, I have few questions on sharding in a SolrCloud setup: 1. How to know the optimal number of shards required for a SolrCloud setup? What are the factors to consider to decide on the value for *numShards* parameter? 2. In case if over sharding has been done i.e. if numShards has been set to a

count mismatch: number of records indexed

2018-05-02 Thread Srinivas Kashyap
Hi, I have standalone solr index server 5.2.1 and have a core with 15 fields(all indexed and stored). Through DIH I'm indexing the data (around 65million records). The index process took 6hours to complete. But after the completion when I checked through Solr admin query console(*:*),

Is it normal for BlendedInfixLookupFactory to not show terms?

2018-05-02 Thread O. Klein
BlendedInfixLookupFactory is not returning terms, but returns the field value. If I change to FuzzyLookupFactory it works fine. Am I doing something wrong? default BlendedInfixLookupFactory position_linear DocumentDictionaryFactory weight text_suggest language

Re: SorCloud Sharding

2018-05-02 Thread Erick Erickson
1> You have to prototype, see: https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ 2> No. It could be done, but it'd take some very careful work. Basically you'd have to merge "adjacent" shards where "adjacent" is measured by the shard range of

Re: SolrCloud replicaition

2018-05-02 Thread Erick Erickson
1> When the replica fails, the leader tries to resend it, and if the resends fail, then the follower goes into recovery which will eventually get the document caught up. 2> Yes, the client will get a failure indication. Best, Erick On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie

Re: Query regarding solr 7.3.0

2018-05-02 Thread Erick Erickson
Just what it says. Solr/Lucene like lots of file handles, I regularly see several thousand. If you run out of file handles Solr stops working. Ditto processes. Solr in particular spawns a lot of threads, particularly when handling many incoming requests through Jetty. If you exceed the limit,

Re: Shard size variation

2018-05-02 Thread Michael Joyner
The main reason we go this route is that after awhile (with default settings) we end up with hundreds of shards and performance of course drops abysmally as a result. By using a stepped optimize a) we don't run into the we need the 3x+ head room issue, b) optimize performance penalty during

Re: SolrCloud replicaition

2018-05-02 Thread Erick Erickson
That's a pretty open-ended question. The short form is when the replica switches back to "active" (or green on the admin UI) then it's been caught up. This is all about NRT replicas. PULL and TLOG replicas pull the segments from the leader so the idea of "sending a doc to the replica" doesn't

RE: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Markus Jelsma
Sounds just like it, i will check it out! Thanks both! Markus -Original message- > From:Erick Erickson > Sent: Wednesday 2nd May 2018 17:21 > To: solr-user > Subject: Re: Collection reload leaves dangling SolrCore instances > >

Re: count mismatch: number of records indexed

2018-05-02 Thread ANNAMANENI RAVEENDRA
Possible cases can be If you don’t have unique key then there are high chances that you will see less data Try hard commit or check your commit times (hard/soft) On Wed, May 2, 2018 at 9:30 AM Srinivas Kashyap < srini...@tradestonesoftware.com> wrote: > Hi, > > I have standalone solr index

Re: Too many commits

2018-05-02 Thread Erick Erickson
Two possibilities: 1> you have multiple replicas in the same JVM and are seeing commits happen withall of them. 2> ramBufferSizeMB. when you index docs, segments are flushed when the in-memory structures exceed this limit, is this perhaps what you're seeing? Best, Erick On Wed, May 2, 2018 at

Re: SolrCloud replicaition

2018-05-02 Thread kumar gaurav
Hi Erick What will happen after replica recovered ? Is leader continuously checks status of replica and send again after recovered or replica will pull document for indexing after recovering ? Please clarify this behavior for all of Replica types i.e. NRT, TLOG and PULL. (i have implemented solr

Re: count mismatch: number of records indexed

2018-05-02 Thread Erick Erickson
And if you _do_ have a uniqueKey ("id" by default), subsequent records will overwrite older records with the same key. The tip from Annameneni is the first thing I'd try though, make sure you've issued a commit. Best, Erick On Wed, May 2, 2018 at 7:09 AM, ANNAMANENI RAVEENDRA

Way for DataImportHandler to use bind variables

2018-05-02 Thread Mike Konikoff
Is there a way to configure the DataImportHandler to use bind variables for the entity queries? To improve database performance. Thanks, Mike

Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query

2018-05-02 Thread Chris Wilt
I began with a 7.2.1 solr instance using the techproducts sample data. Next, I added "a" as a stopword (there were originally no stopwords). I tried two queries: "x a b" and "x b". Here is the raw query parameters: q=x b=id,score,price=score desc=name^0.75 manu cat^3.0

Re: Indexing throughput

2018-05-02 Thread Greenhorn Techie
Thanks Walter and Erick for the valuable suggestions. We shall try out various values for shards and as well other tuning metrics I discussed in various threads earlier. Kind Regards On 2 May 2018 at 18:24:31, Erick Erickson (erickerick...@gmail.com) wrote: I've seen 1.5 M docs/second.

Re: Solr Heap usage

2018-05-02 Thread Greenhorn Techie
Thanks Shawn for the inputs, which will definitely help us to scale our cluster better. Regards On 2 May 2018 at 18:15:12, Shawn Heisey (apa...@elyograg.org) wrote: On 5/1/2018 5:33 PM, Greenhorn Techie wrote: > Wondering what are the considerations to be aware to arrive at an optimal > heap

Re: Shard size variation

2018-05-02 Thread Erick Erickson
You can always increase the maximum segment size. For large indexes that should reduce the number of segments. But watch your indexing stats, I can't predict the consequences of bumping it to 100G for instance. I'd _expect_ bursty I/O whne those large segments started to be created or merged

Re: Too many commits

2018-05-02 Thread Shawn Heisey
On 5/2/2018 4:54 AM, Patrick Recchia wrote: > I'm seeing way too many commits on our solr cluster, and I don't know why. Are you sure there are commits happening?  Do you have logs actually saying that a commit is occurring?  The creation of a new segment does not necessarily mean a commit

Re: Solr Heap usage

2018-05-02 Thread Shawn Heisey
On 5/1/2018 5:33 PM, Greenhorn Techie wrote: > Wondering what are the considerations to be aware to arrive at an optimal > heap size for Solr JVM? Though I did discuss this on the IRC, I am still > unclear on how Solr uses the JVM heap space. Are there any pointers to > understand this aspect

Re: Solr working £ Symbol

2018-05-02 Thread Shawn Heisey
On 5/2/2018 3:13 AM, Mohan Cheema wrote: > We are using Solr to index our data. The data contains £ symbol within the > text and for currency. When data is exported from the source system data > contains £ symbol, however, when the data is imported into the Solr £ symbol > is converted to �. >

Query regarding solr 7.3.0

2018-05-02 Thread Agarwal, Monica (Nokia - IN/Bangalore)
Hi , I am trying to upgrade solr from 7.1.0 to 7.3.0 . While trying to start the solr process the below warnings are observed: *** [WARN] *** Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set

Re: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Shawn Heisey
On 5/2/2018 4:40 AM, Markus Jelsma wrote: > One of our collections, that is heavy with tons of TokenFilters using large > dictionaries, has a lot of trouble dealing with collection reload. I removed > all custom plugins from solrconfig, dumbed the schema down and removed all > custom filters

Re: Collection reload leaves dangling SolrCore instances

2018-05-02 Thread Erick Erickson
Markus: You may well be hitting SOLR-11882. On Wed, May 2, 2018 at 8:18 AM, Shawn Heisey wrote: > On 5/2/2018 4:40 AM, Markus Jelsma wrote: >> One of our collections, that is heavy with tons of TokenFilters using large >> dictionaries, has a lot of trouble dealing with

Re: Learning to Rank (LTR) with grouping

2018-05-02 Thread ilayaraja
Figured out that offset is used as part of the grouping patch which I applied (SOLR-8776) : solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java + if (query instanceof AbstractReRankQuery){ +topNGroups = cmd.getOffset() +

Re: Indexing throughput

2018-05-02 Thread Erick Erickson
I've seen 1.5 M docs/second. Basically the indexing throughput is gated by two things: 1> the number of shards. Indexing throughput essentially scales up reasonably linearly with the number of shards. 2> the indexing program that pushes data to Solr. Before thinking Solr is the bottleneck, check

Re: Too many commits

2018-05-02 Thread Erick Erickson
Youcan turn on "infostream", but that is _very_ voluminous. The regular Solr logs at INFO level should show commits though On Wed, May 2, 2018 at 10:45 AM, Patrick Recchia wrote: > Swawn, > thanks you very much for your answer. > > > On Wed, May 2, 2018 at 6:27 PM,

Indexing throughput

2018-05-02 Thread Greenhorn Techie
Hi, The current hardware profile for our production cluster is 20 nodes, each with 24cores and 256GB memory. Data being indexed is very structured in nature and is about 30 columns or so, out of which half of them are categorical with a defined list of values. The expected peak indexing

Re: Indexing throughput

2018-05-02 Thread Walter Underwood
We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each (EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster is Solr 6.6.2. All storage is SSD EBS. We built a simple batch loader in Java. We get about one million documents per minute

Re: Too many commits

2018-05-02 Thread Patrick Recchia
Swawn, thanks you very much for your answer. On Wed, May 2, 2018 at 6:27 PM, Shawn Heisey wrote: > On 5/2/2018 4:54 AM, Patrick Recchia wrote: > > I'm seeing way too many commits on our solr cluster, and I don't know > why. > > Are you sure there are commits happening? Do

Re: SolrCloud replicaition

2018-05-02 Thread Greenhorn Techie
Shalin, Given the earlier response by Erick, wondering when this scenario occurs i.e. when the replica node recovers after a time period, wouldn’t it automatically recover all the missed updates by connecting to the leader? My understanding is the below from the responses so far (assuming

RE: Solr working £ Symbol

2018-05-02 Thread Mohan Cheema
>> We are using Solr to index our data. The data contains £ symbol within the >> text and for currency. When data is exported from the source system data >> contains £ symbol, however, when the data is imported into the Solr £ symbol >> is converted to . >> > >How can we keep the £ symbol as

Re: Median Date

2018-05-02 Thread Jim Freeby
All, percentiles only work with numbers, not dates. If I use the ms function, I can get the number of milliseconds between NOW and the import date.  Then we can use that result in calculating the median age of the documents using percentiles. rows=0=true={!tag=piv1 percentiles='50' func}ms(NOW,

Faceting question

2018-05-02 Thread Weffelmeyer, Stacie
Hi, Question on faceting. We have a dynamicField that we want to facet on. Below is the field and the type of information that field generates. [cid:image001.png@01D3E22D.DE028870]

Load balanced Solr cluster not updating leader

2018-05-02 Thread Michael B. Klein
Hi all, I've encountered a reproducible and confusing issue with our Solr 6.6 cluster. (Updating to 7.x is an option, but not an immediate one.) This is in our staging environment, running on AWS. To save money, we scale our entire stack down to zero instances every night and spin it back up

RE: User queries end up in filterCache if facetting is enabled

2018-05-02 Thread Markus Jelsma
Hello, Anyone here to reproduce this oddity? It shows up in all our collections once we enable the stats page to show filterCache entries. Is this normal? Am i completely missing something? Thanks, Markus -Original message- > From:Markus Jelsma > Sent:

Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Shawn Heisey
On 5/2/2018 3:52 PM, Michael B. Klein wrote: > It works ALMOST perfectly. The restore operation reports success, and if I > look at the UI, everything looks great in the Cloud graph view. All green, > one leader and two other active instances per collection. > > But once we start updating, we run

Re: Way for DataImportHandler to use bind variables

2018-05-02 Thread Shawn Heisey
On 5/2/2018 1:03 PM, Mike Konikoff wrote: > Is there a way to configure the DataImportHandler to use bind variables for > the entity queries? To improve database performance. Can you clarify where these variables would come from and precisely what you want to do? >From what I can tell, you're

Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Erick Erickson
Perhaps this is: SOLR-11660? On Wed, May 2, 2018 at 4:46 PM, Shawn Heisey wrote: > On 5/2/2018 3:52 PM, Michael B. Klein wrote: >> It works ALMOST perfectly. The restore operation reports success, and if I >> look at the UI, everything looks great in the Cloud graph view.

Re: Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query

2018-05-02 Thread Doug Turnbull
This is a problem that we’ve noted too. This blog post discusses the underlying cause https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/ Hope that helps On Wed, May 2, 2018 at 3:07 PM Chris Wilt wrote: > I began with a 7.2.1 solr

Re: Indexing throughput

2018-05-02 Thread Shawn Heisey
On 5/2/2018 10:58 AM, Greenhorn Techie wrote: > The current hardware profile for our production cluster is 20 nodes, each > with 24cores and 256GB memory. Data being indexed is very structured in > nature and is about 30 columns or so, out of which half of them are > categorical with a defined

Re: Faceting question

2018-05-02 Thread Shawn Heisey
On 5/2/2018 2:56 PM, Weffelmeyer, Stacie wrote: > Question on faceting.  We have a dynamicField that we want to facet > on. Below is the field and the type of information that field generates. > >   > > cid:image001.png@01D3E22D.DE028870 > This image is not available.  This mailing list will

Re: Too many commits

2018-05-02 Thread Shawn Heisey
On 5/2/2018 11:45 AM, Patrick Recchia wrote: > Is there any logging I can turn on to know when a commit happens and/or > when a segment is flushed? The normal INFO-level logging that Solr ships with will log all commits.  It probably doesn't log segment flushes unless they happen as a result of a

Re: Load balanced Solr cluster not updating leader

2018-05-02 Thread Shawn Heisey
On 5/2/2018 6:23 PM, Erick Erickson wrote: > Perhaps this is: SOLR-11660? That definitely looks like the problem that Micheal describes.  And it indicates that restarting Solr instances after restore is a workaround. The issue also says something that might indicate that collection reload after

Re: SolrCloud replicaition

2018-05-02 Thread Shalin Shekhar Mangar
The min_rf parameter does not fail indexing. It only tells you how many replicas received the live update. So if the value is less than what you wanted then it is up to you to retry the update later. On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techie wrote: > Hi, > >