Re: Handling Locales in Solr

2021-02-24 Thread Markus Jelsma
Hello,

We put all our customers in the same core/collection because of this, it is
not practical to manage hundreds of cores, including their small overhead.
Although it can be advantageous when it comes to relevance tuning, no
skewed statistics because of other customers.

In your case, an unused core is probably slow because it is not in cached
memory anymore, and/or it has to load from a slow drive.

With regards to the locales, i would probably separate the cores by topic
only, and have different languages share the same collection/core.

Regards,
Markus



Op wo 24 feb. 2021 om 12:09 schreef Krönert Florian <
florian.kroen...@orbis.de>:

> Hi everyone,
>
>
>
> First up thanks for this group, I appreciate it very much for exchanging
> opinions on how to use Solr.
>
>
>
> We built a Solr instance for one of our customers which is used for
> searching data on his website.
>
> We need to search different data (kb articles, products and external
> links) in different locales.
>
>
>
> For our logic it seemed best to separate solr Cores by topic and locale,
> so we have cores like this:
>
> kbarticle_de-de
>
> kbarticle_en-us
>
> …
>
> products_de-de
>
> products_en-us
>
> …
>
> links_de-de
>
> links_en-us
>
>
>
> First we had only 3 locales, but it grew pretty fast to 16 locales, so
> that we’re having 48 solr cores by now already.
>
> There would have been different approaches for realizing this of course,
> so we’re wondering whether we are using Solr not in the optimal way?
>
>
>
> We found out that when a search on a locale that was not used for some
> time is started, it takes >10 seconds in many cases to execute the search.
>
>
>
> We then find logs like this, where it seems as if Solr needs to start a
> searcher first, which takes time:
>
> 2021-02-20 04:33:42.634 INFO  (Thread-20674) [   ]
> o.a.s.s.SolrIndexSearcher Opening [Searcher@775f8595[kbarticles_en-gb]
> main]
>
> 2021-02-20 04:33:42.643 INFO  (searcherExecutor-26-thread-1) [   ]
> o.a.s.c.QuerySenderListener QuerySenderListener sending requests to
> Searcher@775f8595[kbarticles_en-gb]
>
> …
>
>
>
> Is that an issue? It would be good to know whether our localization
> approach causes issues with Solr and whether we should restructure our core
> design.
>
> Any help would be very much appreciated.
>
>
>
> Kind Regards,
>
>
>
> *Florian Krönert*
> Senior Software Developer
>
> 
>
> *ORBIS AG | *Planckstraße 10 | D-88677 Markdorf
>
> Phone: +49 7544 50398 21 | Mobile: +49 162 3065972 | E-Mail:
> florian.kroen...@orbis.de
> www.orbis.de
>
>
> 
>
> Registered Seat: Saarbrücken
> Commercial Register Court: Amtsgericht Saarbrücken, HRB 12022
> Board of Management: Thomas Gard (Chairman), Michael Jung, Stefan
> Mailänder, Frank Schmelzer
> Chairman of the Supervisory Board: Ulrich Holzer
>
> 
> 
> 
> 
>
>
>
>
> 
>
>
>
>
> 
>
>
> 
>


Re: Overriding Sort and boosting some docs to the top

2021-02-24 Thread Markus Jelsma
I would stick to the query elevation component, it is pretty fast and
easier to handle/configure elevation IDs, instead of using function queries
for it. We have customers that set a dozen of documents for a given query
and it works just fine.

I also do not expect the function query variant to be more performant, but
i am not sure. If it were, would it be measurable?

Regards,
Markus

Op wo 24 feb. 2021 om 12:15 schreef Mark Robinson :

> Thanks for the reply Markus!
>
> I did try it.
> My question specifically was (repasting here):-
>
> Which is more recommended/ performant?
>
> Note:- Assume that I have hundreds of ids to boost like this.
> Is there a difference to the answer if docs to be boosted after the sort is
> less?
>
> Thanks!
> Mark
>
> On Wed, Feb 24, 2021 at 4:41 PM Markus Jelsma 
> wrote:
>
> > Hello,
> >
> > You are probably looking for the elevator component, check it out:
> >
> https://lucene.apache.org/solr/guide/8_8/the-query-elevation-component.html
> >
> > Regards,
> > Markus
> >
> > Op wo 24 feb. 2021 om 11:59 schreef Mark Robinson <
> mark123lea...@gmail.com
> > >:
> >
> > > Hi,
> > >
> > > I wanted to sort and then boost some docs to the top and these docs
> > should
> > > be my first set in the results and the following ones appearing
> according
> > > to my sort criteria.
> > >
> > > I understand that sort overrides bq hence bq may not be used in this
> case
> > >
> > > - I brought my boost into sort using "query()" and achieved my goal.
> > > - I tried sort and then elevate with forceElevation and that also
> worked.
> > >
> > > My question is which is more recommended/ performant?
> > >
> > > Note:- Assume that I have hundreds of ids to boost like this.
> > > Is there a difference to the answer if docs to be boosted after the
> sort
> > is
> > > less?
> > >
> > > Could someone please share your thoughts/experience?
> > >
> > > Thanks!
> > > Mark.
> > >
> >
>


Re: Overriding Sort and boosting some docs to the top

2021-02-24 Thread Markus Jelsma
Hello,

You are probably looking for the elevator component, check it out:
https://lucene.apache.org/solr/guide/8_8/the-query-elevation-component.html

Regards,
Markus

Op wo 24 feb. 2021 om 11:59 schreef Mark Robinson :

> Hi,
>
> I wanted to sort and then boost some docs to the top and these docs should
> be my first set in the results and the following ones appearing according
> to my sort criteria.
>
> I understand that sort overrides bq hence bq may not be used in this case
>
> - I brought my boost into sort using "query()" and achieved my goal.
> - I tried sort and then elevate with forceElevation and that also worked.
>
> My question is which is more recommended/ performant?
>
> Note:- Assume that I have hundreds of ids to boost like this.
> Is there a difference to the answer if docs to be boosted after the sort is
> less?
>
> Could someone please share your thoughts/experience?
>
> Thanks!
> Mark.
>


Re: Using multiple language stop words in Solr Core

2021-02-11 Thread Markus Jelsma
Hell Abhay,

Do not enable stopwords unless you absolutely know what you are doing. In
general, it is a bad practice that somehow still lingers on.

But to answer the question, you must have one field and fieldType for each
language, so language specific filters go there. Also, using edismax and
multi-language search using mm (minimum should match) with stopwords
enabled spells trouble.

Set up per language fieldTypes without stopwords.

Regards,
Markus

Op do 11 feb. 2021 om 12:44 schreef Abhay Kumar <
abhay.ku...@anjusoftware.com>:

> Hello Team,
>
>
>
> Solr provides some data type out of box in managed schema for different
> languages such as english, french, japanies etc.
>
>
>
> We are using common data type "text_general" for fields declaration and
> using stopwards.txt for stopword filtering.
>
>
>
>  autoGeneratePhraseQueries="true" positionIncrementGap="100"
> multiValued="true">
>
> 
>
>   
>
>ignoreCase="true"/>
>
>   
>
>minGramSize="1"/>
>
> 
>
> 
>
>   
>
>ignoreCase="true"/>
>
>ignoreCase="true" synonyms="synonyms.txt"/>
>
>   
>
> 
>
>   
>
>
>
> While syncing data to Solr core we are importing different languages text
> in the fields such as french, english, german etc.
>
>
>
> My query is shall we use all different language stopwords into same
> "stopwards.txt" file or how solr use different language stopwords?
>
>
>
>
>
>
>
> *Warm Regards,*
>
>
>
> *Abhay Kumar* | Lead Developer
>
> 401/402, Pride Portal, Shivaji Housing Society, Off. S. B. Road | Shivaji
> Nagar, Pune-411 016
> +91 20 2563 1011 | Mobile: +91 9096644108
> anjusoftware.com
>
> 
> 
> 
> 
>
>
>
>
>
>
>
> *Confidentiality Notice  This email message, including
> any attachments, is for the sole use of the intended recipient and may
> contain confidential and privileged information. Any unauthorized view,
> use, disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message. Anju Software, Inc. 4500 S. Lakeshore Drive, Suite
> 620, Tempe, AZ USA 85282.*
>


Re: Excessive logging 8.8.0

2021-02-05 Thread Markus Jelsma
Thanks!

Op do 4 feb. 2021 om 20:04 schreef Chris Hostetter :

>
> FWIW: that log message was added to branch_8x by 3c02c9197376 as part of
> SOLR-15052 ... it's based on master commit 8505d4d416fd -- but that does
> not add that same logging message ... so it definitely smells like a
> mistake to me that 8x would add this INFO level log message that master
> doesn't have.
>
> it's worth noting that 3c02c9197376 included many other "log.info(...)"
> messages that had 'nocommit' comments to change them to debug later ...
> making me more confident this is a mistake...
>
> https://issues.apache.org/jira/browse/SOLR-15136
>
>
> : Date: Thu, 4 Feb 2021 12:45:16 +0100
> : From: Markus Jelsma 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Excessive logging 8.8.0
> :
> : Hello all,
> :
> : We upgraded some nodes to 8.8.0 and notice there is excessive logging on
> : INFO when some traffic/indexing is going on:
> :
> : 2021-02-04 11:42:48.535 INFO  (qtp261748192-268) [c:data s:shard2
> : r:core_node4 x:data_shard2_replica_t2] o.a.s.c.c.ZkStateReader already
> : watching , added to s
> : tateWatchers
> :
> : Is this to be expected?
> :
> : Thanks,
> : Markus
> :
>
> -Hoss
> http://www.lucidworks.com/
>


Excessive logging 8.8.0

2021-02-04 Thread Markus Jelsma
Hello all,

We upgraded some nodes to 8.8.0 and notice there is excessive logging on
INFO when some traffic/indexing is going on:

2021-02-04 11:42:48.535 INFO  (qtp261748192-268) [c:data s:shard2
r:core_node4 x:data_shard2_replica_t2] o.a.s.c.c.ZkStateReader already
watching , added to s
tateWatchers

Is this to be expected?

Thanks,
Markus


Re: different score from different replica of same shard

2021-01-13 Thread Markus Jelsma
Hallo Bernd,

I see the different replica types in the 7.1 [1] manual but not in the 6.6.
ExactStatsCache should work in 6.6, just add it to solrconfig.xml, not the
request handler [1]. It will slow down searches due to added overhead.

Regards,
Markus

[1]
https://lucene.apache.org/solr/guide/7_1/shards-and-indexing-data-in-solrcloud.html#types-of-replicas
[2] https://lucene.apache.org/solr/guide/6_6/distributed-requests.html

Op wo 13 jan. 2021 om 15:11 schreef Bernd Fehling <
bernd.fehl...@uni-bielefeld.de>:

> Hello Markus,
>
> thanks a lot.
> Is TLOG also for SOLR 6.6.6 or only 8.x and up?
>
> I will first try ExactStatsCache.
> Should be added as invariant to request handler, right?
>
> Comparing the replica index directories they have different size and
> the index version and generation is different. Also Max Doc.
> But Num Docs is the same.
>
> Regards,
> Bernd
>
>
> Am 13.01.21 um 14:54 schrieb Markus Jelsma:
> > Hello Bernd,
> >
> > This is normal for NRT replicas, because the way segments are merged and
> > deletes are removed is not synchronized between replicas. In that case
> > counts for TF and IDF and norms become slightly different.
> >
> > You can either use ExactStatsCache that fetches counts for terms before
> > scoring, so that all replica's use the same counts. Or change the replica
> > types to TLOG. With TLOG segments are fetched from the leader and thus
> > identical.
> >
> > Regards,
> > Markus
> >
> > Op wo 13 jan. 2021 om 14:45 schreef Bernd Fehling <
> > bernd.fehl...@uni-bielefeld.de>:
> >
> >> Hello list,
> >>
> >> a question for better understanding scoring of a shard in a cloud.
> >>
> >> I see different scores from different replicas of the same shard.
> >> Is this normal and if yes, why?
> >>
> >> My understanding until now was that replicas are always the same within
> a
> >> shard
> >> and the same query to each replica within a shard gives always the same
> >> score.
> >>
> >> Can someone help me to understand this?
> >>
> >> Regards
> >> Bernd
> >>
> >
>


Re: different score from different replica of same shard

2021-01-13 Thread Markus Jelsma
Hello Bernd,

This is normal for NRT replicas, because the way segments are merged and
deletes are removed is not synchronized between replicas. In that case
counts for TF and IDF and norms become slightly different.

You can either use ExactStatsCache that fetches counts for terms before
scoring, so that all replica's use the same counts. Or change the replica
types to TLOG. With TLOG segments are fetched from the leader and thus
identical.

Regards,
Markus

Op wo 13 jan. 2021 om 14:45 schreef Bernd Fehling <
bernd.fehl...@uni-bielefeld.de>:

> Hello list,
>
> a question for better understanding scoring of a shard in a cloud.
>
> I see different scores from different replicas of the same shard.
> Is this normal and if yes, why?
>
> My understanding until now was that replicas are always the same within a
> shard
> and the same query to each replica within a shard gives always the same
> score.
>
> Can someone help me to understand this?
>
> Regards
> Bernd
>


Re: Monitoring Solr for currently running queries

2020-12-29 Thread Markus Jelsma
Hello Ufuk,

You can log slow queries [1].

If you would want to see currently running queries you would have to extend
SearchHandler and build the custom logic yourself. Watch out for SolrCloud
because the main query as well as the per-shard queries can pass through
that same SearchHandler. You can distinguish between then reading the
shard=true parameter.

Regards,
Markus

[1] https://lucene.apache.org/solr/guide/6_6/configuring-logging.html

Op di 29 dec. 2020 om 16:49 schreef ufuk yılmaz :

> Hello All,
>
> Is there a way to see currently executing queries in a SolrCloud? Or a
> general strategy to detect a query using absurd amount or resources?
>
> We are using Solr for not only simple querying, but running complex
> streaming expressions, facets with large data etc. Sometimes, randomly, CPU
> usage gets so high that it starts to respond very slowly to even simple
> queries, or don’t respond at all. I’m trying to determine if it’s a result
> of simple overloading of the system by many “normal” queries, or someone
> sends Solr an unreasonably compute-heavy request.
>
> A few days ago when this occured, I stopped every service that can send
> Solr a query. After that, for about an hour, nodes were reading from the
> disk at 1GB/s which is the maximum of our disks. Then everything went back
> to the normal as I started the other services.
>
> One (bad) idea I had is to build a proxy service which proxies every
> request to our SolrCloud and monitors current running requests, but scaling
> this to the size of SolrCloud may be reinventing the wheel.
>
> For now all I can detect is that Solr is struggling, but I have no idea
> what causes that and when.
>
> -Chees and happy new year
>


RE: Performance issues with CursorMark

2020-10-26 Thread Markus Jelsma
Hello Anshum,

Good point! We sort on the collection's uniqueKey, our id field and this one 
does not have docValues enabled for it. It could be a contender but is it the 
problem? I cannot easily test it at this scale.

Thanks,
Markus
 
-Original message-
> From:Anshum Gupta 
> Sent: Monday 26th October 2020 17:00
> To: solr-user@lucene.apache.org
> Subject: Re: Performance issues with CursorMark
> 
> Hey Markus,
> 
> What are you sorting on? Do you have docValues enabled on the sort field ?
> 
> On Mon, Oct 26, 2020 at 5:36 AM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > We have been using a simple Python tool for a long time that eases
> > movement of data between Solr collections, it uses CursorMark to fetch
> > small or large pieces of data. Recently it stopped working when moving data
> > from a production collection to my local machine for testing, the Solr
> > nodes began to run OOM.
> >
> > I added 500M to the 3G heap and now it works again, but slow (240docs/s)
> > and costing 3G of the entire heap just to move 32k docs out of 76m total.
> >
> > Solr 8.6.0 is running with two shards (1 leader+1 replica), each shard has
> > 38m docs almost no deletions (0.4%) taking up ~10.6g disk space. The
> > documents are very small, they are logs of various interactions of users
> > with our main text search engine.
> >
> > I monitored all four nodes with VisualVM during the transfer, all four
> > went up to 3g heap consumption very quickly. After the transfer it took a
> > while for two nodes to (forcefully) release the no longer for the transfer
> > needed heap space. The two other nodes, now, 17 minutes later, still think
> > they have to hang on to their heap consumption. When i start the same
> > transfer again, the nodes that already have high memory consumption just
> > seem to reuse that, not consuming additional heap. At least the second time
> > it went 920docs/s. While we are used to transfer these tiny documents at
> > light speed of multiple thousands per second.
> >
> > What is going on? We do not need additional heap, Solr is clearly not
> > asking for more and GC activity is minimal. Why did it become so slow?
> > Regular queries on the collection are still going fast, but CursorMarking
> > even through a tiny portion is taking time and memory.
> >
> > Many thanks,
> > Markus
> >
> 
> 
> -- 
> Anshum Gupta
> 


Performance issues with CursorMark

2020-10-26 Thread Markus Jelsma
Hello,

We have been using a simple Python tool for a long time that eases movement of 
data between Solr collections, it uses CursorMark to fetch small or large 
pieces of data. Recently it stopped working when moving data from a production 
collection to my local machine for testing, the Solr nodes began to run OOM.

I added 500M to the 3G heap and now it works again, but slow (240docs/s) and 
costing 3G of the entire heap just to move 32k docs out of 76m total.

Solr 8.6.0 is running with two shards (1 leader+1 replica), each shard has 38m 
docs almost no deletions (0.4%) taking up ~10.6g disk space. The documents are 
very small, they are logs of various interactions of users with our main text 
search engine.

I monitored all four nodes with VisualVM during the transfer, all four went up 
to 3g heap consumption very quickly. After the transfer it took a while for two 
nodes to (forcefully) release the no longer for the transfer needed heap space. 
The two other nodes, now, 17 minutes later, still think they have to hang on to 
their heap consumption. When i start the same transfer again, the nodes that 
already have high memory consumption just  seem to reuse that, not consuming 
additional heap. At least the second time it went 920docs/s. While we are used 
to transfer these tiny documents at light speed of multiple thousands per 
second.

What is going on? We do not need additional heap, Solr is clearly not asking 
for more and GC activity is minimal. Why did it become so slow? Regular queries 
on the collection are still going fast, but CursorMarking even through a tiny 
portion is taking time and memory.

Many thanks,
Markus


RE: advice on whether to use stopwords for use case

2020-10-01 Thread Markus Jelsma
Well, when not splitting on whitespace you can the CharFilter for regex 
replacements [1] to clear the entire search string if anywhere in the string a 
banned word is found: 

.*(cigarette|tobacco).*

[1] 
https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.PatternReplaceCharFilterFactory
 
-Original message-
> From:Walter Underwood 
> Sent: Thursday 1st October 2020 18:20
> To: solr-user@lucene.apache.org
> Subject: Re: advice on whether to use stopwords for use case
> 
> I can’t think of an easy way to do this in Solr.
> 
> Do a bunch of string searches on the query on the client side. If any of them 
> match, 
> make a “no hits” result page.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> > On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> > 
> > Yes, the requirements (for now) is not to return any results. I think they 
> > may change the requirements,pending their return from the holidays.
> > 
> >> If so, then check for those words in the query before sending it to Solr.
> > That is what I think so too.
> > 
> > Thinking further, using stopwords for this, there will still be results 
> > return when the number of words in the search keywords is more than the 
> > stopwords.
> > 
> > On 1/10/2020 2:57 am, Walter Underwood wrote:
> >> I’m not clear on the requirements. It sounds like the query “cigar” or 
> >> “cuban cigar”
> >> should return zero results. Is that right?
> >> 
> >> If so, then check for those words in the query before sending it to Solr.
> >> 
> >> But the stopwords approach seems like the requirement is different. Could 
> >> you give
> >> some examples?
> >> 
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org 
> >> http://observer.wunderwood.org/   (my 
> >> blog)
> >> 
> >>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
> >>>  wrote:
> >>> 
> >>> You may also want to look at something like: 
> >>> https://docs.querqy.org/index.html 
> >>> 
> >>> ApacheCon had (is having..) a presentation on it that seemed quite
> >>> relevant to your needs. The videos should be live in a week or so.
> >>> 
> >>> Regards,
> >>>   Alex.
> >>> 
> >>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
> >>>  wrote:
>  I am not sure why you think stop words are your first choice. Maybe I
>  misunderstand the question. I read it as that you need to exclude
>  completely a set of documents that include specific keywords when
>  called from specific module.
>  
>  If I wanted to differentiate the searches from specific module, I
>  would give that module a different end-point (Request Query Handler),
>  instead of /select. So, /nocigs or whatever.
>  
>  Then, in that end-point, you could do all sorts of extra things, such
>  as setting appends or even invariants parameters, which would include
>  filter query to exclude any documents matching specific keywords. I
>  assume it is ok to return documents that are matching for other
>  reasons.
>  
>  Ideally, you would mark the cigs documents during indexing with a
>  binary or enumeration flag and then during search you just need to
>  check against that flag. In that case, you could copyField  your text
>  and run it against something like
>  https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>   
>  
>  combined with Shingles for multiwords. Or similar. And just transform
>  it as index-only so that the result is basically a yes/no flag.
>  Similar thing could be done with UpdateRequestProcessor pipeline if
>  you want to end up with a true boolean flag. The idea is the same,
>  just to have an index-only flag that you force lock into for any
>  request from specific module.
>  
>  Or even with something like ElevationSearchComponent. Same idea.
>  
>  Hope this helps.
>  
>  Regards,
>    Alex.
>  
>  On Tue, 29 Sep 2020 at 22:28, Derek Poh  
>   wrote:
> > Hi
> > 
> > I have read in the mailings list that we should try to avoid using stop
> > words.
> > 
> > I have a use case where I would like to know if there is other
> > alternative solutions beside using stop words.
> > 
> > There is business requirement to return zero result when the search is
> > cigarette related words and the search is coming from a particular
> > module on our site. It does not apply to all searches from our site.
> > There is a list of these cigarette related words. This list contains
> > single word, multiple words (Electronic cigar), multiple words with
> 

RE: Trailing space issue with indexed data.

2020-08-18 Thread Markus Jelsma
Hello,

You can use TrimFieldUpdateProcessorFactory [1] in your URP chain to remove 
leading or trailing whitespace when indexing.

Regards,
Markus

[1] 
https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/TrimFieldUpdateProcessorFactory.html

 
 
-Original message-
> From:Fiz N 
> Sent: Tuesday 18th August 2020 19:57
> To: solr-user@lucene.apache.org
> Subject: Trailing space issue with indexed data.
> 
> Hell SOLR Experts,
> 
> I am using SOLR 8.6 and indexing data from MSSQL DB.
> 
> after indexing is done I am seeing
> 
> “Page_number”:”1“,
> “Doc_name”:”  office 770 toll free “
> “Doc_text”:” From:  Hyan, gan \nTo:  Delacruz
>   Decruz \n“
> 
> I was remove these empty spaces.
> 
> How to achieve it ?
> 
> Thanks
> Fiz Nadian.
> 


RE: Drop bad document in update batch

2020-08-18 Thread Markus Jelsma
Ah yes, i should have looked at the list of subclasses of 
UpdateRequestProcessorFactory in de API docs as it is not mentioned in the 
manual.

Thanks Erick! 
 
-Original message-
> From:Erick Erickson 
> Sent: Tuesday 18th August 2020 19:04
> To: solr-user@lucene.apache.org
> Subject: Re: Drop bad document in update batch
> 
> I think you’re looking for TolerantUpdateProcessor(Factory), added in 
> SOLR-445. It hung around for a LOGGG time and didn’t actually get 
> added until 6.1.
> 
> > On Aug 18, 2020, at 12:51 PM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > Normally, if a single document is bad, the whole indexing batch is dropped. 
> > I think i remember there was an URP(?) that discards bad documents from the 
> > batch, but i cannot find it in the manual [1].
> > 
> > Is it possible or am i starting to imagine things?
> > 
> > Thanks,
> > Markus
> > 
> > [1] https://lucene.apache.org/solr/guide/8_6/update-request-processors.html
> 
> 


Drop bad document in update batch

2020-08-18 Thread Markus Jelsma
Hello,

Normally, if a single document is bad, the whole indexing batch is dropped. I 
think i remember there was an URP(?) that discards bad documents from the 
batch, but i cannot find it in the manual [1].

Is it possible or am i starting to imagine things?

Thanks,
Markus

[1] https://lucene.apache.org/solr/guide/8_6/update-request-processors.html


RE: Manipulating client's query using a Query object

2020-08-17 Thread Markus Jelsma
Hello Edward,

You asked for the 'Lucene Query representation of the client's query' which is 
already inside Solr and needs no forwarding to anything. Just return in parse() 
and you are good to go.

The Query object contains the analyzed form of your query string. 
ExtendedDismax has some variable (i think it was qstr) that contains the 
original input string. In there you have access to that too.

Regards,
Markus


-Original message-
> From:Edward Turner 
> Sent: Monday 17th August 2020 21:25
> To: solr-user@lucene.apache.org
> Subject: Re: Manipulating client's query using a Query object
> 
> Hi Markus,
> 
> That's really great info. Thank you.
> 
> Supposing we've now modified the Query object, do you know how we would get
> the corresponding query String, which we could then forward to our
> Solrcloud via SolrClient?
> 
> (Or should we be using this extended ExtendedDisMaxQParser class server
> side in Solr?)
> 
> Kind regards,
> 
> Edd
> 
> ----
> Edward Turner
> 
> 
> On Mon, 17 Aug 2020 at 15:06, Markus Jelsma 
> wrote:
> 
> > Hello Edward,
> >
> > Yes you can by extending ExtendedDismaxQParser [1] and override its
> > parse() method. You get the main Query object through super.parse().
> >
> > If you need even more fine grained control on how Query objects are
> > created you can extend ExtendedSolrQueryParser's [2] (inner class)
> > newFieldQuery() method.
> >
> > Regards,
> > Markus
> >
> > [1]
> > https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.html
> > [2]
> > https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.ExtendedSolrQueryParser.html
> >
> > -Original message-
> > > From:Edward Turner 
> > > Sent: Monday 17th August 2020 15:53
> > > To: solr-user@lucene.apache.org
> > > Subject: Manipulating client's query using a Query object
> > >
> > > Hi all,
> > >
> > > Thanks for all your help recently. We're now using the edismax query
> > parser
> > > and are happy with its behaviour. We have another question which maybe
> > > someone can help with.
> > >
> > > We have one use case where we optimise our query before sending it to
> > Solr,
> > > and we do this by manipulating the client's input query string. However,
> > > we're slightly uncomfortable using String manipulation to do this as
> > > there's always the possibility we parse their string wrongly. (We have a
> > > large test suite to check if we're doing the right thing, but even then,
> > we
> > > String manipulation doesn't feel right ...).
> > >
> > > Question: is it possible to get a Lucene Query representation of the
> > > client's query, which we can then navigate and manipulate -- before we
> > then
> > > send the String representation of this Query to Solr for evaluation?
> > >
> > > Kind regards and thank you for your help in advance,
> > >
> > > Edd
> > >
> >
> 


RE: Manipulating client's query using a Query object

2020-08-17 Thread Markus Jelsma
Hello Edward,

Yes you can by extending ExtendedDismaxQParser [1] and override its parse() 
method. You get the main Query object through super.parse().

If you need even more fine grained control on how Query objects are created you 
can extend ExtendedSolrQueryParser's [2] (inner class) newFieldQuery() method.

Regards,
Markus

[1] 
https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.html
[2] 
https://lucene.apache.org/solr/8_6_0/solr-core/org/apache/solr/search/ExtendedDismaxQParser.ExtendedSolrQueryParser.html

-Original message-
> From:Edward Turner 
> Sent: Monday 17th August 2020 15:53
> To: solr-user@lucene.apache.org
> Subject: Manipulating client's query using a Query object
> 
> Hi all,
> 
> Thanks for all your help recently. We're now using the edismax query parser
> and are happy with its behaviour. We have another question which maybe
> someone can help with.
> 
> We have one use case where we optimise our query before sending it to Solr,
> and we do this by manipulating the client's input query string. However,
> we're slightly uncomfortable using String manipulation to do this as
> there's always the possibility we parse their string wrongly. (We have a
> large test suite to check if we're doing the right thing, but even then, we
> String manipulation doesn't feel right ...).
> 
> Question: is it possible to get a Lucene Query representation of the
> client's query, which we can then navigate and manipulate -- before we then
> send the String representation of this Query to Solr for evaluation?
> 
> Kind regards and thank you for your help in advance,
> 
> Edd
> 


RE: eDismax query syntax question

2020-06-13 Thread Markus Jelsma
Hello,

These are special characters, if you don't need them, you must escape them.

See top of the article:
https://lucene.apache.org/solr/guide/8_5/the-extended-dismax-query-parser.html

Markus


 
 
-Original message-
> From:Webster Homer 
> Sent: Friday 12th June 2020 22:09
> To: solr-user@lucene.apache.org
> Subject: eDismax query syntax question
> 
> Recently we found strange behavior in a query. We use eDismax as the query 
> parser.
> 
> This is the query term:
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)-PYRIMIDINE-2,4,6-TRIONE
> 
> It should hit one document in our index. It does not. However, if you use the 
> Dismax query parser it does match the record.
> 
> The problem seems to involve the parenthesis and the dashes. If you escape 
> the dash after the parenthesis it matches
> 1,3-DIMETHYL-5-(3-PHENYL-ALLYLIDENE)\-PYRIMIDINE-2,4,6-TRIONE
> 
> I thought that eDismax and Dismax escaped all lucene special characters 
> before passing the query to lucene. Although I also remember reading that + 
> and - can have special significance in a query if preceded with white space. 
> I can find very little documentation on either query parser in how they work.
> 
> Is this expected behavior or is this a bug? If expected, where can I find 
> documentation?
> 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to any 
> other person. If you have received this transmission in error, please notify 
> the sender immediately and delete the message and any attachment from your 
> system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> accept liability for any omissions or errors in this message which may arise 
> as a result of E-Mail-transmission or for damages resulting from any 
> unauthorized changes of the content of this message and any attachment 
> thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not 
> guarantee that this message is free of viruses and does not accept liability 
> for any damages caused by any virus transmitted therewith.
> 
> 
> 
> Click http://www.merckgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.
> 


RE: Building a web based search engine

2020-06-02 Thread Markus Jelsma
Hello, see inline.

Markus 
 
-Original message-
> From:Jim Anderson 
> Sent: Tuesday 2nd June 2020 19:59
> To: solr-user@lucene.apache.org
> Subject: Re: Building a web based search engine
> 
> Hi Markus,
> 
> Thanks for your response. I appreciate you giving me the bullet list of
> things to do. I can take that list and work from it and hopefully make
> progress, but I don't think it will get me where I want to be - just a bit
> closer.
> 
> You say, "We have been building precisely that for over ten years now". Is
> it in a document? I would like to read it.

No, i haven't written a book about it and don't intend to.

> Some basic things I would like to know that should be documented:
> 
> 1) Using nutch as the crawler, how do I run a nutch thread that crawls my
> named URLs.

You don't, but run Nutch as a separate process from the command line. Or when 
you have to deal with 50+ million records, you run it on Apache Hadoop.

> 2) I will use nutch to visit websites and create documents in solr. How do
> I verify that documents have been created in Solr via nutch?

By searching for them using Solr, or retrieving them by URL, using Solr's 
simple HTTP API. You can use SolrJ, the Java client, too. 

> 3) Solr will store and index the documents. How do I verify the index?

See 2.

> 4) I assume I can run a tomcat server on my host and then provide a
> localhost URI to my web browser. Tomcat will then forward the URI to my
> application. My application will take a query and using a java API is will
> pass the query to Solr. I would like to see an example of a java program
> passing a query to Solr.

See 3. Though i would recommend to use Solr's HTTP API, it is much easier to 
deal with.

> 5) Solr will take the query, parse it and then locate appropriate documents
> using the index. Is there a log in Solr showing what queries have been
> parsed?

Yes, see Solr's log directory.

> 6) Solr will pass back the list of documents it has located. I have not
> really looked at this issue yet, but it would be nice to have an example of
> this.

Search for a SolrJ tutorial, they are plentiful. Also check out Solr's own 
extensive manual, everything you need is there.

> Jim
> 
> 
> 
> On Tue, Jun 2, 2020 at 12:12 PM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > We have been building precisely that for over ten years now. The '10,000
> > foot level overview' is basically:
> >
> > * forget about Lucene for now, Solr uses it under the hood;
> > * get Solr, and start it with the schema.xml file that comes with Nutch;
> > * get Nutch, give it a set of domains or hosts to crawl and some URLs to
> > start the crawl with and point the indexer towards the previously
> > configured Solr;
> > * put a proxy in front of Solr (we use Nginx), or skip this step if it is
> > just an internal demo (do not expose Solr to the outside world);
> > * make some basic JS tool that handles input and search result responses.
> >
> > This was our first web search engine prototype and it was set up in a few
> > days. The chapter "How To Build A Web Based Search Engine With Solr, Lucene
> > and Nutch" just means: set up Solr, and point Nutch towards it, and tell it
> > to start crawling and indexing.
> >
> > Then there comes and endless list of things to improve, autocomplete,
> > spell checking, query and click log handling and analysis, proper text
> > extraction, etc.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:Jim Anderson 
> > > Sent: Tuesday 2nd June 2020 16:36
> > > To: solr-user@lucene.apache.org
> > > Subject: Building a web based search engine
> > >
> > > Hi,
> > >
> > > I have been looking at solr, lucene and nutch websites and tutuorials for
> > > over a week now, experimenting and learning, but also frustrated be the
> > > fact the I am totally missing the 'how to' do what I want. I see a lot of
> > > examples of how to use each of the tools, but not how to put them all
> > > together. I think an 'overview' at the 10,000 foot level is needed, Maybe
> > > one is available and I have not yet found it. If someone can point me to
> > > one, please do.
> > >
> > > If I am correct that an overview on "How To Build A Web Based Search
> > Engine
> > > With Solr, Lucene and Nutch" is not available, then I will be willing to
> > > write an overview and make it available to the Solr community.  I will
> > need
> > > input, explanation and review of others.
> > >
> > > My 2 goals are:
> > >
> > > 1) Build a demo web based search engine [Note: I have a very specific
> > > business need to able to demonstrate a web application on top of a search
> > > engine. This demo is intended to show a 'proof of concept' of the web
> > > application to a small audience.]
> > >
> > > 2) Document the process of building the demo and customizing it using the
> > > java API so that others can more easily build their own web base search
> > > engine.
> > >
> > > Jim Anderson
> > >
> >
> 


RE: Building a web based search engine

2020-06-02 Thread Markus Jelsma
Hello,

We have been building precisely that for over ten years now. The '10,000 foot 
level overview' is basically:

* forget about Lucene for now, Solr uses it under the hood;
* get Solr, and start it with the schema.xml file that comes with Nutch;
* get Nutch, give it a set of domains or hosts to crawl and some URLs to start 
the crawl with and point the indexer towards the previously configured Solr;
* put a proxy in front of Solr (we use Nginx), or skip this step if it is just 
an internal demo (do not expose Solr to the outside world);
* make some basic JS tool that handles input and search result responses.

This was our first web search engine prototype and it was set up in a few days. 
The chapter "How To Build A Web Based Search Engine With Solr, Lucene and 
Nutch" just means: set up Solr, and point Nutch towards it, and tell it to 
start crawling and indexing.

Then there comes and endless list of things to improve, autocomplete, spell 
checking, query and click log handling and analysis, proper text extraction, 
etc.

Regards,
Markus

-Original message-
> From:Jim Anderson 
> Sent: Tuesday 2nd June 2020 16:36
> To: solr-user@lucene.apache.org
> Subject: Building a web based search engine
> 
> Hi,
> 
> I have been looking at solr, lucene and nutch websites and tutuorials for
> over a week now, experimenting and learning, but also frustrated be the
> fact the I am totally missing the 'how to' do what I want. I see a lot of
> examples of how to use each of the tools, but not how to put them all
> together. I think an 'overview' at the 10,000 foot level is needed, Maybe
> one is available and I have not yet found it. If someone can point me to
> one, please do.
> 
> If I am correct that an overview on "How To Build A Web Based Search Engine
> With Solr, Lucene and Nutch" is not available, then I will be willing to
> write an overview and make it available to the Solr community.  I will need
> input, explanation and review of others.
> 
> My 2 goals are:
> 
> 1) Build a demo web based search engine [Note: I have a very specific
> business need to able to demonstrate a web application on top of a search
> engine. This demo is intended to show a 'proof of concept' of the web
> application to a small audience.]
> 
> 2) Document the process of building the demo and customizing it using the
> java API so that others can more easily build their own web base search
> engine.
> 
> Jim Anderson
> 


RE: 8.5.1 LogReplayer extremely slow

2020-05-12 Thread Markus Jelsma
I found the bastard, it was a freaky document that skrewed Solr over, indexing 
kept failing, passing documents between replica's times out, documents get 
reindexed and so the document (and others) end up in the transaction log (many 
times) and are eligible for reindexing. Reindexing and replaying of the 
transaction log both fail on that specific document. Recovery was also not 
possible due to time outs.

Although the original document [1] is a mess, Solr should have no difficulties 
ingesting it [2]. Any ideas what is going on? Ticket, if so, about what 
exactly? For the record, this is PreAnalyzed.

Many thanks,
Markus

[1] https://pastebin.com/1NqBdYCM
[2] https://www.openindex.io/export/do_not_index.xml

-Original message-
> From:Markus Jelsma 
> Sent: Monday 11th May 2020 18:43
> To: solr-user 
> Subject: 8.5.1 LogReplayer extremely slow
> 
> Hello,
> 
> Our main Solr text search collection broke down last night (search was still 
> working fine), every indexing action timed out with the Solr master spending 
> most of its time in Java regex. One shard has only one replica left for 
> queries and it stays like that. I have copied both shard's leader to local to 
> see what is going on.
> 
> One shard is fine but the other has a replica with has about 600MB of data to 
> replay and it is extremely slow. Using the VisualVM sampler i find that the 
> replayer is also spending almost all time in dealing with Java regex (stack 
> trace below). Is this to be expected? And what is it actually doing? Where do 
> the TokenFilters come from?
> 
> I had a old but clean collection on the same cluster and started indexing to 
> it to see what is going on but it too timed out due to Java regex. This is 
> weird, because locally i have no problem indexing a million records in a 
> 8.5.1 collection, and the broken down cluster has been running fine for over 
> a month.
> 
> A note, this index uses PreAnalyzedField, so i would expect no analysis or 
> whatsoever, certainly no regex.
> 
> Thanks,
> Markus
> 
> "replayUpdatesExecutor-3-thread-1-processing-n:127.0.1.1:8983_solr 
> x:sitesearch_shard2_replica_t2 c:sitesearch s:shard2 r:core_node4" #222 
> prio=5 os_prio=0 cpu=239207,44ms elapsed=239,50s tid=0x7ffde0057000 
> nid=0x24f5 runnable  [0x7ffeedd0f000]
>    java.lang.Thread.State: RUNNABLE
> at 
>java.util.regex.Pattern$GroupTail.match(java.base@11.0.7/Pattern.java:4863)
> at 
>java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
> at 
>java.util.regex.Pattern$GroupHead.match(java.base@11.0.7/Pattern.java:4804)
> at 
>java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
> at 
>java.util.regex.Pattern$Start.match(java.base@11.0.7/Pattern.java:3619)
> at java.util.regex.Matcher.search(java.base@11.0.7/Matcher.java:1729)
> at java.util.regex.Matcher.find(java.base@11.0.7/Matcher.java:746)
> at 
>org.apache.lucene.analysis.pattern.PatternReplaceFilter.incrementToken(PatternReplaceFilter.java:71)
> at 
>org.apache.lucene.analysis.miscellaneous.TrimFilter.incrementToken(TrimFilter.java:42)
> at 
>org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
> at 
>org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
> at 
>org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
> at 
>org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
> at 
>org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
> at 
>org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
> at 
>org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
> at 
>org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
> at 
>org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:979)
> at 
>org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
> at 
>org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
> at 
>org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)
> at 
>org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
> at 
>org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
> at 
>org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
> at 
>org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
> at 

8.5.1 LogReplayer extremely slow

2020-05-11 Thread Markus Jelsma
Hello,

Our main Solr text search collection broke down last night (search was still 
working fine), every indexing action timed out with the Solr master spending 
most of its time in Java regex. One shard has only one replica left for queries 
and it stays like that. I have copied both shard's leader to local to see what 
is going on.

One shard is fine but the other has a replica with has about 600MB of data to 
replay and it is extremely slow. Using the VisualVM sampler i find that the 
replayer is also spending almost all time in dealing with Java regex (stack 
trace below). Is this to be expected? And what is it actually doing? Where do 
the TokenFilters come from?

I had a old but clean collection on the same cluster and started indexing to it 
to see what is going on but it too timed out due to Java regex. This is weird, 
because locally i have no problem indexing a million records in a 8.5.1 
collection, and the broken down cluster has been running fine for over a month.

A note, this index uses PreAnalyzedField, so i would expect no analysis or 
whatsoever, certainly no regex.

Thanks,
Markus

"replayUpdatesExecutor-3-thread-1-processing-n:127.0.1.1:8983_solr 
x:sitesearch_shard2_replica_t2 c:sitesearch s:shard2 r:core_node4" #222 prio=5 
os_prio=0 cpu=239207,44ms elapsed=239,50s tid=0x7ffde0057000 nid=0x24f5 
runnable  [0x7ffeedd0f000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.regex.Pattern$GroupTail.match(java.base@11.0.7/Pattern.java:4863)
at 
java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
at 
java.util.regex.Pattern$GroupHead.match(java.base@11.0.7/Pattern.java:4804)
at 
java.util.regex.Pattern$CharPropertyGreedy.match(java.base@11.0.7/Pattern.java:4306)
at 
java.util.regex.Pattern$Start.match(java.base@11.0.7/Pattern.java:3619)
at java.util.regex.Matcher.search(java.base@11.0.7/Matcher.java:1729)
at java.util.regex.Matcher.find(java.base@11.0.7/Matcher.java:746)
at 
org.apache.lucene.analysis.pattern.PatternReplaceFilter.incrementToken(PatternReplaceFilter.java:71)
at 
org.apache.lucene.analysis.miscellaneous.TrimFilter.incrementToken(TrimFilter.java:42)
at 
org.apache.lucene.analysis.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:49)
at 
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:812)
at 
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:442)
at 
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:406)
at 
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:250)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1586)
at 
org.apache.solr.update.DirectUpdateHandler2.updateDocOrDocValues(DirectUpdateHandler2.java:979)
at 
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:345)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:292)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:239)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:76)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:55)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:259)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.doVersionAdd(DistributedUpdateProcessor.java:489)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.lambda$versionAdd$0(DistributedUpdateProcessor.java:339)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor$$Lambda$631/0x000840670c40.apply(Unknown
 Source)
at 
org.apache.solr.update.VersionBucket.runWithLock(VersionBucket.java:50)
- locked <0xa7df5620> (a org.apache.solr.update.VersionBucket)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:339)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:225)
at 
org.apache.solr.update.processor.DistributedZkUpdateProcessor.processAdd(DistributedZkUpdateProcessor.java:245)
at 
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
at 
org.apache.solr.update.UpdateLog$LogReplayer.lambda$execute$1(UpdateLog.java:2025)
at 
org.apache.solr.update.UpdateLog$LogReplayer$$Lambda$629/0x000840672c40.run(Unknown
 

RE: Indexing Korean

2020-05-01 Thread Markus Jelsma
Hello,

Although it is not mentioned in Solr's language analysis page in the manual, 
Lucene has had support for Korean for quite a while now.

https://lucene.apache.org/core/8_5_0/analyzers-nori/index.html

Regards,
Markus

 
 
-Original message-
> From:Audrey Lorberfeld - audrey.lorberf...@ibm.com 
> Sent: Friday 1st May 2020 17:34
> To: solr-user@lucene.apache.org
> Subject: Indexing Korean
> 
>  Hi All,
> 
> My team would like to index Korean, but it looks like Solr OOTB does not have 
> explicit support for Korean. If any of you have schema pipelines you could 
> share for your Korean documents, I would love to see them! I'm assuming I 
> would just use some combination of the OOTB CJK factories
> 
> Best,
> Audrey
> 
> 


RE: heavy reads from disk when off-heap ram is constrained

2020-02-27 Thread Markus Jelsma
Hello Kyle,

This is actually the manual [1] clearly warns for. Snippet copied from the 
manual:

"When setting the maximum heap size, be careful not to let the JVM consume all 
available physical memory. If the JVM process space grows too large, the 
operating system will start swapping it, which will severely impact 
performance. In addition, the operating system uses memory space not allocated 
to processes for file system cache and other purposes. This is especially 
important for I/O-intensive applications, like Lucene/Solr. The larger your 
indexes, the more you will benefit from filesystem caching by the OS. It may 
require some experimentation to determine the optimal tradeoff between heap 
space for the JVM and memory space for the OS to use."

Please check it out, there are more useful hints to be found there.

Regards,
Markus

[1] 
https://lucene.apache.org/solr/guide/8_4/jvm-settings.html#JVMSettings-ChoosingMemoryHeapSettings

 
-Original message-
> From:lstusr 5u93n4 
> Sent: Thursday 27th February 2020 18:45
> To: solr-user@lucene.apache.org
> Subject: heavy reads from disk when off-heap ram is constrained
> 
> Hi All,
> 
> Something we learned recently that might be useful to the community.
> 
> We're running solr in docker, and we've constrained each of our containers
> to have access to 10G of the host's ram. Also, through `docker stats`, we
> can see the Block IO (filesystem reads/writes) that the solr process is
> doing.
> 
> On a test system with three nodes, three shards, each with two NRT
> replicas, and indexing a reference set of a million documents:
> 
>  - When allocating half of the container's available ram to the jvm (i.e.
> starting solr with -m 5g) we see a read/write distribution of roughly
> 400M/2G on each solr node.
> 
>  - When allocation ALL of the container's available ram to the jvm (i.e.
> starting solr with -m 10g) we see a read/write distribution of around 10G /
> 2G on each solr node, and the latency on the underlying disk soars.
> 
> The takeaway here is that Solr really does need non-jvm RAM to function,
> and if you're having performance issues, "adding more ram to the jvm" isn't
> always the right way to get things going faster.
> 
> Best,
> 
> Kyle
> 


RE: Repeatable search term bug in Solr 8?

2020-02-27 Thread Markus Jelsma
Hello Phil,

Solr never returns "The website encountered an unexpected error. Please try 
again later." as an error. To get to the root of the problem, you should at 
least post error logs that Solr actually throws, if it does at all.

You either have an application error, or an actual Solr problem. Neither is 
sure with this information.

It would be helpful if you can reproduce actual queries on Solr itself, without 
the application layer, and then if an error occurs share it with the community.

Regards,
Markus

 
 
-Original message-
> From:Staley, Phil R - DCF 
> Sent: Thursday 27th February 2020 22:32
> To: 'solr-user@lucene.apache.org' 
> Subject: Repeatable search term bug in Solr 8?
> 
> All,
> 
> We recently upgraded to our Drupal 8 sites to SOLR 8.3.1.  We are now getting 
> reports of certain patterns of search terms resulting in an error that reads, 
> "The website encountered an unexpected error. Please try again later."
> 
> Below is a list of example terms that repeatably result in this error and a 
> similar list that works fine.  The problem pattern seems to be a search term 
> that contains 2 or 3 characters followed by a space, followed by additional 
> text.
> 
> To confirm that the problem is version 8 of SOLR, I have updated our local 
> and UAT sites with the latest Drupal updates that did include an update to 
> the Search API Solr module and tested the terms below under SOLR 7.7.2, 
> 8.3.1, and 8.4.1.  Under version 7.7.2  everything works fine. Under either 
> of the version 8, the problem returns.
> 
> Thoughts?
> 
> Search terms that result in error
> 
>   *   w-2 agency directory
>   *   agency w-2 directory
>   *   w-2 agency
>   *   w-2 directory
>   *   w2 agency directory
>   *   w2 agency
>   *   w2 directory
> 
> Search terms that do not result in error
> 
>   *   w-22 agency directory
>   *   agency directory w-2
>   *   agency w-2directory
>   *   agencyw-2 directory
>   *   w-2
>   *   w2
>   *   agency directory
>   *   agency
>   *   directory
>   *   -2 agency directory
>   *   2 agency directory
>   *   w-2agency directory
>   *   w2agency directory
> 
> 
> 
> 


Solr 8.x Startup problems when ZK is partially unavailable

2020-01-10 Thread Markus Jelsma
Hello,

I have multiple collections, one 7.5.0 and the rest is on 8.3.1. They all share 
the same ZK ensemble and have the same ZK connection string. The first ZK 
address in the connection string is one that is not reachable, it seems 
firewalled, the rest is accessible.

The 7.5.0 nodes do not appear to have problems with a partial accessible ZK 
ensemble. It gave a simple warning but the cores on the nodes keep starting up 
nicely.

I have trouble starting up 8.x nodes because it times out when connecting to 
ZK. The logs are filled with:

2020-01-10 16:33:33.146 WARN  (qtp1620948294-21) [   ] 
o.a.s.h.a.ZookeeperStatusHandler Failed talking to zookeeper bad_node1:2181 => 
org.apache.solr.common.SolrException: Failed talking to Zookeeper 
89.188.14.28:2181
at 
org.apache.solr.handler.admin.ZookeeperStatusHandler.getZkRawResponse(ZookeeperStatusHandler.java:245)

And i get this one for one of the cores on a restarted node:

2020-01-10 16:31:11.752 ERROR 
(searcherExecutor-12-thread-1-processing-n:s2.io:8983_solr 
x:documents_shard2_replica_t19 c:documents s:shard2 r:core_node20) [c:documents 
s:shard2 r:core_node20 x:documents_shard2_replica_t19] 
o.a.s.h.RequestHandlerBase java.lang.NullPointerException
at 
org.apache.solr.handler.component.SearchHandler.initComponents(SearchHandler.java:183)

This one is probably preventing the core from getting properly loaded. One the 
same node, however, there is another shard of the same collection, which did 
start up normally, as did other cores on the node.

Is this a known 8.x problem? I can work around it by temporarily removing the 
bad node address from the ZK connection string but thats all.

Thanks,
Markus



PreAnalyzedFieldUpdateProcessor issues in Solrcloud

2019-12-20 Thread Markus Jelsma
Hello,

We are moving our text analysis to outside of Solr and use PreAnalyzedField to 
speed up indexing. We also use MLT, but these two don't work together, there is 
no way for MLT to properly analyze a document using the PreAnalyzedField's 
analyzer, and it does not pass the code in the MLT qparser where it checks for 
FieldType.isExplicitAnalyzer().

So instead of changing the schema, i tried using 
PreAnalyzedFieldUpdateProcessor. This would be ideal because MLT still works 
and i can still manually index non-preanalyzed documents when developing, just 
by switching URP chain.

I cannot get it to work. When i place the URP on top of all others i get:

TransactionLog doesn't know how to serialize class 
org.apache.lucene.document.Field; try implementing ObjectResolver?
at 
org.apache.solr.update.TransactionLog$1.resolve(TransactionLog.java:100)
at 
org.apache.solr.common.util.JavaBinCodec.writeVal(JavaBinCodec.java:264)

If i put the URP directly above Run i get:

Remote error message: TransactionLog doesn't know how to serialize class 
org.apache.lucene.document.Field; try implementing ObjectResolver?
at 
org.apache.solr.update.processor.DistributedZkUpdateProcessor.doDistribFinish(DistributedZkUpdateProcessor.java:1189)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1096)

If i remove the DistributedURP indexing a preanalyzed document works, but, my 
stored field is suddenly prefixed with:

org.apache.lucene.document.Field:stored,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition,omitNorms

RE: Position search

2019-10-15 Thread Markus Jelsma
Hello Adi,

There is no SpanLastQuery or equivalent. But you could reverse the text and use 
SpanFirstQuery. Or, perhaps easier, add a bogus term to the end of the field 
and use PhraseQuery.

Regards,
Markus
 
-Original message-
> From:Kaminski, Adi 
> Sent: Tuesday 15th October 2019 10:57
> To: solr-user@lucene.apache.org
> Subject: RE: Position search
> 
> Hi Markus,
> Thanks for the guidance.
> 
> Is there any official Solr documentation for that ? Tried some googling, only 
> some Stackoverflow / Lucene posts are available.
> 
> Also, will that approach work for the other use case of searching from end of 
> documents ?
> For example if I need to perform some term search from the end, e.g. "book" 
> in the last 30 or 100 words.
> 
> Is there SpanLastQuery ?
> 
> Thanks,
> Adi
> 
> -Original Message-
> From: Markus Jelsma 
> Sent: Tuesday, October 15, 2019 11:04 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Position search
> 
> Hello Adi,
> 
> Try SpanFirstQuery. It limits the search to within the Nth term in the field.
> 
> Regards,
> Markus
> 
> 
> 
> -Original message-
> > From:Kaminski, Adi 
> > Sent: Tuesday 15th October 2019 8:25
> > To: solr-user@lucene.apache.org
> > Subject: Position search
> >
> > Hi,
> > What's the recommended way to search in Solr (assuming 8.2 is used) for 
> > specific terms/phrases/expressions while limiting the search from position 
> > perspective.
> > For example to search only in the first/last 100 words of the document ?
> >
> > Is there any built-in functionality for that ?
> >
> > Thanks in advance,
> > Adi
> >
> >
> > This electronic message may contain proprietary and confidential 
> > information of Verint Systems Inc., its affiliates and/or subsidiaries. The 
> > information is intended to be for the use of the individual(s) or 
> > entity(ies) named above. If you are not the intended recipient (or 
> > authorized to receive this e-mail for the intended recipient), you may not 
> > use, copy, disclose or distribute to anyone this message or any information 
> > contained in this message. If you have received this electronic message in 
> > error, please notify us by replying to this e-mail.
> >
> 
> 
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.
> 


RE: Position search

2019-10-15 Thread Markus Jelsma
Hello Adi,

Try SpanFirstQuery. It limits the search to within the Nth term in the field.

Regards,
Markus

 
 
-Original message-
> From:Kaminski, Adi 
> Sent: Tuesday 15th October 2019 8:25
> To: solr-user@lucene.apache.org
> Subject: Position search
> 
> Hi,
> What's the recommended way to search in Solr (assuming 8.2 is used) for 
> specific terms/phrases/expressions while limiting the search from position 
> perspective.
> For example to search only in the first/last 100 words of the document ?
> 
> Is there any built-in functionality for that ?
> 
> Thanks in advance,
> Adi
> 
> 
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.
> 


RE: Custom update processor not kicking in

2019-09-18 Thread Markus Jelsma
Hello Rahul,

I don't know why you don't see your logs lines, but if i remember correctly, 
you must put all custom processors above Log, Distributed and Run, at least i 
remember i read it somewhere a long time ago.

We put all our custom processors on top of the three default processors and 
they run just fine.

Try it.

Regards,
Markus
 
-Original message-
> From:Rahul Goswami 
> Sent: Wednesday 18th September 2019 22:20
> To: solr-user@lucene.apache.org
> Subject: Custom update processor not kicking in
> 
> Hello,
> 
> I am using solr 7.2.1 in a standalone mode. I created a custom update
> request processor and placed it between the distributed processor and run
> update processor in my chain. I made sure the chain is invoked since I see
> log lines from the getInstance() method of my processor factory. But I
> don’t see any log lines from the processAdd() method.
> 
> Any inputs on why the processor is getting skipped if placed after
> distributed processor?
> 
> Thanks,
> Rahul
> 


RE: SolrClient from inside processAdd function

2019-09-05 Thread Markus Jelsma
Hello Arnold,

In the Factory's inform() method you receive a SolrCore reference. Using this 
you can get the CloudDescriptor and the ZkController references. These provide 
access to what you need to open a connection for SolrClient. 

Our plugins usually work in cloud and non-cloud environments, so we initialize 
different things for each situation. Like this abstracted in some CloudUtils 
thing:

cloudDescriptor = core.getCoreDescriptor().getCloudDescriptor();
zk = core.getCoreContainer().getZkController(); // this is the ZkController 
ref
coreName = core.getCoreDescriptor().getName();

// Are we in cloud mode?
if (zk != null) {
  collectionName = core.getCoreDescriptor().getCollectionName();
  shardId = cloudDescriptor.getShardId();
} else {
  collectionName = null;
  shardId = null;
}

Depending on cloudMode we create new SolrClient instances based on these 
classes. 

Check the apidocs and you'll quickly see what you need.

We use these api's to get what we need. But you can also find these things if 
you check the Java system properties, which is easier. We use the api's to read 
the data because if api's change, we get a compile error. If the system 
properties change, we don't. So the system properties is easier, but the api's 
are safer. Although a unit tests should guard against that as well.

Regards,
Markus

ps, on this list there is normally no need to create a new thread for an 
existing one, even if you are eagerly waiting for a reply. It might take some 
patience though.
 
-Original message-
> From:Arnold Bronley 
> Sent: Thursday 5th September 2019 18:44
> To: solr-user@lucene.apache.org
> Subject: Re: SolrClient from inside processAdd function
> 
> Hi Markus,
> 
> Is there any way to get the information about the current Solr endpoint
> from within the custom URP?
> 
> On Wed, Sep 4, 2019 at 3:10 PM Markus Jelsma 
> wrote:
> 
> > Hello Arnold,
> >
> > Yes, we do this too for several cases.
> >
> > You can create the SolrClient in the Factory's inform() method, and pass
> > is to the URP when it is created. You must implement SolrCoreAware and
> > close the client when the core closes as well. Use a CloseHook for this.
> >
> > If you do not close the client, it will cause trouble if you run unit
> > tests, and most certainly when you regularly reload cores.
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Arnold Bronley 
> > > Sent: Wednesday 4th September 2019 20:10
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrClient from inside processAdd function
> > >
> > > I need to search some other collection inside processAdd function and
> > > append that information to the indexing request.
> > >
> > > On Tue, Sep 3, 2019 at 7:55 PM Erick Erickson 
> > > wrote:
> > >
> > > > This really sounds like an XY problem. What do you need the SolrClient
> > > > _for_? I suspect there’s an easier way to do this…..
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Sep 3, 2019, at 6:17 PM, Arnold Bronley 
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Is there a way to create SolrClient from inside processAdd function
> > for
> > > > > custom update processor for the same Solr on which it is executing?
> > > >
> > > >
> > >
> >
> 


RE: SolrClient from inside processAdd function

2019-09-04 Thread Markus Jelsma
Hello Arnold,

Yes, we do this too for several cases.

You can create the SolrClient in the Factory's inform() method, and pass is to 
the URP when it is created. You must implement SolrCoreAware and close the 
client when the core closes as well. Use a CloseHook for this.

If you do not close the client, it will cause trouble if you run unit tests, 
and most certainly when you regularly reload cores.

Regards,
Markus

 
 
-Original message-
> From:Arnold Bronley 
> Sent: Wednesday 4th September 2019 20:10
> To: solr-user@lucene.apache.org
> Subject: Re: SolrClient from inside processAdd function
> 
> I need to search some other collection inside processAdd function and
> append that information to the indexing request.
> 
> On Tue, Sep 3, 2019 at 7:55 PM Erick Erickson 
> wrote:
> 
> > This really sounds like an XY problem. What do you need the SolrClient
> > _for_? I suspect there’s an easier way to do this…..
> >
> > Best,
> > Erick
> >
> > > On Sep 3, 2019, at 6:17 PM, Arnold Bronley 
> > wrote:
> > >
> > > Hi,
> > >
> > > Is there a way to create SolrClient from inside processAdd function for
> > > custom update processor for the same Solr on which it is executing?
> >
> >
> 


RE: 8.2.0 After changing replica types, state.json is wrong and replication no longer takes place

2019-08-23 Thread Markus Jelsma
Hello,

Reloading and restarting doesn't seem to help here. Just occasionally the 
replicas decide to finally replicate some files, the next few commits are just 
ignored.

I did finally found some errors.

On the leader:
2019-08-23 01:11:10.989 ERROR (qtp367746789-4669) [c:nutch s:shard1 
r:core_node40 x:collection_shard1_replica_t39] o.a.s.h.ReplicationHandler 
Unable to get file names for indexCommit generation:
 1205 => java.nio.file.NoSuchFileException: 
//data/collection_shard1_replica_t39/data/index/_27m_8b.liv
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
java.nio.file.NoSuchFileException: 
/app/data/nutch_shard1_replica_t39/data/index/_27m_8b.liv
at 
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) 
~[?:1.8.0_222]
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
~[?:1.8.0_222]
at 
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
~[?:1.8.0_222]
at 
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
 ~[?:1.8.0_222]
at 
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
 ~[?:1.8.0_222]
at 
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
 ~[?:1.8.0_222]

On the slave:
No files to download for index generation: 1205

So it seems obvious now. The replica won't replicate because of the error on 
the leader. Is this a known error? New Jira?

Regards,
Markus
 
-Original message-
> From:Ere Maijala 
> Sent: Friday 23rd August 2019 11:24
> To: solr-user@lucene.apache.org
> Subject: Re: 8.2.0 After changing replica types, state.json is wrong and 
> replication no longer takes place
> 
> Hi,
> 
> We've had PULL replicas stop replicating a couple of times in Solr 7.x.
> Restarting Solr has got it going again. No errors in logs, and I've been
> unable to reproduce the issue at will. At least once it happened when I
> reloaded a collection, but other times that hasn't caused any issues.
> 
> I'll make a note to check state.json next time we encounter the
> situation to see if I can see what you reported.
> 
> Regards,
> Ere
> 
> Markus Jelsma kirjoitti 22.8.2019 klo 16.36:
> > Hello,
> > 
> > There is a newly created 8.2.0 all NRT type cluster for which i replaced 
> > each NRT replica with a TLOG type replica. Now, the replicas no longer 
> > replicate when the leader receives data. The situation is odd, because some 
> > shard replicas kept replicating up until eight hours ago, another one (same 
> > collection, same node) seven hours, and even another one four hours!
> > 
> > I inspected state.json to see what might be wrong, and compare it with 
> > another fully working, but much older, 8.2.0 all TLOG collection.
> > 
> > The faulty one still lists, probably from when it was created:
> > "nrtReplicas":"2",
> > "tlogReplicas":"0"
> > "pullReplicas":"0",
> > "replicationFactor":"2",
> > 
> > The working collection only has:
> > "replicationFactor":"1",
> > 
> > What actually could cause this new collection to start replicating when i 
> > delete the data directory, but later on stop replicating at some random 
> > time, which is different for each shard.
> > 
> > Is there something i should change in state.json, and can it just be 
> > reuploaded to ZK?
> > 
> > Thanks,
> > Markus
> > 
> 
> -- 
> Ere Maijala
> Kansalliskirjasto / The National Library of Finland
> 


8.2.0 After changing replica types, state.json is wrong and replication no longer takes place

2019-08-22 Thread Markus Jelsma
Hello,

There is a newly created 8.2.0 all NRT type cluster for which i replaced each 
NRT replica with a TLOG type replica. Now, the replicas no longer replicate 
when the leader receives data. The situation is odd, because some shard 
replicas kept replicating up until eight hours ago, another one (same 
collection, same node) seven hours, and even another one four hours!

I inspected state.json to see what might be wrong, and compare it with another 
fully working, but much older, 8.2.0 all TLOG collection.

The faulty one still lists, probably from when it was created:
"nrtReplicas":"2",
"tlogReplicas":"0"
"pullReplicas":"0",
"replicationFactor":"2",

The working collection only has:
"replicationFactor":"1",

What actually could cause this new collection to start replicating when i 
delete the data directory, but later on stop replicating at some random time, 
which is different for each shard.

Is there something i should change in state.json, and can it just be reuploaded 
to ZK?

Thanks,
Markus


StackOverflowError leader election on 8.2.0

2019-08-21 Thread Markus Jelsma
Hello,

Looking this up i found SOLR-5692, but that was solved a lifetime ago, so just 
checking if this is a familiar error and one i missing in Jira:

A client's Solr 8.2.0 cluster brought us the next StackOverflowError while 
running 8.2.0 on Java 8:

Exception in thread "coreZkRegister-1-thread-3" java.lang.StackOverflowError
at 
org.apache.logging.log4j.ThreadContext.getImmutableContext(ThreadContext.java:352)
at 
org.apache.logging.log4j.core.impl.ThreadContextDataInjector$ForDefaultThreadContextMap.injectContextData(ThreadContextDataInjector.java:66)
at 
org.apache.logging.log4j.core.impl.Log4jLogEvent.createContextData(Log4jLogEvent.java:473)
at 
org.apache.logging.log4j.core.impl.Log4jLogEvent.(Log4jLogEvent.java:331)
at 
org.apache.logging.log4j.core.impl.DefaultLogEventFactory.createEvent(DefaultLogEventFactory.java:54)
at 
org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:404)
at 
org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at 
org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2170)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2125)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2108)
at 
org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2007)
at 
org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1866)
at org.apache.logging.slf4j.Log4jLogger.info(Log4jLogger.java:179)
at org.apache.solr.update.PeerSync.sync(PeerSync.java:172)
at 
org.apache.solr.cloud.SyncStrategy.syncWithReplicas(SyncStrategy.java:187)
at 
org.apache.solr.cloud.SyncStrategy.syncReplicas(SyncStrategy.java:131)
at org.apache.solr.cloud.SyncStrategy.sync(SyncStrategy.java:109)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:400)
at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:172)
at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:137)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:309)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:218)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:703)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:449)
at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:172)
at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:137)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:309)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:218)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:703)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:449)
at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:172)
at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:137)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:309)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:218)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:703)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:449)
at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:172)
at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:137)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:309)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:218)

. it repeats hundreds of times

at 
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:703)
at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:449)
at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:172)
at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:137)
at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:309)
at 
org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1479)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1219)
at org.apache.solr.cloud.ZkController.register(ZkController.java:1171)
at 

RE: Solr 8 getZkStateReader throwing AlreadyClosedException

2019-07-01 Thread Markus Jelsma
Opened SOLR-13591.

https://issues.apache.org/jira/browse/SOLR-13591

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 27th June 2019 13:20
> To: solr-user@lucene.apache.org; solr-user 
> Subject: RE: Solr 8 getZkStateReader throwing AlreadyClosedException
> 
> This was 8.1.1 to be precise. Sorry!
> 
>  
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Thursday 27th June 2019 13:19
> > To: solr-user 
> > Subject: Solr 8 getZkStateReader throwing AlreadyClosedException
> > 
> > Hello,
> > 
> > We had two different SolrClients failing on different collections and 
> > machines just around the same time. After restarting everything was just 
> > fine again. The following exception was thrown:
> > 
> > 2019-06-27 11:04:28.117 ERROR (qtp203849460-13532) [c:_shard1_replica_t15] 
> > o.a.s.h.RequestHandlerBase org.apache.solr.common.AlreadyClosedException
> > at 
> > org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getZkStateReader(ZkClientClusterStateProvider.java:165)
> > at 
> > org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:160)
> > at 
> > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.connect(BaseCloudSolrClient.java:329)
> > at 
> > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:779)
> > at 
> > org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:769)
> > at 
> > org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274)
> > 
> > I tried looking for it in Jira but could not find it right away. Is this a 
> > bug, known issue? 
> > 
> > Thanks,
> > Markus
> > 
> 


RE: refused connection

2019-06-28 Thread Markus Jelsma
Hello,

If you get a Connection Refused, then normally the server is just offline. But, 
something weird is hiding in your stack trace, you should check it out further:

> Caused by: java.net.ConnectException: Cannot assign requested address
> (connect failed)

I have not seen this before.

Regards,
Markus 
 
-Original message-
> From:Midas A 
> Sent: Friday 28th June 2019 10:03
> To: solr-user@lucene.apache.org
> Subject: Re: refused connection
> 
> Please reply .  THis error is coming intermittently.
> 
> On Fri, Jun 28, 2019 at 11:50 AM Midas A  wrote:
> 
> > Hi All ,
> >
> > I am getting following error while indexing . Please suggest resolution.
> >
> > We are using kafka consumer to index solr .
> >
> >
> > org.apache.solr.client.solrj.SolrServerException: Server
> > *refused connection* at: http://host:port/solr/research
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:656)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at
> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:207)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:138)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:156)
> > ~[solr-solrj-8.1.1.jar!/:8.1.1 fcbe46c28cef11bc058779afba09521de1b19bef -
> > ab - 2019-05-22 15:20:04]
> > at
> > com.monster.blue.jay.repositories.impl.ResumesDocumentRepositoryImpl.pushToSolr(ResumesDocumentRepositoryImpl.java:425)
> > [classes!/:1.0.0]
> > at
> > com.monster.blue.jay.repositories.impl.ResumesDocumentRepositoryImpl.createResumeDocument(ResumesDocumentRepositoryImpl.java:397)
> > [classes!/:1.0.0]
> > at
> > com.monster.blue.jay.repositories.impl.ResumesDocumentRepositoryImpl$$FastClassBySpringCGLIB$$e5ddf9e4.invoke()
> > [classes!/:1.0.0]
> > at
> > org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:204)
> > [spring-core-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:746)
> > [spring-aop-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
> > [spring-aop-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > org.springframework.dao.support.PersistenceExceptionTranslationInterceptor.invoke(PersistenceExceptionTranslationInterceptor.java:139)
> > [spring-tx-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:185)
> > [spring-aop-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:688)
> > [spring-aop-5.0.7.RELEASE.jar!/:5.0.7.RELEASE]
> > at
> > com.monster.blue.jay.repositories.impl.ResumesDocumentRepositoryImpl$$EnhancerBySpringCGLIB$$3885a0b4.createResumeDocument()
> > [classes!/:1.0.0]
> > at
> > com.monster.blue.jay.services.ResumeDocumentService.getResumeDocument(ResumeDocumentService.java:46)
> > [classes!/:1.0.0]
> > at
> > com.monster.blue.jay.runable.impl.ParallelGroupProcessor$GroupIndexingTaskCallable.call(ParallelGroupProcessor.java:200)
> > [classes!/:1.0.0]
> > at
> > com.monster.blue.jay.runable.impl.ParallelGroupProcessor$GroupIndexingTaskCallable.call(ParallelGroupProcessor.java:148)
> > [classes!/:1.0.0]
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
> > at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > [na:1.8.0_121]
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > [na:1.8.0_121]
> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> > Caused by: org.apache.http.conn.HttpHostConnectException: Connect to
> > 10.216.204.70:3112 [/10.216.204.70] failed: Cannot assign requested
> > address (connect failed)
> > at
> > org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
> > ~[httpclient-4.5.5.jar!/:4.5.5]
> > at
> > 

RE: Solr 8 getZkStateReader throwing AlreadyClosedException

2019-06-27 Thread Markus Jelsma
This was 8.1.1 to be precise. Sorry!

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 27th June 2019 13:19
> To: solr-user 
> Subject: Solr 8 getZkStateReader throwing AlreadyClosedException
> 
> Hello,
> 
> We had two different SolrClients failing on different collections and 
> machines just around the same time. After restarting everything was just fine 
> again. The following exception was thrown:
> 
> 2019-06-27 11:04:28.117 ERROR (qtp203849460-13532) [c:_shard1_replica_t15] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.AlreadyClosedException
> at 
> org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getZkStateReader(ZkClientClusterStateProvider.java:165)
> at 
> org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:160)
> at 
> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.connect(BaseCloudSolrClient.java:329)
> at 
> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:779)
> at 
> org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:769)
> at 
> org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274)
> 
> I tried looking for it in Jira but could not find it right away. Is this a 
> bug, known issue? 
> 
> Thanks,
> Markus
> 


Solr 8 getZkStateReader throwing AlreadyClosedException

2019-06-27 Thread Markus Jelsma
Hello,

We had two different SolrClients failing on different collections and machines 
just around the same time. After restarting everything was just fine again. The 
following exception was thrown:

2019-06-27 11:04:28.117 ERROR (qtp203849460-13532) [c:_shard1_replica_t15] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.AlreadyClosedException
at 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getZkStateReader(ZkClientClusterStateProvider.java:165)
at 
org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.connect(ZkClientClusterStateProvider.java:160)
at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.connect(BaseCloudSolrClient.java:329)
at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:779)
at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:769)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1274)

I tried looking for it in Jira but could not find it right away. Is this a bug, 
known issue? 

Thanks,
Markus


RE: Increased disk space usage 8.1.1 vs 7.7.1

2019-06-13 Thread Markus Jelsma
Hello,

It has something to do with the skewed facet counts seen in another thread. To 
make a full comparison i indexed the same set to a fresh 7.7 build. Without my 
DocValues error, there is still a reasonable difference:

7.7 shard 1: 7.8 GB
7.7 shard 2: 7.3 GB

8.1 shard 1: 8.3 GB
8.1 shard 2: 5.9 GB

Strange enough, one is larger and the second a lot smaller, and overall 8.1 
takes about 1 GB less.

So it was my DocValues error that caused 8.1 locally to be larger than the old 
7.7 production.

My bad, again!

Many thanks,
Markus 
 
-Original message-
> From:Shawn Heisey 
> Sent: Thursday 13th June 2019 13:42
> To: solr-user@lucene.apache.org
> Subject: Re: Increased disk space usage 8.1.1 vs 7.7.1
> 
> On 6/13/2019 4:19 AM, Markus Jelsma wrote:
> > We are upgrading to Solr 8. One of our reindexed collections takes a GB 
> > more than the production uses which is on 7.7.1. Production also has 
> > deleted documents. This means Solr 8 somehow uses more disk space. I have 
> > checked both Solr and Lucene's CHANGES but no ticket was immediately 
> > obvious.
> 
> Did you index to a core with nothing in it, or reindex on an existing 
> index without deleting everything first and letting Lucene erase all the 
> segments?
> 
> If you reindexed into an existing index, you could simply have deleted 
> documents taking up the extra space.  Full comparison would need to be 
> done after optimizing both indexes to clear out deleted documents.
> 
> You're probably already aware that optimizing in production is 
> discouraged, unless you're willing to do it frequently ... which gets 
> expensive with large indexes.
> 
> If the size is 1GB larger after both indexes are optimized to clear 
> deleted documents, then the other replies you've gotten will be important.
> 
> Thanks,
> Shawn
> 


RE: Different facet count between 7.7.1 and 8.1.1

2019-06-13 Thread Markus Jelsma
Hello Jan,

We traced it back to not reindexing 'everything' when we enabled docValues for 
the field i facetted on. Most records before the change do not show up if i 
query old data, and it was only partially reindexed.

My bad!

Thanks,
Markus
 
-Original message-
> From:Jan Høydahl 
> Sent: Thursday 13th June 2019 0:17
> To: solr-user 
> Subject: Re: Different facet count between 7.7.1 and 8.1.1
> 
> Can you reproduce it from a clean 7.7.1 install? I mean, index N docs and 
> then run the facet query? Is it a distributed query or a single shard? Does 
> an "optimize" change anything? Is this DocValues strings?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 12. jun. 2019 kl. 23:49 skrev Markus Jelsma :
> > 
> > Hello again,
> > 
> > We found another oddity when upgrading to Solr 8. For a *:* query, the 
> > facet counts for a simple string field do not match at all between these 
> > versions. Solr 7.7.1 gives less or zero counts where as for 8 we see the 
> > correct counts. So something seems fixed for a bug that i was not aware of, 
> > although are unit tests rely heavily on correct facet counts.
> > 
> > When i do a field query : the numFound matches the 
> > correct facet counts i see on Solr 8.
> > 
> > I checked CHANGES.txt for anything on this subject, but the issues do not 
> > seem to match this description. Does anyone have an idea what difference in 
> > behaviour i see, and what ticket dealt with this subject?
> > 
> > We do not use JSON-facets here.
> > 
> > Many thanks,
> > Markus
> 
> 


Increased disk space usage 8.1.1 vs 7.7.1

2019-06-13 Thread Markus Jelsma
Hello,

We are upgrading to Solr 8. One of our reindexed collections takes a GB more 
than the production uses which is on 7.7.1. Production also has deleted 
documents. This means Solr 8 somehow uses more disk space. I have checked both 
Solr and Lucene's CHANGES but no ticket was immediately obvious.

Does anyone know what is going on?

Many thanks,
Markus


CursorMark, batch size/speed

2019-06-12 Thread Markus Jelsma
Hello,

One of our collections hates CursorMark, it really does. When under very heavy 
load the nodes can occasionally consume GBs additional heap for no clear reason 
immediately after downloading the entire corpus.

Although the additional heap consumption is a separate problem that i hope 
anyone can shed some light on, there is another strange behaviour i would like 
to see explained.

When under little load and with a batch size of just a few hundred, the 
download speed creeps at at most 150 doc/s. But when i increase batch size to 
absurd numbers such as 20k, the speed jumps to 2.5k docs/s. Changing total time 
from days to just a few hours.

We see the heap and the speed differences only really with one big collection 
of millions of small documents. They are just query, click and view logs with 
additional metadata fields such as time, digests, ranks, dates, uids, view time 
etc.

Is there someone here to shed some light on these vague subjects?

Many thanks,
Markus


Different facet count between 7.7.1 and 8.1.1

2019-06-12 Thread Markus Jelsma
Hello again,

We found another oddity when upgrading to Solr 8. For a *:* query, the facet 
counts for a simple string field do not match at all between these versions. 
Solr 7.7.1 gives less or zero counts where as for 8 we see the correct counts. 
So something seems fixed for a bug that i was not aware of, although are unit 
tests rely heavily on correct facet counts.

When i do a field query : the numFound matches the correct 
facet counts i see on Solr 8.

I checked CHANGES.txt for anything on this subject, but the issues do not seem 
to match this description. Does anyone have an idea what difference in 
behaviour i see, and what ticket dealt with this subject?

We do not use JSON-facets here.

Many thanks,
Markus


RE: Solr Heap Usage

2019-06-07 Thread Markus Jelsma
Hello,

We use VisualVM for making observations. But use Eclipse MAT for in-depth 
analysis, usually only when there is a suspected memory leak.

Regards,
Markus

 
 
-Original message-
> From:John Davis 
> Sent: Friday 7th June 2019 20:30
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Heap Usage
> 
> What would be the best way to understand where heap is being used?
> 
> On Tue, Jun 4, 2019 at 9:31 PM Greg Harris  wrote:
> 
> > Just a couple of points I’d make here. I did some testing a while back in
> > which if no commit is made, (hard or soft) there are internal memory
> > structures holding tlogs and it will continue to get worse the more docs
> > that come in. I don’t know if that’s changed in further versions. I’d
> > recommend doing commits with some amount of frequency in indexing heavy
> > apps, otherwise you are likely to have heap issues. I personally would
> > advocate for some of the points already made. There are too many variables
> > going on here and ways to modify stuff to make sizing decisions and think
> > you’re doing anything other than a pure guess if you don’t test and
> > monitor. I’d advocate for a process in which testing is done regularly to
> > figure out questions like number of shards/replicas, heap size, memory etc.
> > Hard data, good process and regular testing will trump guesswork every time
> >
> > Greg
> >
> > On Tue, Jun 4, 2019 at 9:22 AM John Davis 
> > wrote:
> >
> > > You might want to test with softcommit of hours vs 5m for heavy indexing
> > +
> > > light query -- even though there is internal memory structure overhead
> > for
> > > no soft commits, in our testing a 5m soft commit (via commitWithin) has
> > > resulted in a very very large heap usage which I suspect is because of
> > > other overhead associated with it.
> > >
> > > On Tue, Jun 4, 2019 at 8:03 AM Erick Erickson 
> > > wrote:
> > >
> > > > I need to update that, didn’t understand the bits about retaining
> > > internal
> > > > memory structures at the time.
> > > >
> > > > > On Jun 4, 2019, at 2:10 AM, John Davis 
> > > > wrote:
> > > > >
> > > > > Erick - These conflict, what's changed?
> > > > >
> > > > > So if I were going to recommend settings, they’d be something like
> > > this:
> > > > > Do a hard commit with openSearcher=false every 60 seconds.
> > > > > Do a soft commit every 5 minutes.
> > > > >
> > > > > vs
> > > > >
> > > > > Index-heavy, Query-light
> > > > > Set your soft commit interval quite long, up to the maximum latency
> > you
> > > > can
> > > > > stand for documents to be visible. This could be just a couple of
> > > minutes
> > > > > or much longer. Maybe even hours with the capability of issuing a
> > hard
> > > > > commit (openSearcher=true) or soft commit on demand.
> > > > >
> > > >
> > >
> > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Jun 2, 2019 at 8:58 PM Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > >>> I've looked through SolrJ, DIH and others -- is the bottomline
> > > > >>> across all of them to "batch updates" and not commit as long as
> > > > possible?
> > > > >>
> > > > >> Of course it’s more complicated than that ;)….
> > > > >>
> > > > >> But to start, yes, I urge you to batch. Here’s some stats:
> > > > >> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/
> > > > >>
> > > > >> Note that at about 100 docs/batch you hit diminishing returns.
> > > > _However_,
> > > > >> that test was run on a single shard collection, so if you have 10
> > > shards
> > > > >> you’d
> > > > >> have to send 1,000 docs/batch. I wouldn’t sweat that number much,
> > just
> > > > >> don’t
> > > > >> send one at a time. And there are the usual gotchas if your
> > documents
> > > > are
> > > > >> 1M .vs. 1K.
> > > > >>
> > > > >> About committing. No, don’t hold off as long as possible. When you
> > > > commit,
> > > > >> segments are merged. _However_, the default 100M internal buffer
> > size
> > > > means
> > > > >> that segments are written anyway even if you don’t hit a commit
> > point
> > > > when
> > > > >> you have 100M of index data, and merges happen anyway. So you won’t
> > > save
> > > > >> anything on merging by holding off commits.
> > > > >> And you’ll incur penalties. Here’s more than you want to know about
> > > > >> commits:
> > > > >>
> > > > >>
> > > >
> > >
> > https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > > > >>
> > > > >> But some key take-aways… If for some reason Solr abnormally
> > > > >> terminates, the accumulated documents since the last hard
> > > > >> commit are replayed. So say you don’t commit for an hour of
> > > > >> furious indexing and someone does a “kill -9”. When you restart
> > > > >> Solr it’ll try to re-index all the docs for the last hour. Hard
> > > commits
> > > > >> with openSearcher=false aren’t 

RE: Solr 8.1.1, JMX and VisualVM

2019-05-30 Thread Markus Jelsma
Hello,

It solves the problem! So, with this flag disabled, would that mean our Solr 
would have lower performance than with it?

Thanks Andrzej!
Markus
 
-Original message-
> From:Andrzej Białecki 
> Sent: Thursday 30th May 2019 17:35
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 8.1.1, JMX and VisualVM
> 
> Hi,
> 
> This has to do with the new JVM flags that optimise performance, they were 
> added roughly at the same time when Solr switched to G1GC.
> 
> In ‘bin/solr’ please comment out this flag: '-XX:+PerfDisableSharedMem'.
> 
> > On 30 May 2019, at 14:59, Markus Jelsma  wrote:
> > 
> > Hello,
> > 
> > Slight correction, SolrCLI does become visible in the local applications 
> > view. I just missed it before.
> > 
> > Thanks,
> > Markus
> > 
> > -Original message-
> >> From:Markus Jelsma 
> >> Sent: Thursday 30th May 2019 14:47
> >> To: solr-user 
> >> Subject: Solr 8.1.1, JMX and VisualVM
> >> 
> >> Hello,
> >> 
> >> While upgrading from 7.7 to 8.1.1, i noticed start.jar and SolrCLI no 
> >> longer pop up in the local applications view of VisualVM! I CTRL-F'ed my 
> >> way through the changelog for Solr 8.0.0 to 8.1.1 but could not find 
> >> anything related. I am clueless!
> >> 
> >> Using OpenJDK 11.0.3 2019-04-16 and Solr 8, how can i attach my VisualVM 
> >> to it?
> >> 
> >> Many thanks,
> >> Markus
> >> 
> > 
> 
> 


RE: Query of Death Lucene/Solr 7.6

2019-05-30 Thread Markus Jelsma
Hello,

When upgrading to 8.1.1 i took some time to quickly test this problem. Good 
news, is has disappeared, for me at least. I immediately reproduce it on a 
local 7.7 node, it died immediately. But it runs smoothly on the production 
7.5, and local 8.1.1 node! The problem still exists in 8.0.0.

So i went through Lucene and Solr's CHANGELOG again but could not find any 
ticket about the problem? Does anyone have an idea which ticket could be 
responsible for fixing this?

Anyway, many thanks!
Markus
 
-Original message-
> From:Michael Gibney 
> Sent: Friday 22nd February 2019 17:22
> To: solr-user@lucene.apache.org
> Subject: Re: Query of Death Lucene/Solr 7.6
> 
> Ah... I think there are two issues likely at play here. One is LUCENE-8531
> <https://issues.apache.org/jira/browse/LUCENE-8531>, which reverts a bug
> related to SpanNearQuery semantics, causing possible query paths to be
> enumarated up front. Setting ps=0 (although perhaps not appropriate for
> some use cases) should address problems related to this issue.
> 
> The other (likely affecting Gregg, for whom ps=0 did not help) is SOLR-12243
> <https://issues.apache.org/jira/browse/SOLR-12243>. Prior to 7.6,
> SpanNearQuery (generated for relatively complex "graph" tokenized queries,
> such as would be generated with WDGF, SynonymGraphFilter, etc.) were simply
> getting dropped. This was surely a bug, in that pf did not contribute at
> all to boosting such queries; but the silver lining was that performance
> was great ;-)
> 
> Markus, Gregg, could send examples (parsed query toString()) of problematic
> queries (and perhaps relevant analysis chain configs)?
> 
> Michael
> 
> 
> 
> On Fri, Feb 22, 2019 at 11:00 AM Gregg Donovan  wrote:
> 
> > FWIW: we have also seen serious Query of Death issues after our upgrade to
> > Solr 7.6. Are there any open issues we can watch? Is Markus' findings
> > around `pf` our best guess? We've seen these issues even with ps=0. We also
> > use the WDF.
> >
> > On Fri, Feb 22, 2019 at 8:58 AM Markus Jelsma 
> > wrote:
> >
> > > Hello Michael,
> > >
> > > Sorry it took so long to get back to this, too many things to do.
> > >
> > > Anyway, yes, we have WDF on our query-time analysers. I uploaded two log
> > > files, both the same query of death with and without synonym filter
> > enabled.
> > >
> > > https://mail.openindex.io/export/solr-8983-console.log 23 MB
> > > https://mail.openindex.io/export/solr-8983-console-without-syns.log 1.9
> > MB
> > >
> > > Without the synonym we still see a huge number of entries. Many different
> > > parts of our analyser chain contribute to the expansion of queries, but
> > pf
> > > itself really turns the problem on or off.
> > >
> > > Since SOLR-12243 is new in 7.6, does anyone know that SOLR-12243 could
> > > have this side-effect?
> > >
> > > Thanks,
> > > Markus
> > >
> > >
> > > -Original message-
> > > > From:Michael Gibney 
> > > > Sent: Friday 8th February 2019 17:19
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Query of Death Lucene/Solr 7.6
> > > >
> > > > Hi Markus,
> > > > As of 7.6, LUCENE-8531 <
> > > https://issues.apache.org/jira/browse/LUCENE-8531>
> > > > reverted a graph/Spans-based phrase query implementation (introduced in
> > > 6.5
> > > > -- LUCENE-7699 <https://issues.apache.org/jira/browse/LUCENE-7699>) to
> > > an
> > > > implementation that builds a separate phrase query for each possible
> > > > enumerated path through the graph described by a parsed query.
> > > > The potential for combinatoric explosion of the enumerated approach was
> > > (as
> > > > far as I can tell) one of the main motivations for introducing the
> > > > Spans-based implementation. Some real-world use cases would be good to
> > > > explore. Markus, could you send (as an attachment) the debug toString()
> > > for
> > > > the queries with/without synonyms enabled? I'm also guessing you may
> > have
> > > > WordDelimiterGraphFilter on the query analyzer?
> > > > As an alternative to disabling pf, LUCENE-8531 only reverts to the
> > > > enumerated approach for phrase queries where slop>0, so setting ps=0
> > > would
> > > > probably also help.
> > > > Michael
> > > >
> > > > On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma <
> > markus

RE: Solr 8.1.1, JMX and VisualVM

2019-05-30 Thread Markus Jelsma
Hello,

Slight correction, SolrCLI does become visible in the local applications view. 
I just missed it before.

Thanks,
Markus
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 30th May 2019 14:47
> To: solr-user 
> Subject: Solr 8.1.1, JMX and VisualVM
> 
> Hello,
> 
> While upgrading from 7.7 to 8.1.1, i noticed start.jar and SolrCLI no longer 
> pop up in the local applications view of VisualVM! I CTRL-F'ed my way through 
> the changelog for Solr 8.0.0 to 8.1.1 but could not find anything related. I 
> am clueless!
> 
> Using OpenJDK 11.0.3 2019-04-16 and Solr 8, how can i attach my VisualVM to 
> it?
> 
> Many thanks,
> Markus
> 


Solr 8.1.1, JMX and VisualVM

2019-05-30 Thread Markus Jelsma
Hello,

While upgrading from 7.7 to 8.1.1, i noticed start.jar and SolrCLI no longer 
pop up in the local applications view of VisualVM! I CTRL-F'ed my way through 
the changelog for Solr 8.0.0 to 8.1.1 but could not find anything related. I am 
clueless!

Using OpenJDK 11.0.3 2019-04-16 and Solr 8, how can i attach my VisualVM to it?

Many thanks,
Markus


Field ByteArrayUtf8CharSequence instead of String

2019-05-30 Thread Markus Jelsma
Hello,

When upgrading to 7.7 i got SOLR-13249, when a SolrInputField's value suddenly 
became ByteArrayUtf8CharSequence instead of a String. That has been addressed.

I am now upgrading to 8.1.1 and have a SearchComponent that operates on uses 
SolrClient to fetch documents from elsewhere on-the-fly. It walks over the 
fetched SolrDocumentList and wants to read a String field from each result. 
Since 8.1.1, this is no longer always a String, but ByteArrayUtf8CharSequence 
instead.

I assume this is a bug, should i open a new ticket?

Many thanks!
Markus


RE: Very low filter cache hit ratio

2019-05-29 Thread Markus Jelsma
Hello,

What is missing in that article is you must never use NOW without rounding it 
down in a filter query. If you have it, round it down to an hour, day or minute 
to prevent flooding the filter cache.

Regards,
Markus

-Original message-
> From:Atita Arora 
> Sent: Wednesday 29th May 2019 15:43
> To: solr-user@lucene.apache.org
> Subject: Re: Very low filter cache hit ratio
> 
> You can refer to this one:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> 
> HTH,
> Atita
> 
> On Wed, May 29, 2019 at 3:33 PM Saurabh Sharma 
> wrote:
> 
> > Hi Shwan,
> >
> > Many filters are common among the queries. AFAIK, filter cache are created
> > against filters and by that logic one should get good hit ratio for those
> > cached filter conditions.i tried to create a cache of 100K size and that
> > too was not producing good hit ratio. Any document/suggetion about
> > efficient usage of various caches  and their internal working.
> >
> > Thanks
> > Saurabh
> >
> > On Wed 29 May, 2019, 6:53 PM Shawn Heisey,  wrote:
> >
> > > On 5/29/2019 6:57 AM, Saurabh Sharma wrote:
> > > > What can be the possible reasons for low cache usage?
> > > > How can I leverage cache feature for high traffic indexes?
> > >
> > > Your usage apparently does not use the exact same query (or filter
> > > query, in the case of filterCache) very often.
> > >
> > > In order to achieve a high hit ratio on a cache, the same query will
> > > need to be used by many users.  That's not happening here.  I'm betting
> > > that each user is sending something unique to Solr - which means it will
> > > be impossible to get a hit, unless that user sends the same query again.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> 


Facetting heat map, too many cells

2019-05-03 Thread Markus Jelsma
Hello,

With gridlevel set to 3 i have a map of 256 x 128. However, i would really like 
a higher resolution, preferable twice as high. But with any gridlevel higher 
than 3, or distErrPct 0.1 or lower, i get the IllegalArgumentException, saying 
it does not want to give me a 1024x1024 sized map.

How can i get a 512x256 sized heat map for the whole Earth?

Many thanks,
Markus


RE: Solr-Batch Update

2019-04-25 Thread Markus Jelsma
Hello,

There is no definitive rule for this, it depends on your situation such as size 
of documents, resource constraints and possible heavy analysis chain. And in 
case of (re)indexing a large amount, your autocommit time/limit is probably 
more important.

In our case, some collections are fine with 5000+ batch sizes, but others are 
happy with just a hundred. One has small documents and no text analysis, the 
other quite the opposite.

Finding a sweet spot is trial and error.

Cheers,
Markus

 
 
-Original message-
> From:Lucky Sharma 
> Sent: Thursday 25th April 2019 21:48
> To: solr-user@lucene.apache.org
> Subject: Solr-Batch Update
> 
> Hi all,
> While creating an update request to solr, Its recommended creating
> batch request instead of small updates. What is the optimum batch
> size? Is there any number or any computation which can help us to
> assist on the same.
> 
> 
> -- 
> Warm Regards,
> 
> Lucky Sharma
> Contact No :+91 9821559918
> 


NPE in CharsRefBuilder

2019-04-15 Thread Markus Jelsma
Hello,

I made a ConditionalTokenFilter filter and factory. Its Lucene based unit tests 
work really well, and i can see it is doing something, queries are differently 
analyzed based on some condition.

But when debugging through the GUI i get the following:

2019-04-15 12:37:42.219 ERROR (qtp815674463-213) [c:sitesearch s:shard2 
r:core_node9 x:sitesearch_shard2_replica_t6] o.a.s.s.HttpSolrCall 
null:java.lang.NullPointerException
    at 
org.apache.lucene.util.CharsRefBuilder.copyUTF8Bytes(CharsRefBuilder.java:120)
    at 
org.apache.solr.schema.FieldType.indexedToReadable(FieldType.java:387)
    at 
org.apache.solr.handler.AnalysisRequestHandlerBase.convertTokensToNamedLists(AnalysisRequestHandlerBase.java:273)
    at 
org.apache.solr.handler.AnalysisRequestHandlerBase.analyzeValue(AnalysisRequestHandlerBase.java:144)
    at 
org.apache.solr.handler.FieldAnalysisRequestHandler.analyzeValues(FieldAnalysisRequestHandler.java:227)
    at 
org.apache.solr.handler.FieldAnalysisRequestHandler.handleAnalysisRequest(FieldAnalysisRequestHandler.java:183)

So, although the NPE is in Lucene, is this a bug for the Solr Jira, or for 
Lucene?

Many thanks,
Markus


7.7.1 FlattenGraphFilterFactory at query-time?

2019-03-12 Thread Markus Jelsma
Hello,

Due to reading 'This filter must be included on index-time analyzer..' in the 
documentation, i never considered adding it to a query-time analyser.

However, we had problems with a set of three two-word synonyms never yielding 
the same number of results with SynonymGraph. When switching to the good old 
SynonymFilter, the problem was solved, all three synonyms gave the same number 
of results.

Then i decided to try SynonymGraph with FlattenGraph at query-time, which also 
solved the problem i had with SynonymGraph.

So what is the deal with it, and what about the documentation? Is the 
documentation wrong and should we apply it query-time? Is there a bug?

Many thanks,
Markus


RE: Re: Suppress stack trace in error response

2019-02-22 Thread Markus Jelsma
Hello,

Solr's error responses respect the configured response writer settings, so you 
could probably remove the  element and the stuff it contains 
using XSLT. It is not too fancy, but it should work.

Regards,
Markus
 
-Original message-
> From:Branham, Jeremy (Experis) 
> Sent: Friday 22nd February 2019 16:53
> To: solr-user@lucene.apache.org
> Subject: Re:  Re: Suppress stack trace in error response
> 
> Thanks Edwin – You’re right, I could explain that a bit more.
> My security team has run a scan against the SOLR servers and identified a few 
> things they want suppressed, one being the stack trace in an error message.
> 
> For example –
> 
> 
> 500
> 1
> 
> `
> 
> 
> 
> For input string: "`"
> 
> java.lang.NumberFormatException: For input string: "`" at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> at …
> 
> 
> I’ve got a long-term solution involving middleware changes, but I’m not sure 
> there is a quick fix for this.
> 
>  
> Jeremy Branham
> jb...@allstate.com
> 
> On 2/21/19, 9:53 PM, "Zheng Lin Edwin Yeo"  wrote:
> 
> Hi,
> 
> There's too little information provided in your questions.
> You can explain more on the issue or the exception that you are facing.
> 
> Regards,
> Edwin
> 
> On Thu, 21 Feb 2019 at 23:45, Branham, Jeremy (Experis) 
> 
> wrote:
> 
> > When Solr throws an exception, like when a client sends a badly formed
> > query string, is there a way to suppress the stack trace in the error
> > response?
> >
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> > Allstate Insurance Company | UCV Technology Services | Information
> > Services Group
> >
> >
> 
> 
> 


RE: Query of Death Lucene/Solr 7.6

2019-02-22 Thread Markus Jelsma
Hello Michael,

Sorry it took so long to get back to this, too many things to do.

Anyway, yes, we have WDF on our query-time analysers. I uploaded two log files, 
both the same query of death with and without synonym filter enabled.

https://mail.openindex.io/export/solr-8983-console.log 23 MB
https://mail.openindex.io/export/solr-8983-console-without-syns.log 1.9 MB

Without the synonym we still see a huge number of entries. Many different parts 
of our analyser chain contribute to the expansion of queries, but pf itself 
really turns the problem on or off.

Since SOLR-12243 is new in 7.6, does anyone know that SOLR-12243 could have 
this side-effect?

Thanks,
Markus


-Original message-
> From:Michael Gibney 
> Sent: Friday 8th February 2019 17:19
> To: solr-user@lucene.apache.org
> Subject: Re: Query of Death Lucene/Solr 7.6
> 
> Hi Markus,
> As of 7.6, LUCENE-8531 <https://issues.apache.org/jira/browse/LUCENE-8531>
> reverted a graph/Spans-based phrase query implementation (introduced in 6.5
> -- LUCENE-7699 <https://issues.apache.org/jira/browse/LUCENE-7699>) to an
> implementation that builds a separate phrase query for each possible
> enumerated path through the graph described by a parsed query.
> The potential for combinatoric explosion of the enumerated approach was (as
> far as I can tell) one of the main motivations for introducing the
> Spans-based implementation. Some real-world use cases would be good to
> explore. Markus, could you send (as an attachment) the debug toString() for
> the queries with/without synonyms enabled? I'm also guessing you may have
> WordDelimiterGraphFilter on the query analyzer?
> As an alternative to disabling pf, LUCENE-8531 only reverts to the
> enumerated approach for phrase queries where slop>0, so setting ps=0 would
> probably also help.
> Michael
> 
> On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma 
> wrote:
> 
> > Hello (apologies for cross-posting),
> >
> > While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the
> > remaining four, we stumbled upon a situation where the 7.6 nodes quickly
> > succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all
> > unaffected (tested and confirmed).
> >
> > Following Smiley's suggestion i used Eclipse MAT to find the problem in
> > the heap dump i obtained, this fantastic tool revealed within minutes that
> > a query thread ate 65 % of all resources, in the class variables i could
> > find the the query, and reproduce the problem.
> >
> > The problematic query is 'dubbele dijk/rijke dijkproject in het dijktracé
> > eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString() output
> > in edismax' newFieldQuery. If the node survives it takes 2+ seconds for the
> > query to run (150 ms otherwise). If i disable all query time
> > SynonymGraphFilters it still takes a second and produces just a 9 MB
> > toString() for the query.
> >
> > I could not find anything like this in Jira. I did think of LUCENE-8479
> > and LUCENE-8531 but they were about graphs, this problem looked related
> > though.
> >
> > I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i
> > leave Solr's edismax' pf parameter empty, everything runs fast. When all
> > fields are configured for pf, the node dies.
> >
> > I am now unsure whether this is a Solr or a Lucene issue.
> >
> > Please let me know.
> >
> > Many thanks,
> > Markus
> >
> > ps. in Solr i even got an 'Impossible Exception', my first!
> >
> 


RE: TLOG replica, updateHandler errors in metrics, no logs

2019-02-21 Thread Markus Jelsma
Hello Erick,

I just delete a replica and add again, but with type=tlog.

Yes, it is reproducibly both locally and in production, and with various 
collections. For each document added, the metric increments as well.

I'll open a ticket!

Thanks!
Markus

https://issues.apache.org/jira/browse/SOLR-13265


 
-Original message-
> From:Erick Erickson 
> Sent: Thursday 21st February 2019 17:06
> To: solr-user@lucene.apache.org
> Subject: Re: TLOG replica, updateHandler errors in metrics, no logs
> 
> How are you “moving”? There’s no provision that I know of to _change_ an 
> existing replica.
> 
> But no, if you’re starting with replicas created as TLOG then I haven’t heard 
> of this. If
> the documents are getting indexed and replicated properly then it sounds like 
> a bogus
> counter is being incremented. That said, if you can reliably reproduce this 
> should be 
> a JIRA IMO.
> 
> Best,
> Erick
> 
> > On Feb 21, 2019, at 2:33 AM, Markus Jelsma  
> > wrote:
> > 
> > Hello,
> > 
> > We are moving some replica's to TLOG, one collection runs 7.5, the others 
> > 7.7. When indexing, we see UPDATE.updateHandler.errors increment for each 
> > document being indexed, there is nothing in the logs.
> > 
> > Is this a known issue? 
> > 
> > Thanks,
> > Markus
> 
> 


TLOG replica, updateHandler errors in metrics, no logs

2019-02-21 Thread Markus Jelsma
Hello,

We are moving some replica's to TLOG, one collection runs 7.5, the others 7.7. 
When indexing, we see UPDATE.updateHandler.errors increment for each document 
being indexed, there is nothing in the logs.

Is this a known issue? 

Thanks,
Markus


RE: solr cloud version upgrade 7.6 to 7.7 collection indexes all marked as down

2019-02-19 Thread Markus Jelsma
Hello,

We just witnessed this too with 7.7. No no obvious messages in the logs, the 
replica status would not come out of 'down'.

Meanwhile we got another weird exception from a neighbouring collection sharing 
the same nodes:

2019-02-18 13:47:20.622 ERROR 
(updateExecutor-3-thread-1-processing-n:idx1:8983_solr 
x:search_20180717_shard1_replica_t81 c:search_20180717 s:shard1 r:core_node82
) [c:search_20180717 s:shard1 r:core_node82 
x:search_20180717_shard1_replica_t81] o.a.s.u.SolrCmdDistributor 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Er
ror from server at http://idx5:8983/solr/search_20180717_shard1_replica_t91: 
invalid boolean value: replicas
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:643)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.request(ConcurrentUpdateSolrClient.java:491)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260)
at 
org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:326)
at 
org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:315)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Although probably not related, is this a known thing? Or shall i open an issue.

Thanks,
Markus
 
-Original message-
> From:Jeff Courtade 
> Sent: Friday 15th February 2019 21:54
> To: solr-user@lucene.apache.org
> Subject: Re: solr cloud version upgrade 7.6 to 7.7 collection indexes all 
> marked as down
> 
> Yes... nothing in the logs does mean that there was nothing of interest. I
> have actual entries.
> 
> This is a test environment so this isn't an emergency. Thanks for the
> clarification about what I should be seeing.
> 
> I was just so flabbergasted by this because it's so strange I had to tell
> somebody and yell at the universe basically so I yelled at the solar
> mailing list.
> 
> This is an automated upgrading so the next step is to go through and
> manually perform all the steps and see if I get the same behavior.
> 
> I am fairly certain I just going to be some dumb thing that I'm doing and I
> will be happy to update the mailing list when I figure this out for
> everyone's Mutual entertainment.
> --
> Jeff Courtade
> M: 240.507.6116
> 
> On Fri, Feb 15, 2019, 12:33 PM Erick Erickson  wrote:
> 
> > Hmmm. I'm assuming that "nothing in the logs" is node/logs/solr.log, and
> > that
> > you're not finding errors/exceptipons. Just sanity checking here.
> >
> > My guess: you're picking up the default SOLR_HOME which is in your new
> > installation directory and all your
> > replicas are under the old install directory.
> >
> > There should be some kind of message in the log files indicating that
> > Solr is at least trying to load replicas, something similar to:
> >
> > Using system property solr.solr.home:
> > /Users/Erick/apache/solrVersions/playspace/solr/example/cloud/node1/solr
> >
> > and/or:
> >
> > CorePropertiesLocator Found 3 core definitions underneath
> > /Users/Erick/apache/solrVersions/playspace/solr/example/cloud/node1/solr
> >
> > A bit of background: When Solr starts up, it recursively descends from
> > SOLR_HOME and whenever it finds a "core.properties" file
> > it says "Aha, this must be a core, I'll try to load it". So if
> > SOLR_HOME is doesn't point to an ancestor of your existing replicas,
> > Solr won't find any replicas and everything will stay down. _If_
> > SOLR_HOME is defined in solr.in.sh, this should just be picked up.
> >
> > Best,
> > Erick
> >
> > On Thu, Feb 14, 2019 at 7:43 PM Zheng Lin Edwin Yeo
> >  wrote:
> > >
> > > Hi,
> > >
> > > Which version of zookeeper are you using?
> > >
> > > Also, if you tried to query the index, did you get any error message?
> > >
> > > Regards,
> > > Edwin
> > >
> > >
> > > On Fri, 15 Feb 2019 at 02:34, Jeff Courtade 
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I am working n doing a simple point upgrade from solr 7.6 to 7.7 cloud.
> > > >
> > > > 6 servers
> > > > 3 zookeepers
> > > > one simple test collection using the prepackages _default config.
> > > >
> > > > i stop all solr servers 

RE: Solr 7.7 UpdateRequestProcessor broken

2019-02-15 Thread Markus Jelsma
I stumbled upon this too yesterday and created SOLR-13249. In local unit tests 
we get String but in distributed unit tests we get a ByteArrayUtf8CharSequence 
instead.

https://issues.apache.org/jira/browse/SOLR-13249 

 
 
-Original message-
> From:Andreas Hubold 
> Sent: Friday 15th February 2019 10:10
> To: solr-user@lucene.apache.org
> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
> 
> Hi,
> 
> thank you, Jan.
> 
> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you 
> want to add your patch to that ticket. I did not have time to test it yet.
> 
> So I guess, all SolrJ usages have to handle CharSequence now for string 
> fields? Well, this really sounds like a major breaking change for custom 
> code.
> 
> Thanks,
> Andreas
> 
> Jan Høydahl schrieb am 15.02.19 um 09:14:
> > Hi
> >
> > This is a subtle change which is not detected by our langid unit tests, as 
> > I think it only happens when document is trasferred with SolrJ and Javabin 
> > codec.
> > Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
> >
> > Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
> >
> > Other SolrInputDocument users assuming String type for strings in 
> > SolrInputDocument would also be vulnerable.
> >
> > I have a patch ready that you could test:
> >
> > Index: 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> > IDEA additional info:
> > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> > <+>UTF-8
> > ===
> > --- 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >   (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> > +++ 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
> >   (date 1550217809000)
> > @@ -60,12 +60,12 @@
> > Collection fieldValues = doc.getFieldValues(fieldName);
> > if (fieldValues != null) {
> >   for (Object content : fieldValues) {
> > -  if (content instanceof String) {
> > -String stringContent = (String) content;
> > +  if (content instanceof CharSequence) {
> > +CharSequence stringContent = (CharSequence) content;
> >   if (stringContent.length() > maxFieldValueChars) {
> > -  detector.append(stringContent.substring(0, 
> > maxFieldValueChars));
> > +  detector.append(stringContent.subSequence(0, 
> > maxFieldValueChars).toString());
> >   } else {
> > -  detector.append(stringContent);
> > +  detector.append(stringContent.toString());
> >   }
> >   detector.append(" ");
> > } else {
> > Index: 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> > IDEA additional info:
> > Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
> > <+>UTF-8
> > ===
> > --- 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> > (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
> > +++ 
> > solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
> > (date 1550217691000)
> > @@ -413,10 +413,10 @@
> >   Collection fieldValues = doc.getFieldValues(fieldName);
> >   if (fieldValues != null) {
> > for (Object content : fieldValues) {
> > -if (content instanceof String) {
> > -  String stringContent = (String) content;
> > +if (content instanceof CharSequence) {
> > +  CharSequence stringContent = (CharSequence) content;
> > if (stringContent.length() > maxFieldValueChars) {
> > -sb.append(stringContent.substring(0, maxFieldValueChars));
> > +sb.append(stringContent.subSequence(0, 
> > maxFieldValueChars));
> > } else {
> >   sb.append(stringContent);
> > }
> > @@ -449,8 +449,8 @@
> >   Collection contents = doc.getFieldValues(field);
> >   if (contents != null) {
> > for (Object content : contents) {
> > -if (content instanceof String) {
> > -  docSize += Math.min(((String) content).length(), 
> > maxFieldValueChars);
> > +if (content instanceof CharSequence) {
> > +  docSize += Math.min(((CharSequence) content).length(), 
> > maxFieldValueChars);
> >   }
> > }
> >   
> >
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> >> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold 
> >> :
> >>
> >> Hi,
> >>
> 

Query of Death Lucene/Solr 7.6

2019-02-08 Thread Markus Jelsma
Hello (apologies for cross-posting),

While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the remaining 
four, we stumbled upon a situation where the 7.6 nodes quickly succumb when a 
'Query-of-Death' is issued, 7.2.1 up to 7.5 are all unaffected (tested and 
confirmed).

Following Smiley's suggestion i used Eclipse MAT to find the problem in the 
heap dump i obtained, this fantastic tool revealed within minutes that a query 
thread ate 65 % of all resources, in the class variables i could find the the 
query, and reproduce the problem.

The problematic query is 'dubbele dijk/rijke dijkproject in het dijktracé 
eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString() output in 
edismax' newFieldQuery. If the node survives it takes 2+ seconds for the query 
to run (150 ms otherwise). If i disable all query time SynonymGraphFilters it 
still takes a second and produces just a 9 MB toString() for the query.

I could not find anything like this in Jira. I did think of LUCENE-8479 and 
LUCENE-8531 but they were about graphs, this problem looked related though.

I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i leave 
Solr's edismax' pf parameter empty, everything runs fast. When all fields are 
configured for pf, the node dies.

I am now unsure whether this is a Solr or a Lucene issue. 

Please let me know.

Many thanks,
Markus

ps. in Solr i even got an 'Impossible Exception', my first!


LFUCache

2019-02-04 Thread Markus Jelsma
Hello,

Thanks to SOLR-12743 - one of our collections can't use FastLRUCache - we are 
considering LFUCache instead. But there is SOLR-3393 as well, claiming the 
current implementation is inefficient.

But ConcurrentLRUCache and ConcurrentLFUCache both use ConcurrentHashmap under 
the hood, the get() code is practically identical. So based on the code, i 
would think that, despite LFUCache being inefficient, it is neither slower nor 
faster than FastLRUCache for get(), right?

Or am i missing something obvious here?

Thanks,
Markus

https://issues.apache.org/jira/browse/SOLR-12743
https://issues.apache.org/jira/browse/SOLR-3393


RE: Re: Delayed/waiting requests

2019-01-16 Thread Markus Jelsma
Hello,

There is an extremely undocumented parameter to get the cache's contents 
displayed. Set showItems="100" on the filter cache. 

Regards,
Markus

 
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 16th January 2019 17:40
> To: solr-user 
> Subject: Re: Re: Delayed/waiting requests
> 
> I don't know of any tools to inspect the cache. Under the covers,
> these are things like Java's ConcurrentHashMap which don't, for
> instance, carry along information like last access time IIUC.
> 
> I usually have to cull the Solr logs and eyeball the fq clauses to see
> if anything jumps out. If you do find any such patterns, you can
> always add {!cache=false} to those clauses to not use up cache
> entries
> 
> Best,
> Erick
> 
> On Wed, Jan 16, 2019 at 7:53 AM Gael Jourdan-Weil
>  wrote:
> >
> > Ok, I get your point.
> >
> >
> > Do you know if there is a tool to easily view filterCache content?
> >
> > I know we can see the top entries in the API or the UI but could we see 
> > more?
> >
> >
> > Regards,
> >
> > Gaël
> >
> > 
> > De : Erick Erickson 
> > Envoyé : mardi 15 janvier 2019 19:46:19
> > À : solr-user
> > Objet : Re: Re: Delayed/waiting requests
> >
> > bq. If I get your point, having a big cache might cause more troubles
> > than help if the cache hit ratio is not high enough because the cache
> > is constantly evicting/inserting entries?
> >
> > Pretty much. Although there are nuances.
> >
> > Right now, you have a 12K autowarm count. That means your cache will
> > eventually always contain 12K entries whether or not you ever use the
> > last 11K! I'm simplifying a bit, but it grows like this.
> >
> > Let's say I start Solr. Initially it has no cache entries. Now I start
> > both querying and indexing. For simplicity, say I have 100 _new_  fq
> > clauses come in between each commit. The first commit will autowarm
> > 100. The next will autowarm 200, then 300.. etc. Eventually this
> > will grow to 12K. So your performance will start to vary depending on
> > how long Solr has been running.
> >
> > Worse. it's not clear that you _ever_ re-use those clauses. One example:
> > fq=date_field:[* TO NOW]
> > NOW is really a Unix timestamp. So issuing the same fq 1 millisecond
> > from the first one will not re-use the entry. In the worst case almost
> > all of your autwarming is useless. It neither loads relevant index
> > data into RAM nor is reusable.
> >
> > Even if you use "date math" to round to, say, a minute, if you run
> > Solr long enough you'll still fill up with useless fq clauses.
> >
> > Best,
> > Erick
> >
> > On Tue, Jan 15, 2019 at 9:33 AM Gael Jourdan-Weil
> >  wrote:
> > >
> > > @Erick:
> > >
> > >
> > > We will try to lower the autowarm and run some tests to compare.
> > >
> > > If I get your point, having a big cache might cause more troubles than 
> > > help if the cache hit ratio is not high enough because the cache is 
> > > constantly evicting/inserting entries?
> > >
> > >
> > >
> > > @Jeremy:
> > >
> > >
> > > Index size: ~20G and ~14M documents
> > >
> > > Server memory available: 256G from which ~30G used and ~100G system cache
> > >
> > > Server CPU count: 32, ~10% usage
> > >
> > > JVM memory settings: -Xms12G -Xmx12G
> > >
> > >
> > > We have 3 servers and 3 clusters of 3 Solr instances.
> > >
> > > That is each server hosts 1 Solr instance for each cluster.
> > >
> > > And, indeed, each cluster only has 1 shard with replication factor 3.
> > >
> > >
> > > Among all these Solr instances, the pauses are observed on only one 
> > > single cluster but on every server at different times (sometimes on all 
> > > servers at the same time but I would say it's very rare).
> > >
> > > We do observe the traffic is evenly balanced across the 3 servers, around 
> > > 30-40 queries per second sent to each server.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Gaël
> > >
> > >
> > > 
> > > De : Branham, Jeremy (Experis) 
> > > Envoyé : mardi 15 janvier 2019 17:59:56
> > > À : solr-user@lucene.apache.org
> > > Objet : Re: Re: Delayed/waiting requests
> > >
> > > Hi Gael –
> > >
> > > Could you share this information?
> > > Size of the index
> > > Server memory available
> > > Server CPU count
> > > JVM memory settings
> > >
> > > You mentioned a cloud configuration of 3 replicas.
> > > Does that mean you have 1 shard with a replication factor of 3?
> > > Do the pauses occur on all 3 servers?
> > > Is the traffic evenly balanced across those servers?
> > >
> > >
> > > Jeremy Branham
> > > jb...@allstate.com
> > >
> > >
> > > On 1/15/19, 9:50 AM, "Erick Erickson"  wrote:
> > >
> > > Well, it was a nice theory anyway.
> > >
> > > "Other collections with the same settings"
> > > doesn't really mean much unless those other collections are very 
> > > similar,
> > > especially in terms of numbers of docs.
> > >
> > > You should only see a new searcher opening when you do a
> > > 

RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

2018-11-29 Thread Markus Jelsma
Hello, 

Sorry for trying this once more. Is there anyone around who can help me, and 
perhaps others, on this subject and the linked Jira ticket and failing test?

I could really need some help from someone who is really familiar with edismax 
code and the underlying QueryBuilder parts that are used, and then get replaced 
by Solr code.

Many thanks,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Thursday 22nd November 2018 15:39
> To: solr-user@lucene.apache.org; solr-user 
> Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> should match (edismax)
> 
> Hello,
> 
> I have opened a SOLR-13009 describing the problem. The attached patch 
> contains a unit test proving the problem, i.e. the test fails. Any help would 
> be greatly appreciated.
> 
> Many thanks,
> Markus
> 
> https://issues.apache.org/jira/browse/SOLR-13009
> 
>  
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Sunday 18th November 2018 23:21
> > To: solr-user@lucene.apache.org; solr-user 
> > Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> > should match (edismax)
> > 
> > Hello,
> > 
> > Apologies for bothering you all again, but i really need some help in this 
> > matter. How can we resolve this issue? Are we dealing with a bug here (then 
> > i'll open a ticket), am i doing something wrong?
> > 
> > Is here anyone who had the same issue or understand the problem?
> > 
> > Many thanks,
> > Markus 
> > 
> >  
> >  
> > -Original message-
> > > From:Markus Jelsma 
> > > Sent: Tuesday 13th November 2018 9:52
> > > To: solr-user 
> > > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum 
> > > should match (edismax)
> > > 
> > > Hello, apologies for this long winded e-mail.
> > > 
> > > Our fields have KeywordRepeat and language specific filters such as a 
> > > stemmer, the final filter at query-time is SynonymGraph. We do not use 
> > > RemoveDuplicatesFilter for those of you wondering why when you see the 
> > > parsed queries below, this is due to [1]. 
> > > 
> > > We use a custom QParser extending edismax and also extend 
> > > ExtendedSolrQueryParser, so we are able to override newFieldQuery in case 
> > > we have to. The problem also directly applies to Solr's vanilla edismax. 
> > > The file synonyms.txt contains the stemmed versions of the original terms.
> > > 
> > > Consider this example synonym set [bier,brouw] where bier means beer and 
> > > brouw is the stemmed version of brouwsel (brewage, concoction), and 
> > > consider these parameters on /select: 
> > > qf=content_nl=edismax=2<-1 5<-2 6<90%25.
> > > 
> > > The queries q=bier and q=brouw both parse to the following query and give 
> > > the desired results (notice the missing RemoveDuplicates here):
> > > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> > > content_nl:brouw))~2))
> > > 
> > > However, for q=brouwsel something (partially) unexpected happens:
> > > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> > > 
> > > This results in a BooleanQuery where, due to mm=2, both clauses need to 
> > > match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> > > course fixes the problem, but that is not what we want.
> > > 
> > > What is also unexpected, and may be related to the problem, is that when 
> > > checking the analzer output via the GUI, we see the position incrementing 
> > > when KeywordRepeat and SynonymGraph are combined. When these filters are 
> > > not combined, the positions are always 1, as expected. When combined we 
> > > get this for 'brouw':
> > > term: bier brouw bier brouw
> > > pos:  1 1 2  2
> > > 
> > > or for 'brouwsel':
> > > term: brouwsel bier brouw
> > > pos:  1   2  2
> > > 
> > > ExtendedSolrQueryParser, and everything underneath, is a complicated 
> > > piece of code. In the end it extends Lucene's QueryBuilder, but not 
> > > always relying on its results, it seems. Edismax for example 'resets' 
> > > minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a 
> > > complicated web of code and i am a bit too deep in this unfamiliar area, 
> > > and i am in need of help here.
> > > 
> > > So, my question is, how to solve this problem? Or how to approach it?  
> > > What is the actual problem? How can i get the same stable results for 
> > > both queries? Does the odd positon increment have anything to do with it 
> > > (it seems Lucene's QueryBuilder does something with it). What do i need 
> > > to do?
> > > 
> > > Many thanks,
> > > Markus
> > > 
> > > ps. this is on Solr 7.2.1 and 7.5.0.
> > > 
> > > [1] 
> > > http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> > > 
> > 
> 


RE: Delete all, index all, end up with 1 segment with 50% deletes

2018-11-28 Thread Markus Jelsma
Hello Shawn, Erick,

I thought about that too, but dismissed it, other similar batched processes 
don't show this problem. Nonetheless i reset cumulativeAdds and watched a batch 
being indexed, it got indexed twice!

Thanks!
Markus
 
-Original message-
> From:Erick Erickson 
> Sent: Wednesday 28th November 2018 2:59
> To: solr-user 
> Subject: Re: Delete all, index all, end up with 1 segment with 50% deletes
> 
> Shawn's comment seems likely, somehow you're adding all the docs twice
> and only committing at the end. In that case there'd be only 1
> segment. That's about the only way I can imagine your index has
> exactly one segment with exactly half the docs deleted.
> 
> It'd be interesting for you to look at the admin UI>>schema browser
> for your  field. It'll report the most frequent entries and
> if every  has exactly 2 entries, then you're indexing the
> same docs twice in one go.
> 
> Plus, the default TieredMergePolicy doesn't necessarily kick in unless
> there are multiple segments of roughly the same size. With an index
> this small it's perfectly possible that TMP is getting triggered and
> saying, in essence, "there's not enough work to do here to bother".
> 
> In Solr 7.5, you can optimize/forceMerge without any danger of
> creating massive segments, see:
> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> (pre Solr 7.5)
> and
> https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> (Solr 7.5+).
> 
> Best,
> Erick
> On Tue, Nov 27, 2018 at 4:29 AM Markus Jelsma
>  wrote:
> >
> > Hello,
> >
> > A background  batch process compiles a data set, when finished, it sends a 
> > delete all to its target collection, then everything gets sent by SolrJ, 
> > followed by a regular commit. When inspecting the core i notice it has one 
> > segment with 9578 documents, of which exactly half are deleted.
> >
> > That Solr node is on 7.5, how can i encourage the merge scheduler to do its 
> > job and merge away all those deletes?
> >
> > Thanks,
> > Markus
> 


Delete all, index all, end up with 1 segment with 50% deletes

2018-11-27 Thread Markus Jelsma
Hello,

A background  batch process compiles a data set, when finished, it sends a 
delete all to its target collection, then everything gets sent by SolrJ, 
followed by a regular commit. When inspecting the core i notice it has one 
segment with 9578 documents, of which exactly half are deleted. 

That Solr node is on 7.5, how can i encourage the merge scheduler to do its job 
and merge away all those deletes?

Thanks,
Markus


RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

2018-11-22 Thread Markus Jelsma
Hello,

I have opened a SOLR-13009 describing the problem. The attached patch contains 
a unit test proving the problem, i.e. the test fails. Any help would be greatly 
appreciated.

Many thanks,
Markus

https://issues.apache.org/jira/browse/SOLR-13009

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Sunday 18th November 2018 23:21
> To: solr-user@lucene.apache.org; solr-user 
> Subject: RE: KeywordRepeat, stemming, (single term) synonyms and minimum 
> should match (edismax)
> 
> Hello,
> 
> Apologies for bothering you all again, but i really need some help in this 
> matter. How can we resolve this issue? Are we dealing with a bug here (then 
> i'll open a ticket), am i doing something wrong?
> 
> Is here anyone who had the same issue or understand the problem?
> 
> Many thanks,
> Markus 
> 
>  
>  
> -Original message-
> > From:Markus Jelsma 
> > Sent: Tuesday 13th November 2018 9:52
> > To: solr-user 
> > Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should 
> > match (edismax)
> > 
> > Hello, apologies for this long winded e-mail.
> > 
> > Our fields have KeywordRepeat and language specific filters such as a 
> > stemmer, the final filter at query-time is SynonymGraph. We do not use 
> > RemoveDuplicatesFilter for those of you wondering why when you see the 
> > parsed queries below, this is due to [1]. 
> > 
> > We use a custom QParser extending edismax and also extend 
> > ExtendedSolrQueryParser, so we are able to override newFieldQuery in case 
> > we have to. The problem also directly applies to Solr's vanilla edismax. 
> > The file synonyms.txt contains the stemmed versions of the original terms.
> > 
> > Consider this example synonym set [bier,brouw] where bier means beer and 
> > brouw is the stemmed version of brouwsel (brewage, concoction), and 
> > consider these parameters on /select: qf=content_nl=edismax=2<-1 
> > 5<-2 6<90%25.
> > 
> > The queries q=bier and q=brouw both parse to the following query and give 
> > the desired results (notice the missing RemoveDuplicates here):
> > +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> > content_nl:brouw))~2))
> > 
> > However, for q=brouwsel something (partially) unexpected happens:
> > +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> > 
> > This results in a BooleanQuery where, due to mm=2, both clauses need to 
> > match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> > course fixes the problem, but that is not what we want.
> > 
> > What is also unexpected, and may be related to the problem, is that when 
> > checking the analzer output via the GUI, we see the position incrementing 
> > when KeywordRepeat and SynonymGraph are combined. When these filters are 
> > not combined, the positions are always 1, as expected. When combined we get 
> > this for 'brouw':
> > term: bier brouw bier brouw
> > pos:  1 1 2  2
> > 
> > or for 'brouwsel':
> > term: brouwsel bier brouw
> > pos:  1   2  2
> > 
> > ExtendedSolrQueryParser, and everything underneath, is a complicated piece 
> > of code. In the end it extends Lucene's QueryBuilder, but not always 
> > relying on its results, it seems. Edismax for example 'resets' 
> > minShouldMatch in SolrPluginUtils.setMinShouldMatch(), so this is a 
> > complicated web of code and i am a bit too deep in this unfamiliar area, 
> > and i am in need of help here.
> > 
> > So, my question is, how to solve this problem? Or how to approach it?  What 
> > is the actual problem? How can i get the same stable results for both 
> > queries? Does the odd positon increment have anything to do with it (it 
> > seems Lucene's QueryBuilder does something with it). What do i need to do?
> > 
> > Many thanks,
> > Markus
> > 
> > ps. this is on Solr 7.2.1 and 7.5.0.
> > 
> > [1] 
> > http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> > 
> 


RE: KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

2018-11-18 Thread Markus Jelsma
Hello,

Apologies for bothering you all again, but i really need some help in this 
matter. How can we resolve this issue? Are we dealing with a bug here (then 
i'll open a ticket), am i doing something wrong?

Is here anyone who had the same issue or understand the problem?

Many thanks,
Markus 

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 13th November 2018 9:52
> To: solr-user 
> Subject: KeywordRepeat, stemming, (single term) synonyms and minimum should 
> match (edismax)
> 
> Hello, apologies for this long winded e-mail.
> 
> Our fields have KeywordRepeat and language specific filters such as a 
> stemmer, the final filter at query-time is SynonymGraph. We do not use 
> RemoveDuplicatesFilter for those of you wondering why when you see the parsed 
> queries below, this is due to [1]. 
> 
> We use a custom QParser extending edismax and also extend 
> ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we 
> have to. The problem also directly applies to Solr's vanilla edismax. The 
> file synonyms.txt contains the stemmed versions of the original terms.
> 
> Consider this example synonym set [bier,brouw] where bier means beer and 
> brouw is the stemmed version of brouwsel (brewage, concoction), and consider 
> these parameters on /select: qf=content_nl=edismax=2<-1 5<-2 
> 6<90%25.
> 
> The queries q=bier and q=brouw both parse to the following query and give the 
> desired results (notice the missing RemoveDuplicates here):
> +(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
> content_nl:brouw))~2))
> 
> However, for q=brouwsel something (partially) unexpected happens:
> +(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))
> 
> This results in a BooleanQuery where, due to mm=2, both clauses need to 
> match, giving very few matches. Removing KeywordRepeat or setting mm=1 of 
> course fixes the problem, but that is not what we want.
> 
> What is also unexpected, and may be related to the problem, is that when 
> checking the analzer output via the GUI, we see the position incrementing 
> when KeywordRepeat and SynonymGraph are combined. When these filters are not 
> combined, the positions are always 1, as expected. When combined we get this 
> for 'brouw':
> term: bier brouw bier brouw
> pos:  1 1 2  2
> 
> or for 'brouwsel':
> term: brouwsel bier brouw
> pos:  1   2  2
> 
> ExtendedSolrQueryParser, and everything underneath, is a complicated piece of 
> code. In the end it extends Lucene's QueryBuilder, but not always relying on 
> its results, it seems. Edismax for example 'resets' minShouldMatch in 
> SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and 
> i am a bit too deep in this unfamiliar area, and i am in need of help here.
> 
> So, my question is, how to solve this problem? Or how to approach it?  What 
> is the actual problem? How can i get the same stable results for both 
> queries? Does the odd positon increment have anything to do with it (it seems 
> Lucene's QueryBuilder does something with it). What do i need to do?
> 
> Many thanks,
> Markus
> 
> ps. this is on Solr 7.2.1 and 7.5.0.
> 
> [1] 
> http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html
> 


RE: Extracting important multi term phrases from the text

2018-11-15 Thread Markus Jelsma
Hello Pratik,

How about not using StopFilter at all? We got rid of it a long time ago, and 
only use it in very specific circumstances.

LUCENE-4065 is not going to be fixed any time soon. Removing StopFilter will 
introduce noise, but you could work around it with SKG. Please let us know if 
it works for you.

Rergards,
Markus

 
 
-Original message-
> From:Pratik Patel 
> Sent: Thursday 15th November 2018 23:16
> To: solr-user@lucene.apache.org
> Subject: Re: Extracting important multi term phrases from the text
> 
> Hi Markus,
> 
> Thanks for the reply. I tried using ShingleFilter and it seems to
> be working. However, I am hitting an issue when it is used with
> StopWordFilter. StopWordFilter leaves an underscore "_" for removed words
> and it kind of screws up the data in index.
> 
> I tried setting enablePositionIncrements="false" for stop word filter but
> that parameter only works for lucene version 4.3 or earlier. Looks like
> it's an open issue in lucene
> https://issues.apache.org/jira/browse/LUCENE-4065
> 
> For now, I am trying to find a workaround using PatternReplaceFilterFactory.
> 
> Regards,
> Pratik
> 
> On Thu, Nov 15, 2018 at 4:15 PM Markus Jelsma 
> wrote:
> 
> > Hello Pratik,
> >
> > We would use ShingleFilter for this indeed. If you only want
> > bigrams/shingles, don't forget to disable outputUnigrams and set both
> > shinle size limits to 2.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:Pratik Patel 
> > > Sent: Thursday 15th November 2018 17:00
> > > To: solr-user@lucene.apache.org
> > > Subject: Extracting important multi term phrases from the text
> > >
> > > Hello Everyone,
> > >
> > > Standard way of tokenizing in solr would divide the text by white space
> > in
> > > solr.
> > >
> > > Is there a way by which we can index multi-term phrases like "Machine
> > > Learning" instead of "Machine", "Learning"?
> > > Is it possible to create a specific field type for such phrases which has
> > > its own indexing pipeline? I am open to storing n-grams but these n-grams
> > > would be across terms and not just one term? In other words, I don't want
> > > to store n-grams of the term "machine", I want to store n-grams for a
> > > sentence like below.
> > >
> > > "I like machine learning" --> "I like", "like machine", "machine
> > learning"
> > > and so on.
> > >
> > > It seems like Shingle Filter (
> > >
> > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter
> > )
> > > may be used for this. Is there a better alternative?
> > >
> > > I want to use this field as an input to Semantic Knowledge Graph. The
> > > plugin works great for words. But now I want to use it for phrases. Any
> > > idea around this would be really helpful.
> > >
> > > Thanks a lot!
> > >
> > > - Pratik
> > >
> >
> 


RE: Extracting important multi term phrases from the text

2018-11-15 Thread Markus Jelsma
Hello Pratik,

We would use ShingleFilter for this indeed. If you only want bigrams/shingles, 
don't forget to disable outputUnigrams and set both shinle size limits to 2.

Regards,
Markus

-Original message-
> From:Pratik Patel 
> Sent: Thursday 15th November 2018 17:00
> To: solr-user@lucene.apache.org
> Subject: Extracting important multi term phrases from the text
> 
> Hello Everyone,
> 
> Standard way of tokenizing in solr would divide the text by white space in
> solr.
> 
> Is there a way by which we can index multi-term phrases like "Machine
> Learning" instead of "Machine", "Learning"?
> Is it possible to create a specific field type for such phrases which has
> its own indexing pipeline? I am open to storing n-grams but these n-grams
> would be across terms and not just one term? In other words, I don't want
> to store n-grams of the term "machine", I want to store n-grams for a
> sentence like below.
> 
> "I like machine learning" --> "I like", "like machine", "machine learning"
> and so on.
> 
> It seems like Shingle Filter (
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ShingleFilter)
> may be used for this. Is there a better alternative?
> 
> I want to use this field as an input to Semantic Knowledge Graph. The
> plugin works great for words. But now I want to use it for phrases. Any
> idea around this would be really helpful.
> 
> Thanks a lot!
> 
> - Pratik
> 


KeywordRepeat, stemming, (single term) synonyms and minimum should match (edismax)

2018-11-13 Thread Markus Jelsma
Hello, apologies for this long winded e-mail.

Our fields have KeywordRepeat and language specific filters such as a stemmer, 
the final filter at query-time is SynonymGraph. We do not use 
RemoveDuplicatesFilter for those of you wondering why when you see the parsed 
queries below, this is due to [1]. 

We use a custom QParser extending edismax and also extend 
ExtendedSolrQueryParser, so we are able to override newFieldQuery in case we 
have to. The problem also directly applies to Solr's vanilla edismax. The file 
synonyms.txt contains the stemmed versions of the original terms.

Consider this example synonym set [bier,brouw] where bier means beer and brouw 
is the stemmed version of brouwsel (brewage, concoction), and consider these 
parameters on /select: qf=content_nl=edismax=2<-1 5<-2 6<90%25.

The queries q=bier and q=brouw both parse to the following query and give the 
desired results (notice the missing RemoveDuplicates here):
+(((Synonym(content_nl:bier content_nl:brouw) Synonym(content_nl:bier 
content_nl:brouw))~2))

However, for q=brouwsel something (partially) unexpected happens:
+(((content_nl:brouwsel Synonym(content_nl:bier content_nl:brouw))~2))

This results in a BooleanQuery where, due to mm=2, both clauses need to match, 
giving very few matches. Removing KeywordRepeat or setting mm=1 of course fixes 
the problem, but that is not what we want.

What is also unexpected, and may be related to the problem, is that when 
checking the analzer output via the GUI, we see the position incrementing when 
KeywordRepeat and SynonymGraph are combined. When these filters are not 
combined, the positions are always 1, as expected. When combined we get this 
for 'brouw':
term: bier brouw bier brouw
pos:  1 1 2  2

or for 'brouwsel':
term: brouwsel bier brouw
pos:  1   2  2

ExtendedSolrQueryParser, and everything underneath, is a complicated piece of 
code. In the end it extends Lucene's QueryBuilder, but not always relying on 
its results, it seems. Edismax for example 'resets' minShouldMatch in 
SolrPluginUtils.setMinShouldMatch(), so this is a complicated web of code and i 
am a bit too deep in this unfamiliar area, and i am in need of help here.

So, my question is, how to solve this problem? Or how to approach it?  What is 
the actual problem? How can i get the same stable results for both queries? 
Does the odd positon increment have anything to do with it (it seems Lucene's 
QueryBuilder does something with it). What do i need to do?

Many thanks,
Markus

ps. this is on Solr 7.2.1 and 7.5.0.

[1] 
http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html


RE: Odd Scoring behavior

2018-10-30 Thread Markus Jelsma
Hello Webster,

It smells like KeywordRepeat. In general it is not a problem if all terms are 
scored twice. But you also have RemoveDuplicates, and this causes that in some 
cases a term in one field is scored twice, but once in the other field and then 
you have a problem.

Due to lack of replies, in the end i chose to remove the RemoveDuplicates 
filter, so that everything is always scored twice. This 'solution' at least 
solved the general scoring problem of searching across many fields.

Thus far there is no real solution to this problem as far as i know it.

Regards,
Markus

http://lucene.472066.n3.nabble.com/Multiple-languages-boosting-and-stemming-and-KeywordRepeat-td4389086.html

 
 
-Original message-
> From:Webster Homer 
> Sent: Tuesday 30th October 2018 22:34
> To: solr-user@lucene.apache.org
> Subject: Odd Scoring behavior
> 
> I noticed that sometimes query matches seem to get counted twice when they 
> are scored. This will happen if the fieldtype is being stemmed, and there is 
> a matching synonym.
> It seems that the score for the field is 2X higher than it should be. We see 
> this only when there is a matching synonym that has a stemmed term in it.
> 
> 
> We have this synonym defined:
> bsa, bovine serum albumin
> 
> We have this fieldtype:
>  positionIncrementGap="100">
>   
> 
>  words="lang/stopwords_en.txt" />
> 
> 
> 
> 
> 
>  
> 
> 
>  words="lang/stopwords_en.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
> 
> 
> 
>   
> 
> 
> Which is used as:
>  stored="true" required="false" multiValued="false" />
> 
> When we query this field using the eDismax query parser the field, 
> search_en_root_name seems to contribute twice to the score for this query:
> bovine serum albumin
> 
> once for the base query, and once for the stemmed form of the query:
> bovin serum albumin
> 
> If we remove the synonym it will only be counted once. We only see this 
> behavior If part of the synonym can be stemmed. This seems odd and has the 
> effect of overpowering boosts on other fields.
> 
> The explain plan without synonym
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":44,
> "params":{
>   "mm":"2<-25%",
>   "fl":"searchmv_pno, search_en_p_pri_name [explain style=nl]",
>   "group.limit":"1",
>   "q.op":"OR",
>   "sort":"score desc,sort_en_name asc ,sort_ds asc,  search_pid asc",
>   "group.ngroups":"true",
>   "q":"bovine serum albumin",
>   "tie":".45",
>   "defType":"edismax",
>   "group.sort":"sort_ds asc, score desc",
>   "qf":"search_en_p_pri_name_min^7500
> search_en_root_name_min^12000 search_en_p_pri_name^3000
> search_pid^2500 searchmv_pno^2500 searchmv_cas_number^2500
> searchmv_p_skus^2500 search_lform_lc^2500  search_en_root_name^2500
> searchmv_en_s_pri_name^2500 searchmv_en_keywords^2500
> searchmv_lookahead_terms^2000 searchmv_user_term^2000
> searchmv_en_acronym^1500 searchmv_en_synonyms^1500
> searchmv_concat_sku^1000 search_concat_pno^1000
> searchmv_en_name_suf^1000 searchmv_component_cas^1000
> search_lform^1000 searchmv_pno_genr^500 search_concat_pno_genr^500
> searchmv_p_skus_genr^500 search_eform search_mol_form 
> searchmv_component_molform searchmv_en_descriptions searchmv_en_chem_comp 
> searchmv_en_attributes searchmv_en_page_title search_mdl_number 
> searchmv_xref_comparable_pno searchmv_xref_comparable_sku 
> searchmv_xref_equivalent_pno searchmv_xref_exact_pno searchmv_xref_exact_sku 
> searchmv_vendor_sku searchmv_material_number search_en_sortkey searchmv_rtecs 
> search_color_idx search_beilstein search_ecnumber search_egecnumber 
> search_femanumber searchmv_isbn",
>   "group.field":"id_s",
>   "_":"1540331449276",
>   "group":"true"}},
>   "grouped":{
> "id_s":{
>   "matches":4701,
>   "ngroups":4393,
>   "groups":[{
>   "groupValue":"bovineserumalbumin123459048468",
>   "doclist":{"numFound":57,"start":0,"docs":[
>   {
> "search_en_p_pri_name":"Bovine Serum Albumin",
> "searchmv_pno":["A2153"],
> "[explain]":{
>   "match":true,
>   "value":38145.117,
>   "description":"max plus 0.45 times others of:",
>   "details":[{
>   "match":true,
>   "value":10434.111,
>   "description":"sum of:",
>   "details":[{
>   "match":true,
>   "value":4042.5876,
> 
> "description":"weight(Synonym(search_en_root_name:bovin
> search_en_root_name:bovine) in 20407) [SialBM25Similarity], result of:",
>   "details":[{
>   "match":true,
>   "value":4042.5876,
>

RE: Merging data from different sources

2018-10-30 Thread Markus Jelsma
Hello Martin,

We also use an URP for this in some cases. We index documents to some 
collection, the URP reads a field from that document which is an ID in another 
collection. So we fetch that remote Solr document on-the-fly, and use those 
fields to enrich the incoming document.

It is very straightforward and works very well.

Regards,
Markus

 
 
-Original message-
> From:Martin Frank Hansen (MHQ) 
> Sent: Tuesday 30th October 2018 21:55
> To: solr-user@lucene.apache.org
> Subject: RE: Merging data from different sources
> 
> Hi Alex,
> 
> Thanks for your help. I will take a look at the update-request-processor.
> 
> I wonder if there is a way to link documents together, so that they always 
> show up together should one of the documents match a search query?
> 
> -Original Message-
> From: Alexandre Rafalovitch 
> Sent: 30. oktober 2018 13:16
> To: solr-user 
> Subject: Re: Merging data from different sources
> 
> Maybe
> https://lucene.apache.org/solr/guide/7_5/update-request-processors.html#atomicupdateprocessorfactory
> 
> Regards,
> Alex
> 
> On Tue, Oct 30, 2018, 7:57 AM Martin Frank Hansen (MHQ),  wrote:
> 
> > Hi,
> >
> > I am trying to merge files from different sources and with different
> > content (except for one key-field) , how can this be done in Solr?
> >
> > An example could be:
> >
> > Document 1
> > 
> > 001  Unique id
> > for Document 1
> > test-123
> > …
> > 
> >
> > Document 2
> > 
> > abcdefgh   Unique id
> > for Document 2
> > test-123
> > …
> > 
> >
> > In the above case I would like to merge on Journalnumber thus ending
> > up with something like this:
> >
> >  
> > 001  Unique id
> > for the merge
> > test-123
> > abcdefgh   Reference id
> > for Document 2.
> > …
> > 
> >
> > How would I go about this? I was thinking about embedded documents,
> > but since I am not indexing the different data sources at the same
> > time I don’t think it will work. The ideal result would be to have
> > Document 2 imbedded in Document 1.
> >
> > I am currently using a schema that contains all fields from Document 1
> > and Document 2.
> >
> > I really hope that Solr can handle this, and any help/feedback is much
> > appreciated.
> >
> > Best regards
> >
> > Martin
> >
> >
> >
> >
> > Beskyttelse af dine personlige oplysninger er vigtig for os. Her
> > finder du KMD’s
> > Privatlivspolitik, der fortæller, 
> > hvordan vi behandler oplysninger om dig.
> >
> > Protection of your personal data is important to us. Here you can read
> > KMD’s Privacy Policy outlining how
> > we process your personal data.
> >
> > Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> > Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> > informere afsender om fejlen ved at bruge svarfunktionen. Samtidig
> > beder vi dig slette e-mailen i dit system uden at videresende eller kopiere 
> > den.
> > Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning
> > er fri for virus og andre fejl, som kan påvirke computeren eller
> > it-systemet, hvori den modtages og læses, åbnes den på modtagerens
> > eget ansvar. Vi påtager os ikke noget ansvar for tab og skade, som er
> > opstået i forbindelse med at modtage og bruge e-mailen.
> >
> > Please note that this message may contain confidential information. If
> > you have received this message by mistake, please inform the sender of
> > the mistake by sending a reply, then delete the message from your
> > system without making, distributing or retaining any copies of it.
> > Although we believe that the message and any attachments are free from
> > viruses and other errors that might affect the computer or it-system
> > where it is received and read, the recipient opens the message at his or 
> > her own risk.
> > We assume no responsibility for any loss or damage arising from the
> > receipt or use of this message.
> >
> 


RE: Solr Shards down for unknown reason

2018-10-15 Thread Markus Jelsma
Hello,

We observed this problem too with older Solr versions. Whenever none of the 
shard's replica's would come up we would just shut them all down again and 
restart just one replica and wait. In some cases it won't come up (still true 
for Solr 7.4), but start a second shard a while later and wait again. Usually 
one would become leader and then all is fine, if not, shut all down and start 
another replica first and repeat.

Regards,
Markus

 
 
-Original message-
> From:Dasarathi Minjur 
> Sent: Monday 15th October 2018 21:30
> To: solr-user@lucene.apache.org
> Subject: Solr Shards down for unknown reason
> 
> We have a Hadoop cluster with Solr 6.3 running as service. After an OS
> security patching, when the cluster was restarted, Solr Cloud is up but the
> shards are down all the time. No specific messages in Solr.log or console
> logs. Tried restarting solr but that didn't help. Any pointers to get the
> shards up will be helpful. Thanks for taking time to respond.
> 


RE: Opinions on index optimization...

2018-10-03 Thread Markus Jelsma
There are a few bugs for which you require to merge the index, see SOLR-8807 
and related bugs.

https://issues.apache.org/jira/browse/SOLR-8807

-Original message-
> From:Erick Erickson 
> Sent: Wednesday 3rd October 2018 4:50
> To: solr-user 
> Subject: Re: Opinions on index optimization...
> 
> The problem you're at now is that, having run optimize, that single
> massive segment will accumulate deletes until it has < 2.5G "live"
> documents. So once you do optimize (and until you get to Solr 7.5),
> unless you can live with this one segment accumulating deletes for a
> very long time, you must continue to optimize.
> 
> Or you could re-index from scratch if possible and never optimize.
> 
> Best,
> Erick
> On Tue, Oct 2, 2018 at 7:28 AM Walter Underwood  wrote:
> >
> > Don’t optimize. The first article isn’t as clear as it should be. The 
> > important sentence is "Unless you are running into resource problems, it’s 
> > best to leave merging alone.”
> >
> > I’ve been running Solr in production since version 1.3, with several 
> > different kinds and sizes of collections. I’ve never run a daily optimize, 
> > even on collections that only change once per day.
> >
> > The section titles "What? I can’t afford 50% “wasted” space” should have 
> > just been “Then don’t run Solr”. Really, you should have 100% free sapce, 
> > so a 22 Gb index would be on a volume with 22 Gb of free space.
> >
> > It was a mistake to name it “optimize”. It should have been “force merge”.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Oct 2, 2018, at 6:04 AM, Jeff Courtade  wrote:
> > >
> > > We run an old master/slave solr 4.3.0 solr cluster
> > >
> > > 14 nodes 7/7
> > > indexes average 47/5 gig per shard around 2 mill docs per shard.
> > >
> > > We have constant daily additions and a small amount of deletes.
> > >
> > > We optimize nightly currently and it is a system hog.
> > >
> > > Is it feasible to never run optimize?
> > >
> > > I ask because it seems like it would be very bad not to but this
> > > information is out there apparently recommending exactly that... never
> > > optimizing.
> > >
> > > https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/
> > >
> > > https://lucidworks.com/2018/06/20/solr-and-optimizing-your-index-take-ii/
> > >
> > > https://wiki.apache.org/solr/SolrPerformanceFactors#Optimization_Considerations
> >
> 


RE: Java version 11 for solr 7.5?

2018-09-26 Thread Markus Jelsma
Indeed, but JDK-8038348 has been fixed very recently for Java 9 or higher.
 
-Original message-
> From:Jeff Courtade 
> Sent: Wednesday 26th September 2018 17:36
> To: solr-user@lucene.apache.org
> Subject: Re: Java version 11 for solr 7.5?
> 
> My concern with using g1 is solely based on finding this.
> Does anyone have any information on this?
> 
> https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs
> 
> "Do not, under any circumstances, run Lucene with the G1 garbage collector.
> Lucene's test suite fails with the G1 garbage collector on a regular basis,
> including bugs that cause index corruption. There is no person on this
> planet that seems to understand such bugs (see
> https://bugs.openjdk.java.net/browse/JDK-8038348, open for over a year), so
> don't count on the situation changing soon. This information is not out of
> date, and don't think that the next oracle java release will fix the
> situation."
> 
> 
> On Wed, Sep 26, 2018 at 11:08 AM Walter Underwood 
> wrote:
> 
> > We’ve been running G1 in prod for at least 18 months. Our biggest cluster
> > is 48 machines, each with 36 CPUs, running 6.6.2. We also run it on our
> > 4.10.4 master/slave cluster.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Sep 26, 2018, at 7:37 AM, Jeff Courtade 
> > wrote:
> > >
> > > Thanks for that...
> > > I am just starting to look at this I was unaware of the license debacle.
> > >
> > > Automated testing up to 10 is great.
> > >
> > > I am still curious about the GC1 being supported now...
> > >
> > > On Wed, Sep 26, 2018 at 10:25 AM Zisis T.  wrote:
> > >
> > >> Jeff Courtade wrote
> > >>> Can we use GC1 garbage collection yet or do we still need to use CMS?
> > >>
> > >> I believe you should be safe to go with G1. We've applied it in in a
> > Solr
> > >> 6.6 cluster with 10 shards, 3 replicas per shard and an index of about
> > >> 500GB
> > >> (1,5T counting all replicas) and it works extremely well (throughput >
> > >> 99%).
> > >> The use-case includes complex search queries and faceting.
> > >> There is also this post you can use as a starting point
> > >>
> > >>
> > http://blog.cloudera.com/blog/2017/06/apache-solr-memory-tuning-for-production/
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> > >>
> > > --
> > >
> > > Jeff Courtade
> > > M: 240.507.6116 <(240)%20507-6116>
> >
> > --
> 
> Jeff Courtade
> M: 240.507.6116
> 


RE: Grammatical tenses Stemming in SOLR

2018-09-21 Thread Markus Jelsma
Hello Aishwarya,

KStem does a really bad job with the examples you have given, it won't remove 
the -s and -ing suffixes in some strange cases. Porter/Snowball work just fine 
for this example.

What won't work, of course, are irregular verbs and nouns (plural forms). They 
always need to be hard-coded either within the algorithm, which it is not, or 
outside by for example a StemmerOverrideFilter.

Regards,
Markus
 
-Original message-
> From:aishwarya 
> Sent: Friday 21st September 2018 10:38
> To: solr-user@lucene.apache.org
> Subject: Grammatical tenses Stemming in SOLR
> 
> 
> 1
> down vote
> favorite
> i want to know which stemming filter factory can be used to fetch all the
> possible tenses of a stem word.
> 
> example : if "run" is the search word -> it has to fetch results for all
> files involving run , running , runs , ran.
> 
> also the vice-versa --> whichever tense of a word is searched , it has to
> retreive all the results from the files.
> 
> i tried using POrterStemFilterFactory , snowball , kstem --> none of these
> seems to fetch the intended results.
> 
> Please help ! thanks in advance
> 
> Thanks, Aishwarya
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


RE: Heap Memory Problem after Upgrading to 7.4.0

2018-09-06 Thread Markus Jelsma
Thanks Tomás!

Björn, can you reproduce the problem in a local and controlled environment?

Markus

 
 
-Original message-
> From:Tomás Fernández Löbbe 
> Sent: Wednesday 5th September 2018 18:32
> To: solr-user@lucene.apache.org
> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> 
> I think this is pretty bad. I created
> https://issues.apache.org/jira/browse/SOLR-12743. Feel free to add any more
> details you have there.
> 
> On Mon, Sep 3, 2018 at 1:50 PM Markus Jelsma 
> wrote:
> 
> > Hello Björn,
> >
> > Take great care, 7.2.1 cannot read an index written by 7.4.0, so you
> > cannot roll back but need to reindex!
> >
> > Andrey Kudryavtsev made a good suggestion in the thread on how to find the
> > culprit, but it will be a tedious task. I have not yet had the time or
> > courage to venture there.
> >
> > Hope it helps,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Björn Häuser 
> > > Sent: Monday 3rd September 2018 22:28
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> > >
> > > Hi Markus,
> > >
> > > this reads exactly like what we have. Where you able to figure out
> > anything? Currently thinking about rollbacking to 7.2.1.
> > >
> > >
> > >
> > > > On 3. Sep 2018, at 21:54, Markus Jelsma 
> > wrote:
> > > >
> > > > Hello,
> > > >
> > > > Getting an OOM plus the fact you are having a lot of IndexSearcher
> > instances rings a familiar bell. One of our collections has the same issue
> > [1] when we attempted an upgrade 7.2.1 > 7.3.0. I managed to rule out all
> > our custom Solr code but had to keep our Lucene filters in the schema, the
> > problem persisted.
> > > >
> > > > The odd thing, however, is that you appear to have the same problem,
> > but not with 7.3.0? Since you shortly after 7.3.0 upgraded to 7.4.0, can
> > you confirm the problem is not also in 7.3.0?
> > > >
> > >
> > > We had very similar problems with 7.3.0 but never analyzed them and just
> > updated to 7.4.0 because I thought thats the bug we hit:
> > https://issues.apache.org/jira/browse/SOLR-11882 <
> > https://issues.apache.org/jira/browse/SOLR-11882>
> > >
> > >
> > > > You should see the instance count for IndexSearcher increase by one
> > for each replica on each commit.
> > >
> > >
> > > Sorry, where can I find this? ;) Sorry, did not find anything.
> > >
> > > Thanks
> > > Björn
> > >
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > > [1]
> > http://lucene.472066.n3.nabble.com/RE-7-3-appears-to-leak-td4396232.html
> > > >
> > > >
> > > >
> > > > -Original message-
> > > >> From:Erick Erickson 
> > > >> Sent: Monday 3rd September 2018 20:49
> > > >> To: solr-user 
> > > >> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> > > >>
> > > >> I would expect at least 1 IndexSearcher per replica, how many total
> > > >> replicas hosted in your JVM?
> > > >>
> > > >> Plus, if you're actively indexing, there may temporarily be 2
> > > >> IndexSearchers open while the new searcher warms.
> > > >>
> > > >> And there may be quite a few caches, at least queryResultCache and
> > > >> filterCache and documentCache, one of each per replica and maybe two
> > > >> (for queryResultCache and filterCache) if you have a background
> > > >> searcher autowarming.
> > > >>
> > > >> At a glance, your autowarm counts are very high, so it may take some
> > > >> time to autowarm leading to multiple IndexSearchers and caches open
> > > >> per replica when you happen to hit a commit point. I usually start
> > > >> with 16-20 as an autowarm count, the benefit decreases rapidly as you
> > > >> increase the count.
> > > >>
> > > >> I'm not quite sure why it would be different in 7x .vs. 6x. How much
> > > >> heap do you allocate to the JVM? And do you see similar heap dumps in
> > > >> 6.6?
> > > >>
> > > >> Best,
> > > >> Erick
> > > >> On Mon, Sep 3, 2018 at 10:33 AM 

RE: Heap Memory Problem after Upgrading to 7.4.0

2018-09-03 Thread Markus Jelsma
Hello Björn,

Take great care, 7.2.1 cannot read an index written by 7.4.0, so you cannot 
roll back but need to reindex! 

Andrey Kudryavtsev made a good suggestion in the thread on how to find the 
culprit, but it will be a tedious task. I have not yet had the time or courage 
to venture there.

Hope it helps,
Markus

 
 
-Original message-
> From:Björn Häuser 
> Sent: Monday 3rd September 2018 22:28
> To: solr-user@lucene.apache.org
> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> 
> Hi Markus,
> 
> this reads exactly like what we have. Where you able to figure out anything? 
> Currently thinking about rollbacking to 7.2.1. 
> 
> 
> 
> > On 3. Sep 2018, at 21:54, Markus Jelsma  wrote:
> > 
> > Hello,
> > 
> > Getting an OOM plus the fact you are having a lot of IndexSearcher 
> > instances rings a familiar bell. One of our collections has the same issue 
> > [1] when we attempted an upgrade 7.2.1 > 7.3.0. I managed to rule out all 
> > our custom Solr code but had to keep our Lucene filters in the schema, the 
> > problem persisted.
> > 
> > The odd thing, however, is that you appear to have the same problem, but 
> > not with 7.3.0? Since you shortly after 7.3.0 upgraded to 7.4.0, can you 
> > confirm the problem is not also in 7.3.0? 
> > 
> 
> We had very similar problems with 7.3.0 but never analyzed them and just 
> updated to 7.4.0 because I thought thats the bug we hit: 
> https://issues.apache.org/jira/browse/SOLR-11882 
> <https://issues.apache.org/jira/browse/SOLR-11882>
> 
> 
> > You should see the instance count for IndexSearcher increase by one for 
> > each replica on each commit.
> 
> 
> Sorry, where can I find this? ;) Sorry, did not find anything. 
> 
> Thanks
> Björn
> 
> > 
> > Regards,
> > Markus
> > 
> > [1] 
> > http://lucene.472066.n3.nabble.com/RE-7-3-appears-to-leak-td4396232.html 
> > 
> > 
> > 
> > -Original message-
> >> From:Erick Erickson 
> >> Sent: Monday 3rd September 2018 20:49
> >> To: solr-user 
> >> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> >> 
> >> I would expect at least 1 IndexSearcher per replica, how many total
> >> replicas hosted in your JVM?
> >> 
> >> Plus, if you're actively indexing, there may temporarily be 2
> >> IndexSearchers open while the new searcher warms.
> >> 
> >> And there may be quite a few caches, at least queryResultCache and
> >> filterCache and documentCache, one of each per replica and maybe two
> >> (for queryResultCache and filterCache) if you have a background
> >> searcher autowarming.
> >> 
> >> At a glance, your autowarm counts are very high, so it may take some
> >> time to autowarm leading to multiple IndexSearchers and caches open
> >> per replica when you happen to hit a commit point. I usually start
> >> with 16-20 as an autowarm count, the benefit decreases rapidly as you
> >> increase the count.
> >> 
> >> I'm not quite sure why it would be different in 7x .vs. 6x. How much
> >> heap do you allocate to the JVM? And do you see similar heap dumps in
> >> 6.6?
> >> 
> >> Best,
> >> Erick
> >> On Mon, Sep 3, 2018 at 10:33 AM Björn Häuser  
> >> wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> we recently upgraded our solrcloud (5 nodes, 25 collections, 1 shard 
> >>> each, 4 replicas each) from 6.6.0 to 7.3.0 and shortly after to 7.4.0. We 
> >>> are running Zookeeper 4.1.13.
> >>> 
> >>> Since the upgrade to 7.3.0 and also 7.4.0 we encountering heap space 
> >>> exhaustion. After obtaining a heap dump it looks like that we have a lot 
> >>> of IndexSearchers open for our largest collection.
> >>> 
> >>> The dump contains around ~60 IndexSearchers, and each containing around 
> >>> ~40mb heap. Another 500MB of heap is the fieldcache, which is expected in 
> >>> my opinion.
> >>> 
> >>> The current config can be found here: 
> >>> https://gist.github.com/bjoernhaeuser/327a65291ac9793e744b87f0a561e844 
> >>> <https://gist.github.com/bjoernhaeuser/327a65291ac9793e744b87f0a561e844>
> >>> 
> >>> Analyzing the heap dump eclipse MAT says this:
> >>> 
> >>> Problem Suspect 1
> >>> 
> >>> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> >>> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> >>> 1.981.148.336 (38,26%) bytes.
> >>> 
> >>> Biggest instances:
> >>> 
> >>>    • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 
> >>>70.087.272 (1,35%) bytes.
> >>>    • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 
> >>>65.678.264 (1,27%) bytes.
> >>>    • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 
> >>>63.050.600 (1,22%) bytes.
> >>> 
> >>> 
> >>> Problem Suspect 2
> >>> 
> >>> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> >>> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> >>> 1.373.110.208 (26,52%) bytes.
> >>> 
> >>> 
> >>> Any help is appreciated. Thank you very much!
> >>> Björn
> >> 
> 
> 


RE: Heap Memory Problem after Upgrading to 7.4.0

2018-09-03 Thread Markus Jelsma
Hello,

Getting an OOM plus the fact you are having a lot of IndexSearcher instances 
rings a familiar bell. One of our collections has the same issue [1] when we 
attempted an upgrade 7.2.1 > 7.3.0. I managed to rule out all our custom Solr 
code but had to keep our Lucene filters in the schema, the problem persisted.

The odd thing, however, is that you appear to have the same problem, but not 
with 7.3.0? Since you shortly after 7.3.0 upgraded to 7.4.0, can you confirm 
the problem is not also in 7.3.0? 

You should see the instance count for IndexSearcher increase by one for each 
replica on each commit.

Regards,
Markus

[1] http://lucene.472066.n3.nabble.com/RE-7-3-appears-to-leak-td4396232.html 

 
 
-Original message-
> From:Erick Erickson 
> Sent: Monday 3rd September 2018 20:49
> To: solr-user 
> Subject: Re: Heap Memory Problem after Upgrading to 7.4.0
> 
> I would expect at least 1 IndexSearcher per replica, how many total
> replicas hosted in your JVM?
> 
> Plus, if you're actively indexing, there may temporarily be 2
> IndexSearchers open while the new searcher warms.
> 
> And there may be quite a few caches, at least queryResultCache and
> filterCache and documentCache, one of each per replica and maybe two
> (for queryResultCache and filterCache) if you have a background
> searcher autowarming.
> 
> At a glance, your autowarm counts are very high, so it may take some
> time to autowarm leading to multiple IndexSearchers and caches open
> per replica when you happen to hit a commit point. I usually start
> with 16-20 as an autowarm count, the benefit decreases rapidly as you
> increase the count.
> 
> I'm not quite sure why it would be different in 7x .vs. 6x. How much
> heap do you allocate to the JVM? And do you see similar heap dumps in
> 6.6?
> 
> Best,
> Erick
> On Mon, Sep 3, 2018 at 10:33 AM Björn Häuser  wrote:
> >
> > Hello,
> >
> > we recently upgraded our solrcloud (5 nodes, 25 collections, 1 shard each, 
> > 4 replicas each) from 6.6.0 to 7.3.0 and shortly after to 7.4.0. We are 
> > running Zookeeper 4.1.13.
> >
> > Since the upgrade to 7.3.0 and also 7.4.0 we encountering heap space 
> > exhaustion. After obtaining a heap dump it looks like that we have a lot of 
> > IndexSearchers open for our largest collection.
> >
> > The dump contains around ~60 IndexSearchers, and each containing around 
> > ~40mb heap. Another 500MB of heap is the fieldcache, which is expected in 
> > my opinion.
> >
> > The current config can be found here: 
> > https://gist.github.com/bjoernhaeuser/327a65291ac9793e744b87f0a561e844 
> > 
> >
> > Analyzing the heap dump eclipse MAT says this:
> >
> > Problem Suspect 1
> >
> > 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> > "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> > 1.981.148.336 (38,26%) bytes.
> >
> > Biggest instances:
> >
> > • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 
> > 70.087.272 (1,35%) bytes.
> > • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 
> > 65.678.264 (1,27%) bytes.
> > • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 
> > 63.050.600 (1,22%) bytes.
> >
> >
> > Problem Suspect 2
> >
> > 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> > "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> > 1.373.110.208 (26,52%) bytes.
> >
> >
> > Any help is appreciated. Thank you very much!
> > Björn
> 


RE: Boost matches occurring early in the field (offset)

2018-08-29 Thread Markus Jelsma
Hello Jan,

Many years ago i made an extension of SpanFirstQuery called 
GradientSpanFirstQuery that did just that, decrease the boost for each advanced 
position in the text. Then Lucene 4 or 5 came and this code wouldn't compile 
any more.

  @Override
  protected AcceptStatus acceptPosition(Spans spans) throws IOException {
assert spans.startPosition() != spans.endPosition() : "start equals end: " 
+ spans.startPosition();
if (spans.startPosition() >= end) {
  return AcceptStatus.NO_MORE_IN_CURRENT_DOC;
}

else if (spans.endPosition() <= end) {
  super.setBoost(this.boost / (spans.endPosition() / fraction));
  return AcceptStatus.YES;
} else {
  return AcceptStatus.NO;
}
  }

We never actually used this class in production but i did ask either the Lucene 
or Solr list what could be done to quick fix something i didn't use anyway. 

Despite thread being public, i cannot find it. But i do remember, probably 
Adrien Grand saying, i had to implement a custom scorer to get the class back 
to work. 

Hope it helps.
Markus
 
-Original message-
> From:Jan Høydahl 
> Sent: Wednesday 29th August 2018 22:18
> To: solr-user 
> Subject: Re: Boost matches occurring early in the field (offset)
> 
> I also tend to use "sentinel tokens" for exact match or to anchor a search. 
> But in order to obtain decaying boost the further down in the article a match 
> is, you'd need to write several such span/slop queries with varying slops, 
> e.g. highest boost for first 10 words, medium boost for first 50 words, low 
> boost for first 150 words, no boost below that.
> 
> As I wrote in my initial mail, we can do such workarounds, or play with 
> payloads etc. But my real question is whether/how it is possible to factor 
> the actual term offset information from a matching term into the scoring 
> algorithm? Would you need to implement your own Scorer/Weight impl?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 29. aug. 2018 kl. 15:37 skrev Doug Turnbull 
> > :
> > 
> > You can also insert a token at the beginning of the query during analysis
> > using a char filter. I call these sort of boundary tokens "sentinel
> > tokens". So a phrase search for "red shoes" becomes " red shoes".
> > You can add some slop to allow for permissible distance (with
> > 
> > You can also use the Limit Token Count Token Filter and create a copyField,
> > so if you want to boost on first 10 matches, just limit to 10 tokens then
> > use this as a boost query
> > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-LimitTokenCountFilter
> > 
> > -Doug
> > 
> > On Wed, Aug 29, 2018 at 6:26 AM Mikhail Khludnev  wrote:
> > 
> >> 
> >> <
> >> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-XMLQueryParser
> >>> 
> >> 
> >> On Wed, Aug 29, 2018 at 1:19 PM Jan Høydahl  wrote:
> >> 
> >>> Hi,
> >>> 
> >>> Is there an ootb way to boost term matches based on their position/offset
> >>> inside a field, so that the term gets a higher score if it occurs in the
> >>> befinning of the field and lower boost or a deboost if it occurs towards
> >>> the end of a field?
> >>> 
> >>> I know that I could index the first part of the text in a new field and
> >>> boost on that, but that is kind of "binary".
> >>> I could also add the term offset as payload for every term and boost on
> >>> that, but this should not be necessary since offset info is already part
> >> of
> >>> the index?
> >>> 
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>> 
> >>> 
> >> 
> >> --
> >> Sincerely yours
> >> Mikhail Khludnev
> >> 
> > -- 
> > CTO, OpenSource Connections
> > Author, Relevant Search
> > http://o19s.com/doug
> 
> 


RE: 7.4.0 SQL handler throws exception if WHERE clause is present

2018-08-29 Thread Markus Jelsma
Hi,

Forget about it, after ten years without SQL, i managed to forget i had to wrap 
the WHERE value in quotes, single quotes in this case.

Thanks,
Markus
 
 
-Original message-
> From:Markus Jelsma 
> Sent: Wednesday 29th August 2018 11:51
> To: solr-user 
> Subject: 7.4.0 SQL handler throws exception if WHERE clause is present
> 
> Hello,
> 
> I was, finally, trying the SQL handler on one of our collections. Executing a 
> SELECT * FROM logs LIMIT 10 runs fine, but restricting the set using a WHERE 
> clause gives me the exception below. The type field is a String type, indexed 
> and has DocValues.
> 
> I must be doing something wrong, but have no idea what.
> 
> Many thanks,
> Markus
> 
> stmt=SELECT * FROM logs WHERE type=query LIMIT 10 gives me:
> 
> java.lang.AssertionError: cannot translate call =(CAST($7):VARCHAR CHARACTER 
> SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary", CAST($2):VARCHAR 
> CHARACTER SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary") 
> at 
> org.apache.solr.handler.sql.SolrFilter$Translator.translateBinary(SolrFilter.java:181)
>  
> at 
> org.apache.solr.handler.sql.SolrFilter$Translator.translateComparison(SolrFilter.java:128)
>  
> at 
> org.apache.solr.handler.sql.SolrFilter$Translator.translateMatch(SolrFilter.java:81)
>  
> at 
> org.apache.solr.handler.sql.SolrFilter$Translator.access$100(SolrFilter.java:70)
>  
> at org.apache.solr.handler.sql.SolrFilter.implement(SolrFilter.java:64) 
> at 
> org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
> at org.apache.solr.handler.sql.SolrSort.implement(SolrSort.java:58) 
> at 
> org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
> at org.apache.solr.handler.sql.SolrProject.implement(SolrProject.java:55) 
> at 
> org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
> at 
> org.apache.solr.handler.sql.SolrToEnumerableConverter.implement(SolrToEnumerableConverter.java:61)
>  
> at 
> org.apache.calcite.adapter.enumerable.EnumerableRelImplementor.implementRoot(EnumerableRelImplementor.java:108)
>  
> at 
> org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:92)
>  
> at 
> org.apache.calcite.prepare.CalcitePrepareImpl$CalcitePreparingStmt.implement(CalcitePrepareImpl.java:1257)
>  
> at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:331) 
> at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:228) 
> at 
> org.apache.calcite.prepare.CalcitePrepareImpl.prepare2_(CalcitePrepareImpl.java:784)
>  
> at 
> org.apache.calcite.prepare.CalcitePrepareImpl.prepare_(CalcitePrepareImpl.java:639)
>  
> at 
> org.apache.calcite.prepare.CalcitePrepareImpl.prepareSql(CalcitePrepareImpl.java:609)
>  
> at 
> org.apache.calcite.jdbc.CalciteConnectionImpl.parseQuery(CalciteConnectionImpl.java:214)
>  
> at 
> org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:603)
>  
> at 
> org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:638)
>  
> at 
> org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:149)
>  
> at 
> org.apache.calcite.avatica.AvaticaStatement.executeQuery(AvaticaStatement.java:218)
>  
> at 
> org.apache.solr.client.solrj.io.stream.JDBCStream.open(JDBCStream.java:269) 
> at 
> org.apache.solr.client.solrj.io.stream.ExceptionStream.open(ExceptionStream.java:54)
>  
> at 
> org.apache.solr.handler.StreamHandler$TimerStream.open(StreamHandler.java:397)
>  
> at 
> org.apache.solr.client.solrj.io.stream.TupleStream.writeMap(TupleStream.java:83)
> 


7.4.0 SQL handler throws exception if WHERE clause is present

2018-08-29 Thread Markus Jelsma
Hello,

I was, finally, trying the SQL handler on one of our collections. Executing a 
SELECT * FROM logs LIMIT 10 runs fine, but restricting the set using a WHERE 
clause gives me the exception below. The type field is a String type, indexed 
and has DocValues.

I must be doing something wrong, but have no idea what.

Many thanks,
Markus

stmt=SELECT * FROM logs WHERE type=query LIMIT 10 gives me:

java.lang.AssertionError: cannot translate call =(CAST($7):VARCHAR CHARACTER 
SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary", CAST($2):VARCHAR CHARACTER 
SET "ISO-8859-1" COLLATE "ISO-8859-1$en_US$primary") 
at 
org.apache.solr.handler.sql.SolrFilter$Translator.translateBinary(SolrFilter.java:181)
 
at 
org.apache.solr.handler.sql.SolrFilter$Translator.translateComparison(SolrFilter.java:128)
 
at 
org.apache.solr.handler.sql.SolrFilter$Translator.translateMatch(SolrFilter.java:81)
 
at 
org.apache.solr.handler.sql.SolrFilter$Translator.access$100(SolrFilter.java:70)
 
at org.apache.solr.handler.sql.SolrFilter.implement(SolrFilter.java:64) 
at org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
at org.apache.solr.handler.sql.SolrSort.implement(SolrSort.java:58) 
at org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
at org.apache.solr.handler.sql.SolrProject.implement(SolrProject.java:55) 
at org.apache.solr.handler.sql.SolrRel$Implementor.visitChild(SolrRel.java:103) 
at 
org.apache.solr.handler.sql.SolrToEnumerableConverter.implement(SolrToEnumerableConverter.java:61)
 
at 
org.apache.calcite.adapter.enumerable.EnumerableRelImplementor.implementRoot(EnumerableRelImplementor.java:108)
 
at 
org.apache.calcite.adapter.enumerable.EnumerableInterpretable.toBindable(EnumerableInterpretable.java:92)
 
at 
org.apache.calcite.prepare.CalcitePrepareImpl$CalcitePreparingStmt.implement(CalcitePrepareImpl.java:1257)
 
at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:331) 
at org.apache.calcite.prepare.Prepare.prepareSql(Prepare.java:228) 
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepare2_(CalcitePrepareImpl.java:784)
 
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepare_(CalcitePrepareImpl.java:639)
 
at 
org.apache.calcite.prepare.CalcitePrepareImpl.prepareSql(CalcitePrepareImpl.java:609)
 
at 
org.apache.calcite.jdbc.CalciteConnectionImpl.parseQuery(CalciteConnectionImpl.java:214)
 
at 
org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:603)
 
at 
org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:638)
 
at 
org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:149)
 
at 
org.apache.calcite.avatica.AvaticaStatement.executeQuery(AvaticaStatement.java:218)
 
at org.apache.solr.client.solrj.io.stream.JDBCStream.open(JDBCStream.java:269) 
at 
org.apache.solr.client.solrj.io.stream.ExceptionStream.open(ExceptionStream.java:54)
 
at 
org.apache.solr.handler.StreamHandler$TimerStream.open(StreamHandler.java:397) 
at 
org.apache.solr.client.solrj.io.stream.TupleStream.writeMap(TupleStream.java:83)


RE: Contextual Synonym Filter

2018-08-17 Thread Markus Jelsma
Hello,

If you are using Dismax or Edismax, you can easily extend the QParser and 
reconstruct your analyzer on-the-fly, based on what you find in the filter 
query. Be sure to keep a cache of the analyzer because construction can be very 
heavy.

Check the Edismax code, it offers clear examples on how to do it, e.g. 
reconstructing the existing analyzer but without the StopFilter based on some 
parameter.

Regards,
Markus

 
 
-Original message-
> From:Vergantini Luca 
> Sent: Friday 17th August 2018 14:04
> To: solr-user@lucene.apache.org
> Subject: Contextual Synonym Filter
> 
> I need to create a contextual Synonym Filter: 
> I need that the Synonym Filter load different synonym configuration based on 
> the fq query parameter. 
> Ive already modified the SynonymGraphFilterFactory to load from DB (this is 
> another requirement) but I cant understand how to make the fq parameter 
> arrive to the Factory. 
> Maybe I need a Query Parser plugin? 
> Please help 
> 
> 
 
> Luca Vergantini
> 
> Whitehall Reply
>  
 
> Via del Giorgione, 59
 
> 00147 - Roma - ITALY
> phone: +39 06 844341 
> l.vergant...@reply.it 
> www.reply.it  
 
>  <>   
> 
> 
 


RE: Searching by dates

2018-08-16 Thread Markus Jelsma
Hello Christopher,

We have a library whose soul purpose it is to extract, parse and validate dates 
found in free text, in all major world languages (and many more) and every in 
thinkable format/notation. It can also deal with times, timezones (resolve them 
back to UTC), different eras (e.g. Buddhist), validate dates (e.g. 2018-1-4) 
and figure out which format is correct (-m-d or -d-m) if a day name is 
found somewhere very close to the date. And it supports month names including 
abbreviated format (thanks to Locale).

We use it to get the date for an article/web page on our Sitesearch platform, 
and index it to Solr so we can boost recent articles. But some of our customers 
use it together with a Lucene CharFilter to transform it on-the-fly 
(maintaining offsets and positions for highlighting) when indexing or 
searching, or embedded in a QueryParser.

It is a mature project in on-going development since 2010, but not open source, 
so if you are interested contact us off list.

Regards,
Markus

 
 
-Original message-
> From:Shawn Heisey 
> Sent: Thursday 16th August 2018 20:09
> To: solr-user@lucene.apache.org
> Subject: Re: Searching by dates
> 
> On 8/16/2018 9:20 AM, Christopher Schultz wrote:
> > Hmm. I could have sworn the documentation I read in the past (maybe as
> > long as 3-4 months ago) indicated that date+timestamp was necessary.
> > Maybe that was just for the index, while the searches can be partial.
> 
> DateRangeField was introduced four years ago, first available in Solr
> version 5.0.
> 
> https://issues.apache.org/jira/browse/SOLR-6103
> 
> > As for i18n, is there a way to have the query analyzer convert strings
> > like "mm/dd/" into "-mm-dd"?
> 
> Solr doesn't accept dates in mm/dd/ syntax, and can't convert that
> for you.  The ISO standard that *is* accepted is the more logical
> -mm-dd.  It's generally best if you don't use a freeform text field
> for dates ... provide a full interface for choosing specific dates so
> that user input is predictable.  Probably something like this:
> 
> https://developer.mozilla.org/en-US/docs/Web/HTML/Element/input/date
> 
> Looking at the documentation, I don't see any way to search for just a
> day without the year.  That could be a useful enhancement for
> birthday-related use cases, but I have no idea how hard it would be to
> write.
> 
> Thanks,
> Shawn
> 
> 


7.2.1 Solr collection sluggish

2018-08-08 Thread Markus Jelsma
Hello,

We've got, again, a little mystery here. Our main text collection is suddenly 
running at a snail's pace since Monday very early in the morning, the 
monitoring graph for response time went up. This is not unusual for Solr so the 
JVM's were all restarted, it always solves a sluggish collection, not this 
time. They were restarted yesterday as well, but no change. The VM's Solr is 
running on were rebooted today, also no change.

Not all queries are slow all the time, a random query is just slow sometimes, 
or sometime most of the times. All 6 replica's are sometimes slow.

We also took a good look at our monitoring, JVM heap was normal, IO was normal, 
CPU was normal until the first restart. CPU usage is since the first restart 
erratic but not worryingly off the charts, just not 'normal' as usual. 

No changes were made to the collection for days before it became sluggish.

CPU sampling with VisualVM is not helpful either, nothing really stands out, 
especially when i compare it to another cluster that is still healthy. GC is 
also normal.

So, any ideas out here?

Many thanks,
Markus



RE: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Markus Jelsma
Hello Bjarke,

You can use shard splitting:
https://lucene.apache.org/solr/guide/6_6/collections-api.html#CollectionsAPI-splitshard

Regards,
Markus

 
 
-Original message-
> From:Bjarke Buur Mortensen 
> Sent: Tuesday 7th August 2018 13:47
> To: solr-user@lucene.apache.org
> Subject: Re: Recipe for moving to solr cloud without reindexing
> 
> Thank you, that is of course a way to go, but I would actually like to be
> able to shard ...
> Could I use your approach and add shards dynamically?
> 
> 
> 2018-08-07 13:28 GMT+02:00 Markus Jelsma :
> 
> > Hello Bjarke,
> >
> > If you are not going to shard you can just create a 1 shard/1 replica
> > collection, shut down Solr, copy the data directory into the replica's
> > directory and start up again.
> >
> > Regards,
> > Markus
> >
> > -Original message-
> > > From:Bjarke Buur Mortensen 
> > > Sent: Tuesday 7th August 2018 13:06
> > > To: solr-user@lucene.apache.org
> > > Subject: Recipe for moving to solr cloud without reindexing
> > >
> > > Hi List,
> > >
> > > is there a cookbook recipe for moving an existing solr core to a solr
> > cloud
> > > collection.
> > >
> > > We currently have a single machine with a large core (~150gb), and we
> > would
> > > like to move to solr cloud.
> > >
> > > I haven't been able to find anything that reuses an existing index, so
> > any
> > > pointers much appreciated.
> > >
> > > Thanks,
> > > Bjarke
> > >
> >
> 


RE: Recipe for moving to solr cloud without reindexing

2018-08-07 Thread Markus Jelsma
Hello Bjarke,

If you are not going to shard you can just create a 1 shard/1 replica 
collection, shut down Solr, copy the data directory into the replica's 
directory and start up again.

Regards,
Markus
 
-Original message-
> From:Bjarke Buur Mortensen 
> Sent: Tuesday 7th August 2018 13:06
> To: solr-user@lucene.apache.org
> Subject: Recipe for moving to solr cloud without reindexing
> 
> Hi List,
> 
> is there a cookbook recipe for moving an existing solr core to a solr cloud
> collection.
> 
> We currently have a single machine with a large core (~150gb), and we would
> like to move to solr cloud.
> 
> I haven't been able to find anything that reuses an existing index, so any
> pointers much appreciated.
> 
> Thanks,
> Bjarke
> 


RE: indexing two words, searching single word

2018-08-03 Thread Markus Jelsma
Hello,

If your case is English you could use synonyms to work around the problem of 
the few compound words of the language. However, would you be dealing with a 
Germanic compound language, the HyphenationCompoundWordTokenFilter [1] or 
DictionaryCompoundWordTokenFilter are a better choice. The former is much more 
flexible but has its drawbacks.

Regards,
Markus

https://lucene.apache.org/core/7_4_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

 
 
-Original message-
> From:Clemens Wyss DEV 
> Sent: Friday 3rd August 2018 12:22
> To: solr-user@lucene.apache.org
> Subject: indexing two words, searching single word
> 
> Sounds like a rather simple issue:
> if I index "sound stage" and search for "soundstage" I get no hits
> 
> What am I doing wrong 
> a) when indexing
> b) when searching
> ?
> 
> Thx in advance
> - Clemens
> 


RE: Solr Server crashes when requesting a result with too large resultRows

2018-07-31 Thread Markus Jelsma
Hello Georg,

As you have seen, a high rows parameter is a bad idea. Use cursor mark [1] 
instead.

Regards,
Markus

[1] https://lucene.apache.org/solr/guide/7_4/pagination-of-results.html
 
 
-Original message-
> From:Georg Fette 
> Sent: Tuesday 31st July 2018 10:44
> To: solr-user@lucene.apache.org
> Subject: Solr Server crashes when requesting a result with too large 
> resultRows
> 
> Hello,
> We run the server version 7.3.1. on a machine with 32GB RAM in a mode 
> having -10g.
> When requesting a query with
> q={!boost 
> b=sv_int_catalog_count_document}string_catalog_aliases:(*2*)=string_field_type:catalog_entry=2147483647
> the server takes all available memory up to 10GB and is then no longer 
> accessible with one processor at 100%.
> When we reduce the rows parameter to 1000 the query works. The query 
> returns only 581 results.
> The documentation at https://wiki.apache.org/solr/CommonQueryParameters 
> states that as the "rows" parameter a "ridiculously large value" may be 
> used, but this could pose a problem. The number we used was Int.max from 
> Java.
> Greetings
> Georg
> 
> -- 
> -
> Dipl.-Inf. Georg Fette  Raum: B001
> Universität WürzburgTel.: +49-(0)931-31-85516
> Am Hubland  Fax.: +49-(0)931-31-86732
> 97074 Würzburg  mail: georg.fe...@uni-wuerzburg.de
> -
> 
> 


RE: Recent configuration change to our site causes frequent index corruption

2018-07-26 Thread Markus Jelsma
Hello,

Is your maximum number of open files 1024? If so, increase it to a more regular 
65536. Some operating systems ship with 1024 for reasons i don't understand. 
Whenever installing Solr anywhere for the past ten years, we have had to check 
this each and every time, and still have to!

Regards,
Markus

 
 
-Original message-
> From:cyndefromva 
> Sent: Thursday 26th July 2018 22:18
> To: solr-user@lucene.apache.org
> Subject: Recent configuration change to our site causes frequent index 
> corruption
> 
> I have Rails 5 application that uses solr to index and search our site. The
> sunspot gem is used to integrate ruby and sunspot.  It's a relatively small
> site (no more 100,000 records) and has moderate usage (except for the
> googlebot).
> 
> Until recently we regularly received 503 errors; reloading the page
> generally cleared it up, but that was not exactly the user experience we
> wantedso we added the following initializer to force the retry on failures:
> 
> Sunspot.session =
> Sunspot::SessionProxy::Retry5xxSessionProxy.new(Sunspot.session)
> 
> As a result, about every third day the site locks up until we rebuild the
> data directory (stop solr, move data directory to another location, start
> solr, reindex). 
> 
> At the point it starts failing I see a java exception: "java.io-IOException:
> Too many open files" in the solr log file and a SolrException (Error open
> new searcher) is returned to the user.
> 
> In the solrconfig.xml file we have autoCommit and autoSoftCommit set as
> follows:
> 
>   
>  ${solr.autoCommit.maxTime:15000}
>  false
>   
> 
>   
>  ${solr.autoSoftCommit.maxTime:-1}
>   
> 
> Which I believe means there should be a hard commit every 15 seconds.
> 
> But it appears to be calling commit more frequently. In the solr log I see
> the following commit written miliseconds from each other:
> 
>   UpdateHandler start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> 
> I also see the following written right below it:
> 
> PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> 
> Note: maxWarmingSearchers is set to 2.
> 
> 
> I would really appreciate any help I can get to resolve this issue.
> 
> Thank you!
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


RE: Can I use RegEx function?

2018-07-23 Thread Markus Jelsma
Hello,

Neither fl nor facet.field support functions, but facet.query is analogous to 
the latter. I do not understand what you need/want with fl and regex.

Regards,
Markus

 
 
-Original message-
> From:Peter Sh 
> Sent: Monday 23rd July 2018 11:21
> To: solr-user@lucene.apache.org
> Subject: Re: Can I use RegEx function?
> 
> Can I use it in "fl" and  "facet.field" as a function
> 
> On Mon, Jul 23, 2018 at 11:33 AM Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > The usual faceting works for all queries, facet.query=q:field:/[a-z]+$/
> > will probably work too, i would be really surprised if it didn't. Keep in
> > mind that my example doesn't work, the + needs to be URL encoded!
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Peter Sh 
> > > Sent: Monday 23rd July 2018 10:26
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Can I use RegEx function?
> > >
> > > can it be used in facets?
> > >
> > > On Mon, Jul 23, 2018, 11:24 Markus Jelsma 
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > It is not really obvious in documentation, but the standard query
> > parser
> > > > supports regular expressions. Encapsulate your regex with forward
> > slashes
> > > > /, q=field:/[a-z]+$/ will work.
> > > >
> > > > Regards,
> > > > Markus
> > > >
> > > >
> > > >
> > > > -Original message-
> > > > > From:Peter Sh 
> > > > > Sent: Monday 23rd July 2018 10:09
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Can I use RegEx function?
> > > > >
> > > > > I've got collection with a string or text field storing free-text.
> > I'd
> > > > like
> > > > > to use some RexEx function looking for patterns like "KEY:VALUE"
> > from the
> > > > > text and use it for filtering and faceting.
> > > > >
> > > >
> > >
> >
> 


RE: Can I use RegEx function?

2018-07-23 Thread Markus Jelsma
Hello,

The usual faceting works for all queries, facet.query=q:field:/[a-z]+$/ will 
probably work too, i would be really surprised if it didn't. Keep in mind that 
my example doesn't work, the + needs to be URL encoded!

Regards,
Markus

 
 
-Original message-
> From:Peter Sh 
> Sent: Monday 23rd July 2018 10:26
> To: solr-user@lucene.apache.org
> Subject: Re: Can I use RegEx function?
> 
> can it be used in facets?
> 
> On Mon, Jul 23, 2018, 11:24 Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > It is not really obvious in documentation, but the standard query parser
> > supports regular expressions. Encapsulate your regex with forward slashes
> > /, q=field:/[a-z]+$/ will work.
> >
> > Regards,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Peter Sh 
> > > Sent: Monday 23rd July 2018 10:09
> > > To: solr-user@lucene.apache.org
> > > Subject: Can I use RegEx function?
> > >
> > > I've got collection with a string or text field storing free-text. I'd
> > like
> > > to use some RexEx function looking for patterns like "KEY:VALUE" from the
> > > text and use it for filtering and faceting.
> > >
> >
> 


RE: Can I use RegEx function?

2018-07-23 Thread Markus Jelsma
Hello,

It is not really obvious in documentation, but the standard query parser 
supports regular expressions. Encapsulate your regex with forward slashes /, 
q=field:/[a-z]+$/ will work.

Regards,
Markus

 
 
-Original message-
> From:Peter Sh 
> Sent: Monday 23rd July 2018 10:09
> To: solr-user@lucene.apache.org
> Subject: Can I use RegEx function?
> 
> I've got collection with a string or text field storing free-text. I'd like
> to use some RexEx function looking for patterns like "KEY:VALUE" from the
> text and use it for filtering and faceting.
> 


RE: Cannot index to 7.2.1 collection alias

2018-07-18 Thread Markus Jelsma
Ah, it was caused by an badly made alias via the GUI. If you do not select the 
destination collection in that popup, it will mess up, and show these 
exceptions.
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 17th July 2018 16:52
> To: solr-user@lucene.apache.org
> Subject: RE: Cannot index to 7.2.1 collection alias
> 
> Hi Shawn,
> 
> Indexing stack trace:
> 
> null:java.lang.NullPointerException
>   at 
> org.apache.solr.servlet.HttpSolrCall.getCoreUrl(HttpSolrCall.java:931)
>   at 
> org.apache.solr.servlet.HttpSolrCall.getRemotCoreUrl(HttpSolrCall.java:902)
>   at 
> org.apache.solr.servlet.HttpSolrCall.extractRemotePath(HttpSolrCall.java:432)
>   at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:289)
>   at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:470)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
> 
> Reloading an alias is just not supported it seems: 
> 
> 2018-07-17 14:51:35.223 ERROR 
> (OverseerThreadFactory-32-thread-5-processing-n:idx2.oi.dev:8983_solr) [   ] 
> o.a.s.c.OverseerCollectionMessageHandler Collection: c1 operation: reload 
> failed:org.apache.solr.common.SolrException: Could not find collection : c1
> at 
> org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:111)
> at 
> org.apache.solr.cloud.OverseerCollectionMessageHandler.collectionCmd(OverseerCollectionMessageHandler.java:795)
> at 
> org.apache.solr.cloud.OverseerCollectionMessageHandler.collectionCmd(OverseerCollectionMessageHandler.java:784)
> 
>  
> Thanks,
> MArkus
>  
> -Original message-
> > From:Shawn Heisey 
> > Sent: Tuesday 17th July 2018 16:39
> > To: solr-user@lucene.apache.org
> > Subject: Re: Cannot index to 7.2.1 collection alias
> > 
> > On 7/17/2018 6:28 AM, Markus Jelsma wrote:
> > > Just attempted to connect and index a bunch of documents to a collection 
> > > alias, got a NPE right away. Can't find this error in Jira, did i 
> > > overlook something? Create new ticket?
> > 
> > Indexing to an alias should send the documents only to the first 
> > collection in the alias.  I am not aware of any problems in this 
> > functionality.
> > 
> > Before opening a Jira, can we see the full stacktrace from the error, so 
> > we can look into it?  Can you confirm that 7.2.1 is the version that 
> > created the stacktrace?
> > 
> > I don't know whether RELOAD is supported on aliases.  It would be good 
> > to see that stacktrace as well.
> > 
> > Thanks,
> > Shawn
> > 
> > 
> 


RE: Cannot index to 7.2.1 collection alias

2018-07-17 Thread Markus Jelsma
Hi Shawn,

Indexing stack trace:

null:java.lang.NullPointerException
at 
org.apache.solr.servlet.HttpSolrCall.getCoreUrl(HttpSolrCall.java:931)
at 
org.apache.solr.servlet.HttpSolrCall.getRemotCoreUrl(HttpSolrCall.java:902)
at 
org.apache.solr.servlet.HttpSolrCall.extractRemotePath(HttpSolrCall.java:432)
at org.apache.solr.servlet.HttpSolrCall.init(HttpSolrCall.java:289)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:470)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)

Reloading an alias is just not supported it seems: 

2018-07-17 14:51:35.223 ERROR 
(OverseerThreadFactory-32-thread-5-processing-n:idx2.oi.dev:8983_solr) [   ] 
o.a.s.c.OverseerCollectionMessageHandler Collection: c1 operation: reload 
failed:org.apache.solr.common.SolrException: Could not find collection : c1
at 
org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:111)
at 
org.apache.solr.cloud.OverseerCollectionMessageHandler.collectionCmd(OverseerCollectionMessageHandler.java:795)
at 
org.apache.solr.cloud.OverseerCollectionMessageHandler.collectionCmd(OverseerCollectionMessageHandler.java:784)

 
Thanks,
MArkus
 
-Original message-
> From:Shawn Heisey 
> Sent: Tuesday 17th July 2018 16:39
> To: solr-user@lucene.apache.org
> Subject: Re: Cannot index to 7.2.1 collection alias
> 
> On 7/17/2018 6:28 AM, Markus Jelsma wrote:
> > Just attempted to connect and index a bunch of documents to a collection 
> > alias, got a NPE right away. Can't find this error in Jira, did i overlook 
> > something? Create new ticket?
> 
> Indexing to an alias should send the documents only to the first 
> collection in the alias.  I am not aware of any problems in this 
> functionality.
> 
> Before opening a Jira, can we see the full stacktrace from the error, so 
> we can look into it?  Can you confirm that 7.2.1 is the version that 
> created the stacktrace?
> 
> I don't know whether RELOAD is supported on aliases.  It would be good 
> to see that stacktrace as well.
> 
> Thanks,
> Shawn
> 
> 


RE: Cannot index to 7.2.1 collection alias

2018-07-17 Thread Markus Jelsma
Additionaly, reloading a collection alias also doesn't work. Can't find that 
one in Jira either, new ticket?

Thanks,
Markus

 
 
-Original message-
> From:Markus Jelsma 
> Sent: Tuesday 17th July 2018 14:28
> To: solr-user 
> Subject: Cannot index to 7.2.1 collection alias
> 
> Hello,
> 
> Just attempted to connect and index a bunch of documents to a collection 
> alias, got a NPE right away. Can't find this error in Jira, did i overlook 
> something? Create new ticket?
> 
> Thanks,
> Markus
> 


Cannot index to 7.2.1 collection alias

2018-07-17 Thread Markus Jelsma
Hello,

Just attempted to connect and index a bunch of documents to a collection alias, 
got a NPE right away. Can't find this error in Jira, did i overlook something? 
Create new ticket?

Thanks,
Markus


RE: 7.3 appears to leak

2018-07-16 Thread Markus Jelsma
Hello Thomas,

To be absolutely sure you suffer from the same problem as one of our 
collections, can you confirm that your Solr cores are leaking a 
SolrIndexSearcher instance on each commit? If not, there may be a second 
problem.

Also, do you run any custom plugins or apply patches to your Solr instances? Or 
is your Solr a 100 % official build?

Thanks,
Markus

 
 
-Original message-
> From:Thomas Scheffler 
> Sent: Monday 16th July 2018 13:39
> To: solr-user@lucene.apache.org
> Subject: Re: 7.3 appears to leak
> 
> Hi,
> 
> we noticed the same problems here in a rather small setup. 40.000 metadata 
> documents with nearly as much files that have „literal.*“ fields with it. 
> While 7.2.1 has brought some tika issues the real problems started to appear 
> with version 7.3.0 which are currently unresolved in 7.4.0. Memory 
> consumption is out-of-roof. Where previously 512MB heap was enough, now 6G 
> aren’t enough to index all files.
> 
> kind regards,
> 
> Thomas
> 
> > Am 04.07.2018 um 15:03 schrieb Markus Jelsma :
> > 
> > Hello Andrey,
> > 
> > I didn't think of that! I will try it when i have the courage again, 
> > probably next week or so.
> > 
> > Many thanks,
> > Markus
> > 
> > 
> > -Original message-
> >> From:Kydryavtsev Andrey 
> >> Sent: Wednesday 4th July 2018 14:48
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: 7.3 appears to leak
> >> 
> >> If it is not possible to find a resource leak by code analysis and there 
> >> is no better ideas, I can suggest a brute force approach:
> >> - Clone Solr's sources from appropriate branch 
> >> https://github.com/apache/lucene-solr/tree/branch_7_3
> >> - Log every searcher's holder increment/decrement operation in a way to 
> >> catch every caller name (use Thread.currentThread().getStackTrace() or 
> >> something) 
> >> https://github.com/apache/lucene-solr/blob/branch_7_3/solr/core/src/java/org/apache/solr/util/RefCounted.java
> >> - Build custom artefacts and upload them on prod
> >> - After memory leak happened - analyse logs to see what part of 
> >> functionality doesn't decrement searcher after counter was incremented. If 
> >> searchers are leaked - there should be such code I guess.
> >> 
> >> This is not something someone would like to do, but it is what it is.
> >> 
> >> 
> >> 
> >> Thank you,
> >> 
> >> Andrey Kudryavtsev
> >> 
> >> 
> >> 03.07.2018, 14:26, "Markus Jelsma" :
> >>> Hello Erick,
> >>> 
> >>> Even the silliest ideas may help us, but unfortunately this is not the 
> >>> case. All our Solr nodes run binaries from the same source from our 
> >>> central build server, with the same libraries thanks to provisioning. 
> >>> Only schema and config are different, but the  directive is the 
> >>> same all over.
> >>> 
> >>> Are there any other ideas, speculations, whatever, on why only our main 
> >>> text collection leaks a SolrIndexSearcher instance on commit since 7.3.0 
> >>> and every version up?
> >>> 
> >>> Many thanks?
> >>> Markus
> >>> 
> >>> -Original message-
> >>>>  From:Erick Erickson 
> >>>>  Sent: Friday 29th June 2018 19:34
> >>>>  To: solr-user 
> >>>>  Subject: Re: 7.3 appears to leak
> >>>> 
> >>>>  This is truly puzzling then, I'm clueless. It's hard to imagine this
> >>>>  is lurking out there and nobody else notices, but you've eliminated
> >>>>  the custom code. And this is also very peculiar:
> >>>> 
> >>>>  * it occurs only in our main text search collection, all other
> >>>>  collections are unaffected;
> >>>>  * despite what i said earlier, it is so far unreproducible outside
> >>>>  production, even when mimicking production as good as we can;
> >>>> 
> >>>>  Here's a tedious idea. Restart Solr with the -v option, I _think_ that
> >>>>  shows you each and every jar file Solr loads. Is it "somehow" possible
> >>>>  that your main collection is loading some jar from somewhere that's
> >>>>  different than you expect? 'cause silly ideas like this are all I can
> >>>>  come up with.
> >>>> 
> >>>>  Erick
> >>>> 
> >>>>  On Fri, Jun 29, 2018 at 9:56 AM, Ma

  1   2   3   4   5   6   7   8   9   10   >