Re: Announcing githubsearch!

2024-02-20 Thread Walter Underwood
Oops, I followed a link which went to the main GitHub search. Nevermind.

I’m getting zero results for “wunder” now, no error. Looks like my username 
there is “wrunderwood”, that is working correctly as are quoted searches for my 
name.

I’l fool around some more, but so far it looks clean and fast. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 20, 2024, at 3:29 AM, Michael McCandless  
> wrote:
> 
> On Mon, Feb 19, 2024 at 1:00 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> 
>> It appears to always search prefixes, so there is no way to search for 
>> “wunder” without getting “wundermap” and “wunderground”. Putting the term in 
>> quotes doesn’t turn that off.
> 
> Hmm that shouldn't be the case?  It does split on camel case though (thank 
> you WordDelimiterFilter!).  E.g. try searching on infix 
> <https://githubsearch.mikemccandless.com/search.py?text=infix=status%3AOpen>
>  and you should see it highlighted inside terms like AnalyzingInfixSuggester.
> 
> In fact when I search for wunder 
> <https://githubsearch.mikemccandless.com/search.py?chg=new=wunder===0=25966=recentlyUpdated=list=acbloeox20ko=status%3AOpen=wunder>
>  I get a horrible exception, I think I know why (it happens for any query 
> that gets no hits!).  I opened this issue 
> <https://github.com/mikemccand/luceneserver/issues/26>.  I'll try to fix that 
> soon.
> 
> Walter, I'm not sure how you were able to even search on "wunder" -- did you 
> get actual results?  From githubsearch 
> <https://githubsearch.mikemccandless.com/search.py>?
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
> 



Re: Announcing githubsearch!

2024-02-19 Thread Walter Underwood
It appears to always search prefixes, so there is no way to search for “wunder” 
without getting “wundermap” and “wunderground”. Putting the term in quotes 
doesn’t turn that off.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 19, 2024, at 8:39 AM, Michael McCandless  
> wrote:
> 
> Hi Team,
> 
> ~1.5 years ago (August 2022) we migrated our Lucene issue tracking from Jira 
> to GitHub. Thank you Tomoko for all the hard work doing such a complex, 
> multi-phased, high-fidelity migration!
> 
> I finally finished also migrating jirasearch to GitHub: 
> githubsearch.mikemccandless.com <https://githubsearch.mikemccandless.com/>. 
> It was tricky because GitHub issues/PRs are fundamentally more complex than 
> Jira's data model, and the GitHub REST API is also quite rich / heavily 
> normalized. All of the source code for githubsearch lives here 
> <https://github.com/mikemccand/luceneserver/tree/master/examples/githubsearch>.
>  The UI remains its barebones self ;)
> 
> Githubsearch 
> <https://github.com/mikemccand/luceneserver/tree/master/examples/githubsearch>
>  is dog food for us: it showcases Lucene (currently 9.8.0), and many of its 
> fun features like infix autosuggest, block join queries (each comment is a 
> sub-document on the issue/PR), DrillSideways faceting, near-real-time 
> indexing/searching, synonyms (try “oome 
> <https://githubsearch.mikemccandless.com/search.py?text=oome=status%3AOpen>”),
>  expressions, non-relevance and blended-relevance sort, etc.  (This old blog 
> post 
> <https://blog.mikemccandless.com/2016/10/jiraseseach-20-dog-food-using-lucene-to.html>
>  goes into detail.)  Plus, it’s meta-fun to use Lucene to search its own 
> issues, to help us be more productive in improving Lucene!  Nicely recursive.
> 
> In addition to good ol’ searching by text, githubsearch 
> <https://githubsearch.mikemccandless.com/> has some new/fun features:
> Drill down to just PRs or issues
> Filter by “review requested” for a given user: poor Adrien has 8 (open) now 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=requested_reviewers%3Ajpountz>
>  (sorry)! Or see your mentions (Robert is mentioned in 27 open issues/PRs 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=mentioned_users%3Armuir>).
>  Or PRs that you reviewed (Uwe has reviewed 9 still-open PRs 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=reviewed_users%3Auschindler>).
>  Or issues and PRs where a user has had any involvement at all (Dawid has 
> interacted on 197 issues/PRs 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=reviewed_users%3Adweiss>).
> Find still-open PRs that were created by a New Contributor 
> <https://githubsearch.mikemccandless.com/search.py?chg=dds==author_association=New+contributor=0=25792=recentlyUpdated=list=cjhfx60attlt=status%3AOpen=>
>  (an author who has no changes merged into our repository) or Contributor 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=author_association%3AContributor>
>  (non-committer who has had some changes merged into our repository) or 
> Member 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=author_association%3AMember>
> Here are the uber-stale (last touched more than a month ago) open PRs by 
> outside contributors 
> <https://githubsearch.mikemccandless.com/search.py?sort=recentlyUpdated=status%3AOpen=author_association%3ANew+contributor%2CContributor%2CNone=updated_ago%3A%3E+1+month+ago=issue_or_pr%3APR>.
>  We should ideally keep this at 0, but it’s 83 now!
> “Link to this search” to get a short-er, more permanent URL (it is NOT a URL 
> shortener, though!)
> Save named searches you frequently run (they just save to local cookie state 
> on that one browser)
> I’m sure there are exciting bugs, feedback/patches welcome!  If you see 
> problems, please reply to this email or file an issue here 
> <https://github.com/mikemccand/luceneserver/issues>.
> 
> Note that jirasearch <https://jirasearch.mikemccandless.com/search.py> 
> remains running, to search Solr, Tika and Infra issues.
> 
> Happy Searching,
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>


Re: Bump minimum Java version requirement to 21

2023-11-06 Thread Walter Underwood
We love the performance improvements, but most of our upgrades are because of 
CVEs that aren’t backported. We need to upgrade Thing X to the next major 
version and that version requires a more recent Java. 

Java versions for Solr are managed separately from the massive Java codebase, 
but we’d probably bump everything around the same time.

We would absolutely upgrade for a substantial performance improvement because 
of the number of nodes we run. Reduction in cost and better response time would 
make it worth it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 6, 2023, at 8:47 AM, Dawid Weiss  wrote:
> 
> 
>> It's not just you - we have an internal JDK11 fork at BIG COMPANY for some 
>> folks that can't get off the stick.
> 
>  The truth is - most large companies will be reluctant to upgrade unless they 
> see a benefit in doing so. Here, we can offer this benefit (call it a carrot, 
> if you mentioned the stick, Mike) - speedups to vector routines, new 
> directory implementations Uwe has been working on, probably more.
> 
> I'm myself fairly conservative too but I also think that those new APIs are 
> probably worth the investment (and potential pain) to upgrade.
> 
> Dawid



Re: Bump minimum Java version requirement to 21

2023-11-06 Thread Walter Underwood
Yes, LexisNexis is running Java 11 and will probably move to Java 17 soon 
because of Spring Boot 3 requirements. We are running a few hundred Solr nodes, 
mostly 9.1. Probably a few 8.10 clusters out there.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 6, 2023, at 5:18 AM, Gus Heck  wrote:
> 
> For perspective, I'm still seeing java 11 as the norm for clients... 17 is 
> uncommon. Anything requiring 21 is likely to be difficult to sell. I am 
> however a small shop, and "migrating off of solr 6" and "trying out solr 
> cloud" is still a thing for some clients.
> 
> Just a datapoint/anecdote, possibly skewed.
> 
> On Mon, Nov 6, 2023 at 7:41 AM Chris Hegarty 
>  wrote:
>> Hi Robert,
>> 
>> > On 6 Nov 2023, at 12:24, Robert Muir > > <mailto:rcm...@gmail.com>> wrote:
>> > 
>> >> …
>> >> The only concern I have with no.2 is that it could be considered an 
>> >> “aggressive” adoption of Java 21 - adoption sooner than the ecosystem can 
>> >> handle, e.g. are environments in which Lucene is deployed, and their 
>> >> transitive dependencies, ready to run on Java 21? By the time we’re ready 
>> >> to release 10.0.0, say March 2023, then I expect no issue with this.
>> > 
>> > The problem is worse, historically jdk version X isn't adopted as a
>> > minimum until it is already EOL. And the lucene major versions take an
>> > eternity to get out there, code just sits in "main" branch for years
>> > unreleased to nobody. It is really discouraging as a contributor to
>> > contribute code that literally sits on the shelf for years, for no
>> > good reason at all.
>> 
>> Agreed. I also feel discouraged by this approach too, and also wanna
>> avoid the “backport the world”, since it’s counterproductive.
>> 
>> > So why delay?
>> > 
>> > The argument of "moving sooner than ecosystem can handle" is also
>> > bogus in the same way. You mean versus the code sitting on the shelf
>> > and being released to nobody?
>> 
>> Yes - sitting on the shelf is no good to anyone.
>> 
>> Ok, what I’m hearing are good arguments for releasing 10.0.0 *now*, with
>> a Java 17 minimum - this is what is in _main_ today.
>> 
>> If we do that, then we can follow up with _main_ later (after the 10.x
>> branch is created). That is, 1) bump _main_ to Java 21, and 2) decide
>> when a Lucene 11 is to be released (I would to see Lucene 11 ~1yr after
>> Lucene 10).
>> 
>> This is Uwe’s proposal, earlier in this thread.
>> 
>> -Chris.
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
>> 
> 
> 
> -- 
> http://www.needhamsoftware.com <http://www.needhamsoftware.com/> (work)
> https://a.co/d/b2sZLD9 (my fantasy fiction book)



Re: ConjunctionDISI nextDoc can return immediately when NO_MORE_DOCS

2023-10-01 Thread Walter Underwood
At Infoseek, the engine checked the terms in frequency order, with the most 
rare term first. If the conjunction reached zero matches at any point, it 
stopped checking.

This might be a related but more general approach.

That was almost 30 years ago, so any patents are long-expired. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 1, 2023, at 10:12 AM, Adrien Grand  wrote:
> 
> I agree that it would save work in that case, but this query should be very 
> fast anyway.
> 
> On the other hand, if term1, term2 and term3 have 10M matches each, the 
> conjunction will need to check if the current candidate match is NO_MORE_DOCS 
> millions of times even though this would only happen once.
> 
> In general it's better to have less overhead for expensive queries and more 
> overhead for cheap queries than the other way around.
> 
> Le dim. 1 oct. 2023, 17:35, YouPeng Yang  <mailto:yypvsxf19870...@gmail.com>> a écrit :
>> Hi Adrien
>> suppose that conjunction query like  (term1 AND term2 AND term3) ,if the 
>> term1 does not exist ,and then the loop execution may cause unnecessary  
>> overhead.(sorry I have not yet find out whether there is any filter work 
>> before the doNext()..
>> 
>> Best Regard
>> 
>> Adrien Grand mailto:jpou...@gmail.com>> 于2023年10月1日周日 
>> 22:30写道:
>>> Hello,
>>> 
>>> This change would be correct, but it would only save work when the 
>>> conjunction is exhausted, and add overhead otherwise?
>>> 
>>> Le sam. 30 sept. 2023, 16:20, YouPeng Yang >> <mailto:yypvsxf19870...@gmail.com>> a écrit :
>>>> Hi
>>>>   I am reading the code of class ConjunctionDISI .and about the method 
>>>> nextDoc ,  Suppose that the sub-disi is emtpy in the lead1.lead2,should 
>>>> there be that it can return immediately when the input doc==NO_MORE_DOCS? 
>>>> 
>>>> 
>>>> private int doNext(int doc) throws IOException {
>>>>   advanceHead:
>>>>   for (; ; ) {
>>>> assert doc == lead1.docID();
>>>> //assumpt doc==NO_MORE_DOCS ,it return 
>>>> if(doc==NO_MORE_DOCS){
>>>>  return NO_MORE_DOCS;
>>>> }
>>>> // find agreement between the two iterators with the lower costs
>>>> // we special case them because they do not need the
>>>> // 'other.docID() < doc' check that the 'others' iterators need
>>>> final int next2 = lead2.advance(doc);
>>>> if (next2 != doc) {
>>>>   doc = lead1.advance(next2);
>>>> if(doc==NO_MORE_DOCS){
>>>>  return NO_MORE_DOCS;
>>>> }
>>>>   if (next2 != doc) {
>>>> continue;
>>>>   }
>>>> }
>>>> ..left omited...
>>>> }



Re: Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-21 Thread Walter Underwood
Oops, I’ll re-ask over there. Thanks. —wunder

> On Sep 21, 2023, at 9:31 AM, Adrien Grand  wrote:
> 
> Hi Walter,
> 
> You emailed the Lucene dev list (dev@lucene.a.o) but I think you meant
> to ask this question to the Solr list (dev@solr.a.o).
> 
> On Wed, Sep 20, 2023 at 8:59 PM Walter Underwood  
> wrote:
>> 
>> When I get web search results that include the Solr Reference Guide, I often 
>> get older versions (6.6, 7.4) in the results. I would prefer to always get 
>> the latest reference (https://solr.apache.org/guide/solr/latest/index.html).
>> 
>> I think we can list the URLs for that in a sitemap.xml file with a higher 
>> priority to suggest to the crawlers that these are the preferred pages.
>> 
>> I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org.
>> 
>> Should we prefer the latest manual? How do we build/deploy a sitemap? See: 
>> https://www.sitemaps.org/
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 
> 
> -- 
> Adrien
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Sitemap to get latest reference manual to rank in Google/Bing?

2023-09-20 Thread Walter Underwood
When I get web search results that include the Solr Reference Guide, I often 
get older versions (6.6, 7.4) in the results. I would prefer to always get the 
latest reference (https://solr.apache.org/guide/solr/latest/index.html).

I think we can list the URLs for that in a sitemap.xml file with a higher 
priority to suggest to the crawlers that these are the preferred pages.

I don’t see a sitemap.xml or sitemap.xml.gz at https://solr.apached.org 
<https://solr.apached.org/>.

Should we prefer the latest manual? How do we build/deploy a sitemap? See: 
https://www.sitemaps.org/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Patch to change murmurhash implementation slightly

2023-04-25 Thread Walter Underwood
I would recommend some non-English tests. Non-Latin scripts (CJK, Arabic, 
Hebrew) will have longer byte strings because of UTF8. German has large 
compound words.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 25, 2023, at 10:57 AM, Thomas Dullien 
>  wrote:
> 
> Hey all,
> 
> ok, attached is a second patch that adds some unit tests; I am happy to add 
> more.
> 
> This brings me back to my original question: I'd like to run some pretty 
> thorough benchmarking on Lucene, both for this change and for possible other 
> future changes, largely focused on indexing performance. What are good 
> command lines to do so? What are good corpora?
> 
> Cheers,
> Thomas
> 
> On Tue, Apr 25, 2023 at 6:04 PM Thomas Dullien  <mailto:thomas.dull...@elastic.co>> wrote:
>> Hey,
>> 
>> ok, I've done some digging: Unfortunately, MurmurHash3 does not publish 
>> official test vectors, see the following URLs:
>> https://github.com/aappleby/smhasher/issues/6
>> https://github.com/multiformats/go-multihash/issues/135#issuecomment-791178958
>> There is a link to a pastebin entry in the first issue, which leads to 
>> https://pastebin.com/kkggV9Vx
>> 
>> Now, the test vectors in that pastebin do not match either the output of 
>> pre-change Lucene's murmur3, nor the output of the Python mmh3 package. That 
>> said, the pre-change Lucene and the mmh3 package agree, just not with the 
>> published list.
>> 
>> There *are* test vectors in the source code for the mmh3 python package, 
>> which I could use, or cook up a set of bespoke ones, or both (I share the 
>> concern about 8-byte boundaries and signedness).
>> https://github.com/hajimes/mmh3/blob/3bf1e5aef777d701305c1be7ad0550e093038902/test_mmh3.py#L75
>> 
>> Cheers,
>> Thomas
>> 
>> On Tue, Apr 25, 2023 at 5:15 PM Robert Muir > <mailto:rcm...@gmail.com>> wrote:
>>> i dont think we need a ton of random strings. But if you want to
>>> optimize for strings of length 8, at a minimum there should be very
>>> simple tests ensuring correctness for some boundary conditions (e.g.
>>> string of length exactly 8). i would also strongly recommend testing
>>> non-ascii since java is a language with signed integer types so it may
>>> be susceptible to bugs where the input bytes have the "sign bit" set.
>>> 
>>> IMO this could be 2 simple unit tests.
>>> 
>>> usually at least with these kinds of algorithms you can also find
>>> published "test vectors" that intend to seek out the corner cases. if
>>> these exist for murmurhash, we should fold them in too.
>>> 
>>> On Tue, Apr 25, 2023 at 11:08 AM Thomas Dullien
>>> mailto:thomas.dull...@elastic.co>> wrote:
>>> >
>>> > Hey,
>>> >
>>> > I offered to run a large number of random-string-hashes to ensure that 
>>> > the output is the same pre- and post-change. I can add an arbitrary 
>>> > number of such tests to TestStringHelper.java, just specify the number 
>>> > you wish.
>>> >
>>> > If your worry is that my change breaches the inlining bytecode limit: Did 
>>> > you check whether the old version was inlineable or not? The new version 
>>> > is 263 bytecode instructions, the old version was 110. The default 
>>> > inlining limit appears to be 35 bytecode instructions on cursory checking 
>>> > (I may be wrong on this, though), so I don't think it was ever inlineable 
>>> > in default configs.
>>> >
>>> > On your statement "we haven't seen performance gains" -- the starting 
>>> > point of this thread was a friendly request to please point me to 
>>> > instructions for running a broad range of Lucene indexing benchmarks, so 
>>> > I can gather data for further discussion; from my perspective, we haven't 
>>> > even gathered any data, so obviously we haven't seen any gains.
>>> >
>>> > Cheers,
>>> > Thomas
>>> >
>>> > On Tue, Apr 25, 2023 at 4:27 PM Robert Muir >> > <mailto:rcm...@gmail.com>> wrote:
>>> >>
>>> >> There is literally one string, all-ascii. This won't fail if all the
>>> >> shifts and masks are wrong.
>>> >>
>>> >> About the inlining, i'm not talking about cpu stuff, i'm talking about
>>> >> java. There are limits to the size of methods that get inlined (e.g.
>>> >> -XX:MaxInlineS

Re: [Proposal] Remove max number of dimensions for KNN vectors

2023-04-06 Thread Walter Underwood
If we find issues with larger limits, maybe have a configurable limit like we 
do for maxBooleanClauses. Maybe somebody wants to run with a 100G heap and do 
one query per second. 

Where I work (LexisNexis), we have high-value queries, but just not that many 
of them per second.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 6, 2023, at 8:57 AM, Alessandro Benedetti  wrote:
> 
> To be clear Robert, I agree with you in not bumping it just to 2048 or 
> whatever not motivated enough constant. 
> 
> But I disagree on the performance perspective:
> I mean I am absolutely positive in working to improve the current 
> performances, but I think this is disconnected from that limit. 
> 
> Not all users need billions of vectors, maybe tomorrow a new chip is released 
> that speed up the processing 100x or whatever...
> 
> The limit as far as I know is not used to initialise or optimise any data 
> structure, it's only used to raise an exception. 
> 
> I don't see a big problem in allowing 10k vectors for example but then 
> majority of people won't be able to use such vectors because slow on the 
> average computer.
> If we just get 1 new user, it's better than 0.
> Or well, if it's a reputation thing, than It's a completely different 
> discussion I guess. 
> 
> 
> On Thu, 6 Apr 2023, 16:47 Robert Muir,  <mailto:rcm...@gmail.com>> wrote:
>> Well, I'm asking ppl actually try to test using such high dimensions.
>> Based on my own experience, I consider it unusable. It seems other
>> folks may have run into trouble too. If the project committers can't
>> even really use vectors with such high dimension counts, then its not
>> in an OK state for users, and we shouldn't bump the limit.
>> 
>> I'm happy to discuss/compromise etc, but simply bumping the limit
>> without addressing the underlying usability/scalability is a real
>> no-go, it is not really solving anything, nor is it giving users any
>> freedom or allowing them to do something they couldnt do before.
>> Because if it still doesnt work it still doesnt work.
>> 
>> We all need to be on the same page, grounded in reality, not fantasy,
>> where if we set a limit of 1024 or 2048, that you can actually index
>> vectors with that many dimensions and it actually works and scales.
>> 
>> On Thu, Apr 6, 2023 at 11:38 AM Alessandro Benedetti
>> mailto:a.benede...@sease.io>> wrote:
>> >
>> > As I said earlier, a max limit limits usability.
>> > It's not forcing users with small vectors to pay the performance penalty 
>> > of big vectors, it's literally preventing some users to use 
>> > Lucene/Solr/Elasticsearch at all.
>> > As far as I know, the max limit is used to raise an exception, it's not 
>> > used to initialise or optimise data structures (please correct me if I'm 
>> > wrong).
>> >
>> > Improving the algorithm performance is a separate discussion.
>> > I don't see a correlation with the fact that indexing billions of whatever 
>> > dimensioned vector is slow with a usability parameter.
>> >
>> > What about potential users that need few high dimensional vectors?
>> >
>> > As I said before, I am a big +1 for NOT just raise it blindly, but I 
>> > believe we need to remove the limit or size it in a way it's not a problem 
>> > for both users and internal data structure optimizations, if any.
>> >
>> >
>> > On Wed, 5 Apr 2023, 18:54 Robert Muir, > > <mailto:rcm...@gmail.com>> wrote:
>> >>
>> >> I'd ask anyone voting +1 to raise this limit to at least try to index
>> >> a few million vectors with 756 or 1024, which is allowed today.
>> >>
>> >> IMO based on how painful it is, it seems the limit is already too
>> >> high, I realize that will sound controversial but please at least try
>> >> it out!
>> >>
>> >> voting +1 without at least doing this is really the
>> >> "weak/unscientifically minded" approach.
>> >>
>> >> On Wed, Apr 5, 2023 at 12:52 PM Michael Wechner
>> >> mailto:michael.wech...@wyona.com>> wrote:
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > I agree, that it should not crash.
>> >> >
>> >> > So far we did not experience crashes ourselves, but we did not index
>> >> > millions of vectors.
>> >> >
>> >> > I will try to reproduce the crash, maybe this will help us to move 

Re: Maximum score estimation

2022-12-20 Thread Walter Underwood
Comparing scores within the result set for a single query works fine. Mapping 
those to [0,1] is fine, too.

Comparing scores for different queries, or even for the same query at different 
times, isn’t valid. Showing the scores to people almost guarantees they’ll 
compare the scores between different queries.

The BM25 in Lucene is a change to the formulas for idf, tf, and length 
normalization. It is still fundamentally a tf.idf model.

https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 19, 2022, at 10:03 PM, J. Delgado  wrote:
> 
> Actually, I believe that the Lucene scoring function is based on Okapi BM25 
> (BM is an abbreviation of best matching) which is based on the probabilistic 
> retrieval  
> <https://en.m.wikipedia.org/wiki/Probabilistic_relevance_model>framework 
> developed in the 1970s and 1980s by Stephen E. Robertson 
> <https://en.m.wikipedia.org/wiki/Stephen_E._Robertson>, Karen Spärck Jones 
> <https://en.m.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones>, and others.
> 
> There are several interpretations for IDF and slight variations on its 
> formula. In the original BM25 derivation, the IDF component is derived from 
> the Binary Independence Model 
> <https://en.m.wikipedia.org/wiki/Binary_Independence_Model>.
> 
> Info from: 
> https://en.m.wikipedia.org/wiki/Okapi_BM25 
> <https://en.m.wikipedia.org/wiki/Okapi_BM25>
> 
>> You could calculate an ideal score, but that can change every time a 
>> document is added to or deleted from the index, because of idf. So the ideal 
>> score isn’t a useful mental model. 
>> 
>> Essentially, you need to tell your users to worry about something that 
>> matters. The absolute value of the score does not matter.
>> 
> While I understand the concern, quite often BM25 scores are used post 
> retrieval (in 2-stage retrieval/ranking systems) to fuel learning-to-rank 
> models that often transform the score into [0,1] using some normalization 
> function that often  involves estimating a max score by looking at the score 
> distribution.
> 
> J
> 
> On Mon, Dec 19, 2022 at 11:31 AM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> That article is copied from the old wiki, so it is much earlier than 2019, 
> more like 2009. Unfortunately, the links to the email discussion are all 
> dead, but the issues in the article are still true.
> 
> If you really want to go down that path, you might be able to do it with a 
> similarity class that implements a probabilistic relevance model. I’d start 
> the literature search with this Google query.
> 
> probablistic information retrieval 
> <https://www.google.com/search?client=safari=en=probablistic+information+retrieval=UTF-8=UTF-8>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev > <mailto:m...@apache.org>> wrote:
>> 
>> Thanks for replym Walter.
>> Recently Robert commented on PR with the link 
>> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages 
>> <https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages> it 
>> gives arguments against my proposal. Honestly, I'm still in doubt.  
>> 
>> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>> As you point out, this is a probabilistic relevance model. Lucene uses a 
>> vector space model.
>> 
>> A probabilistic model gives an estimate of how relevant each document is to 
>> the query. Unfortunately, their overall relevance isn’t as good as a vector 
>> space model.
>> 
> 
>> You could calculate an ideal score, but that can change every time a 
>> document is added to or deleted from the index, because of idf. So the ideal 
>> score isn’t a useful mental model. 
>> 
>> Essentially, you need to tell your users to worry about something that 
>> matters. The absolute value of the score does not matter.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev >> <mailto:m...@apache.org>> wrote:
>>> 
>>> Hello dev! 
>>> Users are interested in the meaning of absolute value of the score, but we 
>>> always reply that it's j

Re: Maximum score estimation

2022-12-19 Thread Walter Underwood
That article is copied from the old wiki, so it is much earlier than 2019, more 
like 2009. Unfortunately, the links to the email discussion are all dead, but 
the issues in the article are still true.

If you really want to go down that path, you might be able to do it with a 
similarity class that implements a probabilistic relevance model. I’d start the 
literature search with this Google query.

probablistic information retrieval 
<https://www.google.com/search?client=safari=en=probablistic+information+retrieval=UTF-8=UTF-8>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 18, 2022, at 2:47 AM, Mikhail Khludnev  wrote:
> 
> Thanks for replym Walter.
> Recently Robert commented on PR with the link 
> https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages 
> <https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages> it 
> gives arguments against my proposal. Honestly, I'm still in doubt.  
> 
> On Tue, Dec 6, 2022 at 8:15 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> As you point out, this is a probabilistic relevance model. Lucene uses a 
> vector space model.
> 
> A probabilistic model gives an estimate of how relevant each document is to 
> the query. Unfortunately, their overall relevance isn’t as good as a vector 
> space model.
> 
> You could calculate an ideal score, but that can change every time a document 
> is added to or deleted from the index, because of idf. So the ideal score 
> isn’t a useful mental model. 
> 
> Essentially, you need to tell your users to worry about something that 
> matters. The absolute value of the score does not matter.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev > <mailto:m...@apache.org>> wrote:
>> 
>> Hello dev! 
>> Users are interested in the meaning of absolute value of the score, but we 
>> always reply that it's just relative value. Maximum score of matched docs is 
>> not an answer. 
>> Ultimately we need to measure how much sense a query has in the index. e.g. 
>> [jet OR propulsion OR spider] query should be measured like nonsense, 
>> because the best matching docs have much lower scores than hypothetical (and 
>> assuming absent) doc matching [jet AND propulsion AND spider].
>> Could it be a method that returns the maximum possible score if all query 
>> terms would match. Something like stubbing postings on virtual all_matching 
>> doc with average stats like tf and field length and kicks scorers in? It 
>> reminds me something about probabilistic retrieval, but not much. Is there 
>> anything like this already?   
>> 
>> -- 
>> Sincerely yours
>> Mikhail Khludnev
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Re: Maximum score estimation

2022-12-06 Thread Walter Underwood
As you point out, this is a probabilistic relevance model. Lucene uses a vector 
space model.

A probabilistic model gives an estimate of how relevant each document is to the 
query. Unfortunately, their overall relevance isn’t as good as a vector space 
model.

You could calculate an ideal score, but that can change every time a document 
is added to or deleted from the index, because of idf. So the ideal score isn’t 
a useful mental model. 

Essentially, you need to tell your users to worry about something that matters. 
The absolute value of the score does not matter.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 5, 2022, at 11:02 PM, Mikhail Khludnev  wrote:
> 
> Hello dev! 
> Users are interested in the meaning of absolute value of the score, but we 
> always reply that it's just relative value. Maximum score of matched docs is 
> not an answer. 
> Ultimately we need to measure how much sense a query has in the index. e.g. 
> [jet OR propulsion OR spider] query should be measured like nonsense, because 
> the best matching docs have much lower scores than hypothetical (and assuming 
> absent) doc matching [jet AND propulsion AND spider].
> Could it be a method that returns the maximum possible score if all query 
> terms would match. Something like stubbing postings on virtual all_matching 
> doc with average stats like tf and field length and kicks scorers in? It 
> reminds me something about probabilistic retrieval, but not much. Is there 
> anything like this already?   
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Re: [HELP] Link your Apache Lucene Jira and GitHub account ids before Thursday August 4 midnight (in your local time)

2022-08-08 Thread Walter Underwood
JiraName,GitHubAccount,JiraDispName
wunder,wrunderwood,Walter Underwood

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 6, 2022, at 3:05 AM, Michael McCandless  
> wrote:
> 
> Hi Adam, I added your linked accounts here: 
> https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c
>  
> <https://github.com/apache/lucene-jira-archive/commit/c228cb184c073f4b96cd68d45a000cf390455b7c>
> 
> And Tomoko added Rushabh's linked accounts here:
> 
> https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb
>  
> <https://github.com/apache/lucene-jira-archive/commit/6f9501ec68792c1b287e93770f7a9dfd351b86fb>
> Keep the linked accounts coming!
> 
> Mike
> 
> On Thu, Aug 4, 2022 at 7:02 PM Rushabh Shah 
>  wrote:
> Hi,
> My mapping is:
> JiraName,GitHubAccount,JiraDispName
> shahrs87, shahrs87, Rushabh Shah
> 
> Thank you Tomoko and Mike for all of your hard work.
> 
> 
> 
> 
> On Sun, Jul 31, 2022 at 3:08 AM Michael McCandless  <mailto:luc...@mikemccandless.com>> wrote:
> Hello Lucene users, contributors and developers,
> 
> If you have used Lucene's Jira and you have a GitHub account as well, please 
> check whether your user id mapping is in this file: 
> https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified
>  
> <https://github.com/apache/lucene-jira-archive/blob/main/migration/mappings-data/account-map.csv.20220722.verified>
> 
> If not, please reply to this email and we will try to add you.
> 
> Please forward this email to anyone you know might be impacted and who might 
> not be tracking the Lucene lists.
> 
> 
> Full details:
> 
> The Lucene project will soon migrate from Jira to GitHub for issue tracking.
> 
> There have been discussions, votes, a migration tool created / iterated 
> (thanks to Tomoko Uchida's incredibly hard work), all iterating on Lucene's 
> dev list.
> 
> When we run the migration, we would like to map Jira users to the right 
> GitHub users to properly @-mention the right person and make it easier for 
> you to find issues you have engaged with.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>-- 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>


Re: Finding out which fields matched the query

2022-06-27 Thread Walter Underwood
For a quick hack, you can use highlighting. That does more than you want, 
showing which words match, but it does have the info. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 27, 2022, at 3:23 AM, Shai Erera  wrote:
> 
> Thanks Uwe, I didn't know about named queries, but it seems useful. Is there 
> interest in getting similar functionality in Lucene, or perhaps just the 
> FieldMatching collector? I'd be happy to PR-it.
> 
> As for usecase, I was thinking of using something similar to this collector 
> for some kind of (simple) entity recognition task. If you have a corpus of 
> documents with many fields which denote product attributes, you could match a 
> word like "Red" to the various product attribute fields and determine based 
> on the matching fields + their doc count whether this word likely represents 
> a Color or Brand entity (hint: it matches both, the question is which is more 
> probable).
> 
> I'm sure there are other ways to achieve this, and probably much smarter NER 
> implementations, but this one is at least based on the actual data that you 
> index which guarantees something about the results you will receive if 
> applying a certain attribute filtering.
> 
> Shai
> 
> On Mon, Jun 27, 2022 at 1:01 PM Uwe Schindler  <mailto:u...@thetaphi.de>> wrote:
> I think the collector approach is perfectly fine for mass-processing of 
> queries.
> 
> By the way: Elasticserach/Opensearch have a feature already built-in and it 
> is working based on collector API in a similar way like you mentioned (as far 
> as I remember). It is a bit different as you can tag any clause in a BQ (so 
> every query) using a "name" (they call it "named query", 
> https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries
>  
> <https://www.elastic.co/guide/en/elasticsearch/reference/8.2/query-dsl-bool-query.html#named-queries>).
>  When you get the search results, for each hit it tells you which named 
> queries were a match on the hit. The actual implementation is some wrapper 
> query on each of those clauses that contains the name. In hit collection it 
> just collects all named query instances found in query tree. I think their 
> implementation somehow the wrapper query scorer impl adds the name to some 
> global state.
> 
> Uwe
> 
> Am 27.06.2022 um 11:51 schrieb Shai Erera:
>> Out of curiosity and for education purposes, is the Collector approach I 
>> proposed wrong/inefficient? Or less efficient than the matches() API?
>> 
>> I'm thinking, if you want to both match/rank documents and as a side effect 
>> know which fields matched, the Collector will perform better than 
>> Weight.matches(), but I could be wrong.
>> 
>> Shai
>> 
>> On Mon, Jun 27, 2022 at 11:57 AM Dawid Weiss > <mailto:dawid.we...@gmail.com>> wrote:
>> The matches API is awesome. Use it. You can also get a rough glimpse
>> into a superset of fields potentially matching the query via:
>> 
>> query.visit(
>> new QueryVisitor() {
>>   @Override
>>   public boolean acceptField(String field) {
>> affectedFields.add(field);
>> return false;
>>   }
>> });
>> 
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)
>>  
>> <https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/Query.html#visit(org.apache.lucene.search.QueryVisitor)>
>> 
>> I'd go with the Matches API though.
>> 
>> Dawid
>> 
>> On Mon, Jun 27, 2022 at 10:48 AM Alan Woodward > <mailto:romseyg...@gmail.com>> wrote:
>> >
>> > The Matches API will give you this information - it’s still likely to be 
>> > fairly slow, but it’s a lot easier to use than trying to parse Explain 
>> > output.
>> >
>> > Query q = ….;
>> > Weight w = searcher.createWeight(searcher.rewrite(query), 
>> > ScoreMode.COMPLETE_NO_SCORES, 1.0f);
>> >
>> > Matches m = w.matches(context, doc);
>> > List matchingFields = new ArrayList();
>> > for (String field : m) {
>> >  matchingFields.add(field);
>> > }
>> >
>> > Bear in mind that `matches` doesn’t maintain any state between calls, so 
>> > calling it for every matching document is likely to be slow; for those 
>> > cases Shai’s suggestion of using a Collector and examining low-level 
>> > scorers will perform better, but it won’t work for every query type.
>> >
&g

Re: [RESULT] [VOTE] Migration to GitHub issue from Jira

2022-06-15 Thread Walter Underwood
Totally agree. The history of closed issues answer “when did this change and 
why?”. Migrate them all. Computers can do that. It avoids asking humans to 
think about where stuff is.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 15, 2022, at 6:49 AM, Michael McCandless  
> wrote:
> 
> I would prefer that we migrate all issues to GitHub and then make the Jira 
> project read only.
> 
> There is a rich history to our project in these issues that we should not 
> discard.  This is a unique property of Apache Lucene since our project has 
> been in existence and so vibrant for so much time.  Those who do not 
> know/study history are doomed to repeat it :)
> 
> Expecting new developers to remember to "oh, long ago this project used this 
> old thing called Jira, you have to go search that, to find out why XYZ was 
> done this way in Lucene, pre-Github-issue-migration" is not a good solution 
> -- many won't remember (nor eventually, know) to do so.
> 
> If I saw it correctly, at least two other projects (or maybe two people from 
> the same project, not sure) created a one-off tool to perform the migration 
> for their projects.  It isn't perfect of course (GitHub issues may not be 
> able to represent all metadata that Jira has), but we should migrate what is 
> possible.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
> 
> On Wed, Jun 15, 2022 at 9:34 AM Michael Sokolov  <mailto:msoko...@gmail.com>> wrote:
> Agree with everyone here. Also consider that if we duplicate there
> will be two copies of the same issue, and they will inevitably
> diverge...
> 
> On Wed, Jun 15, 2022 at 9:28 AM Jan Høydahl  <mailto:jan@cominvent.com>> wrote:
> >
> > +1 for a manual approach
> >
> > Over time the volume will gravitate to mostly GitHub issues. And JIRA will 
> > always remain as an archive, so nothing is lost. Devs can always peek into 
> > the list of open JIRAs any time and choose to start a PR for one. A slight 
> > disadvantage is of course that in the first year or two you need to look in 
> > both systems to get a full overview of all open issues. But auto migrating 
> > hundreds of abandoned JIRA issues to GitHub is no better imo.
> >
> > Jan
> >
> > 15. jun. 2022 kl. 13:03 skrev Dawid Weiss  > <mailto:dawid.we...@gmail.com>>:
> >
> >
> >> Maybe a 3rd option could be to only use GitHub for new issues by adding 
> >> some text to the Jira create issue dialog that says something like "JIRA 
> >> is deprecated, please use GitHub for new issues" to encourage users to 
> >> create new issues on GitHub instead of JIRA.
> >
> >
> > I was thinking this too, actually. It'd allow for a more graceful 
> > transition period from one system to another. It'd also help keep 
> > cross-links (from comments, etc.) in the old issues reference proper 
> > targets. And I don't see many disadvantages - I imagine that folks who wish 
> > to revisit old(er) open Jira issues and prefer GH can close the jira ticket 
> > as a duplicate and open a new corresponding GH issue. Wouldn't this be 
> > easier (for you as well)? The key change would be procedural -- allow pull 
> > requests and github issues as first-class "change" tickets.
> >
> > D.
> >
> >
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 



Re: [VOTE] Migration to GitHub issue from Jira (LUCENE-10557)

2022-05-30 Thread Walter Underwood
So 15% is a quorum for votes.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 30, 2022, at 1:14 PM, Tomoko Uchida  
> wrote:
> 
> Hi, 
> thank you for participating for this!
> 
> I may need to clarify the local rule I set.
> "15 votes" threshold means literally 15 votes, that includes approval(+1), 
> disapproval(-1), and no opinion(+0).
> I don't mean we need 15 approvals or 15 disapprovals to make the dicision - 
> it could be too high hurdle for both side I think.
> 
> I mean, I need at least "15 participants" who pay attention/take time on it 
> and decide to cast their valuable votes for this proposal.
> 
> Thanks,
> Tomoko
> 
> 
> 2022年5月31日(火) 4:04 Dragan Ivanovic  <mailto:dragan.ivano...@uns.ac.rs>>:
> Hello everyone,
> 
> Not sure whether this email might help you, but let me share the VIVO 
> community experience with this issue. We have migrated JIRA issues available 
> at https://vivo-project.atlassian.net/jira/software/c/projects/VIVO/issues/ 
> <https://vivo-project.atlassian.net/jira/software/c/projects/VIVO/issues/> to 
> GitHub issues available athttps://github.com/vivo-project/VIVO/issues 
> <https://github.com/vivo-project/VIVO/issues>. We used a customized version 
> of this project - https://github.com/rstoyanchev/jira-to-gh-issues 
> <https://github.com/rstoyanchev/jira-to-gh-issues>  (our customization is 
> available at https://github.com/chenejac/jira-to-gh-issues 
> <https://github.com/chenejac/jira-to-gh-issues>). Basically, it is possible 
> to migrate issues, not perfect, but majority of information is there, and we 
> are happy with our decision to move to GitHub issues. 
> 
> Good luck with migration.
> 
> Regards,
> 
> Dragan Ivanovic
> 
> the VIVO tech lead
> 
> On 5/30/2022 8:53 PM, Patrick Zhai wrote:
>> Thank you Tomoko for starting the vote, although I didn't participate in the 
>> last discussion but I'd love to see us moving towards the github issue.
>> 
>> So here's my +1 (committer, non-PMC)
>> 
>> BTW, by "the vote will be effective if it successfully gains more than 15% 
>> of voters (>= 15) from committers", do you mean to make it successful we 
>> need to collect 15 "+1" from committers or just 15 votes (regardless of the 
>> opinion)? 
>> 
>> Best
>> Patrick
>> 
>> On Mon, May 30, 2022 at 8:41 AM Tomoko Uchida > <mailto:tomoko.uchida.1...@gmail.com>> wrote:
>> Hi everyone!
>> 
>> As we had previous discussion thread [1], I propose migration to GitHub 
>> issue from Jira.
>> It'd be technically possible (see [2] for details) and I think it'd be good 
>> for the project - not only for welcoming new developers who are not familiar 
>> with Jira, but also for improving the experiences of long-term 
>> committers/contributors by consolidating the conversation platform.
>> 
>> You can see a short summary of the discussion, some stats on current Jira 
>> issues, and a draft migration plan in [2].
>> Please review [2] if you haven't seen it and vote for this proposal.
>> 
>> The vote will be open until 2022-06-06 16:00 UTC.
>> 
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>> 
>> Here is my +1
>> 
>> *IMPORTANT NOTE*
>> I set a local protocol for this vote.
>> There are 95 committers on this project [3] - the vote will be effective if 
>> it successfully gains more than 15% of voters (>= 15) from committers 
>> (including PMC members). This means, that although only PMC member votes are 
>> counted for the final result, the votes from all committers are important to 
>> make the vote result effective.
>> 
>> If there are less than 15 votes at 2022-06-06 16:00 UTC, I will expand the 
>> term to 2022-06-13 16:00 UTC. If this fails to get sufficient voters after 
>> the expanded time limit, I'll cancel this vote regardless of the result.
>> But why do I set such an extra bar? My fear is that if such things are 
>> decided by the opinions of a few members, the result shouldn't yield a good 
>> outcome for the future. It isn't my goal to just pass the vote [4].
>> 
>> [1] https://lists.apache.org/thread/78wj0vll73sct065m5jjm4z8gqb5yffk 
>> <https://lists.apache.org/thread/78wj0vll73sct065m5jjm4z8gqb5yffk>
>> [2] https://issues.apache.org/jira/browse/LUCENE-10557 
>> <https://issues.apache.org/jira/browse/LUCENE-10557>
>> [3] https://projects.apache.org/committee.html?lucene 
>> <https://projects.apache.org/committee.html?lucene>
>> [4] I'm sorry for being overly cautious, but I have never met in person or 
>> virtually any of the committers (with a very few exceptions), therefore 
>> cannot assess if the vote result is reliable or not unless there is certain 
>> explicit feedback.
>> 
>> Tomoko
> 
>  
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=emailclient_term=icon>
>   Virus-free. www.avast.com 
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=emailclient_term=link>
>  
> 


Re: XML retrieval with Intervals

2022-05-06 Thread Walter Underwood
If you need to search XML, consider MarkLogic. It is a very full-featured 
database and search engine based on XML.

https://www.marklogic.com

Disclaimer: I worked there for a couple of years ten years ago. But I’ve been 
inside that product and it is non-muggle technology.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 6, 2022, at 5:35 AM, Michael Sokolov  wrote:
> 
> Many years ago I had started this Lux project that was designed to
> build an XML-aware index using Solr; see
> https://github.com/msokolov/lux/tree/master/src/main/java/lux/index/analysis
> for the analysis chain I used. Maybe you'll find something useful in
> this project? It's dormant for years, and pre-dates interval queries,
> but the code is still the code, and XML has not really changed...
> 
> On Fri, May 6, 2022 at 5:23 AM Mikhail Khludnev  wrote:
>> 
>> Hi Devs!
>> 
>> I found intervals quite nice and natural for retrieving scoped data (thanks, 
>> Alan!):
>> foo stuff bar
>> I.containing(I.ordered(I.term(""), I.term("")),
>>   I.unordered(I.term("bar"), I.term("foo")));
>> It works like a charm until it encounter ill nested tags:
>> foo bug bar
>> Due to intrinsic minimalizations it picks the internal tag. I feel like 
>> plain intervals backed on positions lack tag scoping information.
>> Do you know any approaches for retrieving XML in Lucene?
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: FST codec for *infix* queries. No luck so far.

2022-04-26 Thread Walter Underwood
I built the original Netflix autocomplete. That used edge Ngrams running on 
Solr 1.3.

It isn’t a really big index. There just aren’t that many movies and TV shows. I 
think we had 70k titles and 150k people (actors, directors, …).

We handled one corner case in the client code. Movies with a one-character 
title must show up for that character or they are unmatchable. You can’t type 
more characters to match A, M, X, or Z (all movies). That special case still 
works on dvd.netflix.com, but not on the streaming site. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 26, 2022, at 12:45 PM, Michael Sokolov  wrote:
> 
> I'm not sure under which scenario ngrams (edgengrams) would not be an
> option? Another to try maybe would be something like BPE (byte pair
> encoding). In this encoding, you train a set of tokens from a
> vocabulary based on frequency of occurrence, and agglomerate them
> iteratively until you have the vocabulary at a size you like. You tend
> to end up with commonly-ocurring subwords (morphemes) that can
> possibly be good indexing choices for this sort of thing?
> 
> On Tue, Apr 26, 2022 at 9:07 AM Michael McCandless
>  wrote:
>> 
>> One small datapoint: Amazon's customer facing product search now includes 
>> some infix suggestions (using Lucene's AnalyzingInfixSuggester), but only in 
>> fallback cases when the prefix suggesters didn't find compelling options.
>> 
>> And I think Netflix's suggester used to be primarily infix, but now when I 
>> tested it, I get no suggestions at all, only live search results, which I 
>> like less :)
>> 
>> Mike McCandless
>> 
>> http://blog.mikemccandless.com
>> 
>> 
>> On Tue, Apr 26, 2022 at 8:13 AM Dawid Weiss  wrote:
>>> 
>>> Hi Mikhail,
>>> 
>>> I don't have any spectacular suggestions but something stemming from 
>>> experience.
>>> 
>>> 1) While the problem is intellectually interesting, I rarely found
>>> anybody who'd be comfortable with using infix suggestions - people are
>>> very used to "completions" happening on a prefix of one or multiple
>>> words (see my note below, though).
>>> 
>>> 2) Wouldn't it be better/ more efficient to maintain an fst/ index of
>>> word suffix(es) -> complete word instead of offsets within the block?
>>> This can be combined with term frequency to limit the number of
>>> suggested words to just certain categories (or most frequent terms)
>>> which would make the fst smaller still.
>>> 
>>> 3) I'd never try to store infixes shorter than 2, 3 characters (you
>>> said you did it - "I even limited suffixes length to reduce their
>>> number"). This requires folks to type in longer input but prevents fst
>>> bloat and in general leads to higher-quality suggestions (since
>>> there'll be so many of them).
>>> 
>>>> Otherwise, with many smaller segments fully scanning term dictionaries is 
>>>> comparable to seeking suffixes FST and scanning certain blocks.
>>> 
>>> Yeah, I'd expect the automaton here to be huge. The complexity of the
>>> vocabulary and number of characters in the language will also play a
>>> key role.
>>> 
>>> 4) IntelliJ idea has this kind of "search everywhere" functionality
>>> which greps for infixes (it is really nice). I recall looking at the
>>> (open source engine) to see how it was done and my conclusion from
>>> glancing over the code was that it's a fixed, coarse, n-gram based
>>> index of consecutive letters pointing at potential matches, which are
>>> then revalidated against the query. So you have a super-simple index,
>>> with a very fast lookup and the cost of verifying and finding exact
>>> matches is shifted to once you have a candidate list. While this
>>> doesn't help with Lucene indexes, perhaps it's a sign that for this
>>> particular task a different index/search paradigm is needed?
>>> 
>>> 
>>> Dawid
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: Soften Jira's note when opening new issues?

2021-09-24 Thread Walter Underwood
I have a search story from Netflix about how much the first words matter.

We had two movies named “The Other Boleyn Girl”. One was an older BBC 
miniseries and the other was a recent Hollywood film. The BBC result was #1, 
which seemed wrong, but it was in more rental queues, so people really were 
adding it. The images were just women in big taffeta dresses, and the 
description for the Hollywood film started with something like “From acclaimed 
director …”, so it was hard to tell which was which.

I asked the movie metadata team to reword the description of the Hollywood film 
so that the first two words were “Nicole Kidman”. They pushed the change, and 
one hour later, the Hollywood film was the first result.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 24, 2021, at 11:59 AM, David Smiley  wrote:
> 
> Agreed Walter.  At least it's better than before.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Fri, Sep 24, 2021 at 11:09 AM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> it seems odd too start with a statement that there is a mailing list without 
> any idea why the person cares. That is why my suggestions started with the 
> person’s need, not with the bare fact of the mailing list. 
> 
> People are likely to skip over that whole paragraph after they scan “This 
> project has a use mailing list…”. The first few words are by far the most 
> important. Again, I strongly suggest starting with "If you want help or have 
> a feature idea…”
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Sep 24, 2021, at 12:57 AM, Adrien Grand > <mailto:jpou...@gmail.com>> wrote:
>> 
>> Infra helped me change the message 
>> <https://issues.apache.org/jira/browse/INFRA-22353> yesterday, thanks for 
>> the discussion on this thread.
>> 
>> +1 on your PR to the project's README.
>> 
>> The problem I saw with Jira recently - and I acknowledge that there might be 
>> a bias - is that users had read our HowToContribute guide already, which 
>> suggests opening a Jira. But then Jira told these contributors to go to the 
>> mailing-list first before we updated the message. I like the idea of linking 
>> HowToContribute from the perspective that it would be welcoming and 
>> encourage contributions, but it would increase the amount of text that you 
>> have to read when using Jira yet the anecdotal evidence I have is that these 
>> contributors were already familiar with the HowToContribute since it is the 
>> thing that led them to Jira in the first place. No strong feelings, I could 
>> be convinced otherwise but wanted to give this perspective.
>> 
>> On Thu, Sep 23, 2021 at 7:41 PM Greg Miller > <mailto:gsmil...@gmail.com>> wrote:
>> Hi Adrien- that's totally fair. There are probably better places for
>> the additional content I'm proposing. A couple things along these
>> lines:
>> 
>> 1. Do you think it would be worth linking this guide from the JIRA
>> message (maybe after updating it)?
>> https://cwiki.apache.org/confluence/display/lucene/HowToContribute 
>> <https://cwiki.apache.org/confluence/display/lucene/HowToContribute>. It
>> could be a nice hook for new users to learn more (and it's what we
>> link from our README). Maybe it would make the message too long
>> though?
>> 2. I just put up a very brief PR to add my proposed "friendly message"
>> to the README before linking off to the above-mentioned guide:
>> https://github.com/apache/lucene/pull/318 
>> <https://github.com/apache/lucene/pull/318>.
>> 
>> Back to your original proposal though, I'll add my +1 as I think it's
>> a big improvement from the current messaging. Thanks for bringing this
>> up!
>> 
>> Cheers,
>> -Greg
>> 
>> On Wed, Sep 22, 2021 at 9:23 AM Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>> >
>> > Hmm. How is this? It is a single longer sentence, but essentially a string 
>> > of simple ones.
>> >
>> > If you want help or have a feature idea, please ask on the mailing list or 
>> > IRC channel before submitting a Jira issue.
>> >
>> > wunder
>> > Walter Underwood
>> > wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
>> 

Re: Soften Jira's note when opening new issues?

2021-09-24 Thread Walter Underwood
it seems odd too start with a statement that there is a mailing list without 
any idea why the person cares. That is why my suggestions started with the 
person’s need, not with the bare fact of the mailing list. 

People are likely to skip over that whole paragraph after they scan “This 
project has a use mailing list…”. The first few words are by far the most 
important. Again, I strongly suggest starting with "If you want help or have a 
feature idea…”

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 24, 2021, at 12:57 AM, Adrien Grand  wrote:
> 
> Infra helped me change the message 
> <https://issues.apache.org/jira/browse/INFRA-22353> yesterday, thanks for the 
> discussion on this thread.
> 
> +1 on your PR to the project's README.
> 
> The problem I saw with Jira recently - and I acknowledge that there might be 
> a bias - is that users had read our HowToContribute guide already, which 
> suggests opening a Jira. But then Jira told these contributors to go to the 
> mailing-list first before we updated the message. I like the idea of linking 
> HowToContribute from the perspective that it would be welcoming and encourage 
> contributions, but it would increase the amount of text that you have to read 
> when using Jira yet the anecdotal evidence I have is that these contributors 
> were already familiar with the HowToContribute since it is the thing that led 
> them to Jira in the first place. No strong feelings, I could be convinced 
> otherwise but wanted to give this perspective.
> 
> On Thu, Sep 23, 2021 at 7:41 PM Greg Miller  <mailto:gsmil...@gmail.com>> wrote:
> Hi Adrien- that's totally fair. There are probably better places for
> the additional content I'm proposing. A couple things along these
> lines:
> 
> 1. Do you think it would be worth linking this guide from the JIRA
> message (maybe after updating it)?
> https://cwiki.apache.org/confluence/display/lucene/HowToContribute 
> <https://cwiki.apache.org/confluence/display/lucene/HowToContribute>. It
> could be a nice hook for new users to learn more (and it's what we
> link from our README). Maybe it would make the message too long
> though?
> 2. I just put up a very brief PR to add my proposed "friendly message"
> to the README before linking off to the above-mentioned guide:
> https://github.com/apache/lucene/pull/318 
> <https://github.com/apache/lucene/pull/318>.
> 
> Back to your original proposal though, I'll add my +1 as I think it's
> a big improvement from the current messaging. Thanks for bringing this
> up!
> 
> Cheers,
> -Greg
> 
> On Wed, Sep 22, 2021 at 9:23 AM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >
> > Hmm. How is this? It is a single longer sentence, but essentially a string 
> > of simple ones.
> >
> > If you want help or have a feature idea, please ask on the mailing list or 
> > IRC channel before submitting a Jira issue.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> >
> > On Sep 22, 2021, at 9:18 AM, Adrien Grand  > <mailto:jpou...@gmail.com>> wrote:
> >
> > Greg, I understand and agree with the intent, but I also would like to keep 
> > this as short as possible since the screen to create a new issue in JIRA is 
> > already quite intimidating with all its text boxes, and the current version 
> > is already taking two lines even though it's short. Maybe this is the sort 
> > of thing that we could try to better emphasize in our project's README?
> >
> > On Wed, Sep 22, 2021 at 6:07 PM Walter Underwood  > <mailto:wun...@wunderwood.org>> wrote:
> >>
> >> Two excellent points. So it could be:
> >>
> >> Are you looking for support for Lucene? Have you seen unexpected behavior? 
> >> Have an idea for a new feature or improvement? Please ask for help on the 
> >> Lucene user mailing list or the IRC channel. If it is a new problem or 
> >> idea, then you can submit a Jira issue.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
> >> blog)
> >>
> >> On Sep 22, 2021, at 6:38 AM, Greg Miller  >> <mailto:gsmil...@gmail.com>> wrote:
> >>
> >> Love this idea!
> >>
> >> I wonder if there's a way to make the messaging clear that ideas for
> >> new features/improvem

Re: Soften Jira's note when opening new issues?

2021-09-22 Thread Walter Underwood
Hmm. How is this? It is a single longer sentence, but essentially a string of 
simple ones.

If you want help or have a feature idea, please ask on the mailing list or IRC 
channel before submitting a Jira issue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 22, 2021, at 9:18 AM, Adrien Grand  wrote:
> 
> Greg, I understand and agree with the intent, but I also would like to keep 
> this as short as possible since the screen to create a new issue in JIRA is 
> already quite intimidating with all its text boxes, and the current version 
> is already taking two lines even though it's short. Maybe this is the sort of 
> thing that we could try to better emphasize in our project's README?
> 
> On Wed, Sep 22, 2021 at 6:07 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> Two excellent points. So it could be:
> 
> Are you looking for support for Lucene? Have you seen unexpected behavior? 
> Have an idea for a new feature or improvement? Please ask for help on the 
> Lucene user mailing list or the IRC channel. If it is a new problem or idea, 
> then you can submit a Jira issue.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Sep 22, 2021, at 6:38 AM, Greg Miller > <mailto:gsmil...@gmail.com>> wrote:
>> 
>> Love this idea!
>> 
>> I wonder if there's a way to make the messaging clear that ideas for
>> new features/improvements are also always welcome? When I read the
>> current language, I interpret it as bug reporting. Maybe adding a
>> leading sentence would help?
>> 
>> ```
>> Bug reports, improvements and new feature ideas are always welcome!
>> Please note, this project has a user mailing list and an IRC channel
>> for support. If you are looking for support, or if you are not sure
>> whether the behavior that you are observing is expected or not, please
>> discuss it there first.
>> ```
>> 
>> Cheers,
>> -Greg
>> 
>> On Wed, Sep 22, 2021 at 5:35 AM Adrien Grand > <mailto:jpou...@gmail.com>> wrote:
>>> 
>>> Hi Walter,
>>> 
>>> Though it doesn't invalidate your comment, I was considering changing the 
>>> message only for the Lucene JIRA, at least for now.
>>> 
>>> On Tue, Sep 21, 2021 at 5:08 PM Walter Underwood >> <mailto:wun...@wunderwood.org>> wrote:
>>>> 
>>>> Here is one with shorter, less complex sentences and clear calls to action.
>>>> 
>>>> Are you looking for support for Solr? Have you seen unexpected behavior? 
>>>> Please ask for help on the Solr user mailing list or the IRC channel. If 
>>>> it is a new problem, then you can submit a Jira issue.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
>>>> blog)
>>>> 
>>>> On Sep 21, 2021, at 12:23 AM, Adrien Grand >>> <mailto:jpou...@gmail.com>> wrote:
>>>> 
>>>> I think you made a good point, Alexandre. Would something like this read 
>>>> better:
>>>> 
>>>> ```
>>>> This project has a user mailing list and an IRC channel for support. If 
>>>> you are looking for support, or if you are not sure whether the behavior 
>>>> that you are observing is expected or not, please discuss it there first.
>>>> ```
>>>> 
>>>> On Mon, Sep 20, 2021 at 2:22 PM Alexandre Rafalovitch >>> <mailto:arafa...@gmail.com>> wrote:
>>>>> 
>>>>> +1.
>>>>> Ideally, the final version could still be several shorter sentences. To 
>>>>> avoid needing to be a programmer to parse the deeply nested, if totally 
>>>>> logical, structure.
>>>>> 
>>>>> On Mon., Sep. 20, 2021, 4:33 a.m. Adrien Grand, >>>> <mailto:jpou...@gmail.com>> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> Jira gives the following note when opening an issue:
>>>>>> 
>>>>>> ```
>>>>>> This project has a user mailing list and an IRC channel for support. 
>>>>>> Please ensure that you have discussed your problem using one of those 
>>>>>> resources BEFORE creating this ticket.

Re: Soften Jira's note when opening new issues?

2021-09-22 Thread Walter Underwood
Two excellent points. So it could be:

Are you looking for support for Lucene? Have you seen unexpected behavior? Have 
an idea for a new feature or improvement? Please ask for help on the Lucene 
user mailing list or the IRC channel. If it is a new problem or idea, then you 
can submit a Jira issue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 22, 2021, at 6:38 AM, Greg Miller  wrote:
> 
> Love this idea!
> 
> I wonder if there's a way to make the messaging clear that ideas for
> new features/improvements are also always welcome? When I read the
> current language, I interpret it as bug reporting. Maybe adding a
> leading sentence would help?
> 
> ```
> Bug reports, improvements and new feature ideas are always welcome!
> Please note, this project has a user mailing list and an IRC channel
> for support. If you are looking for support, or if you are not sure
> whether the behavior that you are observing is expected or not, please
> discuss it there first.
> ```
> 
> Cheers,
> -Greg
> 
> On Wed, Sep 22, 2021 at 5:35 AM Adrien Grand  wrote:
>> 
>> Hi Walter,
>> 
>> Though it doesn't invalidate your comment, I was considering changing the 
>> message only for the Lucene JIRA, at least for now.
>> 
>> On Tue, Sep 21, 2021 at 5:08 PM Walter Underwood  
>> wrote:
>>> 
>>> Here is one with shorter, less complex sentences and clear calls to action.
>>> 
>>> Are you looking for support for Solr? Have you seen unexpected behavior? 
>>> Please ask for help on the Solr user mailing list or the IRC channel. If it 
>>> is a new problem, then you can submit a Jira issue.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> On Sep 21, 2021, at 12:23 AM, Adrien Grand  wrote:
>>> 
>>> I think you made a good point, Alexandre. Would something like this read 
>>> better:
>>> 
>>> ```
>>> This project has a user mailing list and an IRC channel for support. If you 
>>> are looking for support, or if you are not sure whether the behavior that 
>>> you are observing is expected or not, please discuss it there first.
>>> ```
>>> 
>>> On Mon, Sep 20, 2021 at 2:22 PM Alexandre Rafalovitch  
>>> wrote:
>>>> 
>>>> +1.
>>>> Ideally, the final version could still be several shorter sentences. To 
>>>> avoid needing to be a programmer to parse the deeply nested, if totally 
>>>> logical, structure.
>>>> 
>>>> On Mon., Sep. 20, 2021, 4:33 a.m. Adrien Grand,  wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> Jira gives the following note when opening an issue:
>>>>> 
>>>>> ```
>>>>> This project has a user mailing list and an IRC channel for support. 
>>>>> Please ensure that you have discussed your problem using one of those 
>>>>> resources BEFORE creating this ticket.
>>>>> ```
>>>>> 
>>>>> This can be quite intimidating for someone who has never worked with us 
>>>>> before, and we don't apply this logic for ourselves, for instance I feel 
>>>>> free to open JIRAs without discussing them first on IRC or dev@l.a.o. 
>>>>> Given that we are not seeing much irrelevant traffic on JIRA, I'd like to 
>>>>> soften the message to something like below:
>>>>> 
>>>>> ```
>>>>> If you are looking for support, or if you are not sure whether the 
>>>>> behavior that you are observing is expected or not, please discuss your 
>>>>> problem on the user mailing-list instead before creating a ticket.
>>>>> ```
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> --
>>>>> Adrien
>>> 
>>> 
>>> 
>>> --
>>> Adrien
>>> 
>>> 
>> 
>> 
>> --
>> Adrien
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: Soften Jira's note when opening new issues?

2021-09-21 Thread Walter Underwood
Here is one with shorter, less complex sentences and clear calls to action.

Are you looking for support for Solr? Have you seen unexpected behavior? Please 
ask for help on the Solr user mailing list or the IRC channel. If it is a new 
problem, then you can submit a Jira issue.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 21, 2021, at 12:23 AM, Adrien Grand  wrote:
> 
> I think you made a good point, Alexandre. Would something like this read 
> better:
> 
> ```
> This project has a user mailing list and an IRC channel for support. If you 
> are looking for support, or if you are not sure whether the behavior that you 
> are observing is expected or not, please discuss it there first.
> ```
> 
> On Mon, Sep 20, 2021 at 2:22 PM Alexandre Rafalovitch  <mailto:arafa...@gmail.com>> wrote:
> +1. 
> Ideally, the final version could still be several shorter sentences. To avoid 
> needing to be a programmer to parse the deeply nested, if totally logical, 
> structure. 
> 
> On Mon., Sep. 20, 2021, 4:33 a.m. Adrien Grand,  <mailto:jpou...@gmail.com>> wrote:
> Hello,
> 
> Jira gives the following note when opening an issue:
> 
> ```
> This project has a user mailing list and an IRC channel for support. Please 
> ensure that you have discussed your problem using one of those resources 
> BEFORE creating this ticket.
> ```
> 
> This can be quite intimidating for someone who has never worked with us 
> before, and we don't apply this logic for ourselves, for instance I feel free 
> to open JIRAs without discussing them first on IRC or dev@l.a.o. Given that 
> we are not seeing much irrelevant traffic on JIRA, I'd like to soften the 
> message to something like below:
> 
> ```
> If you are looking for support, or if you are not sure whether the behavior 
> that you are observing is expected or not, please discuss your problem on the 
> user mailing-list instead before creating a ticket.
> ```
> 
> What do you think?
> 
> -- 
> Adrien
> 
> 
> -- 
> Adrien



Re: Release Lucene/Solr 8.9.0 should we have it soon

2021-06-01 Thread Walter Underwood
This is really frustrating. We have a feature that never should have been 
committed. The behavior and documentation don’t match and the inputs are 
limited to values that make it unusable. The documentation contains a 
nonfunctional link.

I contribute a patch that implements both the original behavior and the 
documented behavior, with unit tests and detailed documentation.

What else am I supposed to do? This seems like we’ve lost the spirit of Yonik’s 
Law of Patches and perfection has become the enemy of progress.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2021, at 2:01 PM, David Smiley  wrote:
> 
> +1 to Jan's comment; no need to hold up the release.
> 
> I also think we should be open to more releases in the future for 8.x.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Tue, Jun 1, 2021 at 4:55 PM Jan Høydahl  <mailto:jan@cominvent.com>> wrote:
> Let's not hold up the release due to this incomplete PR. It obviously needs 
> more time for completion and there is always a new train to catch.
> As far as I understand, Circuit breakers are pluggable, so anyone can 
> configure their own implementation in the meantime?
> 
> Jan
> 
>> 1. jun. 2021 kl. 22:13 skrev Atri Sharma > <mailto:a...@apache.org>>:
>> 
>> I appreciate you fixing this and adding the new circuit breaker and look 
>> forward to having it in the hands of our users soon.
>> 
>> However, the current state of PR, with significant API churn for a single 
>> change and overlapping code is not yet ready.
>> 
>> If this is too much of a rework, I am happy to take the existing PR and do 
>> the changes, post which I believe the PR should be close to completion. 
>> 
>> Let me know if you need me to help, but unfortunately, the two objections I 
>> raised are blockers, atleast until we establish that they cannot be done 
>> away with. 
>> 
>> 
>> On Wed, 2 Jun 2021, 01:37 Walter Underwood, > <mailto:wun...@wunderwood.org>> wrote:
>> I would appreciate a second opinion on the pull request. Substantive issues 
>> have been resolved. At this point, the discussion is about code style and 
>> coding standards. I don’t have detailed knowledge about the Solr coding 
>> style, so I’d appreciate another set of eyes.
>> 
>> The current behavior is buggy, and we are not able to use it at Chegg. The 
>> patch fixes those bugs.
>> 
>> https://github.com/apache/solr/pull/96 
>> <https://github.com/apache/solr/pull/96>
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Jun 1, 2021, at 12:27 PM, Walter Underwood >> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> I answered the comments. I don’t see those answers on github, oddly.
>>> 
>>> I’ll re-answer them. Most of your questions are already answered in the 
>>> discussion on Jira.
>>> 
>>> I central issues is that load average is not always a CPU measure. In some 
>>> systems, it includes threads in iowait. So it is potentially misleading to 
>>> label it as CPU and document it as CPU. The updated documentation makes 
>>> that clear, so that should have already answered your comment. that is why 
>>> it is important to rename the existing circuit breaker.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>> 
>>>> On Jun 1, 2021, at 12:20 PM, Atri Sharma >>> <mailto:a...@apache.org>> wrote:
>>>> 
>>>> I tool a look at the PR and gave comments for SOLR-15056, and the last I 
>>>> checked, my comments were not addressed?
>>>> 
>>>> On Wed, 2 Jun 2021, 00:31 Walter Underwood, >>> <mailto:wun...@wunderwood.org>> wrote:
>>>> Could someone else please take a look at SOLR-15056? This is a small blast 
>>>> radius change that improves the circuit breakers. It includes unit tests 
>>>> and documentation and has been ready since January.
>>>> 
>>>> https://github.com/apache/solr/pull/96/files 
>>>> <https://github.com/apache/solr/pull/96/files>
>>>> https://issues.apache.org/jira/browse/SOLR-15056 
>>>>

Re: Release Lucene/Solr 8.9.0 should we have it soon

2021-06-01 Thread Walter Underwood
I would appreciate a second opinion on the pull request. Substantive issues 
have been resolved. At this point, the discussion is about code style and 
coding standards. I don’t have detailed knowledge about the Solr coding style, 
so I’d appreciate another set of eyes.

The current behavior is buggy, and we are not able to use it at Chegg. The 
patch fixes those bugs.

https://github.com/apache/solr/pull/96

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2021, at 12:27 PM, Walter Underwood  wrote:
> 
> I answered the comments. I don’t see those answers on github, oddly.
> 
> I’ll re-answer them. Most of your questions are already answered in the 
> discussion on Jira.
> 
> I central issues is that load average is not always a CPU measure. In some 
> systems, it includes threads in iowait. So it is potentially misleading to 
> label it as CPU and document it as CPU. The updated documentation makes that 
> clear, so that should have already answered your comment. that is why it is 
> important to rename the existing circuit breaker.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Jun 1, 2021, at 12:20 PM, Atri Sharma > <mailto:a...@apache.org>> wrote:
>> 
>> I tool a look at the PR and gave comments for SOLR-15056, and the last I 
>> checked, my comments were not addressed?
>> 
>> On Wed, 2 Jun 2021, 00:31 Walter Underwood, > <mailto:wun...@wunderwood.org>> wrote:
>> Could someone else please take a look at SOLR-15056? This is a small blast 
>> radius change that improves the circuit breakers. It includes unit tests and 
>> documentation and has been ready since January.
>> 
>> https://github.com/apache/solr/pull/96/files 
>> <https://github.com/apache/solr/pull/96/files>
>> https://issues.apache.org/jira/browse/SOLR-15056 
>> <https://issues.apache.org/jira/browse/SOLR-15056>
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Jun 1, 2021, at 11:53 AM, Mayya Sharipova 
>>> >> <mailto:mayya.sharip...@elastic.co.INVALID>> wrote:
>>> 
>>> Thank you for the update, Houston.
>>> 
>>> I've started the release process, the branch 8.9 is now cut.
>>> 
>>> On Tue, Jun 1, 2021 at 11:21 AM Houston Putman >> <mailto:hous...@apache.org>> wrote:
>>> Mayya, SOLR-14978 is now in 8.x. So no longer a blocker.
>>> 
>>> - Houston
>>> 
>>> On Thu, May 27, 2021 at 11:42 PM David Smiley >> <mailto:dsmi...@apache.org>> wrote:
>>> SOLR-15412 is rather serious as the title suggests.  I haven't been 
>>> tracking the progress so if it's already resolved, that's unknown to me and 
>>> isn't reflected in JIRA.
>>> 
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley 
>>> <http://www.linkedin.com/in/davidwsmiley>
>>> 
>>> On Thu, May 27, 2021 at 5:24 PM Mayya Sharipova 
>>> >> <mailto:mayya.sharip...@elastic.co.invalid>> wrote:
>>> Hello everyone,
>>> I wonder if everyone is ok for May 31st (Monday) as the date for the 
>>> feature freeze date and branch cut?
>>> I've noticed that `releaseWizard.py` is also asking for the length of 
>>> feature freeze. What is the custom length to put there?
>>> 
>>> Looks like Lucene 
>>> <https://issues.apache.org/jira/projects/LUCENE/versions/12349562> doesn't 
>>> have any unresolved issues for 8.9.
>>> SOLR <https://issues.apache.org/jira/projects/SOLR/versions/12349563> has:
>>> -  SOLR-15412  Strict validation on Replica metadata can cause complete 
>>> outage  (Looks like it may be resolved already?)
>>> - SOLR-15410 GC log is directed to console when starting Solr with Java 11 
>>> Open J9 on Windows
>>> - SOLR-15056  CPU circuit breaker needs to use CPU utilization, not Unix 
>>> load average
>>> 
>>> Are we ok to postpone these issues to later releases if they are not 
>>> resolved and merged before feature freeze?
>>> 
>>> Thank you.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Tue, May 25, 2021 at 12:41 PM Colvin Cowie >> <mailto:colvin.cowie@gmail.com>> wrote:
>>> Hello,
>>> 

Re: Release Lucene/Solr 8.9.0 should we have it soon

2021-06-01 Thread Walter Underwood
I answered the comments. I don’t see those answers on github, oddly.

I’ll re-answer them. Most of your questions are already answered in the 
discussion on Jira.

I central issues is that load average is not always a CPU measure. In some 
systems, it includes threads in iowait. So it is potentially misleading to 
label it as CPU and document it as CPU. The updated documentation makes that 
clear, so that should have already answered your comment. that is why it is 
important to rename the existing circuit breaker.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2021, at 12:20 PM, Atri Sharma  wrote:
> 
> I tool a look at the PR and gave comments for SOLR-15056, and the last I 
> checked, my comments were not addressed?
> 
> On Wed, 2 Jun 2021, 00:31 Walter Underwood,  <mailto:wun...@wunderwood.org>> wrote:
> Could someone else please take a look at SOLR-15056? This is a small blast 
> radius change that improves the circuit breakers. It includes unit tests and 
> documentation and has been ready since January.
> 
> https://github.com/apache/solr/pull/96/files 
> <https://github.com/apache/solr/pull/96/files>
> https://issues.apache.org/jira/browse/SOLR-15056 
> <https://issues.apache.org/jira/browse/SOLR-15056>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Jun 1, 2021, at 11:53 AM, Mayya Sharipova 
>> > <mailto:mayya.sharip...@elastic.co.INVALID>> wrote:
>> 
>> Thank you for the update, Houston.
>> 
>> I've started the release process, the branch 8.9 is now cut.
>> 
>> On Tue, Jun 1, 2021 at 11:21 AM Houston Putman > <mailto:hous...@apache.org>> wrote:
>> Mayya, SOLR-14978 is now in 8.x. So no longer a blocker.
>> 
>> - Houston
>> 
>> On Thu, May 27, 2021 at 11:42 PM David Smiley > <mailto:dsmi...@apache.org>> wrote:
>> SOLR-15412 is rather serious as the title suggests.  I haven't been tracking 
>> the progress so if it's already resolved, that's unknown to me and isn't 
>> reflected in JIRA.
>> 
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley 
>> <http://www.linkedin.com/in/davidwsmiley>
>> 
>> On Thu, May 27, 2021 at 5:24 PM Mayya Sharipova 
>> > <mailto:mayya.sharip...@elastic.co.invalid>> wrote:
>> Hello everyone,
>> I wonder if everyone is ok for May 31st (Monday) as the date for the feature 
>> freeze date and branch cut?
>> I've noticed that `releaseWizard.py` is also asking for the length of 
>> feature freeze. What is the custom length to put there?
>> 
>> Looks like Lucene 
>> <https://issues.apache.org/jira/projects/LUCENE/versions/12349562> doesn't 
>> have any unresolved issues for 8.9.
>> SOLR <https://issues.apache.org/jira/projects/SOLR/versions/12349563> has:
>> -  SOLR-15412  Strict validation on Replica metadata can cause complete 
>> outage  (Looks like it may be resolved already?)
>> - SOLR-15410 GC log is directed to console when starting Solr with Java 11 
>> Open J9 on Windows
>> - SOLR-15056  CPU circuit breaker needs to use CPU utilization, not Unix 
>> load average
>> 
>> Are we ok to postpone these issues to later releases if they are not 
>> resolved and merged before feature freeze?
>> 
>> Thank you.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, May 25, 2021 at 12:41 PM Colvin Cowie > <mailto:colvin.cowie@gmail.com>> wrote:
>> Hello,
>> Eric was going to have a look at the PR.
>> But if it isn't done in time then I don't think it needs to block the release
>> 
>> Thanks
>> 
>> On Tue, 25 May 2021 at 15:50, Mayya Sharipova 
>> > <mailto:mayya.sharip...@elastic.co.invalid>> wrote:
>> Hello Colvin,
>> I am wondering if you still want to merge SOLR-15410 for the Lucene/Solr 8.9 
>> release?  
>> Should we have a deadline for feature freeze? Say May 30th (Sunday)? 
>> 
>> Thank you.
>> 
>> On Tue, May 18, 2021 at 8:49 AM Noble Paul > <mailto:noble.p...@gmail.com>> wrote:
>> +1
>> 
>> 
>> On Tue, May 18, 2021 at 9:30 PM Colvin Cowie > <mailto:colvin.cowie@gmail.com>> wrote:
>> >
>> > Hello,
>> >
>> > I raised SOLR-15410 yesterday with a PR to fix an issue with GC logging 
>> > when using new versions of OpenJ9. It's small, so if somebody could have a 
>> > look 

Re: Release Lucene/Solr 8.9.0 should we have it soon

2021-06-01 Thread Walter Underwood
Could someone else please take a look at SOLR-15056? This is a small blast 
radius change that improves the circuit breakers. It includes unit tests and 
documentation and has been ready since January.

https://github.com/apache/solr/pull/96/files 
<https://github.com/apache/solr/pull/96/files>
https://issues.apache.org/jira/browse/SOLR-15056 
<https://issues.apache.org/jira/browse/SOLR-15056>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 1, 2021, at 11:53 AM, Mayya Sharipova 
>  wrote:
> 
> Thank you for the update, Houston.
> 
> I've started the release process, the branch 8.9 is now cut.
> 
> On Tue, Jun 1, 2021 at 11:21 AM Houston Putman  <mailto:hous...@apache.org>> wrote:
> Mayya, SOLR-14978 is now in 8.x. So no longer a blocker.
> 
> - Houston
> 
> On Thu, May 27, 2021 at 11:42 PM David Smiley  <mailto:dsmi...@apache.org>> wrote:
> SOLR-15412 is rather serious as the title suggests.  I haven't been tracking 
> the progress so if it's already resolved, that's unknown to me and isn't 
> reflected in JIRA.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Thu, May 27, 2021 at 5:24 PM Mayya Sharipova 
>  wrote:
> Hello everyone,
> I wonder if everyone is ok for May 31st (Monday) as the date for the feature 
> freeze date and branch cut?
> I've noticed that `releaseWizard.py` is also asking for the length of feature 
> freeze. What is the custom length to put there?
> 
> Looks like Lucene 
> <https://issues.apache.org/jira/projects/LUCENE/versions/12349562> doesn't 
> have any unresolved issues for 8.9.
> SOLR <https://issues.apache.org/jira/projects/SOLR/versions/12349563> has:
> -  SOLR-15412  Strict validation on Replica metadata can cause complete 
> outage  (Looks like it may be resolved already?)
> - SOLR-15410 GC log is directed to console when starting Solr with Java 11 
> Open J9 on Windows
> - SOLR-15056  CPU circuit breaker needs to use CPU utilization, not Unix load 
> average
> 
> Are we ok to postpone these issues to later releases if they are not resolved 
> and merged before feature freeze?
> 
> Thank you.
> 
> 
> 
> 
> 
> 
> On Tue, May 25, 2021 at 12:41 PM Colvin Cowie  <mailto:colvin.cowie@gmail.com>> wrote:
> Hello,
> Eric was going to have a look at the PR.
> But if it isn't done in time then I don't think it needs to block the release
> 
> Thanks
> 
> On Tue, 25 May 2021 at 15:50, Mayya Sharipova 
>  wrote:
> Hello Colvin,
> I am wondering if you still want to merge SOLR-15410 for the Lucene/Solr 8.9 
> release?  
> Should we have a deadline for feature freeze? Say May 30th (Sunday)? 
> 
> Thank you.
> 
> On Tue, May 18, 2021 at 8:49 AM Noble Paul  <mailto:noble.p...@gmail.com>> wrote:
> +1
> 
> 
> On Tue, May 18, 2021 at 9:30 PM Colvin Cowie  <mailto:colvin.cowie@gmail.com>> wrote:
> >
> > Hello,
> >
> > I raised SOLR-15410 yesterday with a PR to fix an issue with GC logging 
> > when using new versions of OpenJ9. It's small, so if somebody could have a 
> > look at it in time for 8.9 that would be great
> >
> > Thanks,
> > Colvin
> >
> > On Thu, 13 May 2021 at 17:52, Nhat Nguyen  > <mailto:nhat.ngu...@elastic.co>.invalid> wrote:
> >>
> >> Hi Mayya,
> >>
> >> I would like to backport LUCENE-9935, which enables bulk-merge for stored 
> >> fields with index sort, to 8.x this weekend. The patch is ready, but we 
> >> prefer to give CI some cycles before backporting. Please let me know if 
> >> it's okay with the release plan.
> >>
> >> Thanks,
> >> Nhat
> >>
> >> On Thu, May 13, 2021 at 12:44 PM Gus Heck  >> <mailto:gus.h...@gmail.com>> wrote:
> >>>
> >>> Perhaps https://issues.apache.org/jira/browse/SOLR-15378 
> >>> <https://issues.apache.org/jira/browse/SOLR-15378> should be investigated 
> >>> before 8.9, maybe make it a blocker?
> >>>
> >>> On Thu, May 13, 2021 at 1:35 AM Robert Muir  >>> <mailto:rcm...@gmail.com>> wrote:
> >>>>
> >>>> Mayya, I created backport for Adrien's issue here, to try to help out:
> >>>> https://github.com/apache/lucene-solr/pull/2495 
> >>>> <https://github.com/apache/lucene-solr/pull/2495>
> >>>>
> >>>> Personally, I felt that merging non-trivial changes from main branch
> >>>> to 8.x has som

Re: Circuit Breakers -- SOLR-15056

2021-05-25 Thread Walter Underwood
Has anyone had a chance to look at this for 8.9?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 6, 2021, at 10:03 AM, Walter Underwood  wrote:
> 
> Understood, getting well is more important.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On May 6, 2021, at 9:23 AM, Atri Sharma > <mailto:a...@apache.org>> wrote:
>> 
>> Hello,
>> 
>> I have been recovering from Covid so this has been delayed.
>> 
>> Apologies for not being able to look into this. This is on my ToDo
>> list for this week.
>> 
>> On Thu, May 6, 2021 at 9:48 PM Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> How do we make progress on SOLR-15056?
>>> 
>>> This is a simple fix:
>>> 
>>> * Improve the name for an existing circuit breaker
>>> * Add a new circuit breaker that does what the name for the first one 
>>> suggested
>>> * Make the documentation more accurate
>>> 
>>> I submitted a patch in mid-January. I resubmitted those as a PR last month.
>>> 
>>> https://issues.apache.org/jira/browse/SOLR-15056 
>>> <https://issues.apache.org/jira/browse/SOLR-15056>
>>> https://github.com/apache/solr/pull/96
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>> 
>> 
>> -- 
>> Regards,
>> 
>> Atri
>> Apache Concerted
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
>> 
> 



Re: Text search in Arabic

2021-05-20 Thread Walter Underwood
I recommend normalizing all characters with a compatibility transformation, 
whether they are Arabic or not. 

We use this charFilter as the first step in every query and indexing analysis 
chain.



You’ll also need to include the ICU library, which should be included by 
default. Actually, the compatbility normalization should be done by default, 
too. That transform was designed specifically for string matching and search.

We have this in every solrconfig.xml.

  
  
  

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 20, 2021, at 9:38 AM, Mete Kural  wrote:
> 
> Hello Michael,
> 
> Thank you very much for this information.
> 
> I will try at  java-u...@lucene.apache.org 
> <mailto:java-u...@lucene.apache.org> also.
> 
> By the way, is the Arabic analyzer referenced here 
> (https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar)
>  just for the Arabic language or all languages written with the Arabic script?
> 
> Thank you,
> Mete
> 
> 
>> On May 20, 2021, at 4:35 PM, Michael Wechner  
>> wrote:
>> 
>> Hi Mete
>> 
>> You might also want to try the java-u...@lucene.apache.org mailing list
>> 
>> https://lucene.apache.org/core/discussion.html#java-user-list-java-userluceneapacheorg
>> 
>> Re languages other than english you might find more information at
>> 
>> https://cwiki.apache.org/confluence/display/lucene/LuceneFAQ#LuceneFAQ-CanIuseLucenetoindextextinChinese,Japanese,Korean,andothermulti-bytecharactersets?
>> 
>> whereas I just realize that the following link does not work anymore
>> 
>> https://lucene.apache.org/core/lucene-sandbox/
>> 
>> Are these analyzers now inside
>> 
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis
>> https://github.com/apache/lucene/tree/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/ar
>> 
>> ?
>> 
>> Thanks
>> 
>> Michael
>> 
>> 
>> Am 20.05.21 um 14:48 schrieb Mete Kural:
>>> Hello Lucene Community,
>>> 
>>> I hope this finds you all well. I want to ask you if this would be the 
>>> right medium to discuss some matters surrounding text search in relation to 
>>> variant Unicode codings of words in Arabic and Arabic scripted languages. 
>>> This is not a great example but the said matters are similar to matters 
>>> around Latin scripted searches where the letter “İ” needs to be substituted 
>>> with “I” in searches and so forth. Would this mailing list be the best 
>>> medium to discuss such matters? If not, would you mind recommending me a 
>>> medium for discussion on this?
>>> 
>>> Kind regards,
>>> Mete Kural
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
> 



Re: Circuit Breakers -- SOLR-15056

2021-05-06 Thread Walter Underwood
Understood, getting well is more important.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 6, 2021, at 9:23 AM, Atri Sharma  wrote:
> 
> Hello,
> 
> I have been recovering from Covid so this has been delayed.
> 
> Apologies for not being able to look into this. This is on my ToDo
> list for this week.
> 
> On Thu, May 6, 2021 at 9:48 PM Walter Underwood  wrote:
>> 
>> How do we make progress on SOLR-15056?
>> 
>> This is a simple fix:
>> 
>> * Improve the name for an existing circuit breaker
>> * Add a new circuit breaker that does what the name for the first one 
>> suggested
>> * Make the documentation more accurate
>> 
>> I submitted a patch in mid-January. I resubmitted those as a PR last month.
>> 
>> https://issues.apache.org/jira/browse/SOLR-15056
>> https://github.com/apache/solr/pull/96
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 
> 
> -- 
> Regards,
> 
> Atri
> Apache Concerted
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Circuit Breakers -- SOLR-15056

2021-05-06 Thread Walter Underwood
How do we make progress on SOLR-15056?

This is a simple fix:

* Improve the name for an existing circuit breaker
* Add a new circuit breaker that does what the name for the first one suggested
* Make the documentation more accurate

I submitted a patch in mid-January. I resubmitted those as a PR last month.

https://issues.apache.org/jira/browse/SOLR-15056
https://github.com/apache/solr/pull/96

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: SOLR-15056 circuit breaker bugfix?

2021-04-29 Thread Walter Underwood
I made a PR three days ago and linked it in the Jira.

https://github.com/apache/solr/pull/96/files

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 23, 2021, at 2:33 PM, Anshum Gupta  wrote:
> 
> Hi Walter,
> 
> Can you please create a PR, as Ab mentioned in the JIRA. That would make it 
> much easier to review considering the size of the patch. It will also be 
> easier to comment and iterate that way.
> 
> -Anshum
> 
> On Tue, Mar 23, 2021 at 2:24 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> The patch for SOLR-15056 was submitted over two months ago. It fixes a bug, 
> adds a new kind of circuit breaker, improves the documentation, and has unit 
> tests.
> 
> I’ll make it into a PR if that is required, bit it is discouraging to put a 
> bunch of work into a fix and have it ignored. This is a much better 
> submission than the bar in Yonik’s Law of Patches.
> 
> https://issues.apache.org/jira/browse/SOLR-15056 
> <https://issues.apache.org/jira/browse/SOLR-15056>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
> 
> -- 
> Anshum Gupta



Re: SOLR-15056 circuit breaker bugfix?

2021-03-23 Thread Walter Underwood
If patches are hard for people, then the How to Contribute page needs to be 
rewritten to tell people to create PRs. I followed that page when doing this 
work.

It should have specific instructions on how to make PRs, maybe branch naming, 
etc. I make PRs all day at work, but I never use github and I don’t know what 
Solr wants.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 23, 2021, at 8:10 PM, Atri Sharma  wrote:
> 
> +1
> 
> I tried reviewing the patch and echoed AB's point of needing a PR. Happy to 
> do the review as soon as we have that.
> 
> On Wed, 24 Mar 2021, 03:04 Anshum Gupta,  <mailto:ans...@anshumgupta.net>> wrote:
> Hi Walter,
> 
> Can you please create a PR, as Ab mentioned in the JIRA. That would make it 
> much easier to review considering the size of the patch. It will also be 
> easier to comment and iterate that way.
> 
> -Anshum
> 
> On Tue, Mar 23, 2021 at 2:24 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> The patch for SOLR-15056 was submitted over two months ago. It fixes a bug, 
> adds a new kind of circuit breaker, improves the documentation, and has unit 
> tests.
> 
> I’ll make it into a PR if that is required, bit it is discouraging to put a 
> bunch of work into a fix and have it ignored. This is a much better 
> submission than the bar in Yonik’s Law of Patches.
> 
> https://issues.apache.org/jira/browse/SOLR-15056 
> <https://issues.apache.org/jira/browse/SOLR-15056>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
> 
> -- 
> Anshum Gupta



SOLR-15056 circuit breaker bugfix?

2021-03-23 Thread Walter Underwood
The patch for SOLR-15056 was submitted over two months ago. It fixes a bug, 
adds a new kind of circuit breaker, improves the documentation, and has unit 
tests.

I’ll make it into a PR if that is required, bit it is discouraging to put a 
bunch of work into a fix and have it ignored. This is a much better submission 
than the bar in Yonik’s Law of Patches.

https://issues.apache.org/jira/browse/SOLR-15056

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Circuit Breakers interaction with Shards

2021-02-16 Thread Walter Underwood
Limiting open connections is not the same as rate limiting. Open connections is 
a count of the requests being processed by a node. When the load balancer gets 
a new request and all current connections are waiting for a response, a new 
connection is opened. 

If the requests are all the same query and returned from the query cache, the 
rate can be very high with a few connections. If the request are very slow, 
like deep paging, it only takes a few hundred requests to max out the 
connections. 100 queries/sec could be 5% CPU or 100% CPU. 

Think of the count of requests waiting to be handled (number of active 
connections) as like a cluster-wide load average. On connection per request 
being processed, plus one connection per request waiting.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 16, 2021, at 8:53 AM, David Smiley  wrote:
> 
> Walter, it sounds like you were doing rate limiting, just in a different way 
> that is more dynamic than a simple (yet fiddly) constant?
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Sun, Feb 14, 2021 at 2:54 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> Rate limiting is a good idea. It requires a lot of ongoing engineering to 
> adjust the rates to the current cluster behavior. It doesn’t help with some 
> kinds of overload. The ROI just doesn’t work out. It is too much work for not 
> enough benefit.
> 
> Rate limiting works if the collection size doesn’t change and the queries 
> don’t change.
> 
> At Netflix, we limited traffic based on number of connections to each server. 
> This is basically the length of the queue of requests for that server. This 
> is similar to limiting by load average, which is also the work waiting to be 
> done. It has the same weaknesses as the load average circuit breaker, but it 
> did not need to be changed when average CPU usage per query increased. It was 
> “set and forget”. Rate limiters require constant adjustment.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Feb 14, 2021, at 11:44 AM, Atri Sharma > <mailto:a...@apache.org>> wrote:
>> 
>> This is a debate better suited for  a different forum  -- but I would 
>> disagree with your assertion that rate limiting is a bad idea.
>> 
>> Solr allows you to specify node level request quotas which also follow the 
>> principle of not limiting internal requests. I find that to be pretty useful 
>> in two forms: 1. Use it in conjunction with a global request limit which is 
>> typically 0.75 of my total load capacity given my average query resource 
>> consumption. 2. Allow per node request limits to ensure fairness and 
>> dedicated capacity for different types of requests. 3. Allow circuit 
>> breakers to handle cases where a couple of rogue queries can take down nodes.
>> 
>> We digress -- as I said, it should be fairly simple to have a circuit 
>> breaker which rejects only external requests,  but should be clearly 
>> documented with its downsides.
>> 
>> On Mon, 15 Feb 2021, 00:33 Walter Underwood, > <mailto:wun...@wunderwood.org>> wrote:
>> We’ve looked at and rejected rate limiters as high-maintenance and not 
>> sufficient protection.
>> 
>> We would have run nginx on each node, sent external traffic to nginx on a 
>> different port and let internal traffic stay on the default Solr port. This 
>> has other advantages (monitoring), but the rate limiting part is way too 
>> fiddly.
>> 
>> Rates depend on how much CPU is used per query and on the size of the 
>> cluster (if they are not on each node). Some examples from our largest 
>> cluster which would need a change in rate limits. Some of these could be set 
>> by doing offline load benchmarks, some not.
>> 
>> * Experiment cell that uses 2.5X more CPU for each query (running now in 
>> prod)
>> * Increasing traffic allocated to that cell (did this last week)
>> * Increase in index size (number of docs and CPU requirements increase about 
>> 5% every month)
>> * Website slowdown that shifts most traffic to mobile, where queries use 2X 
>> as much CPU
>> * Horizontal scaling from 24 tp 48 nodes
>> * Vertical scaling from c5.8xlarge to c5.18xlarge
>> 
>> And so on. Rate limiting would require almost weekly load benchmarks and it 
>> still wouldn’t catch the outage-causing problems.
>> 
>> wunder
>> Walter Underwood
&

Re: Circuit Breakers interaction with Shards

2021-02-14 Thread Walter Underwood
Rate limiting is a good idea. It requires a lot of ongoing engineering to 
adjust the rates to the current cluster behavior. It doesn’t help with some 
kinds of overload. The ROI just doesn’t work out. It is too much work for not 
enough benefit.

Rate limiting works if the collection size doesn’t change and the queries don’t 
change.

At Netflix, we limited traffic based on number of connections to each server. 
This is basically the length of the queue of requests for that server. This is 
similar to limiting by load average, which is also the work waiting to be done. 
It has the same weaknesses as the load average circuit breaker, but it did not 
need to be changed when average CPU usage per query increased. It was “set and 
forget”. Rate limiters require constant adjustment.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 11:44 AM, Atri Sharma  wrote:
> 
> This is a debate better suited for  a different forum  -- but I would 
> disagree with your assertion that rate limiting is a bad idea.
> 
> Solr allows you to specify node level request quotas which also follow the 
> principle of not limiting internal requests. I find that to be pretty useful 
> in two forms: 1. Use it in conjunction with a global request limit which is 
> typically 0.75 of my total load capacity given my average query resource 
> consumption. 2. Allow per node request limits to ensure fairness and 
> dedicated capacity for different types of requests. 3. Allow circuit breakers 
> to handle cases where a couple of rogue queries can take down nodes.
> 
> We digress -- as I said, it should be fairly simple to have a circuit breaker 
> which rejects only external requests,  but should be clearly documented with 
> its downsides.
> 
> On Mon, 15 Feb 2021, 00:33 Walter Underwood,  <mailto:wun...@wunderwood.org>> wrote:
> We’ve looked at and rejected rate limiters as high-maintenance and not 
> sufficient protection.
> 
> We would have run nginx on each node, sent external traffic to nginx on a 
> different port and let internal traffic stay on the default Solr port. This 
> has other advantages (monitoring), but the rate limiting part is way too 
> fiddly.
> 
> Rates depend on how much CPU is used per query and on the size of the cluster 
> (if they are not on each node). Some examples from our largest cluster which 
> would need a change in rate limits. Some of these could be set by doing 
> offline load benchmarks, some not.
> 
> * Experiment cell that uses 2.5X more CPU for each query (running now in prod)
> * Increasing traffic allocated to that cell (did this last week)
> * Increase in index size (number of docs and CPU requirements increase about 
> 5% every month)
> * Website slowdown that shifts most traffic to mobile, where queries use 2X 
> as much CPU
> * Horizontal scaling from 24 tp 48 nodes
> * Vertical scaling from c5.8xlarge to c5.18xlarge
> 
> And so on. Rate limiting would require almost weekly load benchmarks and it 
> still wouldn’t catch the outage-causing problems.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Feb 14, 2021, at 10:25 AM, Atri Sharma > <mailto:a...@apache.org>> wrote:
>> 
>> The way I look at it is that for cluster level stability, rate limiters 
>> should be used which allow rate limiting of only external requests. They are 
>> "circuit breakers" in the sense of defending against cluster level 
>> instability, which is what you describe.
>> 
>> Circuit breakers, in Solr world, are targeted to be the last resort defense 
>> of a node.
>> 
>> As I said earlier, it is possible to write a circuit breaker which rejects 
>> only external requests, but I personally do not see the benefit in presence 
>> of rate limiters.
>> 
>> On Sun, 14 Feb 2021, 23:50 Walter Underwood, > <mailto:wun...@wunderwood.org>> wrote:
>> Ideally, it would only affect a few queries. In reality, with a sharded 
>> system, the impact will be large.
>> 
>> I disagree that the goal is to protect a node. The goal is to make the 
>> entire cluster avoid congestion failure when overloaded, while providing 
>> good service for the load that it can handle.
>> 
>> I have had Solr clusters take down entire websites when overloaded, both at 
>> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a 
>> huge fan of circuit breakers.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wu

Re: Circuit Breakers interaction with Shards

2021-02-14 Thread Walter Underwood
We’ve looked at and rejected rate limiters as high-maintenance and not 
sufficient protection.

We would have run nginx on each node, sent external traffic to nginx on a 
different port and let internal traffic stay on the default Solr port. This has 
other advantages (monitoring), but the rate limiting part is way too fiddly.

Rates depend on how much CPU is used per query and on the size of the cluster 
(if they are not on each node). Some examples from our largest cluster which 
would need a change in rate limits. Some of these could be set by doing offline 
load benchmarks, some not.

* Experiment cell that uses 2.5X more CPU for each query (running now in prod)
* Increasing traffic allocated to that cell (did this last week)
* Increase in index size (number of docs and CPU requirements increase about 5% 
every month)
* Website slowdown that shifts most traffic to mobile, where queries use 2X as 
much CPU
* Horizontal scaling from 24 tp 48 nodes
* Vertical scaling from c5.8xlarge to c5.18xlarge

And so on. Rate limiting would require almost weekly load benchmarks and it 
still wouldn’t catch the outage-causing problems.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 10:25 AM, Atri Sharma  wrote:
> 
> The way I look at it is that for cluster level stability, rate limiters 
> should be used which allow rate limiting of only external requests. They are 
> "circuit breakers" in the sense of defending against cluster level 
> instability, which is what you describe.
> 
> Circuit breakers, in Solr world, are targeted to be the last resort defense 
> of a node.
> 
> As I said earlier, it is possible to write a circuit breaker which rejects 
> only external requests, but I personally do not see the benefit in presence 
> of rate limiters.
> 
> On Sun, 14 Feb 2021, 23:50 Walter Underwood,  <mailto:wun...@wunderwood.org>> wrote:
> Ideally, it would only affect a few queries. In reality, with a sharded 
> system, the impact will be large.
> 
> I disagree that the goal is to protect a node. The goal is to make the entire 
> cluster avoid congestion failure when overloaded, while providing good 
> service for the load that it can handle.
> 
> I have had Solr clusters take down entire websites when overloaded, both at 
> Netflix and Chegg, and I’ve built defenses for this at both places. I’m a 
> huge fan of circuit breakers.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Feb 14, 2021, at 9:50 AM, Atri Sharma > <mailto:a...@apache.org>> wrote:
>> 
>> This has an issue of still leading to node outages if the fanout for a query 
>> is high.
>> 
>> Circuit breakers follow a simple rule -- defend the node at the cost of 
>> degraded responses.
>> 
>> Ideally, only few requests will be completely rejected -- some will see 
>> partial results. Due to this non discriminating nature of circuit breakers, 
>> the typical blip on service quality due to high resource usage is short 
>> lived.
>> 
>> However, it is possible to write a circuit breaker which rejects only 
>> external requests in master branch (we have the ability to identify requests 
>> as internal or external there).
>> 
>> Regards,
>> 
>> Atri
>> 
>> On Sun, 14 Feb 2021, 23:07 Walter Underwood, > <mailto:wun...@wunderwood.org>> wrote:
>> This got zero responses on the solr-user list, so I’ll raise the issue here.
>> 
>> Should circuit breakers only kill external search requests and not 
>> cluster-internal requests to shards?
>> 
>> Circuit breakers can kill any request, whether it is a client request from 
>> outside the cluster or an internal distributed request to a shard. Killing a 
>> portion of distributed request will affect the main request. Not sure 
>> whether a 503 from a shard will kill the whole request or cause partial 
>> results, but it isn’t good.
>> 
>> We run with 8 shards. If a circuit breaker is killing 10% of requests on 
>> each host, that will hit 57% of all external requests (0.9^8 = 0.43). That 
>> seems like “overkill” to me. If it only kills external requests, then 10% 
>> means 10%.
>> 
>> Killing only external requests requires that external requests go roughly 
>> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 



Re: Circuit Breakers interaction with Shards

2021-02-14 Thread Walter Underwood
Ideally, it would only affect a few queries. In reality, with a sharded system, 
the impact will be large.

I disagree that the goal is to protect a node. The goal is to make the entire 
cluster avoid congestion failure when overloaded, while providing good service 
for the load that it can handle.

I have had Solr clusters take down entire websites when overloaded, both at 
Netflix and Chegg, and I’ve built defenses for this at both places. I’m a huge 
fan of circuit breakers.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 9:50 AM, Atri Sharma  wrote:
> 
> This has an issue of still leading to node outages if the fanout for a query 
> is high.
> 
> Circuit breakers follow a simple rule -- defend the node at the cost of 
> degraded responses.
> 
> Ideally, only few requests will be completely rejected -- some will see 
> partial results. Due to this non discriminating nature of circuit breakers, 
> the typical blip on service quality due to high resource usage is short lived.
> 
> However, it is possible to write a circuit breaker which rejects only 
> external requests in master branch (we have the ability to identify requests 
> as internal or external there).
> 
> Regards,
> 
> Atri
> 
> On Sun, 14 Feb 2021, 23:07 Walter Underwood,  <mailto:wun...@wunderwood.org>> wrote:
> This got zero responses on the solr-user list, so I’ll raise the issue here.
> 
> Should circuit breakers only kill external search requests and not 
> cluster-internal requests to shards?
> 
> Circuit breakers can kill any request, whether it is a client request from 
> outside the cluster or an internal distributed request to a shard. Killing a 
> portion of distributed request will affect the main request. Not sure whether 
> a 503 from a shard will kill the whole request or cause partial results, but 
> it isn’t good.
> 
> We run with 8 shards. If a circuit breaker is killing 10% of requests on each 
> host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems 
> like “overkill” to me. If it only kills external requests, then 10% means 10%.
> 
> Killing only external requests requires that external requests go roughly 
> equally to all hosts in the cluster, or at least all NRT or PULL replicas.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)



Re: Circuit Breaker Clean-up/Extension Jira

2021-02-14 Thread Walter Underwood
Sorry, couldn’t figure out how to do that for Solr. I do PRs all day on our 
company system, but that uses Bitbucket.

The “how to contribute” docs just said to make a PR, which didn’t really help. 
I tried, but nothing worked.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 14, 2021, at 9:56 AM, Atri Sharma  wrote:
> 
> Also, if you could open a PR, it would be easier to review.
> 
> On Sun, 14 Feb 2021, 23:22 Atri Sharma,  <mailto:a...@apache.org>> wrote:
> Apologies for the delay. I will review this tomorrow
> 
> On Sun, 14 Feb 2021, 23:06 Walter Underwood,  <mailto:wun...@wunderwood.org>> wrote:
> Please review for 8.9. We will use this feature after it is updated. The 
> current circuit breakers won’t work for us.
> 
> https://issues.apache.org/jira/browse/SOLR-15056 
> <https://issues.apache.org/jira/browse/SOLR-15056>
> 
> This change:
> 
> * Preserves existing functionality.
> * Renames the existing load average circuit breaker to a more accurate name.
> * Adds a circuit breaker for CPU usage that is available if the JVM supports 
> it.
> * Adds detail to documentation, listing which JMX calls each circuit breaker 
> is based on.
> * Copy-edits on docs for more detail, less complicated wording (good when 
> English is not the reader’s primary language)
> * Includes unit tests.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 



Circuit Breakers interaction with Shards

2021-02-14 Thread Walter Underwood
This got zero responses on the solr-user list, so I’ll raise the issue here.

Should circuit breakers only kill external search requests and not 
cluster-internal requests to shards?

Circuit breakers can kill any request, whether it is a client request from 
outside the cluster or an internal distributed request to a shard. Killing a 
portion of distributed request will affect the main request. Not sure whether a 
503 from a shard will kill the whole request or cause partial results, but it 
isn’t good.

We run with 8 shards. If a circuit breaker is killing 10% of requests on each 
host, that will hit 57% of all external requests (0.9^8 = 0.43). That seems 
like “overkill” to me. If it only kills external requests, then 10% means 10%.

Killing only external requests requires that external requests go roughly 
equally to all hosts in the cluster, or at least all NRT or PULL replicas.

wunder
Walter Underwood
wun...@wunderwood.org <mailto:wun...@wunderwood.org>
http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)

Circuit Breaker Clean-up/Extension Jira

2021-02-14 Thread Walter Underwood
Please review for 8.9. We will use this feature after it is updated. The 
current circuit breakers won’t work for us.

https://issues.apache.org/jira/browse/SOLR-15056

This change:

* Preserves existing functionality.
* Renames the existing load average circuit breaker to a more accurate name.
* Adds a circuit breaker for CPU usage that is available if the JVM supports it.
* Adds detail to documentation, listing which JMX calls each circuit breaker is 
based on.
* Copy-edits on docs for more detail, less complicated wording (good when 
English is not the reader’s primary language)
* Includes unit tests.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: 8.8 Release

2021-01-12 Thread Walter Underwood
I’d love for SOLR-15056 to be in, but it is just a patch now and hasn’t had 
anything besides
local testing, so that is a bit of a long shot.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 11, 2021, at 9:29 AM, Timothy Potter  wrote:
> 
> 15036 will be in later today, so you can plan to cut this evening US time or 
> tomorrow.
> 
> Cheers,
> Tim
> 
> On Mon, Jan 11, 2021 at 9:54 AM Ishan Chattopadhyaya 
> mailto:ichattopadhy...@gmail.com>> wrote:
> I think all the issues mentioned above are in the branch, except SOLR-15036. 
> Tim, I'll cut a branch once that is in, latest by Wednesday AM (USA time).
> Thanks,
> Ishan
> 
> On Thu, Jan 7, 2021 at 5:07 AM Timothy Potter  <mailto:thelabd...@gmail.com>> wrote:
> Thanks for following up on this Ishan ... I intend to get SOLR-15059 and 
> -15036 into 8.8 as well. I should have a proper PR up for SOLR-15036 by 
> Friday sometime, which seems to align with other's timeframes
> 
> Cheers,
> Tim
> 
> On Wed, Jan 6, 2021 at 6:54 AM David Smiley  <mailto:dsmi...@apache.org>> wrote:
> Happy New Year!
> I would much prefer that ensure 8.8 includes SOLR-14923 (a bad nested docs 
> performance issue)
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Wed, Jan 6, 2021 at 6:59 AM Ishan Chattopadhyaya 
> mailto:ichattopadhy...@gmail.com>> wrote:
> Happy New Year!
> 
> I was supposed to start the process tomorrow, but I think we're not ready 
> yet? I see SOLR-15052 still under review with intention of inclusion into 8.8.
> Would it be reasonable to cut the release branch end of this week and start 
> the RC process around 13th January?
> If there are any issues someone would want me to wait on, please let me know.
> 
> Thanks,
> Ishan
> 
> On Fri, Dec 18, 2020 at 6:10 AM Ishan Chattopadhyaya 
> mailto:ichattopadhy...@gmail.com>> wrote:
> Sure, Houston. I'll wait another week. Have a good new year and merry 
> Christmas!
> 
> On Fri, 18 Dec, 2020, 5:58 am Timothy Potter,  <mailto:thelabd...@gmail.com>> wrote:
> Great point Houston! +1 on waiting until a week into January
> 
> On Thu, Dec 17, 2020 at 4:46 PM Houston Putman  <mailto:houstonput...@gmail.com>> wrote:
> Thanks for volunteering Ishan.
> 
> I think it might be a good idea to wait to cut and release 8.8 at least a 
> week into January. Many people are going to be away during the holiday 
> season, and particularly the last week of the year. Pushing into January just 
> gives more people a chance to look at the release and be involved.
> 
> - Houston 
> 
> On Fri, Dec 11, 2020 at 3:26 PM Noble Paul  <mailto:noble.p...@gmail.com>> wrote:
> Thanks Ishan for volunteering
> 
> On Fri, Dec 11, 2020 at 5:07 AM Christine Poerschke (BLOOMBERG/
> LONDON) mailto:cpoersc...@bloomberg.net>> wrote:
> >
> > With a view towards including it in the release, I'd appreciate code review 
> > input on
> >
> > https://github.com/apache/lucene-solr/pull/1992 
> > <https://github.com/apache/lucene-solr/pull/1992> for
> >
> > https://issues.apache.org/jira/browse/SOLR-14939 
> > <https://issues.apache.org/jira/browse/SOLR-14939> (JSON facets: range 
> > faceting to support cache=false parameter)
> >
> > if anyone has some time next week perhaps?
> >
> > Thanks in advance!
> >
> > Christine
> >
> > From: dev@lucene.apache.org <mailto:dev@lucene.apache.org> At: 12/10/20 
> > 18:01:58
> > To: dev@lucene.apache.org <mailto:dev@lucene.apache.org>
> > Subject: Re: 8.8 Release
> >
> > +1
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
> >
> >
> > On Thu, Dec 10, 2020 at 11:23 AM David Smiley  > <mailto:dsmi...@apache.org>> wrote:
> >>
> >> Thanks for volunteering!
> >>
> >> On Thu, Dec 10, 2020 at 11:11 AM Ishan Chattopadhyaya 
> >> mailto:ichattopadhy...@gmail.com>> wrote:
> >>>
> >>> Hi Devs,
> >>> There are lots of changes accumulated and some underway. I wish to 
> >>> volunteer for a 8.8 release, if there are no objections. I'm planning to 
> >>> build the RC in three weeks, i.e. 31 December (and cut the branch about 
> >>> 3-4 days before that). Please let me know if someone has any concerns.
> >>> Thanks and regards,
> >>> Ishan
> >>>
> >> --
> >> Sent from Gmail Mobile
> >
> >
> 
> 
> -- 
> -
> Noble Paul
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 



Re: SOLR-15056 change circuit breaker metric

2021-01-08 Thread Walter Underwood
OK, I’ll do it against master. We’d love to see it in an 8.x release, though.
This would protect against the major Solr outages we’ve had in the past
few years.

A question for someone who knows the Java metrics stuff better than I do:

Is it OK to use com.sun.management.OperatingSystemMXBean instead
of the java.lang.management version? getSystemCpuLoad() is only in the
former. The current code uses java.lang.management.OperatingSystemMXBean.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 8, 2021, at 10:16 AM, David Smiley  wrote:
> 
> Glad to see you contributing Walter!
> 
> Unless you know it only applies to 8x, you should branch against master.
> 
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley 
> <http://www.linkedin.com/in/davidwsmiley>
> 
> On Thu, Jan 7, 2021 at 4:26 PM Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> Starting work on this change. Should that be against branch_8x?
> 
> https://issues.apache.org/jira/browse/SOLR-15056 
> <https://issues.apache.org/jira/browse/SOLR-15056>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 



SOLR-15056 change circuit breaker metric

2021-01-07 Thread Walter Underwood
Starting work on this change. Should that be against branch_8x?

https://issues.apache.org/jira/browse/SOLR-15056 
<https://issues.apache.org/jira/browse/SOLR-15056>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Lucene/Solr and Java versions, what we know

2019-03-27 Thread Walter Underwood
> On Mar 27, 2019, at 9:03 AM, Erick Erickson  wrote:
> 
> Problem here is that there won’t be any entries for any ref guide prior to 
> 8.1 (at best), we’re not going to go back and re-publish them all just to add 
> this. So someone asking “can I run Solr 7.7 on JDK 11” would have to look in 
> the 8.1 reference guide which is confusing as well.

I don’t have any problem with that.

Consider someone upgrading from 7.x to 8.x. They might upgreade the JVM first, 
if they know that 7.x runs on that JVM version. Then upgrade Solr. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Lucene/Solr and Java versions, what we know

2019-03-27 Thread Walter Underwood
I think it is appropriate to have the table in the reference guide. Yes, the 
guide is versioned, but we can’t expect people to manually diff two versions to 
figure out what changed.

If we expect people to upgrade to a new version, we should have the table in 
the reference guide, not hidden in a wiki.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 27, 2019, at 1:24 AM, Jan Høydahl  wrote:
> 
> Because the Reference Guide is versioned, we only need to discuss what JDK to 
> grab that works with e.g. Solr/Lucene 8. But I think your wiki page is great 
> do as an in-depth guide to JDKs and compatibility. Perhaps it can just live 
> on the wiki as now, and link to it from the RefGuide?
> 
> So on the 
> https://lucene.apache.org/solr/guide/7_7/solr-system-requirements.html 
> <https://lucene.apache.org/solr/guide/7_7/solr-system-requirements.html> 
> page, right now we just say "You will need the Java Runtime Environment (JRE) 
> version 1.8 or higher" and then people to download Oracle Java.
> 
> I think instead of linking to only the Oracle paid version, we could say 
> something like:
> 
> If you do not have any opinion or requirements on a particular distribution 
> or version of Java, we recommend that you install the free, Open Source 
> OpenJDK version 11 which is the latest LTS (long term support) version. You 
> can download it from many sources such as AdoptOpenJDK (link), Amazon 
> Corretto (link), Zulu (link) or Oracle (link). We do not endorse any 
> particular vendor. Note that each vendor may have different policies for bug- 
> and security fixes, so choose one you are comfortable with. You may also find 
> that your Operating System already includes a supported version of the JDK.
> 
> There are also commercial paid version of Java. If your organisation has a 
> policy on what vendor or version or Java to use, make sure to consult that 
> before deciding.
> 
> For a more in-depth discussion of JDK compatibility, known issues etc for 
> various versions of Lucene/Solr and Java, please see x.
> 
> For security reasons, Java should be kept up to date on minor versions. Never 
> upgrade Java to a higher major version before first checking and testing that 
> it is compatible with the version of Solr you have.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com <http://www.cominvent.com/>
> 
>> 26. mar. 2019 kl. 16:14 skrev Erick Erickson > <mailto:erickerick...@gmail.com>>:
>> 
>> So I assume everyone thinks I’ve nailed it perfectly with this page? 
>> https://wiki.apache.org/solr/SolrJavaVersions 
>> <https://wiki.apache.org/solr/SolrJavaVersions>. ‘cause I haven’t seen much 
>> feedback.
>> 
>> Look, we give _no_ guidance at this point about whether Lucene/Solr even 
>> work on Java X. Well, I guess we’re saying Solr 9 works on with Java 11. Or 
>> at least it will since it’s about to be required.
>> 
>> I don’t particularly care if we say “If you’re upgrading Java, use Java 11” 
>> for Lucene/Solr 8x or 7x or 6x for that matter. Let’s just get our 
>> collective act together and give some guidance.
>> 
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
>> <mailto:dev-unsubscr...@lucene.apache.org>
>> For additional commands, e-mail: dev-h...@lucene.apache.org 
>> <mailto:dev-h...@lucene.apache.org>
>> 
> 



Re: ISSUE:solrj “org.apache.solr.common.util.SimpleOrderedMap cannot be cast to java.util.Map” exception when using “/suggest” handler

2019-03-12 Thread Walter Underwood
I’ve responded on Stack Overflow, but questions always to go 
solr-u...@lucene.apache.org, never to this list.

Also, a quick summary of the question in the email would make it more likely 
that you would get help.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 12, 2019, at 2:39 AM, praveenraj 4ever  
> wrote:
> 
> HI Team,
> Can you please look into this issue as raised in StackOverflow.
> 
> https://stackoverflow.com/questions/55115760/solrj-org-apache-solr-common-util-simpleorderedmap-cannot-be-cast-to-java-util
>  
> <https://stackoverflow.com/questions/55115760/solrj-org-apache-solr-common-util-simpleorderedmap-cannot-be-cast-to-java-util>
> 
> 
> Regards,
> Praveenraj D,
> 9566067066



Re: Unicode Quotes in query parser

2019-01-21 Thread Walter Underwood
First, check which transforms are already handled by Unicode normalization. Put 
this in all of your analyzer chains:



Probably need this in solrconfig.xml:

 
  
  

I really cannot think of a reason to use unnormalized Unicode in Solr. That 
should be in all the sample files.

For search character matching, yes, all spaces should be normalized. I have too 
many hacks fixing non-breaking spaces spread around the code. When matching, 
there is zero use for stuff like ideographic space (U+3000).

I’m not sure if quotes are normalized. I did some searching around without 
success. That might come under character folding. There was a draft, now 
withdrawn, for standard character folding. I’d probably start there for a 
Unicode folding char filter.

https://www.unicode.org/reports/tr30/tr30-4.html

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 21, 2019, at 7:43 AM, Michael Sokolov  wrote:
> 
> I think this is probably better to discuss on solr-user, or maybe solr-dev, 
> since it is dismax parser you are talking about, which really lives in Solr. 
> However, my 2c  - this seems somewhat dubious. Maybe people want to include 
> those in their terms? Also, it leads to a kind of slippery slope: would you 
> also want to convert all the various white space characters (no-break space, 
> thin space, em space, etc)  as vanilla ascii 32? How about all the other 
> "operator" characters like brackets?
> 
> On Mon, Jan 21, 2019 at 9:50 AM John Ryan  
> wrote:
> I'm looking to create an issue to add support for Unicode Double Quotes to 
> the dismax parser. 
> 
> I want to replace all types of double quotes with standard ones before they 
> get stripped 
> 
> i.e.
> “ ” „ “ „ « » ‟ ❝ ❞ ⹂ "
> 
> With 
> "
> I presume this has been discussed before?
> 
> I have a POC here: 
> https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x 
> <https://github.com/apache/lucene-solr/compare/branch_7x...jnyryan:branch_7x>
> 
> Thanks, 
> 
> John
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 



Re: SOLR: Unnecessary logging

2018-11-27 Thread Walter Underwood
I’m not a big fan of console/file magic. If that is done, there needs to be a 
big WARN at the beginning that says some log messages are suppressed because a 
console has been detected.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 27, 2018, at 6:00 PM, David Smiley  wrote:
> 
> Maybe we can have it both ways: might the distinction be done between console 
> & files?  So for example keep the console more brief, but have the log files 
> contain more logs?  Just an idea.
> 
> On Tue, Nov 27, 2018 at 8:07 AM Erick Erickson  <mailto:erickerick...@gmail.com>> wrote:
> bq: Maybe the real thing to do since everyone's preferences vary is to
> have a --log-config start script option that points solr at one's
> favorite dev/testing/demo oriented logging config
> 
> There's already LOG4J_PROPS so if you have a favorite log4j config you
> want to use, just set that env variable and forget about it, does that
> work?
> 
> I spent a fair amount of time untangling the multiple log config files
> down to two, I would be loathe to add any more config files to the
> release. We could also expand the one we _do_ release for with
> comments for various options...
> 
> Best,
> Erick
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 
> -- 
> Lucene/Solr Search Committer (PMC), Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley 
> <http://linkedin.com/in/davidwsmiley> | Book: 
> http://www.solrenterprisesearchserver.com 
> <http://www.solrenterprisesearchserver.com/>


Re: Lucene/Solr 7.6

2018-11-09 Thread Walter Underwood
Should I try and get a new patch for SOLR-629 working with 7.x?

https://issues.apache.org/jira/browse/SOLR-629

Previous patches have been ignored, so I’d like somebody to promise to look at 
it. Getting those changes into the edismax code makes my brain hurt.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 9, 2018, at 9:43 AM, Chris Hostetter  wrote:
> 
> 
> : I don't have any problem waiting until tomorrow to cut the branch. I
> : noticed a patch was just committed so let me know if you happen to resolve
> : this sooner than expected.
> 
> Heh .. yeah, since I didn't hear back from you I stayed up a little later 
> then usual the last 2 nights re-reviewing everything (that and the 
> patch review from smiley gave me some re-assurance.)
> 
> (As far as SOLR-12962 is concerned) Feel free to cut branch_7_6 whenever 
> you're ready.
> 
> 
> 
> 
> : On Thu, Nov 8, 2018 at 3:59 PM Chris Hostetter 
> : wrote:
> : 
> : >
> : > : Let me know if there are any other blockers that need to be resolved
> : > prior
> : > : to cutting the branch. If not, I will plan to cut the branch on Friday 
> or
> : > : (provided they are close to resolution) whenever these issues are
> : > resolved.
> : >
> : > nknize:  I'd like to try and get SOLR-12962 into 7.6...
> : >
> : > It's a new "opt in" feature, but one that should help prevent a lot of
> : > performance problems for people who want to try it out -- so I'd really
> : > like to release it ASAP so we can hopefully get feedback from folks who
> : > try it in time to consider changing the default to "opt out" ~8.0 (see
> : > SOLR-12963 and parent SOLR-8273 for background)
> : >
> : > The patch is currently feature complete -- the only 'nocommits' are for
> : > updating a small amount of documentation, which I should have done
> : > sometime thursday morning (GMT-0700).
> : >
> : > I'm currently having my machine hammer on the tests, and waiting/hoping
> : > for some patch review / sanity checks.
> : >
> : >
> : > My typically "personal workflow based on comfort level" for a change like
> : > this would be to do nothing but testing & self-review for at least a day
> : > before committing & backporting ... which would mean wrapping it up
> : > sometime friday afternoon (GMT-0700).
> : >
> : > Unless there are any objections, I'd appreciate knowing if either:
> : >
> : >  1) You would be ok holding off on cutting branch_7_6 until
> : > sometime *after* friday ?
> : >
> : >  2) Folks would be ok w/me backporting SOLR-12962 to banch_7_6 after
> : > you fork it (even though it's a new feature, not a blocker bug fix) ?
> : >
> : >
> : > Thoughts? Concerns?
> : >
> : >
> : >
> : >
> : > -Hoss
> : > http://www.lucidworks.com/
> : >
> : > -
> : > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> : > For additional commands, e-mail: dev-h...@lucene.apache.org
> : >
> : > --
> : 
> : Nicholas Knize, Ph.D., GISP
> : Geospatial Software Guy  |  Elasticsearch
> : Apache Lucene Committer
> : nkn...@apache.org
> : 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



[jira] [Commented] (SOLR-629) Fuzzy search with eDisMax request handler

2018-09-24 Thread Walter Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16625868#comment-16625868
 ] 

Walter Underwood commented on SOLR-629:
---

The patch was for 4.10. Before that, I did it for 3.x and 1.3.

It is on my list to re-port this to 7.x, but I'm getting a little tired of 
re-porting the same code. Someone else has been doing some work on it, I'll 
check with them.

> Fuzzy search with eDisMax request handler
> -
>
> Key: SOLR-629
> URL: https://issues.apache.org/jira/browse/SOLR-629
> Project: Solr
>  Issue Type: Improvement
>  Components: query parsers
>Affects Versions: 7.4
>Reporter: Guillaume Smet
>Priority: Minor
> Attachments: SOLR-629.patch, SOLR-629.patch, 
> dismax_fuzzy_query_field.v0.1.diff, dismax_fuzzy_query_field.v0.1.diff
>
>
> The DisMax search handler doesn't support fuzzy queries which would be quite 
> useful for our usage of Solr and from what I've seen on the list, it's 
> something several people would like to have.
> Following this discussion 
> http://markmail.org/message/tx6kqr7ga6ponefa#query:solr%20dismax%20fuzzy+page:1+mid:c4pciq6rlr4dwtgm+state:results
>  , I added the ability to add fuzzy query field in the qf parameter. I kept 
> the patch as conservative as possible.
> The syntax supported is: fieldOne^2.3 fieldTwo~0.3 fieldThree~0.2^-0.4 
> fieldFour as discussed in the above thread.
> The recursive query aliasing should work even with fuzzy query fields using a 
> very simple rule: the aliased fields inherit the minSimilarity of their 
> parent, combined with their own one if they have one.
> Only the qf parameter support this syntax atm. I suppose we should make it 
> usable in pf too. Any opinion?
> Comments are very welcome, I'll spend the time needed to put this patch in 
> good shape.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: G1GC collector warning on JavaBugs page

2018-08-02 Thread Walter Underwood
We’ve had G1GC in production with 6.6.2 for nearly a year with Java 1.8.0_121. 
No issues.

That is a 32 node cluster of 36 CPU instances with 1-2 million queries per day. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Aug 2, 2018, at 2:00 PM, David Smiley  wrote:
> 
> +1
> 
> On Thu, Aug 2, 2018 at 3:15 PM Erick Erickson  <mailto:erickerick...@gmail.com>> wrote:
> There's still the very firm warning _not_ to use the G1GC collector,
> and a link to https://bugs.openjdk.java.net/browse/JDK-8038348 
> <https://bugs.openjdk.java.net/browse/JDK-8038348> which
> is marked as fixed "7u25, 8, 9". Should we remove that warning at this
> point?
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley 
> <http://linkedin.com/in/davidwsmiley> | Book: 
> http://www.solrenterprisesearchserver.com 
> <http://www.solrenterprisesearchserver.com/>


Re: SynonymGraphFilter followed by StopFilter

2018-07-26 Thread Walter Underwood
Move the synonym filter to the index analyzer chain. That provides better 
performance and avoids some surprising relevance behavior. With synonyms at 
query time, you’ll see different idf for terms in the synonym set, with the 
rare variant scoring higher. That is probably the opposite of what is expected.

Also, phrase synonyms just don’t work at query time because the terms are 
parsed into individual tokens by the query parser, not the tokenizer.

Don’t use stop words. Just remove that line. Removing stop words is a 
performance and space hack that was useful in the 1960’s, but causes problems 
now. I’ve never used stop word removal and I started in search with Infoseek in 
1996. Stop word removal is like a binary idf, ignoring common words. Since we 
have idf, we can give a lower score to common words and keep them in the index. 

Do those two things and it should work as you expect. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 26, 2018, at 3:23 AM, Andrea Gazzarini  wrote:
> 
> Hi Alan, thanks for the response and thank you very much for the pointers
> 
> On 26/07/18 12:16, Alan Woodward wrote:
>> Hi Andrea,
>> 
>> This is a long-standing issue: see 
>> https://issues.apache.org/jira/browse/LUCENE-4065 
>> <https://issues.apache.org/jira/browse/LUCENE-4065> and 
>> https://issues.apache.org/jira/browse/LUCENE-8250 
>> <https://issues.apache.org/jira/browse/LUCENE-8250> for discussion.  I don’t 
>> think we’ve reached a consensus on how to fix it yet, but more examples are 
>> good.
>> 
>> Unfortunately I don’t think changing the StopFilter to ignore SYNONYM tokens 
>> will work, because then you’ll generate queries that always fail - they’ll 
>> search for ‘of’ in the middle of the phrase, but ‘of’ never gets indexed 
>> because it’s removed by the StopFilter at index time.
>> 
>> - Alan
>> 
>>> On 26 Jul 2018, at 08:04, Andrea Gazzarini >> <mailto:a.gazzar...@sease.io>> wrote:
>>> 
>>> Hi, 
>>> I have the following field type definition: 
>>> >> autoGeneratePhraseQueries="true">
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> >> synonyms="synonyms.txt" ignoreCase="false" expand="true"/>
>>> >> ignoreCase="false"/>
>>> 
>>> 
>>> Where synonyms and stopwords are defined as follows: 
>>> 
>>> synonyms = out of warranty,oow
>>> stopwords = of
>>> 
>>> Running the following query:
>>> 
>>> q=my tv went out of warranty something of
>>> 
>>> I get wrong results, with the following explain: 
>>> 
>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out ? warranty 
>>> something"))
>>> 
>>> That is, the synonyms is correctly detected, I see the graph information 
>>> are correctly reported in the positionLength, it seems they are wrongly 
>>> interpreted by the QueryParser. 
>>> I guess the reason is the "of" removal operated by the StopFilter, which 
>>> removes the "of" term within the phrase (I wouldn't want that)
>>> creates a "hole" in the span defined by the "oow" term, which has been 
>>> marked as a synonym with a positionLength = 3, therefore including the next 
>>> available term (something). 
>>> I tried to change the StopFilter in order to ignore stopwords that are 
>>> marked as SYNONYM or that are part of a previous synonym span, and it 
>>> works: it correctly produces the following query: 
>>> 
>>> title:my title:tv title:went (title:oow PhraseQuery(title:"out of 
>>> warranty")) title:something
>>> 
>>> So I'd like to ask your opinion about this. Am I missing something? Do you 
>>> think it's better to open a JIRA issue? If the solution is a graph aware 
>>> stop filter, do you think it's better to change the existing filter or to 
>>> subclass it?
>>> 
>>> Best, 
>>> Andrea
>>> 
>>> 
>> 
> 



Re: I am closing Resolved issues

2018-07-05 Thread Walter Underwood
Thanks.

Update the version if that helps people. It is a new feature and still hasn’t 
been implemented or obsoleted.

There are enough changes in edismax that we’ll need to re-implement the changes.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 5, 2018, at 2:51 PM, Alexandre Rafalovitch  wrote:
> 
> Yes, I skipped over it in the manual review. Would it make sense to
> update the version affected to the latest though if the issue is still
> open? I've also updated the issue title to say eDisMax as requested.
> 
> Regards,
>   Alex.
> 
> On 5 July 2018 at 17:46, Walter Underwood  wrote:
>> Please do not close SOLR-629. I’ve been submitting patches and trying to get
>> someone to commit that for years. This is blocking us from upgrading one of
>> our clusters from 4.10.4.
>> 
>> https://issues.apache.org/jira/browse/SOLR-629
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> On Jul 5, 2018, at 2:33 PM, Alexandre Rafalovitch 
>> wrote:
>> 
>> Ok. I found Bulk Transition to Close workflow.
>> 
>> Would it make sense to transition all Solr 'Resolved' issues older
>> than (say) 600 days to 'Closed' state? project = SOLR AND status =
>> Resolved AND "Last public comment date" <= -600d ? Without emails, as
>> I suspect I already annoyed a bunch of people with my cleanup.
>> 
>> That's 1500 issues we are really unlikely to touch ever again.
>> 
>> Thoughts?
>> 
>> Regards,
>>  Alex.
>> 
>> 
>> On 5 July 2018 at 14:32, Alexandre Rafalovitch  wrote:
>> 
>> Do I have a JIRA permission/procedure to do the bulk closures? I
>> wasn't sure. And definitely did not want to step on any RM feet, given
>> that I know nothing about those processes.
>> 
>> As is, for now, I am mostly trying to clean out the ancient weeds,
>> those that just get in the way.
>> 
>> Regards,
>>  Alex.
>> 
>> On 5 July 2018 at 14:27, David Smiley  wrote:
>> 
>> Thanks for doing the JIRA gardening.
>> 
>> By the way, there were past releases where the RM apparently forgot to
>> perform the step of bulk closing resolved issues.  That can be done without
>> sending a notification, by the way -- thus no list noise.  You might want to
>> start with rectifying this?
>> 
>> ~ David
>> 
>> On Thu, Jul 5, 2018 at 1:49 PM Alexandre Rafalovitch 
>> wrote:
>> 
>> 
>> Hi,
>> 
>> I am manually reviewing and closing issues marked Resolved against 7.4
>> and earlier as well as in versions unknown.
>> 
>> I am leaving those I can't quite figure out the "true final" status of
>> as is. Like a patch is in the tree, but it is not marked with a
>> version and nothing in CHANGES file.
>> 
>> I hope this approach is ok. I apologize in advance if I mis-read
>> somebody's intention. Feel free to 'reopened' the case (Jira flow
>> state transition name needs fixing too).
>> 
>> Please let me know if you disagree with this cleanup and/or if you
>> have a better algorithm to improve searches/statistics for
>> still-active issues.
>> 
>> Regards,
>>  Alex.
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> --
>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>> http://www.solrenterprisesearchserver.com
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>> 
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: I am closing Resolved issues

2018-07-05 Thread Walter Underwood
Please do not close SOLR-629. I’ve been submitting patches and trying to get 
someone to commit that for years. This is blocking us from upgrading one of our 
clusters from 4.10.4.

https://issues.apache.org/jira/browse/SOLR-629

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 5, 2018, at 2:33 PM, Alexandre Rafalovitch  wrote:
> 
> Ok. I found Bulk Transition to Close workflow.
> 
> Would it make sense to transition all Solr 'Resolved' issues older
> than (say) 600 days to 'Closed' state? project = SOLR AND status =
> Resolved AND "Last public comment date" <= -600d ? Without emails, as
> I suspect I already annoyed a bunch of people with my cleanup.
> 
> That's 1500 issues we are really unlikely to touch ever again.
> 
> Thoughts?
> 
> Regards,
>   Alex.
> 
> 
> On 5 July 2018 at 14:32, Alexandre Rafalovitch  wrote:
>> Do I have a JIRA permission/procedure to do the bulk closures? I
>> wasn't sure. And definitely did not want to step on any RM feet, given
>> that I know nothing about those processes.
>> 
>> As is, for now, I am mostly trying to clean out the ancient weeds,
>> those that just get in the way.
>> 
>> Regards,
>>   Alex.
>> 
>> On 5 July 2018 at 14:27, David Smiley  wrote:
>>> Thanks for doing the JIRA gardening.
>>> 
>>> By the way, there were past releases where the RM apparently forgot to
>>> perform the step of bulk closing resolved issues.  That can be done without
>>> sending a notification, by the way -- thus no list noise.  You might want to
>>> start with rectifying this?
>>> 
>>> ~ David
>>> 
>>> On Thu, Jul 5, 2018 at 1:49 PM Alexandre Rafalovitch 
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I am manually reviewing and closing issues marked Resolved against 7.4
>>>> and earlier as well as in versions unknown.
>>>> 
>>>> I am leaving those I can't quite figure out the "true final" status of
>>>> as is. Like a patch is in the tree, but it is not marked with a
>>>> version and nothing in CHANGES file.
>>>> 
>>>> I hope this approach is ok. I apologize in advance if I mis-read
>>>> somebody's intention. Feel free to 'reopened' the case (Jira flow
>>>> state transition name needs fixing too).
>>>> 
>>>> Please let me know if you disagree with this cleanup and/or if you
>>>> have a better algorithm to improve searches/statistics for
>>>> still-active issues.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>> 
>>> --
>>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>>> http://www.solrenterprisesearchserver.com
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: Solr search implement in magento 1

2018-06-28 Thread Walter Underwood
This is Solr used inside the Magento platform. I recommend asking in a Magento 
group.

https://community.magento.com/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 28, 2018, at 7:40 AM, David Smiley  wrote:
> 
> Hello,
> 
> This is the "dev" list for Lucene/Solr which is for the internals of 
> Lucene/Solr, not how to use Lucene/Solr.  Please join and post to the "Solr 
> User" list: http://lucene.apache.org/solr/community.html#mailing-lists-irc 
> <http://lucene.apache.org/solr/community.html#mailing-lists-irc>
> 
> BTW when you re-ask, consider trying to improve the wording, maybe get input 
> from a colleague who speaks English better.  I don't understand your inquiry.
> 
> ~ David
> 
> On Thu, Jun 28, 2018 at 10:23 AM Amit Hazra  <mailto:amit.ha...@navsoft.in>> wrote:
> 
> Hi,
> 
> I have implement a solr search in magento 1.8.
> 
> But client Wants that some custom product attribute search as per attribute 
> value ="Parent", as per requirements i have merge query with custom atrribute 
> value = "Parent" and get the proper result.
> 
> And now all the product those have attribute value select Parent are coming.
> 
> Not: But everyday search result rebase to default all result, and when i have 
> save  all those product again from magento admin, it will come proper parent 
> base result.
> And i have to do it regularly.
> 
> So please tell why it will rebase search data everyday. And how to solve?.
> 
> 
> 
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley 
> <http://linkedin.com/in/davidwsmiley> | Book: 
> http://www.solrenterprisesearchserver.com 
> <http://www.solrenterprisesearchserver.com/>


[jira] [Commented] (SOLR-12413) Solr ignores aliases.json from ZooKeeper at startup

2018-05-29 Thread Walter Underwood (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494436#comment-16494436
 ] 

Walter Underwood commented on SOLR-12413:
-

We upload configuration files to zookeeper, then use a RELOAD command to the 
collections API. We don't use aliases.

We keep the files under version control, so loading them to zookeeper works for 
us.

> Solr ignores aliases.json from ZooKeeper at startup
> ---
>
> Key: SOLR-12413
> URL: https://issues.apache.org/jira/browse/SOLR-12413
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCloud
>Affects Versions: 7.2.1
> Environment: A SolrCloud cluster with ZooKeeper (one node is enough 
> to reproduce).
> Solr 7.2.1.
> ZooKeeper 3.4.6.
>Reporter: Gaël Jourdan
>Priority: Major
>
> Since upgrading to 7.2.1, we ran into an issue where Solr ignores 
> _aliases.json_ file stored in ZooKeeper.
>  
> +Steps to reproduce the problem:+
>  # SolrCloud cluster is down
>  # Direct update of _aliases.json_ file in ZooKeeper with Solr ZkCLI *without 
> using Collections API* :
>  ** {{java ... org.apache.solr.cloud.ZkCLI -zkhost ... -cmd clear 
> /aliases.json}}
>  ** {{java ... org.apache.solr.cloud.ZkCLI -zkhost ... -cmd put /aliases.json 
> "new content"}}
>  # SolrCloud cluster is started => _aliases.json_ not taken into account
>  
> +Analysis:+ 
> Digging a bit in the code, what is actually causing the issue is that, when 
> starting, Solr now checks for the metadata of the _aliases.json_ file and if 
> the version metadata from ZooKeeper is lower or equal to local version, it 
> keeps the local version.
> When it starts, Solr has a local version of 0 for the aliases but ZooKeeper 
> also has a version of 0 of the file because we just recreated it. So Solr 
> ignores ZooKeeper configuration and never has a chance to load aliases.
>  
> Relevant parts of Solr code are:
>  * 
> [https://github.com/apache/lucene-solr/blob/branch_7_2/solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java]
>  : line 1562 : method setIfNewer
> {code:java}
> /**
> * Update the internal aliases reference with a new one, provided that its ZK 
> version has increased.
> *
> * @param newAliases the potentially newer version of Aliases
> */
> private boolean setIfNewer(Aliases newAliases) {
>   synchronized (this) {
>     int cmp = Integer.compare(aliases.getZNodeVersion(), 
> newAliases.getZNodeVersion());
>     if (cmp < 0) {
>   LOG.debug("Aliases: cmp={}, new definition is: {}", cmp, newAliases);
>   aliases = newAliases;
>   this.notifyAll();
>       return true;
>     } else {
>   LOG.debug("Aliases: cmp={}, not overwriting ZK version.", cmp);
>       assert cmp != 0 || Arrays.equals(aliases.toJSON(), newAliases.toJSON()) 
> : aliases + " != " + newAliases;
>     return false;
>     }
>   }
> }{code}
>  * 
> [https://github.com/apache/lucene-solr/blob/branch_7_2/solr/solrj/src/java/org/apache/solr/common/cloud/Aliases.java]
>  : line 45 : the "empty" Aliases object with default version 0
> {code:java}
> /**
> * An empty, minimal Aliases primarily used to support the non-cloud solr use 
> cases. Not normally useful
> * in cloud situations where the version of the node needs to be tracked even 
> if all aliases are removed.
> * A version of 0 is provided rather than -1 to minimize the possibility that 
> if this is used in a cloud
> * instance data is written without version checking.
> */
> public static final Aliases EMPTY = new Aliases(Collections.emptyMap(), 
> Collections.emptyMap(), 0);{code}
>  
> Note that a workaround is to force ZooKeeper to always have a version greater 
> than 0 for _aliases.json_ file (for instance by not clearing the file and 
> just overwriting it again and again).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12278) Ignore very large document on indexing

2018-04-30 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458631#comment-16458631
 ] 

Walter Underwood commented on SOLR-12278:
-

Finding the size should be much quicker than the time already spent receiving 
the update request, parsing it, and creating the Solr input document. This is 
optimizing the wrong thing.

If you want to reject large requests, configure Jetty to reject large requests.

> Ignore very large document on indexing
> --
>
> Key: SOLR-12278
> URL: https://issues.apache.org/jira/browse/SOLR-12278
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12278.patch, SOLR-12278.patch
>
>
> Solr should be able to ignore very large document, so it won't affect the 
> index as well as the tlog. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-12278) Ignore very large document on indexing

2018-04-26 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454329#comment-16454329
 ] 

Walter Underwood commented on SOLR-12278:
-

This can be done easily in an update request processor script. Get the field, 
check the size, if it is over the limit return false.

> Ignore very large document on indexing
> --
>
> Key: SOLR-12278
> URL: https://issues.apache.org/jira/browse/SOLR-12278
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Cao Manh Dat
>Priority: Major
> Attachments: SOLR-12278.patch
>
>
> Solr should be able to ignore very large document, so it won't affect the 
> index as well as the tlog. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11971) CVE-2018-1308: XXE attack through DIH's dataConfig request parameter

2018-04-10 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16432354#comment-16432354
 ] 

Walter Underwood commented on SOLR-11971:
-

Sorry, figured it out after I posted that.

 

Does the default config enable the dataimporthandler? We only have it enabled 
on the very old clusters (Solr 3). This is a fine excuse for shutting them down.

> CVE-2018-1308: XXE attack through DIH's dataConfig request parameter
> 
>
> Key: SOLR-11971
> URL: https://issues.apache.org/jira/browse/SOLR-11971
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 6.6.3, 7.3, master (8.0)
>
> Attachments: ApacheSolrDIH-XXE.pdf, SOLR-11971.patch
>
>
> We got a security report about an XXE attack when using the 
> {{=}} of Solr's DataImportHandler. See the attached PDF 
> file with full details (I converted it to PDF, originally it was a DOC file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11971) CVE-2018-1308: XXE attack through DIH's dataConfig request parameter

2018-04-09 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430990#comment-16430990
 ] 

Walter Underwood commented on SOLR-11971:
-

Is there a workaround for this that does not require upgrading?

> CVE-2018-1308: XXE attack through DIH's dataConfig request parameter
> 
>
> Key: SOLR-11971
> URL: https://issues.apache.org/jira/browse/SOLR-11971
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - DataImportHandler
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 6.6.3, 7.3, master (8.0)
>
> Attachments: ApacheSolrDIH-XXE.pdf, SOLR-11971.patch
>
>
> We got a security report about an XXE attack when using the 
> {{=}} of Solr's DataImportHandler. See the attached PDF 
> file with full details (I converted it to PDF, originally it was a DOC file).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: solr

2018-04-05 Thread Walter Underwood
But…why do you want an obsolete version of Solr? 

4.3.1 is from almost five years ago!

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 5, 2018, at 9:05 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
> On 4/5/2018 7:32 AM, Steve Rowe wrote:
>> You can find all past versions here: 
>> http://archive.apache.org/dist/lucene/solr/
> 
> Also, the source code for releases back to 3.1.0 are definitely included in a 
> checkout from the git repository as tag branches.  So if you do a "git 
> clone", you'll have all of that.
> 
> https://wiki.apache.org/solr/HowToContribute#Getting_the_source_code
> 
> Before that release, Solr was in a separate repository from Lucene, but it 
> does look like there might be tags for older releases in the main repository.
> 
> Thanks,
> Shawn
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



[jira] [Commented] (SOLR-12059) Unable to rename solr.xml

2018-03-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16401447#comment-16401447
 ] 

Walter Underwood commented on SOLR-12059:
-

Changing the file name for version control is not a good reason. I've been 
using version control systems that allow keeping the same file name since, um, 
1981. Yeah, that was SCCS in Unix v6/PWB.

> Unable to rename solr.xml
> -
>
> Key: SOLR-12059
> URL: https://issues.apache.org/jira/browse/SOLR-12059
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5.1
> Environment: Renaming of solr,xml in the $SOLR_HOME directory
>Reporter: Edwin Yeo Zheng Lin
>Priority: Major
>
> I am able to rename the flie names like solrconfig.xml and solr.log to custom 
> names like myconfig.xml and my.log quite seamlessly. 
> However, I am not able to rename the same for solr.xml. Understand that the 
> solr.xml is hard-coded at the SolrXmlConfig.java. Meaning it requires a 
> re-compile of the Jar file in order to rename it.
> Since we can rename files like solrconfig.xml from the properties files, so 
> we should do the same for solr.xml?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11874) Add ulimit recommendations to the "Taking Solr to Production" section in the ref guide

2018-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365869#comment-16365869
 ] 

Walter Underwood edited comment on SOLR-11874 at 2/15/18 4:41 PM:
--

This is what we do for each new Amazon instance. (Sorry, had to edit this three 
times to get Jira to not corrupt the commands)

 
{code}
# Increase number of open files. must be run as root
sudo su -
echo '* hard nofile 50' >> /etc/security/limits.conf
echo '* soft nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'fs.file-max = 2097152' >> /etc/sysctl.conf
# set the processes/threads limit
sudo vi /etc/security/limits.d/20-nproc.conf
# * soft nproc 122944
{code}


was (Author: wunder):
This is what we do for each new Amazon instance.

 
 {{}}
{code:java}
# Increase number of open files. must be run as root
sudo su -
echo '* hard nofile 50' >> /etc/security/limits.conf
echo '* soft nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'fs.file-max = 2097152' >> /etc/sysctl.conf
# set the processes/threads limit
sudo vi /etc/security/limits.d/20-nproc.conf
# * soft nproc 122944{code}

> Add ulimit recommendations to the "Taking Solr to Production" section in the 
> ref guide
> --
>
> Key: SOLR-11874
> URL: https://issues.apache.org/jira/browse/SOLR-11874
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
>
> Just noticed that we never mention appropriate ulimits in the ref guide 
> except for one spot when talking about cfs files.
> Anyone who wants to pick this up feel free. Otherwise I'll get to this 
> probably over the weekend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11874) Add ulimit recommendations to the "Taking Solr to Production" section in the ref guide

2018-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365869#comment-16365869
 ] 

Walter Underwood edited comment on SOLR-11874 at 2/15/18 4:40 PM:
--

This is what we do for each new Amazon instance.

 
 {{}}
{code:java}
# Increase number of open files. must be run as root
sudo su -
echo '* hard nofile 50' >> /etc/security/limits.conf
echo '* soft nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'root hard nofile 50' >> /etc/security/limits.conf
echo 'fs.file-max = 2097152' >> /etc/sysctl.conf
# set the processes/threads limit
sudo vi /etc/security/limits.d/20-nproc.conf
# * soft nproc 122944{code}


was (Author: wunder):
This is what we do for each new Amazon instance.

 
{{# Increase number of open files. must be run as root}}
{{sudo su -}}
{{echo }}{{'* hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'* soft nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'root hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'root hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'fs.file-max = 2097152'}} {{>> /etc/sysctl.conf}}
{{# set the processes/threads limit}}
{{sudo vi /etc/security/limits.d/}}{{20}}{{-nproc.conf}}
{{# * soft nproc }}{{122944}}
 

> Add ulimit recommendations to the "Taking Solr to Production" section in the 
> ref guide
> --
>
> Key: SOLR-11874
> URL: https://issues.apache.org/jira/browse/SOLR-11874
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
>
> Just noticed that we never mention appropriate ulimits in the ref guide 
> except for one spot when talking about cfs files.
> Anyone who wants to pick this up feel free. Otherwise I'll get to this 
> probably over the weekend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11874) Add ulimit recommendations to the "Taking Solr to Production" section in the ref guide

2018-02-15 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365869#comment-16365869
 ] 

Walter Underwood commented on SOLR-11874:
-

This is what we do for each new Amazon instance.

 
{{# Increase number of open files. must be run as root}}
{{sudo su -}}
{{echo }}{{'* hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'* soft nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'root hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'root hard nofile 50'}} {{>> /etc/security/limits.conf}}
{{echo }}{{'fs.file-max = 2097152'}} {{>> /etc/sysctl.conf}}
{{# set the processes/threads limit}}
{{sudo vi /etc/security/limits.d/}}{{20}}{{-nproc.conf}}
{{# * soft nproc }}{{122944}}
 

> Add ulimit recommendations to the "Taking Solr to Production" section in the 
> ref guide
> --
>
> Key: SOLR-11874
> URL: https://issues.apache.org/jira/browse/SOLR-11874
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: documentation
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Minor
>
> Just noticed that we never mention appropriate ulimits in the ref guide 
> except for one spot when talking about cfs files.
> Anyone who wants to pick this up feel free. Otherwise I'll get to this 
> probably over the weekend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Logging panel in new UI?

2018-01-23 Thread Walter Underwood
In 6.5.1, when the logging panel in the new UI refreshes, it closes any stack 
trace that has been opened up. This makes me always use the old UI for viewing 
the log.

Has this been reported and/or fixed?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




Re: performance drop on 27 oct?

2017-11-14 Thread Walter Underwood
The other approach would be to do equality tests with a fuzz factor, because 
floating point is like that. But that would probably make things slower.

Here is an example of fuzzy equals:

https://github.com/OpenGamma/Strata/blob/master/modules/math/src/test/java/com/opengamma/strata/math/impl/FuzzyEquals.java

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Nov 14, 2017, at 8:57 AM, Chris Hostetter <hossman_luc...@fucit.org> wrote:
> 
> : In the BM25 case, scores would decrease in some situations with very
> : high TF values because of floating point issues, e.g. so
> : score(freq=100,000) would be unexpectedly less than
> : score(freq=99,999), all other things being equal. There may be other
> : ways to re-arrange the code to avoid this problem, feel free to open
> : an issue if you can optimize the code better while still behaving
> : properly!
> 
> i don't have any idea how to optimize the current code, and I am 
> completley willing to believe the changes in LUCENE-7997 are an 
> improvement in terms of correctness -- which is certainly more important 
> then performance -- I just wanted to point out that Alan's observation 
> about LUCENE-8018 being the only commit around the time the performance 
> graphs dip wasn't accurate before anyone started ripping their hair out 
> trying to explain it.
> 
> If you think the float/double math in LUCENE-7997 might explain the change 
> in mike's graphs, then maybe mike can annotate them to record that?
> 
> (Wild spit balling idea: would be worth while to offer an 
> "ImpreciseBM25Similarity" that used floats instead of doubles for people 
> who want to eek out every lsat bit of performance -- provided it was 
> heavily documented with caveats regarding inaccurate scores due to 
> rounding errors?)
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: Solr 7 default Response now JSON instead of XML causing issues

2017-10-17 Thread Walter Underwood
In your request handlers, add:

  xml

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 2, 2017, at 4:48 AM, Roland Villemoes <r...@alpha-solutions.us> wrote:
> 
> Hi 
>  
> Default response in Solr 7 is now JSON instead of XML 
> (https://issues.apache.org/jira/browse/SOLR-10494 
> <https://issues.apache.org/jira/browse/SOLR-10494>)
>  
> We are using a system that use the Solr admin/cores api for core status etc. 
> and we can’t really change that system. That system expects the XML response. 
> And as far as I can see default also changed to JSON there.
>  
> So: 
>  
> Are there any way I can change the admin/cores API back to responses using 
> XML instead of JSON?
>  
>  
> /Roland Villemoes



Re: Release 7.0 process starts

2017-09-20 Thread Walter Underwood
I scan it for all big changes, not just features. Is a version of Java dropped 
in this release? That sort of thing should be included.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 20, 2017, at 9:12 AM, Anshum Gupta <ansh...@apple.com> wrote:
> 
> Do you mean to remove the note about upgrading completely? Currently it’s 
> just a pointer to the CHANGES and recommends users to go through it before 
> upgrading.
> 
> -Anshum
> 
> 
> 
>> On Sep 20, 2017, at 8:34 AM, Joel Bernstein <joels...@gmail.com 
>> <mailto:joels...@gmail.com>> wrote:
>> 
>> I think the release highlights are about what's exciting in the release. So 
>> leading with the most exciting features is the way to go. Informing people 
>> of changes that will affect them can be done in the upgrade notes in 
>> CHANGES.txt.
>> 
>> What do other people think about this?
>> 
>> Joel Bernstein
>> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
>> 
>> On Wed, Sep 20, 2017 at 11:21 AM, Anshum Gupta <ansh...@apple.com 
>> <mailto:ansh...@apple.com>> wrote:
>> Also, I think it might make sense to add a line saying that the Ref Guide 
>> for 7.0 would be released soon.
>> 
>> -Anshum
>> 
>> 
>> 
>>> On Sep 20, 2017, at 8:20 AM, Anshum Gupta <ansh...@apple.com 
>>> <mailto:ansh...@apple.com>> wrote:
>>> 
>>> Sounds good.
>>> 
>>> Also, I am not a java expert like Uwe, and a few others here so let me know 
>>> if we should leave in the ‘Jigsaw’ part.
>>> 
>>> David, you added that yesterday and Mike looked at the Lucene release notes 
>>> and let it stay there. So I was wondering if it’s important/reasonable 
>>> enough to highlight in the release notes.
>>> 
>>> -Anshum
>>> 
>>> 
>>> 
>>>> On Sep 20, 2017, at 8:12 AM, Joel Bernstein <joels...@gmail.com 
>>>> <mailto:joels...@gmail.com>> wrote:
>>>> 
>>>> I would also consider changing the order of the list to highlight the most 
>>>> interesting features.
>>>> 
>>>> If I saw this as the top highlight I would think of this is mainly a 
>>>> maintenance release. 
>>>> 
>>>> 
>>>> Indented JSON is now the default response format for all APIs,
>>>>   pass wt=json and/or indent=off to use the previous unindented XML format.
>>>> 
>>>> 
>>>> 
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
>>>> 
>>>> On Wed, Sep 20, 2017 at 11:09 AM, Joel Bernstein <joels...@gmail.com 
>>>> <mailto:joels...@gmail.com>> wrote:
>>>> I just made the edit.
>>>> 
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
>>>> 
>>>> On Wed, Sep 20, 2017 at 11:06 AM, Joel Bernstein <joels...@gmail.com 
>>>> <mailto:joels...@gmail.com>> wrote:
>>>> For streaming expressions let's go with:
>>>> 
>>>> Solr 7 Streaming Expressions adds a new statistical programming syntax for
>>>> the statistical analysis of sql queries, random samples, time series and
>>>> graph result sets.
>>>> 
>>>> 
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
>>>> 
>>>> On Wed, Sep 20, 2017 at 11:01 AM, Christine Poerschke (BLOOMBERG/ LONDON) 
>>>> <cpoersc...@bloomberg.net <mailto:cpoersc...@bloomberg.net>> wrote:
>>>> Cool. How about 7th and 8th bullet points like this. 8th bullet ending in 
>>>> Java 9 future magic still, not that the magic counts but fitting things on 
>>>> roughly a screen full for folks to easily get the gist of the new release 
>>>> is important I think.
>>>> 
>>>> -Christine
>>>> 
>>>> * Solr 7 adds Streaming Expressions, a new statistical programming syntax 
>>>> for
>>>>   the statistical analysis of sql queries, random samples, time series and
>>>>   graph result sets.
>>>> 
>>>> * Solr 7 is tested with and verified to support Java 9
>>>> 
>>>> From: dev@lucene.apache.org <mailto:dev@lucene.apache.org> At: 09/20/17 
>>>> 15:54:54
>>>> To:  Christine Poerschke (BLOOMBERG/ LONDON )  

Re: Release 7.0 process starts

2017-09-20 Thread Walter Underwood
I’m fine with more than seven bullet points. In fact, when I see a list of 8 or 
9 things, I know it is a real list, and not someone just trying to get to the 
magic 7 or 10.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 20, 2017, at 7:39 AM, Joel Bernstein <joels...@gmail.com> wrote:
> 
> I think Solr 7 is a truly exciting release stacked with interesting new 
> features. So let's not drop off anything really interesting just to keep it 
> to 7 bullet points. But we should prune the list to include the high impact 
> features. 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
> 
> On Wed, Sep 20, 2017 at 10:21 AM, Christine Poerschke (BLOOMBERG/ LONDON) 
> <cpoersc...@bloomberg.net <mailto:cpoersc...@bloomberg.net>> wrote:
> Totally agree with choosing _7_ highlights for the Solr _7_ release!
> 
> Below is the revised draft I came up with:
> 
> (Notice that v2 is the 2nd bullet, though I think it yet needs to mention one 
> or _two_ benefits of using the new API especially since we mention that 
> /solr/ continues to work.)
> 
> (Also notice some re-ordering of the bullets starting with the used-by-many 
> JSON first, then v2 API second, then third collection creation which mentions 
> faceting and so leads over to the fourth bullet re: facet refinement. Fifth 
> is the new replica types (that bullet being slightly longer than the others 
> to explain what the types are about). Sixth is auto-scaling which mentions 
> future releases (would folks use new replica types first before moving on to 
> auto-scaling?). Seventh and last then is Solr _7_ mention with Java _9_ i.e. 
> the just-arrived future again there.)
> 
> Solr 7.0 Release Highlights:
> 
> * Indented JSON is now the default response format for all APIs,
>   pass wt=json and/or indent=off to use the previous unindented XML format.
> 
> * The new v2 API, exposed at /api/ and also supported via SolrJ, is now the
>   preferred API, but /solr/ continues to work.
> 
> * A new `_default` configset is used if no config is specified at collection
>   creation. The data-driven functionality of this configset indexes strings as
>   analyzed text while at the same time copying to a `*_str` field suitable for
>   faceting.
> 
> * The JSON Facet API now supports two-phase facet refinement to ensure 
> accurate
>   counts and statistics for facet buckets returned in distributed mode.
> 
> * Replica Types - Solr 7 supports different replica types, which handle 
> updates
>   differently. In addition to pure NRT operation where all replicas build an
>   index and keep a replication log, you can now also add so called PULL
>   replicas, achieving the read-speed optimized benefits of a master/slave
>   setup while at the same time keeping index redundancy.
> 
> * Auto-scaling. Solr can now allocate new replicas to nodes using a new auto
>   scaling policy framework. This framework will in future releases enable Solr
>   to move shards around based on load, disk etc.
> 
> * Solr 7 is tested with and verified to support Java 9.
> 
> From: dev@lucene.apache.org <mailto:dev@lucene.apache.org> At: 09/20/17 
> 15:02:38
> To:  dev@lucene.apache.org <mailto:dev@lucene.apache.org>
> Subject: Re: Release 7.0 process starts
> 
> 
> On Wed, Sep 20, 2017 at 9:16 AM Jan Høydahl <jan@cominvent.com 
> <mailto:jan@cominvent.com>> wrote:
> And please, I was serious about choosing 7 major features and not adding 
> random single improvements. The list has already creeped from 7 to 9 bullets. 
> If you want to add something, then ask youself which of the other bullets 
> that are less important to MOST USERS and then replace that bullet instead of 
> adding more. Agree?
> 
> I agree with that very much!  Each bullet added de-values the list as a 
> whole.  IMO the Java 9 bullet can be removed (too few are even using it yet) 
> and we get to 8 bullets; and those 8 are pretty good. 
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley 
> <http://linkedin.com/in/davidwsmiley> | Book: 
> http://www.solrenterprisesearchserver.com 
> <http://www.solrenterprisesearchserver.com/>



Re: Pathological index condition

2017-08-28 Thread Walter Underwood
That makes sense.

I guess the alternative would be to occasionally roll the dice and decide to 
merge that kind of segment.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 1:28 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> I don't think jitter would help. As long as a segment has > 50% max
> segment size "live" docs, it's forever ineligible for merging (outside
> optimize of expungeDeletes commands). So the "zone" is anything over
> 50%.
> 
> Or I missed your point.
> 
> Erick
> 
> On Mon, Aug 28, 2017 at 12:50 PM, Walter Underwood
> <wun...@wunderwood.org> wrote:
>> If this happens in a precise zone, how about adding some random jitter to
>> the threshold? That tends to get this kind of lock-up unstuck.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>> On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>> And one more thought (not very well thought out).
>> 
>> A parameter on TMP (or whatever) that did <3> something like:
>> 
>> a parameter 
>> a parameter 
>> On startup TMP takes the current timestamp
>> 
>> *> Every minute (or whatever) it checks the current timestamp and if
>>  is in between the last check time and now, do <2>.
>> 
>> set the last checked time to the value from * above.
>> 
>> 
>> Taking the current timestamp would keep from kicking of the compaction
>> on startup, so we wouldn't need to keep some stateful information
>> across restarts and wouldn't go into a compact cycle on startup.
>> 
>> Erick
>> 
>> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
>> <erickerick...@gmail.com> wrote:
>> 
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>> 
>> Thanks Mike:
>> 
>> bq: Or are you saying that each segments 20% of not-deleted docs is
>> still greater than 1/2 of the max segment size, and so TMP considers
>> them ineligible?
>> 
>> Exactly.
>> 
>> Hadn't seen the blog, thanks for that. Added to my list of things to refer
>> to.
>> 
>> The problem we're seeing is that "in the wild" there are cases where
>> people can now get satisfactory performance from huge numbers of
>> documents, as in close to 2B (there was a question on the user's list
>> about that recently). So allowing up to 60% deleted documents is
>> dangerous in that situation.
>> 
>> And the situation is exacerbated by optimizing (I know, "don't do that").
>> 
>> Ah, well, the joys of people us

Re: Pathological index condition

2017-08-28 Thread Walter Underwood
If this happens in a precise zone, how about adding some random jitter to the 
threshold? That tends to get this kind of lock-up unstuck.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 28, 2017, at 12:44 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> And one more thought (not very well thought out).
> 
> A parameter on TMP (or whatever) that did <3> something like:
>> a parameter 
>> a parameter 
>> On startup TMP takes the current timestamp
> *> Every minute (or whatever) it checks the current timestamp and if
>  is in between the last check time and now, do <2>.
>> set the last checked time to the value from * above.
> 
> Taking the current timestamp would keep from kicking of the compaction
> on startup, so we wouldn't need to keep some stateful information
> across restarts and wouldn't go into a compact cycle on startup.
> 
> Erick
> 
> On Sun, Aug 27, 2017 at 11:31 AM, Erick Erickson
> <erickerick...@gmail.com> wrote:
>> I've been thinking about this a little more. Since this is an outlier,
>> I'm loathe to change the core TMP merge selection process. Say the max
>> segment size if 5G. You'd be doing an awful lot of I/O to merge a
>> segment with 4.75G "live" docs with one with 0.25G. Plus that doesn't
>> really allow users who issue the tempting "optimize" command to
>> recover; that one huge segment can hang around for a _very_ long time,
>> accumulating lots of deleted docs. Same with expungeDeletes.
>> 
>> I can think of several approaches:
>> 
>> 1> despite my comment above, a flag that says something like "if a
>> segment has > X% deleted docs, merge it with a smaller segment anyway
>> respecting the max segment size. I know, I know this will affect
>> indexing throughput, do it anyway".
>> 
>> 2> A special op (or perhaps a flag on expungeDeletes) that would
>> behave like <1> but on-demand rather than part of standard merging.
>> 
>> In both of these cases, if a segment had > X% deleted docs but the
>> live doc size for that segment was > the max seg size, rewrite it into
>> a single new segment removing deleted docs.
>> 
>> 3> some way to do the above on a schedule. My notion is something like
>> a maintenance window at 1:00 AM. You'd still have a live collection,
>> but (presumably) a way to purge the day's accumulation of deleted
>> documents during off hours.
>> 
>> 4> ???
>> 
>> I probably like <2> best so far, I don't see this condition in the
>> wild very often it usually occurs during heavy re-indexing operations
>> and often after an optimize or expungeDeletes has happened. <1> could
>> get horribly pathological if the threshold was 1% or something.
>> 
>> WDYT?
>> 
>> 
>> On Wed, Aug 9, 2017 at 2:40 PM, Erick Erickson <erickerick...@gmail.com> 
>> wrote:
>>> Thanks Mike:
>>> 
>>> bq: Or are you saying that each segments 20% of not-deleted docs is
>>> still greater than 1/2 of the max segment size, and so TMP considers
>>> them ineligible?
>>> 
>>> Exactly.
>>> 
>>> Hadn't seen the blog, thanks for that. Added to my list of things to refer 
>>> to.
>>> 
>>> The problem we're seeing is that "in the wild" there are cases where
>>> people can now get satisfactory performance from huge numbers of
>>> documents, as in close to 2B (there was a question on the user's list
>>> about that recently). So allowing up to 60% deleted documents is
>>> dangerous in that situation.
>>> 
>>> And the situation is exacerbated by optimizing (I know, "don't do that").
>>> 
>>> Ah, well, the joys of people using this open source thing and pushing
>>> its limits.
>>> 
>>> Thanks again,
>>> Erick
>>> 
>>> On Tue, Aug 8, 2017 at 3:49 PM, Michael McCandless
>>> <luc...@mikemccandless.com> wrote:
>>>> Hi Erick,
>>>> 
>>>> Some questions/answers below:
>>>> 
>>>> On Sun, Aug 6, 2017 at 8:22 PM, Erick Erickson <erickerick...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Particularly interested if Mr. McCandless has any opinions here.
>>>>> 
>>>>> I admit it took some work, but I can create an index that never merges
>>>>> and is 80% deleted documents using TieredMergePolicy.
>>>>> 
>>>>> I'm trying to underst

[jira] [Comment Edited] (SOLR-9458) DocumentDictionaryFactory StackOverflowError on many documents

2017-08-16 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129210#comment-16129210
 ] 

Walter Underwood edited comment on SOLR-9458 at 8/16/17 6:29 PM:
-

I'm getting the same failure using FileDictionaryFactory with 6.5.1.


{code:xml}
   
  concepts_fuzzy
  FuzzyLookupFactory
  true
  suggest-concepts.txt
  suggest_subjects_infix
  text_lower
  1
  true
  false
  false
  true

{code}



was (Author: wunder):
I'm getting the same failure using FileDictionaryFactory.


{code:xml}
   
  concepts_fuzzy
  FuzzyLookupFactory
  true
  suggest-concepts.txt
  suggest_subjects_infix
  text_lower
  1
  true
  false
  false
  true

{code}


> DocumentDictionaryFactory StackOverflowError on many documents
> --
>
> Key: SOLR-9458
> URL: https://issues.apache.org/jira/browse/SOLR-9458
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Suggester
>Affects Versions: 6.1, 6.2
>Reporter: Chris de Kok
>
> When using the FuzzyLookupFactory in combinarion with the 
> DocumentDictionaryFactory it will throw a stackoverflow trying to build the 
> dictionary.
> Using the HighFrequencyDictionaryFactory works ok but behaves very different.
> ```
> 
> 
> suggest
> suggestions
> suggestions
> FuzzyLookupFactory
> DocumentDictionaryFactory
> suggest_fuzzy
> true
> false
> false
> true
> 0
> 
> 
> null:java.lang.StackOverflowError
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9458) DocumentDictionaryFactory StackOverflowError on many documents

2017-08-16 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129210#comment-16129210
 ] 

Walter Underwood commented on SOLR-9458:


I'm getting the same failure using FileDictionaryFactory.


{code:xml}
   
  concepts_fuzzy
  FuzzyLookupFactory
  true
  suggest-concepts.txt
  suggest_subjects_infix
  text_lower
  1
  true
  false
  false
  true

{code}


> DocumentDictionaryFactory StackOverflowError on many documents
> --
>
> Key: SOLR-9458
> URL: https://issues.apache.org/jira/browse/SOLR-9458
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Suggester
>Affects Versions: 6.1, 6.2
>Reporter: Chris de Kok
>
> When using the FuzzyLookupFactory in combinarion with the 
> DocumentDictionaryFactory it will throw a stackoverflow trying to build the 
> dictionary.
> Using the HighFrequencyDictionaryFactory works ok but behaves very different.
> ```
> 
> 
> suggest
> suggestions
> suggestions
> FuzzyLookupFactory
> DocumentDictionaryFactory
> suggest_fuzzy
> true
> false
> false
> true
> 0
> 
> 
> null:java.lang.StackOverflowError
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)
>   at 
> org.apache.lucene.util.automaton.Operations.topoSortStatesRecurse(Operations.java:1311)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11196) Solr 6.5.0 consuming entire Heap suddenly while working smoothly on Solr 6.1.0

2017-08-04 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114510#comment-16114510
 ] 

Walter Underwood commented on SOLR-11196:
-

Ah, missed that openSearcher was false.

This host is named production-solr-master, so it might be master-slave. 

> Solr 6.5.0 consuming entire Heap suddenly while working smoothly on Solr 6.1.0
> --
>
> Key: SOLR-11196
> URL: https://issues.apache.org/jira/browse/SOLR-11196
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.6
>Reporter: Amit
>Priority: Critical
>
> Please note, this issue does not occurs on Solr-6.1.0 while the same occurs 
> on Solr-6.5.0 and above. To fix this we had to move back to Solr-6.1.0 
> version.
> We have been hit by a Solr Behavior in production which we are unable to 
> debug. To start with here are the configurations for solr:
> Solr Version: 6.5, Master with 1 Slave of the same configuration as mentioned 
> below.
> *JVM Config:*
>   
> {code:java}
>  -Xms2048m
>  -Xmx4096m
>  -XX:+ParallelRefProcEnabled
>  -XX:+UseCMSInitiatingOccupancyOnly
>  -XX:CMSInitiatingOccupancyFraction=50
> {code}
> Rest all are default values.
> *Solr Config:*
>  
> {code:java}
>
>   
>   {solr.autoCommit.maxTime:30}
>   false
> 
> 
> 
>   {solr.autoSoftCommit.maxTime:90}
> 
> 
> 
>   1024
>autowarmCount="0" />
>autowarmCount="0" />
>autowarmCount="0" />
>initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
>   true
>   20
>   ${solr.query.max.docs:40}
>   
>   false
>   2
> 
> {code}
> *The Host (AWS) configurations are:*
> RAM: 7.65GB
> Cores: 4
> Now, our solr works perfectly fine for hours and sometimes for days but 
> sometimes suddenly memory jumps up and the GC kicks in causing long big 
> pauses with not much to recover. We are seeing this happening most often when 
> one or multiple segments gets added or deleted post a hard commit. It doesn't 
> matter how many documents got indexed. The images attached shows that just 1 
> document was indexed, causing an addition of one segment and it all got 
> messed up till we restarted the Solr.
> Here are the images from NewRelic and Sematext (Kindly click on the links to 
> view):
> [JVM Heap Memory Image | https://i.stack.imgur.com/9dQAy.png]
> [1 Document and 1 Segment addition Image | 
> https://i.stack.imgur.com/6N4FC.png]
> Update: Here is the JMap output when SOLR last died, we have now increased 
> the JVM memory to xmx of 12GB:
>  
> {code:java}
>  num #instances #bytes  class name
>   --
>   1:  11210921 1076248416  
> org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat$IntBlockTermState
>   2:  10623486  934866768  [Lorg.apache.lucene.index.TermState;
>   3:  15567646  475873992  [B
>   4:  10623485  424939400  
> org.apache.lucene.search.spans.SpanTermQuery$SpanTermWeight
>   5:  15508972  372215328  org.apache.lucene.util.BytesRef
>   6:  15485834  371660016  org.apache.lucene.index.Term
>   7:  15477679  371464296  
> org.apache.lucene.search.spans.SpanTermQuery
>   8:  10623486  339951552  org.apache.lucene.index.TermContext
>   9:   1516724  150564320  [Ljava.lang.Object;
>  10:724486   50948800  [C
>  11:   1528110   36674640  java.util.ArrayList
>  12:849884   27196288  
> org.apache.lucene.search.spans.SpanNearQuery
>  13:582008   23280320  
> org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight
>  14:481601   23116848  org.apache.lucene.document.FieldType
>  15:623073   19938336  org.apache.lucene.document.StoredField
>  16:721649   17319576  java.lang.String
>  17: 327297329640  [J
>  18: 146435788376  [F
> {code}
> The load on Solr is not much - max it goes to 2000 requests per minute. The 
> indexing load can sometimes be in burst but most of the time its pretty low. 
> But as mentioned above sometimes even a single document indexing can put solr 
> into tizzy and sometimes it just works like a charm.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: [jira] [Updated] (SOLR-11196) Solr 6.5.0 consuming entire Heap suddenly while working smoothly on Solr 6.1.0

2017-08-04 Thread Walter Underwood
1. This should be a question on the solr-u...@lucene.apache.org, not a bug 
report.

2. A 12 GB heap on an instance with 7.65 GB of RAM is a fatal configuration. A 
full GC will cause lots of swapping and an extreme slowdown.

3. A 4 GB heap on an instance with 7.65 GB of RAM is not a good configuration. 
That does not leave enough room for the OS, other processes, and file buffers 
to cache Solr’s index files.

4. That instance is pretty small for Solr. The smallest AWS instance we run has 
15 GB of RAM. We run an 8 GB heap. Check the disk access on New Relic during 
the slowdown. 

5. Does this instance swap to magnetic disk? Are the Solr indexes on magnetic 
ephemeral or magnetic EBS? Check the iops on New Relic. When you hit the max 
iops for a disk volume, very bad performance things happen.

6. Set -Xms equal to -Xmx. Growing the heap to max at startup is a waste of 
time and makes Solr slow at the beginning. The heap will always get to max.

7. Setting a longer time for auto soft commit than for auto hard commit is 
nonsense. Just don’t do the soft commit.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 4, 2017, at 7:13 AM, Amit (JIRA) <j...@apache.org> wrote:
> 
> 
> [ 
> https://issues.apache.org/jira/browse/SOLR-11196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Amit updated SOLR-11196:
> 
>Description: 
> Please note, this issue does not occurs on Solr-6.1.0 while the same occurs 
> on Solr-6.5.0 and above. To fix this we had to move back to Solr-6.1.0 
> version.
> 
> We have been hit by a Solr Behavior in production which we are unable to 
> debug. To start with here are the configurations for solr:
> 
> Solr Version: 6.5, Master with 1 Slave of the same configuration as mentioned 
> below.
> 
> *JVM Config:*
> 
> 
> {code:java}
> -Xms2048m
> -Xmx4096m
> -XX:+ParallelRefProcEnabled
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSInitiatingOccupancyFraction=50
> {code}
> 
> Rest all are default values.
> 
> *Solr Config:*
> 
> 
> {code:java}
>   
>  
>  {solr.autoCommit.maxTime:30}
>  false
>
>
>
>  {solr.autoSoftCommit.maxTime:90}
>
>
> 
>
>  1024
>   autowarmCount="0" />
>   autowarmCount="0" />
>   autowarmCount="0" />
>   initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
>  true
>  20
>  ${solr.query.max.docs:40}
>  
>  false
>  2
>
> {code}
> 
> *The Host (AWS) configurations are:*
> 
> RAM: 7.65GB
> Cores: 4
> 
> Now, our solr works perfectly fine for hours and sometimes for days but 
> sometimes suddenly memory jumps up and the GC kicks in causing long big 
> pauses with not much to recover. We are seeing this happening most often when 
> one or multiple segments gets added or deleted post a hard commit. It doesn't 
> matter how many documents got indexed. The images attached shows that just 1 
> document was indexed, causing an addition of one segment and it all got 
> messed up till we restarted the Solr.
> 
> Here are the images from NewRelic and Sematext (Kindly click on the links to 
> view):
> 
> JVM Heap Memory Image : [https://i.stack.imgur.com/9dQAy.png]
> 1 Document and 1 Segment addition Image: [https://i.stack.imgur.com/6N4FC.png]
> 
> Update: Here is the JMap output when SOLR last died, we have now increased 
> the JVM memory to xmx of 12GB:
> 
> 
> {code:java}
> num #instances #bytes  class name
>  --
>  1:  11210921 1076248416  
> org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat$IntBlockTermState
>  2:  10623486  934866768  [Lorg.apache.lucene.index.TermState;
>  3:  15567646  475873992  [B
>  4:  10623485  424939400  
> org.apache.lucene.search.spans.SpanTermQuery$SpanTermWeight
>  5:  15508972  372215328  org.apache.lucene.util.BytesRef
>  6:  15485834  371660016  org.apache.lucene.index.Term
>  7:  15477679  371464296  org.apache.lucene.search.spans.SpanTermQuery
>  8:  10623486  339951552  org.apache.lucene.index.TermContext
>  9:   1516724  150564320  [Ljava.lang.Object;
> 10:724486   50948800  [C
> 11:   1528110   36674640  java.util.ArrayList
> 12:849884   27196288  org.apache.lucene.search.spans.SpanNearQuery
> 13:582008   23280320  
> org.apache.lucene.search.spans.SpanNearQuery$SpanNearWeight
> 14:481601   23116848  org.apache.lucene.

Re: Searching multiple terms and mapping each term with their search result.

2017-08-02 Thread Walter Underwood
I’d do multiple queries in parallel. It is super easy with Solr. Send each 
query request, then go back and read each response. They happen in parallel 
without any threads in the client.

If you want the total, then do another query. It will share cached posting 
lists with the other queries, so they’ll all be fast together.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 2, 2017, at 3:11 PM, Michael McCandless <luc...@mikemccandless.com> 
> wrote:
> 
> Do you really need to know which term/s caused a given hit for every single 
> hit, or only for the top N hits?
> 
> If you must know this for every single hit, you should make your own 
> Collector and use the Scorer.getChildren API to gather the sub-scorer for 
> each term, and on collecting each hit, look at those children to see which 
> ones matched the current hit.
> 
> If it's only for the top N hits, it's more efficient to do a 2nd step with 
> those hits where you test each term separately to see if they match it.
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
> On Wed, Aug 2, 2017 at 1:48 AM, Sarthak Sugandhi <sarthaksugand...@gmail.com 
> <mailto:sarthaksugand...@gmail.com>> wrote:
> Hi Fellow Luceneers,
> 
> I am trying to search multiple terms in an index at one go and I want to 
> differentiate search results for each term all in the same query.
> 
> Could you suggest what is the class/method name.
> 
> Thanks,
> Sarthak
> 



Re: merge policy vs commit rates

2017-08-01 Thread Walter Underwood
Optimizing for frequent changes sounds like a caching strategy, maybe “LRU 
merging”. Perhaps prefer merging segments that have not changed in a while?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 1, 2017, at 5:50 AM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote:
> 
> 
> 
> Il giorno mar 1 ago 2017 alle ore 14:04 Adrien Grand <jpou...@gmail.com 
> <mailto:jpou...@gmail.com>> ha scritto:
> The trade-off does not sound simple to me. This approach could lead to having 
> more segments overall, making search requests and updates potentially slower 
> and more I/O-intensive since they have to iterate over more segments? I'm not 
> saying this is a bad idea, but it could have unexpected side-effects.
> 
> yes, that's my same concern.
>  
> 
> Do you actually have a high commit rate or a high reopen rate 
> (DirectoryReader.open(IndexWriter))?
> 
> in my scenario both, but commit rate is much superseding reopening. 
>  
> Maybe reopening instead of committing (and still committing, but less 
> frequently) would decrease the I/O load since NRT segments might never need 
> to be actually written to disk if they are merged before the next commit 
> happens and you give enough memory to the filesystem cache.
> 
> makes sense in general, however I am a bit constrained in how much I can 
> avoid committing (states from an MVCC systems are tight to commits, so it's 
> trickier).
> 
> In general I was wondering if we could have the merge policy look at both 
> commit rate and no. of segments and decide whether to merge or not based on 
> both, so that if the segments growth is within a threshold we possibly save 
> some merges when we have high commit rates, but as you say we may have to do 
> bigger merges then. 
> I can imagine this to make more sense when a lot of tiny changes are made to 
> the index rather than a few big ones (then the bigger merges problem should 
> be less significant).
> 
> Other than my specific scenario, I am thinking that we can look again at the 
> current MP algorithm and see if we can improve it, or make it more flexible 
> to the way the "sneaky opponent" (Mike's ™ [1]) behaves.
> 
> My 2 cents,
> Tommaso
> 
> [1] : 
> http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
>  
> <http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html>
>  
> 
> Le mar. 1 août 2017 à 10:59, Tommaso Teofili <tommaso.teof...@gmail.com 
> <mailto:tommaso.teof...@gmail.com>> a écrit :
> Hi all,
> 
> lately I am looking a bit closer at merge policies, of course particularly at 
> the tiered one, and I was wondering if we can mitigate the amount of possibly 
> avoidable merges in high commit rates scenarios, especially when a high 
> percentage of the commits happens on same docs.
> I've observed several evolutions of merges in such scenarios and it seemed to 
> me the merge policy was too aggressive in merging, causing a large IO 
> overhead.
> I've then tried the same with a merge policy which was tentatively looking at 
> commit rates and skipping merges if such a rate is higher than a threshold 
> which seemed to give slightly better results in reducing the unneeded IO 
> caused by avoidable merges.
> 
> I know this is a bit abstract but I would like to know if anyone has any 
> ideas or plans about mitigating the merge overhead in general and / or in 
> similar cases.
> 
> Regards,
> Tommaso
> 
> 
> 



SOLR-629 fuzzy in edismax for 7.0?

2017-06-28 Thread Walter Underwood
I’m working on updating the 4.10.4 patch for SOLR-629 to work on master.

If anyone is familiar with the guts of the edismax query parser, I might need 
some help. It seems to take me a week to figure out that code every time I 
update this patch.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)




[jira] [Commented] (SOLR-10531) JMX cache beans names / properties changed in 6.4

2017-06-27 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065018#comment-16065018
 ] 

Walter Underwood commented on SOLR-10531:
-

New Relic fixed their problem. The fix is in the 3.40.0 Java agent.

https://docs.newrelic.com/docs/release-notes/agent-release-notes/java-release-notes

> JMX cache beans names / properties changed in 6.4
> -
>
> Key: SOLR-10531
> URL: https://issues.apache.org/jira/browse/SOLR-10531
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4, 6.5
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: branch_6_3.png, branch_6x.png
>
>
> As reported by [~wunder]:
> {quote}
> New Relic displays the cache hit rate for each collection, showing the query 
> result cache, filter cache, and document cache.
> With 6.5.0, that page shows this message:
> New Relic recorded no Solr caches data for this application in the last 
> 24 hours
> If you think there should be Solr data here, first check to see that JMX 
> is enabled for your application server. If enabled, then please contact 
> support.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10531) JMX cache beans names / properties changed in 6.4

2017-06-13 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048235#comment-16048235
 ] 

Walter Underwood commented on SOLR-10531:
-

Finally got some time to open a case with New Relic. That is here:

https://discuss.newrelic.com/t/solr-cache-monitoring-not-working-with-solr-6-4-or-later/49658

Let's reopen this Jira so New Relic can help troubleshoot.

> JMX cache beans names / properties changed in 6.4
> -
>
> Key: SOLR-10531
> URL: https://issues.apache.org/jira/browse/SOLR-10531
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4, 6.5
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: branch_6_3.png, branch_6x.png
>
>
> As reported by [~wunder]:
> {quote}
> New Relic displays the cache hit rate for each collection, showing the query 
> result cache, filter cache, and document cache.
> With 6.5.0, that page shows this message:
> New Relic recorded no Solr caches data for this application in the last 
> 24 hours
> If you think there should be Solr data here, first check to see that JMX 
> is enabled for your application server. If enabled, then please contact 
> support.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10531) JMX cache beans names / properties changed in 6.4

2017-05-30 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16030221#comment-16030221
 ] 

Walter Underwood commented on SOLR-10531:
-

I'm getting our people to open a case with New Relic about this problem. It 
might be in their code, but they are the ones who can figure all that out.

When that is done, I'll link this back to that case and reopen it.

> JMX cache beans names / properties changed in 6.4
> -
>
> Key: SOLR-10531
> URL: https://issues.apache.org/jira/browse/SOLR-10531
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4, 6.5
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: branch_6_3.png, branch_6x.png
>
>
> As reported by [~wunder]:
> {quote}
> New Relic displays the cache hit rate for each collection, showing the query 
> result cache, filter cache, and document cache.
> With 6.5.0, that page shows this message:
> New Relic recorded no Solr caches data for this application in the last 
> 24 hours
> If you think there should be Solr data here, first check to see that JMX 
> is enabled for your application server. If enabled, then please contact 
> support.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Why no composite primary-key in lucene ?

2017-04-29 Thread Walter Underwood
If you do want a composite key in Solr, you could use an update request 
processor script to make it out of the multiple fields.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 29, 2017, at 11:02 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> 
> On Sat, Apr 29, 2017 at 1:45 PM, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
>> @Yonik
>> 
>> Thanks makes sense. So this means that the 'id' need to be indexed(is
>> always?), (so you can get/update/delete docs not in translog), right ?
> 
> In Solr, yes.  In Lucene, only if you want lookup-by-id to be fast, or
> if you want to use updateDocument with an indexed term for overwriting
> documents.
> 
> -Yonik
> 
> 
>> On Sat, Apr 29, 2017 at 7:24 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>>> 
>>> Solr doesn't use Lucene for RT GET, it uses it's transaction log.
>>> Only when the document is not found in the transaction log will it go
>>> and consult the lucene index (which can only search as of the last
>>> commit).
>>> 
>>> -Yonik
>>> 
>>> On Sat, Apr 29, 2017 at 12:57 PM, Dorian Hoxha <dorian.ho...@gmail.com>
>>> wrote:
>>>> I know all that. My point is, lucene is NRT, while GET is RT (in both
>>>> ES/SOLR). How does lucene return the right document (Term Query) before
>>>> doing a commit on GET ?
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>> 
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: SolrCloud "master mode" planned?

2017-04-26 Thread Walter Underwood
 will never use managed configs.

Other things we needed to do to get here.

* solr.in.sh is under source control.
* etc/jetty.xml is under source control so we can put the logs in /solr.
* Each solrconfig.xml has /solr/… hardcoded as the data directory.
* solr.in.sh has some dancing around to pass in the right properties for the 
metrics destination and New Relic. This avoids maintaining separate configs for 
dev, test, and prod. Solr could be better about properties vs host settings vs 
JVM settings. Putting all of them in solr.in.sh is sloppy.

Here is what we append to solr.in.sh. Line wrapping might make this look odd 
for you.

# Generate a Graphite prefix from the hostname and environment portion of the 
domain
environment=`hostname | cut -d . -f 2`
base_hostname=`hostname -s`
graphite_prefix="${environment}.${base_hostname}"
# Default values are the settings for dev3.
graphite_host="metrics.test3.cheggnet.com"
ZK_HOST="zookeeper-eb342a2d.dev3.cloud.cheggnet.com:2181/solr-cloud"
if [[ "$environment" = 'test3' ]]
then
# dev3 and test3 share the same Graphite metrics host

ZK_HOST='zookeeper1.test3.cloud.cheggnet.com:2181,zookeeper2.test3.cloud.cheggnet.com:2181,zookeeper3.test3.cloud.cheggnet.com:2181/solr-cloud'
fi
if [[ "$environment" = 'prod2' ]]
then
graphite_host='metrics-eng.prod.cheggnet.com'

ZK_HOST='thor-zk01.prod2.cloud.cheggnet.com:2181,thor-zk02.prod2.cloud.cheggnet.com:2181,thor-zk03.prod2.cloud.cheggnet.com:2181/solr-cloud'
fi
SOLR_OPTS="$SOLR_OPTS -Dgraphite_prefix=${graphite_prefix}"
SOLR_OPTS="$SOLR_OPTS -Dgraphite_host=${graphite_host}"
SOLR_OPTS="$SOLR_OPTS -javaagent:/apps/solr6/newrelic/newrelic.jar"
SOLR_OPTS="$SOLR_OPTS -Dnewrelic.environment=${environment}"
ENABLE_REMOTE_JMX_OPTS="true"
SOLR_LOGS_DIR=/solr/logs

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 26, 2017, at 3:25 PM, Otis Gospodnetić <otis.gospodne...@gmail.com> 
> wrote:
> 
> Right, that "bin/solr zk start" is sort of how one could present that to 
> users.  I took the liberty of creating 
> https://issues.apache.org/jira/browse/SOLR-10573 
> <https://issues.apache.org/jira/browse/SOLR-10573> after not being able to 
> find any such issues (yet hearing about such ideas at Lucene Revolution).
> 
> Ishan & Co, you may want to link other related issues or use e.g. "hideZK" 
> label and treat SOLR-10573 just as an umbrella?
> 
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ 
> <http://sematext.com/>
> 
> 
> On Wed, Apr 26, 2017 at 4:35 PM, Upayavira <u...@odoko.co.uk 
> <mailto:u...@odoko.co.uk>> wrote:
> I have done a *lot* of automating this. Redoing it recently it was quite 
> embarrassing to realise how much complexity there is involved in it - it is 
> crazy hard to get a basic, production ready SolrCloud setup running.
> 
> One thing that is hard is getting a ZooKeeper ensemble going - using 
> Exhibitor makes it much easier.
> 
> Something that has often occurred to me is, why do we require people to go 
> download a separate ZooKeeper, and work out how to install and configure it, 
> when we have it embedded already? Why can't we just have a 'bin/solr zk 
> start' command which starts an "embedded" zookeeper, but without Solr. To 
> really make it neat, we offer some way (a la Exhibitor) for multiple 
> concurrently started ZK nodes to autodiscover each other, then getting our 
> three ZK nodes up won't be quite so treacherous.
> 
> Just a thought.
> 
> Upayavira
> 
> On Wed, 26 Apr 2017, at 03:58 PM, Mike Drob wrote:
>> Could the zk role also be guaranteed to run the Overseer (and no 
>> collections)? If we already have that separated out, it would make sense to 
>> put it with the embedded zk. I think you can already configure and place 
>> things manually this way, but it would be a huge win to package it all up 
>> nicely for users and set it to turnkey operation.
>> 
>> I think it was a great improvement for deployment when we dropped tomcat, 
>> this is the next logical step.
>> 
>> Mike
>> 
>> On Wed, Apr 26, 2017, 4:22 AM Jan Høydahl <jan@cominvent.com 
>> <mailto:jan@cominvent.com>> wrote:
>> There have been suggestions to add a “node controller” process which again 
>> could start Solr and perhaps ZK on a node.
>> 
>> But adding a new “zk” role which would let that node start (embedded) ZK I 
>> cannot recall. It would of course make a deploy simpler if ZK was hidden as 
>> a solr role/feature and perhaps assigned to 

Re: Release 6.6

2017-04-25 Thread Walter Underwood
I’m a little tired of re-implementing the patch. I did it for 1.3, 3.x, and 
4.x. Perhaps someone more familiar with edismax could take a look.

With the 100X speedup for fuzzy in 4.x, should be widely useful. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 25, 2017, at 4:57 PM, Erik Hatcher <erik.hatc...@gmail.com> wrote:
> 
> Probably no bribe needed, but an updated patch would be a good start (or will 
> the 2.5 year old patch still apply and be acceptable as-is?)
> 
>   Erik
> 
>> On Apr 25, 2017, at 7:52 PM, Walter Underwood <wun...@wunderwood.org 
>> <mailto:wun...@wunderwood.org>> wrote:
>> 
>> Who do I have to bribe to get SOLR-629 included?
>> 
>> https://issues.apache.org/jira/browse/SOLR-629 
>> <https://issues.apache.org/jira/browse/SOLR-629>
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>> 
>>> On Apr 25, 2017, at 10:46 AM, Ishan Chattopadhyaya 
>>> <ichattopadhy...@gmail.com <mailto:ichattopadhy...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> We have lots of changes, optimizations, bug fixes for 6.6. I'd like to 
>>> propose a 6.6 release (perhaps the last 6x non-bug-fix release before 7.0 
>>> release).
>>> 
>>> I can volunteer to release this, and I can cut the release branch around 
>>> 4th May, in order to let some time for devs to finish current issues.
>>> 
>>> What do you think?
>>> 
>>> Regards,
>>> Ishan
>>> 
>> 
> 



Re: Release 6.6

2017-04-25 Thread Walter Underwood
Who do I have to bribe to get SOLR-629 included?

https://issues.apache.org/jira/browse/SOLR-629 
<https://issues.apache.org/jira/browse/SOLR-629>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 25, 2017, at 10:46 AM, Ishan Chattopadhyaya 
> <ichattopadhy...@gmail.com> wrote:
> 
> Hi,
> We have lots of changes, optimizations, bug fixes for 6.6. I'd like to 
> propose a 6.6 release (perhaps the last 6x non-bug-fix release before 7.0 
> release).
> 
> I can volunteer to release this, and I can cut the release branch around 4th 
> May, in order to let some time for devs to finish current issues.
> 
> What do you think?
> 
> Regards,
> Ishan
> 



[jira] [Commented] (SOLR-10531) JMX cache beans names / properties changed in 6.4

2017-04-20 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15977084#comment-15977084
 ] 

Walter Underwood commented on SOLR-10531:
-

This is from a discussion on the New Relic support forum.

Aug 30, 2016 04:05:40 -0700 [26 60] com.newrelic.agent.jmx.JmxService FINER: 
JMX Service : MBeans query solr*:type=updateHandler,*, matches 8
Aug 30, 2016 04:05:40 -0700 [26 60] com.newrelic.agent.jmx.JmxService FINER: 
JMX Service : MBeans query solr*:type=documentCache,*, matches 8
Aug 30, 2016 04:05:40 -0700 [26 60] com.newrelic.agent.jmx.JmxService FINER: 
JMX Service : MBeans query solr*:type=filterCache,*, matches 8
Aug 30, 2016 04:05:40 -0700 [26 60] com.newrelic.agent.jmx.JmxService FINER: 
JMX Service : MBeans query solr*:type=queryResultCache,*, matches 8

https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/6

I'll try to get New Relic involved in this bug.

> JMX cache beans names / properties changed in 6.4
> -
>
> Key: SOLR-10531
> URL: https://issues.apache.org/jira/browse/SOLR-10531
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4, 6.5
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: branch_6_3.png, branch_6x.png
>
>
> As reported by [~wunder]:
> {quote}
> New Relic displays the cache hit rate for each collection, showing the query 
> result cache, filter cache, and document cache.
> With 6.5.0, that page shows this message:
> New Relic recorded no Solr caches data for this application in the last 
> 24 hours
> If you think there should be Solr data here, first check to see that JMX 
> is enabled for your application server. If enabled, then please contact 
> support.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Embeding distributed-solr in client-app without storing data

2017-04-19 Thread Walter Underwood
I know exactly how this works. MarkLogic can be configured with separate 
execute and data nodes. But in MarkLogic, the execute nodes can do a lot of 
work. The query may be a mix of indexed searching and “table scan” searching, 
all expressed in an XQuery program.

It does not make sense for Solr. The distributed portion of query execution is 
just not enough work to farm out to CPU intensive nodes.

It will mean more nodes to do the same work.

The execute nodes would need to be part of the cluster in order to get config 
updates. But they would not host shards. So now we need a new kind of node in 
the collection.

Lots more code, lots more testing, no benefit.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 19, 2017, at 9:40 AM, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> 
> @Walter:
> 
> Think of it this way.
> 
> 1. Separate data-solr from merger-solr. So merger-solr has more cpu, while 
> data-solr has more ram (very simplistic).
> This way, you can also scale them separately. (es has something like 
> search-only-node)
> 
> 2. Second step is to join client-app with merger-solr. So you do 1 less hop. 
> This node doesn't have to lose the global-idf, query-cache or whatever 
> merger-only-solr currently does.
> If the client-app is just a frontend/proxy for solr, then should be better.
> 
> 3. The whole point is to have fewer, more powerful machines. And each 
> client-app should be able to saturate it's own embedded-solr.
> 
> Makes sense ?
> 
> On Wed, Apr 19, 2017 at 6:29 PM, Walter Underwood <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> That is exactly what I thought you meant. Adds complexity with no benefit.
> 
> Now the merger needs to keep caches for global IDF. But those caches don’t 
> get the benefit of other requests to the same cluster.
> 
> I’m not sure if query result caches cache the results of distributed queries, 
> but if they do, then this approach looses that benefit too.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Apr 19, 2017, at 9:01 AM, Dorian Hoxha <dorian.ho...@gmail.com 
>> <mailto:dorian.ho...@gmail.com>> wrote:
>> 
>> @Walter
>> 
>> Usually you have: client-app --> random-solr-node(mergerer) --> each other 
>> node that has a shard
>> While what I want: client-app (mergerer is in same jvm) --> each other node 
>> that has a shard
>> 
>> Makes sense ?
>> 
>> On Wed, Apr 19, 2017 at 4:50 PM, Walter Underwood <wun...@wunderwood.org 
>> <mailto:wun...@wunderwood.org>> wrote:
>> Does not make sense to me. It would do more queries from the client to the 
>> cluster, not fewer. And those HTTP request would probably be slower than the 
>> intra-cluster requests.
>> 
>> I expect the distributed portion of the query load is small compared to 
>> other CPU usage.
>> 
>> It adds complexity for no gain in performance. Maybe a slight loss.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>> 
>>> On Apr 19, 2017, at 6:32 AM, Mikhail Khludnev <m...@apache.org 
>>> <mailto:m...@apache.org>> wrote:
>>> 
>>> Hello, Dorian.
>>> I'm not sure about 1. But you can create EmbeddedSolrServer and add 
>>> "collection" parameter. It's what's done in 
>>> org.apache.solr.response.transform.SubQueryAugmenter [subquery]
>>> 
>>> On Wed, Apr 19, 2017 at 3:53 PM, Dorian Hoxha <dorian.ho...@gmail.com 
>>> <mailto:dorian.ho...@gmail.com>> wrote:
>>> Hi friends,
>>> 
>>> Anybody has done this ? Reasons being: 1 less http-request when doing 
>>> distributed search. But also not storing data itself (like a 
>>> search-only-node). And the other nodes not caring about search-nodes.
>>> 
>>> Makes sense ?
>>> 
>>> Regards,
>>> Dorian
>>> 
>>> 
>>> 
>>> -- 
>>> Sincerely yours
>>> Mikhail Khludnev
>> 
>> 
> 
> 



Re: Embeding distributed-solr in client-app without storing data

2017-04-19 Thread Walter Underwood
That is exactly what I thought you meant. Adds complexity with no benefit.

Now the merger needs to keep caches for global IDF. But those caches don’t get 
the benefit of other requests to the same cluster.

I’m not sure if query result caches cache the results of distributed queries, 
but if they do, then this approach looses that benefit too.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 19, 2017, at 9:01 AM, Dorian Hoxha <dorian.ho...@gmail.com> wrote:
> 
> @Walter
> 
> Usually you have: client-app --> random-solr-node(mergerer) --> each other 
> node that has a shard
> While what I want: client-app (mergerer is in same jvm) --> each other node 
> that has a shard
> 
> Makes sense ?
> 
> On Wed, Apr 19, 2017 at 4:50 PM, Walter Underwood <wun...@wunderwood.org 
> <mailto:wun...@wunderwood.org>> wrote:
> Does not make sense to me. It would do more queries from the client to the 
> cluster, not fewer. And those HTTP request would probably be slower than the 
> intra-cluster requests.
> 
> I expect the distributed portion of the query load is small compared to other 
> CPU usage.
> 
> It adds complexity for no gain in performance. Maybe a slight loss.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
> 
>> On Apr 19, 2017, at 6:32 AM, Mikhail Khludnev <m...@apache.org 
>> <mailto:m...@apache.org>> wrote:
>> 
>> Hello, Dorian.
>> I'm not sure about 1. But you can create EmbeddedSolrServer and add 
>> "collection" parameter. It's what's done in 
>> org.apache.solr.response.transform.SubQueryAugmenter [subquery]
>> 
>> On Wed, Apr 19, 2017 at 3:53 PM, Dorian Hoxha <dorian.ho...@gmail.com 
>> <mailto:dorian.ho...@gmail.com>> wrote:
>> Hi friends,
>> 
>> Anybody has done this ? Reasons being: 1 less http-request when doing 
>> distributed search. But also not storing data itself (like a 
>> search-only-node). And the other nodes not caring about search-nodes.
>> 
>> Makes sense ?
>> 
>> Regards,
>> Dorian
>> 
>> 
>> 
>> -- 
>> Sincerely yours
>> Mikhail Khludnev
> 
> 



Re: Embeding distributed-solr in client-app without storing data

2017-04-19 Thread Walter Underwood
Does not make sense to me. It would do more queries from the client to the 
cluster, not fewer. And those HTTP request would probably be slower than the 
intra-cluster requests.

I expect the distributed portion of the query load is small compared to other 
CPU usage.

It adds complexity for no gain in performance. Maybe a slight loss.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 19, 2017, at 6:32 AM, Mikhail Khludnev <m...@apache.org> wrote:
> 
> Hello, Dorian.
> I'm not sure about 1. But you can create EmbeddedSolrServer and add 
> "collection" parameter. It's what's done in 
> org.apache.solr.response.transform.SubQueryAugmenter [subquery]
> 
> On Wed, Apr 19, 2017 at 3:53 PM, Dorian Hoxha <dorian.ho...@gmail.com 
> <mailto:dorian.ho...@gmail.com>> wrote:
> Hi friends,
> 
> Anybody has done this ? Reasons being: 1 less http-request when doing 
> distributed search. But also not storing data itself (like a 
> search-only-node). And the other nodes not caring about search-nodes.
> 
> Makes sense ?
> 
> Regards,
> Dorian
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev



Re: Solr JMX changes and backwards (in)compatibility

2017-04-19 Thread Walter Underwood
I did report it a week ago. This is a reply to that email.

New Relic displays the cache hit rate for each collection, showing the query 
result cache, filter cache, and document cache.

With 6.5.0, that page shows this message:

New Relic recorded no Solr caches data for this application in the last 24 
hours
If you think there should be Solr data here, first check to see that JMX is 
enabled for your application server. If enabled, then please contact support.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 19, 2017, at 2:28 AM, Andrzej Białecki 
> <andrzej.biale...@lucidworks.com> wrote:
> 
> 
>> On 18 Apr 2017, at 19:15, Walter Underwood <wun...@wunderwood.org 
>> <mailto:wun...@wunderwood.org>> wrote:
>> 
>> Pretty sure the back-compat did not work, because New Relic cannot find the 
>> MBeans in our 6.5.0 cluster.
>> 
> 
> I don’t use New Relic and nobody reported this until now, which is a pity… 
> Could you please be more specific about what MBeans can’t be found? Does this 
> affect all caches or just some of them? 
> 
>> 
>>> On Apr 11, 2017, at 2:28 PM, Walter Underwood <wun...@wunderwood.org 
>>> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> We are running 6.5.0 in prod and New Relic is not showing cache stats. I 
>>> think this means it cannot find the MBeans.
>>> 
>>> I gleaned that from the discussion here:
>>> 
>>> https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4
>>>  
>>> <https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4>
>>> https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java
>>>  
>>> <https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java>
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>> 
>>>> On Mar 3, 2017, at 7:26 AM, Andrzej Białecki 
>>>> <andrzej.biale...@lucidworks.com <mailto:andrzej.biale...@lucidworks.com>> 
>>>> wrote:
>>>> 
>>>> 
>>>>> On 2 Mar 2017, at 16:45, Otis Gospodnetić <otis.gospodne...@gmail.com 
>>>>> <mailto:otis.gospodne...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> While I love all the new metrics in Solr, I think metrics should be 
>>>>> treated like code/features in terms of how backwards 
>>>>> compatibility/deprecation is handled. Otherwise, on upgrade, people's 
>>>>> monitoring breaks and monitoring is kind of important... 
>>>>> Note: Looks like recent Solr metrics changes broke/changed 
>>>>> previously-existing MBeans... 
>>>>> Don't have the details about what was changed and how exactly, but I see 
>>>>> people using Sematext SPM for monitoring Solr are reporting this with 
>>>>> Solr 6.4.1.
>>>>> 
>>>> 
>>>> Otis,
>>>> 
>>>> Yes, we’ll be more careful, but we need proper feedback too. My 
>>>> understanding was that SOLR-10035 addressed this by adding back-combat 
>>>> registration under old names. Are you saying there are still some issues 
>>>> in 6.4.1? Can you please be more specific? 6.4.2 is almost out, but if 
>>>> it’s something serious then we should fix it.
>>>> 
>>>>> 
>>>>> Otis
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ 
>>>>> <http://sematext.com/>
>>>>> 
>>>> 
>>> 
>> 
> 



Re: Solr JMX changes and backwards (in)compatibility

2017-04-18 Thread Walter Underwood
Pretty sure the back-compat did not work, because New Relic cannot find the 
MBeans in our 6.5.0 cluster.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 11, 2017, at 2:28 PM, Walter Underwood <wun...@wunderwood.org> wrote:
> 
> We are running 6.5.0 in prod and New Relic is not showing cache stats. I 
> think this means it cannot find the MBeans.
> 
> I gleaned that from the discussion here:
> 
> https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4
>  
> <https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4>
> https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java
>  
> <https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java>
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
>> On Mar 3, 2017, at 7:26 AM, Andrzej Białecki 
>> <andrzej.biale...@lucidworks.com <mailto:andrzej.biale...@lucidworks.com>> 
>> wrote:
>> 
>> 
>>> On 2 Mar 2017, at 16:45, Otis Gospodnetić <otis.gospodne...@gmail.com 
>>> <mailto:otis.gospodne...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> While I love all the new metrics in Solr, I think metrics should be treated 
>>> like code/features in terms of how backwards compatibility/deprecation is 
>>> handled. Otherwise, on upgrade, people's monitoring breaks and 
>>> monitoring is kind of important... 
>>> Note: Looks like recent Solr metrics changes broke/changed 
>>> previously-existing MBeans... 
>>> Don't have the details about what was changed and how exactly, but I see 
>>> people using Sematext SPM for monitoring Solr are reporting this with Solr 
>>> 6.4.1.
>>> 
>> 
>> Otis,
>> 
>> Yes, we’ll be more careful, but we need proper feedback too. My 
>> understanding was that SOLR-10035 addressed this by adding back-combat 
>> registration under old names. Are you saying there are still some issues in 
>> 6.4.1? Can you please be more specific? 6.4.2 is almost out, but if it’s 
>> something serious then we should fix it.
>> 
>>> 
>>> Otis
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ 
>>> <http://sematext.com/>
>>> 
>> 
> 



Re: Change Default Response Format (wt) to JSON in Solr 7.0?

2017-04-14 Thread Walter Underwood
I like that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 14, 2017, at 7:47 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:
> 
> We could put commented out wt=xml in the solrconfig.xml as a secondary 
> reminder.
> 
> Regards,
>Alex
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
> 
> 
> On 15 April 2017 at 03:53, Doug Turnbull
> <dturnb...@opensourceconnections.com> wrote:
>> Sounds great. I agree!
>> 
>> I can imagine there might be really old client libraries/integrations that
>> assume XML without sending a wt, but I think it's ok to break those sorts of
>> things in a major release. And those folks can learn to send wt=xml
>> 
>> -Doug
>> 
>> On Fri, Apr 14, 2017 at 2:53 PM Trey Grainger <solrt...@gmail.com> wrote:
>>> 
>>> Just wanted to throw this out there for discussion. Solr's default query
>>> response format is still XML, despite the fact that Solr has supported the
>>> JSON response format for over a decade, developer mindshare has clearly
>>> shifted toward JSON over the years, and most modern/competing systems also
>>> use JSON format now by default.
>>> 
>>> In fact, Solr's admin UI even explicitly adds wt=json to the request (by
>>> default in the UI) to override the default of wt=xml, so Solr's Admin UI
>>> effectively has a different default than the API.
>>> 
>>> We have now introduced things like the JSON faceting API, and the new more
>>> modern /V2 apis assume JSON for the areas of Solr they cover, so clearly
>>> we're moving in the direction of JSON anyway.
>>> 
>>> I'd like propose that we switch the default response writer to JSON
>>> (wt=json) instead of XML for Solr 7.0, as this seems to me like the right
>>> direction and a good time to make this change with the next major version.
>>> 
>>> Before I create a JIRA and submit a patch, though, I wanted to check here
>>> make sure there were no strong objections to changing the default.
>>> 
>>> -Trey Grainger
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 



Re: Change Default Response Format (wt) to JSON in Solr 7.0?

2017-04-14 Thread Walter Underwood
This is guaranteed to break stuff, but I support it. Even though I put the 
first XML support into a search engine and worked for an XML database company.

Also, if someone even proposes a PDF response writer, I will hunt them down.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 14, 2017, at 5:53 PM, Doug Turnbull 
> <dturnb...@opensourceconnections.com> wrote:
> 
> Sounds great. I agree!
> 
> I can imagine there might be really old client libraries/integrations that 
> assume XML without sending a wt, but I think it's ok to break those sorts of 
> things in a major release. And those folks can learn to send wt=xml
> 
> -Doug
> 
> On Fri, Apr 14, 2017 at 2:53 PM Trey Grainger <solrt...@gmail.com 
> <mailto:solrt...@gmail.com>> wrote:
> Just wanted to throw this out there for discussion. Solr's default query 
> response format is still XML, despite the fact that Solr has supported the 
> JSON response format for over a decade, developer mindshare has clearly 
> shifted toward JSON over the years, and most modern/competing systems also 
> use JSON format now by default.
> 
> In fact, Solr's admin UI even explicitly adds wt=json to the request (by 
> default in the UI) to override the default of wt=xml, so Solr's Admin UI 
> effectively has a different default than the API.
> 
> We have now introduced things like the JSON faceting API, and the new more 
> modern /V2 apis assume JSON for the areas of Solr they cover, so clearly 
> we're moving in the direction of JSON anyway.
> 
> I'd like propose that we switch the default response writer to JSON (wt=json) 
> instead of XML for Solr 7.0, as this seems to me like the right direction and 
> a good time to make this change with the next major version.
> 
> Before I create a JIRA and submit a patch, though, I wanted to check here 
> make sure there were no strong objections to changing the default.
> 
> -Trey Grainger



Re: [VOTE] Release Lucene/Solr 6.5.1 RC1

2017-04-12 Thread Walter Underwood
SOLR-10420 is the only reason I would install a 6.5.1. That is a serious bug 
that affects every user of Solr Cloud.

That bug was the reason why the 6.5.1 process was started, right?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Apr 12, 2017, at 7:14 AM, Joel Bernstein <joels...@gmail.com> wrote:
> 
> We have a couple options here:
> 
> 1) The vote ends for this release tomorrow. We can continue with this release 
> and ship 6.5.1. In the background start working on 6.5.2.
> 
> 2) Or we could respin 6.5.1 to include SOLR-10420. The problem with this is 
> it restarts the clock on the 6.5.1 and delays other bug fixes.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
> 
> On Wed, Apr 12, 2017 at 9:48 AM, Đạt Cao Mạnh <caomanhdat...@gmail.com 
> <mailto:caomanhdat...@gmail.com>> wrote:
> Sorry Joel, but can we wait for SOLR-10420, this is a serious bug that 
> leaking one SolrZkClient per second on the Overseer node.
> 
> On Tue, Apr 11, 2017 at 6:34 PM Tommaso Teofili <tommaso.teof...@gmail.com 
> <mailto:tommaso.teof...@gmail.com>> wrote:
> +1
> 
> SUCCESS! [2:25:09.313218]
> 
> Tommaso
> 
> Il giorno mar 11 apr 2017 alle ore 06:47 Shalin Shekhar Mangar 
> <shalinman...@gmail.com <mailto:shalinman...@gmail.com>> ha scritto:
> +1
> 
> SUCCESS! [2:03:24.673867]
> 
> On Mon, Apr 10, 2017 at 8:31 AM, Joel Bernstein <joels...@gmail.com 
> <mailto:joels...@gmail.com>> wrote:
> > Please vote for release candidate 1 for Lucene/Solr 6.5.1
> >
> > The artifacts can be downloaded from:
> >
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-6.5.1-RC1-revf88f850034eee845b8021af47ecffc9c42aa8539
> >  
> > <https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-6.5.1-RC1-revf88f850034eee845b8021af47ecffc9c42aa8539>
> >
> > You can run the smoke tester directly with this command:
> >
> > python3 -u dev-tools/scripts/smokeTestRelease.py \
> > https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-6.5.1-RC1-revf88f850034eee845b8021af47ecffc9c42aa8539
> >  
> > <https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-6.5.1-RC1-revf88f850034eee845b8021af47ecffc9c42aa8539>
> >
> > Here's my +1
> >
> > SUCCESS! [0:40:54.049621 <tel:049%20621>]
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/ <http://joelsolr.blogspot.com/>
> 
> 
> 
> --
> Regards,
> Shalin Shekhar Mangar.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org 
> <mailto:dev-unsubscr...@lucene.apache.org>
> For additional commands, e-mail: dev-h...@lucene.apache.org 
> <mailto:dev-h...@lucene.apache.org>
> 
> 



Re: Solr JMX changes and backwards (in)compatibility

2017-04-11 Thread Walter Underwood
We are running 6.5.0 in prod and New Relic is not showing cache stats. I think 
this means it cannot find the MBeans.

I gleaned that from the discussion here:

https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4
 
<https://discuss.newrelic.com/t/solr-data-not-appearing-in-apm-solr-tabs-caches-updates/37507/4>
https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java
 
<https://docs.newrelic.com/docs/agents/java-agent/troubleshooting/solr-data-not-appearing-apm-solr-tab-java>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 3, 2017, at 7:26 AM, Andrzej Białecki 
> <andrzej.biale...@lucidworks.com> wrote:
> 
> 
>> On 2 Mar 2017, at 16:45, Otis Gospodnetić <otis.gospodne...@gmail.com 
>> <mailto:otis.gospodne...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> While I love all the new metrics in Solr, I think metrics should be treated 
>> like code/features in terms of how backwards compatibility/deprecation is 
>> handled. Otherwise, on upgrade, people's monitoring breaks and 
>> monitoring is kind of important... 
>> Note: Looks like recent Solr metrics changes broke/changed 
>> previously-existing MBeans... 
>> Don't have the details about what was changed and how exactly, but I see 
>> people using Sematext SPM for monitoring Solr are reporting this with Solr 
>> 6.4.1.
>> 
> 
> Otis,
> 
> Yes, we’ll be more careful, but we need proper feedback too. My understanding 
> was that SOLR-10035 addressed this by adding back-combat registration under 
> old names. Are you saying there are still some issues in 6.4.1? Can you 
> please be more specific? 6.4.2 is almost out, but if it’s something serious 
> then we should fix it.
> 
>> 
>> Otis
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ 
>> <http://sematext.com/>
>> 
> 



[jira] [Commented] (SOLR-10420) Solr 6.x leaking one SolrZkClient instance per second

2017-04-04 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955538#comment-15955538
 ] 

Walter Underwood commented on SOLR-10420:
-

To be clear, these are uncollectable objects?

> Solr 6.x leaking one SolrZkClient instance per second
> -
>
> Key: SOLR-10420
> URL: https://issues.apache.org/jira/browse/SOLR-10420
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 6.5, 6.4.2
>Reporter: Markus Jelsma
> Fix For: master (7.0), branch_6x
>
>
> One of our nodes became berzerk after a restart, Solr went completely nuts! 
> So i opened VisualVM to keep an eye on it and spotted a different problem 
> that occurs in all our Solr 6.4.2 and 6.5.0 nodes.
> It appears Solr is leaking one SolrZkClient instance per second via 
> DistributedQueue$ChildWatcher. That one per second is quite accurate for all 
> nodes, there are about the same amount of instances as there are seconds 
> since Solr started. I know VisualVM's instance count includes 
> objects-to-be-collected, the instance count does not drop after a forced 
> garbed collection round.
> It doesn't matter how many cores or collections the nodes carry or how heavy 
> traffic is.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-9859) replication.properties cannot be updated after being written and neither replication.properties or index.properties are durable in the face of a crash

2017-03-21 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-9859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935286#comment-15935286
 ] 

Walter Underwood commented on SOLR-9859:


Is there a workaround for fixing this on a 6.3.0 host? Does it work to delete 
replication.properties then start Solr?

> replication.properties cannot be updated after being written and neither 
> replication.properties or index.properties are durable in the face of a crash
> --
>
> Key: SOLR-9859
> URL: https://issues.apache.org/jira/browse/SOLR-9859
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 5.5.3, 6.3
>Reporter: Pushkar Raste
>Assignee: Mark Miller
>Priority: Minor
> Fix For: 6.4, master (7.0)
>
> Attachments: SOLR-9859.patch, SOLR-9859.patch, SOLR-9859.patch, 
> SOLR-9859.patch, SOLR-9859.patch, SOLR-9859.patch
>
>
> If a shard recovers via replication (vs PeerSync) a file named 
> {{replication.properties}} gets created. If the same shard recovers once more 
> via replication, IndexFetcher fails to write latest replication information 
> as it tries to create {{replication.properties}} but as file already exists. 
> Here is the stack trace I saw 
> {code}
> java.nio.file.FileAlreadyExistsException: 
> \shard-3-001\cores\collection1\data\replication.properties
>   at sun.nio.fs.WindowsException.translateToIOException(Unknown Source)
>   at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
>   at sun.nio.fs.WindowsException.rethrowAsIOException(Unknown Source)
>   at sun.nio.fs.WindowsFileSystemProvider.newByteChannel(Unknown Source)
>   at java.nio.file.spi.FileSystemProvider.newOutputStream(Unknown Source)
>   at java.nio.file.Files.newOutputStream(Unknown Source)
>   at 
> org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:413)
>   at 
> org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:409)
>   at 
> org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:253)
>   at 
> org.apache.solr.handler.IndexFetcher.logReplicationTimeAndConfFiles(IndexFetcher.java:689)
>   at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:501)
>   at 
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265)
>   at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:397)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:157)
>   at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:409)
>   at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:222)
>   at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>   at java.util.concurrent.FutureTask.run(Unknown Source)
>   at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$0(ExecutorUtil.java:229)
>   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>   at java.lang.Thread.run(Unknown Source)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-10130) Serious performance degradation in Solr 6.4.1 due to the new metrics collection

2017-02-16 Thread Walter Underwood (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870751#comment-15870751
 ] 

Walter Underwood commented on SOLR-10130:
-

This might be part of it:

[wunder@new-solr-c01.test3]# ls -lh 
/solr/data/questions_shard2_replica1/data/tlog/
total 4.7G
-rw-r--r-- 1 bin bin 4.7G Feb 13 11:04 tlog.000
[wunder@new-solr-c01.test3]# du -sh /solr/data/questions_shard2_replica1/data/*
8.4G/solr/data/questions_shard2_replica1/data/index
4.0K/solr/data/questions_shard2_replica1/data/snapshot_metadata
4.7G/solr/data/questions_shard2_replica1/data/tlog


Last Modified: 3 days ago
Num Docs: 3683075
Max Doc: 3683075
Heap Memory Usage: -1
Deleted Docs: 0
Version: 2737
Segment Count: 26
Optimized: yes
Current: yes



> Serious performance degradation in Solr 6.4.1 due to the new metrics 
> collection
> ---
>
> Key: SOLR-10130
> URL: https://issues.apache.org/jira/browse/SOLR-10130
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: metrics
>Affects Versions: 6.4.1
> Environment: Centos 7, OpenJDK 1.8.0 update 111
>Reporter: Ere Maijala
>Assignee: Andrzej Bialecki 
>Priority: Blocker
>  Labels: perfomance
> Fix For: master (7.0), 6.4.2
>
> Attachments: SOLR-10130.patch, SOLR-10130.patch, 
> solr-8983-console-f1.log
>
>
> We've stumbled on serious performance issues after upgrading to Solr 6.4.1. 
> Looks like the new metrics collection system in MetricsDirectoryFactory is 
> causing a major slowdown. This happens with an index configuration that, as 
> far as I can see, has no metrics specific configuration and uses 
> luceneMatchVersion 5.5.0. In practice a moderate load will completely bog 
> down the server with Solr threads constantly using up all CPU (600% on 6 core 
> machine) capacity with a load that normally  where we normally see an average 
> load of < 50%.
> I took stack traces (I'll attach them) and noticed that the threads are 
> spending time in com.codahale.metrics.Meter.mark. I tested building Solr 
> 6.4.1 with the metrics collection disabled in MetricsDirectoryFactory getByte 
> and getBytes methods and was unable to reproduce the issue.
> As far as I can see there are several issues:
> 1. Collecting metrics on every single byte read is slow.
> 2. Having it enabled by default is not a good idea.
> 3. The comment "enable coarse-grained metrics by default" at 
> https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/update/SolrIndexConfig.java#L104
>  implies that only coarse-grained metrics should be enabled by default, and 
> this contradicts with collecting metrics on every single byte read.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



  1   2   3   >