[jira] [Commented] (SOLR-12490) Query DSL supports for further referring and exclusion in JSON facets

2019-10-18 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954570#comment-16954570
 ] 

David Wayne Smiley commented on SOLR-12490:
---

excludeTags references "top" but I don't see a "top" tag.

RE json_param.  I don't think I like yet another odd/weird query parser for 
this case _if we can avoid it_.  Wouldn't this work?:
{noformat}
{!query v=$color_fq}
{noformat}
If it doesn't because the params embedded within the json aren't seen, then 
perhaps they should be overlayed such that this does work.  

This language/syntax is foreign to me; I'm not sure why I'm getting involved, 
so I'd rather back away :-)

> Query DSL supports for further referring and exclusion in JSON facets 
> --
>
> Key: SOLR-12490
> URL: https://issues.apache.org/jira/browse/SOLR-12490
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module, faceting
>Reporter: Mikhail Khludnev
>Priority: Major
>  Labels: newdev
> Attachments: SOLR-12490.patch
>
>
> It's spin off from the 
> [discussion|https://issues.apache.org/jira/browse/SOLR-9685?focusedCommentId=16508720=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16508720].
>  
> h2. Problem
> # after SOLR-9685 we can tag separate clauses in hairish queries like 
> {{parent}}, {{bool}}
> # we can {{domain.excludeTags}}
> # we are looking for child faceting with exclusions, see SOLR-9510, SOLR-8998 
>
> # but we can refer only separate params in {{domain.filter}}, it's not 
> possible to refer separate clauses
> see the first comment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8041) All Fields.terms(fld) impls should be O(1) not O(log(N))

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954040#comment-16954040
 ] 

David Wayne Smiley commented on LUCENE-8041:


My recommendation is not to use a TreeMap at all.  Use a plain HashMap.  In the 
constructor, save away a Iterable using a method such as the following 
(taken from a fork of Lucene at work):
{code:java}
private static Iterable sortedFieldNames(Collection 
unsortedFields) {
  List fieldsNames = new ArrayList<>(unsortedFields);
  Collections.sort(fieldsNames);
  return Collections.unmodifiableCollection(fieldsNames);
}
{code}

You could just do this for PerFieldPostingsFormat's reader and 
Lucene50PostingsReader as these are the common ones, or extend this idea to 
others if you wish.  UniformSplit is already doing this approach.  Maybe just 
do those 2 up front for code review.  You could put these few lines of code 
into the affected files, or consider adding this as a protected method on 
Fields.

One day we can get rid of iterator() and that day it'll be less change if we 
use a HashMap for the fields, which is what we'll want then.

> All Fields.terms(fld) impls should be O(1) not O(log(N))
> 
>
> Key: LUCENE-8041
> URL: https://issues.apache.org/jira/browse/LUCENE-8041
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: David Wayne Smiley
>Priority: Major
> Attachments: LUCENE-8041-LinkedHashMap.patch, LUCENE-8041.patch
>
>
> I've seen apps that have a good number of fields -- hundreds.  The O(log(N)) 
> of TreeMap definitely shows up in a profiler; sometimes 20% of search time, 
> if I recall.  There are many Field implementations that are impacted... in 
> part because Fields is the base class of FieldsProducer.  
> As an aside, I hope Fields to go away some day; FieldsProducer should be 
> TermsProducer and not have an iterator of fields. If DocValuesProducer 
> doesn't have this then why should the terms index part of our API have it?  
> If we did this then the issue here would be a simple transition to a HashMap.
> Or maybe we can switch to HashMap and relax the definition of Fields.iterator 
> to not necessarily be sorted?
> Perhaps the fix can be a relatively simple conversion over to LinkedHashMap 
> in many cases if we can assume when we initialize these internal maps that we 
> consume them in sorted order to begin with.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954017#comment-16954017
 ] 

David Wayne Smiley commented on LUCENE-9004:


bq.  The DocValues API basically requires creating a new DocValues whenever 
docids go backwards, so it's not ideal for this use case

I thought it was a real shame that the random access API to DocValues was 
removed.  I'm skeptical it was actually _necessary_ for sparse docValues but 
that was the rationale.  Any way, for vectors, do you think we need an entire 
new Codec level format or can we just have a new _type_ of DocValues that is 
random access BinaryDocValues?  For example imagine a 
DocValuesProducer.getBinaryRandomAccess()?  And/or imagine BinaryDocValues 
allowing you to call advance() on any ID have this be acceptable -- needn't be 
forward-only?

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but 

[jira] [Commented] (SOLR-10653) After Upgrade from 5.3.1 to 6.4.2, Solr is storing certain fields like UUID, BigDecimal, Enums as :

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-10653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953991#comment-16953991
 ] 

David Wayne Smiley commented on SOLR-10653:
---

Maybe kinda sorta related to SOLR-8866
I don't think it's well documented how non-String objects get serialized.  
Still... silently converting to class:value is clearly bad though -- like when 
would anyone want that?!
You might try working around by using another writer like XML or JSON.

> After Upgrade from 5.3.1 to 6.4.2, Solr is storing certain fields like UUID, 
> BigDecimal, Enums as :
> 
>
> Key: SOLR-10653
> URL: https://issues.apache.org/jira/browse/SOLR-10653
> Project: Solr
>  Issue Type: Bug
>  Components: clients - java, SolrJ
>Affects Versions: 6.4.2
>Reporter: Sudharshan Krishnamurthy
>Priority: Major
>
> Originally being in 5.3.1 when supplying object types such as java.util.UUID, 
> Enum, BigDecimal supplied to SolrInputDocument the conversion to 
> corresponding data types defined in the Solr schema happened just fine in 
> this case string, string, float respectively. After the upgrade to 6.4.2 
> version, I see that when such values are supplied to SolrInputDocument, while 
> saving it gets stored as 
> "java.util.UUID:0997e78e-6e3d-4824-8c52-8cc15533e541" with UUID for example 
> and fully qualified name of the class for Enums etc. Hence while 
> deserializing we are getting errors such as 
> Invalid UUID String: 'java.util.UUID:0997e78e-6e3d-4824-8c52-8cc15533e541'
> Although converting these fields to String before supplying to 
> SolrInputDocument or converting to varchar for delta import queries seem to 
> fix the problem. I wonder what changed between the 2 versions to have me do 
> this String or varchar conversion that was not required before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12490) Query DSL supports for further referring and exclusion in JSON facets

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953974#comment-16953974
 ] 

David Wayne Smiley commented on SOLR-12490:
---

I have not used the Query DSL so I'm not of much help.  But I couldn't help but 
think maybe the _tag_ should be specified using a {{tag:color}} within the JSON 
somewhere so that it's more consistent with the existing local-param based 
query tagging that's been in Solr for a long time?  It doesn't have to but it's 
just a consideration.  If we don't, I do like your use of the # pound sign 
which is kinda consistent with HTML id references.

> Query DSL supports for further referring and exclusion in JSON facets 
> --
>
> Key: SOLR-12490
> URL: https://issues.apache.org/jira/browse/SOLR-12490
> Project: Solr
>  Issue Type: Improvement
>  Components: Facet Module, faceting
>Reporter: Mikhail Khludnev
>Priority: Major
>  Labels: newdev
> Attachments: SOLR-12490.patch
>
>
> It's spin off from the 
> [discussion|https://issues.apache.org/jira/browse/SOLR-9685?focusedCommentId=16508720=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16508720].
>  
> h2. Problem
> # after SOLR-9685 we can tag separate clauses in hairish queries like 
> {{parent}}, {{bool}}
> # we can {{domain.excludeTags}}
> # we are looking for child faceting with exclusions, see SOLR-9510, SOLR-8998 
>
> # but we can refer only separate params in {{domain.filter}}, it's not 
> possible to refer separate clauses
> see the first comment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13851) SolrIndexSearcher.getFirstMatch trips assertion if multiple matches

2019-10-17 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953893#comment-16953893
 ] 

David Wayne Smiley commented on SOLR-13851:
---

I agree with your analysis.  That "assert" should be an IllegalStateException.  
And the implementation of getFirstMatch is fine; just update it's documentation.

I dislike the names of these similar methods "getFirstMatch" and "lookupId" 
are basically doing the same thing yet the method names are so different.  And 
as you note, we always use the ID.  Perhaps these might be renamed to 
lookupDocIdByUniqueKey and lookupDocIdAsPairByUniqueKey and have them both 
simply take a BytesRef?  It's okay to me that my proposed names are long-ish.

> SolrIndexSearcher.getFirstMatch trips assertion if multiple matches
> ---
>
> Key: SOLR-13851
> URL: https://issues.apache.org/jira/browse/SOLR-13851
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Priority: Major
>
> the documentation for {{SolrIndexSearcher.getFirstMatch}} says...
> {quote}
> Returns the first document number containing the term t Returns 
> -1 if no document was found. This method is primarily intended for clients 
> that want to fetch documents using a unique identifier."
> @return the first document number containing the term
> {quote}
> But SOLR-12366 refactored {{SolrIndexSearcher.getFirstMatch}} to eliminate 
> it's previous implementation and replace it with a call to (a refactored 
> version of) {{SolrIndexSearcher.lookupId}} -- but the code in {{lookupId}} 
> was always designed *explicitly* for dealing with a uniqueKey field, and has 
> an assertion that once it finds a match _there will be no other matches in 
> the index_
> This means that even though {{getFirstMatch}} is _intended_ for fields that 
> are unique between documents, i it's used on a field that is not unique, it 
> can trip an assertion.
> At a minimum we need to either "fix" {{getFirstMatch}} to behave as 
> documented, or fix it's documetation.
> Given that the current behavior has now been in place since Solr 7.4, and 
> given that all existing uses in "core" solr code are for looking up docs by 
> uniqueKey, it's probably best to simply fix the documentation, but we should 
> also consider replacing hte assertion with an IllegalStateException, or 
> SolrException -- anything not dependent on having assertions enabled -- to 
> prevent silent bugs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9006) Ensure WordDelimiterGraphFilter always emits catenateAll token early

2019-10-16 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952799#comment-16952799
 ] 

David Wayne Smiley commented on LUCENE-9006:


Thanks for the explanation RE graphOffsetsAreCorrect.  I guess there is no new 
concern here the PR then.

I discovered this problem due to a custom filter that directly collaborates 
with a delegated WDGF instance.  It assumes the first two tokens are 
preserveOriginal then catenateAll.  This was the case with the now deprecated 
WDF.  It's intuitive too, so "looks" odd when it doesn't happen.  I noticed in 
LUCENE-8730 a precedent for making the token orderings consistent, which makes 
sense to me.

> Ensure WordDelimiterGraphFilter always emits catenateAll token early
> 
>
> Key: LUCENE-9006
> URL: https://issues.apache.org/jira/browse/LUCENE-9006
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: David Wayne Smiley
>Assignee: David Wayne Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Ideally, the first token of WDGF is the preserveOriginal (if configured to 
> emit), and the second should be the catenateAll (if configured to emit).  The 
> deprecated WDF does this but WDGF can sometimes put the first other token 
> earlier when there is a non-emitted candidate sub-token.
> Example input "8-other" when only generateWordParts and catenateAll -- *not* 
> generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
> the "other" token and the catenated "8other" will appear at the same internal 
> position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8509) NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can produce backwards offsets

2019-10-15 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951982#comment-16951982
 ] 

David Wayne Smiley commented on LUCENE-8509:


[~romseygeek] why was this option not added as a new configuration flag?  This 
is sort of an internal implementation detail, so it's not a big deal but If it 
were, it'd be easier for the tests to toggle this flag.  It's also 
disappointing to see yet another constructor arg when we already have a bit 
field for booleans.

Also:
* the CHANGES.txt claims offset adjusting is false, but actually it defaults to 
true.
* there was no documentation change.  At least the javadocs of this class which 
shows all the other options.

> NGramTokenizer, TrimFilter and WordDelimiterGraphFilter in combination can 
> produce backwards offsets
> 
>
> Key: LUCENE-8509
> URL: https://issues.apache.org/jira/browse/LUCENE-8509
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.0
>
> Attachments: LUCENE-8509.patch, LUCENE-8509.patch
>
>
> Discovered by an elasticsearch user and described here: 
> https://github.com/elastic/elasticsearch/issues/33710
> The ngram tokenizer produces tokens "a b" and " bb" (note the space at the 
> beginning of the second token).  The WDGF takes the first token and splits it 
> into two, adjusting the offsets of the second token, so we get "a"[0,1] and 
> "b"[2,3].  The trim filter removes the leading space from the second token, 
> leaving offsets unchanged, so WDGF sees "bb"[1,4]; because the leading space 
> has already been stripped, WDGF sees no need to adjust offsets, and emits the 
> token as-is, resulting in the start offsets of the tokenstream being [0, 2, 
> 1], and the IndexWriter rejecting it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9006) Ensure WordDelimiterGraphFilter always emits catenateAll token early

2019-10-15 Thread David Wayne Smiley (Jira)
David Wayne Smiley created LUCENE-9006:
--

 Summary: Ensure WordDelimiterGraphFilter always emits catenateAll 
token early
 Key: LUCENE-9006
 URL: https://issues.apache.org/jira/browse/LUCENE-9006
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: David Wayne Smiley
Assignee: David Wayne Smiley


Ideally, the first token of WDGF is the preserveOriginal (if configured to 
emit), and the second should be the catenateAll (if configured to emit).  The 
deprecated WDF does this but WDGF can sometimes put the first other token 
earlier when there is a non-emitted candidate sub-token.

Example input "8-other" when only generateWordParts and catenateAll -- *not* 
generateNumberParts.  WDGF internally sees the '8' but moves on.  Ultimately, 
the "other" token and the catenated "8other" will appear at the same internal 
position, which by luck fools the sorter to emit "other" first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13821) Package Store

2019-10-08 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947116#comment-16947116
 ] 

David Wayne Smiley commented on SOLR-13821:
---

[~noble.paul] Close this as part of 8.3?

> Package Store
> -
>
> Key: SOLR-13821
> URL: https://issues.apache.org/jira/browse/SOLR-13821
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Package store is a storage managed by Solr that holds the package artifacts. 
> This is replicated across nodes.
> Design is here: 
> [https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#]
> The package store is powered by an underlying filestore. This filestore is a 
> fully replicated p2p filesystem storage for artifacts.
> The APIs are as follows
> {code:java}
> # add a file
> POST  /api/cluster/files/path/to/file.jar
> #retrieve a file
> GET /api/cluster/files/path/to/file.jar
> #list files in the /path/to directory
> GET /api/cluster/files/path/to
> #GET meta info of the jar
> GET /api/cluster/files/path/to/file.jar?meta=true
> {code}
> This store keeps 2 files per file
>  # The actual file say {{myplugin.jar}}
>  # A metadata file {{.myplugin.jar.json}} in the same directory
> The contenbts of the metadata file is
> {code:json}
> {
> "sha512" : ""
> "sig": {
> "" :""
> }}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-10-08 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946777#comment-16946777
 ] 

David Wayne Smiley commented on LUCENE-8920:


You _might_ want to start with a bit of cleanup/refactoring that makes what 
you're doing easier.  Both from my own experience looking through the FST code 
and from seeing Michael Sokolov's input on the code readability (Dawid too?)... 
I wish the code were easier to read.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13817) Deprecate legacy SolrCache implementations

2019-10-04 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944801#comment-16944801
 ] 

David Wayne Smiley commented on SOLR-13817:
---

+1
A basic initial step would be to make Caffeine the default right away.  And 
eliminate class= on all such caches unless the test literally is testing 
something else.

> Deprecate legacy SolrCache implementations
> --
>
> Key: SOLR-13817
> URL: https://issues.apache.org/jira/browse/SOLR-13817
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Major
>
> Now that SOLR-8241 has been committed I propose to deprecate other cache 
> implementations in 8x and remove them altogether from 9.0, in order to reduce 
> confusion and maintenance costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-10-04 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944770#comment-16944770
 ] 

David Wayne Smiley commented on SOLR-13796:
---

I'm looking forward to seeing the PR.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-11508) Make coreRootDirectory configurable via an environment variable (SOLR_CORE_HOME)

2019-10-04 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-11508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944519#comment-16944519
 ] 

David Wayne Smiley commented on SOLR-11508:
---

This issue has been re-focused to something very straight-forward and I think 
non-controversial.  Just add a SOLR_CORE_HOME env var option to an existing 
coreRootDirectory setting.  And probably explicitly declare this setting in the 
default solr.xml with a Java sys prop -- something already done for tests.

I agree on a key point Shawn makes -- 3 settings is confusing.  We can/should 
do better.  Maybe a new thread on the dev list is where we should have further 
discussion instead of this issue, since it's distracting here.

> Make coreRootDirectory configurable via an environment variable 
> (SOLR_CORE_HOME)
> 
>
> Key: SOLR-11508
> URL: https://issues.apache.org/jira/browse/SOLR-11508
> Project: Solr
>  Issue Type: Bug
>Reporter: Marc Morissette
>Priority: Major
>
> (Heavily edited)
> Since Solr 7, it is possible to store Solr cores in separate disk locations 
> using solr.data.home (see SOLR-6671). This is very useful when running Solr 
> in Docker where data must be stored in a directory which is independent from 
> the rest of the container.
> While this works well in standalone mode, it doesn't in Cloud mode as the 
> core.properties automatically created by Solr are still stored in 
> coreRootDirectory and cores created that way disappear when the Solr Docker 
> container is redeployed.
> The solution is to configure coreRootDirectory to an empty directory that can 
> be mounted outside the Docker container.
> The incoming patch makes this easier to do by allowing coreRootDirectory to 
> be configured via a solr.core.home system property and SOLR_CORE_HOME 
> environment variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-02 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943149#comment-16943149
 ] 

David Wayne Smiley commented on SOLR-13790:
---

Interesting; your proposal makes sense.  Thanks [~ab].

> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-8241) Evaluate W-TinyLfu cache

2019-10-02 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943141#comment-16943141
 ] 

David Wayne Smiley commented on SOLR-8241:
--

Woohoo!  Thanks [~ab] and for your extreme persistence [~ben.manes].  Better 
late than never.  I'd hope to see this as the default in solr configs in 9.0.

> Evaluate W-TinyLfu cache
> 
>
> Key: SOLR-8241
> URL: https://issues.apache.org/jira/browse/SOLR-8241
> Project: Solr
>  Issue Type: Improvement
>  Components: search
>Reporter: Ben Manes
>Assignee: Andrzej Bialecki
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: EvictionBenchmark.png, GetPutBenchmark.png, 
> SOLR-8241.patch, SOLR-8241.patch, SOLR-8241.patch, SOLR-8241.patch, 
> SOLR-8241.patch, SOLR-8241.patch, caffeine-benchmark.txt, proposal.patch, 
> solr_caffeine.patch.gz, solr_jmh_results.json
>
>
> SOLR-2906 introduced an LFU cache and in-progress SOLR-3393 makes it O(1). 
> The discussions seem to indicate that the higher hit rate (vs LRU) is offset 
> by the slower performance of the implementation. An original goal appeared to 
> be to introduce ARC, a patented algorithm that uses ghost entries to retain 
> history information.
> My analysis of Window TinyLfu indicates that it may be a better option. It 
> uses a frequency sketch to compactly estimate an entry's popularity. It uses 
> LRU to capture recency and operate in O(1) time. When using available 
> academic traces the policy provides a near optimal hit rate regardless of the 
> workload.
> I'm getting ready to release the policy in Caffeine, which Solr already has a 
> dependency on. But, the code is fairly straightforward and a port into Solr's 
> caches instead is a pragmatic alternative. More interesting is what the 
> impact would be in Solr's workloads and feedback on the policy's design.
> https://github.com/ben-manes/caffeine/wiki/Efficiency



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13722) Package Management APIs

2019-10-02 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943087#comment-16943087
 ] 

David Wayne Smiley commented on SOLR-13722:
---

I'm confused on the status here.  It's in master but not 8x; was that 
deliberate?   This is not an implied "ask" to merge to 8x as it ought to be 
reviewed any way.

> Package Management APIs
> ---
>
> Key: SOLR-13722
> URL: https://issues.apache.org/jira/browse/SOLR-13722
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: package
>
> This ticket totally eliminates the need for an external service to host the 
> jars. So a url will no longer be required. An external URL leads to 
> unreliability because the service may go offline or it can be DDoSed if/when 
> too many requests are sent to them
>  
>  
>  Add a jar to cluster as follows
> {code:java}
> curl -X POST -H 'Content-Type: application/octet-stream' --data-binary 
> @myjar.jar http://localhost:8983/api/cluster/filestore/package?name=myjar.jar
> {code}
> This does the following operations
>  * Upload this jar to all the live nodes in the system
>  * The name of the file is the {{sha256-}} of the file/payload
>  * The store is agnostic of the content of the file/payload
> h2.  How it works?
> A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted 
> locally & all nodes are instructed to download it from this node or from any 
> other available node. If a node comes up later, it can query other nodes in 
> the system and download the blobs as required
> h2. {{add}} package command
> {code:java}
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add": {
>"name": "my-package" ,
>   "file":{"id" : "", "sig" : ""}
>   }}' http://localhost:8983/api/cluster/package
> {code}
> ]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-10-01 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942472#comment-16942472
 ] 

David Wayne Smiley commented on SOLR-13661:
---

Don't merge this to master.  Please be patient for peer review.

> A package management system for Solr
> 
>
> Key: SOLR-13661
> URL: https://issues.apache.org/jira/browse/SOLR-13661
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Ishan Chattopadhyaya
>Priority: Blocker
>  Labels: package
> Attachments: plugin-usage.png, repos.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Here's the design doc:
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13710) DistribFileStore: a distributed p2p file store

2019-10-01 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942466#comment-16942466
 ] 

David Wayne Smiley commented on SOLR-13710:
---

Just "FileStore" might imply local (non-distributed) by itself; I wonder if 
"[Cluster|Cloud|Shared]FileStore" might be better options.  Nevertheless 
FileStore is a decent name.

Also I think this internally should be an abstraction with multiple possible 
implementations.  It should be _possible_ to have a WebDAV like impl or shared 
file system or S3 impl or heck even a ZooKeeper impl (not necessarily the same 
ZK ensemble as the cluster state).  The design should allow for these impls to 
be extremely simple, especially the shared file system impl -- at least 
theoretically since it need not be implemented right way.

I'm not sure if the SHA256 stuff need to be a fundamental part of this 
abstraction.  Why not just a shared directory of persisted data and _that's 
it_?  Could the "package" concept be layered on top (which may or may not use 
SHA256 naming schemes)?  I could imagine one day an option to put configSets in 
this thing.

> DistribFileStore: a distributed p2p file store
> --
>
> Key: SOLR-13710
> URL: https://issues.apache.org/jira/browse/SOLR-13710
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> * All jars for packages downloaded are stored in a dir 
> SOLR_HOME/filestore/package. 
> * The file names will be the sha256_hash-.
> * Before downloading the a jar from a location, it's first checked in the 
> local directory
> * POST a jar to 
> {{http://localhost://8983/api/cluster/filestore/package?name=}} to 
> distibute it in the cluster
> * A new API end point {{http://localhost://8983/api/node/filestore/package}} 
> will list the available files
> * The file will be downloadable at 
> {{http://localhost://8983/api/node/filestore/package/}} 
> Design: 
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#heading=h.qxgax9a5br5o



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13802) Analyzer property luceneMatchVersion is not written to managed schema

2019-09-30 Thread David Wayne Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wayne Smiley reassigned SOLR-13802:
-

Assignee: David Wayne Smiley

> Analyzer property luceneMatchVersion is not written to managed schema
> -
>
> Key: SOLR-13802
> URL: https://issues.apache.org/jira/browse/SOLR-13802
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Schema and Analysis
>Affects Versions: 7.7.2, master (9.0), 8.2
>Reporter: Thomas Wöckinger
>Assignee: David Wayne Smiley
>Priority: Major
>  Labels: easy-fix, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The analyzer property luceneMatchVersion is no written to managed schema, it 
> is simply not handled by the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13661) A package management system for Solr

2019-09-30 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941007#comment-16941007
 ] 

David Wayne Smiley commented on SOLR-13661:
---

The design document is pretty fantastic in its overall scope (not too much or 
too little) and structure (easy to consume).  Of course I have things inside to 
debate but it was a breath of fresh air to consume.

I want to be clear on one thing:  The concern/frustration that Jan and I have 
on peer review is because this issue is not some ordinary JIRA issue.  It's 
highly impactful to Solr.  As-such, IMO peer review is _required_ for at least 
the major ideas / high level, naming, CLI, release-plan.  Getting into some 
small details, no, not needed.  Thankfully the peer review is here now :-)

> A package management system for Solr
> 
>
> Key: SOLR-13661
> URL: https://issues.apache.org/jira/browse/SOLR-13661
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Ishan Chattopadhyaya
>Priority: Major
>  Labels: package
> Attachments: plugin-usage.png, repos.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Here's the design doc:
> https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13764) Parse Interval Query from JSON API

2019-09-28 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940136#comment-16940136
 ] 

David Wayne Smiley commented on SOLR-13764:
---

I admit I'm a little confused.  I just reacquainted myself with the JSON Query 
DSL implemented via JsonQueryConverter; a fine piece of work IMO.  Do you 
propose something that competes with it or that works nice with it?  You've 
mentioned not wanting to make new QParsers for spans or intervals but I don't 
see that as a real concern.

> Parse Interval Query from JSON API
> --
>
> Key: SOLR-13764
> URL: https://issues.apache.org/jira/browse/SOLR-13764
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Reporter: Mikhail Khludnev
>Priority: Major
>
> h2. Context
> Lucene has Intervals query LUCENE-8196. Note: these are a kind of healthy 
> man's Spans/Phrases. Note: It's not about ranges nor facets.
> h2. Problem
> There's no way to search by IntervalQuery via JSON Query DSL.
> h2. Suggestion
>  * Create classic QParser \{{ {!interval df=text_content}a_json_param}}, ie 
> one can combine a few such refs in {{json.query.bool}}
>  * It accepts just a name of JSON params, nothing like this happens yet.
>  * This param carries plain json which is accessible via {{req.getJSON()}}
> please examine 
> https://cwiki.apache.org/confluence/display/SOLR/SOLR-13764+Discussion+-+Interval+Queries+in+JSON
>  for syntax proposal.
> h2. Challenges
>  * I have no idea about particular JSON DSL for these queries, Lucene API 
> seems like easy JSON-able. Proposals are welcome.
>  * Another awkward things is combining analysis and low level query API. eg 
> what if one request term for one word and analysis yield two tokens, and vice 
> versa requesting phrase might end up with single token stream.
>  * Putting json into Jira ticket description
> h2. Q: Why don't..
> .. put intervals DSL right into {{json.query}}, avoiding these odd param 
> refs? 
>  A: It requires heavy lifting for {{JsonQueryConverter}} which is streamlined 
> for handling old good http parametrized queires.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-13795) SolrIndexSearcher still uses old schema after schema update using schema-api

2019-09-27 Thread David Wayne Smiley (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Wayne Smiley reassigned SOLR-13795:
-

Assignee: David Wayne Smiley

> SolrIndexSearcher still uses old schema after schema update using schema-api
> 
>
> Key: SOLR-13795
> URL: https://issues.apache.org/jira/browse/SOLR-13795
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: config-api, Schema and Analysis, Server, SolrJ, v2 API
>Affects Versions: 7.7.2, master (9.0), 8.2
>Reporter: Thomas Wöckinger
>Assignee: David Wayne Smiley
>Priority: Critical
>  Labels: easyfix, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When adding a new field to the schema using schema-api, the new field is not 
> known by the current SolrIndexSearcher. In SolrCloud any core gets reloaded 
> after the new schema is persisted, this does not happen in case of stand 
> alone HTTP Solr server or EmbeddedSolrServer.
> So currently an additional commit is necessary to open a new 
> SolrIndexSearcher using the new schema.
> Fix is really easy: Just reload the core!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13722) A cluster-wide blob upload package option & avoid remote url

2019-09-26 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938711#comment-16938711
 ] 

David Wayne Smiley commented on SOLR-13722:
---

There are two sub-tasks that, based on the title alone, seem like the same 
thing.  This one and SOLR-13710 So it's not clear where to discuss a new/second 
"blob store".  Can you help disambiguate these for me [~noble.paul]?

> A cluster-wide blob upload package option & avoid remote url
> 
>
> Key: SOLR-13722
> URL: https://issues.apache.org/jira/browse/SOLR-13722
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: package
>
> This ticket totally eliminates the need for an external service to host the 
> jars. So a url will no longer be required. An external URL leads to 
> unreliability because the service may go offline or it can be DDoSed if/when 
> too many requests are sent to them
>  
>  
>  Add a jar to cluster as follows
> {code:java}
> curl -X POST -H 'Content-Type: application/octet-stream' --data-binary 
> @myjar.jar http://localhost:8983/api/cluster/blob
> {code}
> This does the following operations
>  * Upload this jar to all the live nodes in the system
>  * The name of the file is the {{sha256}} of the file/payload
>  * The blob is agnostic of the content of the file/payload
> h2.  How it works?
> A blob that is POSTed to the {{/api/cluster/blob}} end point is persisted 
> locally & all nodes are instructed to download it from this node or from any 
> other available node. If a node comes up later, it can query other nodes in 
> the system and download the blobs as required
> h2. {{add-package}} command
> {code:java}
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add-package": {
>"name": "my-package" ,
>   "sha256":""
>   }}' http://localhost:8983/api/cluster
> {code}
>  The {{sha256}} is the same as the file name. It gets hold of the jar using 
> the following steps
>  * check the local file system for the blob
>  * If not available locally,  query other live nodes if they have the blob 
> (one by one)
>  * if a node has it , it's downloaded and persisted to it's local {{blob}} dir
> h2. Security
> The blob upload does not check for the content of the payload and it does not 
> verify the file. However, the {{add-package}} , {{update-package}} commands 
> check for the signatures (if enabled) . 
>  The size of the file is limited to 5MB,to avoid (OOM). This can be changed 
> using a system property {{runtime.lib.size}} . 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org