date:20200117

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-17 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018518#comment-17018518
 ] 

Tomoko Uchida commented on LUCENE-9004:
---

Hi [~jtibshirani], 
 thanks for you comments/suggestions. I will check the links you mentioned.
{quote}It also suggests that graph-based kNN is an active research area and 
that there are likely to be improvements + new approaches that come out.
{quote}
Yes, there are so many proposed methods and their variants in this field. 
Currently I'm not fully sure that what is the most feasible approach for Lucene.

Also, I just noticed an issue that proposes a product quantization based 
approach - roughly speaking, it may need less disk and memory space than the 
graph based methods like HNSW but takes more indexing and query time costs: 
https://issues.apache.org/jira/browse/LUCENE-9136

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits,

[jira] [Assigned] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida reassigned LUCENE-9123:
-

Assignee: Tomoko Uchida

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Assignee: Tomoko Uchida
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018504#comment-17018504
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

{quote}OK. I will prepare another patch for the master branch.
{quote}
Thanks [~h.kazuaki]! Once the work is done I can create the patch for 8x branch 
by applying some modifications to yours, if you feel bothered by arranging two 
patches.
 Also we need to add some tests to {{TestJapaneseTokenizer}} and 
{{TestJapaneseTokenizerFactory}}. And according to the custom, the final patch 
to the master branch should be named "LUCENE-9123.patch" so can you please 
overwrite the obsolete patch instead of upload new ones?
{quote}Then, a person who is a maintainer of Japanese Tokenizer can choose how 
to merge the changes (who is responsible for Japanese Tokenizer for now?)
{quote}
I'm not sure if there is explicit maintainer on each Lucene module, 
theoretically every person who has write access to the ASF repo can commit any 
patches on his own responsibility.
 Let us wait for a few days and I will commit the patch if there are no other 
comments or objections.
 [~cm] do you have any feedback about this change?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet

2020-01-17 Thread Mike Drob (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018492#comment-17018492
 ] 

Mike Drob commented on LUCENE-9142:
---

Yep, good eye. I was able to reproduce the bug in a unit test and then 
refactored the code to be more readable and less mysterious.

> Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
> 
>
> Key: LUCENE-9142
> URL: https://issues.apache.org/jira/browse/LUCENE-9142
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out 
> that we have mismatched types when trying to reuse states, and so we may be 
> creating more states than we need to.
> Relevant snippets:
> {code:title=Operations.java}
> Map newstate = new HashMap<>();
> final SortedIntSet statesSet = new SortedIntSet(5);
> Integer q = newstate.get(statesSet);
> {code}
> {{q}} is always going to be null in this path because there are no 
> SortedIntSet keys in the map.
> There are also very little javadoc on SortedIntSet, so I'm having trouble 
> following the precise relationship between all the pieces here.
> cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have 
> them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9004) Approximate nearest vector search

2020-01-17 Thread Julie Tibshirani (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017696#comment-17017696
 ] 

Julie Tibshirani edited comment on LUCENE-9004 at 1/18/20 3:12 AM:
---

Hello and thank you for this very exciting work! We have been doing research 
into nearest neighbor search on high-dimensional vectors and I wanted to share 
some thoughts here in the hope that they're helpful.

Related to Adrien's comment about search filters, I am wondering how deleted 
documents would be handled. If I'm understanding correctly, a segment's deletes 
are applied 'on top of' the query. So if the k nearest neighbors to the query 
vector all happen to be deleted, then the query won't bring back any documents. 
From a user's perspective, I could see this behavior being surprising or hard 
to work with. One approach would be to keep expanding the search while skipping 
over deleted documents, but I'm not sure about the performance + accuracy it 
would give (there's a [short 
discussion|https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] 
in the hnswlib repo on this point).

The recent paper [Graph based Nearest Neighbor Search: Promises and 
Failures|https://arxiv.org/abs/1904.02077] compares HNSW to other graph-based 
approaches and claims that the hierarchy of layers only really helps in low 
dimensions (Figure 4). In these experiments, they see that a 'flat' version of 
HNSW performs very similarly to the original above around 16 dimensions. The 
original HNSW paper also cites the hierarchy as most helpful in low dimensions. 
This seemed interesting in that it may be possible to avoid some complexity if 
the focus is not on low-dimensional vectors. (It also suggests that graph-based 
kNN is an active research area and that there are likely to be improvements + 
new approaches that come out. One such new approach is [DiskANN Fast Accurate 
Billion-point Nearest Neighbor Search on a Single 
Node|https://suhasjs.github.io/files/diskann_neurips19.pdf]).

On the subject of testing recall, I'm working on adding [sentence 
embedding|https://github.com/erikbern/ann-benchmarks/issues/144] and [deep 
image descriptor|https://github.com/erikbern/ann-benchmarks/issues/143] 
datasets to the ann-benchmarks repo. Hopefully that will help in providing 
realistic shared data to test against.

 


was (Author: jtibshirani):
Hello and thank you for this very exciting work! We have been doing research 
into nearest neighbor search on high-dimensional vectors and I wanted to share 
some thoughts here in the hope that they're helpful.

Related to Adrien's comment about search filters, I am wondering how deleted 
documents would be handled. If I'm understanding correctly, a segment's deletes 
are applied 'on top of' the query. So if the k nearest neighbors to the query 
vector all happen to be deleted, then the query won't bring back any documents. 
From a user's perspective, I could see this behavior being surprising or hard 
to work with. One approach would be to keep expanding the search while skipping 
over deleted documents, but I'm not sure about the performance + accuracy it 
would give (there's a [short 
discussion|https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892] 
in the hnswlib repo on this point).

The recent paper [Graph based Nearest Neighbor Search: Promises and 
Failures|https://arxiv.org/abs/1904.02077] compares HNSW to other graph-based 
approaches and claims that the hierarchy of layers only really helps in low 
dimensions (Figure 4). In these experiments, they see that a 'flat' version of 
HNSW performs very similarly to the original above around 16 dimensions. The 
original HNSW paper also cites the hierarchy as most helpful in low dimensions. 
This seemed interesting in that it may be possible to avoid some complexity if 
the focus is not on low-dimensional vectors. (It also suggests that graph-based 
kNN is an active research area and that there are likely to be improvements + 
new approaches that come out. One such new approach is [DiskANN Fast Accurate 
Billion-point Nearest Neighbor Search on a Single 
Node|https://suhasjs.github.io/files/diskann_neurips19.pdf]).

On the subject of testing recall, we are working on adding [sentence 
embedding|https://github.com/erikbern/ann-benchmarks/issues/144] and [deep 
image descriptor|https://github.com/erikbern/ann-benchmarks/issues/143] 
datasets to the ann-benchmarks repo. Hopefully that will help provide some 
realistic shared data to test against.

 

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018470#comment-17018470
 ] 

Kazuaki Hiraga commented on LUCENE-9123:


{quote}
I thought the change in the behavior has very small or no impact for users who 
use the Tokenizer for searching, but yes it would affect users who use it for 
pure tokenization purpose.
{quote}

 Yes, I think that's a point that changing default behavior affects 
Solr/Elastic users as well if these products doesn't change the parameter.  But 
you may be right that we can change the default behavior. I have no idea...

{quote}
How about this proposal: we can create two patches, one for the master and one 
for 8x. On 8x branch, add the new constructor so you can use it from the next 
update. There is no change in the default behavior. On the master branch, 
switch the default behavior (users who don't like the change can still swich 
back by using the full constructor).
{quote}

OK. I will prepare another patch for the master branch. Then, a person who is a 
maintainer of Japanese Tokenizer can choose how to merge the changes (who is 
responsible for Japanese Tokenizer for now?)

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018437#comment-17018437
 ] 

Chris M. Hostetter commented on SOLR-11746:
---

bq.  I think it would be nice to have a separate patch/issue to add something 
like a hasNorms() functionality to SchemaField

Sounds like a good idea ... but I think it would probably be cleaner if 
{{PointField.init(...)}} & {{PointField.checkSchemaField(...)}} would check for 
types/fields using {{omitNorm==false}} and if found explicitly fail with an 
error that {{PointFields don't support 'omitNorms=false'}}  (...unless 
{{TEST_HACK_IGNORE_USELESS_TRIEFIELD_ARGS == true}} -- similar to how 
precisionStep is silently ignored at a 'type' level -- so that we could 
continue to randomize either point fields or trie fields in the same test 
schema)

that way -- IIUC -- your proposed {{SchemaField.hasNorms()}} would just be {{! 
SchemaField.omitNorms()}}

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit

2020-01-17 Thread GitBox

msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to 
selectively merge on commit
URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r368187135
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -3223,15 +3259,44 @@ private long prepareCommitInternal() throws 
IOException {
   // sneak into the commit point:
   toCommit = segmentInfos.clone();
 
+  if (anyChanges) {
+mergeAwaitLatchRef = new AtomicReference<>();
 
 Review comment:
   This is a little bit of hackery to share state between this thread and the 
threads that do the merges. 
   
   We initialize the ref here, pass it to `waitForMergeOnCommitPolicy` on the 
next line to make sure it gets shared with any computed `OneMerge`s. Then, 
before we release the IW lock (so we're guaranteed that those `OneMerge`s 
haven't run yet), we populate the ref with the `CountdownLatch` (once we know 
what we're counting down).
   
   That said, I think I could simplify things a lot by not using 
`OneMergeWrappingMergePolicy`, but rather decorating the returned `OneMerge`s 
(if applicable) directly. I'm going to take a stab at that approach in my next 
commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob opened a new pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-01-17 Thread GitBox

madrob opened a new pull request #1184: LUCENE-9142 Refactor IntSet operations 
for determinize
URL: https://github.com/apache/lucene-solr/pull/1184
 
 
   Fix a bug where a frozen set could be symmetrically unequal to the
   sorted set that created it because we compared the backing array
   instead of only the active elements.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit

2020-01-17 Thread GitBox

msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to 
selectively merge on commit
URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r368184052
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/index/IndexWriter.java
 ##
 @@ -3223,15 +3259,44 @@ private long prepareCommitInternal() throws 
IOException {
   // sneak into the commit point:
   toCommit = segmentInfos.clone();
 
+  if (anyChanges) {
+mergeAwaitLatchRef = new AtomicReference<>();
+MergePolicy mergeOnCommitPolicy = 
waitForMergeOnCommitPolicy(config.getMergePolicy(), toCommit, 
mergeAwaitLatchRef);
+
+// Find any merges that can execute on commit (per 
MergePolicy).
+commitMerges = 
mergeOnCommitPolicy.findCommitMerges(segmentInfos, this);
+if (commitMerges != null && commitMerges.merges.size() > 0) {
 
 Review comment:
   Based on e.g. NoMergePolicy, it looks like the convention to say "I found no 
merges" is to return null, but I can't see anything preventing a MergePolicy 
from returning an empty MergeSpecification.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Houston Putman (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018422#comment-17018422
 ] 

Houston Putman commented on SOLR-11746:
---

The {{getSpecializedRangeQuery()}} call is to make sure that there's no 
circular dependency, since {{getRangeQuery()}} can possibly call 
{{getExistenceQuery()}}. And yeah, I will remove using norms with Point fields.

Thanks for looking into the norms & PointFields, that makes sense. I think it 
would be nice to have a separate patch/issue to add something like a 
{{hasNorms()}} functionality to SchemaField, so that there doesn't have to be 
special logic for certain types. But it's easy enough to use the special case 
of PointFields in this patch.

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] ErickErickson opened a new pull request #1183: LUCENE-9134: Port ant-regenerate tasks to Gradle build

2020-01-17 Thread GitBox

ErickErickson opened a new pull request #1183: LUCENE-9134: Port ant-regenerate 
tasks to Gradle build
URL: https://github.com/apache/lucene-solr/pull/1183
 
 
   # Description
   
   The only file that counts here really is lucene/queryparser/build.gradle. 
All the rest are a result of running the regenerate task. 
   
   There are a couple of dodgy bits, see //nocommit in the build.gradle file.
   
   NOTES:
   - this does not address any of the other regenerate tasks yet.
   - I'm not going to untangle warnings until later.
   
   # Tests
   
   precommit and the full test suite (both gradle and ant) pass, so I think 
it's close for this part of the regenerate task
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to selectively merge on commit

2020-01-17 Thread GitBox

msfroh commented on a change in pull request #1155: LUCENE-8962: Add ability to 
selectively merge on commit
URL: https://github.com/apache/lucene-solr/pull/1155#discussion_r368182168
 
 

 ##
 File path: 
lucene/core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java
 ##
 @@ -277,6 +285,92 @@ public void testSetters() {
 assertSetters(new LogDocMergePolicy());
   }
 
+  public void testMergeOnCommit() throws IOException, InterruptedException {
+Directory dir = newDirectory();
+IndexWriter firstWriter = new IndexWriter(dir, newIndexWriterConfig(new 
MockAnalyzer(random()))
+.setMergePolicy(NoMergePolicy.INSTANCE));
+for (int i = 0; i < 5; i++) {
+  TestIndexWriter.addDoc(firstWriter);
+  firstWriter.flush();
+}
+DirectoryReader firstReader = DirectoryReader.open(firstWriter);
+assertEquals(5, firstReader.leaves().size());
+firstReader.close();
+firstWriter.close();
+
+MergePolicy mergeOnCommitPolicy = new LogDocMergePolicy() {
+  @Override
+  public MergeSpecification findCommitMerges(SegmentInfos segmentInfos, 
MergeContext mergeContext) throws IOException {
+// Optimize down to a single segment on commit
+MergeSpecification mergeSpecification = new MergeSpecification();
+List nonMergingSegments = new ArrayList<>();
+for (SegmentCommitInfo sci : segmentInfos) {
+  if (mergeContext.getMergingSegments().contains(sci) == false) {
+nonMergingSegments.add(sci);
+  }
+}
+mergeSpecification.add(new OneMerge(nonMergingSegments));
+return mergeSpecification;
+  }
+};
+
+IndexWriter writerWithMergePolicy = new IndexWriter(dir, 
newIndexWriterConfig(new MockAnalyzer(random()))
+.setMergePolicy(mergeOnCommitPolicy));
+
+writerWithMergePolicy.commit();
+
+DirectoryReader unmergedReader = 
DirectoryReader.open(writerWithMergePolicy);
+assertEquals(5, unmergedReader.leaves().size()); // Don't merge unless 
there's a change
+unmergedReader.close();
+
+TestIndexWriter.addDoc(writerWithMergePolicy);
+writerWithMergePolicy.commit();
+
+DirectoryReader mergedReader = DirectoryReader.open(writerWithMergePolicy);
+assertEquals(1, mergedReader.leaves().size()); // Now we merge on commit
+mergedReader.close();
+
+LineFileDocs lineFileDocs = new LineFileDocs(random());
+int docCount = atLeast(1000);
+AtomicInteger indexedDocs = new AtomicInteger(0);
+int numIndexingThreads = atLeast(2);
+CountDownLatch startingGun = new CountDownLatch(1);
+Collection indexingThreads = new ArrayList<>();
+for (int i = 0; i < numIndexingThreads; i++) {
+  Thread t = new Thread(() -> {
+try {
+  while (indexedDocs.getAndIncrement() < docCount) {
+writerWithMergePolicy.addDocument(lineFileDocs.nextDoc());
+if (rarely()) {
+  writerWithMergePolicy.commit();
+}
+  }
+} catch (IOException e) {
+  e.printStackTrace();
+  fail();
+}
+  });
+  t.start();
+  indexingThreads.add(t);
+}
+startingGun.countDown();
+for (Thread t : indexingThreads) {
+  t.join();
+}
+writerWithMergePolicy.commit();
+assertEquals(1, writerWithMergePolicy.listOfSegmentCommitInfos().size());
 
 Review comment:
   I just found that this assertion sometimes fails. If there are some 
pending/running merges left over from the indexing threads, the segments 
associated with those merges will be excluded from merging on commit. I'll 
update this test to wait for pending merges to finish before committing.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-12859) DocExpirationUpdateProcessorFactory does not work with BasicAuth

2020-01-17 Thread Lucene/Solr QA (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018407#comment-17018407
 ] 

Lucene/Solr QA commented on SOLR-12859:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m 45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m 45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m 45s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 63m  
0s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m  1s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-12859 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12991247/SOLR-12859.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-170-generic #199-Ubuntu SMP 
Thu Nov 14 01:45:04 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / aad849b |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/653/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/653/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> DocExpirationUpdateProcessorFactory does not work with BasicAuth
> 
>
> Key: SOLR-12859
> URL: https://issues.apache.org/jira/browse/SOLR-12859
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.5
>Reporter: Varun Thacker
>Priority: Major
> Attachments: SOLR-12859.patch, SOLR-12859.patch
>
>
> I setup a cluster with basic auth and then wanted to use Solr's TTL feature ( 
> DocExpirationUpdateProcessorFactory ) to auto-delete documents.
>  
> Turns out it doesn't work when Basic Auth is enabled. I get the following 
> stacktrace from the logs
> {code:java}
> 2018-10-12 22:06:38.967 ERROR (autoExpireDocs-42-thread-1) [   ] 
> o.a.s.u.p.DocExpirationUpdateProcessorFactory Runtime error in periodic 
> deletion of expired docs: Async exception during distributed update: Error 
> from server at http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: 
> require authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Error from server at 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: require 
> authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:964)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1976)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
>

[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018406#comment-17018406
 ] 

Chris M. Hostetter commented on SOLR-11746:
---

[~houston] - i have to step away for a minute, but based on my poking around a 
bit i think that fundamentally the problem is that – at a lucene level – Point 
fields just don't seem to ever support (or care about) norms?

Unlike other solr fieldtypes, none of the {{...solr.schema.*PointField}} impls 
ever _create_ or pass a {{...lucene.document.FieldType}} instance (containing 
the "omitNorms" setting from the SchemaField) to the underlying 
{{...lucene.document.Field}} instance that they create their 
{{...solr.schema.FieldType.createField()}} method – because the underlying 
classes (like {{...lucene.document.IntPoint}} don't _allow_ you to specify 
you're own FieldType (where you set things like {{omitNorms}} ... instead 
that's all private & internal to the (point {{Field}} impl

There are no existing lucene layer test cases of using Point Fields + 
NormsFieldExistsQuery, and I'm pretty sure if you tried to write one you'd find 
that you can never make it pass, because it's physically impossible to create a 
"Point" field with {{omitNorms=true}} ?

BUT ... I don't think this is a bug? ... If you look back at what Uwe said 
above when he suggested using NormsFieldExistsQuery he was very specific...
{quote}If the field has norms enabled (e.g. a text field or StringField with 
norms), then you can also use NormsFieldExistsQuery:
{quote}
...i think fundamentally your patch should be restructured to ensure it never 
attempts to use NormsFieldExistsQuery with Point based fields?

Off the top of my head, i would straw man suggest eliminating the concept of 
{{getSpecializedExistenceQuery}} and instead just make FieldType use...
{code:java}
public Query getExistenceQuery(QParser parser, SchemaField field) {
  if (field.hasDocValues()) {
return new DocValuesFieldExistsQuery(field.getName());
  } else if (!isPointField() && !field.omitNorms() && field.indexOptions() != 
IndexOptions.NONE) {
return new NormsFieldExistsQuery(field.getName());
  }
  // Default to an unbounded range query
  return getRangeQuery(...); // ? getSpecializedRangeQuery ? 
}
{code}
And let subclasses (like the point fields) override getExistenceQuery as needed.

(Although I generally hate the existence of that {{isPointField()}} method, so 
i'm not fully sold on this idea ... I'm also not really clear on the 
purpose/need of getSpecializedRangeQuery as opposed to just letting subclasses 
override {{getRangeQuery(...)}} ... so take this entire suggestion with a grain 
of salt)

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13965) Adding new functions to GraphHandler should be same as Streamhandler

2020-01-17 Thread Lucene/Solr QA (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018391#comment-17018391
 ] 

Lucene/Solr QA commented on SOLR-13965:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  1m  1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  1m  0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  1m  0s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 44m 
53s{color} | {color:green} core in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 48m 39s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | SOLR-13965 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12991242/SOLR-13965.01.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 
10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-SOLR-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / aad849bf87a |
| ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-SOLR-Build/652/testReport/ |
| modules | C: solr/core U: solr/core |
| Console output | 
https://builds.apache.org/job/PreCommit-SOLR-Build/652/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> Adding new functions to GraphHandler should be same as Streamhandler
> 
>
> Key: SOLR-13965
> URL: https://issues.apache.org/jira/browse/SOLR-13965
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Affects Versions: 8.3
>Reporter: David Eric Pugh
>Priority: Minor
> Attachments: SOLR-13965.01.patch
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Currently you add new functions to GraphHandler differently than you do in 
> StreamHandler.  We should have one way of extending the handlers that support 
> streaming expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Houston Putman (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018375#comment-17018375
 ] 

Houston Putman commented on SOLR-11746:
---

Sorry about that Hoss, it's #2. I've posted my patch with a comment around the 
test that fails {{TestSolrQueryParser.testDocsWithValuesInField()}}. 
{code:java}
reproduce with: ant test -Dtestcase=TestSolrQueryParser 
-Dtests.method=testDocsWithValuesInField -Dtests.seed=8945CFEE0F9CB0A8 
-Dtests.slow=true -Dtests.badapples=true -Dtests.locale=kab-DZ 
-Dtests.timezone=Brazil/East -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8{code}
I'm pretty sure it's a Solr bug with creating FieldInfo objects, but I'm new to 
this part of Solr and haven't been able to track down how the IndexOptions get 
populated yet. 

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Houston Putman (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Houston Putman updated SOLR-11746:
--
Attachment: SOLR-11746.patch

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9150) Restore support for dynamic PlanetModel in Geo3D

2020-01-17 Thread Nick Knize (Jira)

Nick Knize created LUCENE-9150:
--

 Summary: Restore support for dynamic PlanetModel in Geo3D
 Key: LUCENE-9150
 URL: https://issues.apache.org/jira/browse/LUCENE-9150
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nick Knize


LUCENE-7072 removed dynamic planet model support in Geo3D. This was logical at 
the time (given the state of Lucene and spatial projections and coordinate 
reference systems). Since then, however, there have been a lot of new 
developments within the OGC community around [Coordinate Reference 
Systems|https://docs.opengeospatial.org/as/18-005r4/18-005r4.html], [Dynamic 
Coordinate Reference 
Systems|http://docs.opengeospatial.org/DRAFTS/18-058.html], and [Updated ISO 
Standards|https://www.iso.org/obp/ui/#iso:std:iso:19111:ed-3:v1:en].

It would be useful for Geo3D (and eventually LatLon*) to support different 
geographic datums to make lucene a viable option for indexing/searching in 
different spatial reference systems (e.g., more accurately computing query 
shape relations to BKD's internal nodes using datum consistent with the spatial 
projection). This would also provide an alternative to other limitations of the 
{{LatLon*/XY*}} implementation (e.g., pole/dateline crossing, quantization of 
small polygons). 

I'd like to propose keeping the current WGS84 static datum as the default for 
Geo3D but adding back the constructors to accept custom planet models. Perhaps 
this could be listed as an "expert" API feature?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9077) Gradle build

2020-01-17 Thread Erick Erickson (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-9077:
---
Description: 
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes below on why this respin is needed.

The code lives on *gradle-master* branch. It is kept with sync with *master*. 
Try running the following to see an overview of helper guides concerning 
typical workflow, testing and ant-migration helpers:

gradlew :help

A list of items that needs to be added or requires work. If you'd like to work 
on any of these, please add your name to the list. Once you have a patch/ pull 
request let me (dweiss) know - I'll try to coordinate the merges.
 * (/) Apply forbiddenAPIs
 * (/) Generate hardware-aware gradle defaults for parallelism (count of 
workers and test JVMs).
 * (/) Fail the build if --tests filter is applied and no tests execute during 
the entire build (this allows for an empty set of filtered tests at single 
project level).
 * (/) Port other settings and randomizations from common-build.xml
 * (/) Configure security policy/ sandboxing for tests.
 * (/) test's console output on -Ptests.verbose=true
 * (/) add a :helpDeps explanation to how the dependency system works (palantir 
plugin, lockfile) and how to retrieve structured information about current 
dependencies of a given module (in a tree-like output).
 * (/) jar checksums, jar checksum computation and validation. This should be 
done without intermediate folders (directly on dependency sets).
 * (/) verify min. JVM version and exact gradle version on build startup to 
minimize odd build side-effects
 * (/) Repro-line for failed tests/ runs.
 * (/) add a top-level README note about building with gradle (and the required 
JVM).
 * (/) add an equivalent of 'validate-source-patterns' 
(check-source-patterns.groovy) to precommit.
 * (/) add an equivalent of 'rat-sources' to precommit.
 * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) to 
precommit.
* (/) javadoc compilation

Hard-to-implement stuff already investigated:
 * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
to be any way to do this in a reasonably efficient way. There are onOutput 
listeners but they're slow to operate and solr tests emit *tons* of output so 
it's an overkill.-
 * (!) (LUCENE-9120) *Tests working with security-debug logs or other JVM-early 
log output*. Gradle's test runner works by redirecting Java's stdout/ syserr so 
this just won't work. Perhaps we can spin the ant-based test runner for such 
corner-cases.

Of lesser importance:
 * Add an equivalent of 'documentation-lint" to precommit.
 * Do not require files to be committed before running precommit.
 * (/) add rendering of javadocs (gradlew javadoc)
 * Attach javadocs to maven publications.
 * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
it'll be difficult to run it sensibly because gradle doesn't offer cwd 
separation for the forked test runners.
 * if you diff solr packaged distribution against ant-created distribution 
there are minor differences in library versions and some JARs are excluded/ 
moved around. I didn't try to force these as everything seems to work (tests, 
etc.) – perhaps these differences should  be fixed in the ant build instead.
 * [EOE] identify and port various "regenerate" tasks from ant builds (javacc, 
precompiled automata, etc.)
 * Fill in POM details in gradle/defaults-maven.gradle so that they reflect the 
previous content better (dependencies aside).
 * Add any IDE integration layers that should be added (I use IntelliJ and it 
imports the project out of the box, without the need for any special tuning).
 * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; currently 
XSLT...)
 * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
from a binary distribution? 

 

*{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
Cao Mạnh Đạt but also applies lessons learned from those two efforts:
 * *Do not try to do too many things at once*. If we deviate too far from 
master, the branch will be hard to merge.
 * *Do everything in baby-steps* and add small, independent build fragments 
replacing the old ant infrastructure.
 * *Try to engage people to run, test and contribute early*. It can't be a 
one-man effort. The more people understand and can contribute to the build, the 
more healthy it will be.

 

  was:
This task focuses on providing gradle-based build equivalent for Lucene and 
Solr (on master branch). See notes below on why this respin is needed.

The code lives on *gradle-master* branch. It is kept with sync with *master*. 
Try running the following to see an overview of helper guides concerning 
typical workflow, testing and ant-migration helpers:

gradlew :help

A

[GitHub] [lucene-solr] nknize opened a new pull request #1182: LUCENE-9149: Increase data dimension limit in BKD

2020-01-17 Thread GitBox

nknize opened a new pull request #1182: LUCENE-9149: Increase data dimension 
limit in BKD
URL: https://github.com/apache/lucene-solr/pull/1182
 
 
   [LUCENE-8496](https://issues.apache.org/jira/browse/LUCENE-8496) added 
selective indexing; the ability to designate the first K <= N dimensions for 
driving the construction of the BKD internal nodes. Follow on work stored the 
"data dimensions" for only the leaf nodes and only the "index dimensions" are 
stored for the internal nodes. While maxPointsInLeafNode is still important for 
managing the BKD heap memory footprint (thus we don't want this to get too 
large), I'd like to propose increasing the `MAX_DIMENSIONS` limit (to something 
not too crazy like 16; effectively doubling the index dimension limit) while 
maintaining the `MAX_INDEX_DIMENSIONS` at 8.
   
   Doing this will enable us to encode higher dimension data within a lower 
dimension index (e.g., 3D tessellated triangles as a 10 dimension point using 
only the first 6 dimensions for index construction)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9149) Increase data dimension limit in BKD

2020-01-17 Thread Nick Knize (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Knize updated LUCENE-9149:
---
Attachment: LUCENE-9149.patch
Status: Open  (was: Open)

Attached patch:

* refactors most {{numDataDim}} variables to more accurate name {{numDims}}
* increases {{MAX_DIMENSIONS}} to 16, keeps {{MAX_INDEX_DIMENSIONS}} at 8
* updates random test suites to test with new increased limit

Will open a PR for review

> Increase data dimension limit in BKD
> 
>
> Key: LUCENE-9149
> URL: https://issues.apache.org/jira/browse/LUCENE-9149
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Nick Knize
>Priority: Major
> Attachments: LUCENE-9149.patch
>
>
> LUCENE-8496 added selective indexing; the ability to designate the first K <= 
> N dimensions for driving the construction of the BKD internal nodes. Follow 
> on work stored the "data dimensions" for only the leaf nodes and only the 
> "index dimensions" are stored for the internal nodes. While 
> {{maxPointsInLeafNode}} is still important for managing the BKD heap memory 
> footprint (thus we don't want this to get too large), I'd like to propose 
> increasing the {{MAX_DIMENSIONS}} limit (to something not too crazy like 16; 
> effectively doubling the index dimension limit) while maintaining the 
> {{MAX_INDEX_DIMENSIONS}} at 8.
> Doing this will enable us to encode higher dimension data within a lower 
> dimension index (e.g., 3D tessellated triangles as a 10 dimension point using 
> only the first 6 dimensions for index construction)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-11746) numeric fields need better error handling for prefix/wildcard syntax -- consider uniform support for "foo:* == foo:[* TO *]"

2020-01-17 Thread Houston Putman (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-11746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018314#comment-17018314
 ] 

Houston Putman commented on SOLR-11746:
---

I've been updating the patch with everything detailed above (docValues, norms, 
and the specialized [* TO *] functionality for floats and doubles), as well as 
extended tests.

I have run into a snag with the {{NormsFieldExistsQuery}}. For PointFields (not 
TrieFields), the behavior of a field's {{SchemaField.indexOptions}} do not line 
up with the {{FieldInfo.indexOptions}} for the same field. This means that when 
[FieldInfo.hasNorms()|https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/FieldInfo.java#L321]
 is called in {{NormsFieldExistsQuery}}, for PointFields, *false* will be 
returned even if the same logic {{(omitNorms == false && IndexOptions != 
IndexOptions.None)}} used with the data in {{SchemaField}} returns *true*. 
Since this "hasNorms" logic is different in {{FieldType}}, which uses 
SchemaType, and {{NormsFieldExistsQuery}}, which uses {{FieldInfo}}, the logic 
in FieldType cannot accurately determine if the NormsFieldExistsQuery is the 
correct method to use. 

I've been unable so far to figure out how FieldInfo and SchemaField have 
received different values for IndexOptions. (This seems to be the reason why 
the logic results in different results, {{omitNorms}} has the correct value in 
both classes.) Any advice here beyond just omitting the 
{{NormsFieldExistsQuery}} entirely?

To be clear this issue only exists for PointFields.

> numeric fields need better error handling for prefix/wildcard syntax -- 
> consider uniform support for "foo:* == foo:[* TO *]"
> 
>
> Key: SOLR-11746
> URL: https://issues.apache.org/jira/browse/SOLR-11746
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.0
>Reporter: Chris M. Hostetter
>Assignee: Houston Putman
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch, 
> SOLR-11746.patch, SOLR-11746.patch, SOLR-11746.patch
>
>
> On the solr-user mailing list, Torsten Krah pointed out that with Trie 
> numeric fields, query syntax such as {{foo_d:\*}} has been functionality 
> equivilent to {{foo_d:\[\* TO \*]}} and asked why this was not also supported 
> for Point based numeric fields.
> The fact that this type of syntax works (for {{indexed="true"}} Trie fields) 
> appears to have been an (untested, undocumented) fluke of Trie fields given 
> that they use indexed terms for the (encoded) numeric terms and inherit the 
> default implementation of {{FieldType.getPrefixQuery}} which produces a 
> prefix query against the {{""}} (empty string) term.  
> (Note that this syntax has aparently _*never*_ worked for Trie fields with 
> {{indexed="false" docValues="true"}} )
> In general, we should assess the behavior users attempt a prefix/wildcard 
> syntax query against numeric fields, as currently the behavior is largely 
> non-sensical:  prefix/wildcard syntax frequently match no docs w/o any sort 
> of error, and the aformentioned {{numeric_field:*}} behaves inconsistently 
> between points/trie fields and between indexed/docValued trie fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9149) Increase data dimension limit in BKD

2020-01-17 Thread Nick Knize (Jira)

Nick Knize created LUCENE-9149:
--

 Summary: Increase data dimension limit in BKD
 Key: LUCENE-9149
 URL: https://issues.apache.org/jira/browse/LUCENE-9149
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Nick Knize


LUCENE-8496 added selective indexing; the ability to designate the first K <= N 
dimensions for driving the construction of the BKD internal nodes. Follow on 
work stored the "data dimensions" for only the leaf nodes and only the "index 
dimensions" are stored for the internal nodes. While {{maxPointsInLeafNode}} is 
still important for managing the BKD heap memory footprint (thus we don't want 
this to get too large), I'd like to propose increasing the {{MAX_DIMENSIONS}} 
limit (to something not too crazy like 16; effectively doubling the index 
dimension limit) while maintaining the {{MAX_INDEX_DIMENSIONS}} at 8.

Doing this will enable us to encode higher dimension data within a lower 
dimension index (e.g., 3D tessellated triangles as a 10 dimension point using 
only the first 6 dimensions for index construction)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] asfgit merged pull request #1174: LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core

2020-01-17 Thread GitBox

asfgit merged pull request #1174: LUCENE-8621: Refactor LatLonShape, XYShape, 
and all query and utility classes to core
URL: https://github.com/apache/lucene-solr/pull/1174
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8621) Move LatLonShape and XYShape out of sandbox

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018308#comment-17018308
 ] 

ASF subversion and git services commented on LUCENE-8621:
-

Commit aad849bf87ab69c1bd0eb34518181e1f3c1c42f2 in lucene-solr's branch 
refs/heads/master from Nicholas Knize
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=aad849b ]

LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes 
from sandbox to core


> Move LatLonShape and XYShape out of sandbox
> ---
>
> Key: LUCENE-8621
> URL: https://issues.apache.org/jira/browse/LUCENE-8621
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Nick Knize
>Priority: Minor
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> LatLonShape has matured a lot over the last months, I'd like to start 
> thinking about moving it out of sandbox so that it doesn't stay there for too 
> long like what happened to LatLonPoint. I am pretty happy with the current 
> encoding. To my knowledge, we might just need to do a minor modification 
> because of 
> LUCENE-8620.
> {{XYShape}} and foundation classes will also need to be refactored. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018278#comment-17018278
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

I thought the change in the behavior has very small or no impact for users who 
use the Tokenizer for searching, but yes it would affect users who use it for 
pure tokenization purpose.

While keeping backward compatibility (within the same major version) is 
important, not emitting compound tokens would be prefered to get along with 
succeeding token filters and compound tokens are not needed for most use cases. 
I think it'd be better that we change the behavior at some point.

How about this proposal: we can create two patches, one for the master and one 
for 8x. On 8x branch, add the new constructor so you can use it from the next 
update. There is no change in the default behavior. On the master branch, 
switch the default behavior (users who don't like the change can still swich 
back by using the full constructor).

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9145) Address warnings found by static analysis

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018260#comment-17018260
 ] 

ASF subversion and git services commented on LUCENE-9145:
-

Commit 338d386ae08a1edecb89df5497cb46d0abf154ad in lucene-solr's branch 
refs/heads/master from Mike
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=338d386 ]

LUCENE-9145 First pass addressing static analysis (#1181)

Fixed a bunch of the smaller warnings found by error-prone compiler
plugin, while ignoring a lot of the bigger ones.

> Address warnings found by static analysis
> -
>
> Key: LUCENE-9145
> URL: https://issues.apache.org/jira/browse/LUCENE-9145
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob merged pull request #1181: LUCENE-9145 First pass addressing static analysis

2020-01-17 Thread GitBox

madrob merged pull request #1181: LUCENE-9145 First pass addressing static 
analysis
URL: https://github.com/apache/lucene-solr/pull/1181
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] nknize edited a comment on issue #1174: LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core

2020-01-17 Thread GitBox

nknize edited a comment on issue #1174: LUCENE-8621: Refactor LatLonShape, 
XYShape, and all query and utility classes to core
URL: https://github.com/apache/lucene-solr/pull/1174#issuecomment-575759956
 
 
   Thanks @mikemccand. Since this is a refactor from sandbox to the core lucene 
jar I'm planning to backport this to 8x.  This way future master to 8x 
backports are reduced to bug fixes only. Any objections?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] nknize commented on issue #1174: LUCENE-8621: Refactor LatLonShape, XYShape, and all query and utility classes to core

2020-01-17 Thread GitBox

nknize commented on issue #1174: LUCENE-8621: Refactor LatLonShape, XYShape, 
and all query and utility classes to core
URL: https://github.com/apache/lucene-solr/pull/1174#issuecomment-575759956
 
 
   Thanks @mikemccand. Since this is a refactor from sandbox to the core lucene 
jar I'm planning to backport this to 8x.  This way master to 8x backports are 
reduced to bug fixes only. Any objections?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-12859) DocExpirationUpdateProcessorFactory does not work with BasicAuth

2020-01-17 Thread Chris M. Hostetter (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-12859:
--
Attachment: SOLR-12859.patch
Status: Patch Available  (was: Patch Available)

Attaching a slightly updated patch with  javadoc additions that makes it clear 
the new {{setUserPrincipalName()}} method in LocalSolrQueryRequest is 
experimental and subject ot change in case we want to remove if if/when more 
comprehensive changes are made to PKI/isSolrThread.


> DocExpirationUpdateProcessorFactory does not work with BasicAuth
> 
>
> Key: SOLR-12859
> URL: https://issues.apache.org/jira/browse/SOLR-12859
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.5
>Reporter: Varun Thacker
>Priority: Major
> Attachments: SOLR-12859.patch, SOLR-12859.patch
>
>
> I setup a cluster with basic auth and then wanted to use Solr's TTL feature ( 
> DocExpirationUpdateProcessorFactory ) to auto-delete documents.
>  
> Turns out it doesn't work when Basic Auth is enabled. I get the following 
> stacktrace from the logs
> {code:java}
> 2018-10-12 22:06:38.967 ERROR (autoExpireDocs-42-thread-1) [   ] 
> o.a.s.u.p.DocExpirationUpdateProcessorFactory Runtime error in periodic 
> deletion of expired docs: Async exception during distributed update: Error 
> from server at http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: 
> require authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Error from server at 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: require 
> authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:964)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1976)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:182)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>

[jira] [Commented] (LUCENE-9053) java.lang.AssertionError: inputs are added out of order lastInput=[f0 9d 9c 8b] vs input=[ef ac 81 67 75 72 65]

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018237#comment-17018237
 ] 

ASF subversion and git services commented on LUCENE-9053:
-

Commit 8147e491ce3905bb3543f2c7e34a4ecb60382b49 in lucene-solr's branch 
refs/heads/master from Michael McCandless
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8147e49 ]

LUCENE-9053: improve FST's package-info.java comment to clarify required 
(Unicode code point) sort order for FST.Builder


> java.lang.AssertionError: inputs are added out of order lastInput=[f0 9d 9c 
> 8b] vs input=[ef ac 81 67 75 72 65]
> ---
>
> Key: LUCENE-9053
> URL: https://issues.apache.org/jira/browse/LUCENE-9053
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: gitesh
>Priority: Minor
>
> Even if the inputs are sorted in unicode order, I get following exception 
> while creating FST:
>  
> {code:java}
> // Input values (keys). These must be provided to Builder in Unicode sorted 
> order!
> String inputValues[] = {"퐴", "ﬁgure", "ﬂagship"};
> long outputValues[] = {5, 7, 12};
> PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
> Builder builder = new Builder(FST.INPUT_TYPE.BYTE1, outputs);
> BytesRefBuilder scratchBytes = new BytesRefBuilder();
> IntsRefBuilder scratchInts = new IntsRefBuilder();
> for (int i = 0; i < inputValues.length; i++) {
>  scratchBytes.copyChars(inputValues[i]);
>  builder.add(Util.toIntsRef(scratchBytes.get(), scratchInts), 
> outputValues[i]);
> }
> FST fst = builder.finish();
> Long value = Util.get(fst, new BytesRef("ﬁgure"));
> System.out.println(value);
> {code}
>  Please note that ﬁgure {color:#172b4d}and{color} ﬂagship {color:#172b4d}are 
> using the ligature character{color} ﬂ {color:#172b4d}above. {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #1033: SOLR-13965: Use Plugin to add new expressions to GraphHandler

2020-01-17 Thread GitBox

cpoerschke commented on a change in pull request #1033: SOLR-13965: Use Plugin 
to add new expressions to GraphHandler
URL: https://github.com/apache/lucene-solr/pull/1033#discussion_r368076257
 
 

 ##
 File path: solr/core/src/java/org/apache/solr/handler/GraphHandler.java
 ##
 @@ -92,24 +104,29 @@ public void inform(SolrCore core) {
 }
 
 // This pulls all the overrides and additions from the config
+List pluginInfos = 
core.getSolrConfig().getPluginInfos(Expressible.class.getName());
+
+// Check deprecated approach.
 Object functionMappingsObj = initArgs.get("streamFunctions");
 if(null != functionMappingsObj){
+  log.warn("solrconfig.xml:  is deprecated for adding 
additional streaming functions to GraphHandler.");
   NamedList functionMappings = (NamedList)functionMappingsObj;
   for(Entry functionMapping : functionMappings) {
 String key = functionMapping.getKey();
 PluginInfo pluginInfo = new PluginInfo(key, 
Collections.singletonMap("class", functionMapping.getValue()));
-
-if (pluginInfo.pkgName == null) {
-  Class clazz = 
core.getResourceLoader().findClass((String) functionMapping.getValue(),
-  Expressible.class);
-  streamFactory.withFunctionName(key, clazz);
-} else {
-  StreamHandler.ExpressibleHolder holder = new 
StreamHandler.ExpressibleHolder(pluginInfo, core, 
SolrConfig.classVsSolrPluginInfo.get(Expressible.class));
-  streamFactory.withFunctionName(key, () -> holder.getClazz());
-}
-
+pluginInfos.add(pluginInfo);
   }
+}
 
+for (PluginInfo pluginInfo : pluginInfos) {
+  if (pluginInfo.pkgName != null) {
+ExpressibleHolder holder = new ExpressibleHolder(pluginInfo, core, 
SolrConfig.classVsSolrPluginInfo.get(Expressible.class));
+streamFactory.withFunctionName(pluginInfo.name,
+() -> holder.getClazz());
+  } else {
+Class clazz = 
core.getMemClassLoader().findClass(pluginInfo.className, Expressible.class);
 
 Review comment:
   > Since this code is duplicated between Stream & Graph, can we factor it out 
into a common method?
   
   Attached SOLR-13965.01.patch to the JIRA ticket.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4499) Multi-word synonym filter (synonym expansion)

2020-01-17 Thread Alessandro Benedetti (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-4499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018233#comment-17018233
 ] 

Alessandro Benedetti commented on LUCENE-4499:
--

hi [~Tagar], sorry for the late reply, I contributed a patch that is still 
waiting for a review:
[https://issues.apache.org/jira/browse/SOLR-12238|https://issues.apache.org/jira/browse/SOLR-12238]
It's a bit old, so it may require some porting effort, but it could help you.


> Multi-word synonym filter (synonym expansion)
> -
>
> Key: LUCENE-4499
> URL: https://issues.apache.org/jira/browse/LUCENE-4499
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Affects Versions: 4.1, 6.0
>Reporter: Roman Chyla
>Priority: Major
>  Labels: analysis, multi-word, synonyms
> Fix For: 6.0
>
> Attachments: LUCENE-4499.patch, LUCENE-4499.patch
>
>
> I apologize for bringing the multi-token synonym expansion up again. There is 
> an old, unresolved issue at LUCENE-1622 [1]
> While solving the problem for our needs [2], I discovered that the current 
> SolrSynonym parser (and the wonderful FTS) have almost everything to 
> satisfactorily handle both the query and index time synonym expansion. It 
> seems that people often need to use the synonym filter *slightly* differently 
> at indexing and query time.
> In our case, we must do different things during indexing and querying.
> Example sentence: Mirrors of the Hubble space telescope pointed at XA5
> This is what we need (comma marks position bump):
> indexing: mirrors,hubble|hubble space 
> telescope|hst,space,telescope,pointed,xa5|astroobject#5
> querying: +mirrors +(hubble space telescope | hst) +pointed 
> +(xa5|astroboject#5)
> This translated to following needs:
>   indexing time: 
> single-token synonyms => return only synonyms
> multi-token synonyms => return original tokens *AND* the synonyms
>   query time:
> single-token: return only synonyms (but preserve case)
> multi-token: return only synonyms
>  
> We need the original tokens for the proximity queries, if we indexed 'hubble 
> space telescope'
> as one token, we cannot search for 'hubble NEAR telescope'
> You may (not) be surprised, but Lucene already supports ALL of these 
> requirements. The patch is an attempt to state the problem differently. I am 
> not sure if it is the best option, however it works perfectly for our needs 
> and it seems it could work for general public too. Especially if the 
> SynonymFilterFactory had a preconfigured sets of SynonymMapBuilders - and 
> people would just choose what situation they use. Please look at the unittest.
> links:
> [1] https://issues.apache.org/jira/browse/LUCENE-1622
> [2] http://labs.adsabs.harvard.edu/trac/ads-invenio/ticket/158
> [3] seems to have similar request: 
> http://lucene.472066.n3.nabble.com/Proposal-Full-support-for-multi-word-synonyms-at-query-time-td4000522.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13965) Adding new functions to GraphHandler should be same as Streamhandler

2020-01-17 Thread Christine Poerschke (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018234#comment-17018234
 ] 

Christine Poerschke commented on SOLR-13965:


GraphHandler and StreamHandler code sharing/duplication was mentioned both here 
and on the pull request.

SOLR-13965.01.patch factors out a static 
{{StreamHandler.addExpressiblePlugins}} method which GraphHandler could then 
use too.

(Note that this does _not_ fix the 
{{SolrConfig.classVsSolrPluginInfo.get(Expressible.class)}} suspected bug that 
[~mdrob] mentioned on the PR – {{Expressible.class}} vs. 
{{Expressible.class.getName()}} was the suspected type mismatch there, right?)

> Adding new functions to GraphHandler should be same as Streamhandler
> 
>
> Key: SOLR-13965
> URL: https://issues.apache.org/jira/browse/SOLR-13965
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Affects Versions: 8.3
>Reporter: David Eric Pugh
>Priority: Minor
> Attachments: SOLR-13965.01.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently you add new functions to GraphHandler differently than you do in 
> StreamHandler.  We should have one way of extending the handlers that support 
> streaming expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-13965) Adding new functions to GraphHandler should be same as Streamhandler

2020-01-17 Thread Christine Poerschke (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated SOLR-13965:
---
Status: Patch Available  (was: Open)

> Adding new functions to GraphHandler should be same as Streamhandler
> 
>
> Key: SOLR-13965
> URL: https://issues.apache.org/jira/browse/SOLR-13965
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Affects Versions: 8.3
>Reporter: David Eric Pugh
>Priority: Minor
> Attachments: SOLR-13965.01.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently you add new functions to GraphHandler differently than you do in 
> StreamHandler.  We should have one way of extending the handlers that support 
> streaming expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-13965) Adding new functions to GraphHandler should be same as Streamhandler

2020-01-17 Thread Christine Poerschke (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-13965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated SOLR-13965:
---
Attachment: SOLR-13965.01.patch

> Adding new functions to GraphHandler should be same as Streamhandler
> 
>
> Key: SOLR-13965
> URL: https://issues.apache.org/jira/browse/SOLR-13965
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Affects Versions: 8.3
>Reporter: David Eric Pugh
>Priority: Minor
> Attachments: SOLR-13965.01.patch
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Currently you add new functions to GraphHandler differently than you do in 
> StreamHandler.  We should have one way of extending the handlers that support 
> streaming expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4702) Terms dictionary compression

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018228#comment-17018228
 ] 

Adrien Grand commented on LUCENE-4702:
--

I did some research on what's taking space in the terms dictionary, and while 
suffixes take a fair amount of space for text fields, it tends to rather be 
stats for ID fields, so I did a couple changes to also use LZ4 to do some 
run-length encoding for doc freqs (frequent runs of 1s for ids, and 
interestingly there are many runs of 1s for the body field of wikibigall too), 
suffix lengths, which are also frequently the same especially for ID fields 
(always the same for UUID or flake IDs and very little variance for 
auto-increment IDs). Finally we were wasting some space with the pulsing 
optimization too since we kept writing the delta of file pointers in spite of 
the fact that these deltas are almost always zeros for ID fields since we don't 
write postings in the doc file but in the terms dictionary. The compression is 
significantly better now as the size of the tim file goes down by 18% from 
937MB to 767MB. Here are the stats for the body and id fields if you are 
curious:

{code}
"id" field
  index FST:
72 bytes
  terms:
6647577 terms
39885462 bytes (6.0 bytes/term)
  blocks:
189932 blocks
184655 terms-only blocks
5277 sub-block-only blocks
0 mixed blocks
0 floor blocks
189932 non-floor blocks
0 floor sub-blocks
14059850 term suffix bytes before compression (52.8 suffix-bytes/block)
10023973 compressed term suffix bytes (0.71 compression ratio - compression 
count by algorithm: NO_COMPRESSION: 189932)
6647577 term stats bytes before compression (11.7 stats-bytes/block)
2226414 compressed term stats bytes (0.33 compression ratio)
26962631 other bytes (142.0 other-bytes/block)
{code}

{code}
"body" field
  index FST:
72 bytes
  terms:
46916528 terms
595069147 bytes (12.7 bytes/term)
  blocks:
1507239 blocks
1158537 terms-only blocks
471 sub-block-only blocks
348231 mixed blocks
318391 floor blocks
491775 non-floor blocks
1015464 floor sub-blocks
359880365 term suffix bytes before compression (196.3 suffix-bytes/block)
295898442 compressed term suffix bytes (0.82 compression ratio - 
compression count by algorithm: NO_COMPRESSION: 252273, LOWERCASE_ASCII: 
1190011, LZ4: 64955)
94426201 term stats bytes before compression (45.1 stats-bytes/block)
68022105 compressed term stats bytes (0.72 compression ratio)
213996755 other bytes (142.0 other-bytes/block)
{code}
 
I see a 10% slowdown on PKLookup that I'll look into.

> Terms dictionary compression
> 
>
> Key: LUCENE-4702
> URL: https://issues.apache.org/jira/browse/LUCENE-4702
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Trivial
> Attachments: LUCENE-4702.patch, LUCENE-4702.patch
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I've done a quick test with the block tree terms dictionary by replacing a 
> call to IndexOutput.writeBytes to write suffix bytes with a call to 
> LZ4.compressHC to test the peformance hit. Interestingly, search performance 
> was very good (see comparison table below) and the tim files were 14% smaller 
> (from 150432 bytes overall to 129516).
> {noformat}
> TaskQPS baseline  StdDevQPS compressed  StdDev
> Pct diff
>   Fuzzy1  111.50  (2.0%)   78.78  (1.5%)  
> -29.4% ( -32% -  -26%)
>   Fuzzy2   36.99  (2.7%)   28.59  (1.5%)  
> -22.7% ( -26% -  -18%)
>  Respell  122.86  (2.1%)  103.89  (1.7%)  
> -15.4% ( -18% -  -11%)
> Wildcard  100.58  (4.3%)   94.42  (3.2%)   
> -6.1% ( -13% -1%)
>  Prefix3  124.90  (5.7%)  122.67  (4.7%)   
> -1.8% ( -11% -9%)
>OrHighLow  169.87  (6.8%)  167.77  (8.0%)   
> -1.2% ( -15% -   14%)
>  LowTerm 1949.85  (4.5%) 1929.02  (3.4%)   
> -1.1% (  -8% -7%)
>   AndHighLow 2011.95  (3.5%) 1991.85  (3.3%)   
> -1.0% (  -7% -5%)
>   OrHighHigh  155.63  (6.7%)  154.12  (7.9%)   
> -1.0% ( -14% -   14%)
>  AndHighHigh  341.82  (1.2%)  339.49  (1.7%)   
> -0.7% (  -3% -2%)
>OrHighMed  217.55  (6.3%)  216.16  (7.1%)   
> -0.6% ( -13% -   13%)
>   IntNRQ   53.10 (10.9%)   52.90  (8.6%)   
> -0.4% ( -17% -   21%)
>  MedTerm  998.11  (3.8%)  994.82  (5.6%)   
> -0.3% (  -9% -9%)
>

[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018159#comment-17018159
 ] 

Kazuaki Hiraga edited comment on LUCENE-9123 at 1/17/20 5:56 PM:
-

{quote}
This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?
{quote}
Yes. Although I may need to test more documents to ensure that the fix will 
produce a simple chain of tokens, it seems working fine so far.

{quote}
Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)? (Since that seems to be where the error is 
occurring here). Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji). 
{quote}

In my case, I use this mode as my primary Tokenizer configuration since I 
usually want to have decompound tokens.

It would be nice if synonym filter and synonym graph filter can work with this 
mode without the patch. However, I don't think there are many situations that 
we need original tokens along with decompound ones (I cannot say we will never 
need though).  Current workaround for this issue is using normal mode that will 
not produce decompound tokens. But, for example, we cannot get a document that 
contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal 
mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 
(in this case, we can use n-gram in addition to tokenize field to get a 
document but it has other issues).

Therefore, there are two issues. #1 Kuromoji produces compound and decompound 
tokens on both of search mode and extended mode, which compound one is rarely 
needed. #2 Neither synonym filter nor synonym graph filter can work with tokens 
that overlap position. 

 [~mikemccand], I will try to find the ticket for #2. If there's no one, I will 
create one. And I will change the title of this ticket to focus on #1.


was (Author: h.kazuaki):
{quote}
This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?
{quote}
Yes. Although I may need to test more documents to ensure that the fix will 
produce a simple chain of tokens, it seems working fine so far.

{quote}
Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)? (Since that seems to be where the error is 
occurring here). Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji). 
{quote}

In my case, I use this mode as my primary Tokenizer configuration since I 
usually want to have decompound tokens.

It would be nice if synonym filter and synonym graph filter can work with this 
mode without the patch. However, I don't think there are many situations that 
we need original tokens along with decompound ones (I cannot say we will never 
need though).  Current workaround for this issue is using normal mode that will 
not produce decompound tokens. But, for example, we cannot get a document that 
contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal 
mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 
(in this case, we can use n-gram in addition to tokenize field to get a 
document but it has other issues).

I will try to find out that one which dedicated issue for the filter. If 
there's no one, I will create a ticket to record the issue. 

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that

[jira] [Commented] (SOLR-12859) DocExpirationUpdateProcessorFactory does not work with BasicAuth

2020-01-17 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018213#comment-17018213
 ] 

Chris M. Hostetter commented on SOLR-12859:
---

[~caomanhdat] - i can't find your patch, did you already delete it? (or forget 
to add it)

I think what' you were saying...
{quote}if isSolrThread == true, set usr = "$" even incase of principal == null
{quote}
...is pretty similar to what i had hypothosized...
{quote}My initial reaction was to "swap" the order of the 
Principle/isSolrThread() checks...
{quote}
...but i still have the same concerns...
{quote}...that seems risky (especially since AFAICT, every SolrClient used for 
distributed Solr requests will return "true" for isSolrThread() – meaning PKI 
would probably completely stop forwarding credentials if we did that?
{quote}
To put it another way:
 * Assume {{blockUnknown=false}}
 * So unauthenticated request that gets accepted by the AuthenticationPlugin 
will have a null Principal.
 * as things stand right now, if that "principal==null" request gets forwarded 
to a distributed node, it will _never_ have a PKI header added
 * With your proposed change, if it does get forwarded _by another thread_ such 
as ConcurrentUpdateSolrClient (where {{isSolrThread == true}}) then a PKI 
header will be added initiating it is a "principal == '$' (Super User)" request
 ** ie: anonymous requests will be "promoted" to being super user requests on 
the distributed nodes

i have no idea what practical problems that may cause, but it certainly sounds 
bad.

{quote}...think that the Hoss's initial approach is more valid than mine because
 * it makes less significant change to the code base.{quote}
Yeah, it seems like there are a lot of weird edge cases of the way PKI for 
"background server threads" works that really needs to be 
discussed/re-considered -- particularly since even forwarding updates from 
leader to replicas happens in a background thread (inside of the update client) 
... but that should probably happen in a new jira with wider discussion & 
visibility.

My patch seems like the minimum amount of change needed to fix the current bug, 
so unless someone sees a security problem  with it (that doesn't already exist) 
i would suggest we move forward using my patch and re-consider the overall 
"isSolrThread" concept in a new jira?

> DocExpirationUpdateProcessorFactory does not work with BasicAuth
> 
>
> Key: SOLR-12859
> URL: https://issues.apache.org/jira/browse/SOLR-12859
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.5
>Reporter: Varun Thacker
>Priority: Major
> Attachments: SOLR-12859.patch
>
>
> I setup a cluster with basic auth and then wanted to use Solr's TTL feature ( 
> DocExpirationUpdateProcessorFactory ) to auto-delete documents.
>  
> Turns out it doesn't work when Basic Auth is enabled. I get the following 
> stacktrace from the logs
> {code:java}
> 2018-10-12 22:06:38.967 ERROR (autoExpireDocs-42-thread-1) [   ] 
> o.a.s.u.p.DocExpirationUpdateProcessorFactory Runtime error in periodic 
> deletion of expired docs: Async exception during distributed update: Error 
> from server at http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: 
> require authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Error from server at 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6: require 
> authentication
> request: 
> http://192.168.0.8:8983/solr/gettingstarted_shard2_replica_n6/update?update.distrib=TOLEADER=http%3A%2F%2F192.168.0.8%3A8983%2Fsolr%2Fgettingstarted_shard1_replica_n2%2F=javabin=2
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:964)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1976)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:182)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]
>     at 
> org.apache.solr.update.processor.UpdateRequestProcessor.finish(UpdateRequestProcessor.java:80)
>  ~[solr-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - 
> jimczi - 2018-09-18 13:07:55]

[GitHub] [lucene-solr] dweiss commented on issue #1181: LUCENE-9145 First pass addressing static analysis

2020-01-17 Thread GitBox

dweiss commented on issue #1181: LUCENE-9145 First pass addressing static 
analysis
URL: https://github.com/apache/lucene-solr/pull/1181#issuecomment-575727247
 
 
   I'd run a full test suite and if it passes just commit it in. Most of these 
look like legitimate bug fixes!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018207#comment-17018207
 ] 

Kazuaki Hiraga commented on LUCENE-9123:


[~tomoko], Hm.. it sounds tricky and difficulty to draw a line between 
acceptable change and unacceptable one. I think changing default behavior has 
more impact on outcome of Tokenizer rather than modifying signature of 
constructors, which users may want to intentionally change. I thought we didn't 
want to do that at the point release. That's the reason to set false to this 
option. What do you think? we can change the behavior?


> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on issue #1179: LUCENE-9147: Move the stored fields index off-heap.

2020-01-17 Thread GitBox

jpountz commented on issue #1179: LUCENE-9147: Move the stored fields index 
off-heap.
URL: https://github.com/apache/lucene-solr/pull/1179#issuecomment-575722955
 
 
   We have this information for a few datasets at 
https://elasticsearch-benchmarks.elastic.co/index.html#tracks/geonames/nightly/default/30d.
 For instance 3.8MB for the geonames dataset or 6MB for the HTTP logs dataset. 
It's not much, but these datasets are not very large either (3.1GB and 18.6GB 
respectively). As you can see it's the main contributor to memory usage after 
the terms index (which we still load it on-heap for the  `_id` field for now) 
so if a user loads the terms index off-heap, it may well be that stored fields 
are the main heap user.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob opened a new pull request #1181: LUCENE-9145 First pass addressing static analysis

2020-01-17 Thread GitBox

madrob opened a new pull request #1181: LUCENE-9145 First pass addressing 
static analysis
URL: https://github.com/apache/lucene-solr/pull/1181
 
 
   Fixed a bunch of the smaller warnings found by error-prone compiler
   plugin, while ignoring a lot of the bigger ones.
   
   This is just the warnings found by #1176 without the build changes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on issue #1176: LUCENE-9143 Add error-prone checks to build, but disabled

2020-01-17 Thread GitBox

madrob commented on issue #1176: LUCENE-9143 Add error-prone checks to build, 
but disabled
URL: https://github.com/apache/lucene-solr/pull/1176#issuecomment-575721644
 
 
   Yea, I'll split this out into the warnings separately and the build changes 
into an individual file as well.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187
 ] 

Tomoko Uchida edited comment on LUCENE-9123 at 1/17/20 4:59 PM:


{quote}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{quote}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?


was (Author: tomoko uchida):
{{quote}}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{{quote}}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018187#comment-17018187
 ] 

Tomoko Uchida commented on LUCENE-9123:
---

{{quote}}
However, I don't think there are many situations that we need original tokens 
along with decompound ones
{{quote}}

Personally I agree with that. Concerning full text searching, we rarely need 
original tokens when we use the "search mode". Why don't we set 
"discardCompoundToken" to true by default from here (I think this minor change 
in behaviour is Okay for next 8.x release)?

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Kazuaki Hiraga (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018159#comment-17018159
 ] 

Kazuaki Hiraga commented on LUCENE-9123:


{quote}
This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?
{quote}
Yes. Although I may need to test more documents to ensure that the fix will 
produce a simple chain of tokens, it seems working fine so far.

{quote}
Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)? (Since that seems to be where the error is 
occurring here). Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji). 
{quote}

In my case, I use this mode as my primary Tokenizer configuration since I 
usually want to have decompound tokens.

It would be nice if synonym filter and synonym graph filter can work with this 
mode without the patch. However, I don't think there are many situations that 
we need original tokens along with decompound ones (I cannot say we will never 
need though).  Current workaround for this issue is using normal mode that will 
not produce decompound tokens. But, for example, we cannot get a document that 
contains 株式会社 by using a query 会社 because 株式会社 will be one token and normal 
mode doesn't produce decoumpound tokens that will produce two tokens 株式 and 会社 
(in this case, we can use n-gram in addition to tokenize field to get a 
document but it has other issues).

I will try to find out that one which dedicated issue for the filter. If 
there's no one, I will create a ticket to record the issue. 

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-14073) Fix segment look ahead NPE in CollapsingQParserPlugin

2020-01-17 Thread Joel Bernstein (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018150#comment-17018150
 ] 

Joel Bernstein edited comment on SOLR-14073 at 1/17/20 4:03 PM:


The initial patch seems to work but it's quite hard to reason about.

I'm going to try a different approach which is easier to reason about which is 
to pre-populate the contexts array rather then populating it as the segments 
are visited. This should eliminate the look-ahead NPE as well.


was (Author: joel.bernstein):
The initial patch seems to work but it's quite hard to reason about.

I'm going to try a different approach which is easier to reason about which is 
to prepopulate the contexts array rather then populating it as the segments are 
visited. This should eliminate the look-ahead NPE as well.

> Fix segment look ahead NPE in CollapsingQParserPlugin
> -
>
> Key: SOLR-14073
> URL: https://issues.apache.org/jira/browse/SOLR-14073
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-14073.patch
>
>
> The CollapsingQParserPlugin has a bug that if every segment is not visited 
> during the collect it throws an NPE. This causes the CollapsingQParserPlugin 
> to not work when used with any feature that short circuits the segments 
> during the collect. This includes using the CollapsingQParserPlugin twice in 
> the same query and the time limiting collector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14073) Fix segment look ahead NPE in CollapsingQParserPlugin

2020-01-17 Thread Joel Bernstein (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018150#comment-17018150
 ] 

Joel Bernstein commented on SOLR-14073:
---

The initial patch seems to work but it's quite hard to reason about.

I'm going to try a different approach which is easier to reason about which is 
to prepopulate the contexts array rather then populating it as the segments are 
visited. This should eliminate the look-ahead NPE as well.

> Fix segment look ahead NPE in CollapsingQParserPlugin
> -
>
> Key: SOLR-14073
> URL: https://issues.apache.org/jira/browse/SOLR-14073
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-14073.patch
>
>
> The CollapsingQParserPlugin has a bug that if every segment is not visited 
> during the collect it throws an NPE. This causes the CollapsingQParserPlugin 
> to not work when used with any feature that short circuits the segments 
> during the collect. This includes using the CollapsingQParserPlugin twice in 
> the same query and the time limiting collector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9098) Report problematic term value when fuzzy query is too complex

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018138#comment-17018138
 ] 

Michael McCandless commented on LUCENE-9098:


Indeed, this is still failing on current master 
(fb3ca8d000d6e5203a57625942b754f1d5757fac).  Looks like the test tries to make 
a random string that for sure will attempt to use too many states during 
determinize, yet this particular random string does not... here's the full 
failure:
{noformat}
[junit4:pickseed] Seed property 'tests.seed' already defined: CE3DF037C6D29401
   [junit4]  says Привет! Master seed: CE3DF037C6D29401
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(16174@localhost).
   [junit4] Suite: org.apache.lucene.search.TestFuzzyQuery
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestFuzzyQuery 
-Dtests.method=testErrorMessage -Dtests.seed=CE3DF037C6D29401 -Dtests.slow=true 
-Dtests.badapples=true -Dtests.locale=fr-GN -Dtests.timezon\
e=US/Pacific-New -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1
   [junit4] FAILURE 0.23s | TestFuzzyQuery.testErrorMessage <<<
   [junit4]    > Throwable #1: junit.framework.AssertionFailedError: Unexpected 
exception type, expected FuzzyTermsException but got 
java.lang.UnsupportedOperationException
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([CE3DF037C6D29401:1836CAB94FFCBD4F]:0)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2752)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2740)
   [junit4]    >        at 
org.apache.lucene.search.TestFuzzyQuery.testErrorMessage(TestFuzzyQuery.java:507)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   [junit4]    >        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   [junit4]    >        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   [junit4]    >        at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
   [junit4]    >        at java.base/java.lang.Thread.run(Thread.java:834)
   [junit4]    > Caused by: java.lang.UnsupportedOperationException
   [junit4]    >        at 
org.apache.lucene.search.TestFuzzyQuery$1.iterator(TestFuzzyQuery.java:511)
   [junit4]    >        at 
org.apache.lucene.index.Terms.intersect(Terms.java:70)
   [junit4]    >        at 
org.apache.lucene.search.FuzzyTermsEnum.getAutomatonEnum(FuzzyTermsEnum.java:205)
   [junit4]    >        at 
org.apache.lucene.search.FuzzyTermsEnum.bottomChanged(FuzzyTermsEnum.java:232)
   [junit4]    >        at 
org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:131)
   [junit4]    >        at 
org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:196)
   [junit4]    >        at 
org.apache.lucene.search.MultiTermQuery.getTermsEnum(MultiTermQuery.java:303)
   [junit4]    >        at 
org.apache.lucene.search.TestFuzzyQuery.lambda$testErrorMessage$6(TestFuzzyQuery.java:508)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase._expectThrows(LuceneTestCase.java:2870)
   [junit4]    >        at 
org.apache.lucene.util.LuceneTestCase.expectThrows(LuceneTestCase.java:2745)
   [junit4]    >        ... 38 more
   [junit4]   2> NOTE: test params are: 
codec=DummyCompressingStoredFields(storedFieldsFormat=CompressingStoredFieldsFormat(compressionMode=DUMMY,
 chunkSize=5, maxDocsPerChunk=10, blockSize=8), termVectorsFormat=Co\
mpressingTermVectorsFormat(compressionMode=DUMMY, chunkSize=5, blockSize=8)), 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@3f2c18f9),
 locale=fr-GN, timezone=US/Pacific-New
   [junit4]   2> NOTE: Linux 4.4.0-165-generic amd64/Oracle Corporation 11.0.2 
(64-bit)/cpus=8,threads=1,free=477446800,total=536870912
   [junit4]   2> NOTE: All tests run in this JVM: [TestFuzzyQuery]
   [junit4] Completed [1/1 (1!)] in 0.45s, 1 test, 1 failure <<< 
FAILURES!{noformat}

> Report problematic term value when fuzzy query is too complex
> -
>
> Key: LUCENE-9098
> URL: https://issues.apache.org/jira/browse/LUCENE-9098
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the lucene compliment to SOLR-13190, when fuzzy query gets a term 
> that expands to too many states, we throw an exception but don't provide 
> insight on the problematic term. We should improve the error reporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (LUCENE-9098) Report problematic term value when fuzzy query is too complex

2020-01-17 Thread Michael McCandless (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-9098:


Reopen so we can fix the failing seed ...

> Report problematic term value when fuzzy query is too complex
> -
>
> Key: LUCENE-9098
> URL: https://issues.apache.org/jira/browse/LUCENE-9098
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Minor
> Fix For: master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the lucene compliment to SOLR-13190, when fuzzy query gets a term 
> that expands to too many states, we throw an exception but don't provide 
> insight on the problematic term. We should improve the error reporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9125) Improve Automaton.step() with binary search and introduce Automaton.next()

2020-01-17 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018125#comment-17018125
 ] 

Bruno Roustant commented on LUCENE-9125:


{quote}The Automaton queries only use the {{step}} API while constructing the 
{{RunAutomaton}}
{quote}
Correct. Automaton.step() was also used in the minimize() Operation, which is 
now a bit faster.

> Improve Automaton.step() with binary search and introduce Automaton.next()
> --
>
> Key: LUCENE-9125
> URL: https://issues.apache.org/jira/browse/LUCENE-9125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement the existing todo in Automaton.step() (lookup a transition from a 
> source state depending on a given label) to use binary search since the 
> transitions are sorted.
> Introduce new method Automaton.next() to optimize iteration & lookup over all 
> the transitions of a state. This will be used in RunAutomaton constructor and 
> in MinimizationOperations.minimize().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9137) Broken link 'Change log' for 8.4.1 on https://lucene.apache.org/core/downloads.html

2020-01-17 Thread Sebb (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018122#comment-17018122
 ] 

Sebb commented on LUCENE-9137:
--

The published web page has yet to be updated:

$ curl -Is https://lucene.apache.org/core/downloads.html 
HTTP/1.1 200 OK
Date: Fri, 17 Jan 2020 15:29:26 GMT
Last-Modified: Tue, 14 Jan 2020 13:08:01 GMT



> Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
> ---
>
> Key: LUCENE-9137
> URL: https://issues.apache.org/jira/browse/LUCENE-9137
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
>Reporter: Sebb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9123) JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018114#comment-17018114
 ] 

Michael McCandless commented on LUCENE-9123:


This solution would fix Kuromoji to create a simple chain of tokens, all with 
position increment 1 (no overlapping compound tokens)?

Would you only use that mode when parsing the synonyms to build the synonym 
filter (or synonym graph filter)?  (Since that seems to be where the error is 
occurring here).  Or would you also use that as your primary Tokenizer (which 
would mean you don't also get compound words directly out of Kuromoji).

Net/net it's disappointing that neither synonym filter nor synonym graph filter 
can correctly consume an incoming token graph; it'd be somewhat tricky to fix, 
but is important.  I thought we had a dedicated issue for that but I cannot 
locate it now.

> JapaneseTokenizer with search mode doesn't work with SynonymGraphFilter
> ---
>
> Key: LUCENE-9123
> URL: https://issues.apache.org/jira/browse/LUCENE-9123
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.4
>Reporter: Kazuaki Hiraga
>Priority: Major
> Attachments: LUCENE-9123.patch, LUCENE-9123_revised.patch
>
>
> JapaneseTokenizer with `mode=search` or `mode=extended` doesn't work with 
> both of SynonymGraphFilter and SynonymFilter when JT generates multiple 
> tokens as an output. If we use `mode=normal`, it should be fine. However, we 
> would like to use decomposed tokens that can maximize to chance to increase 
> recall.
> Snippet of schema:
> {code:xml}
>  positionIncrementGap="100" autoGeneratePhraseQueries="false">
>   
> 
>  synonyms="lang/synonyms_ja.txt"
> tokenizerFactory="solr.JapaneseTokenizerFactory"/>
> 
> 
>  tags="lang/stoptags_ja.txt" />
> 
> 
> 
> 
> 
>  minimumLength="4"/>
> 
> 
>   
> 
> {code}
> An synonym entry that generates error:
> {noformat}
> 株式会社,コーポレーション
> {noformat}
> The following is an output on console:
> {noformat}
> $ ./bin/solr create_core -c jp_test -d ../config/solrconfs
> ERROR: Error CREATEing SolrCore 'jp_test': Unable to create core [jp_test3] 
> Caused by: term: 株式会社 analyzed to a token (株式会社) with position increment != 1 
> (got: 0)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9142) Add documentation to Operations.determinize, SortedIntSet, and FrozenSet

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018103#comment-17018103
 ] 

Michael McCandless commented on LUCENE-9142:


This is indeed dangerously sneaky code – {{SortedIntSet.equals}} has special 
logic to compare only to a {{FrozenIntSet}} ... it's kinda weird that it cannot 
compare against another {{SortedIntSet}}, while {{FrozenIntSet.equals}} is 
symmetric (can compare against either class).

Maybe we could at least fix both of these {{equals}} methods to invoke the same 
(static) {{equals}}?

Hmm, and {{FrozenIntSet.equals}} looks buggy when it's comparing to a 
{{SortedIntSet}} – it's checking the length of the {{values}} array in the 
{{SortedIntSet}} when I think it should instead check against {{upto}}?  If 
that's really it bug it may indeed be causing our determinize/minimize to not 
collapse as many states as it should?

> Add documentation to Operations.determinize, SortedIntSet, and FrozenSet
> 
>
> Key: LUCENE-9142
> URL: https://issues.apache.org/jira/browse/LUCENE-9142
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Mike Drob
>Priority: Major
>
> Was tracing through the fuzzy query code, and IntelliJ helpfully pointed out 
> that we have mismatched types when trying to reuse states, and so we may be 
> creating more states than we need to.
> Relevant snippets:
> {code:title=Operations.java}
> Map newstate = new HashMap<>();
> final SortedIntSet statesSet = new SortedIntSet(5);
> Integer q = newstate.get(statesSet);
> {code}
> {{q}} is always going to be null in this path because there are no 
> SortedIntSet keys in the map.
> There are also very little javadoc on SortedIntSet, so I'm having trouble 
> following the precise relationship between all the pieces here.
> cc: [~mikemccand] [~romseygeek] - I would appreciate any pointers if you have 
> them



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

2020-01-17 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018096#comment-17018096
 ] 

Erick Erickson commented on LUCENE-9134:


BTW, I've got the javacc bits working, just trying to clean up enough so we 
don't need to hand-edit the results afterwards.

Having real trouble getting IntelliJ to recompile on demand even when the files 
haven't changed, it used to.

Or getting the gradle build to use -xlint options. Digging...

> Port ant-regenerate tasks to Gradle build
> -
>
> Key: LUCENE-9134
> URL: https://issues.apache.org/jira/browse/LUCENE-9134
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9134.patch, gen-kuromoji.patch
>
>
> Here are the "regenerate" targets I found in the ant version. There are a 
> couple that I don't have evidence for or against being rebuilt
>  // Very top level
> {code:java}
> ./build.xml: 
> ./build.xml:  failonerror="true">
> ./build.xml:  depends="regenerate,-check-after-regeneration"/>
>  {code}
> // top level Lucene. This includes the core/build.xml and 
> test-framework/build.xml files
> {code:java}
> ./lucene/build.xml: 
> ./lucene/build.xml:  inheritall="false">
> ./lucene/build.xml: 
>  {code}
> // This one has quite a number of customizations to
> {code:java}
> ./lucene/core/build.xml:  depends="createLevAutomata,createPackedIntSources,jflex"/>
>  {code}
> // This one has a bunch of code modifications _after_ javacc is run on 
> certain of the
>  // output files. Save this one for last?
> {code:java}
> ./lucene/queryparser/build.xml: 
>  {code}
> // the files under ../lucene/analysis... are pretty self contained. I expect 
> these could be done as a unit
> {code:java}
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/common/build.xml:  depends="jflex,unicode-data"/>
> ./lucene/analysis/icu/build.xml:  depends="gen-utr30-data-files,gennorm2,genrbbi"/>
> ./lucene/analysis/kuromoji/build.xml:  depends="build-dict"/>
> ./lucene/analysis/nori/build.xml:  depends="build-dict"/>
> ./lucene/analysis/opennlp/build.xml:  depends="train-test-models"/>
>  {code}
>  
> // These _are_ regenerated from the top-level regenerate target, but for --
> LUCENE-9080//the changes were only in imports so there are no
> //corresponding files checked in in that JIRA
> {code:java}
> ./lucene/expressions/build.xml:  depends="run-antlr"/>
>  {code}
> // Apparently unrelated to ./lucene/analysis/opennlp/build.xml 
> "train-test-models" target
> // Apparently not rebuilt from the top level, but _are_ regenerated when 
> executed from
> // ./solr/contrib/langid
> {code:java}
> ./solr/contrib/langid/build.xml:  depends="train-test-models"/>
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

2020-01-17 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018090#comment-17018090
 ] 

Dawid Weiss commented on LUCENE-9134:
-

Note to self: use jgit. Treat the expanded dictionary as a fresh repo and apply 
the patch with jgit's patch command.

> Port ant-regenerate tasks to Gradle build
> -
>
> Key: LUCENE-9134
> URL: https://issues.apache.org/jira/browse/LUCENE-9134
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9134.patch, gen-kuromoji.patch
>
>
> Here are the "regenerate" targets I found in the ant version. There are a 
> couple that I don't have evidence for or against being rebuilt
>  // Very top level
> {code:java}
> ./build.xml: 
> ./build.xml:  failonerror="true">
> ./build.xml:  depends="regenerate,-check-after-regeneration"/>
>  {code}
> // top level Lucene. This includes the core/build.xml and 
> test-framework/build.xml files
> {code:java}
> ./lucene/build.xml: 
> ./lucene/build.xml:  inheritall="false">
> ./lucene/build.xml: 
>  {code}
> // This one has quite a number of customizations to
> {code:java}
> ./lucene/core/build.xml:  depends="createLevAutomata,createPackedIntSources,jflex"/>
>  {code}
> // This one has a bunch of code modifications _after_ javacc is run on 
> certain of the
>  // output files. Save this one for last?
> {code:java}
> ./lucene/queryparser/build.xml: 
>  {code}
> // the files under ../lucene/analysis... are pretty self contained. I expect 
> these could be done as a unit
> {code:java}
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/common/build.xml:  depends="jflex,unicode-data"/>
> ./lucene/analysis/icu/build.xml:  depends="gen-utr30-data-files,gennorm2,genrbbi"/>
> ./lucene/analysis/kuromoji/build.xml:  depends="build-dict"/>
> ./lucene/analysis/nori/build.xml:  depends="build-dict"/>
> ./lucene/analysis/opennlp/build.xml:  depends="train-test-models"/>
>  {code}
>  
> // These _are_ regenerated from the top-level regenerate target, but for --
> LUCENE-9080//the changes were only in imports so there are no
> //corresponding files checked in in that JIRA
> {code:java}
> ./lucene/expressions/build.xml:  depends="run-antlr"/>
>  {code}
> // Apparently unrelated to ./lucene/analysis/opennlp/build.xml 
> "train-test-models" target
> // Apparently not rebuilt from the top level, but _are_ regenerated when 
> executed from
> // ./solr/contrib/langid
> {code:java}
> ./solr/contrib/langid/build.xml:  depends="train-test-models"/>
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

2020-01-17 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018088#comment-17018088
 ] 

Dawid Weiss commented on LUCENE-9134:
-

I wanted to start with Kuromoji dictionary regeneration but it turned out it 
uses a patch command (which Windows lacks). I leave the patch here for now, 
will pick it up again when I figure out how to do this in a 
platform-independent way.

> Port ant-regenerate tasks to Gradle build
> -
>
> Key: LUCENE-9134
> URL: https://issues.apache.org/jira/browse/LUCENE-9134
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9134.patch, gen-kuromoji.patch
>
>
> Here are the "regenerate" targets I found in the ant version. There are a 
> couple that I don't have evidence for or against being rebuilt
>  // Very top level
> {code:java}
> ./build.xml: 
> ./build.xml:  failonerror="true">
> ./build.xml:  depends="regenerate,-check-after-regeneration"/>
>  {code}
> // top level Lucene. This includes the core/build.xml and 
> test-framework/build.xml files
> {code:java}
> ./lucene/build.xml: 
> ./lucene/build.xml:  inheritall="false">
> ./lucene/build.xml: 
>  {code}
> // This one has quite a number of customizations to
> {code:java}
> ./lucene/core/build.xml:  depends="createLevAutomata,createPackedIntSources,jflex"/>
>  {code}
> // This one has a bunch of code modifications _after_ javacc is run on 
> certain of the
>  // output files. Save this one for last?
> {code:java}
> ./lucene/queryparser/build.xml: 
>  {code}
> // the files under ../lucene/analysis... are pretty self contained. I expect 
> these could be done as a unit
> {code:java}
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/common/build.xml:  depends="jflex,unicode-data"/>
> ./lucene/analysis/icu/build.xml:  depends="gen-utr30-data-files,gennorm2,genrbbi"/>
> ./lucene/analysis/kuromoji/build.xml:  depends="build-dict"/>
> ./lucene/analysis/nori/build.xml:  depends="build-dict"/>
> ./lucene/analysis/opennlp/build.xml:  depends="train-test-models"/>
>  {code}
>  
> // These _are_ regenerated from the top-level regenerate target, but for --
> LUCENE-9080//the changes were only in imports so there are no
> //corresponding files checked in in that JIRA
> {code:java}
> ./lucene/expressions/build.xml:  depends="run-antlr"/>
>  {code}
> // Apparently unrelated to ./lucene/analysis/opennlp/build.xml 
> "train-test-models" target
> // Apparently not rebuilt from the top level, but _are_ regenerated when 
> executed from
> // ./solr/contrib/langid
> {code:java}
> ./solr/contrib/langid/build.xml:  depends="train-test-models"/>
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

2020-01-17 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-9134:

Attachment: gen-kuromoji.patch

> Port ant-regenerate tasks to Gradle build
> -
>
> Key: LUCENE-9134
> URL: https://issues.apache.org/jira/browse/LUCENE-9134
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9134.patch, gen-kuromoji.patch
>
>
> Here are the "regenerate" targets I found in the ant version. There are a 
> couple that I don't have evidence for or against being rebuilt
>  // Very top level
> {code:java}
> ./build.xml: 
> ./build.xml:  failonerror="true">
> ./build.xml:  depends="regenerate,-check-after-regeneration"/>
>  {code}
> // top level Lucene. This includes the core/build.xml and 
> test-framework/build.xml files
> {code:java}
> ./lucene/build.xml: 
> ./lucene/build.xml:  inheritall="false">
> ./lucene/build.xml: 
>  {code}
> // This one has quite a number of customizations to
> {code:java}
> ./lucene/core/build.xml:  depends="createLevAutomata,createPackedIntSources,jflex"/>
>  {code}
> // This one has a bunch of code modifications _after_ javacc is run on 
> certain of the
>  // output files. Save this one for last?
> {code:java}
> ./lucene/queryparser/build.xml: 
>  {code}
> // the files under ../lucene/analysis... are pretty self contained. I expect 
> these could be done as a unit
> {code:java}
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/common/build.xml:  depends="jflex,unicode-data"/>
> ./lucene/analysis/icu/build.xml:  depends="gen-utr30-data-files,gennorm2,genrbbi"/>
> ./lucene/analysis/kuromoji/build.xml:  depends="build-dict"/>
> ./lucene/analysis/nori/build.xml:  depends="build-dict"/>
> ./lucene/analysis/opennlp/build.xml:  depends="train-test-models"/>
>  {code}
>  
> // These _are_ regenerated from the top-level regenerate target, but for --
> LUCENE-9080//the changes were only in imports so there are no
> //corresponding files checked in in that JIRA
> {code:java}
> ./lucene/expressions/build.xml:  depends="run-antlr"/>
>  {code}
> // Apparently unrelated to ./lucene/analysis/opennlp/build.xml 
> "train-test-models" target
> // Apparently not rebuilt from the top level, but _are_ regenerated when 
> executed from
> // ./solr/contrib/langid
> {code:java}
> ./solr/contrib/langid/build.xml:  depends="train-test-models"/>
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9077) Gradle build

2020-01-17 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018082#comment-17018082
 ] 

David Smiley edited comment on LUCENE-9077 at 1/17/20 2:41 PM:
---

In general I suggest NOT jumping between major versions on a single 
checkout/worktree.  On my machine I have multiple "git worktree" for the major 
branches.  If I want to go between say branch_8x and say perhaps the 8.4 
release branch then I do it on that worktree and *not* master.  It's just too 
disruptive to thinks like expected JDK, modules, IDE issues etc.


was (Author: dsmiley):
In general I suggest jumping between major versions on a single 
checkout/worktree.  On my machine I have multiple "git worktree" for the major 
branches.  If I want to go between say branch_8x and say perhaps the 8.4 
release branch then I do it on that worktree and *not* master.  It's just too 
disruptive to thinks like expected JDK, modules, IDE issues etc.

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that

[jira] [Commented] (LUCENE-9077) Gradle build

2020-01-17 Thread David Smiley (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018082#comment-17018082
 ] 

David Smiley commented on LUCENE-9077:
--

In general I suggest jumping between major versions on a single 
checkout/worktree.  On my machine I have multiple "git worktree" for the major 
branches.  If I want to go between say branch_8x and say perhaps the 8.4 
release branch then I do it on that worktree and *not* master.  It's just too 
disruptive to thinks like expected JDK, modules, IDE issues etc.

> Gradle build
> 
>
> Key: LUCENE-9077
> URL: https://issues.apache.org/jira/browse/LUCENE-9077
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> This task focuses on providing gradle-based build equivalent for Lucene and 
> Solr (on master branch). See notes below on why this respin is needed.
> The code lives on *gradle-master* branch. It is kept with sync with *master*. 
> Try running the following to see an overview of helper guides concerning 
> typical workflow, testing and ant-migration helpers:
> gradlew :help
> A list of items that needs to be added or requires work. If you'd like to 
> work on any of these, please add your name to the list. Once you have a 
> patch/ pull request let me (dweiss) know - I'll try to coordinate the merges.
>  * (/) Apply forbiddenAPIs
>  * (/) Generate hardware-aware gradle defaults for parallelism (count of 
> workers and test JVMs).
>  * (/) Fail the build if --tests filter is applied and no tests execute 
> during the entire build (this allows for an empty set of filtered tests at 
> single project level).
>  * (/) Port other settings and randomizations from common-build.xml
>  * (/) Configure security policy/ sandboxing for tests.
>  * (/) test's console output on -Ptests.verbose=true
>  * (/) add a :helpDeps explanation to how the dependency system works 
> (palantir plugin, lockfile) and how to retrieve structured information about 
> current dependencies of a given module (in a tree-like output).
>  * (/) jar checksums, jar checksum computation and validation. This should be 
> done without intermediate folders (directly on dependency sets).
>  * (/) verify min. JVM version and exact gradle version on build startup to 
> minimize odd build side-effects
>  * (/) Repro-line for failed tests/ runs.
>  * (/) add a top-level README note about building with gradle (and the 
> required JVM).
>  * (/) add an equivalent of 'validate-source-patterns' 
> (check-source-patterns.groovy) to precommit.
>  * (/) add an equivalent of 'rat-sources' to precommit.
>  * (/) add an equivalent of 'check-example-lucene-match-version' (solr only) 
> to precommit.
> * (/) javadoc compilation
> Hard-to-implement stuff already investigated:
>  * (/) (done)  -*Printing console output of failed tests.* There doesn't seem 
> to be any way to do this in a reasonably efficient way. There are onOutput 
> listeners but they're slow to operate and solr tests emit *tons* of output so 
> it's an overkill.-
>  * (!) (LUCENE-9120) *Tests working with security-debug logs or other 
> JVM-early log output*. Gradle's test runner works by redirecting Java's 
> stdout/ syserr so this just won't work. Perhaps we can spin the ant-based 
> test runner for such corner-cases.
> Of lesser importance:
>  * Add an equivalent of 'documentation-lint" to precommit.
>  * (/) add rendering of javadocs (gradlew javadoc)
>  * Attach javadocs to maven publications.
>  * Add test 'beasting' (rerunning the same suite multiple times). I'm afraid 
> it'll be difficult to run it sensibly because gradle doesn't offer cwd 
> separation for the forked test runners.
>  * if you diff solr packaged distribution against ant-created distribution 
> there are minor differences in library versions and some JARs are excluded/ 
> moved around. I didn't try to force these as everything seems to work (tests, 
> etc.) – perhaps these differences should  be fixed in the ant build instead.
>  * [EOE] identify and port various "regenerate" tasks from ant builds 
> (javacc, precompiled automata, etc.)
>  * Fill in POM details in gradle/defaults-maven.gradle so that they reflect 
> the previous content better (dependencies aside).
>  * Add any IDE integration layers that should be added (I use IntelliJ and it 
> imports the project out of the box, without the need for any special tuning).
>  * Add Solr packaging for docs/* (see TODO in packaging/build.gradle; 
> currently XSLT...)
>  * I didn't bother adding Solr dist/test-framework to packaging (who'd use it 
> from a binary distribution? 
>  
> *{color:#ff}Note:{color}* this builds on the work done by Mark Miller and 
> Cao Mạnh Đạt but also

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #1179: LUCENE-9147: Move the stored fields index off-heap.

2020-01-17 Thread GitBox

mikemccand commented on a change in pull request #1179: LUCENE-9147: Move the 
stored fields index off-heap.
URL: https://github.com/apache/lucene-solr/pull/1179#discussion_r367955319
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/packed/DirectMonotonicReader.java
 ##
 @@ -101,20 +104,99 @@ public static LongValues getInstance(Meta meta, 
RandomAccessInput data) throws I
 readers[i] = DirectReader.getInstance(data, meta.bpvs[i], 
meta.offsets[i]);
   }
 }
-final int blockShift = meta.blockShift;
-
-final long[] mins = meta.mins;
-final float[] avgs = meta.avgs;
-return new LongValues() {
-
-  @Override
-  public long get(long index) {
-final int block = (int) (index >>> blockShift);
-final long blockIndex = index & ((1 << blockShift) - 1);
-final long delta = readers[block].get(blockIndex);
-return mins[block] + (long) (avgs[block] * blockIndex) + delta;
+
+return new DirectMonotonicReader(meta.blockShift, readers, meta.mins, 
meta.avgs, meta.bpvs);
+  }
+
+  private final int blockShift;
+  private final LongValues[] readers;
+  private final long[] mins;
+  private final float[] avgs;
+  private final byte[] bpvs;
+  private final int nonZeroBpvs;
+
+  private DirectMonotonicReader(int blockShift, LongValues[] readers, long[] 
mins, float[] avgs, byte[] bpvs) {
+this.blockShift = blockShift;
+this.readers = readers;
+this.mins = mins;
+this.avgs = avgs;
+this.bpvs = bpvs;
+if (readers.length != mins.length || readers.length != avgs.length || 
readers.length != bpvs.length) {
+  throw new IllegalArgumentException();
+}
+int nonZeroBpvs = 0;
+for (byte b : bpvs) {
+  if (b != 0) {
+nonZeroBpvs++;
+  }
+}
+this.nonZeroBpvs = nonZeroBpvs;
+  }
+
+  @Override
+  public long get(long index) {
+final int block = (int) (index >>> blockShift);
+final long blockIndex = index & ((1 << blockShift) - 1);
+final long delta = readers[block].get(blockIndex);
+return mins[block] + (long) (avgs[block] * blockIndex) + delta;
+  }
+
+  /** Get lower/upper bounds for the value at a given index without hitting 
the direct reader. */
+  private long[] getBounds(long index) {
+final int block = (int) (index >>> blockShift);
 
 Review comment:
   Do we know this incoming `long index` is small enough not to overflow `int` 
after right shift?  Should we use `Math.toIntExact` instead to confirm?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-01-17 Thread Gus Heck (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018063#comment-17018063
 ] 

Gus Heck commented on SOLR-13749:
-

Sure, I'll look at it tomorow

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   }}{{size}}{{=}}{{"128"}}
>  {{   }}{{initialSize}}{{=}}{{"0"}}
>  {{   }}{{regenerator}}{{=}}{{"solr.NoOpRegenerator"}}{{/>}}
>   
>  {{<}}{{queryParser}} {{name}}{{=}}{{"xcjf"}} 
> {{class}}{{=}}{{"org.apache.solr.search.join.XCJFQueryParserPlugin"}}{{>}}
>  {{  }}{{<}}{{str}} {{name}}{{=}}{{"routerField"}}{{>vin}}
>  {{}}
>

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018061#comment-17018061
 ] 

Michael McCandless commented on LUCENE-9130:


{quote}This is an issue tracker not a support portal. It's for when you are 
*certain* of a specific behavior in contravention of published documentation, 
or clear errors (like unexpected stack traces). When you are *confused* or 
don't know something you will get better responses using the mailing list.
{quote}
+1, but really [~mkhl] should have stated this when he originally resolved the 
issue as INVALID.

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018057#comment-17018057
 ] 

Michael McCandless commented on LUCENE-9130:


{quote}After i change the api call to add(term, position), and pass 0,2,4,6,7 
which is the same as analyzed in indexing (because they are the same text 
string), it matches!
{quote}
If you use the same analyzer during query parsing as you used during indexing, 
this (setting the right positions for each term in the {{PhraseQuery}}) should 
have happened "for free".
{quote}Now i'm confused: what's the relation between Term's position value and 
PhraseQuery's slop parameter?
{quote}
The {{slop}} parameter states how precisely the term positions in the document 
must match the positions in the query.  A {{slop}} of 0 means the match must be 
identical, a {{slop}} of 1 means it can tolerate one term being in the wrong 
position in the document, etc.  It's an edit-distance measure, at the term 
level.

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-5669) queries containing \u return error: "Truncated unicode escape sequence."

2020-01-17 Thread Gus Heck (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-5669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018051#comment-17018051
 ] 

Gus Heck commented on SOLR-5669:


This is more an undocumented feature than a bug. That error message comes from 
SolrQueryParserBase:

{code}
  /** Returns the numeric value of the hexadecimal character */
  static final int hexToInt(char c) throws ParseException {
if ('0' <= c && c <= '9') {
  return c - '0';
} else if ('a' <= c && c <= 'f'){
  return c - 'a' + 10;
} else if ('A' <= c && c <= 'F') {
  return c - 'A' + 10;
} else {
  throw new ParseException("Non-hex character in Unicode escape sequence: " 
+ c);
}
  }
{code}

I don't find documentation of it in the ref guide however, so that could be 
added. For edismax one might also request an enhancement to not error on this, 
which would be consistent with the stated goal in the edismax docs of being 
tolerant of errors. This ticket however should probably only document the 
feature in the ref guide. (or point to said documentation if my quick search 
through the guide failed to reveal it)

> queries containing \u  return error: "Truncated unicode escape sequence."
> -
>
> Key: SOLR-5669
> URL: https://issues.apache.org/jira/browse/SOLR-5669
> Project: Solr
>  Issue Type: Bug
>  Components: query parsers
>Affects Versions: 4.4
>Reporter: Dorin Oltean
>Priority: Minor
>
> When I do the following query:
> /select?q=\ujb
> I get 
> {quote}
> "org.apache.solr.search.SyntaxError: Non-hex character in Unicode escape 
> sequence: j",
> {quote}
> To make it work i have to put in fornt of the query nother '\'
> {noformat}\\ujb{noformat}
> wich in fact leads to a different query in solr.
> I use edismax qparser.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13749) Implement support for joining across collections with multiple shards ( XCJF )

2020-01-17 Thread Kevin Watters (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-13749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018052#comment-17018052
 ] 

Kevin Watters commented on SOLR-13749:
--

Hey [~gus]  Dan Fox just did the backport.  It's available here:  

[https://github.com/apache/lucene-solr/pull/1175]

Was curious if you wouldn't mind giving it a merge?  There were no code changes 
between master and 8x for this pull request.

> Implement support for joining across collections with multiple shards ( XCJF )
> --
>
> Key: SOLR-13749
> URL: https://issues.apache.org/jira/browse/SOLR-13749
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Kevin Watters
>Assignee: Gus Heck
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This ticket includes 2 query parsers.
> The first one is the "Cross collection join filter"  (XCJF) parser. This is 
> the "Cross-collection join filter" query parser. It can do a call out to a 
> remote collection to get a set of join keys to be used as a filter against 
> the local collection.
> The second one is the Hash Range query parser that you can specify a field 
> name and a hash range, the result is that only the documents that would have 
> hashed to that range will be returned.
> This query parser will do an intersection based on join keys between 2 
> collections.
> The local collection is the collection that you are searching against.
> The remote collection is the collection that contains the join keys that you 
> want to use as a filter.
> Each shard participating in the distributed request will execute a query 
> against the remote collection.  If the local collection is setup with the 
> compositeId router to be routed on the join key field, a hash range query is 
> applied to the remote collection query to only match the documents that 
> contain a potential match for the documents that are in the local shard/core. 
>  
>  
> Here's some vocab to help with the descriptions of the various parameters.
> ||Term||Description||
> |Local Collection|This is the main collection that is being queried.|
> |Remote Collection|This is the collection that the XCJFQuery will query to 
> resolve the join keys.|
> |XCJFQuery|The lucene query that executes a search to get back a set of join 
> keys from a remote collection|
> |HashRangeQuery|The lucene query that matches only the documents whose hash 
> code on a field falls within a specified range.|
>  
>  
> ||Param ||Required ||Description||
> |collection|Required|The name of the external Solr collection to be queried 
> to retrieve the set of join key values ( required )|
> |zkHost|Optional|The connection string to be used to connect to Zookeeper.  
> zkHost and solrUrl are both optional parameters, and at most one of them 
> should be specified.  
> If neither of zkHost or solrUrl are specified, the local Zookeeper cluster 
> will be used. ( optional )|
> |solrUrl|Optional|The URL of the external Solr node to be queried ( optional 
> )|
> |from|Required|The join key field name in the external collection ( required 
> )|
> |to|Required|The join key field name in the local collection|
> |v|See Note|The query to be executed against the external Solr collection to 
> retrieve the set of join key values.  
> Note:  The original query can be passed at the end of the string or as the 
> "v" parameter.  
> It's recommended to use query parameter substitution with the "v" parameter 
> to ensure no issues arise with the default query parsers.|
> |routed| |true / false.  If true, the XCJF query will use each shard's hash 
> range to determine the set of join keys to retrieve for that shard.
> This parameter improves the performance of the cross-collection join, but 
> it depends on the local collection being routed by the toField.  If this 
> parameter is not specified, 
> the XCJF query will try to determine the correct value automatically.|
> |ttl| |The length of time that an XCJF query in the cache will be considered 
> valid, in seconds.  Defaults to 3600 (one hour).  
> The XCJF query will not be aware of changes to the remote collection, so 
> if the remote collection is updated, cached XCJF queries may give inaccurate 
> results.  
> After the ttl period has expired, the XCJF query will re-execute the join 
> against the remote collection.|
> |_All others_| |Any normal Solr parameter can also be specified as a local 
> param.|
>  
> Example Solr Config.xml changes:
>  
>  {{<}}{{cache}} {{name}}{{=}}{{"hash_vin"}}
>  {{   }}{{class}}{{=}}{{"solr.LRUCache"}}
>  {{   }}{{size}}{{=}}{{"128"}}
>  {{   }}{{initialSize}}{{=}}{{"0"}}
>  {{

[jira] [Commented] (LUCENE-9125) Improve Automaton.step() with binary search and introduce Automaton.next()

2020-01-17 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018049#comment-17018049
 ] 

Michael McCandless commented on LUCENE-9125:


{quote}Here is the benchmark for wikimediumall:
{quote}
Thanks – these results look more realistic!  Looks like mostly noise ...

The Automaton queries only use the {{step}} API while constructing the 
{{RunAutomaton}} which is then used to (quickly) walk the transitions right?

> Improve Automaton.step() with binary search and introduce Automaton.next()
> --
>
> Key: LUCENE-9125
> URL: https://issues.apache.org/jira/browse/LUCENE-9125
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.5
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement the existing todo in Automaton.step() (lookup a transition from a 
> source state depending on a given label) to use binary search since the 
> transitions are sorted.
> Introduce new method Automaton.next() to optimize iteration & lookup over all 
> the transitions of a state. This will be used in RunAutomaton constructor and 
> in MinimizationOperations.minimize().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8369) Remove the spatial module as it is obsolete

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018026#comment-17018026
 ] 

ASF subversion and git services commented on LUCENE-8369:
-

Commit 78655239c58a1ed72d6e015dd05a0b355c936999 in lucene-solr's branch 
refs/heads/gradle-master from Nicholas Knize
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=7865523 ]

LUCENE-8369: Remove obsolete spatial module


> Remove the spatial module as it is obsolete
> ---
>
> Key: LUCENE-8369
> URL: https://issues.apache.org/jira/browse/LUCENE-8369
> Project: Lucene - Core
>  Issue Type: Task
>  Components: modules/spatial
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
> Attachments: LUCENE-8369.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The "spatial" module is at this juncture nearly empty with only a couple 
> utilities that aren't used by anything in the entire codebase -- 
> GeoRelationUtils, and MortonEncoder.  Perhaps it should have been removed 
> earlier in LUCENE-7664 which was the removal of GeoPointField which was 
> essentially why the module existed.  Better late than never.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14130) Add postlogs command line tool for indexing Solr logs

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018025#comment-17018025
 ] 

ASF subversion and git services commented on SOLR-14130:


Commit 35d8e3de6d5931bfd6cba3221cfd0dca7f97c1a1 in lucene-solr's branch 
refs/heads/gradle-master from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=35d8e3d ]

SOLR-14130: Continue to improve log parsing logic


> Add postlogs command line tool for indexing Solr logs
> -
>
> Key: SOLR-14130
> URL: https://issues.apache.org/jira/browse/SOLR-14130
> Project: Solr
>  Issue Type: Task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Joel Bernstein
>Assignee: Joel Bernstein
>Priority: Major
> Attachments: SOLR-14130.patch, SOLR-14130.patch, SOLR-14130.patch, 
> SOLR-14130.patch, SOLR-14130.patch, SOLR-14130.patch, SOLR-14130.patch, 
> Screen Shot 2019-12-19 at 2.04.41 PM.png, Screen Shot 2019-12-19 at 2.16.01 
> PM.png, Screen Shot 2019-12-19 at 2.35.41 PM.png, Screen Shot 2019-12-21 at 
> 8.46.51 AM.png
>
>
> This ticket adds a simple command line tool for posting Solr logs to a solr 
> index. The tool works with the out of the box Solr log format. Still a work 
> in progress but currently indexes:
>  * queries
>  * updates
>  * commits
>  * new searchers
>  * errors - including stack traces
> Attached are some sample visualizations using Solr Streaming Expressions and 
> Math Expressions after the data has been loaded. The visualizations show: 
> time series, scatter plots, histograms and quantile plots, but really this is 
> just scratching the surface of the visualizations that can be done with the 
> Solr logs.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14184) replace DirectUpdateHandler2.commitOnClose with (negated) TestInjection.skipIndexWriterCommitOnClose

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-14184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018027#comment-17018027
 ] 

ASF subversion and git services commented on SOLR-14184:


Commit 5f2d7c4855987670489d68884c787e4cfb377fa9 in lucene-solr's branch 
refs/heads/gradle-master from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5f2d7c4 ]

SOLR-14184: Internal 'test' variable DirectUpdateHandler2.commitOnClose has 
been removed and replaced with TestInjection.skipIndexWriterCommitOnClose


> replace DirectUpdateHandler2.commitOnClose with (negated) 
> TestInjection.skipIndexWriterCommitOnClose
> 
>
> Key: SOLR-14184
> URL: https://issues.apache.org/jira/browse/SOLR-14184
> Project: Solr
>  Issue Type: Test
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: master (9.0), 8.5
>
> Attachments: SOLR-14184.patch, SOLR-14184.patch
>
>
> {code:java}
> public static volatile boolean commitOnClose = true;  // TODO: make this a 
> real config option or move it to TestInjection
> {code}
> Lots of tests muck with this (to simulate unclean shutdown and force tlog 
> replay on restart) but there's no garuntee that it is reset properly.
> It should be replaced by logic in {{TestInjection}} that is correctly cleaned 
> up by {{TestInjection.reset()}}
> 
> It's been replaced with the (negated) option 
> {{TestInjection.skipIndexWriterCommitOnClose}} which is automatically reset 
> to it's default value of {{false}} by {{TestInjection.reset()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018022#comment-17018022
 ] 

Adrien Grand commented on LUCENE-9147:
--

[~erickerickson] Yeah I have similar motivations, with many users who want to 
open terabytes of indices on rather small nodes. In my case the main heap user 
is usually the terms index of a primary/foreign key, so the ability to load the 
terms index off-heap addresses most of the problem. But since it should be an 
even less contentious move for stored fields and term vectors, I thought we 
should do it! :)

> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14194) Allow Highlighting to work for indexes with uniqueKey that is not stored

2020-01-17 Thread Andrzej Wislowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Wislowski updated SOLR-14194:
-
Attachment: SOLR-14194.patch
Status: Open  (was: Open)

> Allow Highlighting to work for indexes with uniqueKey that is not stored
> 
>
> Key: SOLR-14194
> URL: https://issues.apache.org/jira/browse/SOLR-14194
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: highlighter
>Affects Versions: master (9.0)
>Reporter: Andrzej Wislowski
>Priority: Minor
> Fix For: master (9.0)
>
> Attachments: SOLR-14194.patch
>
>
> Highlighting requires uniqueKey to be a stored field. I have changed 
> Highlighter allow returning results on indexes with uniqueKey that is a not 
> stored field, but saved as a docvalue type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9147) Move the stored fields index off-heap

2020-01-17 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018007#comment-17018007
 ] 

Erick Erickson commented on LUCENE-9147:


If you only knew how much of my time with clients is spent dealing with "how 
much memory should I allocate" ;). So while I don't have an opinion on the 
technical aspects, anything we can do to reduce heap requirements is welcome.

> Move the stored fields index off-heap
> -
>
> Key: LUCENE-9147
> URL: https://issues.apache.org/jira/browse/LUCENE-9147
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that the terms index is off-heap by default, it's almost embarrassing 
> that many indices spend most of their memory usage on the stored fields index 
> or the term vectors index, which are much less performance-sensitive than the 
> terms index. We should move them off-heap too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9125) Improve Automaton.step() with binary search and introduce Automaton.next()

2020-01-17 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017978#comment-17017978
 ] 

Bruno Roustant commented on LUCENE-9125:


In the benchmark above I used by error wikimedium10k (I edited to mention that).

Here is the benchmark for wikimediumall:

                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff

           OrHighNotHigh      769.84      (4.8%)      756.84      (5.0%)   
-1.7% ( -10% -    8%)

            OrNotHighLow      664.03      (4.2%)      653.64      (3.4%)   
-1.6% (  -8% -    6%)

            OrNotHighMed      574.56      (3.0%)      566.90      (2.5%)   
-1.3% (  -6% -    4%)

                 MedTerm     1373.80      (3.9%)     1359.30      (5.1%)   
-1.1% (  -9% -    8%)

             AndHighHigh       19.84      (3.6%)       19.67      (2.9%)   
-0.9% (  -7% -    5%)

              AndHighLow      474.49      (2.9%)      470.36      (3.6%)   
-0.9% (  -7% -    5%)

                  Fuzzy1       69.27     (10.7%)       68.75     (11.0%)   
-0.7% ( -20% -   23%)

           OrNotHighHigh      569.30      (3.4%)      565.26      (5.0%)   
-0.7% (  -8% -    7%)

               MedPhrase       36.97      (2.4%)       36.76      (2.7%)   
-0.6% (  -5% -    4%)

                HighTerm     1133.65      (4.2%)     1128.30      (4.3%)   
-0.5% (  -8% -    8%)

               OrHighLow      227.08      (2.9%)      226.24      (3.3%)   
-0.4% (  -6% -    6%)

              OrHighHigh       24.17      (2.6%)       24.08      (2.4%)   
-0.4% (  -5% -    4%)

                 Prefix3       25.30      (3.8%)       25.22      (3.7%)   
-0.3% (  -7% -    7%)

               OrHighMed       48.26      (3.1%)       48.11      (3.1%)   
-0.3% (  -6% -    6%)

                 LowTerm     1087.75      (3.4%)     1084.44      (3.3%)   
-0.3% (  -6% -    6%)

              AndHighMed       69.62      (3.9%)       69.44      (4.1%)   
-0.3% (  -7% -    7%)

        HighSloppyPhrase       15.11      (2.6%)       15.08      (2.6%)   
-0.2% (  -5% -    5%)

                 Respell       43.34      (2.0%)       43.28      (2.3%)   
-0.1% (  -4% -    4%)

            OrHighNotLow      666.79      (3.4%)      665.98      (4.9%)   
-0.1% (  -8% -    8%)

            HighSpanNear        8.21      (1.8%)        8.20      (2.0%)   
-0.1% (  -3% -    3%)

    HighIntervalsOrdered       14.46      (1.2%)       14.45      (1.4%)   
-0.1% (  -2% -    2%)

              HighPhrase      333.99      (3.3%)      333.74      (3.9%)   
-0.1% (  -7% -    7%)

             MedSpanNear       12.08      (1.8%)       12.07      (2.0%)   
-0.1% (  -3% -    3%)

               LowPhrase      481.10      (2.5%)      481.14      (3.4%)    
0.0% (  -5% -    6%)

         MedSloppyPhrase        6.78      (2.9%)        6.78      (2.9%)    
0.0% (  -5% -    6%)

                PKLookup      157.80      (2.5%)      157.83      (2.5%)    
0.0% (  -4% -    5%)

             LowSpanNear       21.48      (2.1%)       21.48      (2.3%)    
0.0% (  -4% -    4%)

            OrHighNotMed      590.59      (3.9%)      591.21      (3.8%)    
0.1% (  -7% -    8%)

   BrowseMonthTaxoFacets        1.06      (1.1%)        1.06      (0.9%)    
0.1% (  -1% -    2%)

         LowSloppyPhrase       40.57      (2.1%)       40.63      (2.2%)    
0.1% (  -4% -    4%)

                  IntNRQ      124.31      (4.2%)      124.53      (4.9%)    
0.2% (  -8% -    9%)

    BrowseDateTaxoFacets        1.00      (1.0%)        1.00      (0.7%)    
0.2% (  -1% -    1%)

BrowseDayOfYearTaxoFacets        0.99      (0.9%)        1.00      (0.7%)    
0.2% (  -1% -    1%)

   HighTermDayOfYearSort       18.57      (6.2%)       18.62      (6.0%)    
0.3% ( -11% -   13%)

   BrowseMonthSSDVFacets        4.38      (1.0%)        4.40      (0.9%)    
0.4% (  -1% -    2%)

BrowseDayOfYearSSDVFacets        3.92      (0.7%)        3.94      (0.7%)    
0.5% (   0% -    1%)

                Wildcard       52.17      (4.0%)       52.47      (5.0%)    
0.6% (  -8% -    9%)

                  Fuzzy2       57.57      (9.5%)       58.32      (9.3%)    
1.3% ( -16% -   22%)

       HighTermMonthSort       40.51     (14.2%)       41.47     (13.9%)    
2.4% ( -22% -   35%)

 
{quote}There's an option for lucene-util to format the output for JIRA
{quote}
Last time I used this option Jira interpreted some tags and the resulting 
display was not better than this basic one.
{quote}Looking at the results you posted, the optimization seems fairly 
invisible
{quote}
Yes. The change optimizes the construction only of the CompiledAutomaton, so 
this is a tiny part of the fuzzy query execution.
{quote}that's 4.7% of "noise"
{quote}
Yes, there is noise. I tried baseline vs baseline and got the same noise. Maybe 
with wikimediumall this time there is less noise.

> Improve Automaton.step() with binary search and introduce Automaton.next()
>

[jira] [Comment Edited] (LUCENE-9125) Improve Automaton.step() with binary search and introduce Automaton.next()

2020-01-17 Thread Bruno Roustant (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017089#comment-17017089
 ] 

Bruno Roustant edited comment on LUCENE-9125 at 1/17/20 12:44 PM:
--

Here it is: (wikimedium10K)

                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff

    HighIntervalsOrdered      463.57     (13.2%)      443.74     (19.6%)   
-4.3% ( -32% -   32%)

                 Respell      382.45     (14.7%)      374.88     (21.3%)   
-2.0% ( -33% -   39%)

               OrHighLow     1746.37      (6.8%)     1737.44      (7.0%)   
-0.5% ( -13% -   14%)

              AndHighLow     4208.34      (6.1%)     4186.85      (5.8%)   
-0.5% ( -11% -   12%)

                HighTerm     5697.99      (7.5%)     5673.66      (5.1%)   
-0.4% ( -12% -   13%)

   BrowseMonthTaxoFacets     4679.40      (3.7%)     4664.60      (2.6%)   
-0.3% (  -6% -    6%)

                 Prefix3      442.09     (17.3%)      441.77     (16.6%)   
-0.1% ( -28% -   40%)

    BrowseDateTaxoFacets     4104.50      (3.4%)     4102.05      (2.8%)   
-0.1% (  -6% -    6%)

               OrHighMed      681.54     (11.8%)      681.70     (10.6%)    
0.0% ( -20% -   25%)

             AndHighHigh      978.85      (8.3%)      979.47      (9.9%)    
0.1% ( -16% -   19%)

BrowseDayOfYearTaxoFacets     3615.56      (2.8%)     3620.94      (2.4%)    
0.1% (  -4% -    5%)

                 MedTerm     5964.33      (5.7%)     5980.59      (5.8%)    
0.3% ( -10% -   12%)

                 LowTerm     6555.56      (4.8%)     6576.49      (5.3%)    
0.3% (  -9% -   10%)

                  Fuzzy2       73.24     (16.4%)       73.55     (16.1%)    
0.4% ( -27% -   39%)

                  Fuzzy1      887.86      (5.3%)      892.14      (2.7%)    
0.5% (  -7% -    8%)

              HighPhrase      901.57      (5.7%)      905.94      (6.6%)    
0.5% ( -11% -   13%)

              OrHighHigh      741.70     (11.5%)      745.44      (8.4%)    
0.5% ( -17% -   23%)

   BrowseMonthSSDVFacets     3462.54      (4.2%)     3480.43      (3.0%)    
0.5% (  -6% -    8%)

        HighSloppyPhrase      617.51      (6.9%)      620.74      (7.8%)    
0.5% ( -13% -   16%)

                PKLookup      275.55      (5.2%)      277.01      (5.0%)    
0.5% (  -9% -   11%)

         MedSloppyPhrase     1843.18      (4.7%)     1853.23      (3.8%)    
0.5% (  -7% -    9%)

         LowSloppyPhrase     2085.07      (4.3%)     2098.25      (3.9%)    
0.6% (  -7% -    9%)

BrowseDayOfYearSSDVFacets     2985.60      (2.5%)     3009.10      (2.6%)    
0.8% (  -4% -    6%)

              AndHighMed     1712.96      (5.8%)     1729.47      (4.5%)    
1.0% (  -8% -   12%)

             LowSpanNear     2006.25      (6.2%)     2029.83      (6.0%)    
1.2% ( -10% -   14%)

             MedSpanNear      814.10     (12.3%)      823.97     (10.1%)    
1.2% ( -18% -   26%)

            HighSpanNear      593.47     (10.3%)      600.77     (10.6%)    
1.2% ( -17% -   24%)

   HighTermDayOfYearSort     1035.41      (7.8%)     1050.76      (6.5%)    
1.5% ( -11% -   17%)

                Wildcard      772.44     (10.7%)      791.42     (12.7%)    
2.5% ( -18% -   28%)

               MedPhrase      806.70      (8.7%)      827.27      (8.1%)    
2.5% ( -13% -   21%)

               LowPhrase      805.91      (7.9%)      831.26      (5.3%)    
3.1% (  -9% -   17%)

                  IntNRQ     1898.15      (8.1%)     1967.24      (9.8%)    
3.6% ( -13% -   23%)

       HighTermMonthSort     3150.77     (12.1%)     3300.42     (13.5%)    
4.7% ( -18% -   34%)


was (Author: broustant):
Here it is:

                    Task   QPS trunk      StdDev   QPS patch      StdDev        
        Pct diff

    HighIntervalsOrdered      463.57     (13.2%)      443.74     (19.6%)   
-4.3% ( -32% -   32%)

                 Respell      382.45     (14.7%)      374.88     (21.3%)   
-2.0% ( -33% -   39%)

               OrHighLow     1746.37      (6.8%)     1737.44      (7.0%)   
-0.5% ( -13% -   14%)

              AndHighLow     4208.34      (6.1%)     4186.85      (5.8%)   
-0.5% ( -11% -   12%)

                HighTerm     5697.99      (7.5%)     5673.66      (5.1%)   
-0.4% ( -12% -   13%)

   BrowseMonthTaxoFacets     4679.40      (3.7%)     4664.60      (2.6%)   
-0.3% (  -6% -    6%)

                 Prefix3      442.09     (17.3%)      441.77     (16.6%)   
-0.1% ( -28% -   40%)

    BrowseDateTaxoFacets     4104.50      (3.4%)     4102.05      (2.8%)   
-0.1% (  -6% -    6%)

               OrHighMed      681.54     (11.8%)      681.70     (10.6%)    
0.0% ( -20% -   25%)

             AndHighHigh      978.85      (8.3%)      979.47      (9.9%)    
0.1% ( -16% -   19%)

BrowseDayOfYearTaxoFacets     3615.56      (2.8%)     3620.94      (2.4%)    
0.1% (  -4% -    5%)

                 MedTerm     5964.33      (5.7%)     5980.59      (5.8%)

[jira] [Created] (SOLR-14193) Update tutorial.adoc(line no:664) so that command executes in windows enviroment

2020-01-17 Thread balaji sundaram (Jira)

balaji sundaram created SOLR-14193:
--

 Summary: Update tutorial.adoc(line no:664) so that command 
executes in windows enviroment
 Key: SOLR-14193
 URL: https://issues.apache.org/jira/browse/SOLR-14193
 Project: Solr
  Issue Type: Bug
  Components: documentation
Affects Versions: 8.4
Reporter: balaji sundaram


 

{{When executing the following command in windows 10 "java -jar -Dc=films 
-Dparams=f.genre.split=true_by.split=true=|_by.separator=|
 -Dauto example\exampledocs\post.jar example\films\*.csv", it throws error "& 
was unexpected at this time."}}

Fix: the command should escape "&" and "|" symbol{{}}

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9116) Simplify postings API by removing long[] metadata

2020-01-17 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9116.
--
Fix Version/s: 8.5
   Resolution: Fixed

> Simplify postings API by removing long[] metadata
> -
>
> Key: LUCENE-9116
> URL: https://issues.apache.org/jira/browse/LUCENE-9116
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.5
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The postings API allows to store metadata about a term either in a long[] or 
> in a byte[]. This is unnecessary as all information could be encoded in the 
> byte[], which is what most codecs do in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9116) Simplify postings API by removing long[] metadata

2020-01-17 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017968#comment-17017968
 ] 

ASF subversion and git services commented on LUCENE-9116:
-

Commit fb3ca8d000d6e5203a57625942b754f1d5757fac in lucene-solr's branch 
refs/heads/master from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fb3ca8d ]

LUCENE-9116: Remove long[] from `PostingsWriterBase#encodeTerm`. (#1149) (#1158)

All the metadata can be directly encoded in the `DataOutput`.

> Simplify postings API by removing long[] metadata
> -
>
> Key: LUCENE-9116
> URL: https://issues.apache.org/jira/browse/LUCENE-9116
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The postings API allows to store metadata about a term either in a long[] or 
> in a byte[]. This is unnecessary as all information could be encoded in the 
> byte[], which is what most codecs do in practice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] jpountz merged pull request #1158: LUCENE-9116: Remove long[] from `PostingsWriterBase#encodeTerm`.

2020-01-17 Thread GitBox

jpountz merged pull request #1158: LUCENE-9116: Remove long[] from 
`PostingsWriterBase#encodeTerm`.
URL: https://github.com/apache/lucene-solr/pull/1158
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Gus Heck (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017956#comment-17017956
 ] 

Gus Heck commented on LUCENE-9130:
--

This is an issue tracker not a support portal. It's for when you are *certain* 
of a specific behavior in contravention of published documentation, or clear 
errors (like unexpected stack traces). When you are *confused* or don't know 
something you will get better responses using the mailing list.

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] balaji-s closed pull request #1180: Update solr-tutorial.adoc

2020-01-17 Thread GitBox

balaji-s closed pull request #1180: Update solr-tutorial.adoc
URL: https://github.com/apache/lucene-solr/pull/1180
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] balaji-s opened a new pull request #1180: Update solr-tutorial.adoc

2020-01-17 Thread GitBox

balaji-s opened a new pull request #1180: Update solr-tutorial.adoc
URL: https://github.com/apache/lucene-solr/pull/1180
 
 
   When executing the tutorial in windows 10 platform line 664 throws an error 
"& was unexpected at this time.", so adding an escape characters for "&" and "|"
   
   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Chen Zhixiang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chen Zhixiang resolved LUCENE-9130.
---
Resolution: Not A Bug

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Chen Zhixiang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017917#comment-17017917
 ] 

Chen Zhixiang commented on LUCENE-9130:
---

After i change the api call to add(term, position), and pass 0,2,4,6,7 which is 
the same as analyzed in indexing (because they are the same text string), it 
matches!

Now i'm confused: what's the relation between Term's position value and 
PhraseQuery's slop parameter?

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Chen Zhixiang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017914#comment-17017914
 ] 

Chen Zhixiang commented on LUCENE-9130:
---

PhraseQuery.Builder:



public Builder add(Term term) {
 return add(term, positions.isEmpty() ? 0 : 1 + positions.get(positions.size() 
- 1));
 }

/**
 * Adds a term to the end of the query phrase.
 * The relative position of the term within the phrase is specified explicitly, 
but must be greater than
 * or equal to that of the previously added term.
 * A greater position allows phrases with gaps (e.g. in connection with 
stopwords).
 * If the position is equal, you most likely should be using
 * \{@link MultiPhraseQuery} instead which only requires one term at each 
position to match; this class requires
 * all of them.
 */
 public Builder add(Term term, int position) {
...


I used the prev api add(term), but there is another api which can specify an 
extra position argument.


Here in this case, maybe i should pass in positions 0 2 4 6 7, which can be got 
from analyzing of raw query text...

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9137) Broken link 'Change log' for 8.4.1 on https://lucene.apache.org/core/downloads.html

2020-01-17 Thread Ishan Chattopadhyaya (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017907#comment-17017907
 ] 

Ishan Chattopadhyaya commented on LUCENE-9137:
--

Seems like sloppy grep search/replace work on my end. 
Instead of %s/8\.4\.0/8.4.1/g, i must've done  %s/8.4.0/8.4.1/g which also 
replaced the underscores with dots.

> Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
> ---
>
> Key: LUCENE-9137
> URL: https://issues.apache.org/jira/browse/LUCENE-9137
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
>Reporter: Sebb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9137) Broken link 'Change log' for 8.4.1 on https://lucene.apache.org/core/downloads.html

2020-01-17 Thread Ishan Chattopadhyaya (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017902#comment-17017902
 ] 

Ishan Chattopadhyaya commented on LUCENE-9137:
--

Oh, I'm sorry that this happened. Thanks a lot, Adrien. I'll take a look 
tomorrow as to how I missed it.

> Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
> ---
>
> Key: LUCENE-9137
> URL: https://issues.apache.org/jira/browse/LUCENE-9137
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
>Reporter: Sebb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9137) Broken link 'Change log' for 8.4.1 on https://lucene.apache.org/core/downloads.html

2020-01-17 Thread Ishan Chattopadhyaya (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017902#comment-17017902
 ] 

Ishan Chattopadhyaya edited comment on LUCENE-9137 at 1/17/20 10:46 AM:


Oh, I'm sorry that this happened. Thanks a lot, Sebb & Adrien. I'll take a look 
tomorrow as to how I missed it.


was (Author: ichattopadhyaya):
Oh, I'm sorry that this happened. Thanks a lot, Adrien. I'll take a look 
tomorrow as to how I missed it.

> Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
> ---
>
> Key: LUCENE-9137
> URL: https://issues.apache.org/jira/browse/LUCENE-9137
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
>Reporter: Sebb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Chen Zhixiang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017898#comment-17017898
 ] 

Chen Zhixiang commented on LUCENE-9130:
---

PhraseQuery.java:

 

public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float 
boost) throws IOException {
 return new PhraseWeight(this, field, searcher, scoreMode) {

private transient TermStates states[];

@Override
 protected Similarity.SimScorer getStats(IndexSearcher searcher) throws 
IOException {
 final int[] positions = PhraseQuery.this.getPositions();


Here positions' values are 0,1,2,3,4. Why is it inited so? Is there any 
document for PhraseQuery's sloppy match algorithm?

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9130) Failed to match when create PhraseQuery with terms analyzed from long query text

2020-01-17 Thread Chen Zhixiang (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017895#comment-17017895
 ] 

Chen Zhixiang commented on LUCENE-9130:
---

Lucene SloppyPhraseMatcher.java

 public boolean nextMatch() throws IOException {
 if (!positioned) {
 return false;
 }
 PhrasePositions pp = pq.pop();
 assert pp != null; // if the pq is not full, then positioned == false
 captureLead(pp);
 matchLength = end - pp.position;
 int next = pq.top().position; 
 while (advancePP(pp)) {
 if (hasRpts && !advanceRpts(pp)) {
 break; // pps exhausted
 }
 if (pp.position > next) { // done minimizing current match-length
 pq.add(pp);
 if (matchLength <= slop) {
 return true;
 }
 pp = pq.pop();
 next = pq.top().position;
 assert pp != null; // if the pq is not full, then positioned == false
 matchLength = end - pp.position;
 } else {
 int matchLength2 = end - pp.position;
 if (matchLength2 < matchLength) {
 matchLength = matchLength2;
 }
 }
 captureLead(pp);
 }
 positioned = false;
 return matchLength <= slop;
 }

Condition while (advancePP(pp)) doesn't match, and directly skip, 
matchLength=3,  slop=2, so return false.

I believe here exists a bug, but i cannot figure out why.

> Failed to match when create PhraseQuery with terms analyzed from long query 
> text
> 
>
> Key: LUCENE-9130
> URL: https://issues.apache.org/jira/browse/LUCENE-9130
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 8.4
>Reporter: Chen Zhixiang
>Priority: Major
> Attachments: LongTextFieldSearchTest.java
>
>
> When i use a long text (which is euqual to doc's StringField at indexing 
> time) to build a PhraseQuery, i cannot match the document. But BooleanQuery 
> with MUST/AND mode successes.
>  
> long query text is a address string: 
> "申长路988弄虹桥万科中心地下停车场LG2层2179-2184车位(锡虹路入,LG1层开到底下LG2)"
> test case is attached.
> logs:
>  
> 15:46:11.940 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:11.956 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:11.962 [main] INFO test.LongTextFieldSearchTest - query: +(+address:申 
> +address:长 +address:路 +address:988 +address:弄 +address:虹桥 +address:万 
> +address:科 +address:中 +address:心 +address:地下 +address:停车场 +address:lg 
> +address:2 +address:层 +address:2179 +address:2184 +address:车位 +address:锡 
> +address:虹 +address:路 +address:入 +address:lg +address:1 +address:层 +address:开 
> +address:到 +address:底下 +address:lg +address:2)
> 15:46:11.988 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=1
> 15:46:12.181 [main] INFO test.LongTextFieldSearchTest - indexed terms: 开, 层, 
> 心, 弄, 万, 停车场, 地下, 科, 虹桥, 底下, 锡, 入, 2184, 中, 路, 到, 1, 2, 申, 2179, 车位, 988, 虹, 
> lg, 长
> 15:46:12.185 [main] INFO test.LongTextFieldSearchTest - terms: 申, 长, 路, 988, 
> 弄, 虹桥, 万, 科, 中, 心, 地下, 停车场, lg, 2, 层, 2179, 2184, 车位, 锡, 虹, 路, 入, lg, 1, 层, 
> 开, 到, 底下, lg, 2
> 15:46:12.188 [main] INFO test.LongTextFieldSearchTest - query: +address:"申 长 
> 路 988 弄 虹桥 万 科 中 心 地下 停车场 lg 2 层 2179 2184 车位 锡 虹 路 入 lg 1 层 开 到 底下 lg 2"~2
> 15:46:12.210 [main] INFO test.LongTextFieldSearchTest - 
> results.totalHits.value=0
> 15:46:12.214 [main] INFO test.LongTextFieldSearchTest - no matching phrase



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9139) TestXYMultiPolygonShapeQueries test failures

2020-01-17 Thread Ignacio Vera (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017896#comment-17017896
 ] 

Ignacio Vera commented on LUCENE-9139:
--

You are totally right, my proposal does not solve the issue.

What I have done to check that lines do not intersects is to change the 
intersect logic to use BigDecimals instead of doubles. In that case lines do 
not intersect (and test don't fail for those seeds).

 

> TestXYMultiPolygonShapeQueries test failures
> 
>
> Key: LUCENE-9139
> URL: https://issues.apache.org/jira/browse/LUCENE-9139
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Ignacio Vera
>Priority: Major
>
> We recently have two failures on CI from the test method 
> TestXYMultiPolygonShapeQueries. The reproduction lines are:
>  
> {code:java}
> ant test  -Dtestcase=TestXYMultiPolygonShapeQueries 
> -Dtests.method=testRandomMedium -Dtests.seed=F1E142C2FBB612AF 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=el -Dtests.timezone=EST5EDT -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII{code}
> {code:java}
> ant test  -Dtestcase=TestXYMultiPolygonShapeQueries 
> -Dtests.method=testRandomMedium -Dtests.seed=363603A0428EC788 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sv-SE -Dtests.timezone=America/Yakutat -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8{code}
>  
> I dug into the failures and there seem to be due to numerical errors in the 
> GeoUtils.orient method. The method is detecting intersections of two very 
> long lines when it shouldn't. For example:
> Line 1: 
> {code:java}
> double ax = 3.31439550712E38;
> double ay = -1.4151510014141656E37;
> double bx = 3.4028234663852886E38;
> double by = 9.641030236797581E20;{code}
> Line 2:
> {code:java}
> double cx = 3.4028234663852886E38;
> double cy = -0.0;
> double dx = 3.4028234663852886E38;
> double dy = -2.7386422951137726E38;{code}
> My proposal to prevent those numerical errors is to modify the shape 
> generator to prevent creating shapes that expands more than half the float 
> space.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9148) Move the BKD index to its own file.

2020-01-17 Thread Adrien Grand (Jira)

Adrien Grand created LUCENE-9148:


 Summary: Move the BKD index to its own file.
 Key: LUCENE-9148
 URL: https://issues.apache.org/jira/browse/LUCENE-9148
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


Lucene60PointsWriter stores both inner nodes and leaf nodes in the same file, 
interleaved. For instance if you have two fields, you would have 
{{}}. It's not ideal 
since leaves and inner nodes have quite different access patterns. Should we 
split this into two files? In the case when the BKD index is off-heap, this 
would also help force it into RAM with {{MMapDirectory#setPreload}}.

Note that Lucene60PointsFormat already has a file that it calls "index" but 
it's really only about mapping fields to file pointers in the other file and 
not what I'm discussing here. But we could possibly store the BKD indices in 
this existing file if we want to avoid creating a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8615) Can LatLonShape's tessellator create more search-efficient triangles?

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017875#comment-17017875
 ] 

Adrien Grand commented on LUCENE-8615:
--

This sounds like an interesting idea!

> Can LatLonShape's tessellator create more search-efficient triangles?
> -
>
> Key: LUCENE-8615
> URL: https://issues.apache.org/jira/browse/LUCENE-8615
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: 2-tessellations.png, re-tessellate-triangle.png, 
> screenshot-1.png
>
>
> The triangular mesh produced by LatLonShape's Tessellator creates reasonable 
> numbers of triangles, which is helpful for indexing speed. However I'm 
> wondering that there are conditions when it might be beneficial to run 
> tessellation slightly differently in order to create triangles that are more 
> search-friendly. Given that we only index the minimum bounding rectangle for 
> each triangle, we always check for intersection between the query and the 
> triangle if the query intersects with the MBR of the triangle. So the smaller 
> the area of the triangle compared to its MBR, the higher the likeliness to 
> have false positive when querying.
> For instance see the following shape, there are two ways that it can be 
> tessellated into two triangles. LatLonShape's Tessellator is going to return 
> either of them depending on which point is listed first in the polygon. Yet 
> the first one is more efficient that the second one: with the second one, 
> both triangles have roughly the same MBR (which is also the MBR of the 
> polygon), so both triangles will need to be checked all the time whenever the 
> query intersects with this shared MBR. On the other hand, with the first way, 
> both MBRs are smaller and don't overlap, which makes it more likely that only 
> one triangle needs to be checked at query time.
>  !2-tessellations.png! 
> Another example is the following polygon. It can be tessellated into a single 
> triangle. Yet at times it might be a better idea create more triangles so 
> that the overall area of MBRs is smaller and queries are less likely to run 
> into false positives.
>  !re-tessellate-triangle.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9137) Broken link 'Change log' for 8.4.1 on https://lucene.apache.org/core/downloads.html

2020-01-17 Thread Adrien Grand (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9137.
--
Resolution: Fixed

I just pushed a fix, it might take a couple minutes to be applied. Thanks for 
reporting [~sebb]!

cc [~ichattopadhyaya]

> Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
> ---
>
> Key: LUCENE-9137
> URL: https://issues.apache.org/jira/browse/LUCENE-9137
> Project: Lucene - Core
>  Issue Type: Bug
> Environment: Broken link 'Change log' for 8.4.1 on 
> https://lucene.apache.org/core/downloads.html
>Reporter: Sebb
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9138) Behaviour of concurrent calls to IndexInput#clone is unclear

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017865#comment-17017865
 ] 

Adrien Grand commented on LUCENE-9138:
--

Would you like to open a pull request that clarifies this documentation?

> Behaviour of concurrent calls to IndexInput#clone is unclear
> 
>
> Key: LUCENE-9138
> URL: https://issues.apache.org/jira/browse/LUCENE-9138
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Affects Versions: 8.4
>Reporter: David Turner
>Priority: Minor
>
> I think this is a documentation issue, rather than anything actually wrong, 
> but need expert guidance to propose a fix.
> The Javadocs for {{IndexInput#clone}} warn that it is not thread safe:
> * This method is NOT thread safe, so if the current \{@code IndexInput}
>  * is being used by one thread while \{@code clone} is called by another,
>  * disaster could strike.
>  */
> @Override
> public IndexInput clone() {
>  
> However, there are places where {{clone()}} may be called concurrently. For 
> instance I believe {{SegmentReader#getFieldsReader}} clones an {{IndexInput}} 
> and requires no extra synchronization. I think this comment is supposed to 
> mean that you should not {{clone()}} an {{IndexInput}} while you're _reading 
> or seeking from it_ concurrently, but the precise guarantees aren't totally 
> clear.
>  
>  Furthermore there's no mention of the threadsafety of {{slice()}} and there 
> seem to be similar concurrent usages of it in e.g. 
> {{Lucene80DocValuesProducer}}. Does this have the same guarantees as 
> {{clone()}}?
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9138) Behaviour of concurrent calls to IndexInput#clone is unclear

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017864#comment-17017864
 ] 

Adrien Grand commented on LUCENE-9138:
--

Your analysis sounds right to me. It looks we don't have similar warnings on 
slice() because it doesn't need the current position of the input unlike 
clone().

> Behaviour of concurrent calls to IndexInput#clone is unclear
> 
>
> Key: LUCENE-9138
> URL: https://issues.apache.org/jira/browse/LUCENE-9138
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/store
>Affects Versions: 8.4
>Reporter: David Turner
>Priority: Minor
>
> I think this is a documentation issue, rather than anything actually wrong, 
> but need expert guidance to propose a fix.
> The Javadocs for {{IndexInput#clone}} warn that it is not thread safe:
> * This method is NOT thread safe, so if the current \{@code IndexInput}
>  * is being used by one thread while \{@code clone} is called by another,
>  * disaster could strike.
>  */
> @Override
> public IndexInput clone() {
>  
> However, there are places where {{clone()}} may be called concurrently. For 
> instance I believe {{SegmentReader#getFieldsReader}} clones an {{IndexInput}} 
> and requires no extra synchronization. I think this comment is supposed to 
> mean that you should not {{clone()}} an {{IndexInput}} while you're _reading 
> or seeking from it_ concurrently, but the precise guarantees aren't totally 
> clear.
>  
>  Furthermore there's no mention of the threadsafety of {{slice()}} and there 
> seem to be similar concurrent usages of it in e.g. 
> {{Lucene80DocValuesProducer}}. Does this have the same guarantees as 
> {{clone()}}?
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9139) TestXYMultiPolygonShapeQueries test failures

2020-01-17 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017853#comment-17017853
 ] 

Adrien Grand commented on LUCENE-9139:
--

The subtraction by-ay is indeed not accurate already in spite of the promotion 
from floats to doubles since their exponents differ by more than the number of 
mantissa bits of a double. And things might get worse with the multiplications. 
I wonder if your proposal would actually address the problem though, the 
problem is not that much the absolute values of the coordinates, but rather 
their relative values. For instance I believe you could have the same issue if 
we had some coordinates that are close but not equal to zero?

I haven't looked at the test, but how does it know that these lines don't 
intersect, is it using better logic?

> TestXYMultiPolygonShapeQueries test failures
> 
>
> Key: LUCENE-9139
> URL: https://issues.apache.org/jira/browse/LUCENE-9139
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Ignacio Vera
>Priority: Major
>
> We recently have two failures on CI from the test method 
> TestXYMultiPolygonShapeQueries. The reproduction lines are:
>  
> {code:java}
> ant test  -Dtestcase=TestXYMultiPolygonShapeQueries 
> -Dtests.method=testRandomMedium -Dtests.seed=F1E142C2FBB612AF 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=el -Dtests.timezone=EST5EDT -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII{code}
> {code:java}
> ant test  -Dtestcase=TestXYMultiPolygonShapeQueries 
> -Dtests.method=testRandomMedium -Dtests.seed=363603A0428EC788 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.badapples=true 
> -Dtests.locale=sv-SE -Dtests.timezone=America/Yakutat -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8{code}
>  
> I dug into the failures and there seem to be due to numerical errors in the 
> GeoUtils.orient method. The method is detecting intersections of two very 
> long lines when it shouldn't. For example:
> Line 1: 
> {code:java}
> double ax = 3.31439550712E38;
> double ay = -1.4151510014141656E37;
> double bx = 3.4028234663852886E38;
> double by = 9.641030236797581E20;{code}
> Line 2:
> {code:java}
> double cx = 3.4028234663852886E38;
> double cy = -0.0;
> double dx = 3.4028234663852886E38;
> double dy = -2.7386422951137726E38;{code}
> My proposal to prevent those numerical errors is to modify the shape 
> generator to prevent creating shapes that expands more than half the float 
> space.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

1 2 >

1 - 100 of 106 matches

Mail list logo