date:20210212



dweiss commented on pull request #2342:
URL: https://github.com/apache/lucene-solr/pull/2342#issuecomment-778577814


   If such a situation arises folks will have the ability to write a custom 
proxy listener with the ability to switch, add and/or remove delegates. This 
keeps the freedom to do anything you said and at the same time simplifies 
implementation (and concurrency-issues) on IW side?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] muse-dev[bot] commented on a change in pull request #2306: SOLR-15121: Move XSLT (tr param) to scripting contrib



muse-dev[bot] commented on a change in pull request #2306:
URL: https://github.com/apache/lucene-solr/pull/2306#discussion_r575631373



##
File path: 
solr/contrib/scripting/src/java/org/apache/solr/scripting/xslt/XSLTUpdateRequestHandler.java
##
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.scripting.xslt;
+
+import static org.apache.solr.scripting.xslt.XSLTConstants.*;
+
+import java.util.Map;
+import javax.xml.stream.XMLStreamException;
+import javax.xml.stream.XMLStreamReader;
+import javax.xml.transform.Transformer;
+import javax.xml.transform.TransformerException;
+import javax.xml.transform.dom.DOMResult;
+import javax.xml.transform.dom.DOMSource;
+import javax.xml.transform.sax.SAXSource;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.solr.common.EmptyEntityResolver;
+import org.apache.solr.common.SolrException;
+import org.apache.solr.common.params.SolrParams;
+import org.apache.solr.common.util.ContentStream;
+import org.apache.solr.common.util.ContentStreamBase;
+import org.apache.solr.common.util.NamedList;
+import org.apache.solr.handler.UpdateRequestHandler;
+import org.apache.solr.handler.loader.XMLLoader;
+import org.apache.solr.request.SolrQueryRequest;
+import org.apache.solr.response.SolrQueryResponse;
+import org.apache.solr.update.processor.UpdateRequestProcessor;
+import org.xml.sax.InputSource;
+import org.xml.sax.XMLReader;
+
+/**
+ * Send XML formatted documents to Solr, transforming them from the original 
XML
+ * format to the Solr XML format using an XSLT stylesheet via the 'tr' 
parameter.
+ */
+public class XSLTUpdateRequestHandler extends UpdateRequestHandler {
+
+  @Override
+  public void init(@SuppressWarnings({"rawtypes"})NamedList args) {
+super.init(args);
+setAssumeContentType("application/xml");
+
+SolrParams p = null;
+if (args != null) {
+  p = args.toSolrParams();
+}
+final XsltXMLLoader loader = new XsltXMLLoader().init(p);
+loaders = Map.of("application/xml", loader, "text/xml", loader);
+  }
+
+  @VisibleForTesting
+  static class XsltXMLLoader extends XMLLoader {
+
+int xsltCacheLifetimeSeconds;
+
+@Override
+public XsltXMLLoader init(SolrParams args) {
+  super.init(args);
+
+  xsltCacheLifetimeSeconds = XSLT_CACHE_DEFAULT;
+  if (args != null) {
+xsltCacheLifetimeSeconds = args.getInt(XSLT_CACHE_PARAM, 
XSLT_CACHE_DEFAULT);
+  }
+  return this;
+}
+
+@Override
+public void load(
+SolrQueryRequest req,
+SolrQueryResponse rsp,
+ContentStream stream,
+UpdateRequestProcessor processor)
+throws Exception {
+
+  String tr = req.getParams().get(TR, null);
+  if (tr == null) {
+super.load(req, rsp, stream, processor); // no XSLT; do standard 
processing
+return;
+  }
+
+  if (req.getCore().getCoreDescriptor().isConfigSetTrusted() == false) {
+throw new SolrException(
+SolrException.ErrorCode.UNAUTHORIZED,
+"The configset for this collection was uploaded without any 
authentication in place,"
++ " and this operation is not available for collections with 
untrusted configsets. To use this feature, re-upload the configset"
++ " after enabling authentication and authorization.");
+  }
+
+  final Transformer t = TransformerProvider.getTransformer(req, tr, 
xsltCacheLifetimeSeconds);
+  final DOMResult result = new DOMResult();
+
+  // first step: read XML and build DOM using Transformer (this is no 
overhead, as XSL always
+  // produces
+  // an internal result DOM tree, we just access it directly as input for 
StAX):
+  try (var is = stream.getStream()) {
+final XMLReader xmlr = saxFactory.newSAXParser().getXMLReader();
+xmlr.setErrorHandler(xmllog);
+xmlr.setEntityResolver(EmptyEntityResolver.SAX_INSTANCE);
+final InputSource isrc = new InputSource(is);
+
isrc.setEncoding(ContentStreamBase.getCharsetFromContentType(stream.getContentType()));
+final SAXSource source = new SAXSource(xmlr, isrc);
+t.transform(source,

[GitHub] [lucene-solr] rmuir commented on a change in pull request #2362: LUCENE-9767: infrastructure for icu regeneration in place.

2021-02-12 Thread ASF subversion and git services (Jira)



rmuir commented on a change in pull request #2362:
URL: https://github.com/apache/lucene-solr/pull/2362#discussion_r575622281



##
File path: gradle/generation/icu.gradle
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* NOTE: when regenerating, you need icu4c binaries in PATH.
+ * The icu4c version must match exactly the icu4j version in version.props:
+ * The one on your system is probably different.
+ *
+ * You can download the C sources to a standalone folder with these steps:
+ * (example for icu 62.2)
+ *
+ * # download version matching icu4j version in version.props
+ * curl -fLO 
https://github.com/unicode-org/icu/releases/download/release-62-2/icu4c-62_2-src.tgz
+ * # extract
+ * tar -zxvf icu4c-62_2-src.tgz
+ * # compile
+ * (cd icu/source && ./configure --prefix=$(pwd) --enable-rpath && make -j4)
+ * # test binaries work
+ * icu/source/bin/derb -V
+ * # put in PATH
+ * export PATH=$(pwd)/icu/source/bin:$PATH
+ */
+configure(project(":lucene:analysis:icu")) {
+  def utr30DataDir = file("src/data/utr30")
+
+  task genUtr30DataFiles() {
+// May be undefined yet, so use a provider.
+dependsOn { sourceSets.tools.runtimeClasspath }
+
+doFirst {
+  // all these steps must be done sequentially: it's a pipeline resulting 
in utr30.nrm
+  project.javaexec {
+main = "org.apache.lucene.analysis.icu.GenerateUTR30DataFiles"
+classpath = sourceSets.tools.runtimeClasspath
+
+ignoreExitValue false
+workingDir utr30DataDir
+  }
+
+  def gennorm = 'gennorm2'
+  def icupkg = 'icupkg'
+  project.exec {
+executable gennorm
+ignoreExitValue = false
+args = [
+"-v",
+"-s",
+utr30DataDir,
+"-o",
+"${buildDir}/utr30.tmp",
+"nfc.txt", "nfkc.txt", "nfkc_cf.txt", "BasicFoldings.txt",
+"DiacriticFolding.txt", "DingbatFolding.txt", 
"HanRadicalFolding.txt",
+"NativeDigitFolding.txt"
+]
+  }
+  project.exec {
+executable icupkg
+ignoreExitValue = false
+args = [
+"-tb",
+"${buildDir}/utr30.tmp",
+"src/resources/org/apache/lucene/analysis/icu/utr30.nrm"
+]
+  }
+}
+  }
+
+  task genRbbi() {
+// May be undefined yet, so use a provider.
+dependsOn { sourceSets.tools.runtimeClasspath }
+
+doFirst {
+  project.javaexec {
+main = "org.apache.lucene.analysis.icu.RBBIRuleCompiler"
+classpath = sourceSets.tools.runtimeClasspath
+
+ignoreExitValue false
+enableAssertions true
+args = [
+"src/data/uax29",
+"src/resources/org/apache/lucene/analysis/icu/segmentation"
+]
+  }
+}
+  }
+
+  task regenerate() {
+dependsOn genUtr30DataFiles

Review comment:
   thats right. the RBBI one is pure java. we can do it in parallel 
alongside the other one.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json



[ 
https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284075#comment-17284075
 ] 

ASF subversion and git services commented on SOLR-15138:


Commit 5f065acfbdb10e35633859520bb59122f4809f0b in lucene-solr's branch 
refs/heads/branch_8_8 from Ishan Chattopadhyaya
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5f065ac ]

Revert "SOLR-15138: Collection creation for PerReplicaStates does not scale to 
large collections as well as regular collections (closes #2359 and #2318)"

This reverts commit 22c716bcd946fa2d49e6cea53c0f0dd689954d76.


> PerReplicaStates does not scale to large collections as well as state.json
> --
>
> Key: SOLR-15138
> URL: https://issues.apache.org/jira/browse/SOLR-15138
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.8
>Reporter: Mike Drob
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.8.1
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I was testing PRS collection creation with larger collections today 
> (previously I had tested with many small collections) and it seemed to be 
> having trouble keeping up.
>  
> I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single 
> zookeeper.
>  
> With this cluster configuration, I am able to create several (at least 10) 
> collections with 11 shards and 11 replicas using the "old way" of keeping 
> state. These collections are created serially, waiting for all replicas to be 
> active before proceeding.
> However, when attempting to do the same with PRS, the creation stalls on 
> collection 2 or 3, with several replicas stuck in a "down" state. Further, 
> when attempting to delete these collections using the regular API it 
> sometimes takes several attempts after getting stuck a few times as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] zacharymorn commented on pull request #2342: LUCENE-9406: Add IndexWriterEventListener to track events in IndexWriter

2021-02-12 Thread ASF subversion and git services (Jira)



zacharymorn commented on pull request #2342:
URL: https://github.com/apache/lucene-solr/pull/2342#issuecomment-778550370


   > I think it's good overall but I'm wondering whether it makes sense to make 
that field volatile... do we want to allow changing listeners over index writer 
lifecycle? I think it should be a regular field and IW should just read it once 
(and set forever).
   
   Thanks for the feedback Dawid! I thought about this a bit as well when I 
noticed most of the other fields are volatile, but I can't decide for sure 
which direction to go as I don't have enough context information about the 
different use cases IW may need to support in the wild, so I ended up following 
the existing pattern here. However, I do feel that it may not hurt for 
application to have the freedom to switch event listener in the middle if 
situation or need dictates (i.e. a service application  containing Lucene may 
need to switch event listener if it exposes an API for its client to choose how 
the IW event stream to be sent and stored) ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-14928) Remove Overseer ClusterStateUpdater



[ 
https://issues.apache.org/jira/browse/SOLR-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284056#comment-17284056
 ] 

ASF subversion and git services commented on SOLR-14928:


Commit 23755ddfdd36a9613010cb9e6201127df55be744 in lucene-solr's branch 
refs/heads/master from Ilan Ginzburg
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=23755dd ]

SOLR-14928: allow cluster state updates to be done in a distributed way and not 
through Overseer (#2364)



> Remove Overseer ClusterStateUpdater
> ---
>
> Key: SOLR-14928
> URL: https://issues.apache.org/jira/browse/SOLR-14928
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Ilan Ginzburg
>Assignee: Ilan Ginzburg
>Priority: Major
>  Labels: cluster, collection-api, overseer
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Remove the Overseer {{ClusterStateUpdater}} thread and associated Zookeeper 
> queue at {{<_chroot_>/overseer/queue}}.
> Change cluster state updates so that each (Collection API) command execution 
> does the update directly in Zookeeper using optimistic locking (Compare and 
> Swap on the {{state.json}} Zookeeper files).
> Following this change cluster state updates would still be happening only 
> from the Overseer node (that's where Collection API commands are executing), 
> but the code will be ready for distribution once such commands can be 
> executed by any node (other work done in the context of parent task 
> SOLR-14927).
> See the [Cluster State 
> Updater|https://docs.google.com/document/d/1u4QHsIHuIxlglIW6hekYlXGNOP0HjLGVX5N6inkj6Ok/edit#heading=h.ymtfm3p518c]
>  section in the Removing Overseer doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] murblanc merged pull request #2364: SOLR-14928: allow cluster state updates to be done in a distributed way



murblanc merged pull request #2364:
URL: https://github.com/apache/lucene-solr/pull/2364


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15089) Allow backup/restoration to Amazon's S3 blobstore

2021-02-12 Thread Andy Throgmorton (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284054#comment-17284054
 ] 

Andy Throgmorton commented on SOLR-15089:
-

Sounds good [~gerlowskija], I'll look into tidying up our code to make it 
open-sourceable. Ours has been used in production for a little over a year, but 
since the clean up work may be on the heavy side, the eventual codebase that 
gets open-sourced will not really have any time in production :)

I agree with you and Ishan, solr-core isn't the place for this, but we can I 
think figure out exactly where at a later date. One more thing to call out is 
that our implementation is built on AWS SDK v1. We would like to move to v2 at 
a later date (for built-in metrics, etc.) but haven't had the time yet.

> Allow backup/restoration to Amazon's S3 blobstore 
> --
>
> Key: SOLR-15089
> URL: https://issues.apache.org/jira/browse/SOLR-15089
> Project: Solr
>  Issue Type: Sub-task
>Reporter: Jason Gerlowski
>Priority: Major
>
> Solr's BackupRepository interface provides an abstraction around the physical 
> location/format that backups are stored in.  This allows plugin writers to 
> create "repositories" for a variety of storage mediums.  It'd be nice if Solr 
> offered more mediums out of the box though, such as some of the "blobstore" 
> offerings provided by various cloud providers.
> This ticket proposes that a "BackupRepository" implementation for Amazon's 
> popular 'S3' blobstore, so that Solr users can use it for backups without 
> needing to write their own code.
> Amazon offers a s3 Java client with acceptable licensing, and the required 
> code is relatively simple.  The biggest challenge in supporting this will 
> likely be procedural - integration testing requires S3 access and S3 access 
> costs money.  We can check with INFRA to see if there is any way to get cloud 
> credits for an integration test to run in nightly Jenkins runs on the ASF 
> Jenkins server.  Alternatively we can try to stub out the blobstore in some 
> reliable way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] murblanc closed pull request #2285: SOLR-14928: introduce distributed cluster state updates



murblanc closed pull request #2285:
URL: https://github.com/apache/lucene-solr/pull/2285


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] murblanc commented on pull request #2285: SOLR-14928: introduce distributed cluster state updates



murblanc commented on pull request #2285:
URL: https://github.com/apache/lucene-solr/pull/2285#issuecomment-778536347


   Moved to 
[https://github.com/apache/lucene-solr/pull/2364](https://github.com/apache/lucene-solr/pull/2364)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] murblanc opened a new pull request #2364: SOLR-14928: allow cluster state updates to be done in a distributed way



murblanc opened a new pull request #2364:
URL: https://github.com/apache/lucene-solr/pull/2364


   SOLR-14928: allow cluster state updates to be done in a distributed way and 
not through Overseer
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-14593) Package store API to disable file upload over HTTP

2021-02-12 Thread Noble Paul (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul updated SOLR-14593:
--
Fix Version/s: (was: 8.8)

> Package store API to disable file upload over HTTP
> --
>
> Key: SOLR-14593
> URL: https://issues.apache.org/jira/browse/SOLR-14593
> Project: Solr
>  Issue Type: Task
>Reporter: Noble Paul
>Priority: Blocker
>
> h2. Why?
> Users installing third party plugins from external repos trust the public 
> keys of that repository owner. Anyone who has a private key to that repo will 
> be able to push any executable binary into such a cluster using the HTTP 
> upload endpoints. These executables will remain trusted.
> h3. Solution: Disable uploading jars over HTTP (they can be downloaded via 
> CLI by the user)
>  * {{/cluster/files/*}} endpoint will stop accepting files. That end-point 
> will not exist
>  * All jar files will need to be uploaded using the CLI. The CLI has access 
> to a physical file system where it copies the jar file to 
> {{$SOLR_HOME/filestore/*}} and issues the sync command. The sync command asks 
> other nodes to sync the jar file from this local node. (This is how the keys 
> are distributed today)
> h2. Is this backward compatible?
> No. For anyone using the internal APIs only to deploy, their packages will 
> stop working. Anyone using the CLI will have the same experience and they do 
> not need to make any changes to their workflow. All packages that are 
> currently installed will continue to work fine



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Closed] (SOLR-14827) Refactor schema loading to not use XPath

2021-02-12 Thread Noble Paul (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul closed SOLR-14827.
-

> Refactor schema loading to not use XPath
> 
>
> Key: SOLR-14827
> URL: https://issues.apache.org/jira/browse/SOLR-14827
> Project: Solr
>  Issue Type: Task
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: perfomance
> Fix For: 8.8
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> XPath is slower compared to DOM. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (SOLR-14827) Refactor schema loading to not use XPath

2021-02-12 Thread Noble Paul (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-14827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Noble Paul resolved SOLR-14827.
---
Fix Version/s: 8.8
   Resolution: Fixed

> Refactor schema loading to not use XPath
> 
>
> Key: SOLR-14827
> URL: https://issues.apache.org/jira/browse/SOLR-14827
> Project: Solr
>  Issue Type: Task
>Reporter: Noble Paul
>Assignee: Noble Paul
>Priority: Major
>  Labels: perfomance
> Fix For: 8.8
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> XPath is slower compared to DOM. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently



[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284030#comment-17284030
 ] 

Robert Muir commented on LUCENE-9754:
-

Sorry, I think this tokenizer works behind-the-scenes differently than you 
imagine: if you want a more pure unicode-standard tokenizer, then use 
{{StandardTokenizer}}. 

But {{ICUTokenizer}} differs from {{StandardTokenizer}} in that it tries to 
track more modern unicode standards, and as i mentioned in my comment above, it 
first chunks inpuit, then divides on scripts. This lets someone customize how 
the tokenization works for a particular writing system. And we give options for 
the tricky ones (e.g. thai/lao/burmese/whatever) that are usable in case the 
JDK might not be.

With no disprespect intended, the rules you see don't mean what you might 
infer. You need to go to the notes section :) If we encounter text in thai etc, 
ICU dictionary takes care. But we also let the end-user supply their own rules, 
in case they want something different.

The differences in this issue just has to do with stupid low-level text 
buffering, when segmentation usually just needs sentence context.. and from the 
NLP perspective, that is typically what it is trained on. So it makes sense to 
chunk on sentences rather than ranges and devolving to spaces. That's the issue 
the base {{SegmentingTokenizerBase}} fixes, for its subclasses (e.g. CJK), we 
should fix it here too.

I dont care about how good or terrible UAX29 sentence segmentation is, i want 
to use it for chunking. if you don't like it, you can optionally provide own 
rules you think are better. That is how i feel about this from a search engine 
library perspective.

> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] epugh edited a comment on pull request #2356: SOLR-15152: Export Tool should export nested docs cleanly in .json, .jsonl, and javabin



epugh edited a comment on pull request #2356:
URL: https://github.com/apache/lucene-solr/pull/2356#issuecomment-778497757


   In more testing, I don't have the `javabin` format working yet with nested 
children...   It's not roundtripping exports with the javabin format back into 
solr.  It does work for non nested children.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15150) add request level option to fail an atomic update if it can't be done 'in-place'

2021-02-12 Thread Chris M. Hostetter (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-15150:
--
Attachment: SOLR-15150.patch
Status: Open  (was: Open)

Updated patch with revised param name and ref-guide docs.

> add request level option to fail an atomic update if it can't be done 
> 'in-place'
> 
>
> Key: SOLR-15150
> URL: https://issues.apache.org/jira/browse/SOLR-15150
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Attachments: SOLR-15150.patch, SOLR-15150.patch
>
>
> When "In-Place" DocValue updates were added to Solr, the choice was made to 
> re-use the existing "Atomic Update" syntax, and use the DocValue updating 
> code if possible based on the index & schema, otherwise fall back to the 
> existing Atomic Update logic (to re-index the entire document). In essence, 
> "In-Place Atomic Updates" are treated as a (possible) optimization to 
> "regular" Atomic Updates
> This works fine, but it leaves open the possibility of a "gotcha" situation 
> where users may (reasonably) assume that an update can be done "In-Place" but 
> some aspect of the schema prevents it, and the performance of the updates 
> doesn't meet expectations (notably in the case of things like deeply nested 
> documents, where the re-indexing cost is multiplicative based on the total 
> size of the document tree)
> I think it would be a good idea to support an optional request param users 
> can specify with the semantics that say "If this update is an Atomic Update, 
> fail to execute it unless it can be done In-Place"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] epugh commented on pull request #2356: SOLR-15152: Export Tool should export nested docs cleanly in .json, .jsonl, and javabin



epugh commented on pull request #2356:
URL: https://github.com/apache/lucene-solr/pull/2356#issuecomment-778497757


   In more testing, I don't have the `javabin` format working yet...   It's not 
roundtripping exports with the javabin format back into solr.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15129) Use the Solr TGZ artifact as Docker context

2021-02-12 Thread Houston Putman (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284025#comment-17284025
 ] 

Houston Putman commented on SOLR-15129:
---

I have created an issue in the Docker solr repo, and tagged tianon. Please feel 
free to add any information if I have missed it.

https://github.com/docker-solr/docker-solr/issues/368

> Use the Solr TGZ artifact as Docker context
> ---
>
> Key: SOLR-15129
> URL: https://issues.apache.org/jira/browse/SOLR-15129
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: master (9.0)
>Reporter: Houston Putman
>Priority: Major
>
> As discussed in SOLR-15127, there is a need for a unified Dockerfile that 
> allows for release and local builds.
> This ticket is an attempt to achieve this by using the Solr distribution TGZ 
> as the docker context to build from.
> Therefore release images would be completely reproducible by running:
> {{docker build -f solr-9.0.0/Dockerfile 
> https://www.apache.org/dyn/closer.lua/lucene/solr/9.0.0/solr-9.0.0.tgz}}
> The changes to the Solr distribution would include adding a Dockerfile at 
> {{solr-/Dockerfile}}, adding the docker scripts under 
> {{solr-/docker}}, and adding a version file at 
> {{solr-/VERSION.txt}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284020#comment-17284020
 ] 

Robert Muir commented on LUCENE-9767:
-

I get the conflicting ICU version even from {{bluez}}: so just because I want 
low-level bluetooth support on linux. The library version conflict possibility 
is 100%

I don't even turn bluetooth on yet, just like the wifi, i simply leave it off, 
but maybe it'd be cool to scan for virusy people on the trail behind my house, 
or whatever i might want to do with those radios.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284018#comment-17284018
 ] 

Robert Muir commented on LUCENE-9767:
-

Also, I checked on headless linux router, which has almost the most an embedded 
profile, so few packages installed, it is a fucking network router and nothing 
much more.

It has ICU, because anything using {{glib2}} drags that in. So in my case its 
relatively low level packages like {{polkit}} and {{gettext}} and {{perf}} 
bringing in icu transitively.



> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284013#comment-17284013
 ] 

Robert Muir commented on LUCENE-9767:
-

no worries, you did more than enough. Thank you for getting it to a place where 
I could help out, without fighting so much gradle !

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15152) Export Tool should export nested docs cleanly in .json, .jsonl, and javabin

2021-02-12 Thread David Eric Pugh (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284012#comment-17284012
 ] 

David Eric Pugh commented on SOLR-15152:


Hi all, I believe the attached PR is now ready for review.   

I'm pretty happy with everything EXCEPT some changes I had to make to 
{{ChildDocTransformerFactory.processPathHierarchyQueryString()}}..   I added 
some checks to decide when to escape the {{childFilter}} query string, but I 
did them kind of blindly, just to get the tests to pass :(.   I would love 
[~dsmiley] or others to look here.

I'm looking forward to being able to roundtrip export and import Solr docs that 
have children successfully!

> Export Tool should export nested docs cleanly in .json, .jsonl, and javabin
> ---
>
> Key: SOLR-15152
> URL: https://issues.apache.org/jira/browse/SOLR-15152
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SolrCLI
>Affects Versions: 8.8
>Reporter: David Eric Pugh
>Assignee: David Eric Pugh
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ExportTool doesn't properly handle anonymous child docs or nested docs.   It 
> also confuses the JSONL format with the JSON format.  
> I'd like to have the JSON Lines format output as .jsonl, which is the 
> standard, and have the JSON format to be a .json, which is the same output as 
> if you wanted to post a Solr doc as a JSON to upload the data...    This will 
> let us round trip the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284011#comment-17284011
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Yep, I agree. I'll take a look at it (but not tonight, I'm done...).

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284010#comment-17284010
 ] 

Dawid Weiss commented on LUCENE-9767:
-

I've tried on my local (old-ish) ubuntu, I get this:
{code}
dweiss@:~/tmp/icu$ ./usr/local/sbin/gennorm2
./usr/local/sbin/gennorm2: error while loading shared libraries: 
libicutu.so.68: cannot open shared object file: No such file or directory
{code}
and when I set LD_LIBRARY_PATH:
{code}
export LD_LIBRARY_PATH=/home/dweiss/tmp/icu/usr/local/$LD_LIBRARY_PATH
dweiss@:~/tmp/icu$ ./usr/local/sbin/gennorm2
./usr/local/sbin/gennorm2: /lib/x86_64-linux-gnu/libm.so.6: version 
`GLIBC_2.29' not found (required by 
/home/dweiss/tmp/icu/usr/local/lib/libicuuc.so.68)
{code}

So yeah... perhaps better to compile from sources in *nix systems.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284008#comment-17284008
 ] 

Robert Muir commented on LUCENE-9767:
-

I think its challenging on linux, you have several contenders:
* temp directories: lots of people will have /tmp mounted noexec
* "dll hell": i run a minimal linux system, i have icu installed via "Required 
By : bind  boost-libs  brltty  libical  libxml2  tracker3". It isn't just 
because i wanted "nslookup" command :) pretty much everyone will have a 
conflicting version
* dynamic linker (/etc/ld.so/conf, /etc/ld.so.conf.d) are complex, i dont know 
what kind of "dll hell" issues we can enter
* the things we do with icu are invasive, we depend on its exact version, we 
cant be lazy about it

also, we still should have a good solution for the mac,tthere are no binaries 
for the mac. if we want to compile "standalone" from source code, the mac 
instructions are exactly equivalent to the linux ones. So why try to fight with 
linux binary executables? I suggest just one path for "linux/mac" and another 
for windows.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284004#comment-17284004
 ] 

Dawid Weiss commented on LUCENE-9767:
-

The windows package has dlls along with executables and will use these prior to 
anything else... and I bet the compilation process on Windows is really complex 
(either cross-compilation or you need to download tons of stuff from 
Microsoft...).

Would changing LD_LIBRARY_PATH affect the lookup of share libraries?

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284003#comment-17284003
 ] 

Robert Muir commented on LUCENE-9767:
-

btw, on windows, the binaries might still be appropriate to use, i don't know, 
i don't use windows. that won't be my "dll hell".

But on linux, binaries will cause "dll hell", especially as icu is a very 
common package for linux systems to have installed (and we need exact versions 
to match).


> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284002#comment-17284002
 ] 

Robert Muir commented on LUCENE-9767:
-

For example, here is from the binary package, you can see it links to the ICU 
stuff i have in /usr/lib (the system ICU). so this will break everything to use.

{noformat}
$ ldd usr/local/sbin/gennorm2
linux-vdso.so.1 (0x77fca000)
libicutu.so.68 => /usr/lib/libicutu.so.68 (0x77f37000)
libicuuc.so.68 => /usr/lib/libicuuc.so.68 (0x77d48000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x77b6b000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x77b51000)
libc.so.6 => /usr/lib/libc.so.6 (0x77984000)
libicui18n.so.68 => /usr/lib/libicui18n.so.68 (0x77665000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x77642000)
libicudata.so.68 => /usr/lib/libicudata.so.68 (0x75b01000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x75afa000)
libm.so.6 => /usr/lib/libm.so.6 (0x759b5000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 
(0x77fcc000)
{noformat}

On the other hand here is what it looks like with the "standalone rpath" 
detailed in the comments of my PR (i used /home/rmuir/icu)
{noformat}
$ ldd ~/icu/source/bin/gennorm2
linux-vdso.so.1 (0x77fca000)
libicutu.so.62 => /home/rmuir/icu/source/lib/libicutu.so.62 
(0x77f4c000)
libicui18n.so.62 => /home/rmuir/icu/source/lib/libicui18n.so.62 
(0x77c61000)
libicuuc.so.62 => /home/rmuir/icu/source/lib/libicuuc.so.62 
(0x77a88000)
libicudata.so.62 => /home/rmuir/icu/source/lib/libicudata.so.62 
(0x760ed000)
libpthread.so.0 => /usr/lib/libpthread.so.0 (0x760b8000)
libdl.so.2 => /usr/lib/libdl.so.2 (0x760b1000)
libm.so.6 => /usr/lib/libm.so.6 (0x75f6a000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x75d8d000)
libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x75d73000)
libc.so.6 => /usr/lib/libc.so.6 (0x75ba6000)
/lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 
(0x77fcc000)
{noformat}

As you can see, it ensures the correct version is really used.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284001#comment-17284001
 ] 

Robert Muir commented on LUCENE-9767:
-

Yeah the linux binaries "seemed to work" but are a no-go. I did "ldd 
usr/local/sbin/gennorm2" and they are linking to the system libicudata etc. So 
they aren't gonna work for our case where we need to match the versions 
exactly. That's why i have rpath / standalone compilation in the comments.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] epugh commented on a change in pull request #2356: SOLR-15152: Export Tool should export nested docs cleanly in .json, .jsonl, and javabin



epugh commented on a change in pull request #2356:
URL: https://github.com/apache/lucene-solr/pull/2356#discussion_r575534307



##
File path: 
solr/core/src/java/org/apache/solr/response/transform/ChildDocTransformerFactory.java
##
@@ -160,8 +161,18 @@ protected static String 
processPathHierarchyQueryString(String queryString) {
 int indexOfLastPathSepChar = queryString.lastIndexOf(PATH_SEP_CHAR, 
indexOfFirstColon);
 if (indexOfLastPathSepChar < 0) {
   // regular filter, not hierarchy based.
-  return ClientUtils.escapeQueryChars(queryString.substring(0, 
indexOfFirstColon))

Review comment:
   @dsmiley I need your help on this!  I am struggling on childFilters...   
`level_i:1` works great with the escaping, but queries like `level_i:[1 TO 3]` 
fail due to the escaping, I get werid "can parse number" errors.   Same if I 
have `type_s:Chocolate OR type_s:Regular`, but if I skip that escaping then 
things work.   
   
   This is definitly at the very outer edges of my knowledge. Would love to 
chat about how to fix this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build

2021-02-12 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283999#comment-17283999
 ] 

Dawid Weiss commented on LUCENE-9767:
-

> For mac, what to do? There is no binary there, and macs are popular for 
> developers.

Ouch. Didn't notice that. I have a mac, actually, but don't use it for 
development - old habits die last. I'll poke around and see. We can make 
different variants, of course, including compilation from binaries.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13608) Incremental backup for Solr



[ 
https://issues.apache.org/jira/browse/SOLR-13608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283997#comment-17283997
 ] 

ASF subversion and git services commented on SOLR-13608:


Commit fd1af8f524d5d459188d755bdf8eb02b0e88f31f in lucene-solr's branch 
refs/heads/branch_8x from Jason Gerlowski
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fd1af8f ]

SOLR-13608: Incremental backup file format

This commit introduces a new way for Solr to do backups (with a new
underlying file structure).  This new "incremental" backup process
improves over the existing backup mechanism in several ways:

- multiple backups "points" can now be stored at a given backup
  location/name, allowing users to choose which point in time they want
  to restore
- subsequent backups skip over uploading files that were uploaded by
  previous backups, saving time and network time.
- files are checksumed as they're uploaded, ensuring that corrupted
  indices aren't persisted and accidentally restored later.

Incremental backups are now the default, and traditional backups
should now be considered 'deprecated' but can still be created by
passing an `incremental=false` parameter on backup requests.

> Incremental backup for Solr
> ---
>
> Key: SOLR-13608
> URL: https://issues.apache.org/jira/browse/SOLR-13608
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Cao Manh Dat
>Assignee: Jason Gerlowski
>Priority: Major
>  Time Spent: 82.5h
>  Remaining Estimate: 14.5h
>
> SIP-12 lays out a plan for adding support for incremental backups to Solr.  
> At a high level, the idea is that Solr will be able to store multiple backups 
> in the same location, and backups beyond the first one will only upload those 
> files that were not uploaded by previous backups.
> This involves changes to the file structure within a particular backup 
> location.  It also entails changes to some of the backup/restore API 
> parameters and semantics, to accommodate storing multiple backups in the same 
> place, etc.
> This ticket covers the changes required for this functionality, as described 
> in SIP-12 unless mentioned specifically below.  It does not implement all of 
> [SIP-12.|https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore]
>   Same-collection-restoration, support for popular proprietary blob stores, 
> etc. are left for separate tickets in an attempt to keep PRs manageable and 
> conceptually cohesive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] gerlowskija merged pull request #2360: SOLR-13608: Incremental backup file format (#2250)



gerlowskija merged pull request #2360:
URL: https://github.com/apache/lucene-solr/pull/2360


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283995#comment-17283995
 ] 

Robert Muir commented on LUCENE-9767:
-

I tried both the linux binaries here on my linux machine (which is neither 
fedora nor ubuntu, not sure why it matters for the packages), both worked. You 
do have to be careful that they put some stuff (such as gennorm2 which we need) 
in /usr/local/sbin

For windows, I didn't test, but that seems attractive.

For mac, what to do? There is no binary there, and macs are popular for 
developers. 

So if we want to get fancy, i suggest starting simple, like having a 
compile-from-source approach which works on all linux/mac variants (the steps i 
have outlined as comments in the PR will work for both), and using the binaries 
only for windows?

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2362: LUCENE-9767: infrastructure for icu regeneration in place.



dweiss commented on a change in pull request #2362:
URL: https://github.com/apache/lucene-solr/pull/2362#discussion_r575530042



##
File path: gradle/generation/icu.gradle
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/* NOTE: when regenerating, you need icu4c binaries in PATH.
+ * The icu4c version must match exactly the icu4j version in version.props:
+ * The one on your system is probably different.
+ *
+ * You can download the C sources to a standalone folder with these steps:
+ * (example for icu 62.2)
+ *
+ * # download version matching icu4j version in version.props
+ * curl -fLO 
https://github.com/unicode-org/icu/releases/download/release-62-2/icu4c-62_2-src.tgz
+ * # extract
+ * tar -zxvf icu4c-62_2-src.tgz
+ * # compile
+ * (cd icu/source && ./configure --prefix=$(pwd) --enable-rpath && make -j4)
+ * # test binaries work
+ * icu/source/bin/derb -V
+ * # put in PATH
+ * export PATH=$(pwd)/icu/source/bin:$PATH
+ */
+configure(project(":lucene:analysis:icu")) {
+  def utr30DataDir = file("src/data/utr30")
+
+  task genUtr30DataFiles() {
+// May be undefined yet, so use a provider.
+dependsOn { sourceSets.tools.runtimeClasspath }
+
+doFirst {
+  // all these steps must be done sequentially: it's a pipeline resulting 
in utr30.nrm
+  project.javaexec {
+main = "org.apache.lucene.analysis.icu.GenerateUTR30DataFiles"
+classpath = sourceSets.tools.runtimeClasspath
+
+ignoreExitValue false
+workingDir utr30DataDir
+  }
+
+  def gennorm = 'gennorm2'
+  def icupkg = 'icupkg'
+  project.exec {
+executable gennorm
+ignoreExitValue = false
+args = [
+"-v",
+"-s",
+utr30DataDir,
+"-o",
+"${buildDir}/utr30.tmp",
+"nfc.txt", "nfkc.txt", "nfkc_cf.txt", "BasicFoldings.txt",
+"DiacriticFolding.txt", "DingbatFolding.txt", 
"HanRadicalFolding.txt",
+"NativeDigitFolding.txt"
+]
+  }
+  project.exec {
+executable icupkg
+ignoreExitValue = false
+args = [
+"-tb",
+"${buildDir}/utr30.tmp",
+"src/resources/org/apache/lucene/analysis/icu/utr30.nrm"
+]
+  }
+}
+  }
+
+  task genRbbi() {
+// May be undefined yet, so use a provider.
+dependsOn { sourceSets.tools.runtimeClasspath }
+
+doFirst {
+  project.javaexec {
+main = "org.apache.lucene.analysis.icu.RBBIRuleCompiler"
+classpath = sourceSets.tools.runtimeClasspath
+
+ignoreExitValue false
+enableAssertions true
+args = [
+"src/data/uax29",
+"src/resources/org/apache/lucene/analysis/icu/segmentation"
+]
+  }
+}
+  }
+
+  task regenerate() {
+dependsOn genUtr30DataFiles

Review comment:
   these two dependsOn statements don't give ordering guarantees in gradle 
(genRbbi can run before genUtr). Is this correct? If there are dependencies 
between them then these should be declared on tasks (for example 
genRbbi.dependsOn genUtr...).
   
   I'll wait for your decision whether it makes sense to automate binary 
downloads - if so, I'll try to do this and we can commit it cleaned up and 
shiny.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283988#comment-17283988
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Would it be any different if we downloaded a binary (for a given system) in a 
specific version? For example:

https://github.com/unicode-org/icu/releases/tag/release-68-2

I downloaded the Windows version and the tools were there (and worked). I can 
try to automate this process - it won't work forever (until they change the 
download links) but it'd be easier to switch a link than compile it locally. 
Plus, it'd work on exotic environments like Windows ;) 

I haven't looked at the patch you committed yet but I can try to switch to 
binaries if this is acceptable.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2361: LUCENE-9768: Add source sets for src/tools, clean up forbidden API and formatting errors



dweiss commented on pull request #2361:
URL: https://github.com/apache/lucene-solr/pull/2361#issuecomment-778461944


   I didn't really do much - I just corrected what the forbiddenapis were 
complaining about. I'm sure you could make them nicer. It's good they're part 
of the build now though!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build

2021-02-12 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283983#comment-17283983
 ] 

Robert Muir commented on LUCENE-9767:
-

Looking into it again, it would be good to think on ways to make regeneration 
less trappy, maybe automation is good here. I basically got the branch working 
the same as the old ant build, so it regenerates everything, but that relies 
upon you dealing with having c packages compiled with correct versions in the 
correct place in your $PATH and so on. 

For linux and mac at least, you need the same formula: downloading the sources 
of a specific version (which probably conflicts with the one on your system) to 
a scratch directory and compile the thing with rpath linkage ensures that it 
works "standalone", you could blow that temp work away afterwards. So it maybe 
could be automated, to make it easier to regenerate everything from scratch. 
Compilation is slow though, it is a big library.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-02-12 Thread Trey Jones (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283971#comment-17283971
 ] 

Trey Jones edited comment on LUCENE-9754 at 2/12/21, 8:44 PM:
--

The inconsistency caused by chunking is a very confusing, albeit rare, 
problem—but I don't think it is what needs to be fixed here. The chunking 
algorithm assumes that whitespace is a reasonable place to split tokens, and 
that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as 
_cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect 
the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before 
it? It happens across punctuation too, so a word in a different _sentence_ can 
trigger different tokenization; in this example, "The top results are: 1st is 
the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." 
No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and 
_3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't 
solve this one either—just replace periods with semicolons and it's one long 
sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break 
within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The 
[Unicode Segmentation 
Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...]
 also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most 
recent character set" that should be reset to null or "none" or something at 
whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, 
but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); 
πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); 
интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 
వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 
50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide 
(16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.



was (Author: trey jones):
The inconsistency caused by chunking is a very confusing, albeit rare, 
problem—but I don't think it is what needs to be fixed here. The chunking 
algorithm assumes that whitespace is a reasonable place to split tokens, and 
that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as 
_cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect 
the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before 
it? It happens across punctuation too, so a word in a different _sentence_ can 
trigger different tokenization; in this example, "The top results are: 1st is 
the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." 
No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and 
_3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't 
solve this one either—just replace periods with semi-colons and it's one long 
sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break 
within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The 
[Unicode Segmentation 
Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...]
 also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most 
recent character set" that should be reset to null or "none" or something at 
whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, 
but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); 
πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); 
интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 
వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 
50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide 
(16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue

[jira] [Commented] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

2021-02-12 Thread Trey Jones (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283971#comment-17283971
 ] 

Trey Jones commented on LUCENE-9754:


The inconsistency caused by chunking is a very confusing, albeit rare, 
problem—but I don't think it is what needs to be fixed here. The chunking 
algorithm assumes that whitespace is a reasonable place to split tokens, and 
that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as 
_cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect 
the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before 
it? It happens across punctuation too, so a word in a different _sentence_ can 
trigger different tokenization; in this example, "The top results are: 1st is 
the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." 
No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and 
_3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't 
solve this one either—just replace periods with semi-colons and it's one long 
sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break 
within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The 
[Unicode Segmentation 
Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...]
 also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most 
recent character set" that should be reset to null or "none" or something at 
whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, 
but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); 
πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); 
интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 
వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 
50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide 
(16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> --
>
> Key: LUCENE-9754
> URL: https://issues.apache.org/jira/browse/LUCENE-9754
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 7.5
> Environment: Tested most recently on Elasticsearch 6.5.4.
>Reporter: Trey Jones
>Priority: Major
> Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by 
> the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 
> 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system 
> before the space is the same as the writing system after the number, then you 
> get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does 
> (trying to split on spaces to create <4k chunks) can create an artificial 
> boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the 
> unexpected split of the second token (_14th_). Because chunking changes can 
> ripple through a long document, editing text or the effects of a character 
> filter can cause changes in tokenization thousands of lines later in a 
> document.
> My guess is that some "previous character set" flag is not reset at the 
> space, and numbers are not in a character set, so _t_ is compared to _ァ_ and 
> they are not the same—causing a token split at the character set change—but 
> I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json

2021-02-12 Thread Ishan Chattopadhyaya (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ishan Chattopadhyaya resolved SOLR-15138.
-
Fix Version/s: 8.8.1
   Resolution: Fixed

Thanks [~mdrob] and [~ilan].

> PerReplicaStates does not scale to large collections as well as state.json
> --
>
> Key: SOLR-15138
> URL: https://issues.apache.org/jira/browse/SOLR-15138
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.8
>Reporter: Mike Drob
>Assignee: Noble Paul
>Priority: Major
> Fix For: 8.8.1
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I was testing PRS collection creation with larger collections today 
> (previously I had tested with many small collections) and it seemed to be 
> having trouble keeping up.
>  
> I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single 
> zookeeper.
>  
> With this cluster configuration, I am able to create several (at least 10) 
> collections with 11 shards and 11 replicas using the "old way" of keeping 
> state. These collections are created serially, waiting for all replicas to be 
> active before proceeding.
> However, when attempting to do the same with PRS, the creation stalls on 
> collection 2 or 3, with several replicas stuck in a "down" state. Further, 
> when attempting to delete these collections using the regular API it 
> sometimes takes several attempts after getting stuck a few times as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json



[ 
https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283964#comment-17283964
 ] 

ASF subversion and git services commented on SOLR-15138:


Commit 22c716bcd946fa2d49e6cea53c0f0dd689954d76 in lucene-solr's branch 
refs/heads/branch_8_8 from Ishan Chattopadhyaya
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=22c716b ]

SOLR-15138: Collection creation for PerReplicaStates does not scale to large 
collections as well as regular collections (closes #2359 and #2318)


> PerReplicaStates does not scale to large collections as well as state.json
> --
>
> Key: SOLR-15138
> URL: https://issues.apache.org/jira/browse/SOLR-15138
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.8
>Reporter: Mike Drob
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I was testing PRS collection creation with larger collections today 
> (previously I had tested with many small collections) and it seemed to be 
> having trouble keeping up.
>  
> I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single 
> zookeeper.
>  
> With this cluster configuration, I am able to create several (at least 10) 
> collections with 11 shards and 11 replicas using the "old way" of keeping 
> state. These collections are created serially, waiting for all replicas to be 
> active before proceeding.
> However, when attempting to do the same with PRS, the creation stalls on 
> collection 2 or 3, with several replicas stuck in a "down" state. Further, 
> when attempting to delete these collections using the regular API it 
> sometimes takes several attempts after getting stuck a few times as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json

2021-02-12 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283963#comment-17283963
 ] 

ASF subversion and git services commented on SOLR-15138:


Commit 1a05c83b5f93b9014008e30b519c3f6064c78731 in lucene-solr's branch 
refs/heads/branch_8x from Ishan Chattopadhyaya
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1a05c83 ]

SOLR-15138: Collection creation for PerReplicaStates does not scale to large 
collections as well as regular collections (closes #2359 and #2318)


> PerReplicaStates does not scale to large collections as well as state.json
> --
>
> Key: SOLR-15138
> URL: https://issues.apache.org/jira/browse/SOLR-15138
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.8
>Reporter: Mike Drob
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> I was testing PRS collection creation with larger collections today 
> (previously I had tested with many small collections) and it seemed to be 
> having trouble keeping up.
>  
> I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single 
> zookeeper.
>  
> With this cluster configuration, I am able to create several (at least 10) 
> collections with 11 shards and 11 replicas using the "old way" of keeping 
> state. These collections are created serially, waiting for all replicas to be 
> active before proceeding.
> However, when attempting to do the same with PRS, the creation stalls on 
> collection 2 or 3, with several replicas stuck in a "down" state. Further, 
> when attempting to delete these collections using the regular API it 
> sometimes takes several attempts after getting stuck a few times as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15132) Add temporal graph query to the nodes Streaming Expression

2021-02-12 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283958#comment-17283958
 ] 

ASF subversion and git services commented on SOLR-15132:


Commit 4a42ecd9364efea9867976295cd0342c96875786 in lucene-solr's branch 
refs/heads/master from Joel Bernstein
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4a42ecd ]

SOLR-15132: Add temporal graph query to the nodes Streaming Expression


> Add temporal graph query to the nodes Streaming Expression
> --
>
> Key: SOLR-15132
> URL: https://issues.apache.org/jira/browse/SOLR-15132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Joel Bernstein
>Priority: Major
> Attachments: SOLR-15132.patch, SOLR-15132.patch, SOLR-15132.patch, 
> SOLR-15132.patch
>
>
> The *nodes* Streaming Expression performs a breadth first graph traversal. 
> This ticket will add a *window* parameter to allow the nodes expression to 
> traverse the graph within a window of time. 
> To take advantage of this feature you must index the content with a String 
> field which is an ISO timestamp truncated at ten seconds. Then the *window* 
> parameter can be applied to walk the graph within a *window prior* to a 
> specific ten second window and perform aggregations. 
> *The main use case for this feature is auto-detecting lagged correlations.* 
> This is useful in many different fields.
> Here is an example using Solr logs to answer the following question: 
> What types of log events occur most frequently in the 30 second window prior 
> to 10 second windows with the most slow queries:
> {code}
> nodes(logs,
>   facet(logs, q="qtime_s:[5000 TO *]", buckets="time_ten_seconds", 
> rows="25"),
>   walk="time_ten_seconds->time_ten_seconds",
>   window="3",
>   gather="type_s",
>   count(*))
> {code}
> This ticket is phase 1. Phase 2 will auto-detect different ISO Timestamp 
> truncations so that increments of one second, one minute, one day etc... can 
> also be traversed using the same query syntax. There will be a follow-on 
> ticket for that after this ticket is completed. This will create a more 
> general purpose time graph.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15132) Add temporal graph query to the nodes Streaming Expression

2021-02-12 Thread Joel Bernstein (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-15132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joel Bernstein updated SOLR-15132:
--
Summary: Add temporal graph query to the nodes Streaming Expression  (was: 
Add window paramater to the nodes Streaming Expression)

> Add temporal graph query to the nodes Streaming Expression
> --
>
> Key: SOLR-15132
> URL: https://issues.apache.org/jira/browse/SOLR-15132
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: streaming expressions
>Reporter: Joel Bernstein
>Priority: Major
> Attachments: SOLR-15132.patch, SOLR-15132.patch, SOLR-15132.patch, 
> SOLR-15132.patch
>
>
> The *nodes* Streaming Expression performs a breadth first graph traversal. 
> This ticket will add a *window* parameter to allow the nodes expression to 
> traverse the graph within a window of time. 
> To take advantage of this feature you must index the content with a String 
> field which is an ISO timestamp truncated at ten seconds. Then the *window* 
> parameter can be applied to walk the graph within a *window prior* to a 
> specific ten second window and perform aggregations. 
> *The main use case for this feature is auto-detecting lagged correlations.* 
> This is useful in many different fields.
> Here is an example using Solr logs to answer the following question: 
> What types of log events occur most frequently in the 30 second window prior 
> to 10 second windows with the most slow queries:
> {code}
> nodes(logs,
>   facet(logs, q="qtime_s:[5000 TO *]", buckets="time_ten_seconds", 
> rows="25"),
>   walk="time_ten_seconds->time_ten_seconds",
>   window="3",
>   gather="type_s",
>   count(*))
> {code}
> This ticket is phase 1. Phase 2 will auto-detect different ISO Timestamp 
> truncations so that increments of one second, one minute, one day etc... can 
> also be traversed using the same query syntax. There will be a follow-on 
> ticket for that after this ticket is completed. This will create a more 
> general purpose time graph.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

2021-02-12 Thread ASF subversion and git services (Jira)



chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575485111



##
File path: 
solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps 
message, @SuppressWarnin
   shardRequestTracker.processResponses(results, shardHandler, false, null, 
Collections.emptySet());
   @SuppressWarnings({"rawtypes"})
   boolean failure = results.get("failure") != null && 
((SimpleOrderedMap)results.get("failure")).size() > 0;
+  if(isPrs) {
+TimeOut timeout = new 
TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), 
TimeUnit.SECONDS, timeSource); // could be a big cluster
+PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+while (!timeout.hasTimedOut()) {
+  if(prs.allActive()) break;
+  Thread.sleep(100);
+  prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+}
+if (prs.allActive()) {
+  // we have successfully found all replicas to be ACTIVE
+  // Now ask Overseer to fetch the latest state of collection
+  // from ZK
+  ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+} else {
+  failure = true;
+}
+  }
   if (failure) {
 // Let's cleanup as we hit an exception
 // We shouldn't be passing 'results' here for the cleanup as the 
response would then contain 'success'

Review comment:
   This is resolved now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json



[ 
https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283942#comment-17283942
 ] 

ASF subversion and git services commented on SOLR-15138:


Commit 4b113067d8185a62d0ea1292f5088d5f8300d75e in lucene-solr's branch 
refs/heads/master from Ishan Chattopadhyaya
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4b11306 ]

SOLR-15138: Collection creation for PerReplicaStates does not scale to large 
collections as well as regular collections (#2318)


> PerReplicaStates does not scale to large collections as well as state.json
> --
>
> Key: SOLR-15138
> URL: https://issues.apache.org/jira/browse/SOLR-15138
> Project: Solr
>  Issue Type: Bug
>  Components: SolrCloud
>Affects Versions: 8.8
>Reporter: Mike Drob
>Assignee: Noble Paul
>Priority: Major
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> I was testing PRS collection creation with larger collections today 
> (previously I had tested with many small collections) and it seemed to be 
> having trouble keeping up.
>  
> I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single 
> zookeeper.
>  
> With this cluster configuration, I am able to create several (at least 10) 
> collections with 11 shards and 11 replicas using the "old way" of keeping 
> state. These collections are created serially, waiting for all replicas to be 
> active before proceeding.
> However, when attempting to do the same with PRS, the creation stalls on 
> collection 2 or 3, with several replicas stuck in a "down" state. Further, 
> when attempting to delete these collections using the regular API it 
> sometimes takes several attempts after getting stuck a few times as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-15155) Let CloudHttp2SolrClient accept an external Http2SolrClient Builder

2021-02-12 Thread Tomas Eduardo Fernandez Lobbe (Jira)

Tomas Eduardo Fernandez Lobbe created SOLR-15155:


 Summary: Let CloudHttp2SolrClient accept an external 
Http2SolrClient Builder
 Key: SOLR-15155
 URL: https://issues.apache.org/jira/browse/SOLR-15155
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Tomas Eduardo Fernandez Lobbe


{{CloudHttp2SolrClient}} doesn't provide much of the options that 
{{Http2SolrClient}} does (timeouts, max connections per hosts, etc). 
Technically it accepts a fully built {{Http2SolrClient}}, however, in such case 
the client becomes "external", which means it won't be closed when the 
CloudClient is closed (one needs to maintain a reference and close explicitly 
after closing CloudClient). 
{{CloudHttp2SolrClient}} will use an empty/default {{Http2SolrClient.Builder}} 
to build it's internal client. I propose we allow providing a configured  
{{Http2SolrClient.Builder}} instead and let the {{CloudHttp2SolrClient}} just 
build from it. This would be optional of course, and backwards compatible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-15154) Let Http2SolrClient pass Basic Auth credentials to all requests

2021-02-12 Thread Tomas Eduardo Fernandez Lobbe (Jira)

Tomas Eduardo Fernandez Lobbe created SOLR-15154:


 Summary: Let Http2SolrClient pass Basic Auth credentials to all 
requests
 Key: SOLR-15154
 URL: https://issues.apache.org/jira/browse/SOLR-15154
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SolrJ
Reporter: Tomas Eduardo Fernandez Lobbe


In {{HttpSolrClient}}, one could specify credentials [at the JVM 
level|https://lucene.apache.org/solr/guide/8_8/basic-authentication-plugin.html#global-jvm-basic-auth-credentials],
 and that would make all requests to Solr have them. This doesn't work with the 
Http2 clients case and I think it's very useful. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] chatman closed pull request #39: Adding slack links to community and discussion pages



chatman closed pull request #39:
URL: https://github.com/apache/lucene-site/pull/39


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] chatman commented on pull request #39: Adding slack links to community and discussion pages



chatman commented on pull request #39:
URL: https://github.com/apache/lucene-site/pull/39#issuecomment-778366288


   Thanks for the review, @epugh , @anshumg !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] anshumg commented on pull request #39: Adding slack links to community and discussion pages



anshumg commented on pull request #39:
URL: https://github.com/apache/lucene-site/pull/39#issuecomment-778357060


   Thanks for adding this information, @chatman 
   
   Can you please highlight and may be list the unofficial channels separately? 
You may list those as "Third Party" perhaps? 
   You've mentioned this already but just wanted to be sure that the channels 
aren't perceived as official ones even accidentally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9769) Hunspell: KEEPCASE should take precedence over affixed forms

Peter Gromov created LUCENE-9769:


 Summary: Hunspell: KEEPCASE should take precedence over affixed 
forms
 Key: LUCENE-9769
 URL: https://issues.apache.org/jira/browse/LUCENE-9769
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Peter Gromov


If an inflected form is listed in the dictionary like KEEPCASE, its other 
variations should be considered misspelled, even if affix removal would result 
in a non-KEEPCASE root



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] epugh commented on a change in pull request #39: Adding slack links to community and discussion pages



epugh commented on a change in pull request #39:
URL: https://github.com/apache/lucene-site/pull/39#discussion_r575407662



##
File path: content/pages/core/discussion.md
##
@@ -86,7 +86,9 @@ but developers should be careful to transfer all the official 
decisions or usefu
 
 ## Slack
 
-The project's Slack channel is the **#lucene-dev** channel in the **the-asf** 
organization. Link: 
+- The project's Slack channel are the `#lucene-dev` and `#solr-dev` channels 
in the `the-asf` organization. These are primarily for developer discussions 
and not meant as support channels. Link: 

+- For Solr support, there is a (community maintained/unofficial) Slack 
organization that relays messages bi-directionally to/from the officially 
supported IRC channels. Link: https://s.apache.org/solr-slack
+- For relevance related discussions (Solr or other search engines), there's an 
unofficial Slack organization: https://opensourceconnections.com/slack

Review comment:
   This slack communities name is "Relevance Slack", the invite code just 
happens to be through OSC.   So how about "For relevance related discussions 
(Solr or other search engines), join Relevance Slack: 
http://opensourceconnections.com/slack;





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #2350: SOLR-15149: model creation errors fixes



cpoerschke commented on a change in pull request #2350:
URL: https://github.com/apache/lucene-solr/pull/2350#discussion_r575388380



##
File path: solr/contrib/ltr/src/java/org/apache/solr/ltr/model/LinearModel.java
##
@@ -80,10 +80,10 @@
 
   public void setWeights(Object weights) {
 @SuppressWarnings({"unchecked"})
-final Map modelWeights = (Map) weights;
+final Map modelWeights = (Map) weights;
 for (int ii = 0; ii < features.size(); ++ii) {
   final String key = features.get(ii).getName();
-  final Double val = modelWeights.get(key);
+  final Number val = modelWeights.get(key);

Review comment:
   neat!

##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/store/rest/ManagedModelStore.java
##
@@ -294,11 +294,18 @@ private static void initWrapperModel(SolrResourceLoader 
solrResourceLoader,
 return modelMap;
   }
 
-  private static Feature lookupFeatureFromFeatureMap(Map 
featureMap,
-  FeatureStore featureStore) {
-final String featureName = (String)featureMap.get(NAME_KEY);
-return (featureName == null ? null
-: featureStore.get(featureName));
+  private static Feature lookupFeatureFromFeatureMap(Map 
featureMap, FeatureStore featureStore)
+  {
+final String featureName = (String) featureMap.get(NAME_KEY);
+Feature extractedFromStore = featureName == null ? null : 
featureStore.get(featureName);
+if (extractedFromStore == null) {
+  if (featureStore.getFeatures().isEmpty()) {
+throw new ModelException("Feature Store not found: " + 
featureStore.getName());
+  } else {
+throw new ModelException("Feature:" + featureName + " not found in 
store: " + featureStore.getName());

Review comment:
   ```suggestion
   throw new ModelException("Feature: " + featureName + " not found in 
store: " + featureStore.getName());
   ```

##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/model/LTRScoringModel.java
##
@@ -108,7 +108,7 @@ public static LTRScoringModel 
getInstance(SolrResourceLoader solrResourceLoader,
 SolrPluginUtils.invokeSetters(model, params.entrySet());
   }
 } catch (final Exception e) {
-  throw new ModelException("Model type does not exist " + className, e);
+  throw new ModelException("Model creation failed for " + className, e);

Review comment:
   minor: how about not saying "creation" but perhaps "loading" or 
something to that effect? technically the "creation" i.e. storage of the 
model's JSON succeeded (if i remember things right) but it's the "loading" or 
"using" of the model's JSON that encounters the error. so the user seeing the 
error needs to look at the content of what they uploaded but not at the 
mechanics of how the uploaded/created the model.

##
File path: 
solr/contrib/ltr/src/test-files/modelExamples/multipleadditivetreesmodel_unknownFeature.json
##
@@ -0,0 +1,38 @@
+{
+"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",
+"name":"multipleadditivetreesmodel",
+"features":[
+{ "name": "notExist1"},
+{ "name": "notExist2"}
+],
+"params":{
+"trees": [
+{
+"weight" : "1f",
+"root": {
+"feature": "matchedTitle",
+"threshold": "0.5f",
+"left" : {
+"value" : "-100"
+},
+"right": {
+"feature" : 
"constantScoreToForceMultipleAdditiveTreesScoreAllDocs",

Review comment:
   ```suggestion
   "feature" : "notExist2",
   ```

##
File path: 
solr/contrib/ltr/src/java/org/apache/solr/ltr/store/rest/ManagedModelStore.java
##
@@ -294,11 +294,18 @@ private static void initWrapperModel(SolrResourceLoader 
solrResourceLoader,
 return modelMap;
   }
 
-  private static Feature lookupFeatureFromFeatureMap(Map 
featureMap,
-  FeatureStore featureStore) {
-final String featureName = (String)featureMap.get(NAME_KEY);
-return (featureName == null ? null
-: featureStore.get(featureName));
+  private static Feature lookupFeatureFromFeatureMap(Map 
featureMap, FeatureStore featureStore)
+  {
+final String featureName = (String) featureMap.get(NAME_KEY);
+Feature extractedFromStore = featureName == null ? null : 
featureStore.get(featureName);
+if (extractedFromStore == null) {
+  if (featureStore.getFeatures().isEmpty()) {
+throw new ModelException("Feature Store not found: " + 
featureStore.getName());

Review comment:
   ```suggestion
   throw new ModelException("Missing or empty feature store: " + 
featureStore.getName());
   ```

##
File path: 
solr/contrib/ltr/src/test-files/modelExamples/multipleadditivetreesmodel_unknownFeature.json
##
@@ -0,0 +1,38 @@
+{
+"class":"org.apache.solr.ltr.model.MultipleAdditiveTreesModel",

[GitHub] [lucene-solr] rmuir commented on pull request #2362: LUCENE-9767: infrastructure for icu regeneration in place.



rmuir commented on pull request #2362:
URL: https://github.com/apache/lucene-solr/pull/2362#issuecomment-778342807


   @dweiss This is now doing everything the ant build was doing, thanks! I 
tested everything works: nuked all the generated files locally, regenerated 
them, and git status is clean.
   
   It still has the hairiness of having to deal with the c package, but that's 
nothing new. I added comments for now about how to deal with this locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] chatman commented on pull request #39: Adding slack links to community and discussion pages



chatman commented on pull request #39:
URL: https://github.com/apache/lucene-site/pull/39#issuecomment-778332733


   @epugh FYI, contains a link to the OSC slack.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-site] chatman opened a new pull request #39: Adding slack links to community and discussion pages



chatman opened a new pull request #39:
URL: https://github.com/apache/lucene-site/pull/39


   There are more options for support these days than the community page 
mentions. Lets add as much info as possible.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15149) Learning To Rank model upload fails generically

2021-02-12 Thread Christine Poerschke (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-15149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated SOLR-15149:
---
Component/s: contrib - LTR

> Learning To Rank model upload fails generically
> ---
>
> Key: SOLR-15149
> URL: https://issues.apache.org/jira/browse/SOLR-15149
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: contrib - LTR
>Reporter: Alessandro Benedetti
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When uploading a model, using a not existent store or other incorrect 
> parameters you get:
> "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","java.lang.ClassCastException"],
> "msg":"org.apache.solr.ltr.model.ModelException: Model type does not 
> exist org.apache.solr.ltr.model.LinearModel",
> "code":400}}
> In the response, logs don't help that much out of the box, I had to go for 
> remote debugging and of course we don't want the generic user to do that.
> Reason is in org/apache/solr/ltr/model/LTRScoringModel.java:111
> {code:java}
> try {
>   // create an instance of the model
>   model = solrResourceLoader.newInstance(
>   className,
>   LTRScoringModel.class,
>   new String[0], // no sub packages
>   new Class[] { String.class, List.class, List.class, String.class, 
> List.class, Map.class },
>   new Object[] { name, features, norms, featureStoreName, 
> allFeatures, params });
>   if (params != null) {
> SolrPluginUtils.invokeSetters(model, params.entrySet());
>   }
> } catch (final Exception e) {
>   throw new ModelException("Model type does not exist " + className, e);
> }
> {code}
> This happens when:
> - use a not existent feature store
> - use not existent feature
> - use an integer instead of Double as a weight in a linear model
> unless any objection, we should improve such message with the real one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (SOLR-15153) Collection selector drop down does not sort collections

2021-02-12 Thread Mike Drob (Jira)

Mike Drob created SOLR-15153:


 Summary: Collection selector drop down does not sort collections
 Key: SOLR-15153
 URL: https://issues.apache.org/jira/browse/SOLR-15153
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
  Components: Admin UI
Reporter: Mike Drob


The collections selector drop down on the admin UI does not sort collections, 
making it harder to find the one that you care about.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2363: LUCENE-9766: Hunspell: add API for retrieving dictionary morphologica…



donnerpeter commented on a change in pull request #2363:
URL: https://github.com/apache/lucene-solr/pull/2363#discussion_r575352984



##
File path: 
lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestAllDictionaries.java
##
@@ -205,6 +205,7 @@ private static String memoryUsageSummary(Dictionary dic) {
 + ("strips=" + RamUsageTester.humanSizeOf(dic.stripData) + ", ")
 + ("conditions=" + RamUsageTester.humanSizeOf(dic.patterns) + ", ")
 + ("affixData=" + RamUsageTester.humanSizeOf(dic.affixData) + ", ")
++ ("morphData=" + RamUsageTester.humanSizeOf(dic.morphData) + ", ")

Review comment:
   There's a slight memory usage increase, ~6% on average. I hope this is 
fine.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2363: LUCENE-9766: Hunspell: add API for retrieving dictionary morphologica…



donnerpeter commented on a change in pull request #2363:
URL: https://github.com/apache/lucene-solr/pull/2363#discussion_r575352045



##
File path: 
lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/StemmerTestBase.java
##
@@ -43,6 +43,11 @@ static void init(String affix, String dictionary) throws 
IOException, ParseExcep
 
   static void init(boolean ignoreCase, String affix, String... dictionaries)
   throws IOException, ParseException {
+stemmer = new Stemmer(loadDictionary(ignoreCase, affix, dictionaries));
+  }
+
+  static Dictionary loadDictionary(boolean ignoreCase, String affix, String... 
dictionaries)

Review comment:
   extract a method to call from a test





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



donnerpeter commented on a change in pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354#discussion_r575345376



##
File path: lucene/CHANGES.txt
##
@@ -89,8 +89,8 @@ API Changes
 
 Improvements
 
-* LUCENE-9687: Hunspell support improvements: add SpellChecker API, support 
default encoding and
-  BREAK/FORBIDDENWORD/COMPOUNDRULE affix rules, improve stemming of all-caps 
words (Peter Gromov)
+* LUCENE-9687: Hunspell support improvements: add API for spell-checking and 
suggestions, support compound words,
+  fix various behavior differences between Java and C++ implementations, 
improve performance (Peter Gromov, Dawid Weiss)

Review comment:
   I thought "via" is for pure review activity, but you've made meaningful 
changes yourself which I didn't even plan. Feel free to change to "via" if you 
like :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9687) Hunspell support improvements



 [ 
https://issues.apache.org/jira/browse/LUCENE-9687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Gromov updated LUCENE-9687:
-
Description: 
I'd like Lucene's Hunspell support to be on a par with the native C++ Hunspell 
for spellchecking and suggestions, at least for some languages. So I propose to:
* support the affix rules necessary for English, German, French, Spanish and
Russian dictionaries, possibly more languages later
* mirror Hunspell's suggestion algorithm in Lucene
* provide a public APIs for spellchecking, suggestion, stemming, morphological 
data
* check corpora for specific languages to find and fix spellchecking/suggestion 
discrepancices between Lucene's implementation and Hunspell/C++


  was:
I'd like Lucene's Hunspell support to be on a par with the native C++ Hunspell 
for spellchecking and suggestions, at least for some languages. So I propose to:
* support the affix rules necessary for English, German, French, Spanish and
Russian dictionaries, possibly more languages later
* provide a public API to check if a word is misspelled
* mirror Hunspell's suggestion algorithm in Lucene, probably in the
"src/suggest" module


> Hunspell support improvements
> -
>
> Key: LUCENE-9687
> URL: https://issues.apache.org/jira/browse/LUCENE-9687
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Peter Gromov
>Priority: Major
> Fix For: master (9.0)
>
>
> I'd like Lucene's Hunspell support to be on a par with the native C++ 
> Hunspell for spellchecking and suggestions, at least for some languages. So I 
> propose to:
> * support the affix rules necessary for English, German, French, Spanish and
> Russian dictionaries, possibly more languages later
> * mirror Hunspell's suggestion algorithm in Lucene
> * provide a public APIs for spellchecking, suggestion, stemming, 
> morphological data
> * check corpora for specific languages to find and fix 
> spellchecking/suggestion discrepancices between Lucene's implementation and 
> Hunspell/C++



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283804#comment-17283804
 ] 

Robert Muir commented on LUCENE-9767:
-

{quote}
I'd add a task to download gennorm2
{quote}

What do you mean, download and compile C code? I think we should let that be :) 
Besides the version of icu4c must match the targeted icu4j version exactly. 

{quote}
If you'd like to take over and try, please go ahead (and push to that PR 
directly)? If not, I'll get back to this later in the evening.
{quote}

I'll try it out, see if I can make some progress, thanks for getting it started!


> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter opened a new pull request #2363: LUCENE-9766: Hunspell: add API for retrieving dictionary morphologica…



donnerpeter opened a new pull request #2363:
URL: https://github.com/apache/lucene-solr/pull/2363


   …l data and stemming
   
   
   
   
   # Description
   
   We need to rank suggestions based on metadata associated with corresponding 
dictionary entries. For that, we need a stemming API (to get the entries) and 
an API to get the morphological data (where we'd store the needed info)
   
   # Solution
   
   Add public `Hunspell.getRoots` and `Dictionary.lookupEntries`
   
   # Tests
   
   Yes, a test for each new introduced method
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283801#comment-17283801
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Added an example of running gennorm but really have to be gone now.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283793#comment-17283793
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Hi Robert. I've filed a PR that runs just the first step here (and it does seem 
to work just fine). I've got to dash home now - the rest should be quite 
straightforward. I'd add a task to download gennorm2 (there are examples of 
that); alternatively, just hardcode gennorm2's execution for now (using 
project.exec). 

If you'd like to take over and try, please go ahead (and push to that PR 
directly)? If not, I'll get back to this later in the evening.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss opened a new pull request #2362: LUCENE-9767: infrastructure for icu regeneration in place.



dweiss opened a new pull request #2362:
URL: https://github.com/apache/lucene-solr/pull/2362


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9768) Add source sets for src/tools

2021-02-12 Thread ASF subversion and git services (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9768.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

> Add source sets for src/tools
> -
>
> Key: LUCENE-9768
> URL: https://issues.apache.org/jira/browse/LUCENE-9768
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> src/tools contain project-specific utilities. This should be a separate 
> source set (with its separate compilation phases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9768) Add source sets for src/tools



[ 
https://issues.apache.org/jira/browse/LUCENE-9768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283788#comment-17283788
 ] 

ASF subversion and git services commented on LUCENE-9768:
-

Commit f7e42bdb35b4c1a834a902ba3e3524c3b81bb958 in lucene-solr's branch 
refs/heads/master from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=f7e42bd ]

LUCENE-9768: Add source sets for src/tools, clean up forbidden API and 
formatting errors (#2361)



> Add source sets for src/tools
> -
>
> Key: LUCENE-9768
> URL: https://issues.apache.org/jira/browse/LUCENE-9768
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> src/tools contain project-specific utilities. This should be a separate 
> source set (with its separate compilation phases). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss merged pull request #2361: LUCENE-9768: Add source sets for src/tools, clean up forbidden API and formatting errors



dweiss merged pull request #2361:
URL: https://github.com/apache/lucene-solr/pull/2361


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tkaessmann closed pull request #2135: SOLR-15038: Add elevateDocsWithoutMatchingQ and onlyElevatedReprese…



tkaessmann closed pull request #2135:
URL: https://github.com/apache/lucene-solr/pull/2135


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tkaessmann commented on pull request #2135: SOLR-15038: Add elevateDocsWithoutMatchingQ and onlyElevatedReprese…



tkaessmann commented on pull request #2135:
URL: https://github.com/apache/lucene-solr/pull/2135#issuecomment-778279414


   not needed anymore



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



dweiss commented on a change in pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354#discussion_r575313858



##
File path: lucene/CHANGES.txt
##
@@ -89,8 +89,8 @@ API Changes
 
 Improvements
 
-* LUCENE-9687: Hunspell support improvements: add SpellChecker API, support 
default encoding and
-  BREAK/FORBIDDENWORD/COMPOUNDRULE affix rules, improve stemming of all-caps 
words (Peter Gromov)
+* LUCENE-9687: Hunspell support improvements: add API for spell-checking and 
suggestions, support compound words,
+  fix various behavior differences between Java and C++ implementations, 
improve performance (Peter Gromov, Dawid Weiss)

Review comment:
   Doesn't matter.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



dweiss commented on a change in pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354#discussion_r575313672



##
File path: lucene/CHANGES.txt
##
@@ -89,8 +89,8 @@ API Changes
 
 Improvements
 
-* LUCENE-9687: Hunspell support improvements: add SpellChecker API, support 
default encoding and
-  BREAK/FORBIDDENWORD/COMPOUNDRULE affix rules, improve stemming of all-caps 
words (Peter Gromov)
+* LUCENE-9687: Hunspell support improvements: add API for spell-checking and 
suggestions, support compound words,
+  fix various behavior differences between Java and C++ implementations, 
improve performance (Peter Gromov, Dawid Weiss)

Review comment:
   There's actually a convention for this: "Peter Gromov via Dawid Weiss".





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json



chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575312927



##
File path: 
solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps 
message, @SuppressWarnin
   shardRequestTracker.processResponses(results, shardHandler, false, null, 
Collections.emptySet());
   @SuppressWarnings({"rawtypes"})
   boolean failure = results.get("failure") != null && 
((SimpleOrderedMap)results.get("failure")).size() > 0;
+  if(isPrs) {
+TimeOut timeout = new 
TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), 
TimeUnit.SECONDS, timeSource); // could be a big cluster
+PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+while (!timeout.hasTimedOut()) {
+  if(prs.allActive()) break;
+  Thread.sleep(100);
+  prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+}
+if (prs.allActive()) {
+  // we have successfully found all replicas to be ACTIVE
+  // Now ask Overseer to fetch the latest state of collection
+  // from ZK
+  ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+} else {
+  failure = true;
+}
+  }
   if (failure) {
 // Let's cleanup as we hit an exception
 // We shouldn't be passing 'results' here for the cleanup as the 
response would then contain 'success'

Review comment:
   ^ My above comment was based on the 8x change, and seems like this 
change got missed when porting them over to this PR (for master). I'll update 
this branch and bring it up to sync with 8x soon.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9766) Hunspell: add API for retrieving dictionary morphological data and stemming



[ 
https://issues.apache.org/jira/browse/LUCENE-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283775#comment-17283775
 ] 

Dawid Weiss commented on LUCENE-9766:
-

No, I think it's fine and nicely collects information about what's been done. I 
just thought the task itself nears completion.

> Hunspell: add API for retrieving dictionary morphological data and stemming
> ---
>
> Key: LUCENE-9766
> URL: https://issues.apache.org/jira/browse/LUCENE-9766
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2361: LUCENE-9768: Add source sets for src/tools, clean up forbidden API and formatting errors



dweiss commented on pull request #2361:
URL: https://github.com/apache/lucene-solr/pull/2361#issuecomment-778268089


   Never mind, I see it in the generated .classpath. Should be fine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on pull request #2361: LUCENE-9768: Add source sets for src/tools, clean up forbidden API and formatting errors



dweiss commented on pull request #2361:
URL: https://github.com/apache/lucene-solr/pull/2361#issuecomment-778267709


   @rmuir Can you check if this works with Eclipse? Tools should be included in 
classpath now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9766) Hunspell: add API for retrieving dictionary morphological data and stemming



[ 
https://issues.apache.org/jira/browse/LUCENE-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283771#comment-17283771
 ] 

Peter Gromov commented on LUCENE-9766:
--

That's because I'm creating a new subtask for each PR, as advised. I can stop 
doing that :)

> Hunspell: add API for retrieving dictionary morphological data and stemming
> ---
>
> Key: LUCENE-9766
> URL: https://issues.apache.org/jira/browse/LUCENE-9766
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss opened a new pull request #2361: LUCENE-9768: Add source sets for src/tools, clean up forbidden API and formatting errors



dweiss opened a new pull request #2361:
URL: https://github.com/apache/lucene-solr/pull/2361


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9766) Hunspell: add API for retrieving dictionary morphological data and stemming



[ 
https://issues.apache.org/jira/browse/LUCENE-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283751#comment-17283751
 ] 

Dawid Weiss commented on LUCENE-9766:
-

Ok, up to you. That list of subtasks gets very long, but I'll leave it up to 
you.

> Hunspell: add API for retrieving dictionary morphological data and stemming
> ---
>
> Key: LUCENE-9766
> URL: https://issues.apache.org/jira/browse/LUCENE-9766
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



donnerpeter commented on a change in pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354#discussion_r575288848



##
File path: lucene/CHANGES.txt
##
@@ -89,8 +89,8 @@ API Changes
 
 Improvements
 
-* LUCENE-9687: Hunspell support improvements: add SpellChecker API, support 
default encoding and
-  BREAK/FORBIDDENWORD/COMPOUNDRULE affix rules, improve stemming of all-caps 
words (Peter Gromov)
+* LUCENE-9687: Hunspell support improvements: add API for spell-checking and 
suggestions, support compound words,
+  fix various behavior differences between Java and C++ implementations, 
improve performance (Peter Gromov, Dawid Weiss)

Review comment:
   I think you do, as you've dedicated quite a lot of your time to my and 
our own PRs. We could also change that to "mostly Peter Gromov, but also Dawid 
Weiss" if you like :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] gerlowskija opened a new pull request #2360: SOLR-13608: Incremental backup file format (#2250)

gerlowskija opened a new pull request #2360:
URL: https://github.com/apache/lucene-solr/pull/2360

# Description

Currently backups in SolrCloud are done as full snapshots. The full index
is uploaded each time, even if many of the files are unchanged since the
last backup.

# Solution

This commit introduces a new way for Solr to do backups (with a new
underlying file structure). This new "incremental" backup process
improves over the existing backup mechanism in several ways:

- multiple backups "points" can now be stored at a given backup
location/name, allowing users to choose which point in time they want
to restore
- subsequent backups skip over uploading files that were uploaded by
previous backups, saving time and network time.
- files are checksumed as they're uploaded, ensuring that corrupted
indices aren't persisted and accidentally restored later.

Incremental backups are now the default, and traditional backups
should now be considered 'deprecated' but can still be created by
passing an `incremental=false` parameter on backup requests.

# Tests

See TestIncrementalCoreBackup, TestStressIncrementalBackup,
HdfsBackupRepositoryIntegrationTest, LocalFSCloudIncrementalBackupTest, and
HdfsCloudIncrementalBackupTest among others.

# Checklist

Please review the following and check all that apply:

- [x] I have reviewed the guidelines for [How to
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms
to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request
title.
- [x] I have given Solr maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have run `ant precommit test`.
- [x] I have added tests for my changes.
- [x] I have added documentation for the [Ref
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide)
(for Solr changes only).

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9766) Hunspell: add API for retrieving dictionary morphological data and stemming



[ 
https://issues.apache.org/jira/browse/LUCENE-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283749#comment-17283749
 ] 

Peter Gromov commented on LUCENE-9766:
--

This one is indeed a bit out of the current wording of LUCENE-9687, but that 
could be fixed by making that one more general (e.g. "add public APIs for 
spellchecking, stemming, etc".

I wouldn't close the improvements issue, because I think performance 
improvements would nicely fit there, and I also still have some spellchecking 
differences between C++/Java versions which aren't covered by Hunspell's tests 
but need to be fixed. After that, I'd also use a corpus to check that 
suggestions work in the same way. Then I'd consider the improvements done. I 
can change the description to fit that. What do you think?

> Hunspell: add API for retrieving dictionary morphological data and stemming
> ---
>
> Key: LUCENE-9766
> URL: https://issues.apache.org/jira/browse/LUCENE-9766
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] chatman commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json

2021-02-12 Thread ASF subversion and git services (Jira)



chatman commented on a change in pull request #2318:
URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r575282512



##
File path: 
solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
##
@@ -256,6 +280,23 @@ public void call(ClusterState clusterState, ZkNodeProps 
message, @SuppressWarnin
   shardRequestTracker.processResponses(results, shardHandler, false, null, 
Collections.emptySet());
   @SuppressWarnings({"rawtypes"})
   boolean failure = results.get("failure") != null && 
((SimpleOrderedMap)results.get("failure")).size() > 0;
+  if(isPrs) {
+TimeOut timeout = new 
TimeOut(Integer.getInteger("solr.waitToSeeReplicasInStateTimeoutSeconds", 120), 
TimeUnit.SECONDS, timeSource); // could be a big cluster
+PerReplicaStates prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+while (!timeout.hasTimedOut()) {
+  if(prs.allActive()) break;
+  Thread.sleep(100);
+  prs = PerReplicaStates.fetch(collectionPath, 
ocmh.zkStateReader.getZkClient(), null);
+}
+if (prs.allActive()) {
+  // we have successfully found all replicas to be ACTIVE
+  // Now ask Overseer to fetch the latest state of collection
+  // from ZK
+  ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
+} else {
+  failure = true;
+}
+  }
   if (failure) {
 // Let's cleanup as we hit an exception
 // We shouldn't be passing 'results' here for the cleanup as the 
response would then contain 'success'

Review comment:
   I can confirm that this is no longer an issue after the latest commits. 
I ran CreateCollectionCleanupTest with the following patch [0] and it passed 
consistently. While it ran, I checked the coverage report to verify that these 
lines were covered.
   
   [0] - https://paste.centos.org/view/09e3434d





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9765) Hunspell: rename SpellChecker to Hunspell, fix test name, update javadoc and CHANGES.txt



[ 
https://issues.apache.org/jira/browse/LUCENE-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283736#comment-17283736
 ] 

ASF subversion and git services commented on LUCENE-9765:
-

Commit 02ea7a11392834857f944f35ace65896c951703f in lucene-solr's branch 
refs/heads/master from Peter Gromov
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=02ea7a1 ]

LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name, update 
javadoc and CHANGES.txt (#2354)



> Hunspell: rename SpellChecker to Hunspell, fix test name, update javadoc and 
> CHANGES.txt
> 
>
> Key: LUCENE-9765
> URL: https://issues.apache.org/jira/browse/LUCENE-9765
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9765) Hunspell: rename SpellChecker to Hunspell, fix test name, update javadoc and CHANGES.txt



 [ 
https://issues.apache.org/jira/browse/LUCENE-9765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9765.
-
Fix Version/s: master (9.0)
   Resolution: Fixed

> Hunspell: rename SpellChecker to Hunspell, fix test name, update javadoc and 
> CHANGES.txt
> 
>
> Key: LUCENE-9765
> URL: https://issues.apache.org/jira/browse/LUCENE-9765
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss merged pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



dweiss merged pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2354: LUCENE-9765: Hunspell: rename SpellChecker to Hunspell, fix test name…



dweiss commented on a change in pull request #2354:
URL: https://github.com/apache/lucene-solr/pull/2354#discussion_r575274004



##
File path: lucene/CHANGES.txt
##
@@ -89,8 +89,8 @@ API Changes
 
 Improvements
 
-* LUCENE-9687: Hunspell support improvements: add SpellChecker API, support 
default encoding and
-  BREAK/FORBIDDENWORD/COMPOUNDRULE affix rules, improve stemming of all-caps 
words (Peter Gromov)
+* LUCENE-9687: Hunspell support improvements: add API for spell-checking and 
suggestions, support compound words,
+  fix various behavior differences between Java and C++ implementations, 
improve performance (Peter Gromov, Dawid Weiss)

Review comment:
   I don't think I deserve to be in the changes entry. :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9766) Hunspell: add API for retrieving dictionary morphological data and stemming



[ 
https://issues.apache.org/jira/browse/LUCENE-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283732#comment-17283732
 ] 

Dawid Weiss commented on LUCENE-9766:
-

Hi Peter. Do you think we can move this to a major issue and close the 
"improvements" issue? If we have all the hunspell tests working I'd consider 
that one done.

> Hunspell: add API for retrieving dictionary morphological data and stemming
> ---
>
> Key: LUCENE-9766
> URL: https://issues.apache.org/jira/browse/LUCENE-9766
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Peter Gromov
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283706#comment-17283706
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Exactly. I think they should just go all other validation/ formatting checks as 
well. Maybe we'll have to add some suppressions but these are fine and keeps 
everything consistent. 

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283703#comment-17283703
 ] 

Robert Muir commented on LUCENE-9767:
-

OK, I was just mentioning it as an idea if it keeps the build simpler. The 
tools folders caused some pain/hair for the ant build. And ideally we'd at 
least be *compiling* the code to make sure it doesn't break, even/especially if 
tools are rarely used.

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283698#comment-17283698
 ] 

Dawid Weiss commented on LUCENE-9767:
-

Having a separate sourceset is doing much of what you want - it keeps sources 
in the same "module" (project) but at the same time it's not part of the same 
primary published artifact (so not part of javadocs, source zip, etc.). Tests 
are a separate sourceset, for example. 

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9767) port ICU regeneration to gradle build



[ 
https://issues.apache.org/jira/browse/LUCENE-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283696#comment-17283696
 ] 

Robert Muir commented on LUCENE-9767:
-

[~dweiss] maybe we should ultimately get rid of the tools folders? I think it 
has happened already with kuromoji and nori... they had src/tools and even 
tests for their tools, but we just moved the tooling into src/java.

The biggest downside i see to removing the src/tools: in this case the icu 
tools really shouldn't be showing up in javadocs. but maybe we could put them 
in a separate java package and exclude it from javadocs for now as a 
workaround... without having to go "full modules".

> port ICU regeneration to gradle build
> -
>
> Key: LUCENE-9767
> URL: https://issues.apache.org/jira/browse/LUCENE-9767
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>
> When we upgrade ICU dependency we have to regenerate a lot of stuff. The ant 
> build has it all automated.
> You do need icu4c installed corresponding to the icu4j version to regenerate 
> some of the datastructures. There are also some java regenerators that do 
> processing of unicode data and so on.
> Will try to see if I can get this hobbling when I have the time, the icu 
> dependency is quite old at this point. The hard part for me is learning 
> gradle's crazy ways every time, but maybe i can start it off super-ugly with 
> something like shell script that everyone hates, but at least works 
> correctly. 
> cc [~dweiss]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-15089) Allow backup/restoration to Amazon's S3 blobstore

2021-02-12 Thread Jason Gerlowski (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283690#comment-17283690
 ] 

Jason Gerlowski edited comment on SOLR-15089 at 2/12/21, 1:35 PM:
--

bq. I would be happy to look into cleaning it up and submitting it for 
consideration if you haven't started yet

Hey, that's great news Andy!  I'm in a similar situation - I have code from my 
employer (written largely by Shalin and Dat who I used to work with), but it 
also needs some recontextualization/cleanup work before it's ready to share 
here.

FWIW, my employer's implementation is well-tested but I don't think it's seen 
much production traffic.  So maybe your Salesforce implementation is a better 
base to start from, since it sounds like it's seen a good bit of production 
usage?  If you're able to get things cleaned up for contribution, let's work 
from what you have, and we can use my copy as a fallback or a sanity-check as 
necessary.

(Ishan raised some good questions about where this code should ultimately live, 
but if it makes it easier for you to share code we can handle that last.  Feel 
free to put your S3Repository code where-ever is easiest for the moment, and we 
can relocate it as necessary at the end.) 

bq. I would strongly prefer for this to stay outside of solr-core, preferably 
in solr-extras repo (when that's created).

My primary goal for this is that it lives as ASF code _somewhere_.  So I'm not 
against solr-extras as a home, if the community has decided on that approach 
for handling future contrib-y modules.

But does that consensus exist right now? The last email thread about it [ends 
ambiguously|http://mail-archives.apache.org/mod_mbox/lucene-dev/202101.mbox/%3Calpine.DEB.2.21.2101141026530.13436%40slate%3E],
 with Hoss asking some questions (that I seconded) about what benefits 
{{solr-extras}} really provides over a single-repo approach.

>From that I was under the impression that {{solr-extras}} might happen, but 
>was still very much up in the air.  But I might've missed some mail on it?


was (Author: gerlowskija):
bq. I would be happy to look into cleaning it up and submitting it for 
consideration if you haven't started yet

Hey, that's great news Andy!  I'm in a similar situation - I have code from my 
employer (written largely by Shalin and Dat who I used to work with), but it 
also needs some recontextualization/cleanup work before it's ready to share 
here.

FWIW, my employer's implementation is well-tested but I don't think it's seen 
much production traffic.  So maybe your Salesforce implementation is a better 
base to start from, since it sounds like it's seen a good bit of production 
usage?  If you're able to get things cleaned up for contribution, let's work 
from what you have, and we can use my copy as a fallback or a sanity-check as 
necessary.

(Ishan raised some good questions about where this code should ultimately live, 
but if it makes it easier for you to share code we can handle that at the end.  
Feel free to put your S3Repository code where-ever is easiest for the moment.) 

bq. I would strongly prefer for this to stay outside of solr-core, preferably 
in solr-extras repo (when that's created).

My primary goal for this is that it lives as ASF code _somewhere_.  So I'm not 
against solr-extras as a home, if the community has decided on that approach 
for handling future contrib-y modules.

But does that consensus exist right now? The last email thread about it [ends 
ambiguously|http://mail-archives.apache.org/mod_mbox/lucene-dev/202101.mbox/%3Calpine.DEB.2.21.2101141026530.13436%40slate%3E],
 with Hoss asking some questions (that I seconded) about what benefits 
{{solr-extras}} really provides over a single-repo approach.

>From that I was under the impression that {{solr-extras}} might happen, but 
>was still very much up in the air.  But I might've missed some mail on it?

> Allow backup/restoration to Amazon's S3 blobstore 
> --
>
> Key: SOLR-15089
> URL: https://issues.apache.org/jira/browse/SOLR-15089
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Jason Gerlowski
>Priority: Major
>
> Solr's BackupRepository interface provides an abstraction around the physical 
> location/format that backups are stored in.  This allows plugin writers to 
> create "repositories" for a variety of storage mediums.  It'd be nice if Solr 
> offered more mediums out of the box though, such as some of the "blobstore" 
> offerings provided by various cloud providers.
> This ticket proposes that a "BackupRepository" implementation for Amazon's 
> popular 'S3' blobstore, so that Solr users can use it for backups without 
> needing to write their own code.
> Amazon offers a

[jira] [Commented] (SOLR-15089) Allow backup/restoration to Amazon's S3 blobstore

2021-02-12 Thread Jason Gerlowski (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-15089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283690#comment-17283690
 ] 

Jason Gerlowski commented on SOLR-15089:


bq. I would be happy to look into cleaning it up and submitting it for 
consideration if you haven't started yet

Hey, that's great news Andy!  I'm in a similar situation - I have code from my 
employer (written largely by Shalin and Dat who I used to work with), but it 
also needs some recontextualization/cleanup work before it's ready to share 
here.

FWIW, my employer's implementation is well-tested but I don't think it's seen 
much production traffic.  So maybe your Salesforce implementation is a better 
base to start from, since it sounds like it's seen a good bit of production 
usage?  If you're able to get things cleaned up for contribution, let's work 
from what you have, and we can use my copy as a fallback or a sanity-check as 
necessary.

(Ishan raised some good questions about where this code should ultimately live, 
but if it makes it easier for you to share code we can handle that at the end.  
Feel free to put your S3Repository code where-ever is easiest for the moment.) 

bq. I would strongly prefer for this to stay outside of solr-core, preferably 
in solr-extras repo (when that's created).

My primary goal for this is that it lives as ASF code _somewhere_.  So I'm not 
against solr-extras as a home, if the community has decided on that approach 
for handling future contrib-y modules.

But does that consensus exist right now? The last email thread about it [ends 
ambiguously|http://mail-archives.apache.org/mod_mbox/lucene-dev/202101.mbox/%3Calpine.DEB.2.21.2101141026530.13436%40slate%3E],
 with Hoss asking some questions (that I seconded) about what benefits 
{{solr-extras}} really provides over a single-repo approach.

>From that I was under the impression that {{solr-extras}} might happen, but 
>was still very much up in the air.  But I might've missed some mail on it?

> Allow backup/restoration to Amazon's S3 blobstore 
> --
>
> Key: SOLR-15089
> URL: https://issues.apache.org/jira/browse/SOLR-15089
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Jason Gerlowski
>Priority: Major
>
> Solr's BackupRepository interface provides an abstraction around the physical 
> location/format that backups are stored in.  This allows plugin writers to 
> create "repositories" for a variety of storage mediums.  It'd be nice if Solr 
> offered more mediums out of the box though, such as some of the "blobstore" 
> offerings provided by various cloud providers.
> This ticket proposes that a "BackupRepository" implementation for Amazon's 
> popular 'S3' blobstore, so that Solr users can use it for backups without 
> needing to write their own code.
> Amazon offers a s3 Java client with acceptable licensing, and the required 
> code is relatively simple.  The biggest challenge in supporting this will 
> likely be procedural - integration testing requires S3 access and S3 access 
> costs money.  We can check with INFRA to see if there is any way to get cloud 
> credits for an integration test to run in nightly Jenkins runs on the ASF 
> Jenkins server.  Alternatively we can try to stub out the blobstore in some 
> reliable way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9768) Add source sets for src/tools