Re: Investigating Seeming Deadlock

2021-03-05 Thread Mike Drob
Were you having any OOM errors beforehand? If so, that could have caused
some GC of objects that other threads still expect to be reachable, leading
to these null monitors.

On Fri, Mar 5, 2021 at 12:55 PM Stephen Lewis Bianamara <
stephen.bianam...@gmail.com> wrote:

> Hi SOLR Community,
>
> I'm investigating a node on solr 8.3.1 running in cloud mode which appears
> to have deadlocked, and I'm trying to figure out if this is a known issue
> or not, and looking for some guidance in understanding both (a) whether
> this is a resolved issue in future releases or needs a bug, and (b) how to
> lower the risk of recurrence until it is fixed.
>
> Here is what I've observed:
>
>- strace shows the main process waiting. A spot check on child processes
>shows the same, though I did not deep dive all of the threads yet (there
>are over 100).
>- the server was not doing anything or busy, except for jvm sitting at
>constant memory usage. No resource of memory, swap, cpu, etc... was
> limited
>or showing active usage.
>- jcmd Thread.Print shows some interesting info which suggests a
>deadlock or another type of locking issue
>   - For example, I found this log suggests something unusual because it
>   looks like it's trying to lock a null object
>  - "Finalizer" #3 daemon prio=8 os_prio=0 cpu=11.11ms
>  elapsed=11.11s tid=0x0100 nid=0x in
> Object.wait()
>   [0x1000]
> java.lang.Thread.State: WAITING (on object monitor)
>  at java.lang.Object.wait(java.base@11.0.7/Native Method)
>  - waiting on 
>  at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7
>  /ReferenceQueue.java:155)
>  - waiting to re-lock in wait() <0x00020020> (a
>  java.lang.ref.ReferenceQueue$Lock)
>  at java.lang.ref.ReferenceQueue.remove(java.base@11.0.7
>  /ReferenceQueue.java:176)
>  at
>  java.lang.ref.Finalizer$FinalizerThread.run(java.base@11.0.7
>  /Finalizer.java:170)
>  - I also see a lot of this. Some addressess occur multiple times,
>   but one in particular occurs 31 times. Maybe related?
>  - "h2sc-1-thread-11" #110 prio=5 os_prio=0 cpu=54.29ms
>  elapsed=11.11s tid=0x10010100 nid=0x waiting
> on condition
>   [0x10011000]
> java.lang.Thread.State: WAITING (parking)
>  at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native
>  Method)
>  - parking to wait for  <0x00030033>
>
> Can anyone help answer whether this is known or what I could look at next?
>
> Thanks!
> Stephen
>


Re: Partial update bug on solr 8.8.0

2021-03-02 Thread Mike Drob
This looks like a bug that is already fixed but not yet released in 8.9

https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-13034

On Tue, Mar 2, 2021 at 6:27 AM Mohsen Saboorian  wrote:

> Any idea about this post?
> https://stackoverflow.com/q/66335803/141438
>
> Regards.
>


Re: Asymmetric Key Size not sufficient

2021-02-14 Thread Mike Drob
Future vulnerability reports should be sent to secur...@apache.org so that
they can be resolved privately.

Thank you

On Fri, Feb 12, 2021 at 10:17 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Recent versions of Solr use 2048.
>
> https://github.com/apache/lucene-solr/blob/branch_8_6/solr/core/src/java/org/apache/solr/util/CryptoKeys.java#L332
>
> Thanks for your report.
>
> On Fri, Feb 12, 2021 at 3:44 PM Mahir Kabir  wrote:
>
> > Hello,
> >
> > I am a Ph.D. student at Virginia Tech, USA. While working on a security
> > project-related work, we came across the following vulnerability in the
> > source code -
> >
> > In file
> >
> >
> https://github.com/apache/lucene-solr/blob/branch_6_6/solr/core/src/java/org/apache/solr/util/CryptoKeys.java
> > <
> >
> https://github.com/apache/ranger/blob/71e1dd40366c8eb8e9c498b0b5158d85d603af02/kms/src/main/java/org/apache/hadoop/crypto/key/RangerKeyStore.java
> > >
> > (at
> > Line 300) Key Size was set as 1024.
> >
> > *Security Impact*:
> >
> > < 2048 key size for RSA algorithm makes the system vulnerable to
> > brute-force attack
> >
> > *Useful resource*:
> > https://rules.sonarsource.com/java/type/Vulnerability/RSPEC-4426
> > https://rules.sonarsource.com/java/type/Vulnerability/RSPEC-4426
> >
> > *Solution we suggest*:
> >
> > For RSA algorithm, the key size should be >= 2048
> >
> > *Please share with us your opinions/comments if there is any*:
> >
> > Is the bug report helpful?
> >
> > Please let us know what you think about the issue. Any feedback will be
> > appreciated.
> >
> > Thank you,
> > Md Mahir Asef Kabir
> > Ph.D. Student
> > Department of CS
> > Virginia Tech
> >
>


Re: Ghost Documents or Shards out of Sync

2021-02-01 Thread Mike Drob
To expand on what Jason suggested, if the issue is the non-deterministic
ordering due to staggered commits per replica, you may have more
consistency with TLOG replicas rather than the NRT replicas. In this case,
the underlying segment files should be identical and lead to more
predictable results.

On Mon, Feb 1, 2021 at 2:50 PM Jason Gerlowski 
wrote:

> Hi Ronen,
>
> The first thing I'd figure out in your situation is whether the
> results are actually different each time, or whether the ordering is
> what differs (which might push a particular result off the page you're
> looking at, giving the appearance that it didn't match).
>
> In the case of the former, this can happen briefly if queries come in
> when some but not all replicas have seen a commit.  But usually this
> is a transient concern - either waiting for the next autocommit or
> triggering an explicit commit resolves the discrepancy in this case.
> Since you only see identical results after a restart, this _doesn't_
> sound like what you're seeing.
>
> In the case of the latter (same results, differently ordered) this is
> expected sometimes.  Solr sorts on relevance by default with the
> internal Lucene document ID being a tiebreaker.  Both the relevance
> statistics and Lucene's document IDs can differ across SolrCloud
> replicas (due to non-deterministic conditions such as the segment
> merging and deleted-doc removal that Lucene does under the hood), and
> this can produce differently-ordered result sets for users that issue
> the same query repeatedly.
>
> Good luck narrowing things down!
>
> Jason
>
> On Mon, Jan 25, 2021 at 3:32 AM Ronen Nussbaum  wrote:
> >
> > Hi All,
> >
> > I'm using Solr Cloud (version 8.3.0) with shards and replicas
> (replication
> > factor of 2).
> > Recently, I've encountered several times that running the same query
> > repeatedly yields different results. Restarting the nodes fixes the
> problem
> > (until next time).
> > I assume that some shards are not synchronized and I have several
> questions:
> > 1. What can cause this - many atomic updates? issues with commits?
> > 2. Can I trigger the "fixing" mechanism that Solr runs at restart by an
> API
> > call or some other method?
> >
> > Thanks in advance,
> > Ronen.
>


Re: Apache Solr Reference Guide isn't accessible

2021-02-01 Thread Mike Drob
Hi Dorion,

We are currently working with our infra team to get these restored. In the
meantime, the 8.4 guide is still available at
https://lucene.apache.org/solr/guide/8_4/ and are hopeful that the 8.8
guide will be back up soon. Thank you for your patience.

Mike

On Mon, Feb 1, 2021 at 1:58 PM Dorion Caroline 
wrote:

> Hi,
>
> I can't access to Apache Solr Reference Guide since few days.
> Example:
> URL
>
>   *   https://lucene.apache.org/solr/guide/8_8/
>   *   https://lucene.apache.org/solr/guide/8_7/
> Result:
> Not Found
> The requested URL was not found on this server.
>
> Do you know what going on?
>
> Thanks
> Caroline Dorion
>


Re: Solr 8.7.0 memory leak?

2021-01-27 Thread Mike Drob
Are you running these in docker containers?

Also, I’m assuming this is a typo but just in case the setting is Xmx :)

Can you share the OOM stack trace? It’s not always running out of memory,
sometimes Java throws OOM for file handles or threads.

Mike

On Wed, Jan 27, 2021 at 10:00 PM Luke  wrote:

> Shawn,
>
> it's killed by OOME exception. The problem is that I just created empty
> collections and the Solr JVM keeps growing and never goes down. there is no
> data at all. at the beginning, I set Xxm=6G, then 10G, now 15G, Solr 8.7
> always use all of them and it will be killed by oom.sh once jvm usage
> reachs 100%.
>
> I have another solr 8.6.2 cloud(3 nodes) in separated environment , which
> have over 100 collections, the Xxm = 6G , jvm is always 4-5G.
>
>
>
> On Thu, Jan 28, 2021 at 2:56 AM Shawn Heisey  wrote:
>
> > On 1/27/2021 5:08 PM, Luke Oak wrote:
> > > I just created a few collections and no data, memory keeps growing but
> > never go down, until I got OOM and solr is killed
> > >
> > > Any reason?
> >
> > Was Solr killed by the operating system's oom killer or did the death
> > start with a Java OutOfMemoryError exception?
> >
> > If it was the OS, then the entire system doesn't have enough memory for
> > the demands that are made on it.  The problem might be Solr, or it might
> > be something else.  You will need to either reduce the amount of memory
> > used or increase the memory in the system.
> >
> > If it was a Java OOME exception that led to Solr being killed, then some
> > resource (could be heap memory, but isn't always) will be too small and
> > will need to be increased.  To figure out what resource, you need to see
> > the exception text.  Such exceptions are not always recorded -- it may
> > occur in a section of code that has no logging.
> >
> > Thanks,
> > Shawn
> >
>


Re: NullPointerException in Graph Traversal nodes streaming expression

2021-01-21 Thread Mike Drob
Can you provide a sample expression that would be able to reproduce this?
Are you able to try a newer version by chance - I know we've fixed a few
NPEs recently, maybe https://issues.apache.org/jira/browse/SOLR-14700

On Thu, Jan 21, 2021 at 4:13 PM ufuk yılmaz 
wrote:

> Solr version 8.4. I’m getting an unexplanetory NullPointerException when
> executing a simple 2 level nodes stream, do you have any idea what may
> cause this?
>
> I tried setting /stream?partialResults=true&shards.tolerant=true and
> shards.tolerant=true in nodes expressions, with no luck. I also tried
> reading source of GatherNodesStream in branch 8_4, but couldn’t understand
> it. Here is a beautiful stack trace:
>
> solr| 2021-01-21 22:00:12.726 ERROR (qtp832292933-25149)
> [c:WorkerCollection s:shard1 r:core_node10
> x:WorkerCollection_shard1_replica_n9] o.a.s.c.s.i.s.ExceptionStream
> java.lang.RuntimeException: java.util.concurrent.ExecutionException:
> java.lang.RuntimeException: java.lang.NullPointerException
> solr|   at
> org.apache.solr.client.solrj.io.graph.GatherNodesStream.read(GatherNodesStream.java:607)
> solr|   at
> org.apache.solr.client.solrj.io.stream.ExceptionStream.read(ExceptionStream.java:71)
> solr|   at
> org.apache.solr.handler.StreamHandler$TimerStream.read(StreamHandler.java:454)
> solr|   at
> org.apache.solr.client.solrj.io.stream.TupleStream.lambda$writeMap$0(TupleStream.java:84)
> solr|   at
> org.apache.solr.common.util.JsonTextWriter.writeIterator(JsonTextWriter.java:141)
> solr|   at
> org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:67)
> solr|   at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:152)
> solr|   at
> org.apache.solr.common.util.JsonTextWriter$2.put(JsonTextWriter.java:176)
> solr|   at
> org.apache.solr.client.solrj.io.stream.TupleStream.writeMap(TupleStream.java:81)
> solr|   at
> org.apache.solr.common.util.JsonTextWriter.writeMap(JsonTextWriter.java:164)
> solr|   at
> org.apache.solr.common.util.TextWriter.writeVal(TextWriter.java:69)
> solr|   at
> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:152)
> solr|   at
> org.apache.solr.common.util.JsonTextWriter.writeNamedListAsMapWithDups(JsonTextWriter.java:386)
> solr|   at
> org.apache.solr.common.util.JsonTextWriter.writeNamedList(JsonTextWriter.java:292)
> solr|   at
> org.apache.solr.response.JSONWriter.writeResponse(JSONWriter.java:73)
> solr|   at
> org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:66)
> solr|   at
> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:65)
> solr|   at
> org.apache.solr.servlet.HttpSolrCall.writeResponse(HttpSolrCall.java:892)
> solr|   at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:594)
> solr|   at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:419)
> solr|   at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
> solr|   at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
> solr|   at
> org.eclipse.jetty.servlets.CrossOriginFilter.handle(CrossOriginFilter.java:311)
> solr|   at
> org.eclipse.jetty.servlets.CrossOriginFilter.doFilter(CrossOriginFilter.java:265)
> solr|   at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
> solr|   at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
> solr|   at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
> solr|   at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> solr|   at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> solr|   at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> solr|   at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
> solr|   at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> solr|   at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
> solr|   at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> solr|   at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
> solr|   at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
> solr|   at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> solr|   at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
> solr|   at
> org.eclipse.jetty.server.handler.

Re: Cursor Performance Issue

2021-01-13 Thread Mike Drob
You should be using docvalues on your id, but note that switching this
would require a reindex.

On Wed, Jan 13, 2021 at 6:04 AM Ajay Sharma 
wrote:

> Hi All,
>
> I have used cursors to search and export documents in solr according to
>
> https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html#fetching-a-large-number-of-sorted-results-cursors
>
> Solr version: 6.5.0
> No of Documents: 10 crore
>
> Before implementing cursor, I was using the start and rows parameter to
> fetch records
> Service response time used to be 2 sec
>
> *Before implementing Cursor Solr URL:*
> http://localhost:8080/solr/search/select?q=bird
> toy&qt=mapping&ps=3&rows=25&mm=100
>
> Request handler Looks like this: fl contains approx 20 fields
> 
> 
> edismax
> on
> 0.01
> 
> 
> id,refid,title,smalldesc:""
> 
>
> none
> json
> 25
> 15000
> smalldesc
> title_text
> titlews^3
> sdescnisq
> 1
> 
> 2<-1 4<70%
> 
> 
>
> Sharing Response with EchoParams=all > Qtime is 6
> responseHeader: {
> status: 0,
> QTime: 6,
> params: {
> ps: "3",
> echoParams: "all",
> indent: "on",
> fl: "id,refid,title,smalldesc:"",
> tie: "0.01",
> defType: "edismax",
> qf: "customphonetic",
> wt: "json",
>qs: "1",
>qt: "mapping",
>rows: "25",
>q: "bird toy",
>timeAllowed: "15000"
> }
> },
> response: {
> numFound: 17,
> start: 0,
> maxScore: 26.616478,
> docs: [
>   {
> id: "22347708097",
> refid: "152585558",
> title: "Round BIRD COLOURFUL SWINGING CIRCULAR SITTING TOY",
> smalldesc: "",
> score: 26.616478
>  }
> ]
> }
>
> I am facing a performance issue now after implementing the cursor. Service
> response time is increased 3 to 4 times .i.e. 8 sec in some cases
>
> *After implementing Cursor query is-*
> localhost:8080/solr/search/select?q=bird
> toy&qt=cursor&ps=3&rows=1000&mm=100&sort=score desc,id asc&cursorMark=*
>
> Just added &sort=score desc,id asc&cursorMark=* to the before query and
> rows to be fetched is 1000 now and fl contains just a single field
>
> Request handler remains same as before just changed the name and made fl
> change and added df in defaults
>
> 
>
>   edismax
>   on
>   0.01
>
>
>   refid
>
>
>   none
>   json
>   1000
>   smalldesc
>   title_text
>   titlews^3
>   sdescnisq
>   1
>   2<-1 4<70%
>   product_titles
>
> 
>
> Response with Cursor and echoParams=all-> *Qtime is now 17* i.e approx 3
> time of previous qtime
> responseHeader: {
> status: 0,
> QTime: 17,
> params: {
> df: "product_titles",
> ps: "3",
> echoParams: "all",
> indent: "on",
> fl: "refid",
> tie: "0.01",
> defType: "edismax",
> qf: "customphonetic",
> qs: "1",
> qt: "cursor",
> sort: "score desc,id asc",
> rows: "1000",
> q: "bird toy",
> cursorMark: "*",
> }
> },
> response: {
> numFound: 17,
> start: 0,
> docs: [
> {
> refid: "152585558"
> },
> {
> refid: "157276077"
> }
> ]
> }
>
>
> When i curl http://localhost:8080/solr/search/select?q=bird
> toy&qt=mapping&ps=3&rows=25&mm=100, i can get results in 3 seconds.
> When i curl localhost:8080/solr/search/select?q=bird
> toy&qt=cursor&ps=3&rows=1000&mm=100&sort=score desc,id asc&cursorMark=* it
> consumed 8 seconds to return result even if the result count=0
>
> BTW, the id schema definition is used in sort
>  omitNorms="true" multiValued="false"/>
>
> Is it due to the sort I have applied or I have implemented it in the wrong
> way?
> Please help or provide the direction to solve this issue
>
>
> Thanks in advance
>
> --
> Thanks & Regards,
> Ajay Sharma
> Product Search
> Indiamart Intermesh Ltd.
>
> --
>
>


Re: Converting a collection name to an alias

2021-01-07 Thread Mike Drob
I believe you may be able to use that command (or some combination of
create alias commands) to create an alias from A to A, and then in
the future when you want to change it you can have Alias A to collection B
(assuming this is the point of the alias in the first place).

On Thu, Jan 7, 2021 at 1:53 PM ufuk yılmaz 
wrote:

> Hi,
> I’m aware of that API but it doesn’t do what I actually want.
>
> regards
>
> Sent from Mail for Windows 10
>
> From: matthew sporleder
> Sent: 07 January 2021 22:46
> To: solr-user@lucene.apache.org
> Subject: Re: Converting a collection name to an alias
>
> https://lucene.apache.org/solr/guide/8_1/collections-api.html#rename
>
> On Thu, Jan 7, 2021 at 2:07 PM ufuk yılmaz 
> wrote:
> >
> > Hi again,
> >
> > Lets say I have a collection named A.
> > I’m trying to rename it to A_1, then create an alias named A, which
> points to the A_1 collection.
> > Is this possible without deleting and reindexing the collection from
> scratch?
> >
> > Regards,
> > uyilmaz
> >
>
>


Re: SPLITSHARD - data loss of child documents

2020-12-17 Thread Mike Drob
I was under the impression that split shard doesn’t work with child
documents, if that is missing from the ref guide we should update it

On Thu, Dec 17, 2020 at 4:30 AM Nussbaum, Ronen 
wrote:

> Hi Everyone,
>
> We're using version 8.6.1 with nested documents.
> I used the SPLITSHARD API and after it finished successfully, I've noticed
> the following:
>
>   1.  Most of child documents are missing - before the split: ~600M,
> after: 68M
>   2.  Retrieving a document with its children, shows child documents that
> do not belong to this parent (their parentID value is different than
> parent's ID).
>
> I didn't see any limitation in the API documentation.
> Do you have any suggestions?
>
> Thanks in advance,
> Ronen.
>
>
> This electronic message may contain proprietary and confidential
> information of Verint Systems Inc., its affiliates and/or subsidiaries. The
> information is intended to be for the use of the individual(s) or
> entity(ies) named above. If you are not the intended recipient (or
> authorized to receive this e-mail for the intended recipient), you may not
> use, copy, disclose or distribute to anyone this message or any information
> contained in this message. If you have received this electronic message in
> error, please notify us by replying to this e-mail.
>


Re: solr 8.6.3 and noggit

2020-11-20 Thread Mike Drob
Noggit code was forked into Solr, see SOLR-13427
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.6.3/solr/solrj/src/java/org/noggit/ObjectBuilder.java

It looks like that particular method was added in 8.4 via
https://issues.apache.org/jira/browse/SOLR-13824

Is it possible you're using an older SolrJ against a newer Solr server (or
vice versa).

Mike

On Fri, Nov 20, 2020 at 2:25 PM Susmit Shukla 
wrote:

> Hi,
> got this error using streaming with solrj 8.6.3 . does it use noggit-0.8.
> It was not mentioned in dependencies
> https://github.com/apache/lucene-solr/blob/branch_8_6/solr/solrj/ivy.xml
>
> Caused by: java.lang.NoSuchMethodError: 'java.lang.Object
> org.noggit.ObjectBuilder.getValStrict()'
>
> at org.apache.solr.common.util.Utils.fromJSON(Utils.java:284)
> ~[solr-solrj-8.6.3.jar:8.6.3 e001c2221812a0ba9e9378855040ce72f93eced4 -
> jasongerlowski - 2020-10-03 18:12:06]
>


Re: download binary files will not uncompress

2020-11-03 Thread Mike Drob
Routing back to the mailing list, please do not reply directly to
individual emails.

You did not download the complete file, the releases should be
approximately 180MB, not the 30KB that you show.

Try downloading from a different mirror, or check if you are behind a proxy
or firewall preventing the downloads.


On Tue, Nov 3, 2020 at 4:51 PM James Rome  wrote:

> jar@jarfx ~/.gnupg $ gpg --import ~/download/KEYS
> gpg: key B83EA82A0AFCEE7C: public key "Yonik Seeley "
> imported
> gpg: key E48025ED13E57FFC: public key "Upayavira "
> imported
>
> ...
>
> jar@jarfx ~/download $ ls -l solr*
> -rw-r--r-- 1 root root 30690 Nov  3 17:00 solr-8.6.3.tgz
> -rw-r--r-- 1 root root   833 Oct  3 21:44 solr-8.6.3.tgz.asc
> -rw-r--r-- 1 root root   145 Oct  3 21:44 solr-8.6.3.tgz.sha512
> -rw-r--r-- 1 root root 30718 Nov  3 17:01 solr-8.6.3.zip
>
> gpg --verify  solr-8.6.3.tgz.asc solr-8.6.3.tgz
> gpg: Signature made Sat 03 Oct 2020 06:17:01 PM EDT
> gpg:using RSA key 902CC51935C140BF820230961FD5295281436075
> gpg: BAD signature from "Jason Gerlowski (CODE SIGNING KEY)
> " [unknown]
>
> jar@jarfx ~/download $ tar xvf solr-8.6.3.tgz
>
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
>
>
> James A. Rome
> 116 Claymore Lane
> Oak Ridge, TN 37830-7674
> 865 482-5643; Cell: 865 566-7991
> jamesr...@gmail.com
> https://jamesrome.net
>
> On 11/3/20 5:20 PM, Mike Drob wrote:
> > Can you check the signatures to make sure your downloads were not
> > corrupted? I just checked and was able to download and uncompress both of
> > them.
> >
> > Also, depending on your version of tar, you don't want the - for your
> > flags... tar xf solr-8.6.3.tgz
> >
> > Mike
> >
> > On Tue, Nov 3, 2020 at 4:15 PM James Rome  wrote:
> >
> >> # Source release: solr-8.6.3-src.tgz
> >> <
> >>
> https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3-src.tgz
> >
> >>
> >> [PGP
> >> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3-src.tgz.asc
> >]
> >> [SHA512
> >> <
> https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3-src.tgz.sha512
> >>> ]
> >> # Binary releases: solr-8.6.3.tgz
> >> <https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3.tgz
> >
> >> [PGP
> >> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.tgz.asc>]
> >> [SHA512
> >> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.tgz.sha512>]
> >> / solr-8.6.3.zip
> >> <https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3.zip
> >
> >> [PGP
> >> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.zip.asc>]
> >> [SHA512
> >> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.zip.sha512>]
> >>
> >>unzip solr-8.6.3.zip
> >> Archive:  solr-8.6.3.zip
> >> End-of-central-directory signature not found.  Either this file is
> not
> >> a zipfile, or it constitutes one disk of a multi-part archive. In
> the
> >> latter case the central directory and zipfile comment will be found
> on
> >> the last disk(s) of this archive.
> >>
> >>
> >> and
> >>
> >> # tar -xvf solr-8.6.3.tgz
> >>
> >> gzip: stdin: not in gzip format
> >> tar: Child returned status 1
> >> tar: Error is not recoverable: exiting now
> >>
> >> --
> >> James A. Rome
> >>
> >> https://jamesrome.net
> >>
> >>
>


Re: download binary files will not uncompress

2020-11-03 Thread Mike Drob
Can you check the signatures to make sure your downloads were not
corrupted? I just checked and was able to download and uncompress both of
them.

Also, depending on your version of tar, you don't want the - for your
flags... tar xf solr-8.6.3.tgz

Mike

On Tue, Nov 3, 2020 at 4:15 PM James Rome  wrote:

> # Source release: solr-8.6.3-src.tgz
> <
> https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3-src.tgz>
>
> [PGP
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3-src.tgz.asc>]
> [SHA512
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3-src.tgz.sha512
> >]
> # Binary releases: solr-8.6.3.tgz
> <https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3.tgz>
> [PGP
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.tgz.asc>]
> [SHA512
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.tgz.sha512>]
> / solr-8.6.3.zip
> <https://www.apache.org/dyn/closer.lua/lucene/solr/8.6.3/solr-8.6.3.zip>
> [PGP
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.zip.asc>]
> [SHA512
> <https://downloads.apache.org/lucene/solr/8.6.3/solr-8.6.3.zip.sha512>]
>
>   unzip solr-8.6.3.zip
> Archive:  solr-8.6.3.zip
>End-of-central-directory signature not found.  Either this file is not
>a zipfile, or it constitutes one disk of a multi-part archive. In the
>latter case the central directory and zipfile comment will be found on
>the last disk(s) of this archive.
>
>
> and
>
> # tar -xvf solr-8.6.3.tgz
>
> gzip: stdin: not in gzip format
> tar: Child returned status 1
> tar: Error is not recoverable: exiting now
>
> --
> James A. Rome
>
> https://jamesrome.net
>
>


Re: Solr dependency update at Apache Beam - which versions should be supported

2020-10-27 Thread Mike Drob
Piotr,

Based on the questions that we've seen over the past month on this list,
there are still users with Solr on 6, 7, and 8. I suspect there are still
Solr 5 users out there too, although they don't appear to be asking for
help - likely they are in set it and forget it mode.

Solr 7 may not be officially deprecated on our site, but it's pretty old at
this point and we're not doing any development on it outside of mybe a
very high profile security fix. Even then, we might acknowledge it and
recommend users update to 8.x anyway.

The index files generated by Lucene and consumed by Solr are backwards
compatible up to one major version. Some of the API remains compatible, a
client issuing simple queries to Solr 5 would probably work fine even
against Solr 9 when it comes out eventually. A client doing admin
operations will be less certain. I don't know enough about Beam to tell you
where on the spectrum your use will fall.

I'm not sure if this was helpful or not, but maybe it is a nudge in the
right direction.

Good luck,
Mike


On Tue, Oct 27, 2020 at 11:09 AM Piotr Szuberski <
piotr.szuber...@polidea.com> wrote:

> Hi,
>
> We are working on dependency updates at Apache Beam and I would like to
> consult which versions should be supported so we don't break any existing
> users.
>
> Previously the supported Solr version was 5.5.4.
>
> Versions 8.x.y and 7.x.y naturally come to mind as they are the only not
> deprecated. But maybe there are users that use some earlier versions?
>
> Are these versions backwards-compatible or there are things to be aware of?
>
> Regards
>


Re: Folding Repeated Letters

2020-10-08 Thread Mike Drob
I was thinking about that, but there are words that are legitimately
different with repeated consonants. My primary school teacher lost hair
over getting us to learn the difference between desert and dessert.

Maybe we need something that can borrow the boosting behaviour of fuzzy
query - match the exact term, but also the neighbors with a slight deboost,
so that if the main term exists those others won't show up.

On Thu, Oct 8, 2020 at 5:46 PM Andy Webb  wrote:

> How about something like this?
>
> {
> "add-field-type": [
> {
> "name": "norepeat",
> "class": "solr.TextField",
> "analyzer": {
> "tokenizer": {
> "class": "solr.StandardTokenizerFactory"
> },
> "filters": [
> {
> "class": "solr.LowerCaseFilterFactory"
> },
> {
> "class": "solr.PatternReplaceFilterFactory",
> "pattern": "(.)\\1+",
> "replacement": "$1"
> }
> ]
> }
> }
> ]
> }
>
> This finds a match...
>
> http://localhost:8983/solr/#/norepeat/analysis?analysis.fieldvalue=Yes&analysis.query=YyyeeEssSs&analysis.fieldtype=norepeat
>
> Andy
>
>
>
> On Thu, 8 Oct 2020 at 23:02, Mike Drob  wrote:
>
> > I'm looking for a way to transform words with repeated letters into the
> > same token - does something like this exist out of the box? Do our
> stemmers
> > support it?
> >
> > For example, say I would want all of these terms to return the same
> search
> > results:
> >
> > YES
> > YESSS
> > YYYEEESSS
> > YYEE[...]S
> >
> > I don't know how long a user would hold down the S key at the end to
> > capture their level of excitement, and I don't want to manually define
> > synonyms for every length.
> >
> > I'm pretty sure that I don't want PhoneticFilter here, maybe
> > PatternReplace? Not a huge fan of how that one is configured, and I think
> > I'd have to set up a bunch of patterns inline for it?
> >
> > Mike
> >
>


Folding Repeated Letters

2020-10-08 Thread Mike Drob
I'm looking for a way to transform words with repeated letters into the
same token - does something like this exist out of the box? Do our stemmers
support it?

For example, say I would want all of these terms to return the same search
results:

YES
YESSS
YYYEEESSS
YYEE[...]S

I don't know how long a user would hold down the S key at the end to
capture their level of excitement, and I don't want to manually define
synonyms for every length.

I'm pretty sure that I don't want PhoneticFilter here, maybe
PatternReplace? Not a huge fan of how that one is configured, and I think
I'd have to set up a bunch of patterns inline for it?

Mike


Re: Term too complex for spellcheck.q param

2020-10-07 Thread Mike Drob
Right now the only solution is to use a shorter term.

In a fuzzy query you could also try using a lower edit distance e.g. term~1
(default is 2), but I’m not sure what the syntax for a spellcheck would be.

Mike

On Wed, Oct 7, 2020 at 2:59 PM gnandre  wrote:

> Hi,
>
> I am getting following error when I pass '
> 김포오피➬유유닷컴➬✗UUDAT3.COM유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마
> ' in spellcheck.q param. How to avoid this error? I am using Solr 8.5.2
>
> {
>   "error": {
> "code": 500,
> "msg": "Term too complex: 김포오피➬유유닷컴➬✗uudat3.com
> 유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마",
> "trace": "org.apache.lucene.search.FuzzyTermsEnum$FuzzyTermsException:
> Term too complex:
> 김포오피➬유유닷컴➬✗uudat3.com유유닷컴김포풀싸롱て김포오피ふ김포휴게텔け김포마사지❂김포립카페じ김포안마\n\tat
>
> org.apache.lucene.search.FuzzyAutomatonBuilder.buildAutomatonSet(FuzzyAutomatonBuilder.java:63)\n\tat
>
> org.apache.lucene.search.FuzzyTermsEnum$AutomatonAttributeImpl.init(FuzzyTermsEnum.java:365)\n\tat
>
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:125)\n\tat
>
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:92)\n\tat
>
> org.apache.lucene.search.spell.DirectSpellChecker.suggestSimilar(DirectSpellChecker.java:425)\n\tat
>
> org.apache.lucene.search.spell.DirectSpellChecker.suggestSimilar(DirectSpellChecker.java:376)\n\tat
>
> org.apache.solr.spelling.DirectSolrSpellChecker.getSuggestions(DirectSolrSpellChecker.java:196)\n\tat
>
> org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:195)\n\tat
>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:328)\n\tat
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:211)\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2596)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:802)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:579)\n\tat
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:420)\n\tat
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:352)\n\tat
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)\n\tat
>
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
>
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\n\tat
>
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\n\tat
>
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)\n\tat
>
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)\n\tat
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
>
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)\n\tat
>
> org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
>
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)\n\tat
>
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:500)\n\tat
>
> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)\n\tat
> org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)\n\tat
>
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)\n\tat
> org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat
> org.eclipse.jetty.io.Ch

RE: Why use a different analyzer for "index" and "query"?

2020-09-10 Thread Dunham-Wilkie, Mike CITZ:EX
Hi Steven, 

I can think of one case.  If we have an index of database table or column 
names, e.g., words like 'THIS_IS_A_TABLE_NAME', we may want to split the name 
at the underscores when indexing (as well as keep the original), since the 
individual parts might be significant and meaningful.  When querying, though, 
if the searcher types in THIS_IS_A_TABLE_NAME then they are likely looking for 
the whole string, so we wouldn't want to split it apart.

There also seems to be a debate on whether the SYNONYM filter should be 
included on indexing, on querying, or on both.  Google "solr synonyms index vs 
query"

Mike

-Original Message-
From: Steven White  
Sent: September 10, 2020 8:19 AM
To: solr-user@lucene.apache.org
Subject: Why use a different analyzer for "index" and "query"?

[EXTERNAL] This email came from an external source. Only open attachments or 
links that you are expecting from a known sender.


Hi everyone,

In Solr's schema, I have come across field types that use a different logic for 
"index" than for "query".  To be clear, I"m talking about this block:


  
   
  
  
   
  


Why would one want to not use the same logic for both and simply use:


  
   
  


What are real word use cases to use a different analyzer for index and query?

Thanks,

Steve


Lowercase-ing everything but acronyms

2020-09-09 Thread Dunham-Wilkie, Mike CITZ:EX
Hi SOLR list,

I'm currently using the White Space tokenizer and the Lower Case filter with 
SOLR 7.3.  I'd like to modify the logic to keep any tokens that are entirely 
upper case as upper case, and just apply the Lower Case filter (or something 
equivalent) to the remaining tokens.  Is there a way to do this using 
tokenizers and filters?

Thanks
Mike


Mike Dunham-Wilkie | Senior Spatial Data Administration Analyst | PHONE... 
778-676-1791
Data Systems & Services - Digital Platforms and Data Division - Ministry of 
Citizens' Services

For faster response and/or future inquires, the following email addresses are 
monitored continuously:
BC Geographic Warehouse (BCGW) and Replication/ETL | DataBC Data Architecture 
Services (databc...@gov.bc.ca<mailto:databc...@gov.bc.ca>)
BC Data Catalogue (BCDC) and Open Data | DataBC Catalogue Services 
(data...@gov.bc.ca<mailto:data...@gov.bc.ca>)



Re: Adding solr-core via maven fails

2020-07-02 Thread Mike Drob
Does it fail similarly on 8.5.0 and .1?

On Thu, Jul 2, 2020 at 6:38 AM Erick Erickson 
wrote:

> There have been some issues with Maven, see:
> https://issues.apache.org/jira/browse/LUCENE-9170
>
> However, we do not officially support Maven builds, they’re there as a
> convenience, so there may still
> be issues in future.
>
> > On Jul 2, 2020, at 1:27 AM, Ali Akhtar  wrote:
> >
> > If I try adding solr-core to an existing project, e.g (SBT):
> >
> > libraryDependencies += "org.apache.solr" % "solr-core" % "8.5.2"
> >
> > It fails due a 404 on the dependencies:
> >
> > Extracting structure failed
> > stack trace is suppressed; run last update for the full output
> > stack trace is suppressed; run last ssExtractDependencies for the full
> > output
> > (update) sbt.librarymanagement.ResolveException: Error downloading
> > org.restlet.jee:org.restlet:2.4.0
> > Not found
> > Not found
> > not found:
> > /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
> > not found:
> >
> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
> > Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
> > Not found
> > Not found
> > not found:
> >
> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
> > not found:
> >
> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom
> > (ssExtractDependencies) sbt.librarymanagement.ResolveException: Error
> > downloading org.restlet.jee:org.restlet:2.4.0
> > Not found
> > Not found
> > not found:
> > /home/ali/.ivy2/local/org.restlet.jee/org.restlet/2.4.0/ivys/ivy.xml
> > not found:
> >
> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet/2.4.0/org.restlet-2.4.0.pom
> > Error downloading org.restlet.jee:org.restlet.ext.servlet:2.4.0
> > Not found
> > Not found
> > not found:
> >
> /home/ali/.ivy2/local/org.restlet.jee/org.restlet.ext.servlet/2.4.0/ivys/ivy.xml
> > not found:
> >
> https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.4.0/org.restlet.ext.servlet-2.4.0.pom
> >
> >
> >
> > Any ideas? Do I need to add a specific repository to get it to compile?
>
>


Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-24 Thread Mike Drob
Brend,

I appreciate that you are trying to examine this issue from multiple sides
and consider future implications, but I don’t think that is a stirring
argument. By analogy, if we are out of eggs and my wife asks me to go to
the store to get some, refusing to do so on the basis that she might call
me while I’m there and also ask me to get milk would not be reasonable.

What will come next may be an interesting question philosophically, but we
are not discussing abstract concepts here. There is a concrete issue
identified, and we’re soliciting input in how best to address it.

Thank you for the suggestion of "guide/follower"

Mike

On Wed, Jun 24, 2020 at 6:30 AM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> I'm following this thread now for a while and I can understand
> the wish to change some naming/wording/speech in one or the other
> programs but I always get back to the one question:
> "Is it the weapon which kills people or the hand controlled by
> the mind which fires the weapon?"
>
> The thread started with slave - slavery, then turned over to master
> and followed by leader (for me as a german... you know).
> What will come next?
>
> And more over, we now discuss about changes in the source code and
> due to this there need to be changes to the documentation.
> What about the books people wrote about this programs and source code,
> should we force this authors to rewrite their books?
> May be we should file a request to all web search engines to reject
> all stored content about these "banned" words?
> And contact all web hosters about providing bad content.
>
> To sum things up, within my 40 years of computer science and writing
> programs I have never had a nanosecond any thoughts about words
> like master, slave, leader, ... other than thinking about computers
> and programming.
>
> Just my 2 cents.
>
> For what it is worth, I tend to guide/follower if there "must be" any
> changes.
>
> Bernd
>


Re: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Mike Drob
I personally think that using Solr cloud terminology for this would be fine
with leader/follower. The leader is the one that accepts updates, followers
cascade the updates somehow. The presence of ZK or election doesn’t really
change this detail.

However, if folks feel that it’s confusing, then I can’t tell them that
they’re not confused. Especially when they’re working with others who have
less Solr experience than we do and are less familiar with the intricacies.

Primary/Replica seems acceptable. Coordinator instead of Overseer seems
acceptable.

Would love to see this in 9.0!

Mike

On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
 wrote:

> While on the topic of renaming roles, I'd like to propose finding a better
> term than "overseer" which has historical slavery connotations as well.
> Director, perhaps?
>
>
> John Gallagher
>
> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski 
> wrote:
>
> > +1 to rename master/slave, and +1 to choosing terminology distinct
> > from what's used for SolrCloud.  I could be happy with several of the
> > proposed options.  Since a good few have been proposed though, maybe
> > an eventual vote thread is the most organized way to aggregate the
> > opinions here.
> >
> > I'm less positive about the prospect of changing the name of our
> > primary git branch.  Most projects that contributors might come from,
> > most tutorials out there to learn git, most tools built on top of git
> > - the majority are going to assume "master" as the main branch.  I
> > appreciate the change that Github is trying to effect in changing the
> > default for new projects, but it'll be a long time before that
> > competes with the huge bulk of projects, documentation, etc. out there
> > using "master".  Our contributors are smart and I'm sure they'd figure
> > it out if we used "main" or something else instead, but having a
> > non-standard git setup would be one more "papercut" in understanding
> > how to contribute to a project that already makes that harder than it
> > should.
> >
> > Jason
> >
> >
> > On Thu, Jun 18, 2020 at 7:33 AM Demian Katz 
> > wrote:
> > >
> > > Regarding people having a problem with the word "master" -- GitHub is
> > changing the default branch name away from "master," even in isolation
> from
> > a "slave" pairing... so the terminology seems to be falling out of favor
> in
> > all contexts. See:
> > >
> > >
> >
> https://www.cnet.com/news/microsofts-github-is-removing-coding-terms-like-master-and-slave/
> > >
> > > I'm not here to start a debate about the semantics of that, just to
> > provide evidence that in some communities, the term "master" is causing
> > concern all by itself. If we're going to make the change anyway, it might
> > be best to get it over with and pick the most appropriate terminology we
> > can agree upon, rather than trying to minimize the amount of change. It's
> > going to be backward breaking anyway, so we might as well do it all now
> > rather than risk having to go through two separate breaking changes at
> > different points in time.
> > >
> > > - Demian
> > >
> > > -Original Message-
> > > From: Noble Paul 
> > > Sent: Thursday, June 18, 2020 1:51 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: [EXTERNAL] Re: Getting rid of Master/Slave nomenclature in
> Solr
> > >
> > > Looking at the code I see a 692 occurrences of the word "slave".
> > > Mostly variable names and ref guide docs.
> > >
> > > The word "slave" is present in the responses as well. Any change in the
> > request param/response payload is backward incompatible.
> > >
> > > I have no objection to changing the names in ref guide and other
> > internal variables. Going ahead with backward incompatible changes is
> > painful. If somebody has the appetite to take it up, it's OK
> > >
> > > If we must change, master/follower can be a good enough option.
> > >
> > > master (noun): A man in charge of an organization or group.
> > > master(adj) : having or showing very great skill or proficiency.
> > > master(verb): acquire complete knowledge or skill in (a subject,
> > technique, or art).
> > > master (verb): gain control of; overcome.
> > >
> > > I hope nobody has a problem with the term "master"
> > >
> > > On Thu, Jun 18, 2020 at 

Re: Master Slave Terminology

2020-06-17 Thread Mike Drob
Hi Jan,

Can you link to the discussion? I searched the dev list and didn’t see
anything, is it on slack or a jira or somewhere else?

Mike

On Wed, Jun 17, 2020 at 1:51 AM Jan Høydahl  wrote:

> Hi Kaya,
>
> Thanks for bringing it up. The topic is already being discussed by
> developers, so expect to see some change in this area; Not over-night, but
> incremental.
> Also, if you want to lend a helping hand, patches are more than welcome as
> always.
>
> Jan
>
> > 17. jun. 2020 kl. 04:22 skrev Kayak28 :
> >
> > Hello, Community:
> >
> > As the Github and Python will replace terminologies that relative to
> > slavery,
> > why don't we replace master-slave for Solr as well?
> >
> > https://developers.srad.jp/story/18/09/14/0935201/
> >
> https://developer-tech.com/news/2020/jun/15/github-replace-slavery-terms-master-whitelist/
> >
> > --
> >
> > Sincerely,
> > Kaya
> > github: https://github.com/28kayak
>
>


[ANNOUNCE] Apache Solr 8.5.2 released

2020-05-26 Thread Mike Drob
26 May 2020, Apache Solr™ 8.5.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 8.5.2

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release contains two bug fixes. The release is available for immediate
download at:

The release is available for immediate download at:

https://lucene.apache.org/solr/downloads.html

Solr 8.5.2 Bug Fixes:

   - SOLR-14411 : Fix
   regression from SOLR-14359 (Admin UI 'Select an Option')
   - SOLR-14471 : base
   replica selection strategy not applied to "last place" shards.preference
   matches

Solr 8.5.2 also includes 1 bugfix in the corresponding Apache Lucene
release:



Please report any feedback to the mailing lists (
https://lucene.apache.org/solr/community.html#mailing-lists-irc)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


Re: Solr 8.5.1 startup error - lengthTag=109, too big.

2020-05-26 Thread Mike Drob
Did you have SSL enabled with 8.2.1?

The error looks common to certificate handling and not specific to Solr.

I would verify that you have no extra characters in your certificate file
(including line endings) and that the keystore type that you specified
matches the file you are presenting (JKS or PKCS12)

Mike

On Sat, May 23, 2020 at 10:11 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm trying to upgrade from Solr 8.2.1 to Solr 8.5.1, with Solr SSL
> Authentication and Authorization.
>
> However, I get the following error when I enable SSL. The Solr itself can
> start up if there is no SSL.  The main error that I see is this
>
>   java.io.IOException: DerInputStream.getLength(): lengthTag=109, too big.
>
> What could be the reason that causes this?
>
>
> INFO  - 2020-05-24 10:38:20.080;
> org.apache.solr.util.configuration.SSLConfigurations; Setting
> javax.net.ssl.keyStorePassword
> INFO  - 2020-05-24 10:38:20.081;
> org.apache.solr.util.configuration.SSLConfigurations; Setting
> javax.net.ssl.trustStorePassword
> Waiting up to 120 to see Solr running on port 8983
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.eclipse.jetty.start.Main.invokeMain(Main.java:218)
> at org.eclipse.jetty.start.Main.start(Main.java:491)
> at org.eclipse.jetty.start.Main.main(Main.java:77)d
> Caused by: java.security.PrivilegedActionException: java.io.IOException:
> DerInputStream.getLength(): lengthTag=109, too big.
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1837)
> ... 7 more
> Caused by: java.io.IOException: DerInputStream.getLength(): lengthTag=109,
> too big.
> at sun.security.util.DerInputStream.getLength(Unknown Source)
> at sun.security.util.DerValue.init(Unknown Source)
> at sun.security.util.DerValue.(Unknown Source)
> at sun.security.util.DerValue.(Unknown Source)
> at sun.security.pkcs12.PKCS12KeyStore.engineLoad(Unknown Source)
> at java.security.KeyStore.load(Unknown Source)
> at
>
> org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(CertificateUtils.java:54)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.loadKeyStore(SslContextFactory.java:1188)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.load(SslContextFactory.java:323)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:245)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)
> at
>
> org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFactory.java:92)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)
> at
>
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:320)
> at
>
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:81)
> at
> org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:231)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at org.eclipse.jetty.server.Server.doStart(Server.java:385)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.xml.XmlConfiguration.lambda$main$0(XmlConfiguration.java:1888)
> ... 9 more
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.eclipse.jetty.start.Main.invokeMain(Main.java:218)
> at org.eclipse.jetty.start.Main.start(Main.java:491)
> at org.eclipse.jetty.

Re: Download a pre-release version? 8.6

2020-05-15 Thread Mike Drob
We could theoretically include this in a 8.5.2 version which should be
released soon. The change looks minimally risky to backport?

On Fri, May 15, 2020 at 3:43 PM Jan Høydahl  wrote:

> Check Jenkins:
> https://builds.apache.org/view/L/view/Lucene/job/Solr-Artifacts-8.x/lastSuccessfulBuild/artifact/solr/package/
>
> Jan Høydahl
>
> > 15. mai 2020 kl. 22:27 skrev Phill Campbell
> :
> >
> > Is there a way to download a tgz of the binary of a nightly build or
> similar?
> >
> > I have been testing 8.5.1 and ran into the bug with load balancing.
> > https://issues.apache.org/jira/browse/SOLR-14471 <
> https://issues.apache.org/jira/browse/SOLR-14471>
> >
> > It is a deal breaker for me to move forward with an upgrade of the
> system.
> >
> > I would like to start evaluating a version that has the fix.
> >
> > Is there a place to get a build?
> >
> > Thank you.
>


Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Mike Drob
This is how things get stemmed *now*, but I believe there is an open
question as to whether that is how they *should* be stemmed. Specifically,
the case appears to be -ify words not stemming to the same as -ification -
this applies to much more than identify/identification. Also, justify,
fortify, notify, many many others.

$ grep ification /usr/share/dict/words | wc -l
 328

I am by no means an expert on stemming, and if the folks at snowball decide
to tell us that this change is bad or hard because it would overstem some
other words, then I'll happily accept that. But I definitely want to use
their expertise rather than relying on my own.

Mike

On Fri, May 1, 2020 at 10:35 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Unless I'm misunderstanding the bug in question, there is no bug. What you
> are observing is simply just how things get stemmed...
>
> Best,
> Audrey
>
> On 4/30/20, 6:37 PM, "Jhonny Lopez" 
> wrote:
>
> Yes, sounds like worth it.
>
> Thanks guys!
>
> -Original Message-
> From: Mike Drob 
> Sent: jueves, 30 de abril de 2020 5:30 p. m.
> To: solr-user@lucene.apache.org
> Subject: Re: Possible issue with Stemming and nouns ended with suffix
> 'ion'
>
> This email has been sent from a source external to Publicis Groupe.
> Please use caution when clicking links or opening attachments.
> Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
>
>
>
> Is this worth filing a bug/suggestion to the folks over at
> snowballstem.org?
>
> On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I agree with Erick. I think that's just how the cookie crumbles when
> > stemming. If you have some time on your hands, you can integrate
> > OpenNLP with your Solr instance and start using the lemmas of tokens
> > instead of the stems. In this case, I believe if you were to
> lemmatize
> > both "identify" and "identification," they would both condense to
> "identify."
> >
> > Best,
> > Audrey
> >
> > On 4/30/20, 3:54 PM, "Erick Erickson" 
> wrote:
> >
> > They are being stemmed to two different tokens, “identif” and
> > “identifi”. Stemming is algorithmic and imperfect and in this case
> > you’re getting bitten by that algorithm. It looks like you’re using
> > PorterStemFilter, if you want you can look up the exact algorithm,
> but
> > I don’t think it’s a bug, just one of those little joys of English...
> >
> > To get a clearer picture of exactly what’s being searched, try
> > adding &debug=query to your query, in particular looking at the
> parsed
> > query that’s returned. That’ll tell you a bunch. In this particular
> > case I don’t think it’ll tell you anything more, but for future…
> >
> > Best,
> > Erick
> >
> > On, and un-checking the ‘verbose’ box on the analysis page
> removes
> > a lot of distraction, the detailed information is often TMI ;)
> >
> > > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> > jhonny.lo...@publicismedia.com> wrote:
> > >
> > > Sure, rewriting the message with links for images:
> > >
> > >
> > > We’re facing an issue with stemming in solr. Most of the cases
> > are working correctly, for example, if we search for bidding, solr
> > brings results for bidding, bid, bids, etc. However, with nouns
> ended with ‘ion’
> > suffix, stemming is not working. Even when analyzers seems to have
> > correct stemming of the word, the results are not reflecting that.
> One
> > example. If I search ‘identifying’, this is the output:
> > >
> > > Analyzer (image link):
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
> > >
> > > A clip of results:
> > > "haschildren_b":false,
> > >"isbucket_text_s":"0",
> > >"sectionbody_t":&q

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Mike Drob
Jhonny,

Are you planning on reporting the issue to snowball, or would you prefer
one of us take care of it?
If you do report it, please share the link to the issue or mail archive
back here so that we know when it is resolved and can update our
dependencies.

Thanks,
Mike

On Thu, Apr 30, 2020 at 5:37 PM Jhonny Lopez 
wrote:

> Yes, sounds like worth it.
>
> Thanks guys!
>
> -Original Message-----
> From: Mike Drob 
> Sent: jueves, 30 de abril de 2020 5:30 p. m.
> To: solr-user@lucene.apache.org
> Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'
>
> This email has been sent from a source external to Publicis Groupe. Please
> use caution when clicking links or opening attachments.
> Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
>
>
>
> Is this worth filing a bug/suggestion to the folks over at
> snowballstem.org?
>
> On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I agree with Erick. I think that's just how the cookie crumbles when
> > stemming. If you have some time on your hands, you can integrate
> > OpenNLP with your Solr instance and start using the lemmas of tokens
> > instead of the stems. In this case, I believe if you were to lemmatize
> > both "identify" and "identification," they would both condense to
> "identify."
> >
> > Best,
> > Audrey
> >
> > On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
> >
> > They are being stemmed to two different tokens, “identif” and
> > “identifi”. Stemming is algorithmic and imperfect and in this case
> > you’re getting bitten by that algorithm. It looks like you’re using
> > PorterStemFilter, if you want you can look up the exact algorithm, but
> > I don’t think it’s a bug, just one of those little joys of English...
> >
> > To get a clearer picture of exactly what’s being searched, try
> > adding &debug=query to your query, in particular looking at the parsed
> > query that’s returned. That’ll tell you a bunch. In this particular
> > case I don’t think it’ll tell you anything more, but for future…
> >
> > Best,
> > Erick
> >
> > On, and un-checking the ‘verbose’ box on the analysis page removes
> > a lot of distraction, the detailed information is often TMI ;)
> >
> > > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> > jhonny.lo...@publicismedia.com> wrote:
> > >
> > > Sure, rewriting the message with links for images:
> > >
> > >
> > > We’re facing an issue with stemming in solr. Most of the cases
> > are working correctly, for example, if we search for bidding, solr
> > brings results for bidding, bid, bids, etc. However, with nouns ended
> with ‘ion’
> > suffix, stemming is not working. Even when analyzers seems to have
> > correct stemming of the word, the results are not reflecting that. One
> > example. If I search ‘identifying’, this is the output:
> > >
> > > Analyzer (image link):
> > >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
> > >
> > > A clip of results:
> > > "haschildren_b":false,
> > >"isbucket_text_s":"0",
> > >"sectionbody_t":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> dynamics.\n\n\n\n\n\n\n",
> > >"parsedupdatedby_s":"sitecorecarvaini",
> > >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> > auctions, leverage the proprietary tools available or manually pull a
> > log file report to understand the trends and gauge auction spread
> > overtime to assess the impact of variable auction
> dynamics.\n\n\n\n\n\n\n",
> > >"hide_section_b":false
> > >
> > >
> > > As you can see, it has used the stemming correctly and brings
> > results for other words based in th

Re: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Mike Drob
Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate OpenNLP
> with your Solr instance and start using the lemmas of tokens instead of the
> stems. In this case, I believe if you were to lemmatize both "identify" and
> "identification," they would both condense to "identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
>
> They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case you’re
> getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but I
> don’t think it’s a bug, just one of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try adding
> &debug=query to your query, in particular looking at the parsed query
> that’s returned. That’ll tell you a bunch. In this particular case I don’t
> think it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes a
> lot of distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases are
> working correctly, for example, if we search for bidding, solr brings
> results for bidding, bid, bids, etc. However, with nouns ended with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have correct
> stemming of the word, the results are not reflecting that. One example. If
> I search ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo&e=
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a log
> file report to understand the trends and gauge auction spread overtime to
> assess the impact of variable auction dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA&d=DwIFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M&m=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s&s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE&e=
> >
> >
> > Even with proper stemming, solr is only bringing results for the
> word identification (or identifications) but nothing else.
> >
> > The queries are over the same field that has the Porter Stemming
> Filter applied for both, query and index. This behavior is consistent with
> other ‘ion’ ended nouns: representation, modification, etc.
> >
> > Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> >
> > Thanks.
> >
> >
> >
> >
> >
> > -Original Message-
> >
> > From: Erick Erickson 
> >
> > Sent: jueves, 30 de abril de 2020 1:47 p. m.
> >
> > To: solr-user@lucene.apache.org
> >
> > Subject: Re: Possible issue with Stemming and nouns ended with
> suffix 'ion'
> >
> >
> >
> > This email has been sent from a source external to Publicis Groupe.
> Please use caution when clicking links or opening attachments.
> >
> > Cet email a été envoyé depuis une source externe à Publicis Groupe.
> Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou
> lorsque vous ouvrez des pièces jointes.
> >
> >
> >
> >
> >
> >
> >
> > The mail server is pretty aggressive about stripping links, so we
> can’t see the images.
> >
> >
> 

Re: Fuzzy search not working

2020-04-14 Thread Mike Drob
Pradeep,

First, some background on fuzzy term expansions:

1) A query for foobar~2 is really a query for (foobar OR foobar~1 OR
foobar~2)
2) Fuzzy term expansion will only take the first 50 terms found in the
index and drop the rest.

For implementation notes, see this comment -
https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/search/FuzzyTermsEnum.java#L229-L232

So in your first search, your available terms in the title_txt_en field are
few enough that "probl~2" does match "problem"
In your second search, with the copy field, there are likely many more
terms in all_text_txt_enus. Here, the edit distance 1 terms crowd out the
edit distance 2 terms and they never match.
You can imagine that the term expands into… "probl OR prob OR probe OR
prob1 OR…"

I don't see a way to specify the number of expansions from a Solr query,
maybe somebody else on the list would know.

But at the end of the day, like Wunder said, you might want a prefix query
based on what you're describing.

Mike


On Mon, Apr 13, 2020 at 6:01 PM Deepu  wrote:

> Corrected Typo mistake.
>
> Hi Team,
>
> We have 8 text fields (*_txt_en) in schema and one multi valued text field
> which is copy field of other text fields, like below.
>
> tittle_txt_en, configuration_summary_txt_en, all_text_txt_ens (multi value
> field)
>
> Observed one issue with Fuzzy match, same term with distance of two(~2) is
> working on individual fields but not returning any results from multi
> valued field.
>
> Term we used is "probl" and document has "problem" term in two text fields,
> so all_text field has two occurrences of 'problem" terms.
>
>
>
> title_txt_en:probl~2. (given results)
>
> all_text_txt_ens:probl~2 (no results)
>
>
>
> is there any other factors involved in distance calculation other
> than Damerau-Levenshtein Distance algoritham?
>
> what might be the reason same input with same distance worked with one
> field and failed with other field in same collection?
>
> is there a way we can get actual distance solr calculated w.r.t specific
> document and specific field ?
>
>
>
> Thanks in advance !!
>
>
> Thanks,
>
> Pradeep
>
> On Mon, Apr 13, 2020 at 2:35 PM Deepu  wrote:
>
> > Hi Team,
> >
> > We have 8 text fields (*_txt_en) in schema and one multi valued text
> field
> > which is copy field of other text fields, like below.
> >
> > tittle_txt_en, configuration_summary_txt_en, all_text_txt_ens (multi
> value
> > field)
> >
> > Observed one issue with Fuzzy match, same term with distance of two(~2)
> is
> > working on individual fields but not returning any results from multi
> > valued field.
> >
> > Term we used is "prob" and document has "problem" term in two text
> fields,
> > so all_text field has two occurrences of 'problem" terms.
> >
> >
> >
> > title_txt_en:prob~2. (given results)
> >
> > all_text_txt_ens:prob~2 (no results)
> >
> >
> >
> > is there any other factors involved in distance calculation other
> > than Damerau-Levenshtein Distance algoritham?
> >
> > what might be the reason same input with same distance worked with one
> > field and failed with other field in same collection?
> >
> > is there a way we can get actual distance solr calculated w.r.t specific
> > document and specific field ?
> >
> >
> >
> > Thanks in advance !!
> >
> >
> > Thanks,
> >
> > Pradeep
> >
>


Re: Apache Solr 8.4.1 Basic Authentication

2020-03-27 Thread Mike Phillips


The line webResource = client.resource(resourceUrl); defines what action 
I am performing example 
"https://localhost:8985/solr/CoreName/select?q=*%3A*";
Are you setting up your URL correctly. My snippet was outlining the 
additional Authorization header that needs to also be part of the 
request but assuming you were already going to a valid URL.


On 3/26/2020 3:59 PM, Altamirano, Emmanuel wrote:


Thank you so much for replying my email Mike.

I did use now the base64 to encode user and password but now Solr 
doesn’t undertint the credentials:


{Accept=[application/json], Content-Type=[application/json], 
*Authorization*=[Basic c29scjpTb2xyUm9ja3M=]}>] 
ERROR[org.springframework.web.client.HttpClientErrorException: 404 Not 
Found]


Before I got:

{Accept=[application/json], Content-Type=[application/json], 
*Authorization*=[Basic solr:SolrRocks]}>] 
ERROR[org.springframework.web.client.HttpClientErrorException: 401 
Invalid authentication token]


Is there something else that I need to configure?

*Emmanuel Altamirano,*

Consultant- Global Technology

International Operations

*Telephone:*312-985-3149

*Mobile:*312-860-3774

*cid:image001.png@01D02A68.19FA64F0*

555 W. Adams 5^th Floor

Chicago, IL 60661

_transunion.com <http://www.transunion.com/>___

This email including, without limitation, the attachments, if any, 
accompanying this email, may contain information which is confidential 
or privileged and exempt from disclosure under applicable law. The 
information is for the use of the intended recipient. If you are not 
the intended recipient, be aware that any disclosure, copying, 
distribution, review or use of the contents of this email, and/or its 
attachments, is without authorization and is prohibited. If you have 
received this email in error, please notify us by reply email 
immediately and destroy all copies of this email and its attachments.


*From:* Mike Phillips 
*Sent:* Thursday, March 26, 2020 3:10 PM
*To:* Altamirano, Emmanuel 
*Subject:* Re: Apache Solr 8.4.1 Basic Authentication

*EXTERNAL SENDER:* Exercise caution with links and attachments.

I use Jersey to talk to solr. Here is a code snippet. You seem to be 
on the right track but you need to base64 encode the username/password 
bytes.


    String combined = username + ":" + password;
    String  encoded = base64.encode(combined.getBytes());
    String  authHeader = "Basic " + encoded;

    // Setup need to encode the query
    webResource = client.resource(resourceUrl);
    webResource.accept("*.*");

    // Perform request
    response = webResource.header("Content-Type", "application/json")
    .header("Authorization", authHeader)
    .get(ClientResponse.class);
    respStatus = response.getStatus();

On 3/26/2020 12:27 PM, Altamirano, Emmanuel wrote:

Hello everyone,

We recently enable Solr Basic Authentication in our Dev
environment and we are testing Solr security. We followed the
instructions provided in the Apache Solr website and it is working
using curl command.

If you could provide us any advice of how do we need to send the
credentials in the HTTP headers in a Java program? It is very
appreciate it.

HttpHeaders headers= *new*HttpHeaders();

headers.setAccept(Arrays./asList/(MediaType.*/APPLICATION_JSON/*));

headers.setContentType(MediaType.*/APPLICATION_JSON/*);

headers.add("Authorization", "Basic "+ "solr:SolrRocks");

Thanks,

*Emmanuel Altamirano,*

Consultant- Global Technology

International Operations

*Telephone:*312-985-3149

*Mobile:*312-860-3774

*cid:image001.png@01D02A68.19FA64F0*

555 W. Adams 5^th Floor

Chicago, IL 60661

_transunion.com <http://www.transunion.com/>_

This email including, without limitation, the attachments, if any,
accompanying this email, may contain information which is
confidential or privileged and exempt from disclosure under
applicable law. The information is for the use of the intended
recipient. If you are not the intended recipient, be aware that
any disclosure, copying, distribution, review or use of the
contents of this email, and/or its attachments, is without
authorization and is prohibited. If you have received this email
in error, please notify us by reply email immediately and destroy
all copies of this email and its attachments.





SolrCloud location for solr.xml

2020-02-28 Thread Mike Drob
Hi Searchers!

I was recently looking at some of the start-up logic for Solr and was
interested in cleaning it up a little bit. However, I'm not sure how common
certain deployment scenarios are. Specifically is anybody doing the
following combination:

* Using SolrCloud (i.e. state stored in zookeeper)
* Loading solr.xml from a local solr home rather than zookeeper

Much appreciated! Thanks,
Mike


Re: Is this a bug? Wildcard with PatternReplaceFilterFactory

2020-02-21 Thread Mike Phillips
It looks like the debug result you are showing me is the results for 
Rod's not Rod’s, but in answer to your question


This is why I think    "Rod’s  finds fields Rod's and 
Rod’s that are now in the index as rod's"


The analysis page shows Rod’s gets stored in the index as:
rod's rods rod s

Field Value (Index)

Rod’s

Analyse Fieldname / FieldType: _text_ Schema Browser 
<https://centos1:8985/solr/#/rat_11/schema?field=_text_>


 *
   Verbose Output

WT

text
raw_bytes
start
end
positionLength
type
termFrequency
position


Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1

SF

text
raw_bytes
start
end
positionLength
type
termFrequency
position


Rod’s
[52 6f 64 e2 80 99 73]
0
5
1
word
1
1

WDGF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

FGF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

PRF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod’s
[52 6f 64 e2 80 99 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

PRF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

PRF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

PRF

text
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


Rod's
[52 6f 64 27 73]
0
5
2
word
1
1
false


Rods
[52 6f 64 73]
0
5
2
word
1
1
false


Rod
[52 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false

LCF

tex

t
raw_bytes
start
end
positionLength
type
termFrequency
position
keyword


rod's
[72 6f 64 27 73]
0
5
2
word
1
1
false


rods
[72 6f 64 73]
0
5
2
word
1
1
false


rod
[72 6f 64]
0
3
1
word
1
1
false


s
[73]
4
5
1
word
1
2
false



This is  what we were trying to achieve with the class="solr.PatternReplaceFilterFactory" pattern="’" replacement="'"/>



The problem is when using wildcard *Rod’s* we get no hits
||

|"responseHeader":{ "status":0, "QTime":2, "params":{ "q":"*Rod’s*", 
"debugQuery":"on", "_":"1582315262594"}}, 
"response":{"numFound":0,"start":0,"docs":[] }, "debug":{ 
"rawquerystring":"*Rod’s*", "querystring":"*Rod’s*", 
"parsedquery":"_text_:*rod’s*", "parsedquery_toString":"_text_:*rod’s*", 
"explain":{}, "QParser":"LuceneQParser", ... |







On 2/21/2020 11:52 AM, Erick Erickson wrote:

Why do you say “…that are now in the index as rod’s”? You have 
WordDelimiterGraphFilterFactory, which breaks things up. When I put your field 
definition in the schema and use the analysis page, turns “rod’s” into  the 
following 4 tokens:

rod’s
rods
rod
s

And querying on field:”*Rod’s*” works just fine. I’m using 8.x, and when I add 
“&debug=query” to the URL, I see:
{
"responseHeader": {
"status": 0, "QTime": 10, "params": {
"q": "eoe:\"*Rod's*\"", "debug": "query"
}
}, "response": {
"numFound": 1, "start": 0, "docs": [
{
"id": "1", "eoe": "Rod's", "_version_": 1659176849231577088
}
]
}, "debug": {
"rawquerystring": "eoe:\"*Rod's*\"", "querystring": "eoe:\"*Rod's*\"", "parsedquery": "SynonymQuery(Synonym(eoe:*rod's* 
eoe:rod))", "parsedquery_toString": "Synonym(eoe:*rod's* eoe:rod)", "QParser": "LuceneQParser"
}
}

What do you see?

Best,
Erick


On Feb 21, 2020, at 12:57 PM, Mike Phillips  
wrote:

Rod’s  finds fields Rod's and Rod’s that are now in the index as rod's

but *Rod’s* finds nothing because the index now only contains rod's





Is this a bug? Wildcard with PatternReplaceFilterFactory

2020-02-21 Thread Mike Phillips

Is this a bug? Wildcard with PatternReplaceFilterFactory

Attempting to normalize left and right single and double quotes for searches

‘   Left single quotation mark    '    Single quote
’   Right single quotation mark   '    Single quote
“   Left double quotation mark    "    Double quotes
”   Right double quotation mark   "    Double quotes


    positionIncrementGap="100" multiValued="true">

  
    
    words="stopwords.txt" />
        preserveOriginal="1" catenateWords="1"/>
         
        replacement="'"/>
        replacement="'"/>
    replacement="""/>
    replacement="""/>

    
  
  
    
        preserveOriginal="1" catenateWords="1"/>
    words="stopwords.txt" />
    synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        replacement="'"/>
        replacement="'"/>
    replacement="""/>
    replacement="""/>

    
  
    

The wildcard seems to NOT utilize the PatternReplaceFilterFactory

Rod’s  finds fields Rod's and Rod’s that are now in the index as rod's

but *Rod’s* finds nothing because the index now only contains rod's



Re: Outdated information on JVM heap sizes in Solr 8.3 documentation?

2020-02-15 Thread Mike Drob
Erick,

Can you drop a link to that Jira here after you create it?

Many thanks,
Mike

On Fri, Feb 14, 2020 at 6:05 PM Erick Erickson 
wrote:

> I just read that page over and it looks way out of date. I’ll raise
> a JIRA.
>
> > On Feb 14, 2020, at 2:55 PM, Walter Underwood 
> wrote:
> >
> > Yeah, that is pretty outdated. At Netflix, I was running an 8 GB heap
> with Solr 1.3. :-)
> >
> > Every GC I know about has a stop-the-world collector as a last ditch
> measure.
> >
> > G1GC limits the time that the world will stop. It gives up after
> MaxGCPauseMillis
> > milliseconds and leaves the rest of the garbage uncollected. If it has 5
> seconds
> > worth of work to do that, it might take 10 seconds, but in 200 ms
> chunks. It does
> > a lot of other stuff outside of the pauses to make the major collections
> more effective.
> >
> > We wrote Ultraseek in Python+C because Python used reference counting and
> > did not do garbage collection. That is the only way to have no pauses
> with
> > automatic memory management.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >> On Feb 14, 2020, at 11:35 AM, Tom Burton-West 
> wrote:
> >>
> >> Hello,
> >>
> >> In the section on JVM tuning in the  Solr 8.3 documentation (
> >> https://lucene.apache.org/solr/guide/8_3/jvm-settings.html#jvm-settings
> )
> >> there is a paragraph which cautions about setting heap sizes over 2 GB:
> >>
> >> "The larger the heap the longer it takes to do garbage collection. This
> can
> >> mean minor, random pauses or, in extreme cases, "freeze the world"
> pauses
> >> of a minute or more. As a practical matter, this can become a serious
> >> problem for heap sizes that exceed about **two gigabytes**, even if far
> >> more physical memory is available. On robust hardware, you may get
> better
> >> results running multiple JVMs, rather than just one with a large memory
> >> heap. "  (** added by me)
> >>
> >> I suspect this paragraph is severely outdated, but am not a Java expert.
> >> It seems to be contradicted by the statement in "
> >>
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#memory-and-gc-settings
> "
> >> "...values between 10 and 20 gigabytes are not uncommon for production
> >> servers"
> >>
> >> Are "freeze the world" pauses still an issue with modern JVM's?
> >> Is it still advisable to avoid heap sizes over 2GB?
> >>
> >> Tom
> >> https://www.hathitrust.org/blogslarge-scale-search
> >
>
>


Re: Modify partial configsets using API

2019-05-08 Thread Mike Drob



On 2019/05/08 16:52:52, Shawn Heisey  wrote: 
> On 5/8/2019 10:50 AM, Mike Drob wrote:
> > Solr Experts,
> > 
> > Is there an existing API to modify just part of my configset, for example
> > synonyms or stopwords? I see that there is the schema API, but that is
> > pretty specific in scope.
> > 
> > Not sure if I should be looking at configset API to upload a zip with a
> > single file, or if there are more granular options available.
> 
> Here's a documentation link for managed resources:
> 
> https://lucene.apache.org/solr/guide/6_6/managed-resources.html
> 
> That's the 6.6 version of the documentation.  If you're running 
> something newer, which seems likely since 6.6 is quite old now, you 
> might want to look into a later documentation version.
> 
> Thanks,
> Shawn
> 

Thanks Shawn, this looks like it will fit the bill nicely!

One more question that I don't see covered in the documentation - if I have 
multiple collections sharing the same config set, does updating the managed 
stop words for one collection apply the change to all? Is this change persisted 
in zookeeper?

Mike


Modify partial configsets using API

2019-05-08 Thread Mike Drob
Solr Experts,

Is there an existing API to modify just part of my configset, for example
synonyms or stopwords? I see that there is the schema API, but that is
pretty specific in scope.

Not sure if I should be looking at configset API to upload a zip with a
single file, or if there are more granular options available.

Thanks,
Mike


Highlighting

2019-04-15 Thread Mike Phillips
I don't understand why highlighting does not return anything but the document 
id.
I created a core imported all my data, everything seems like it should be 
working.
From reading the documentation I expect it to show me highlight information for
assetName around Potter, but I never get anything but the document id (assetId)

Here are the entries I added to managed-schema for assetName. Anybody know what
I should be targeting as a solution.




#- My query and results -#




  0
  1
  
Potter
on


  clientId:11
  assetTypeId:1


1555116259160
  


  
3
Harry Potter and The Order of The Phoenix.mov
2012-09-27T02:34:27Z
Quicktime with Audio
1
11
Level 3
27
27
Harry Potter and The Order of The Phoenix.mov
23.976
Non drop frame
2.35
16
4:2:0
1624857205677228032
  
3
StudelyCastleHotel
2019-04-12T22:57:33Z
JPEG
1
11
Level 3
10130
10130

  Producer


  Michael Potter


  7

StudelyCastleHotel
1630650901282684928

  
  



RE: Auto recovery of a failed Solr Cloud Node?

Erick,

Apologies I should have been more specific. "Failed solr node" mean's:

1. SolrCloud instance has crashed
2. SolrCloud Instance is up but not responding
3. SolrCloud Cluster is not responding

I'm trying to determine if there is any health check available to determine the 
above and then if the issue happens then an automated mechanism in SolrCloud to 
restart the instance. Or is this something we have to code ourselves?

Thanks

Mike

-Original Message-
From: Erick Erickson 
Sent: 25 September 2018 18:25
To: solr-user 
Subject: Re: Auto recovery of a failed Solr Cloud Node?

What does "Failed solr node" mean? How do you mean if fails? There's lots of 
recovery built in for a replica that gets out-of-sync somehow (is shut down 
while indexing is going on etc). All that relies on having more than one 
replica per shard of course.

If the node completely dies due to hardware for instance, then yes the best 
solution now is to spin up another Solr node. I'm not sure what REPLACENODE 
does in this scenario.

If you're using HDFS there's an option to do this since the index is replicated 
by HDFS.

Best,
Erick
On Tue, Sep 25, 2018 at 8:48 AM Kimber, Mike  wrote:
>
> Hi,
>
> Is there a recommend design pattern or best practice for auto recovery of a 
> failed Solr Node?
>
> I'm I correct to assume there is nothing out of the box for this and we have 
> to code our own solution?
>
> Thanks
>
> Michael Kimber
>
>
> This electronic message may contain proprietary and confidential information 
> of Verint Systems Inc., its affiliates and/or subsidiaries. The information 
> is intended to be for the use of the individual(s) or entity(ies) named 
> above. If you are not the intended recipient (or authorized to receive this 
> e-mail for the intended recipient), you may not use, copy, disclose or 
> distribute to anyone this message or any information contained in this 
> message. If you have received this electronic message in error, please notify 
> us by replying to this e-mail.


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Auto recovery of a failed Solr Cloud Node?

Hi,

Is there a recommend design pattern or best practice for auto recovery of a 
failed Solr Node?

I'm I correct to assume there is nothing out of the box for this and we have to 
code our own solution?

Thanks

Michael Kimber


This electronic message may contain proprietary and confidential information of 
Verint Systems Inc., its affiliates and/or subsidiaries. The information is 
intended to be for the use of the individual(s) or entity(ies) named above. If 
you are not the intended recipient (or authorized to receive this e-mail for 
the intended recipient), you may not use, copy, disclose or distribute to 
anyone this message or any information contained in this message. If you have 
received this electronic message in error, please notify us by replying to this 
e-mail.


Way for DataImportHandler to use bind variables

Is there a way to configure the DataImportHandler to use bind variables for
the entity queries? To improve database performance.

Thanks,

Mike


RE: InetAddressPoint support in Solr or other IP type?

Thanks David. Is there a reason we wouldn't want to base the Solr 
implementation on the InetAddressPoint class?

https://lucene.apache.org/core/7_2_1/misc/org/apache/lucene/document/InetAddressPoint.html

I realize that is in the "misc" package for now, so it's not part of core 
Lucene. But it is nice in that it has one class for both ipv4 and ipv6 and 
it's based on point numerics rather than trie numerics which seem to be 
deprecated. I'm pretty familiar with the code base, I could take a stab at 
implementing this. I just wanted to make sure there wasn't something I was 
missing since I couldn't find any discussion on this.

Michael Cooper

-Original Message-
From: David Smiley [mailto:david.w.smi...@gmail.com]
Sent: Friday, March 23, 2018 5:14 PM
To: solr-user@lucene.apache.org
Subject: Re: InetAddressPoint support in Solr or other IP type?

Hi,

For IPv4, use TrieIntField with precisionStep=8

For IPv6 https://issues.apache.org/jira/browse/SOLR-6741   There's nothing
there yet; you could help out if you are familiar with the codebase.  Or you 
might try something relatively simple involving edge ngrams.

~ David

On Thu, Mar 22, 2018 at 1:09 PM Mike Cooper  wrote:

> I have scoured the web and cannot find any discussion of having the
> Lucene InetAddressPoint type exposed in Solr. Is there a reason this
> is omitted from the Solr supported types? Is it on the roadmap? Is
> there an alternative recommended way to index and store Ipv4 and Ipv6
> addresses for optimal range searches and subnet searches? Thanks for your 
> help.
>
>
>
> *Michael Cooper*
>
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


smime.p7s
Description: S/MIME cryptographic signature


InetAddressPoint support in Solr or other IP type?

I have scoured the web and cannot find any discussion of having the Lucene
InetAddressPoint type exposed in Solr. Is there a reason this is omitted
from the Solr supported types? Is it on the roadmap? Is there an alternative
recommended way to index and store Ipv4 and Ipv6 addresses for optimal range
searches and subnet searches? Thanks for your help.

 

Michael Cooper



smime.p7s
Description: S/MIME cryptographic signature


Re: zero-day exploit security issue

Given that the already public nature of the disclosure, does it make sense
to make the work being done public prior to release as well?

Normally security fixes are kept private while the vulnerabilities are
private, but that's not the case here...

On Mon, Oct 16, 2017 at 1:20 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Yes, there is but it is private i.e. only the Apache Lucene PMC
> members can see it. This is standard for all security issues in Apache
> land. The fixes for this issue has been applied to the release
> branches and the Solr 7.1.0 release candidate is already up for vote.
> Barring any unforeseen circumstances, a 7.1.0 release with the fixes
> should be expected this week.
>
> On Fri, Oct 13, 2017 at 8:14 PM, Xie, Sean  wrote:
> > Is there a tracking to address this issue for SOLR 6.6.x and 7.x?
> >
> > https://lucene.apache.org/solr/news.html#12-october-
> 2017-please-secure-your-apache-solr-servers-since-a-
> zero-day-exploit-has-been-reported-on-a-public-mailing-list
> >
> > Sean
> >
> > Confidentiality Notice::  This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information.
> If you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited.  If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately.  You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Two separate instances sharing the same zookeeper cluster

When you specify the zk string for a solr instance, you typically include a
chroot in it. I think the default is /solr, but it doesn't have to be, so
you should be able to run with -z zk1:2181/sorl-dev and /solr-prod

https://lucene.apache.org/solr/guide/6_6/setting-up-an-external-zookeeper-ensemble.html#SettingUpanExternalZooKeeperEnsemble-PointSolrattheinstance

On Thu, Sep 14, 2017 at 3:01 PM, James Keeney  wrote:

> I have a staging and a production solr cluster. I'd like to have them use
> the same zookeeper cluster. It seems like it is possible if I can set a
> different directory for the second cluster. I've looked through the
> documentation though and I can't quite figure out where to set that up. As
> a result my staging cluster nodes keep trying to add themselves tot he
> production cluster.
>
> If someone could point me in the right direction?
>
> Jim K.
> --
> Jim Keeney
> President, FitterWeb
> E: j...@fitterweb.com
> M: 703-568-5887
>
> *FitterWeb Consulting*
> *Are you lean and agile enough? *
>


Re: IndexReaders cannot exceed 2 Billion

> I have no idea whether you can successfully recover anything from that
> index now that it has broken the hard limit.

Theoretically, I think it's possible with some very surgical edits.
However, I've tried to do this in the past and abandoned it. The code to
split the index needs to be able to open it first, so we reasoned that we'd
have no way to demonstrate correctness and at that point restoring from a
backup was the best option.

Maybe somebody smarter or more determined has a better experience.

Mike

On Tue, Aug 8, 2017 at 10:21 AM, Shawn Heisey  wrote:

> On 8/7/2017 9:41 AM, Wael Kader wrote:
> > I faced an issue that is making me go crazy.
> > I am running SOLR saving data on HDFS and I have a single node setup with
> > an index that has been running fine until today.
> > I know that 2 billion documents is too much on a single node but it has
> > been running fine for my requirements and it was pretty fast.
> >
> > I restarted SOLR today and I am getting an error stating "Too many
> > documents, composite IndexReaders cannot exceed 2147483519.
> > The last backup I have is 2 weeks back and I really need the index to
> start
> > to get the data from the index.
>
> You have run into what I think might be the only *hard* limit in the
> entire Lucene ecosystem.  Other limits can usually be broken with
> careful programming, but that one is set in stone.
>
> A Lucene index uses a 32-bit Java integer to track the internal document
> ID.  In Java, numeric variables are signed.  For that reason, an integer
> cannot exceed (2^31)-1.  That number is 2147483647.  It appears that
> Lucene cuts that off at a value that's smaller by 128.  Not sure why
> that is, but it's probably to prevent problems when a small offset is
> added to the value.
>
> SolrCloud is perfectly capable of running indexes with far more than two
> billion documents, but as Yago mentioned, the collection must be sharded
> for that to happen.
>
> I have no idea whether you can successfully recover anything from that
> index now that it has broken the hard limit.
>
> Thanks,
> Shawn
>
>


Re: Solr Cloud 6.x - rollback best practice

The two collection approach with aliasing is a good approach.

You can also use the backup and restore APIs -
https://lucene.apache.org/solr/guide/6_6/making-and-restoring-backups.html

Mike

On Wed, Jul 12, 2017 at 10:57 AM, Vincenzo D'Amore 
wrote:

> Hi,
>
> I'm moving to Solr Cloud 6.x and I see rollback cannot be supported when is
> in Cloud mode.
>
> In my scenario, there are basically two tasks (full indexing, partial
> indexing).
>
> Full indexing
> =
>
> This is the most important case, where I really need the possibility to
> rollback.
>
> The full reindex is basically done in 3 steps:
>
> 1. delete *:* all collection's documents
> 2. add all existing documents
> 3. commit
>
> If during the step 2 something go wrong (usually some problem with the
> source of data) I had to rollback.
>
> Partial reindexing
> =
>
> Unlike the the former, this case is executed in only 2 steps (no delete)
> and the number of documents indexed usually is small (or very small).
>
> Even in this case if the step 2 go wrong I had to rollback.
>
> Do you know if there is a common pattern, a best practice, something of
> useful to handle a rollback if something go wrong in these cases?
>
> My simplistic idea is to have two collections (active/passive), and switch
> from one to another only when all the steps are completed successfully.
>
> But, as you can understand, having two collections works well with full
> indexing, but how do I handle a partial reindexing if something goes wrong?
>
> So, I'll be grateful to whom would spend his/her time to give me a
> suggestion.
>
> Thanks in advance and best regards,
> Vincenzo
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


Re: Swapping indexes on disk

Pool
–  at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.server.Server.handle(Server.java:368)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
590125502 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
590125503 [Thread-0] INFO  org.eclipse.jetty.util.thread.QueuedThreadPool
–  at java.lang.Thread.run(Thread.java:745)

Those are the last lines in the log, after all of the other indexes shut
down properly.

After that, a new log file is started, and it cannot start the index,
complaining about missing files. So at that point, the index is gone.

I'd love to prevent this from happening a third time. It's super baffling.
Any ideas?

Mike

On Tue, Jun 20, 2017 at 12:38 PM Mike Lissner <
mliss...@michaeljaylissner.com> wrote:

> Thanks for the suggestions everybody.
>
> Some responses to Shawn's questions:
>
> > Does your solr.xml file contain core definitions, or is that information
> in a core.properties file in each instanceDir?
>
> Were using core.properties files.
>
> > How did you install Solr
>
> Solr is installed just by downloading and unzipping. From there, we use
> the example directories as a starting point.
>
>
> > and how are you starting it?
>
> Using a pretty simple init script. Nothing too exotic here.
>
> > Do you have the full error and stacktrace from those null pointer
> exceptions?
>
> I put a log of the startup here:
> https://www.courtlistener.com/tools/sample-data/misc/null_logs.txt
>
> I created this by doing `grep -C 1000 -i nullpointer`, then cleaning out
> any private queries. I looked through it a bit. It looks like the index was
> missing a file, and was therefore unable to start up. I won't say it's
> impossible that the index was deleted before I started Solr, but it seemed
> to be operating fine using the other name prior to stopping solr and
> putting in a symlink. In the real-world logs, our disks are named /sata and
> /sata8 instead of /old and /new.
>
>
> > In the context of that information, what exactly did you do at each step
> of your process?
>
> The process above was pretty boring really.
>
> 1. Create new index and populate it:
>
>  - copied an existing index configuration into a new directory
>  - tweaked the datadir parameter in core.properties
>  - restarted solr
>  - re-indexed the database using usual HTTP API to populate the new index
>
> 2. stop solr: sudo service solr stop
>
> 3. make symlink:
>
>  - mv'ed the old index out of the way
>  - ln -s old new (or vice versa, I never remember which way ln goes)
>
> 4. start solr: sudo service solr start
>
> FWIW, I've got it working now using the SWAP index functionality, so the
> above is just in case somebody wants to try to track this down. I'll
> probably take those logs offline after a week or two.
>
> Mike
>
>
>

Re: Swapping indexes on disk

Thanks for the suggestions everybody.

Some responses to Shawn's questions:

> Does your solr.xml file contain core definitions, or is that information
in a core.properties file in each instanceDir?

Were using core.properties files.

> How did you install Solr

Solr is installed just by downloading and unzipping. From there, we use the
example directories as a starting point.

> and how are you starting it?

Using a pretty simple init script. Nothing too exotic here.

> Do you have the full error and stacktrace from those null pointer
exceptions?

I put a log of the startup here:
https://www.courtlistener.com/tools/sample-data/misc/null_logs.txt

I created this by doing `grep -C 1000 -i nullpointer`, then cleaning out
any private queries. I looked through it a bit. It looks like the index was
missing a file, and was therefore unable to start up. I won't say it's
impossible that the index was deleted before I started Solr, but it seemed
to be operating fine using the other name prior to stopping solr and
putting in a symlink. In the real-world logs, our disks are named /sata and
/sata8 instead of /old and /new.

> In the context of that information, what exactly did you do at each step
of your process?

The process above was pretty boring really.

1. Create new index and populate it:

 - copied an existing index configuration into a new directory
 - tweaked the datadir parameter in core.properties
 - restarted solr
 - re-indexed the database using usual HTTP API to populate the new index

2. stop solr: sudo service solr stop

3. make symlink:

 - mv'ed the old index out of the way
 - ln -s old new (or vice versa, I never remember which way ln goes)

4. start solr: sudo service solr start

FWIW, I've got it working now using the SWAP index functionality, so the
above is just in case somebody wants to try to track this down. I'll
probably take those logs offline after a week or two.

Mike


On Tue, Jun 20, 2017 at 7:20 AM Shawn Heisey  wrote:

> On 6/14/2017 12:26 PM, Mike Lissner wrote:
> > We are replacing a drive mounted at /old with one mounted at /new. Our
> > index currently lives on /old, and our plan was to:
> >
> > 1. Create a new index on /new
> > 2. Reindex from our database so that the new index on /new is properly
> > populated.
> > 3. Stop solr.
> > 4. Symlink /old to /new (Solr now looks for the index at /old/solr, which
> > redirects to /new/solr)
> > 5. Start solr
> > 6. (Later) Stop solr, swap the drives (old for new), and start solr.
> (Solr
> > now looks for the index at /old/solr again, and finds it there.)
> > 7. Delete the index pointing to /new created in step 1.
> >
> > The idea was that this would create a new index for solr, would populate
> it
> > with the right content, and would avoid having to touch our existing solr
> > configurations aside from creating one new index, which we could soon
> > delete.
> >
> > I just did steps 1-5, but I got null pointer exceptions when starting
> solr,
> > and it appears that the index on /new has been almost completely deleted
> by
> > Solr (this is a bummer, since it takes days to populate).
> >
> > Is this expected? Am I terribly crazy to try to swap indexes on disk? As
> > far as I know, the only difference between the indexes is their name.
> >
> > We're using Solr version 4.10.4.
>
> Solr should not delete indexes on startup.  The only time it should do
> that is when you explicitly request deletion.  Do you have the full
> error and stacktrace from those null pointer exceptions?  Something
> would have to be very wrong for it to behave like you describe.
>
> Does your solr.xml file contain core definitions, or is that information
> in a core.properties file in each instanceDir?  The latter is the only
> option supported in 5.0 and later, but the 4.10 version still supports
> both.
>
> How is Solr itself and the data directories laid out?  How did you
> install Solr, and how are you starting it?  In the context of that
> information, what exactly did you do at each step of your process?
>
> Thanks,
> Shawn
>
>


Re: (how) do folks use the Cloud Graph (Radial) in the Solr Admin UI?

+solr-user

Might get a different audience on this list.

-- Forwarded message --
From: Christine Poerschke (BLOOMBERG/ LONDON) 
Date: Fri, Jun 16, 2017 at 11:43 AM
Subject: (how) do folks use the Cloud Graph (Radial) in the Solr Admin UI?
To: d...@lucene.apache.org


Any thoughts on potentially removing the radial cloud graph?

https://issues.apache.org/jira/browse/SOLR-5405 is the background for the
question and further input and views would be very welcome.

Thanks,
Christine
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org


Re: Swapping indexes on disk

I figured Solr would have a native system built in, but since we don't use
it already, I didn't want to learn all of its ins and outs just for this
disk situation.

Ditto, essentially, applies for the swapping strategy. We don't have a Solr
expert, just me, a generalist, and sorting out these kinds of things can
take a while. The hope was to avoid that kind of complication with some
clever use of symlinks and minor downtime. Our front end has a retry
mechanism, so if solr is down for less than a minute, users will just have
delayed responses, which is fine.

The new strategy is to rsync the files while solr is live, stop solr, do a
rsync diff, then start solr again. That'll give a bit for bit copy with
very little downtime — it's the strategy postgres recommends for disk-based
backups, so it seems like a safer bet. We needed a re-index anyway due to
schema changes, which my first attempt included, but I guess that'll have
to wait.

Thanks for the replies. If anybody can explain why the first strategy
failed, I'd still be interested in learning.

Mike

On Wed, Jun 14, 2017 at 12:09 PM Chris Ulicny  wrote:

> Are you physically swapping the disks to introduce the new index? Or having
> both disks mounted at the same time?
>
> If the disks are simultaneously available, can you just swap the cores and
> then delete the core on the old disk?
>
> https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-SWAP
>
> We periodically move cores to different drives using solr's replication
> functionality and core swapping (after stopping replication). However, I've
> never encountered solr deleting an index like that.
>
>
>
> On Wed, Jun 14, 2017 at 2:48 PM David Hastings <
> hastings.recurs...@gmail.com>
> wrote:
>
> > I dont have an answer to why the folder got cleared, however i am
> wondering
> > why you arent using basic replication to do this exact same thing, since
> > solr will natively take care of all this for you with no interruption to
> > the user and no stop/start routines etc.
> >
> > On Wed, Jun 14, 2017 at 2:26 PM, Mike Lissner <
> > mliss...@michaeljaylissner.com> wrote:
> >
> > > We are replacing a drive mounted at /old with one mounted at /new. Our
> > > index currently lives on /old, and our plan was to:
> > >
> > > 1. Create a new index on /new
> > > 2. Reindex from our database so that the new index on /new is properly
> > > populated.
> > > 3. Stop solr.
> > > 4. Symlink /old to /new (Solr now looks for the index at /old/solr,
> which
> > > redirects to /new/solr)
> > > 5. Start solr
> > > 6. (Later) Stop solr, swap the drives (old for new), and start solr.
> > (Solr
> > > now looks for the index at /old/solr again, and finds it there.)
> > > 7. Delete the index pointing to /new created in step 1.
> > >
> > > The idea was that this would create a new index for solr, would
> populate
> > it
> > > with the right content, and would avoid having to touch our existing
> solr
> > > configurations aside from creating one new index, which we could soon
> > > delete.
> > >
> > > I just did steps 1-5, but I got null pointer exceptions when starting
> > solr,
> > > and it appears that the index on /new has been almost completely
> deleted
> > by
> > > Solr (this is a bummer, since it takes days to populate).
> > >
> > > Is this expected? Am I terribly crazy to try to swap indexes on disk?
> As
> > > far as I know, the only difference between the indexes is their name.
> > >
> > > We're using Solr version 4.10.4.
> > >
> > > Thank you,
> > >
> > > Mike
> > >
> >
>


Swapping indexes on disk

We are replacing a drive mounted at /old with one mounted at /new. Our
index currently lives on /old, and our plan was to:

1. Create a new index on /new
2. Reindex from our database so that the new index on /new is properly
populated.
3. Stop solr.
4. Symlink /old to /new (Solr now looks for the index at /old/solr, which
redirects to /new/solr)
5. Start solr
6. (Later) Stop solr, swap the drives (old for new), and start solr. (Solr
now looks for the index at /old/solr again, and finds it there.)
7. Delete the index pointing to /new created in step 1.

The idea was that this would create a new index for solr, would populate it
with the right content, and would avoid having to touch our existing solr
configurations aside from creating one new index, which we could soon
delete.

I just did steps 1-5, but I got null pointer exceptions when starting solr,
and it appears that the index on /new has been almost completely deleted by
Solr (this is a bummer, since it takes days to populate).

Is this expected? Am I terribly crazy to try to swap indexes on disk? As
far as I know, the only difference between the indexes is their name.

We're using Solr version 4.10.4.

Thank you,

Mike


Re: Can solrcloud be running on a read-only filesystem?

To throw out one possibility, a read only file systems has no (low?)
possibility of corruption. If you have a static index then you shouldn't
need to be doing any recovery. Would still need to run ZK with RW
filesystem, but mybe Solr could work?

On Fri, Jun 2, 2017 at 10:15 AM, Erick Erickson 
wrote:

> As Susheel says, this is iffy, very iffy. You can disable tlogs
> entirely through solrconfig.xml, you can _probably_
> disable all of the Solr logging.
>
> You'd also have to _not_ run in SolrCloud. You say
> "some of the nodes eventually are stuck in the recovering phase"
> SolrCloud tries very hard to keep all of the replicas in sync.
> To do this it _must_ be able to copy from the leader to the follower.
> If it ever has to sync with the leader, it'll be stuck in recovery
> as you can see.
>
> You could spend a lot of time trying to make this work, but
> you haven't stated _why_ you want to. Perhaps there are
> other ways to get the functionality you want.
>
> Best,
> Erick
>
> On Fri, Jun 2, 2017 at 5:05 AM, Susheel Kumar 
> wrote:
> > I doubt it can run in readonly file system.  Even though there is no
> > ingestion etc.  Solr still needs to write to logs/tlogs for synching /
> > recovering etc
> >
> > Thnx
> >
> > On Fri, Jun 2, 2017 at 6:56 AM, Wudong Liu  wrote:
> >
> >> Hi All:
> >>
> >> We have a normal build/stage -> prod settings for our production
> pipeline.
> >> And we would build solr index in the build environment and then the
> index
> >> is copied to the prod environment.
> >>
> >> The solrcloud in prod seems working fine when the file system backing
> it is
> >> writable. However, we see many errors when the file system is readonly.
> >> Many exceptions are thrown regarding the tlog file cannot be open for
> write
> >> when the solr nodes are restarted with the new data; some of the nodes
> >> eventually are stuck in the recovering phase and never able to go back
> >> online in the cloud.
> >>
> >> Just wondering is anyone has any experience on Solrcloud running in
> >> readonly file system? Is it possible at all?
> >>
> >> Regards,
> >> Wudong
> >>
>


Re: Solr Web Crawler - Robots.txt

Isn't this exactly what Apache Nutch was built for?

On Thu, Jun 1, 2017 at 6:56 PM, David Choi  wrote:

> In any case after digging further I have found where it checks for
> robots.txt. Thanks!
>
> On Thu, Jun 1, 2017 at 5:34 PM Walter Underwood 
> wrote:
>
> > Which was exactly what I suggested.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Jun 1, 2017, at 3:31 PM, David Choi  wrote:
> > >
> > > In the mean time I have found a better solution at the moment is to
> test
> > on
> > > a site that allows users to crawl their site.
> > >
> > > On Thu, Jun 1, 2017 at 5:26 PM David Choi 
> > wrote:
> > >
> > >> I think you misunderstand the argument was about stealing content.
> Sorry
> > >> but I think you need to read what people write before making bold
> > >> statements.
> > >>
> > >> On Thu, Jun 1, 2017 at 5:20 PM Walter Underwood <
> wun...@wunderwood.org>
> > >> wrote:
> > >>
> > >>> Let’s not get snarky right away, especially when you are wrong.
> > >>>
> > >>> Corporations do not generally ignore robots.txt. I worked on a
> > commercial
> > >>> web spider for ten years. Occasionally, our customers did need to
> > bypass
> > >>> portions of robots.txt. That was usually because of a
> > poorly-maintained web
> > >>> server, or because our spider could safely crawl some content that
> > would
> > >>> cause problems for other crawlers.
> > >>>
> > >>> If you want to learn crawling, don’t start by breaking the
> conventions
> > of
> > >>> good web citizenship. Instead, start with sitemap.xml and crawl the
> > >>> preferred portions of a site.
> > >>>
> > >>> https://www.sitemaps.org/index.html <
> > https://www.sitemaps.org/index.html>
> > >>>
> > >>> If the site blocks you, find a different site to learn on.
> > >>>
> > >>> I like the looks of “Scrapy”, written in Python. I haven’t used it
> for
> > >>> anything big, but I’d start with that for learning.
> > >>>
> > >>> https://scrapy.org/ 
> > >>>
> > >>> If you want to learn on a site with a lot of content, try ours,
> > chegg.com
> > >>> But if your crawler gets out of hand, crawling too fast, we’ll block
> > it.
> > >>> Any other site will do the same.
> > >>>
> > >>> I would not base the crawler directly on Solr. A crawler needs a
> > >>> dedicated database to record the URLs visited, errors, duplicates,
> > etc. The
> > >>> output of the crawl goes to Solr. That is how we did it with
> Ultraseek
> > >>> (before Solr existed).
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> > >>>
> >  On Jun 1, 2017, at 3:01 PM, David Choi 
> > wrote:
> > 
> >  Oh well I guess its ok if a corporation does it but not someone
> > wanting
> > >>> to
> >  learn more about the field. I actually have written a crawler before
> > as
> >  well as the you know Inverted Index of how solr works but I just
> > thought
> >  its architecture was better suited for scaling.
> > 
> >  On Thu, Jun 1, 2017 at 4:47 PM Dave 
> > >>> wrote:
> > 
> > > And I mean that in the context of stealing content from sites that
> > > explicitly declare they don't want to be crawled. Robots.txt is to
> be
> > > followed.
> > >
> > >> On Jun 1, 2017, at 5:31 PM, David Choi 
> > >>> wrote:
> > >>
> > >> Hello,
> > >>
> > >> I was wondering if anyone could guide me on how to crawl the web
> and
> > >> ignore the robots.txt since I can not index some big sites. Or if
> > >>> someone
> > >> could point how to get around it. I read somewhere about a
> > >> protocol.plugin.check.robots
> > >> but that was for nutch.
> > >>
> > >> The way I index is
> > >> bin/post -c gettingstarted https://en.wikipedia.org/
> > >>
> > >> but I can't index the site I'm guessing because of the robots.txt.
> > >> I can index with
> > >> bin/post -c gettingstarted http://lucene.apache.org/solr
> > >>
> > >> which I am guessing allows it. I was also wondering how to find
> the
> > >>> name
> > > of
> > >> the crawler bin/post uses.
> > >
> > >>>
> > >>>
> >
> >
>


Re: Performance warning: Overlapping onDeskSearchers=2 solr

You're committing too frequently, so you have new searchers getting queued
up before the previous ones have been processed.

You have several options on how to deal with this. Can increase commit
interval, add hardware, or reduce query warming.

I don't know if uncommenting that section will help because I don't know
what your current settings are. Or if you are using manual commits.

Mike

On Wed, May 17, 2017, 4:58 AM Srinivas Kashyap 
wrote:

> Hi All,
>
> We are using Solr 5.2.1 version and are currently experiencing below
> Warning in Solr Logging Console:
>
> Performance warning: Overlapping onDeskSearchers=2
>
> Also we encounter,
>
> org.apache.solr.common.SolrException: Error opening new searcher. exceeded
> limit of maxWarmingSearchers=2,​ try again later.
>
>
> The reason being, we are doing mass update on our application and solr
> experiencing the higher loads at times. Data is being indexed using DIH(sql
> queries).
>
> In solrconfig.xml below is the code.
>
> 
>
> Should we be uncommenting the above lines and try to avoid this error?
> Please help me.
>
> Thanks and Regards,
> Srinivas Kashyap
>
> 
>
> DISCLAIMER: E-mails and attachments from Bamboo Rose, LLC are
> confidential. If you are not the intended recipient, please notify the
> sender immediately by replying to the e-mail, and then delete it without
> making copies or using it in any way. No representation is made that this
> email or any attachments are free of viruses. Virus scanning is recommended
> and is the responsibility of the recipient.
>


Re: SOLR as nosql database store

> The searching install will be able to rebuild itself from the data
storage install when that
is required.

Is this a use case for CDCR?

Mike

On Tue, May 9, 2017 at 6:39 AM, Shawn Heisey  wrote:

> On 5/9/2017 12:58 AM, Bharath Kumar wrote:
> > Thanks Hrishikesh and Dave. We use SOLR cloud with 2 extra replicas,
> will that not serve as backup when something goes wrong? Also we use latest
> solr 6 and from the documentation of solr, the indexing performance has
> been good. The reason is that we are using MySQL as the primary data store
> and the performance might not be optimal if we write data at a very rapid
> rate. Already we index almost half the fields that are in MySQL in solr.
>
> A replica is protection against data loss in the event of hardware
> failure, but there are classes of problems that it cannot protect against.
>
> Although Solr (Lucene) does try *really* hard to never lose data that it
> hasn't been asked to delete, it is not designed to be a database.  It's
> a search engine.  Solr doesn't offer the same kinds of guarantees about
> the data it contains that software like MySQL does.
>
> I personally don't recommend trying to use Solr as a primary data store,
> but if that's what you really want to do, then I would suggest that you
> have two complete Solr installs, with multiple replicas on both.  One of
> them will be used for searching and have a configuration you're already
> familiar with, the other will be purely for data storage -- only certain
> fields like the uniqueKey will be indexed, but every other field will be
> stored only.
>
> Running with two separate Solr installs will allow you to optimize one
> for searching and the other for data storage.  The searching install
> will be able to rebuild itself from the data storage install when that
> is required.  If better performance is needed for the rebuild, you have
> the option of writing a multi-threaded or multi-process program that
> reads from one and writes to the other.
>
> Thanks,
> Shawn
>
>


Re: Both main and replica are trying to access solr_gc.log.0.current file

It might depend some on how you are starting Solr (I am less familiar with
Windows) but you will need to give each instead a separate log4j.properties
file and configure the log location in there.

Also check out the Solr Ref Guide section on Configuring Logging,
subsection Permanent Logging Settings.

https://cwiki.apache.org/confluence/display/solr/Configuring+Logging

Mike

On Sat, Apr 29, 2017, 12:24 PM Zheng Lin Edwin Yeo 
wrote:

> Yes, both Solr instances are running in the same hardware.
>
> I believe they are pointing to the same log directories/config too.
>
> How do we point them to different log directories/config?
>
> Regards,
> Edwin
>
>
> On 30 April 2017 at 00:36, Mike Drob  wrote:
>
> > Are you running both Solr instances in the same hardware and pointing
> them
> > at the same log directories/config?
> >
> > On Sat, Apr 29, 2017, 2:56 AM Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi,
> > >
> > > I'm using Solr 6.4.2 on SolrCloud, and I'm running 2 replica of Solr.
> > >
> > > When I start the replica, I will encounter this error message. It is
> > > probably due to the Solr log, as both the main and the replica are
> trying
> > > to access the same solr_gc.log.0.current file.
> > >
> > > Is there anyway to prevent this?
> > >
> > > Besides this error message, the rest of the Solr for both main and
> > replica
> > > are running normally.
> > >
> > > Exception in thread "main" java.nio.file.FileSystemException:
> > > C:\edwin\solr\server\logs\solr_gc.log.0.current ->
> > > C:\edwin\solr\server\logs\archived\solr_gc.log.0.current: The process
> > >  cannot access the file because it is being used by another process.
> > >
> > > at
> > > sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> > > ava:86)
> > > at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> > > a:97)
> > > at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > > at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> > > ava:287)
> > > at java.nio.file.Files.move(Files.java:1395)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.archiveGcLogs(SolrCLI.java:357
> > > 9)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3548)
> > > at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> > > "Failed archiving old GC logs"
> > > Exception in thread "main" java.nio.file.FileSystemException:
> > > C:\edwin\solr\server\logs\solr-8983-console.log ->
> > > C:\edwin\solr\server\logs\archived\solr-8983-console.log: The process
> > >  cannot access the file because it is being used by another process.
> > >
> > > at
> > > sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> > > ava:86)
> > > at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> > > a:97)
> > > at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > > at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> > > ava:287)
> > > at java.nio.file.Files.move(Files.java:1395)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.archiveConsoleLogs(SolrCLI.jav
> > > a:3608)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3551)
> > > at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> > > "Failed archiving old console logs"
> > > Exception in thread "main" java.nio.file.FileSystemException:
> > > C:\edwin\solr\server\logs\solr.log -> C:\edwin\solr\server\logs\
> > solr.log.1:
> > > The process cannot access the file because i
> > > t is being used by another process.
> > >
> > > at
> > > sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> > > ava:86)
> > > at
> > > sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> > > a:97)
> > > at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> > > at
> > > sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> > > ava:287)
> > > at java.nio.file.Files.move(Files.java:1395)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.rotateSolrLogs(SolrCLI.java:36
> > > 51)
> > > at
> > > org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3545)
> > > at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> > > "Failed rotating old Solr logs"
> > > Waiting up to 30 to see Solr running on port 8984
> > >
> > >
> > > Regards,
> > > Edwin
> > >
> >
>


Re: Both main and replica are trying to access solr_gc.log.0.current file

Are you running both Solr instances in the same hardware and pointing them
at the same log directories/config?

On Sat, Apr 29, 2017, 2:56 AM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm using Solr 6.4.2 on SolrCloud, and I'm running 2 replica of Solr.
>
> When I start the replica, I will encounter this error message. It is
> probably due to the Solr log, as both the main and the replica are trying
> to access the same solr_gc.log.0.current file.
>
> Is there anyway to prevent this?
>
> Besides this error message, the rest of the Solr for both main and replica
> are running normally.
>
> Exception in thread "main" java.nio.file.FileSystemException:
> C:\edwin\solr\server\logs\solr_gc.log.0.current ->
> C:\edwin\solr\server\logs\archived\solr_gc.log.0.current: The process
>  cannot access the file because it is being used by another process.
>
> at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> ava:86)
> at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> a:97)
> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> ava:287)
> at java.nio.file.Files.move(Files.java:1395)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.archiveGcLogs(SolrCLI.java:357
> 9)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3548)
> at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> "Failed archiving old GC logs"
> Exception in thread "main" java.nio.file.FileSystemException:
> C:\edwin\solr\server\logs\solr-8983-console.log ->
> C:\edwin\solr\server\logs\archived\solr-8983-console.log: The process
>  cannot access the file because it is being used by another process.
>
> at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> ava:86)
> at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> a:97)
> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> ava:287)
> at java.nio.file.Files.move(Files.java:1395)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.archiveConsoleLogs(SolrCLI.jav
> a:3608)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3551)
> at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> "Failed archiving old console logs"
> Exception in thread "main" java.nio.file.FileSystemException:
> C:\edwin\solr\server\logs\solr.log -> C:\edwin\solr\server\logs\solr.log.1:
> The process cannot access the file because i
> t is being used by another process.
>
> at
> sun.nio.fs.WindowsException.translateToIOException(WindowsException.j
> ava:86)
> at
> sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.jav
> a:97)
> at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387)
> at
> sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.j
> ava:287)
> at java.nio.file.Files.move(Files.java:1395)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.rotateSolrLogs(SolrCLI.java:36
> 51)
> at
> org.apache.solr.util.SolrCLI$UtilsTool.runTool(SolrCLI.java:3545)
> at org.apache.solr.util.SolrCLI.main(SolrCLI.java:250)
> "Failed rotating old Solr logs"
> Waiting up to 30 to see Solr running on port 8984
>
>
> Regards,
> Edwin
>


Re: SolrJ appears to have problems with Docker Toolbox

Thanks. I think I'll take a look at that. I decided to just build a big
vagrant-managed desktop VM to let me run Ubuntu on my company machine, so I
expect that this pain point may be largely gone soon.

On Mon, Apr 10, 2017 at 12:31 PM, Vincenzo D'Amore 
wrote:

> Hi Mike
>
> disclaimer I'm the author of https://github.com/freedev/
> solrcloud-zookeeper-docker
>
> I had same problem when I tried to create a cluster SolrCloud with docker,
> just because the docker instances were referred by ip addresses I cannot
> access with SolrJ.
>
> I avoided this problem referring each docker instance via a hostname
> instead of ip address.
>
> Docker-compose is a great help to have a network where your docker
> instances can be resolved using their names.
>
> I'll suggest to take a look at my project, in particular at the
> docker-compose.yml used to start a SolrCloud cluster (3 Solr nodes with a
> zookeeper ensemble of 3):
>
> https://raw.githubusercontent.com/freedev/solrcloud-
> zookeeper-docker/master/
> solrcloud-3-nodes-zookeeper-ensemble/docker-compose.yml
>
> Ok, I know, it sounds too much create a SolrCloud into a single VM, I did
> it just to understand how Solr works... :)
>
> Once you've build your SolrCloud Docker network, you can map the name of
> your docker instances externally, for example in your private network or in
> your hosts file.
>
> In other words, given a Docker Solr instance named solr-1, in the docker
> network the instance named solr-1 has a docker ip address that cannot be
> used outside the VM.
>
> So when you use SolrJ client on your computer you must have into /etc/hosts
> an entry solr-1 that points to the ip address your VM (the public network
> interface where the docker instance is mapped).
>
> Hope you understand... :)
>
> Cheers,
> Vincenzo
>
>
> On Sun, Apr 9, 2017 at 2:42 AM, Mike Thomsen 
> wrote:
>
> > I'm running two nodes of SolrCloud in Docker on Windows using Docker
> > Toolbox.  The problem I am having is that Docker Toolbox runs inside of a
> > VM and so it has an internal network inside the VM that is not accessible
> > to the Docker Toolbox VM's host OS. If I go to the VM's IP which is
> > 192.168.99.100, I can load the admin UI and do basic operations that are
> > written to go against that IP and port (like querying, schema editor,
> > manually adding documents, etc.)
> >
> > However, when I try to run code that uses SolrJ to add documents, it
> fails
> > because the ZK configuration has the IPs for the internal Docker network
> > which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code
> > from there, it works just fine. From the host OS, doesn't.
> >
> > Anyone have any ideas on how to get around this? If I rewrite the
> indexing
> > code to do a manual JSON POST to the update handler on one of the nodes,
> it
> > does work just fine, but that leaves me not using SolrJ.
> >
> > Thanks,
> >
> > Mike
> >
>
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251 <349%20851%203251>
>


Re: SolrJ appears to have problems with Docker Toolbox

Hi Rick,

No, I just used the "official" one on Docker Hub (
https://hub.docker.com/_/solr/) and followed the instructions for linking
and working with ZooKeeper to get SolrCloud up and running.

I may have to go to the Docker forum in the end, but I thought I'd ask here
first since the only thing that seems to be broken is the Java client API,
not the servers, in this environment/configuration.

Thanks,

Mike

On Sat, Apr 8, 2017 at 9:41 PM, Rick Leir  wrote:

> Hi Mike
> Did you dockerize Solr yourself? I have some knowledge of Docker, and
> think that this question would get better help in a Docker forum.
> Cheers -- Rick
>
> On April 8, 2017 8:42:13 PM EDT, Mike Thomsen 
> wrote:
> >I'm running two nodes of SolrCloud in Docker on Windows using Docker
> >Toolbox.  The problem I am having is that Docker Toolbox runs inside of
> >a
> >VM and so it has an internal network inside the VM that is not
> >accessible
> >to the Docker Toolbox VM's host OS. If I go to the VM's IP which is
> >192.168.99.100, I can load the admin UI and do basic operations that
> >are
> >written to go against that IP and port (like querying, schema editor,
> >manually adding documents, etc.)
> >
> >However, when I try to run code that uses SolrJ to add documents, it
> >fails
> >because the ZK configuration has the IPs for the internal Docker
> >network
> >which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code
> >from there, it works just fine. From the host OS, doesn't.
> >
> >Anyone have any ideas on how to get around this? If I rewrite the
> >indexing
> >code to do a manual JSON POST to the update handler on one of the
> >nodes, it
> >does work just fine, but that leaves me not using SolrJ.
> >
> >Thanks,
> >
> >Mike
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


SolrJ appears to have problems with Docker Toolbox

I'm running two nodes of SolrCloud in Docker on Windows using Docker
Toolbox.  The problem I am having is that Docker Toolbox runs inside of a
VM and so it has an internal network inside the VM that is not accessible
to the Docker Toolbox VM's host OS. If I go to the VM's IP which is
192.168.99.100, I can load the admin UI and do basic operations that are
written to go against that IP and port (like querying, schema editor,
manually adding documents, etc.)

However, when I try to run code that uses SolrJ to add documents, it fails
because the ZK configuration has the IPs for the internal Docker network
which is 172.X.Y..Z. If I log into the toolbox VM and run the Java code
from there, it works just fine. From the host OS, doesn't.

Anyone have any ideas on how to get around this? If I rewrite the indexing
code to do a manual JSON POST to the update handler on one of the nodes, it
does work just fine, but that leaves me not using SolrJ.

Thanks,

Mike


Re: Data Import

If Solr is down, then adding through SolrJ would fail as well. Kafka's new
API has some great features for this sort of thing. The new client API is
designed to be run in a long-running loop where you poll for new messages
with a certain amount of defined timeout (ex: consumer.poll(1000) for 1s)
So if Solr becomes unstable or goes down, it's easy to have the consumer
just stop and either wait until Solr comes back up or save the data to
disk/commit the Kafka offsets to ZK and stop running.

On Fri, Mar 17, 2017 at 1:24 PM, OTH  wrote:

> Are Kafka and SQS interchangeable?  (The latter does not seem to be free.)
>
> @Wunder:
> I'm assuming, that updating to Solr would fail if Solr is unavailable not
> just if posting via say a DB trigger, but probably also if trying to post
> through SolrJ?  (Which is what I'm using for now.)  So, even if using
> SolrJ, it would be a good idea to use a queuing software?
>
> Thanks
>
> On Fri, Mar 17, 2017 at 10:12 PM, vishal jain  wrote:
>
> > Streaming the data through kafka would be a good option if near real time
> > data indexing is the key requirement.
> > In our application the RDBMS data is populated by an ETL job periodically
> > so we don't need real time data indexing for now.
> >
> > Cheers,
> > Vishal
> >
> > On Fri, Mar 17, 2017 at 10:30 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > Or set a trigger on your RDBMS's main table to put the relevant
> > > information in a different table (call it EVENTS) and have your SolrJ
> > > consult the EVENTS table periodically. Essentially you're using the
> > > EVENTS table as a queue where the trigger is the producer and the
> > > SolrJ program is the consumer.
> > >
> > > It's a polling solution though, so not event-driven. There's no
> > > mechanism that I know of have, say, your RDBMS push an event to DIH
> > > for instance.
> > >
> > > Hmmm, I do wonder if anyone's done anything with queueing (e.g. Kafka)
> > > for this kind of problem..
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Mar 17, 2017 at 8:41 AM, Alexandre Rafalovitch
> > >  wrote:
> > > > One assumes by hooking into the same code that updates RDBMS, as
> > > > opposed to be reverse engineering the changes from looking at the DB
> > > > content. This would be especially the case for Delete changes.
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > http://www.solr-start.com/ - Resources for Solr users, new and
> > > experienced
> > > >
> > > >
> > > > On 17 March 2017 at 11:37, OTH  wrote:
> > > >>>
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr.
> > > >>
> > > >> How can SolrJ be used to make RDBMS updates immediately available?
> > > >> Thanks
> > > >>
> > > >> On Fri, Mar 17, 2017 at 2:28 PM, Sujay Bawaskar <
> > > sujaybawas...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hi Vishal,
> > > >>>
> > > >>> As per my experience DIH is the best for RDBMS to solr index. DIH
> > with
> > > >>> caching has best performance. DIH nested entities allow you to
> define
> > > >>> simple queries.
> > > >>> Also, solrj is good when you want your RDBMS updates make
> immediately
> > > >>> available in solr. DIH full import can be used for index all data
> > first
> > > >>> time or restore index in case index is corrupted.
> > > >>>
> > > >>> Thanks,
> > > >>> Sujay
> > > >>>
> > > >>> On Fri, Mar 17, 2017 at 2:34 PM, vishal jain 
> > > wrote:
> > > >>>
> > > >>> > Hi,
> > > >>> >
> > > >>> >
> > > >>> > I am new to Solr and am trying to move data from my RDBMS to
> Solr.
> > I
> > > know
> > > >>> > the available options are:
> > > >>> > 1) Post Tool
> > > >>> > 2) DIH
> > > >>> > 3) SolrJ (as ours is a J2EE application).
> > > >>> >
> > > >>> > I want to know what is the recommended way for Data import in
> > > production
> > > >>> > environment.
> > > >>> > Will sending data via SolrJ in batches be faster than posting a
> csv
> > > using
> > > >>> > POST tool?
> > > >>> >
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Vishal
> > > >>> >
> > > >>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Thanks,
> > > >>> Sujay P Bawaskar
> > > >>> M:+91-77091 53669
> > > >>>
> > >
> >
>


Re: SOLR Data Locality

I've only ever used the HDFS support with Cloudera's build, but my
experience turned me off to use HDFS. I'd much rather use the native file
system over HDFS.

On Tue, Mar 14, 2017 at 10:19 AM, Muhammad Imad Qureshi <
imadgr...@yahoo.com.invalid> wrote:

> We have a 30 node Hadoop cluster and each data node has a SOLR instance
> also running. Data is stored in HDFS. We are adding 10 nodes to the
> cluster. After adding nodes, we'll run HDFS balancer and also create SOLR
> replicas on new nodes. This will affect data locality. does this impact how
> solr works (I mean performance) if the data is on a remote node? ThanksImad
>


How to expose new Lucene field type to Solr

Found this project and I'd like to know what would be involved with
exposing its RestrictedField type through Solr for indexing and querying as
a Solr field type.

https://github.com/roshanp/lucure-core

Thanks,

Mike


Re: solr warning - filling logs

It's a brittle ZK configuration. A typical ZK quorum is three nodes for
most production systems. One is fine, though, for development provided the
system it's on is not overloaded.

On Mon, Feb 27, 2017 at 6:43 PM, Rick Leir  wrote:

> Hi Mike
> We are using a single ZK node, I think. What problems should we expect?
> Thanks -- Rick
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Index Segments not Merging

Just barely skimmed the documentation, but it looks like the tool generates
its own shards and pushes them into the collection by manipulating the
configuration of the cluster.

https://www.cloudera.com/documentation/enterprise/5-8-x/topics/search_mapreduceindexertool.html

If that reading is correct, it would stand to reason that Solr (at least as
of Solr 4.10 which is what CDH ships) would not be doing the periodic
cleanup it normally does when building shards through its APIs.

On Thu, Feb 23, 2017 at 10:01 PM, Jordan Drake 
wrote:

> We have solr with the index stored in HDFS. We are running MapReduce jobs
> to build the index using the MapReduceIndexerTool from Cloudera with the
> go-live option to merge into our live index.
>
> We are seeing an issue where the number of segments in the index never
> reduces. It continues to grow until we manually do an optimize.
>
> We are using the following solr config for merge policy
>
>
>
>
>
>
>
>
>
>
>
> * name="maxMergeAtOnce">10 name="segmentsPerTier">10 class="org.apache.lucene.index.ConcurrentMergeScheduler"> name="maxThreadCount">1 name="maxMergeCount">6*
>
> If we add documents into solr without using MapReduce the segments merge
> properly as expected.
>
> Any ideas on why we see this behavior? Does the solr index merge prevent
> the segments from merging?
>
>
> Thanks,
> Jordan
>


Re: solr warning - filling logs

When you transition to an external zookeeper, you'll need at least 3 ZK
nodes. One is insufficient outside of a development environment. That's a
general requirement for any system that uses ZK.

On Sun, Feb 26, 2017 at 7:14 PM, Satya Marivada 
wrote:

> May I ask about the port scanner running? Can you please elaborate?
> Sure, will try to move out to external zookeeper
>
> On Sun, Feb 26, 2017 at 7:07 PM Dave  wrote:
>
> > You shouldn't use the embedded zookeeper with solr, it's just for
> > development not anywhere near worthy of being out in production.
> Otherwise
> > it looks like you may have a port scanner running. In any case don't use
> > the zk that comes with solr
> >
> > > On Feb 26, 2017, at 6:52 PM, Satya Marivada  >
> > wrote:
> > >
> > > Hi All,
> > >
> > > I have configured solr with SSL and enabled http authentication. It is
> > all
> > > working fine on the solr admin page, indexing and querying process. One
> > > bothering thing is that it is filling up logs every second saying no
> > > authority, I have configured host name, port and authentication
> > parameters
> > > right in all config files. Not sure, where is it coming from. Any
> > > suggestions, please. Really appreciate it. It is with sol-6.3.0 cloud
> > with
> > > embedded zookeeper. Could it be some bug with solr-6.3.0 or am I
> missing
> > > some configuration?
> > >
> > > 2017-02-26 23:32:43.660 WARN (qtp606548741-18) [c:plog s:shard1
> > > r:core_node2 x:plog_shard1_replica1] o.e.j.h.HttpParser parse
> exception:
> > > java.lang.IllegalArgumentException: No Authority for
> > > HttpChannelOverHttp@6dac689d{r=0,c=false,a=IDLE,uri=null}
> > > java.lang.IllegalArgumentException: No Authority
> > > at
> > >
> > org.eclipse.jetty.http.HostPortHttpField.(
> HostPortHttpField.java:43)
> > > at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)
> > > at org.eclipse.jetty.http.HttpParser.parseHeaders(
> HttpParser.java:1050)
> > > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)
> > > at
> > >
> > org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(
> HttpConnection.java:344)
> > > at
> > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(
> HttpConnection.java:227)
> > > at org.eclipse.jetty.io
> > > .AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
> > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > > at
> > org.eclipse.jetty.io.ssl.SslConnection.onFillable(
> SslConnection.java:186)
> > > at org.eclipse.jetty.io
> > > .AbstractConnection$ReadCallback.succeeded(
> AbstractConnection.java:273)
> > > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> > > at org.eclipse.jetty.io
> > > .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> > > at
> > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
> produceAndRun(ExecuteProduceConsume.java:246)
> > > at
> > >
> > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(
> ExecuteProduceConsume.java:156)
> > > at
> > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:654)
> > > at
> > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
> QueuedThreadPool.java:572)
> > > at java.lang.Thread.run(Thread.java:745)
> >
>


Re: Fwd: Solr dynamic field blowing up the index size

Correct me if I'm wrong, but heavy use of doc values should actually blow
up the size of your index considerably if they are in fields that get sent
a lot of data.

On Tue, Feb 21, 2017 at 10:50 AM, Pratik Patel  wrote:

> Thanks for the reply. I can see that in solr 6, more than 50% of the index
> directory is occupied by ".nvd" file extension. It is something related to
> norms and doc values.
>
> On Tue, Feb 21, 2017 at 10:27 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> wrote:
>
> > Did you look in the data directories to check what index file extensions
> > contribute most to the difference? That could give a hint.
> >
> > Regards,
> > Alex
> >
> > On 21 Feb 2017 9:47 AM, "Pratik Patel"  wrote:
> >
> > > Here is the same question in stackOverflow for better format.
> > >
> > > http://stackoverflow.com/questions/42370231/solr-
> > > dynamic-field-blowing-up-the-index-size
> > >
> > > Recently, I upgraded from solr 5.0 to solr 6.4.1. I can run my app fine
> > but
> > > the problem is that index size with solr 6 is way too large. In solr 5,
> > > index size was about 15GB and in solr 6, for the same data, the index
> > size
> > > is 300GB! I am not able to understand what contributes to such huge
> > > difference in solr 6.
> > >
> > > I have been able to identify a field which is blowing up the size of
> > index.
> > > It is as follows.
> > >
> > >  > > stored="true" multiValued="true"  />
> > >
> > >  > > stored="false" multiValued="true"  />
> > > 
> > >
> > > When this field is commented out, the index size reduces to less than
> > 10GB.
> > >
> > > This field is of type text_general. Following is the definition of this
> > > type.
> > >
> > >  > > positionIncrementGap="100">
> > >   
> > > 
> > > 
> > > 
> > >  > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> > >  > > protected="protwords.txt" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="0"/>
> > > 
> > >  > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > > />
> > >   
> > >   
> > > 
> > > 
> > > 
> > >  > > pattern="((?m)[a-z]+)'s" replacement="$1s" />
> > >  > > protected="protwords.txt" generateWordParts="1"
> > > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > > catenateAll="0" splitOnCaseChange="0"/>
> > > 
> > >  > > words="C:/Users/pratik/Desktop/solr-6.4.1_playground/
> > > solr-6.4.1/server/solr/collection1/conf/stopwords.txt"
> > > />
> > >   
> > >   
> > >
> > > Few things which I did to debug this issue:
> > >
> > >- I have ensured that field type definition is same as what I was
> > using
> > >in solr 5 and it is also valid in version 6. This field type
> > considers a
> > >list of "stopwords" to be ignored during indexing. I have supplied
> the
> > > same
> > >list of stopwords which we were using in solr 5. I have verified
> that
> > > path
> > >of this file is correct and it is being loaded fine in solr admin
> UI.
> > > When
> > >I analyse these fields using "Analysis" tab of the solr admin UI, I
> > can
> > > see
> > >that stopwords are being filtered out. However, when I query with
> some
> > > of
> > >these stopwords, I do get the results back which makes me think that
> > >probably stopwords are being indexed.
> > >
> > > Any idea what could increase the size of index by so much in solr 6?
> > >
> >
>


Re: Solr partial update

Set the fl parameter equal to the fields you want and then query for
id:(SOME_ID OR SOME_ID OR SOME_ID)

On Thu, Feb 9, 2017 at 5:37 AM, Midas A  wrote:

> Hi,
>
> i want solr doc partially if unique id exist else we donot want to do any
> thing .
>
> how can i achieve this .
>
> Regards,
> Midas
>


Re: Solr Kafka DIH

Probably not, but writing your own little Java process to do it would be
trivial with Kafka 0.9.X or 0.10.X. You can also look at the Confluent
Platform as they have tons of connectors for Kafka to directly feed into
other systems.

On Mon, Jan 30, 2017 at 3:05 AM, Mahmoud Almokadem 
wrote:

> Hello,
>
> Is there a way to get SolrCloud to pull data from a topic in Kafak
> periodically using Dataimport Handler?
>
> Thanks
> Mahmoud


Re: Is it possible to rewrite part of the solr response?

I finally got a chance to deep dive into this and have a preliminary
working plugin. I'm starting to look at optimization strategies for how to
speed processing up and am wondering if you can give me some more
information about your "bailout" strategy.

Thanks,

Mike

On Wed, Dec 21, 2016 at 9:08 PM, Erick Erickson 
wrote:

> "grab the response" is a bit ambiguous here in Solr terms. Sure,
> a SearchComponent (you can write a plugin) gets the response,
> but it only sees the final list being returned to the user, i.e. if you
> have rows=15 it sees only 15 docs. Not sure that's adequate,
> in the case above you could easily not be allowed to see any of
> the top N docs. Plus, doing anything like this would give very
> skewed things like facets, grouping, etc. Say the facets were
> calculated over 534 hits but the user was only allowed to see 10 docs...
> Very confusing.
>
> The most robust solution would be a "post filter", another bit
> of custom code that you write (plugin). See:
> http://yonik.com/advanced-filter-caching-in-solr/
> A post filter sees _all_ the documents that satisfy the query,
> and makes an include/exclude decision on each one (just
> like any other fq clause). So facets, grouping and all the rest
> "just work". Do be aware that if the ACL calculations are  expensive
> you need to be prepared for the system administrator doing a
> *:* query. I usually build in a bailout and stop passing documents
> after some number and pass back a result about "please narrow
> down your search". Of course if your business logic is such that
> you can calculate them all "fast enough", you're golden.
>
> All that said, if there's any way you can build this into tokens in the
> doc and use a standard fq clause it's usually much easier. That may
> take some creative work at indexing time if it's even possible.
>
> Best,
> Erick
>
> On Wed, Dec 21, 2016 at 5:56 PM, Mike Thomsen 
> wrote:
> > We're trying out some ideas on locking down solr and would like to know
> if
> > there is a public API that allows you to grab the response before it is
> > sent and inspect it. What we're trying to do is something for which a
> > filter query is not a good option to really get where we want to be.
> > Basically, it's an integration with some business logic to make a final
> > pass at ensuring that certain business rules are followed in the event a
> > query returns documents a user is not authorized to see.
> >
> > Thanks,
> >
> > Mike
>


Re: Solr ACL Plugin Windows

I didn't see a real Java project there, but the directions to compile on
Linux are almost always applicable to Windows with Java. If you find a
project that says it uses Ant or Maven, all you need to do is download Ant
or Maven, the Java Development Kit and put both of them on the windows
path. Then it's either "ant package" (IIRC most of the time) or "mvn
install" from within the folder that has the project.

FWIW, creating a simple ACL doesn't even require a custom plugin. This is
roughly how you would do it w/ an application that your team has written
that works with solr:

1. Add a multivalue string field called ACL or privileges
2. Write something for your app that can pull a list of
attributes/privileges from a database for the current user.
3. Append a filter query to the query that matches those attributes. Ex:

fq=privileges:(DEVELOPER AND DEVOPS)


If you are using a role-based system that bundles groups of permissions
into a role, all you need to do is decompose the role into a list of
permissions for the user and put all of the required permissions into that
multivalue field.

Mike

On Wed, Jan 4, 2017 at 2:55 AM,  wrote:

> I am searching a SOLR ACL Plugin, i found this
> https://lucidworks.com/blog/2015/05/15/custom-security-filtering-solr-5/
>
> but i don't know how i can compile the jave into to a jar - all Infos i
> found was how to complie it on linux - but this doesn't help.
>
> I am running solr version 6.3.0 on windows Server 2003
>
> So i am searching for infos about compiling a plugin under windows.
>
> Thanxs in advance :D
>
> 
> This message was sent using IMP, the Internet Messaging Program.
>
>


Re: HDFS support maturity

Cloudera defaults their Hadoop installation to use HDFS w/ their bundle of
Solr (4.10.3) if that is any indication.

On Tue, Jan 3, 2017 at 7:40 AM, Hendrik Haddorp 
wrote:

> Hi,
>
> is the HDFS support in Solr 6.3 considered production ready?
> Any idea how many setups might be using this?
>
> thanks,
> Hendrik
>


Re: Is it possible to rewrite part of the solr response?

Thanks. I'll look into that stuff. The counts issue is really not a serious
problem for us far as I know.

On Wed, Dec 21, 2016 at 9:08 PM, Erick Erickson 
wrote:

> "grab the response" is a bit ambiguous here in Solr terms. Sure,
> a SearchComponent (you can write a plugin) gets the response,
> but it only sees the final list being returned to the user, i.e. if you
> have rows=15 it sees only 15 docs. Not sure that's adequate,
> in the case above you could easily not be allowed to see any of
> the top N docs. Plus, doing anything like this would give very
> skewed things like facets, grouping, etc. Say the facets were
> calculated over 534 hits but the user was only allowed to see 10 docs...
> Very confusing.
>
> The most robust solution would be a "post filter", another bit
> of custom code that you write (plugin). See:
> http://yonik.com/advanced-filter-caching-in-solr/
> A post filter sees _all_ the documents that satisfy the query,
> and makes an include/exclude decision on each one (just
> like any other fq clause). So facets, grouping and all the rest
> "just work". Do be aware that if the ACL calculations are  expensive
> you need to be prepared for the system administrator doing a
> *:* query. I usually build in a bailout and stop passing documents
> after some number and pass back a result about "please narrow
> down your search". Of course if your business logic is such that
> you can calculate them all "fast enough", you're golden.
>
> All that said, if there's any way you can build this into tokens in the
> doc and use a standard fq clause it's usually much easier. That may
> take some creative work at indexing time if it's even possible.
>
> Best,
> Erick
>
> On Wed, Dec 21, 2016 at 5:56 PM, Mike Thomsen 
> wrote:
> > We're trying out some ideas on locking down solr and would like to know
> if
> > there is a public API that allows you to grab the response before it is
> > sent and inspect it. What we're trying to do is something for which a
> > filter query is not a good option to really get where we want to be.
> > Basically, it's an integration with some business logic to make a final
> > pass at ensuring that certain business rules are followed in the event a
> > query returns documents a user is not authorized to see.
> >
> > Thanks,
> >
> > Mike
>


Is it possible to rewrite part of the solr response?

We're trying out some ideas on locking down solr and would like to know if
there is a public API that allows you to grab the response before it is
sent and inspect it. What we're trying to do is something for which a
filter query is not a good option to really get where we want to be.
Basically, it's an integration with some business logic to make a final
pass at ensuring that certain business rules are followed in the event a
query returns documents a user is not authorized to see.

Thanks,

Mike


Replica document counts out of sync

In one of our environments, we have an issue where one shard has two
replicas with smaller document counts than the third one. This is on Solr
4.10.3 (Cloudera's build). We've found that shutting down the smaller
replicas, deleting their data folders and restarting one by one will do the
trick of forcing them to get the bigger and fresher index from the third
one.

We aren't doing anything different with the document router configuration
or anything like that. It's a really simple and straight forward
installation of Solr that is largely based on defaults for everything. Any
suggestions on what might be getting us into this situation? Also, is there
a SolrCloud API for forcing those two replicas to sync with the third or do
we have to continue using that manual process?

Thanks,

Mike


RE: Combined Dismax and Block Join Scoring on nested documents

 field value, etc.. Don't put quotes around the value though, v=$row.id 
works, v="$row.id" does not as that would look for a literal value "$row.id"
doc.rows=1000 controls the maximum children returned, 1000 for me is way more 
than I know will ever exist for this particular client,
finally doc.fq=(base_colour:(blue) AND (in_stock:(true))) filters the children 
to only those I really want, so only the "blue" in stock documents are returned.

So here's the raw string I'm putting in the "q" box in Solr Admin Console:

q=+{!dismax v="skirt" qf="name^ 0 searchtext^0"} +{!parent 
which=content_type:product score=min v=$bjv}&bjv=+(base_colour:(blue)^0 AND 
(in_stock:(true)^0)) {!func}list_price_gbp&sort=score 
asc&fl=*,doc:[subquery]&doc.q={!terms f="productid" 
v=$row.id}&doc.rows=1000&doc.fq=(base_colour:(blue) AND (in_stock:(true)))

By stepping into my Visual Studio code, the encoded request looks like this:

http://localhost:8983/solr/test_core/select?q=%2b%7b!dismax+v%3d%22skirt%22+qf%3d%22name%5e0+searchtext%5e0%22+%7d+%2b%7b!parent+which%3dcontent_type%3aproduct+score%3dmin+v%3d%24bjv%7d&bjv=%2b(base_colour%3a(blue)+AND+(in_stock%3a(true)))+%7b!func%7dlist_price_gbp&doc.q=%7b!terms+f%3d%22productid%22+v%3d%24row.id%7d&doc.rows=1000&doc.fq=(base_colour%3a(blue)+AND+(in_stock%3a(true)))&start=0&rows=103&fl=*%2cdoc%3a%5bsubquery%5d&sort=score+asc

So you'll notice the explicit "+"s have been encoded as %2B and spaces are "+". 
Correct encoding seems half the battle to be honest.

So that's what I've got for now, but I wouldn't take it as gospel that it's 
working correctly. I'm still validating by hand checking the results I would 
expect versus the results I actually get. For instance I need to know for sure 
it's scoring on only matched variants, not all children of a parent - which 
would completely blow the whole thing out of the water. And as I said, I'm 
pretty sure I've yet to figure out applying a query filter to parent docs.

When I'm a bit less clueless about what I'm actually doing  I'll try and write 
it up properly somewhere.

Cheers all,

Mike

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: 21 November 2016 21:59
To: solr-user
Subject: Re: Combined Dismax and Block Join Scoring on nested documents

You could do:
*) LinkedIn
*) Wiki
*) Write it up, give it to me and I'll stick it as a guest post on my blog 
(with attribution of your choice)
*) Write it up, give it to Lucidworks and they may (not sure about
rules) stick it on their blog

Regards,
Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 November 2016 at 02:36, Mike Allen 
 wrote:
> Sure thing Alex. I don't actually do any personal blogging, but if there's a 
> suitable place - the Solr Wiki perhaps - you'd suggest I can write something 
> up I'd be more than happy to. What goes around comes around!
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: 21 November 2016 13:01
> To: solr-user
> Subject: Re: Combined Dismax and Block Join Scoring on nested 
> documents
>
> A blog article about what you learned would be very welcome. These edge cases 
> are something other people could certainly learn from.
> Share the knowledge forward etc.
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and 
> experienced
>
>
> On 21 November 2016 at 23:57, Mike Allen 
>  wrote:
>> Hi Mikhail,
>>
>> Thanks for your advice, it went a long way towards helping me get the right 
>> documents in the first place, especially paramterising the block join with 
>> an explicit v, as otherwise it was a nightmare of parser errors.  Not to 
>> mention I'm still figuring out the nuances of where I need a whitespace and 
>> where I don't! However, I spent a part of the weekend fiddling around with 
>> spaces and +'s and I believe I've got it working as I'd hoped.
>>
>> Again, many thanks,
>>
>> Mike
>>
>> -Original Message-
>> From: Mikhail Khludnev [mailto:m...@apache.org]
>> Sent: 18 November 2016 12:58
>> To: solr-user
>> Subject: Re: Combined Dismax and Block Join Scoring on nested 
>> documents
>>
>> Hello Mike,
>> Structured queries in Solr are way cumbersome.
>> Start from:
>> q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product 
>> score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...
>>
>> beside of "explain"

RE: Combined Dismax and Block Join Scoring on nested documents

Sure thing Alex. I don't actually do any personal blogging, but if there's a 
suitable place - the Solr Wiki perhaps - you'd suggest I can write something up 
I'd be more than happy to. What goes around comes around!

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: 21 November 2016 13:01
To: solr-user
Subject: Re: Combined Dismax and Block Join Scoring on nested documents

A blog article about what you learned would be very welcome. These edge cases 
are something other people could certainly learn from.
Share the knowledge forward etc.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 21 November 2016 at 23:57, Mike Allen 
 wrote:
> Hi Mikhail,
>
> Thanks for your advice, it went a long way towards helping me get the right 
> documents in the first place, especially paramterising the block join with an 
> explicit v, as otherwise it was a nightmare of parser errors.  Not to mention 
> I'm still figuring out the nuances of where I need a whitespace and where I 
> don't! However, I spent a part of the weekend fiddling around with spaces and 
> +'s and I believe I've got it working as I'd hoped.
>
> Again, many thanks,
>
> Mike
>
> -Original Message-
> From: Mikhail Khludnev [mailto:m...@apache.org]
> Sent: 18 November 2016 12:58
> To: solr-user
> Subject: Re: Combined Dismax and Block Join Scoring on nested 
> documents
>
> Hello Mike,
> Structured queries in Solr are way cumbersome.
> Start from:
> q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product 
> score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...
>
> beside of "explain" there is a parsed query entry in debug that's more useful 
> for troubleshooting purposes.
> Please also make sure that + is properly encoded by %2B and pass http hurdle.
>
> On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < 
> mike.al...@thecommercepartnership.com> wrote:
>
>> Apologies if I'm doing something incredibly stupid as I'm new to Solr.
>> I am having an issue with scoring child documents in a block join 
>> query when including a dismax query. I'm actually a little unclear on 
>> whether or not that's a complete oxymoron, combining dismax and block join.
>>
>>
>>
>> Problem statement: Given a set of Product documents - which contain 
>> the product names and descriptions - which contain nested variant 
>> documents (see below for abridged example) - which contain the 
>> boolean stock status
>> (in_stock) and the variant prices (list_price_gbp) - I want to do a 
>> Dismax query of, say, "skirt" on the product name (name) and sort the 
>> resulting product documents by the minimum price (list_price_gbp) of 
>> their child variant documents. Note that, although the abridged 
>> document doesn't show them, there are a number of other arbitrary 
>> fields which may be used as filter queries on the child documents, 
>> for example size or colour, which will in effect change the "active"
>> minimum price of a product. Hence, denormalizing, or flattening, the 
>> documents is not really an option I want to pursue.
>>
>>
>>
>> An abridged example document returned by the Solr Admin Query console 
>> which I am querying:
>>
>>
>>
>> 
>>
>> 12345
>>
>> product
>>
>> black flared skirt
>>
>> 40.0
>>
>> 
>>
>>   
>>
>> 12345abcd
>>
>> 12345
>>
>> variant
>>
>> > name="list_price_gbp">65.0
>>
>> true
>>
>>   
>>
>>   
>>
>> 12345fghi
>>
>> 12345
>>
>> variant
>>
>> > name="list_price_gbp">40.0
>>
>> true
>>
>>   
>>
>> 
>>
>>
>>
>> So I am familiar with the block join score mode; setting aside the 
>> dismax aspect for now, this query, using the Function Query 
>> {!func}list_price_gbp, with score ascending, returns documents 
>> ordered correctly, with a £2.00
>> (cheapest) product first:
>>
>>
>>
>> q={!parent which=content_type:product 
>

RE: Combined Dismax and Block Join Scoring on nested documents

Hi Mikhail,

Thanks for your advice, it went a long way towards helping me get the right 
documents in the first place, especially paramterising the block join with an 
explicit v, as otherwise it was a nightmare of parser errors.  Not to mention 
I'm still figuring out the nuances of where I need a whitespace and where I 
don't! However, I spent a part of the weekend fiddling around with spaces and 
+'s and I believe I've got it working as I'd hoped. 

Again, many thanks,

Mike

-Original Message-
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: 18 November 2016 12:58
To: solr-user
Subject: Re: Combined Dismax and Block Join Scoring on nested documents

Hello Mike,
Structured queries in Solr are way cumbersome.
Start from:
q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product score=min 
v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...

beside of "explain" there is a parsed query entry in debug that's more useful 
for troubleshooting purposes.
Please also make sure that + is properly encoded by %2B and pass http hurdle.

On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < 
mike.al...@thecommercepartnership.com> wrote:

> Apologies if I'm doing something incredibly stupid as I'm new to Solr. 
> I am having an issue with scoring child documents in a block join 
> query when including a dismax query. I'm actually a little unclear on 
> whether or not that's a complete oxymoron, combining dismax and block join.
>
>
>
> Problem statement: Given a set of Product documents - which contain 
> the product names and descriptions - which contain nested variant 
> documents (see below for abridged example) - which contain the boolean 
> stock status
> (in_stock) and the variant prices (list_price_gbp) - I want to do a 
> Dismax query of, say, "skirt" on the product name (name) and sort the 
> resulting product documents by the minimum price (list_price_gbp) of 
> their child variant documents. Note that, although the abridged 
> document doesn't show them, there are a number of other arbitrary 
> fields which may be used as filter queries on the child documents, for 
> example size or colour, which will in effect change the "active" 
> minimum price of a product. Hence, denormalizing, or flattening, the 
> documents is not really an option I want to pursue.
>
>
>
> An abridged example document returned by the Solr Admin Query console 
> which I am querying:
>
>
>
> 
>
> 12345
>
> product
>
> black flared skirt
>
> 40.0
>
> 
>
>   
>
> 12345abcd
>
> 12345
>
> variant
>
>  name="list_price_gbp">65.0
>
> true
>
>   
>
>   
>
> 12345fghi
>
> 12345
>
> variant
>
>  name="list_price_gbp">40.0
>
> true
>
>   
>
> 
>
>
>
> So I am familiar with the block join score mode; setting aside the 
> dismax aspect for now, this query, using the Function Query 
> {!func}list_price_gbp, with score ascending, returns documents ordered 
> correctly, with a £2.00
> (cheapest) product first:
>
>
>
> q={!parent which=content_type:product
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The "explain" for this is:
>
>
>
> 2.184 = Score based on 1 child docs in range from 26752 to 26752, 
> best
> match:
>
>   2.184 = sum of:
>
> 1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>
>   1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>
> ), product of:
>
> 1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>
> 1.0 = tfNorm, computed from:
>
>   1.0 = termFreq=1.0
>
>   1.2 = parameter k1
>
>   0.0 = parameter b (norms omitted for field)
>
> 2.0 = FunctionQuery(float(list_price_gbp)), product of:
>
>   2.0 = float(list_price_gbp)=2.0
>
>   1.0 = boost
>
>   1.0 = queryNorm
>
>
>
> Even though this is doing what I want, I have a slight niggle the that 
> overall score is not just the result o

Combined Dismax and Block Join Scoring on nested documents

does not exist, as the Function Query always returns
zero. 

 

6243963 = sum of:

  3.624396 = weight(name:skirt in 18113) [], result of:

3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0

), product of:

  3.5851278 = idf(docFreq=103, docCount=3731)

  1.0109531 = tfNorm, computed from:

1.0 = termFreq=1.0

1.2 = parameter k1

0.75 = parameter b

4.108818 = avgFieldLength

4.0 = fieldLength

  1.0 =
{!cache=false}ConstantScore(BitDocIdSetFilterWrapper(QueryBitSetProducer(con
tent_type:product))), product of:

1.0 = boost

1.0 = queryNorm

  0.0 = FunctionQuery(float(list_price_gbp)), product of:

0.0 = float(list_price_gbp)=0.0

1.0 = boost

1.0 = queryNorm



Indeed, if I change the Function Query field to a product scoped field,
min_list_price_gbp, like so:



q=+name:skirt +{!parent which=content_type:product
score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
f="productid"
v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row
s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml

 

then the "explain" certainly does show the Function Query evaluating

 

8.624397 = sum of:

  3.624396 = weight(name:skirt in 17890) [], result of:

3.624396 = score(doc=17890,freq=1.0 = termFreq=1.0

), product of:

  3.5851278 = idf(docFreq=103, docCount=3731)

  1.0109531 = tfNorm, computed from:

1.0 = termFreq=1.0

1.2 = parameter k1

0.75 = parameter b

4.108818 = avgFieldLength

4.0 = fieldLength

  1.0 =
{!cache=false}ConstantScore(BitDocIdSetFilterWrapper(QueryBitSetProducer(con
tent_type:product))), product of:

1.0 = boost

1.0 = queryNorm

  14.0 = FunctionQuery(float(min_list_price_gbp)), product of:

14.0 = float(min_list_price_gbp)=14.0

1.0 = boost

1.0 = queryNorm

 

My grasp of the syntax is pretty flakey, so I would be immensely grateful if
someone could point out if I'm just doing something incredibly dumb. In my
head, I see what I am trying to do as 

 

(some dismax or lucene query on parent document [e.g."skirt"]) 

=> (get a subset of these parent docs based on a block join)

=> (where the children match a bunch of
arbitrary filter queries [e.g. "colour:red"])

=> (then subquery the child
docs that match the same filter queries[e.g. "colour:red"])

=> (then
score this subset of child documents)

 
=> (and order by that score)

 


Is this actually possible? I've been googling about this for a day or so and
can't quite find anything definitive. I'm going to maybe try and dive into
the solr source code, but I'm a c# guy, not java, without a debuggable
environment as unneeded yet, and that could prove pretty painful.

 

Any help would be appreciated, even if it is just "can't be done", as at
least I could stop chasing my tail.

 

Mike

 

 

 

 





Detecting schema errors while adding documents

We're stuck on Solr 4.10.3 (Cloudera bundle). Is there any way to detect
with SolrJ when a document added to the index violated the schema? All we
see when we look at the stacktrace for the SolrException that comes back is
that it contains messages about an IOException when talking to the solr
nodes. Solr is up and running, and the documents are only invalid because I
added a Java statement to make a field invalid for testing purposes. When I
remove that statement, the indexing happens just fine.

Any way to do this? I seem to recall that at least in newer versions of
Solr it would tell you more about the specific error.

Thanks,

Mike


Re: Rolling backups of a collection

Thanks. If we write such a process, I'll see if I can get permission to
release it. It might be a moot point because I found out we're stuck on
4.10.3 for the time being. Haven't used that version in a while and forgot
it didn't even have the collection backup API.

On Wed, Nov 9, 2016 at 2:18 PM, Hrishikesh Gadre 
wrote:

> Hi Mike,
>
> I filed SOLR-9744 <https://issues.apache.org/jira/browse/SOLR-9744> to
> track this work. Please comment on this jira if you have any suggestions.
>
> Thanks
> Hrishikesh
>
>
> On Wed, Nov 9, 2016 at 11:07 AM, Hrishikesh Gadre 
> wrote:
>
> > Hi Mike,
> >
> > Currently we don't have capability to take rolling backups for the Solr
> > collections. I think it should be fairly straightforward to write a
> script
> > that implements this functionality outside of Solr. If you post that
> > script, may be we can even ship it as part of Solr itself (for the
> benefit
> > of the community).
> >
> > Thanks
> > Hrishikesh
> >
> >
> >
> > On Wed, Nov 9, 2016 at 9:17 AM, Mike Thomsen 
> > wrote:
> >
> >> I read over the docs (
> >> https://cwiki.apache.org/confluence/display/solr/Making+and+
> >> Restoring+Backups)
> >> and am not quite sure what route to take. My team is looking for a way
> to
> >> backup the entire index of a SolrCloud collection with regular rotation
> >> similar to the backup option available in a single node deployment.
> >>
> >> We have plenty of space in our HDFS cluster. Resources are not an issue
> in
> >> the least to have a rolling back up of say, the last seven days. Is
> there
> >> a
> >> good way to implement this sort of rolling backup with the APIs or will
> we
> >> have to roll some of the functionality ourselves?
> >>
> >> I'm not averse to using the API to dump a copy of each shard to HDFS.
> >> Something like this:
> >>
> >> /solr/collection/replication?command=backup&name=shard_1_1&
> numberToKeep=7
> >>
> >> Is that a viable route to achieve this or do we need to do something
> else?
> >>
> >> Thanks,
> >>
> >> Mike
> >>
> >
> >
>


Rolling backups of a collection

I read over the docs (
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups)
and am not quite sure what route to take. My team is looking for a way to
backup the entire index of a SolrCloud collection with regular rotation
similar to the backup option available in a single node deployment.

We have plenty of space in our HDFS cluster. Resources are not an issue in
the least to have a rolling back up of say, the last seven days. Is there a
good way to implement this sort of rolling backup with the APIs or will we
have to roll some of the functionality ourselves?

I'm not averse to using the API to dump a copy of each shard to HDFS.
Something like this:

/solr/collection/replication?command=backup&name=shard_1_1&numberToKeep=7

Is that a viable route to achieve this or do we need to do something else?

Thanks,

Mike


Backup to HDFS while running cluster on local disk

We have SolrCloud running on bare metal but want the nightly snapshots to
be written to HDFS. Can someone give me some help on configuring the
HdfsBackupRepository?



${solr.hdfs.default.backup.path}
${solr.hdfs.home:}
${solr.hdfs.confdir:}



Not sure how to proceed on configuring this because the documentation is a
bit sparse on what some of those values mean in this context. The example
looked geared toward someone using HDFS both to store the index and do
backup/restore.

Thanks,

Mike


Re: UpdateProcessor as a batch

maybe introduce a distributed queue such as apache ignite,  hazelcast or
even redis.   Read from the queue in batches, do your lookup then index the
same batch.

just a thought.

Mike St. John.

On Nov 3, 2016 3:58 PM, "Erick Erickson"  wrote:

> I thought we might be talking past each other...
>
> I think you're into "roll your own" here. Anything that
> accumulated docs for a while, did a batch lookup
> on the external system, then passed on the docs
> runs the risk of losing docs if the server is abnormally
> shut down.
>
> I guess ideally you'd like to augment the list coming in
> rather than the docs once they're removed from the
> incoming batch and passed on, but I admit I have no
> clue where to do that. Possibly in an update chain? If
> so, you'd need to be careful to only augment when
> they'd reached their final shard leader or all at once
> before distribution to shard leaders.
>
> Is the expense for the external lookup doing the actual
> lookups or establishing the connection? Would
> having some kind of shared connection to the external
> source be worthwhile?
>
> FWIW,
> Erick
>
> On Thu, Nov 3, 2016 at 12:06 PM, Markus Jelsma
>  wrote:
> > Hi - i believe i did not explain myself well enough.
> >
> > Getting the data in Solr is not a problem, various sources index docs to
> Solr, all in fine batches as everyone should do indeed. The thing is that i
> need to do some preprocessing before it is indexed. Normally,
> UpdateProcessors are the way to go. I've made quite a few of them and they
> work fine.
> >
> > The problem is, i need to do a remote lookup for each document being
> indexed. Right now, i make an external connection for each doc being
> indexed in the current UpdateProcessor. This is still fast. But the remote
> backend supports batched lookups, which are faster.
> >
> > This is why i'd love to be able to buffer documents in an
> UpdateProcessor, and if there are enough, i do a remote lookup for all of
> them, do some processing and let them be indexed.
> >
> > Thanks,
> > Markus
> >
> >
> >
> > -Original message-
> >> From:Erick Erickson 
> >> Sent: Thursday 3rd November 2016 19:18
> >> To: solr-user 
> >> Subject: Re: UpdateProcessor as a batch
> >>
> >> I _thought_ you'd been around long enough to know about the options I
> >> mentioned ;).
> >>
> >> Right. I'd guess you're in UpdateHandler.addDoc and there's really no
> >> batching at that level that I know of. I'm pretty sure that even
> >> indexing batches of 1,000 documents from, say, SolrJ go through this
> >> method.
> >>
> >> I don't think there's much to be gained by any batching at this level,
> >> it pretty immediately tells Lucene to index the doc.
> >>
> >> FWIW
> >> Erick
> >>
> >> On Thu, Nov 3, 2016 at 11:10 AM, Markus Jelsma
> >>  wrote:
> >> > Erick - in this case data can come from anywhere. There is one piece
> of code all incoming documents, regardless of their origin, are passed
> thru, the update handler and update processors of Solr.
> >> >
> >> > In my case that is the most convenient point to partially modify the
> documents, instead of moving that logic to separate places.
> >> >
> >> > I've seen the ContentStream in SolrQueryResponse and i probably could
> tear incoming data apart and put it back together again, but that would not
> be so easy as working with already deserialized objects such as
> SolrInputDocument.
> >> >
> >> > UpdateHandler doesn't seem to work on a list of documents, it looked
> like it works on incoming stuff, not a whole list. I've also looked if i
> could buffer a batch in UpdateProcessor, work on them, and release them,
> but that seems impossible.
> >> >
> >> > Thanks,
> >> > Markus
> >> >
> >> > -Original message-
> >> >> From:Erick Erickson 
> >> >> Sent: Thursday 3rd November 2016 18:57
> >> >> To: solr-user 
> >> >> Subject: Re: UpdateProcessor as a batch
> >> >>
> >> >> Markus:
> >> >>
> >> >> How are you indexing? SolrJ has a client.add(List<
> SolrInputDocument>)
> >> >> form, and post.jar lets you add as many documents as you want in a
> >> >> batch
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Nov 3, 2016 at 10:18 AM, Markus Jelsma
> >> >>  wrote:
> >> >> > Hi - i need to process a batch of documents on update but i cannot
> seem to find a point where i can hook in and process a list of
> SolrInputDocuments, not in UpdateProcessor nor in UpdateHandler.
> >> >> >
> >> >> > For now i let it go and implemented it on a per-document basis, it
> is fast, but i'd prefer batches. Is that possible at all?
> >> >> >
> >> >> > Thanks,
> >> >> > Markus
> >> >>
> >>
>


Result Grouping vs. Collapsing Query Parser -- Can one be deprecated?

Hi all,

I've had a rotten day today because of Solr. I want to share my experience
and perhaps see if we can do something to fix this particular situation in
the future.

Solr currently has two ways to get grouped results (so far!). You can
either use Result Grouping or you can use the Collapsing Query Parser.
Result grouping seems like the obvious way to go. It's well documented, the
parameters are clear, it doesn't use a bunch of weird syntax (ie,
{!collapse blah=foo}), and it uses the feature name from SQL (so it comes
up in Google).

OTOH, if you use faceting with result grouping, which I imagine many people
do, you get terrible performance. In our case it went from subsecond to
10-120 seconds for big queries. Insanely bad.

Collapsing Query Parser looks like a good way forward for us, and we'll be
investigating that, but it uses the Expand component that our library
doesn't support, to say nothing of the truly bizarre syntax. So this will
be a fair amount of effort to switch.

I'm curious if there is anything we can do to clean up this situation. What
I'd really like to do is:

1. Put a HUGE warning on the Result Grouping docs directing people away
from the feature if they plan to use faceting (or perhaps directing them
away no matter what?)

2. Work towards eliminating one or the other of these features. They're
nearly completely compatible, except for their syntax and performance. The
collapsing query parser apparently was only written because the result
grouping had such bad performance -- In other words, it doesn't exist to
provide unique features, it exists to be faster than the old way. Maybe we
can get rid of one or the other of these, taking the best parts from each
(syntax from Result Grouping, and performance from Collapse Query Parser)?

Thanks,

Mike

PS -- For some extra context, I want to share some other reasons this is
frustrating:

1. I just spent a week upgrading a third-party library so it would support
grouped results, and another week implementing the feature in our code with
tests and everything. That was a waste.
2. It's hard to notice performance issues until after you deploy to a big
data environment. This creates a bad situation for users until you detect
it and revert the new features.
3. The documentation *could* say something about the fact that a new
feature was developed to provide better performance for grouping. It could
say that using facets with groups is an anti-feature. It says neither.

I only mention these because, like others, I've had a real rough time with
solr (again), and these are the kinds of seemingly small things that could
have made all the difference.


Re: Real Time Search and External File Fields

Thanks for the replies. I made the changes so that the external file field
is loaded per:


  
  

Re: Real Time Search and External File Fields

On Fri, Oct 7, 2016 at 8:18 PM Erick Erickson 
wrote:

> What you haven't mentioned is how often you add new docs. Is it once a
> day? Steadily
> from 8:00 to 17:00?
>

Alas, it's a steady trickle during business hours. We're ingesting court
documents as they're posted on court websites, then sending alerts as soon
as possible.


> Whatever, your soft commit really should be longer than your autowarm
> interval. Configure
> autowarming to reference queries (firstSearcher or newSearcher events
> or autowarm
> counts in queryResultCache and filterCache. Say 16 in each of these
> latter for a start) such
> that they cause the external file to load. That _should_ prevent any
> queries from being
> blocked since the autowarming will happen in the background and while
> it's happening
> incoming queries will be served by the old searcher.
>

I want to make sure I understand this properly and document this for future
people that may find this thread. Here's what I interpret your advice to be:

0. Slacken my auto soft commit interval to something more like a minute.

1. Set up a query in the newSearcher listener that uses my external file
field.
1a. Do the same in firstSearcher if I want newly started solr to warm up
before getting queries (this doesn't matter to me, so I'm skipping this).

and/or

2. Set autowarmcount in queryResultCache and filterCache to 16 so that the
top 16 query results from the previous searcher are regenerated in the new
searcher.

Doing #1 seems like a safe strategy since it's guaranteed to hit the
external file field. #2 feels like a bonus.

I'm a bit confused about the example autowarmcount for the caches, which is
0. Why not set this to something higher? I guess it's a RAM utilization vs.
speed tradeoff? A low number like 16 seems like it'd have minimal impact on
RAM?

Thanks for all the great replies and for everything you do for Solr. I
truly appreciate your efforts.

Mike


Re: Real Time Search and External File Fields

On Sat, Oct 8, 2016 at 8:46 AM Shawn Heisey  wrote:

> Most soft commit
> > documentation talks about setting up soft commits with  of
> about a
> > second.
>
> IMHO any documentation that recommends autoSoftCommit with a maxTime of
> one second is bad documentation, and needs to be fixed.  Where have you
> seen such a recommendation?


You know, I must have made that up, sorry. But the documentation you linked
to (on the Lucid Works blog) and the example file says 15 seconds for hard
commits, so it I think that got me thinking that soft commits could be more
frequent.

Should soft commits be less frequent than hard commits
(opensearcher=False)? If so, I didn't find that to be at all clear.


> right now Solr/Lucene has no
> way of knowing that your external file has not changed, so it must read
> the file every time it builds a searcher.


Is it crazy to file a feature request asking that Solr/Lucene keep the
modtime of this file and on reload it if it has changed? Seems like an easy
win.


>  I doubt this feature was
> designed to deal well with an extremely large external file like yours.
>

Perhaps not. It's probably worth mentioning that part of the reason the
file is so large is because pagerank uses very small and accurate floats.
So a typical line is:

1=9.50539603222e-08

Not something smaller like:

1=3.2

Pagerank also provides a value for every item in the index, so that makes
the file long. I'd suspect that anybody with a pagerank boosted index of
moderate size would have a similarly-sized file.


> If the info changes that infrequently, can you just incorporate it
> directly into the index with a standard field, with the info coming in
> as a part of your normal indexing process?


We've considered that, but whenever you re-run pagerank, it updates EVERY
value. So I guess we could try updating every doc in our index whenever we
run pagerank, but that's a nasty solution.


> It seems unlikely that Solr would stop serving queries while setting up
> a new searcher.  The old searcher should continue to serve requests
> until the new searcher is ready.  If this is happening, that definitely
> seems like a bug.
>

I'm positive I've observed this, though you're right, some queries still
seem to come through. Is it possible that queries relying on the field are
stopped while the field is loading? I've observed this two ways:

1. From the front end, things were stalling every time I was doing a hard
commit (opensearcher=true). I had hard commits coming in every ten minutes
via cron job, and sure enough, at ten, twenty, thirty...minutes after every
hour, I'd see stalls.

2. Watching the logs, I saw a flood of queries come through after the line:

Loaded external value source external_pagerank

Some queries were coming through before this line, but I think none of
those queries use the external file field (external_pagerank).

Mike


Real Time Search and External File Fields

I have an index of about 4M documents with an external file field
configured to do boosting based on pagerank scores of each document. The
pagerank file is about 93MB as of today -- it's pretty big.

Each day, I add about 1,000 new documents to the index, and I need them to
be available as soon as possible so that I can send out alerts to our users
about new content (this is Google Alerts, essentially).

Soft commits seem to be exactly the thing for this, but whenever I open a
new searcher (which soft commits seem to do), the external file is
reloaded, and all queries are halted until it finishes loading. When I just
measured, this took about 30 seconds to complete. Most soft commit
documentation talks about setting up soft commits with  of about a
second.

Is there anything I can do to make the external file field not get reloaded
constantly? It only changes about once a month, and I want to use soft
commits to power the alerts feature.

Thanks,

Mike


Best way to generate multivalue fields from streaming API

Read this article and thought it could be interesting as a way to do
ingestion:

https://dzone.com/articles/solr-streaming-expressions-for-collection-auto-upd-1

Example from the article:

daemon(id="12345",

 runInterval="6",

 update(users,

 batchSize=10,

 jdbc(connection="jdbc:mysql://localhost/users?user=root&password=solr",
sql="SELECT id, name FROM users", sort="id asc",
driver="com.mysql.jdbc.Driver")

)

What's the best way to handle a multivalue field using this API? Is
there a way to tokenize something returned in a database field?

Thanks,

Mike


Re: Missed update on replica

I should add that this is on Solr 5.1.0.

On Thu, Apr 28, 2016 at 2:42 PM, Mike Wartes  wrote:

> I have a three node, one shard SolrCloud cluster.
>
> Last week one of the nodes went out of sync with the other two and I'm
> trying to understand why that happened.
>
> After poking through my logs and the solr code here's what I've pieced
> together:
>
> 1. Leader gets an update request for a batch delete of 306 items. It sends
> this update along to Replica A and Replica B.
> 2. On Replica A all is well. It receives the update request and logs that
> 306 documents were deleted.
> 3. Replica B also receives the update request but at some point during the
> request something kills the connection. Leader logs a "connection reset"
> socket error. Replica B doesn't have any errors but it does log that it
> only deleted 95 documents as a result of the update call.
> 4. Because of the socket error, Leader starts leader-initiated-recovery
> for Replica B. It sets Replica B to the "down" state in ZK.
> 5. Replica B gets the leader-initiated-recovery request, updates its ZK
> state to "recovering", and starts the PeerSync process.
> 6. Replica B's PeerSync reports that it has gotten "100 versions" from the
> leader but then declares that "Our versions are newer" and finishes
> successfully.
> 7. Replica B puts itself back in the active state, but it is now out of
> sync with the Leader and Replica A. It is left with 211 documents in it
> that should have been deleted.
>
> I am curious if anyone has any thoughts on why Replica B failed to detect
> that it was behind the leader in this scenario.
>
> I'm not really clear on how the update version numbers are assigned, but
> is it possible that the 95 documents that did make it to Replica B had a
> later version number than the 211 that didn't? I don't have perfect
> understanding of the PeerSync code but looking through it, in particular at
> the logic that prints the "Our versions are newer" message, it seems like
> if 95 of the 100 documents fetched from the leader during PeerSync did
> match what the replica already has it might declare itself up-to-date
> without looking at the last few.
>


Missed update on replica

I have a three node, one shard SolrCloud cluster.

Last week one of the nodes went out of sync with the other two and I'm
trying to understand why that happened.

After poking through my logs and the solr code here's what I've pieced
together:

1. Leader gets an update request for a batch delete of 306 items. It sends
this update along to Replica A and Replica B.
2. On Replica A all is well. It receives the update request and logs that
306 documents were deleted.
3. Replica B also receives the update request but at some point during the
request something kills the connection. Leader logs a "connection reset"
socket error. Replica B doesn't have any errors but it does log that it
only deleted 95 documents as a result of the update call.
4. Because of the socket error, Leader starts leader-initiated-recovery for
Replica B. It sets Replica B to the "down" state in ZK.
5. Replica B gets the leader-initiated-recovery request, updates its ZK
state to "recovering", and starts the PeerSync process.
6. Replica B's PeerSync reports that it has gotten "100 versions" from the
leader but then declares that "Our versions are newer" and finishes
successfully.
7. Replica B puts itself back in the active state, but it is now out of
sync with the Leader and Replica A. It is left with 211 documents in it
that should have been deleted.

I am curious if anyone has any thoughts on why Replica B failed to detect
that it was behind the leader in this scenario.

I'm not really clear on how the update version numbers are assigned, but is
it possible that the 95 documents that did make it to Replica B had a later
version number than the 211 that didn't? I don't have perfect understanding
of the PeerSync code but looking through it, in particular at the logic
that prints the "Our versions are newer" message, it seems like if 95 of
the 100 documents fetched from the leader during PeerSync did match what
the replica already has it might declare itself up-to-date without looking
at the last few.


Update command not working

I posted this to http://localhost:8983/solr/default-collection/update and
it treated it like I was adding a whole document, not a partial update:

{
"id": "0be0daa1-a6ee-46d0-ba05-717a9c6ae283",
"tags": {
"add": [ "news article" ]
}
}

In the logs, I found this:

2016-02-26 14:07:50.831 ERROR (qtp2096057945-17) [c:default-collection
s:shard1_1 r:core_node21 x:default-collection] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException:
[doc=0be0daa1-a6ee-46d0-ba05-717a9c6ae283] missing required field: data_type
at
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:198)
at
org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:83)
at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:273)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:207)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)

Does  this make any sense?  I sent updates just fine a day or two ago like
that, now it is acting like the update request is a whole new document.

Thanks,

Mike


Re: /select changes between 4 and 5

Yeah, it was a problem on my end. Not just the content-type as you
suggested, but I had to wrap that whole JSON body so it looked like this:

{
"params": { ///That block pasted here }
}

On Wed, Feb 24, 2016 at 11:05 AM, Yonik Seeley  wrote:

> POST in general still works for queries... I just verified it:
>
> curl -XPOST "http://localhost:8983/solr/techproducts/select"; -d "q=*:*"
>
> Maybe it's your content-type (since it seems like you are posting
> Python)... Were you using some sort of custom code that could
> read/accept other content types?
>
> -Yonik
>
>
> On Wed, Feb 24, 2016 at 8:48 AM, Mike Thomsen 
> wrote:
> > With 4.10, we used to post JSON like this example (part of it is Python)
> to
> > /select:
> >
> > {
> > "q": "LONG_QUERY_HERE",
> > "fq": fq,
> > "fl": ["id", "title", "date_of_information", "link", "search_text"],
> > "rows": 100,
> > "wt": "json",
> > "indent": "true",
> > "_": int(time.time())
> > }
> >
> > We just upgraded to 5.4.1, and now we can't seem to POST anything to
> > /select. I tried it out in the admin tool, and it only does GET
> operations
> > against /select (tried changing it to POST and moving query string to the
> > body with Firefox dev tools, but that failed).
> >
> > Is there a way to keep doing something like what we were doing or do we
> > need to limit ourselves to GETs? I think our queries are all small enough
> > now for that, but it would helpful to know for planning.
> >
> > Thanks,
> >
> > Mike
>


/select changes between 4 and 5

With 4.10, we used to post JSON like this example (part of it is Python) to
/select:

{
"q": "LONG_QUERY_HERE",
"fq": fq,
"fl": ["id", "title", "date_of_information", "link", "search_text"],
"rows": 100,
"wt": "json",
"indent": "true",
"_": int(time.time())
}

We just upgraded to 5.4.1, and now we can't seem to POST anything to
/select. I tried it out in the admin tool, and it only does GET operations
against /select (tried changing it to POST and moving query string to the
body with Firefox dev tools, but that failed).

Is there a way to keep doing something like what we were doing or do we
need to limit ourselves to GETs? I think our queries are all small enough
now for that, but it would helpful to know for planning.

Thanks,

Mike


Leader election issues after upgrade from 4.10.4 to 5.4.1

We get this error on one of our nodes:

Caused by: org.apache.solr.common.SolrException: There is conflicting
information about the leader of shard: shard2 our state says:
http://server01:8983/solr/collection/ but zookeeper says:
http://server02:8983/collection/


Then I noticed this in the log:

] o.a.s.c.c.ZkStateReader Load collection config
from:/collections/collection
2016-02-09 00:09:56.763 INFO  (qtp1037197792-12) [   ]
o.a.s.c.c.ZkStateReader path=/collections/collection configName=collection
specified config exists in ZooKeeper

We have a clusterstate.json file left over from 4.X. I read this thread and
the first comment or two suggested that clusterstate.json is now broken up
and refactored into the collections' configuration:

http://grokbase.com/t/lucene/solr-user/152v8bab2z/solr-cloud-does-not-start-with-many-collections

So should we get rid of the clusterstate.json file or keep it? We have 4
Solr VMs in our devops environment. They have 2 CPUs and 4GB of RAM. There
are about 7 collections shared between then, but all are negligible (like a
few hundred kb each) except for one which is about 22GB.

Thanks,

Mike


zkCli.sh not in solr 5.4?

I downloaded a build of 5.4.0 to install in some VMs and noticed that
zkCli.sh is not there. I need it in order to upload a configuration set to
ZooKeeper before I create the collection. What's the preferred way of doing
that?

Specifically, I need to specify a configuration like this because it's in a
Vagrant-managed set of VMs and I need to tell it to use the private network
IP addresses not my host's IP address:

/admin/collections?action=CREATE&name=default-collection2&numShards=4&replicationFactor=1&maxShardsPerNode=1&createNodeSet=192.168.56.20:8983
_solr,192.168.56.21:8983_solr,192.168.56.22:8983_solr,192.168.56.23:8983
_solr&collection.configName=default-collection

Thanks,

Mike


Phrase query not matching exact tokens in some cases

For the query "police office" our users are getting back highlighted
results for "police office*r*" (and "police office*rs*") I get why a search
for police officers would include just "office" since the stemmer would
cause that behavior. However I don't understand why "office" is matching
"officer" here when no fuzzy matching is being done. Is that also a result
of our stemmer?

Here's the text field we're using:

























Thanks,

Mike


Re: Too many Soft commits and opening searchers realtime

Are the clients that are posting updates requesting commits?

On Tue, Jul 7, 2015 at 4:29 PM, Summer Shire  wrote:

> HI All,
>
> Can someone help me understand the following behavior.
> I have the following maxTimes on hard and soft commits
>
> yet I see a lot of Opening Searchers in the log
> org.apache.solr.search.SolrIndexSearcher- Opening Searcher@1656a258[main]
> realtime
> also I see a soft commit happening almost every 30 secs
> org.apache.solr.update.UpdateHandler - start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> 
> 48
> false
> 
>
> 
> 18
> 
> I tried disabling softCommit by setting maxTime to -1.
> On startup solrCore recognized it and logged "Soft AutoCommit: disabled"
> but I could still see softCommit=true
> org.apache.solr.update.UpdateHandler - start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> 
> -1
> 
>
> Thanks,
> Summer


  1   2   3   4   5   6   7   8   9   10   >