Re: URGENT Documents automatically getting deleted in SOLR 6.6.0

2019-09-26 Thread Alexandre Rafalovitch
Your system is under attack, something trying to hack into it via
Solr. Possibly a cryptominer or similar. And it is using DIH endpoint
for it.

Shawn explain the most likely cause for Solr actually deleting the
records. I would also suggest:
1) Figure out where the request is coming from and treat it as a
threat. If it is internal, they are infected. If they are external and
consistent, maybe they need to be blocked, etc.
2) Check your system has not been infected already by looking for
weird processes. I guess if you are not on Windows, that particular
line is not a threat, but the attack may have had several methods
3) If you are not using dataimporthandler, remove that from the
solrconfig.xml. Or rename (though that will loose Admin UI interface).
Or firewall block access to it

Regards,
   Alex.

On Thu, 26 Sep 2019 at 08:42, Neha  wrote:
>
> Hello SOLR Users,
>
> Today i have noticed that in my SOLR instance 6.6.0 documents are
> getting automatically deleted.
>
> In SOLR traces i found below lines and seems it is because of this.
>
>
> 2019-09-26 09:01:21.599 INFO  (qtp225493257-14) [   x:Ecotron]
> o.a.s.c.S.Request [xyz]  webapp=/solr path=/dataimport
> 

Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Alessandro Benedetti
Personally I was expecting the sort request parameter to be applied on the
final search results:
1) run original query, get top K based on score
2) run re rank query on the top K, recalculate the scores
3) finally apply the sort

But when you mentioned "you expect the sort specified to be applied to both
the “outer” and “inner” queries",
I changed my mind, it is probably a better solution to give the user a nice
flexibility on controlling both the original query sort (to affect the top
K retrieval) and the final sort (the one sorting the reranked results).

*Currently the 'sort' global request parameter affects the way the top K
are retrieved, then they are re-ranked.*
Unfortunately the workaround you suggested through the local params of the
rerank query parser doesn't seem to work at all in 8.1.1 :(
Unless it was introduced in 8.2 I think it is a good idea to create the
jira issue, with this in mind:
1) we want to be able to decide the sort for both the original query(to
assess the top K) and the final results
2) we need to decide which request parameter should do what
e.g.
should the 'sort' request param affect *the original query* OR the final
results?
should the 'sort' in the local params of the reRank query parser affect
 the original query OR *the final results*?

In bold my personal preference, but I don't have any hard position in
regards.

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Thu, Sep 26, 2019 at 5:23 PM Erick Erickson 
wrote:

> OK so to restate, you expect the sort specified to be applied to both the
> “outer” and “inner” queries. Makes sense, seems like a good enhancement.
>
> Hmm, I wonder if you can put the sort parameter in with the rerank
> specification, like: q={!rerank reRankQuery=$rqq reRankDocs=1200
> reRankWeight=3 sort="score desc, downloads desc”}
>
> That doesn’t address your initial point, just curious if it’d do as a
> workaround meanwhile.
>
> Best,
> Erick
>
>
> > On Sep 26, 2019, at 10:54 AM, Alessandro Benedetti 
> wrote:
> >
> > In the first OK scenario, the search results are sorted with score desc,
> > and when the score is identical, the secondary sort field is applied.
> >
> > In the KO scenario, only score desc is taken into consideration(the
> > reranked score) , the secondary sort by the sort field is ignored.
> >
> > I suspect an intuitive expected result would be to have the same
> behaviour
> > that happens with no reranking, so:
> > 1) sort of the final results by reranked score desc
> > 2) when identical raranked score, sort by secondat sort field
> >
> > Is it clearer?
> > Any wrong assumption?
> >
> >
> > On Thu, 26 Sep 2019, 14:34 Erick Erickson, 
> wrote:
> >
> >> Hmmm, can we see a bit of sample output? I always have to read this
> >> backwards, the outer query results are sent to the inner query, so my
> >> _guess_ is that the sort is applied to the “q=*:*” and then the top
> 1,200
> >> are sorted by score by the rerank. But then I’m often confused about
> this.
> >>
> >> Erick
> >>
> >>> On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti <
> a.benede...@sease.io>
> >> wrote:
> >>>
> >>> Hi all,
> >>> I was playing a bit with the reranking capability and I discovered
> that:
> >>>
> >>> *Sort by score, then by secondary field -> OK*
> >>> http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
> >>> desc,downloads desc*=id,title,score,downloads
> >>>
> >>> *ReRank, Sort by score, then by secondary field -> KO*
> >>> http://localhost:8983/solr/books/select?q=*:*={!rerank
> >> reRankQuery=$rqq
> >>> reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score
> >> desc,downloads
> >>> desc*=id,title,score,downloads
> >>>
> >>> Is this intended? It sounds counter-intuitive to me and I wanted to
> check
> >>> before opening a Jira issue
> >>> Tested on 8.1.1 but it should be in master as well.
> >>>
> >>> Regards
> >>> --
> >>> Alessandro Benedetti
> >>> Search Consultant, R Software Engineer, Director
> >>> www.sease.io
> >>
> >>
>
>


Re: auto scaling question - solr 8.2.0

2019-09-26 Thread Joe Obernberger
Just as another data point.  I just tried again, and this time, I got an 
error from one of the remaining 3 nodes:


Error while trying to recover. 
core=UNCLASS_2019_6_8_36_shard2_replica_n21:java.util.concurrent.ExecutionException:
 org.apache.solr.client.solrj.SolrServerException: IOException occurred when 
talking to server at: http://telesto:9100/solr
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:902)
at 
org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:603)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:336)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:317)
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occurred when talking to server at: http://telesto:9100/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:670)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.lambda$httpUriRequest$0(HttpSolrClient.java:306)
... 5 more
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:204)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at 
org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at 
org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at 
org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at 
org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:555)
... 6 more


At this point, no nodes are hosting one of the collections.

-Joe

On 9/26/2019 1:32 PM, Joe Obernberger wrote:
Hi all - I have a 4 node cluster for test, and created several solr 
collections with 2 shards and 2 replicas each.


I'd like the global policy to be to not place more than one replica of 
the same shard on the same node.  I did this with this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary 
'{"set-cluster-policy":[{"replica": "<2", "shard": "#EACH", "node": 
"#ANY"}]}' http://localhost:9100/solr/admin/autoscaling


Creating the collections works great - they are distributed across the 
nodes nicely.  When I turn a node off, however, (going from 4 nodes to 
3), the same node was assigned to not only be both replicas of a 
shard, but one node is now hosting all of the replicas of a collection 
ie:

collection->shard1>replica1,replica2
collection->shard2->replica1,replica2

all of those replicas above are hosted by the same node.  What am I 
doing wrong here?  

auto scaling question - solr 8.2.0

2019-09-26 Thread Joe Obernberger
Hi all - I have a 4 node cluster for test, and created several solr 
collections with 2 shards and 2 replicas each.


I'd like the global policy to be to not place more than one replica of 
the same shard on the same node.  I did this with this curl command:
curl -X POST -H 'Content-type:application/json' --data-binary 
'{"set-cluster-policy":[{"replica": "<2", "shard": "#EACH", "node": 
"#ANY"}]}' http://localhost:9100/solr/admin/autoscaling


Creating the collections works great - they are distributed across the 
nodes nicely.  When I turn a node off, however, (going from 4 nodes to 
3), the same node was assigned to not only be both replicas of a shard, 
but one node is now hosting all of the replicas of a collection ie:

collection->shard1>replica1,replica2
collection->shard2->replica1,replica2

all of those replicas above are hosted by the same node.  What am I 
doing wrong here?  Thank you!


-Joe



RE: How to split a shard?

2019-09-26 Thread Gael Jourdan-Weil
Thanks for your answer Shawn, let's use the Collections API only then :)

Any idea what could cause the "missing index size information for parent shard 
leader" error message?

Regards,
Gaël

De : Shawn Heisey 
Envoyé : jeudi 26 septembre 2019 16:58
À : solr-user@lucene.apache.org 
Objet : Re: How to split a shard?

On 9/26/2019 8:50 AM, Gael Jourdan-Weil wrote:
> We are trying to split a single shard into two but we are encountering some 
> issues we don't understand.



> A) Create a new core "col_core2", then run the SPLIT 
> (https://lucene.apache.org/solr/guide/7_6/coreadmin-api.html#coreadmin-split)

If you are running SolrCloud, do NOT use the CoreAdmin API.  Use of the
CoreAdmin API on SolrCloud will lead to problems.

Use the Collections API only.

https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard

Thanks,
Shawn


Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Erick Erickson
OK so to restate, you expect the sort specified to be applied to both the 
“outer” and “inner” queries. Makes sense, seems like a good enhancement.

Hmm, I wonder if you can put the sort parameter in with the rerank 
specification, like: q={!rerank reRankQuery=$rqq reRankDocs=1200 reRankWeight=3 
sort="score desc, downloads desc”}

That doesn’t address your initial point, just curious if it’d do as a 
workaround meanwhile.

Best,
Erick


> On Sep 26, 2019, at 10:54 AM, Alessandro Benedetti  
> wrote:
> 
> In the first OK scenario, the search results are sorted with score desc,
> and when the score is identical, the secondary sort field is applied.
> 
> In the KO scenario, only score desc is taken into consideration(the
> reranked score) , the secondary sort by the sort field is ignored.
> 
> I suspect an intuitive expected result would be to have the same behaviour
> that happens with no reranking, so:
> 1) sort of the final results by reranked score desc
> 2) when identical raranked score, sort by secondat sort field
> 
> Is it clearer?
> Any wrong assumption?
> 
> 
> On Thu, 26 Sep 2019, 14:34 Erick Erickson,  wrote:
> 
>> Hmmm, can we see a bit of sample output? I always have to read this
>> backwards, the outer query results are sent to the inner query, so my
>> _guess_ is that the sort is applied to the “q=*:*” and then the top 1,200
>> are sorted by score by the rerank. But then I’m often confused about this.
>> 
>> Erick
>> 
>>> On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti 
>> wrote:
>>> 
>>> Hi all,
>>> I was playing a bit with the reranking capability and I discovered that:
>>> 
>>> *Sort by score, then by secondary field -> OK*
>>> http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
>>> desc,downloads desc*=id,title,score,downloads
>>> 
>>> *ReRank, Sort by score, then by secondary field -> KO*
>>> http://localhost:8983/solr/books/select?q=*:*={!rerank
>> reRankQuery=$rqq
>>> reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score
>> desc,downloads
>>> desc*=id,title,score,downloads
>>> 
>>> Is this intended? It sounds counter-intuitive to me and I wanted to check
>>> before opening a Jira issue
>>> Tested on 8.1.1 but it should be in master as well.
>>> 
>>> Regards
>>> --
>>> Alessandro Benedetti
>>> Search Consultant, R Software Engineer, Director
>>> www.sease.io
>> 
>> 



Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Alessandro Benedetti
In the first OK scenario, the search results are sorted with score desc,
and when the score is identical, the secondary sort field is applied.

In the KO scenario, only score desc is taken into consideration(the
reranked score) , the secondary sort by the sort field is ignored.

I suspect an intuitive expected result would be to have the same behaviour
that happens with no reranking, so:
1) sort of the final results by reranked score desc
2) when identical raranked score, sort by secondat sort field

Is it clearer?
Any wrong assumption?


On Thu, 26 Sep 2019, 14:34 Erick Erickson,  wrote:

> Hmmm, can we see a bit of sample output? I always have to read this
> backwards, the outer query results are sent to the inner query, so my
> _guess_ is that the sort is applied to the “q=*:*” and then the top 1,200
> are sorted by score by the rerank. But then I’m often confused about this.
>
> Erick
>
> > On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti 
> wrote:
> >
> > Hi all,
> > I was playing a bit with the reranking capability and I discovered that:
> >
> > *Sort by score, then by secondary field -> OK*
> > http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
> > desc,downloads desc*=id,title,score,downloads
> >
> > *ReRank, Sort by score, then by secondary field -> KO*
> > http://localhost:8983/solr/books/select?q=*:*={!rerank
> reRankQuery=$rqq
> > reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score
> desc,downloads
> > desc*=id,title,score,downloads
> >
> > Is this intended? It sounds counter-intuitive to me and I wanted to check
> > before opening a Jira issue
> > Tested on 8.1.1 but it should be in master as well.
> >
> > Regards
> > --
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > www.sease.io
>
>


Re: How to split a shard?

2019-09-26 Thread Shawn Heisey

On 9/26/2019 8:50 AM, Gael Jourdan-Weil wrote:

We are trying to split a single shard into two but we are encountering some 
issues we don't understand.





A) Create a new core "col_core2", then run the SPLIT 
(https://lucene.apache.org/solr/guide/7_6/coreadmin-api.html#coreadmin-split)


If you are running SolrCloud, do NOT use the CoreAdmin API.  Use of the 
CoreAdmin API on SolrCloud will lead to problems.


Use the Collections API only.

https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard

Thanks,
Shawn


How to split a shard?

2019-09-26 Thread Gael Jourdan-Weil
Hi,

We are trying to split a single shard into two but we are encountering some 
issues we don't understand.

Our current setup:
- 1 collection "col"
- 1 shard "shard1"
- 2 nodes, each having the whole collection (SolrCloud)
- 1 core on each node "col_core"

What we would like to have is:
- 1 collection "col"
- 2 shards: "shard1" and "shard2"
- 2 nodes, each still having the whole collection, thus the 2 shards (SolrCloud)
- 2 cores on each node: "col_core", "col_core2"

We tried following actions:
A) Create a new core "col_core2", then run the SPLIT 
(https://lucene.apache.org/solr/guide/7_6/coreadmin-api.html#coreadmin-split)
action to split "col_core" to targetCore "col_core2" but we get the error "Core 
with core name col_core2 must be the only replica in shard shard2"
B) Run the SPLITSHARD action to split the shard "shard1" into two other shards 
but we get the error "missing index size information for parent shard leader".

What would be the best approach to do this splitting? And any idea why do we 
get these errors?

(We are using Solr 7.6.0)

Thanks for reading,
Regards,

Gaël

Re: How to resolve a single domain name to multiple zookeeper IP in Solr

2019-09-26 Thread LEE Ween Jiann
Thank you, this is what I needed to know.

On 26/9/19, 9:08 PM, "Shawn Heisey"  wrote:

On 9/26/2019 4:12 AM, LEE Ween Jiann wrote:
> I'm trying to modify the helm chart for solr such that it works for 
kubernetes (k8s) deployment correctly. There needs to be a particular change in 
the way solr resolves zookeepers hostname in order for this to happen.

This is the solr-user mailing list.  Your question is about ZooKeeper.

Solr uses the ZK client without any modifications.  It passes the zkHost 
string to ZK and ZK handles it.  Solr does not interpret that string -- 
it is ZK that is looking up the hosts, not Solr.

You're going to need to ask ZK folks this question.

Thanks,
Shawn




Re: how to configure AWS S3 bucket to index data

2019-09-26 Thread nenzius
did you solved this problem?

Thanks



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: URGENT Documents automatically getting deleted in SOLR 6.6.0

2019-09-26 Thread Shawn Heisey

On 9/26/2019 6:42 AM, Neha wrote:
Today i have noticed that in my SOLR instance 6.6.0 documents are 
getting automatically deleted.


In SOLR traces i found below lines and seems it is because of this.

2019-09-26 09:01:21.599 INFO  (qtp225493257-14) [   x:Ecotron] 



Also the "dataimport.properties" files of each core is getting updated 
with something like below: -


*stackoverflow.last_index_time=2019-09-26 08\:24\:11*


One of the parameters of your DIH request is "clean=true".  I can see 
this in the logged message that contains "o.a.s.c.S.Request".  What this 
parameter means is that DIH will delete all documents in the index as 
its first step.


There is an error logged, but the fact that dataimport.properties is 
being updated suggests that DIH is probably honoring the clean=true 
parameter, then throwing the error that says the config is not good.


Thanks,
Shawn


[JOB] remote job at Help Scout

2019-09-26 Thread Leah Knobler
Hey all!

Help Scout
,
a 100 person remote company that builds helpful customer messaging tools,
is looking for a Java Data Engineer
to
join our Search team. We are looking to hire someone who relishes designing
and building systems and services that can manage large data sets with a
high transaction volume that are scaling constantly to meet customer
demand. The ideal person takes pride in building coherent and usable
interfaces making it easy to use and operate on data. This role would allow
you to take on challenging problems, choose the right tools for the job and
build elegant, scalable solutions.

If this sounds like a fit, please apply

and feel free to reach out to me. Thanks!

Leah


Re: How to resolve a single domain name to multiple zookeeper IP in Solr

2019-09-26 Thread Shawn Heisey

On 9/26/2019 4:12 AM, LEE Ween Jiann wrote:

I'm trying to modify the helm chart for solr such that it works for kubernetes 
(k8s) deployment correctly. There needs to be a particular change in the way 
solr resolves zookeepers hostname in order for this to happen.


This is the solr-user mailing list.  Your question is about ZooKeeper.

Solr uses the ZK client without any modifications.  It passes the zkHost 
string to ZK and ZK handles it.  Solr does not interpret that string -- 
it is ZK that is looking up the hosts, not Solr.


You're going to need to ask ZK folks this question.

Thanks,
Shawn


URGENT Documents automatically getting deleted in SOLR 6.6.0

2019-09-26 Thread Neha

Hello SOLR Users,

Today i have noticed that in my SOLR instance 6.6.0 documents are 
getting automatically deleted.


In SOLR traces i found below lines and seems it is because of this.


2019-09-26 09:01:21.599 INFO  (qtp225493257-14) [   x:Ecotron] 
o.a.s.c.S.Request [xyz]  webapp=/solr path=/dataimport 

Re: [Apache Solr ReRanking] Sort Clauses Bug

2019-09-26 Thread Erick Erickson
Hmmm, can we see a bit of sample output? I always have to read this backwards, 
the outer query results are sent to the inner query, so my _guess_ is that the 
sort is applied to the “q=*:*” and then the top 1,200 are sorted by score by 
the rerank. But then I’m often confused about this.

Erick

> On Sep 25, 2019, at 5:47 PM, Alessandro Benedetti  
> wrote:
> 
> Hi all,
> I was playing a bit with the reranking capability and I discovered that:
> 
> *Sort by score, then by secondary field -> OK*
> http://localhost:8983/solr/books/select?q=vegeta ssj&*sort=score
> desc,downloads desc*=id,title,score,downloads
> 
> *ReRank, Sort by score, then by secondary field -> KO*
> http://localhost:8983/solr/books/select?q=*:*={!rerank reRankQuery=$rqq
> reRankDocs=1200 reRankWeight=3}=(vegeta ssj)&*sort=score desc,downloads
> desc*=id,title,score,downloads
> 
> Is this intended? It sounds counter-intuitive to me and I wanted to check
> before opening a Jira issue
> Tested on 8.1.1 but it should be in master as well.
> 
> Regards
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> www.sease.io



Re: Undefined field - solr 7.2.1 cloud

2019-09-26 Thread Erick Erickson
BTW, my purpose in suggesting you remove managed schema is just to insure that 
you’re really using classic. Solr will blow up because it’s unable to find 
managed-schema if, for some strange reason, you’re really using managed.

The two can exist perfectly well together, one should be used and one ignored 
based on solrconfig.xml settings.

Last I knew what _can’t_ happen is you use managed schema factories and specify 
that your managed file is “schema.xml”, that’ll throw an error on startup.  
“fail early” and all that.

Best,
Erick

> On Sep 25, 2019, at 1:15 PM, Antony A  wrote:
> 
> Thanks Erick.
> 
> I have removed the managed-schema for now. This setup was running perfectly
> for couple of years. I implemented basic auth around the collection a year
> back. But nothing really changed on my process to update the schema. Let me
> see if removing managed-schema has any impact and will update.
> 
> 
> 
> On Wed, Sep 25, 2019 at 9:16 AM Erick Erickson 
> wrote:
> 
>> Then something sounds wrong with your setup. The configs are stored in ZK,
>> and read from ZooKeeper every time Solr starts. So how the replica “does
>> not have the correct schema” is a complete mystery.
>> 
>> You say you have ClassicIndexSchemaFactory set up. Take a look at your
>> configs _through the Admin UI from the “collections” drop-down_ and verify.
>> This reads the same thing in ZooKeeper. Sometimes I’ve thought I was set up
>> one way and discovered later that I wasn’t.
>> 
>> Next: Do you have “managed-schema” _and_ “schema.xml” in your configs? If
>> you’re indeed using classic, you can remove managed-schema.
>> 
>> All to make sure your’e operating as you think you are.
>> 
>> Best,
>> Erick
>> 
>>> On Sep 24, 2019, at 3:58 PM, Antony A  wrote:
>>> 
>>> Hi,
>>> 
>>> I also observed that whenever the JVM crashes, the replicas does not have
>>> the correct schema. Anyone seen similar behavior.
>>> 
>>> Thanks,
>>> AA
>>> 
>>> On Wed, Sep 4, 2019 at 9:58 PM Antony A 
>> wrote:
>>> 
 Hi,
 
 I have confirmed that ZK ensemble is external. Even though both
 managed-schema and schema.xml are on the admin ui, I see the below class
 defined in solrconfig.
 
 
 The workaround is till to run "solr zk upconfig" followed by restarting
 the cores of the collection. Anything else I should be looking into?
 
 Thanks
 
 On Wed, Sep 4, 2019 at 6:31 PM Erick Erickson 
 wrote:
 
> This almost always means that you really _didn’t_ update the schema and
> reload the collection, you just thought you did ;).
> 
> One common reason is to fire up Solr with an internal ZooKeeper but
>> have
> the rest of your collection be using an external ensemble.
> 
> Another is to be modifying schema.xml when using managed-schema or
> vice-versa.
> 
> First thing I’d do is check the ZK ensemble, are any of the ports
> reference by the admin screen anywhere 9983? If so it’s internal.
> 
> Second thing I’d do is, in the admin UI, select my collection from the
> drop down list, then click files and open up the schema. Check that
>> there
> is only managed-schema or schema.xml. If both are present, check your
> solrconfig to see which one you’re using. Then open the schema and
>> check
> that your field is there. BTW, the field will be explicitly stated in
>> the
> solr log.
> 
> Third thing I’d do is open the admin
> UI>>configsets>>the_configset_you’re_using and check which schema
>> you’re
> using and again if the field is in the schema.
> 
> Best,
> Erick
> 
>> On Sep 4, 2019, at 3:27 PM, Antony A 
>> wrote:
>> 
>> Hi,
>> 
>> I ran the collection reload after a new "leader" core was selected for
> the
>> collection due to heap failure on the previous core. But I still have
> stack
>> trace with common.SolrException: undefined field.
>> 
>> On Thu, Aug 29, 2019 at 1:36 PM Antony A 
> wrote:
>> 
>>> Yes. I do restart the cores on all the different servers. I will look
> at
>>> implementing reloading the collection. Thank you for your suggestion.
>>> 
>>> Cheers,
>>> Antony
>>> 
>>> On Thu, Aug 29, 2019 at 1:34 PM Shawn Heisey 
> wrote:
>>> 
 On 8/29/2019 1:22 PM, Antony A wrote:
> I do restart Solr after changing schema using "solr zk upconfig". I
> am
 yet
> to confirm but I do have a daily cron that does "delta" import.
>> Does
 that
> process have any bearing on some cores losing the field?
 
 Did you restart all the Solr servers?  If the collection lives on
 multiple servers, restarting one of the servers is not going to
>> affect
 replicas living on other servers.
 
 Reloading the collection with an HTTP request to the collections API
> is
 a better option than restarting Solr.
 
 

RE: How to resolve a single domain name to multiple zookeeper IP in Solr

2019-09-26 Thread LEE Ween Jiann
SMU Classification: Restricted

Yes zookeeper supports dynamic change from 3.5.x.
I am referring to Solr here. 

You would need to specify the list of zookeeper servers in solr.in.sh or 
solr.in.cmd or as -z param.
https://lucene.apache.org/solr/guide/8_1/setting-up-an-external-zookeeper-ensemble.html
But scaling zookeeper after helm deployment does not change this list of 
ZK_HOST automatically, this is intended as helm/k8s does not do this for you 
and you should not change this manually.

K8s has DNS that allow you to resolve a single domain name to multiple IP. Let 
say this domain is zk-headless. Then ZK_HOST="zk-headless:2181".
Solr should resolve all instance of zookeeper from a single domain name.

nslookup zk-headless
Server:  xxx
Address:  xxx
Non-authoritative answer:
Name:zk-headless
Addresses:  
  10.0.0.11
  10.0.0.12
  10.0.0.13

This three addresses will be the zookeeper servers.

-Original Message-
From: Jörn Franke  
Sent: Thursday, September 26, 2019 6:41 PM
To: solr-user@lucene.apache.org
Subject: Re: How to resolve a single domain name to multiple zookeeper IP in 
Solr

The newest zk version supports dynamic change of the zk instances:

https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html

However, for that to work properly in case of a Solr restart you always need a 
minimal set of servers that do not change and just increase/decrease additional 
ones.

> Am 26.09.2019 um 12:22 schrieb LEE Ween Jiann :
> 
> I'm trying to modify the helm chart for solr such that it works for 
> kubernetes (k8s) deployment correctly. There needs to be a particular change 
> in the way solr resolves zookeepers hostname in order for this to happen.
> 
> Let me explain...
> The standard way to configure solr is by listing all the zookeeper 
> hostname/IP in either:
> 
>  *   solr.in.sh or solr.in.cmd
>  *   zoo.cfg
>  *   -z param
> For example: ZK_HOST="zk1:2181,zk2:2181,zk3:2181".
> 
> However, when it comes to cloud deployment, in particular on k8s using helm 
> chart, this is not an ideal situation as the user is required to modify 
> zk_host each time they scale the number of zookeeper up/down.
> 
>  *   For example (scale down): ZK_HOST="zk1:2181,zk2:2181".
>  *   For example (scale up): ZK_HOST="zk1:2181,zk2:2181,zk3:2181,zk4:2181".
> 
> This cannot be done automatically using in helm/k8s. In k8s, this parameter 
> should remain static, meaning that it should not be changed after deployment 
> of the chart.
> 
>  *   For example (k8s): ZK_HOST="zk-headless:2181".
> 
> What a chart can do is to create a service with a DNS name such as 
> zk-headless that contains all the IP of the zookeepers, and as zookeeper 
> scales, the number of IP resolved from zk-headless changes. Could solr to 
> resolve multiple zookeeper IPs from a single name?
> 
> Cheers,
> Ween Jiann


Re: How to resolve a single domain name to multiple zookeeper IP in Solr

2019-09-26 Thread Jörn Franke
The newest zk version supports dynamic change of the zk instances:

https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html

However, for that to work properly in case of a Solr restart you always need a 
minimal set of servers that do not change and just increase/decrease additional 
ones.

> Am 26.09.2019 um 12:22 schrieb LEE Ween Jiann :
> 
> I'm trying to modify the helm chart for solr such that it works for 
> kubernetes (k8s) deployment correctly. There needs to be a particular change 
> in the way solr resolves zookeepers hostname in order for this to happen.
> 
> Let me explain...
> The standard way to configure solr is by listing all the zookeeper 
> hostname/IP in either:
> 
>  *   solr.in.sh or solr.in.cmd
>  *   zoo.cfg
>  *   -z param
> For example: ZK_HOST="zk1:2181,zk2:2181,zk3:2181".
> 
> However, when it comes to cloud deployment, in particular on k8s using helm 
> chart, this is not an ideal situation as the user is required to modify 
> zk_host each time they scale the number of zookeeper up/down.
> 
>  *   For example (scale down): ZK_HOST="zk1:2181,zk2:2181".
>  *   For example (scale up): ZK_HOST="zk1:2181,zk2:2181,zk3:2181,zk4:2181".
> 
> This cannot be done automatically using in helm/k8s. In k8s, this parameter 
> should remain static, meaning that it should not be changed after deployment 
> of the chart.
> 
>  *   For example (k8s): ZK_HOST="zk-headless:2181".
> 
> What a chart can do is to create a service with a DNS name such as 
> zk-headless that contains all the IP of the zookeepers, and as zookeeper 
> scales, the number of IP resolved from zk-headless changes. Could solr to 
> resolve multiple zookeeper IPs from a single name?
> 
> Cheers,
> Ween Jiann


How to resolve a single domain name to multiple zookeeper IP in Solr

2019-09-26 Thread LEE Ween Jiann
I'm trying to modify the helm chart for solr such that it works for kubernetes 
(k8s) deployment correctly. There needs to be a particular change in the way 
solr resolves zookeepers hostname in order for this to happen.

Let me explain...
The standard way to configure solr is by listing all the zookeeper hostname/IP 
in either:

  *   solr.in.sh or solr.in.cmd
  *   zoo.cfg
  *   -z param
For example: ZK_HOST="zk1:2181,zk2:2181,zk3:2181".

However, when it comes to cloud deployment, in particular on k8s using helm 
chart, this is not an ideal situation as the user is required to modify zk_host 
each time they scale the number of zookeeper up/down.

  *   For example (scale down): ZK_HOST="zk1:2181,zk2:2181".
  *   For example (scale up): ZK_HOST="zk1:2181,zk2:2181,zk3:2181,zk4:2181".

This cannot be done automatically using in helm/k8s. In k8s, this parameter 
should remain static, meaning that it should not be changed after deployment of 
the chart.

  *   For example (k8s): ZK_HOST="zk-headless:2181".

What a chart can do is to create a service with a DNS name such as zk-headless 
that contains all the IP of the zookeepers, and as zookeeper scales, the number 
of IP resolved from zk-headless changes. Could solr to resolve multiple 
zookeeper IPs from a single name?

Cheers,
Ween Jiann


Re: Need more info on MLT (More Like This) feature

2019-09-26 Thread Alessandro Benedetti
In addition to all the valuable information already shared I am curious to
understand why you think the results are unreliable.
Most of the times is the parameters that cause to ignore some of the terms
of the original document/corpus (as simple of the min/max document frequency
to consider or min term frequency in the source doc) .

I have been working a lot on the MLT in the past years and presenting the
work done (and internals) at various conferences/meetups.

I'll share some slides and some Jira issues that may help you:

https://www.youtube.com/watch?v=jkaj89XwHHw=540s
  
https://www.slideshare.net/SeaseLtd/how-the-lucene-more-like-this-works
  

https://issues.apache.org/jira/browse/LUCENE-8326
  
https://issues.apache.org/jira/browse/LUCENE-7802
  
https://issues.apache.org/jira/browse/LUCENE-7498
  

Generally speaking I favour the MLT query parser, it builds the MLT query
and gives you the chance to see it using the debug query.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html