Questions for SynonymGraphFilter and WordDelimiterGraphFilter

2019-01-04 Thread Wei
Hello,

We are upgrading to Solr 7.6.0 and noticed that SynonymFilter and
WordDelimiterFilter have been deprecated. Solr doc recommends to use
SynonymGraphFilter and WordDelimiterGraphFilter instead.  In current
schema, we have text field type defined as



  














  

  














  



In the index phase we have both SynonymFilter and WordDelimiterFilter
configured:







Solr documentation states that "graph filters produces correct token
graphs, but cannot consume an input token graph correctly. When use
these two graph filter during indexing, you must follow it with a
FlattenGraphFilter". I am confused as how to replace them with the new
SynonymGraphFilter and WordDelimiterGraphFilter. A few questions:

1. Regarding the FlattenGraphFilter, is it to be used only once or
multiple times after each graph filter? Can we have the configure like
this?

   

   

   



   

2. Is it possible to we have two graph filters, i.e. both
SynonymGraphFilter and WordDelimiterGraphFilter in the same analysis
chain? If not what's the best option to replace our current config?

3. With the StopFilterFactory in between SynonymGraphFilter and
WordDelimiterGraphFilter, I get a few index errors:

Exception writing document id XX to the index; possible analysis error

Caused by: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1

But if I move StopFilter before the SynonymGraphFilter the errors are gone.

I guess the StopFilter mess up the SynonymGraphFilter output? Not sure
if  it's a solr defect or there is a guideline that StopFilter should
not be put after graph filters.

Thanks in advance for you input.


Thanks,

Wei


Re: Solr relevancy score different on replicated nodes

2019-01-04 Thread Erick Erickson
Ashish:

Deleting and re-adding a replica is not a solution. Even if you did,
that would then be identical only until you started indexing again,
then the stats could skew a bit.

When you index to NRT replicas, the wall clock times that cause the
commits to trigger will be different due to network delays. What
happens essentially is that the doc gets indexed to the leader at time
X but hits the replica Y milliseconds later. So on leader, the
autocommit interval expires at time X+Z (Z being your autocommit
interval) but X+Y+Z on the follower. However, some additional docs may
have already been indexed on the leader but not yet on the follower
when the autocommit trigger happens so the newly-closed segment on the
leader can have docs that the newly-closed segment on the  follower
does not have.

the point is that the termfreq does _not_ change when a document is
deleted in some segment (and remember that an update is really a
delete followed by an add). The data associated with deleted docs is
not purged until segments are merged. Further, the decision about
which segments to merge is influenced by how many documents are
deleted in each.

All of which means that the tf/idf statistics are different (slightly)
and you either have to use destributed IDF or just live with it.

You're saying that the document count of live documents is different,
and that's more concerning. Is this true for brief intervals or is it
true when there is _no_ indexing going on _and_ your autocommit
interval is allowed to expire? In that case it's a different problem.
However, if the condition is transitory and goes away if you stop
indexing, then it's the same issue I outlined above; autocommit is
happening at different wall-clock times.

Best,
Erick

On Fri, Jan 4, 2019 at 11:12 AM Ashish Bisht  wrote:
>
> Hi Erick,
>
> I have updated that I am not facing this problem in a new collection.
>
> As per 3) I can try deleting a replica and adding it again, but the
> confusion is which one out of two should I delete.(wondering which replica
> is giving correct score for query)
>
> Both replicas give same number of docs while doing all query.Its strange
> that in query explain docCount and docFreq is differing.
>
> Regards
> Ashish
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Warnings in Zookeeper Server Logs

2019-01-04 Thread Joe Lerner
Hi (yes again):

We have a simple architecture: 2 SOLR Cloud servers (on servers #1 and #2),
and 3 zookeeper instances (on servers #1, #2, and #3). Things appear to work
fine, and I have confirmed that our basic configuration is correct. But we
are seeing TONS of the following warnings in all of our zookeeper server
logs:

2019-01-04 14:48:04,266 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] -
Accepted socket connection from /XXX.YY.ZZZ.46:51516
2019-01-04 14:48:04,266 [myid:1] - WARN 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end
of stream exception
EndOfStreamException: Unable to read additional data from client sessionid
0x0, likely client has closed socket
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
2019-01-04 14:48:04,266 [myid:1] - INFO 
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1044] - Closed
socket connection for client /XXX.YY.ZZZ.46:51516 (no session established
for client)


These messages seem to correspond to similar message we are seeing in the
application client-side logs. (I don’t see any messages that would indicate
Too many connections.) 

Reading the log content, it seems to be saying that a connection is
accepted, but then there is an "end of stream" exception. But our users are
not experiencing any problems--they are searching SOLR like crazy.

Any suggestions?

Thanks!

Joe





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Time consuming for insert record

2019-01-04 Thread Shawn Heisey

On 12/25/2018 11:23 PM, jay harkhani wrote:

We are using add method of CloudSolrClient for insert data into Solr Cloud 
Index. In specific scenario we need to insert record of around 3 MB document 
into Solr which takes 5-6 seconds.


Is this a single document that's 3 MB in size, or many documents 
totaling 3 MB?


If it's a single document, there's probably little you can do to make it 
faster other than shrinking the document.


If it's many documents, then the way to increase speed would be to use 
multiple threads or multiple processes to index documents in parallel.


If the "3 MB" you have mentioned means 3 million documents, then 5-6 
seconds is *VERY* fast already, and it would be extremely difficult to 
improve on that.


Thanks,
Shawn



Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report

2019-01-04 Thread Shawn Heisey

On 1/3/2019 11:15 AM, Bob Hathaway wrote:
We want to use SOLR v7 but Sonatype scans past v6.5 show dozens of 
critical and severe security issues and dozens of licensing issues.


None of the images that you attached to your message are visible to us.  
Attachments are regularly stripped by Apache mailing lists and cannot be 
relied on.


Some of the security issues you've mentioned could be problems.  But if 
you follow recommendations and make sure that Solr is not directly 
accessible to unauthorized parties, it will not be possible for those 
parties to exploit security issues without first finding and exploiting 
a vulnerability on an authorized system.


Vulnerabilities in SolrJ, if any exist, are slightly different, but 
unless unauthorized parties have the ability to *directly* send input to 
SolrJ code without intermediate code sanitizing the input, they will not 
be able to exploit those vulnerabilities. JSON support in SolrJ is 
provided by noggit, not jackson, and JSON/XML are not used by recent 
versions of SolrJ unless they are very specifically requested by the 
programmer.  Are there any vulnerabilities you've found that affect 
SolrJ itself, separately from the rest of Solr?


As we become aware of issues with either project code or third-party 
software, we get them fixed.  Sometimes it is not completely 
straightforward to upgrade to newer versions of third-party software, 
but staying current is a priority.


Licensing issues are of major concern to the entire Apache Foundation.  
As a project, we are unaware of any licensing problems at this time.  
All of the third-party software that is included with Solr should be 
available under a license that is compatible with the Apache license.  I 
didn't examine the list you sent super closely, but what I did look at 
didn't look like a problem.


https://www.apache.org/legal/resolved.html#category-b

The mere presence of GPL in the available licenses for third party 
software is not an indication of a problem.  If that were the ONLY 
license available, then it would be a problem.


Thanks,
Shawn



Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report

2019-01-04 Thread Jörn Franke
Jackson-databind is actually not such an old version. The problem with Jackson 
databind is that for deserialization it has just a blacklist of objects not to 
deserialize and it is impossible to maintain that blacklist uptodate. For 
version 3.0 they change to a whitelist approach it seems which will resolve 
those errors. Until then all future versions of databind based on a blacklist 
approach are vulnerable. BTW this is for all applications using that library. 
Spring security has put on top of that additional items on the blacklist so 
even if nexusiq shows a security issue with databind but you have introduced 
additional means (eg you or another  have worked on the blacklist) to be less 
vulnerable - nexusiq can’t know. Btw this is also what they explain when you 
open the detail of the security assessment.

Then, it depends on how you deploy software such as solr in your enterprise 
environment and they risks related to that. Eg one could have introduced means 
as above. Most of the users usually don’t have direct access to Solr itself but 
through a custom application, so there is no “direct” attack possible.

Finally, the absence of findings in the report does not mean an application is 
secure.

> Am 04.01.2019 um 19:27 schrieb Gus Heck :
> 
> Hi Bob,
> 
> Wrt licensing keep in mind that multi licensed software allows you to
> choose which license you are using the software under. Also there's some
> good detail on the Apache policy here:
> 
> https://www.apache.org/legal/resolved.html#what-can-we-not-include-in-an-asf-project-category-x
> 
> One has to be careful with license scanners, often they have very
> conservative settings. I had to spend untold hours getting jfrog's license
> plugin to select the correct license and hunting down missing licenses when
> I finally sorted out licensing for JesterJ. (though MANY fewer hours than
> if I had done this by hand!)
> 
>> On Fri, Jan 4, 2019, 11:17 AM Bob Hathaway > 
>> The most important feature of any software running today is that it can be
>> run at all. Security vulnerabilities can preclude software from running in
>> enterprise environments. Today software must be free of critical and severe
>> security vulnerabilities or they can't be run at all from Information
>> Security policies. Enterprises today run security scan software to check
>> for security and licensing vulnerabilities because today most organizations
>> are using open source software where this has become most relevant.
>> Forrester has a good summary on the need for software composition analysis
>> tools which virtually all enterprises run today befor allowing software to
>> run in production environments:
>> 
>> https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf
>> 
>> Solr version 6.5 passes security scans showing no critical security
>> issues.  Solr version 7 fails security scans with over a dozen critical and
>> severe security vulnerabilities for Solr version from 7.1.  Then we ran
>> scans against the latest Solr version 7.6 which failed as well.  Most of
>> the issues are due to using old libraries including the JSON Jackson
>> framework, Dom 4j and Xerces and should be easy to bring up to date. Only
>> the latest version of SimpleXML has severe security vulnerabilities. Derby
>> leads the most severe security violations at Level 9.1 by using an out of
>> date version.
>> 
>> What good is software or any features if enterprises can't run them?
>> Today software cybersecurity is a top priority and risk for enterprises.
>> Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ
>> client which is a differentiating capability.
>> 
>> Is security and remediation a priority for SolrJ?  I believe this should be
>> a top feature to allow SolrJ to continue providing search features to
>> enterprises and a security roadmap and plan to keep Solr secure and usable
>> by continually adapting and improving in the ever changing security
>> landscape and ecosystem.  The Darby vulnerability issue CVE-2015-1832 was a
>> passing medium Level 6.2  issue in CVSS 2.0 last year but is the most
>> critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0.  These
>> changes need to be tracked and updates and fixes incorporated into new Solr
>> versions.
>> https://nvd.nist.gov/vuln/detail/CVE-2015-1832
>> 
>>> On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway  wrote:
>>> 
>>> Critical and Severe security vulnerabilities against Solr v7.1.  Many of
>>> these appear to be from old open source  framework versions.
>>> 
>>> *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4
>>> Open
>>> 
>>>   CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open
>>> 
>>>   CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open
>>> 
>>>   CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open
>>> 
>>>   CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
>>> 
>>>   

Re: Solr relevancy score different on replicated nodes

2019-01-04 Thread Ashish Bisht
Hi Erick, 

I have updated that I am not facing this problem in a new collection. 

As per 3) I can try deleting a replica and adding it again, but the
confusion is which one out of two should I delete.(wondering which replica
is giving correct score for query) 

Both replicas give same number of docs while doing all query.Its strange
that in query explain docCount and docFreq is differing. 

Regards
Ashish



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report

2019-01-04 Thread Gus Heck
Hi Bob,

Wrt licensing keep in mind that multi licensed software allows you to
choose which license you are using the software under. Also there's some
good detail on the Apache policy here:

https://www.apache.org/legal/resolved.html#what-can-we-not-include-in-an-asf-project-category-x

One has to be careful with license scanners, often they have very
conservative settings. I had to spend untold hours getting jfrog's license
plugin to select the correct license and hunting down missing licenses when
I finally sorted out licensing for JesterJ. (though MANY fewer hours than
if I had done this by hand!)

On Fri, Jan 4, 2019, 11:17 AM Bob Hathaway  The most important feature of any software running today is that it can be
> run at all. Security vulnerabilities can preclude software from running in
> enterprise environments. Today software must be free of critical and severe
> security vulnerabilities or they can't be run at all from Information
> Security policies. Enterprises today run security scan software to check
> for security and licensing vulnerabilities because today most organizations
> are using open source software where this has become most relevant.
> Forrester has a good summary on the need for software composition analysis
> tools which virtually all enterprises run today befor allowing software to
> run in production environments:
>
> https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf
>
> Solr version 6.5 passes security scans showing no critical security
> issues.  Solr version 7 fails security scans with over a dozen critical and
> severe security vulnerabilities for Solr version from 7.1.  Then we ran
> scans against the latest Solr version 7.6 which failed as well.  Most of
> the issues are due to using old libraries including the JSON Jackson
> framework, Dom 4j and Xerces and should be easy to bring up to date. Only
> the latest version of SimpleXML has severe security vulnerabilities. Derby
> leads the most severe security violations at Level 9.1 by using an out of
> date version.
>
> What good is software or any features if enterprises can't run them?
> Today software cybersecurity is a top priority and risk for enterprises.
> Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ
> client which is a differentiating capability.
>
> Is security and remediation a priority for SolrJ?  I believe this should be
> a top feature to allow SolrJ to continue providing search features to
> enterprises and a security roadmap and plan to keep Solr secure and usable
> by continually adapting and improving in the ever changing security
> landscape and ecosystem.  The Darby vulnerability issue CVE-2015-1832 was a
> passing medium Level 6.2  issue in CVSS 2.0 last year but is the most
> critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0.  These
> changes need to be tracked and updates and fixes incorporated into new Solr
> versions.
> https://nvd.nist.gov/vuln/detail/CVE-2015-1832
>
> On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway  wrote:
>
> > Critical and Severe security vulnerabilities against Solr v7.1.  Many of
> > these appear to be from old open source  framework versions.
> >
> > *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4
> > Open
> >
> >CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open
> >
> >CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open
> >
> >CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open
> >
> >CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
> >
> >CVE-2017-7658 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
> >
> >CVE-2017-1000190 org.simpleframework : simple-xml : 2.7.1 Open
> >
> > *7* sonatype-2016-0397 com.fasterxml.jackson.core : jackson-core : 2.5.4
> > Open
> >
> >sonatype-2017-0355 com.fasterxml.jackson.core : jackson-core : 2.5.4
> > Open
> >
> >CVE-2014-0114 commons-beanutils : commons-beanutils : 1.8.3 Open
> >
> >CVE-2018-1000632 dom4j : dom4j : 1.6.1 Open
> >
> >CVE-2018-8009 org.apache.hadoop : hadoop-common : 2.7.4 Open
> >
> >CVE-2017-12626 org.apache.poi : poi : 3.17-beta1 Open
> >
> >CVE-2017-12626 org.apache.poi : poi-scratchpad : 3.17-beta1 Open
> >
> >CVE-2018-1308 org.apache.solr : solr-dataimporthandler : 7.1.0 Open
> >
> >CVE-2016-4434 org.apache.tika : tika-core : 1.16 Open
> >
> >CVE-2018-11761 org.apache.tika : tika-core : 1.16 Open
> >
> >CVE-2016-1000338 org.bouncycastle : bcprov-jdk15 : 1.45 Open
> >
> >CVE-2016-1000343 org.bouncycastle : bcprov-jdk15 : 1.45 Open
> >
> >CVE-2018-1000180 org.bouncycastle : bcprov-jdk15 : 1.45 Open
> >
> >CVE-2017-7656 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
> >
> >CVE-2012-0881 xerces : xercesImpl : 2.9.1 Open
> >
> >CVE-2013-4002 xerces : xercesImpl : 2.9.1 Open
> >
> > On Thu, Jan 3, 2019 at 12:15 PM Bob Hathaway 
> wrote:
> >
> >> 

RE: [solr-solrcloud] How does DIH work when there are multiple nodes?

2019-01-04 Thread Davis, Daniel (NIH/NLM) [C]
DIH is also not designed to multi-thread very well.   One way I've handled this 
is to have a DIH XML that breaks-up a database query into multiple processes by 
taking the modulo of a row, as follows:



This allows me to do sub-queries within the entity, but it is often better to 
just write a small program to get this data from the database, and ETL 
processors such as Pentaho DI (Kettle) and Talend DI do this quite well.

If you can express what you want in a database view, even a complicated one, 
then your best way to get it into Solr IMO is to use logstash with the jdbc 
input plugin.   It can do some transformation, but you'll need your database 
view to process the data.

> -Original Message-
> From: Shawn Heisey 
> Sent: Friday, January 4, 2019 12:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: [solr-solrcloud] How does DIH work when there are multiple
> nodes?
> 
> On 1/4/2019 1:04 AM, 유정인 wrote:
> > The reader was looking for a way to do 'DIH' automatically.
> >
> > The reason was for HA configuration.
> 
> If you send a DIH request to the collection (as opposed to a specific
> core), that request will be load balanced across the cloud.  You won't
> know which replica/core actually handles it. This means that an import
> command may be handled by a different host than a status command.  In
> that situation, the status command will not know about the import,
> because it will be running on a different Solr core.
> 
> When doing DIH on SolrCloud, you should send your requests directly to a
> specific core on a specific node.  It's the only way to be sure what's
> happening.  High availability would have to be handled in your application.
> 
> Thanks,
> Shawn



Re: [solr-solrcloud] How does DIH work when there are multiple nodes?

2019-01-04 Thread Shawn Heisey

On 1/4/2019 1:04 AM, 유정인 wrote:

The reader was looking for a way to do 'DIH' automatically.

The reason was for HA configuration.


If you send a DIH request to the collection (as opposed to a specific 
core), that request will be load balanced across the cloud.  You won't 
know which replica/core actually handles it. This means that an import 
command may be handled by a different host than a status command.  In 
that situation, the status command will not know about the import, 
because it will be running on a different Solr core.


When doing DIH on SolrCloud, you should send your requests directly to a 
specific core on a specific node.  It's the only way to be sure what's 
happening.  High availability would have to be handled in your application.


Thanks,
Shawn



Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG

2019-01-04 Thread Shawn Heisey

On 1/3/2019 11:26 PM, Doss wrote:

We are planning to setup a SOLR cloud with 6 nodes for 3 million records
(expected to grow to 5 million in a year), with 150 fields and over all
index would come around 120GB.

We plan to use NRT with 5 sec soft commit and 1 min hard commit.


Five seconds is likely far too short an interval.  That's something 
you'll have to experiment with.



Expected query volume would be 5000 select hits per second and 7000 inserts
/ updates per second.


5000 queries per second is an extremely high query rate.  I would guess 
that six nodes is far too few to handle that much of a query load.  It 
might also be plenty ... it's nearly impossible to gauge that with the 
information you've shared so far.  Usually the only way to find out for 
sure is to actually BUILD the system and try it.


7000 documents inserted per second is also ambitious.  It's achievable, 
but is almost certainly going to require parallel threads/processes 
indexing at the same time.  That's going to reduce the query volume you 
can handle.


If you expect 3 million documents to reach 120GB of index size, then 
each of those documents must be fairly large.  Large documents will 
index more slowly, and can also reduce query capacity.


Memory will be your biggest challenge.  If a Solr instance must handle 
120GB of index and achieve a high query volume, then you'll want that 
Solr instance to have about 128GB of memory, so the entire index will 
fit into the operating system disk cache.



Our records can be classified under 15 categories, but they will not have
even number of records, few categories will have more number of records.

Queries will also come in the same pattern, that is., categories with high
number of records will get high volume of select / updates.

For this situation we are confused in choosing what type of sharding would
help us in better performance in both select and updates?

Composite / implicit - Composite with 15 shards or implicit based on 15
categories.


15 shards is probably far too many for only a few million documents, 
especially with the extremely high query volume and low host count you 
have projected.  With a high query volume, you want the absolute minimum 
number of shards possible ... one if you can.  Handling several million 
documents in a single shard is usually doable.



Our select queries will have minimum 15 filters in fq, with extensive
function queries used in sort.


When a query has multiple filters, they will generally all be run in 
parallel, not sequentially.  This can affect the query volume you can 
handle, it's very difficult to know whether the effect will be helpful 
or harmful.



For our kind of situation which replica Type can we choose? All NRT or NRT
with TLOG ?


If you will only have two replicas, they should both be either NRT or 
TLOG.  With more than two replicas, my suggestion would be to make two 
of them TLOG and the rest PULL. One of the TLOG replicas will be elected 
leader, and all other replicas will copy the index from the leader, 
rather than do the independent indexing that NRT replicas do.


Thanks,
Shawn



Re: Regarding Shards - Composite / Implicit , Replica Type - NRT / TLOG

2019-01-04 Thread Erick Erickson
It's usually best to use compositeId routing. That distributes
the load evenly. Otherwise, _you_ have to be responsible
for making sure that the docs are reasonably evenly distributed,
which can be a pain.

Implicit routing is usually best in situations where you index
to a particular shard for a while then move on to another
shard, think news stories where you want to keep them for
30 days then dispose of them. Implicit lets you add/remove
shards on a daily basis. Doesn't sound particularly suitable for
your situation.

But I do have to ask why you're sharding at all? 5M docs is a fairly
small index by modern standards. There's some inevitable overhead
with sharding that you could avoid. Mostly I'm asking if you've
stress-tested with that query and update rate. The 7,000 updates/second
do worry me a bit with a single-shard solution, but if you get adequate
response times under that load, then there's no need to shard. Use all
the hardware to support querying.

Sharding will improve indexing throughput without doubt, Solr scales
roughly linearly with the number of shards. Do use CloudSolrClient
for your updates as it routes docs to the correct leader, avoiding
one extra hop.

Given your soft  commit setting of 5 seconds, I infer that the allowable
time for updates to be searchable is quite small, indicating that NRT
replicas are the way to go. I'll also say that this commit rate is pretty
aggressive given your volume, is it really necessary to be that short?
Your caches are going to be pretty useless since they won't stick around
for very long. Look carefully at the autowarming time, in order to
make any good use of your fitlerCache, you'll have to autowarm it some
and if you do, you need to insure that the autowarm interval is less than
your autocommit time.

Best,
Erick

On Thu, Jan 3, 2019 at 10:34 PM Doss  wrote:
>
> Hi,
>
> We are planning to setup a SOLR cloud with 6 nodes for 3 million records
> (expected to grow to 5 million in a year), with 150 fields and over all
> index would come around 120GB.
>
> We plan to use NRT with 5 sec soft commit and 1 min hard commit.
>
> Expected query volume would be 5000 select hits per second and 7000 inserts
> / updates per second.
>
> Our records can be classified under 15 categories, but they will not have
> even number of records, few categories will have more number of records.
>
> Queries will also come in the same pattern, that is., categories with high
> number of records will get high volume of select / updates.
>
> For this situation we are confused in choosing what type of sharding would
> help us in better performance in both select and updates?
>
> Composite / implicit - Composite with 15 shards or implicit based on 15
> categories.
>
> Our select queries will have minimum 15 filters in fq, with extensive
> function queries used in sort.
>
> Updates will have 6 integer fields, 5 string fields and 4 string/integer
> fields with multi valued.
>
> If we choose implicit to boost select performance, our updates will be
> heavy on few shards (major category shards), will this be a problem?
>
> For our kind of situation which replica Type can we choose? All NRT or NRT
> with TLOG ?
>
> Thanks in advance!
>
> Best,
> Doss.


Re: How to debug empty ParsedQuery from Edismax Query Parser

2019-01-04 Thread Kay Wrobel
I'd like to follow up on this post here because it has become relevant to me 
now.

I have set up a debugging environment and took a deep-dive into the SOLR 7.6.0 
source code with Eclipse as my IDE of choice for this task. I have isolated the 
exact line as to where things fall apart for my two sample queries that I have 
been testing with, which are "q=a3f*" and "q=aa3f*. As you can see here, the 
only visible difference between the two search terms are that the second search 
term has two characters in succession before switching to a numerical portion.

First things first, the Extended Dismax Query Parser hands over portions of the 
parsing to the Standard Query Parser early on the the parsing process. 
Following down the rabbit hole, I ended up in 
SolrQueryParserBase.getPrefixQuery() method. On line 1173 of this method, we 
have the following statement:

termStr = analyzeIfMultitermTermText(field, termStr, 
schema.getFieldType(field));

This statement, when executing with the "a3f" search term, returns "a3f" as a 
result. However, when using "aa3f", it throws a SolrException with excatly the 
same multi-term error as shown below, only like this:
> analyzer returned too many terms for multiTerm term: aa3f

At this point, I would like to reiterate the purpose of our search: we are a 
part number house. We deal with millions of part numbers in our system and on 
our web site. A customer of ours typically searches our site with a given part 
number (or SKU if you will). Some part numbers are intelligent, and so 
customers might reduce the part number string to a portion at the beginning. 
Either way, it is *not* a typical "word" based search. Yet, the system (Drupal) 
does treat those two query fields like standard "Text" search fields. Those who 
know Drupal Commerce will recognize the Title field of a node and also possible 
the Product Variation or (SKU) field.

With that in mind, multi-term was introduced with SOLR 5, and I think this 
error (or limitation) has probably been in SOLR 5 since then. Can anyone closer 
to the matter or having struggled with this same issue chime in on the subject?

Kind regards,

Kay

> On Dec 28, 2018, at 9:57 AM, Kay Wrobel  wrote:
> 
> Here are my log entries:
> 
> SOLR 7.x (non-working)
> 2018-12-28 15:36:32.786 INFO  (qtp1769193365-20) [   x:collection1] 
> o.a.s.c.S.Request [collection1]  webapp=/solr path=/select 
> params={q=ac6023*=tm_field_product^21.0=tm_title_field^8.0=all=10=xml=true}
>  hits=0 status=0 QTime=2
> 
> SOLR 4.x (working)
> INFO  - 2018-12-28 15:43:41.938; org.apache.solr.core.SolrCore; [collection1] 
> webapp=/solr path=/select 
> params={q=ac6023*=tm_field_product^21.0=tm_title_field^8.0=all=10=xml=true}
>  hits=32 status=0 QTime=8 
> 
> EchoParams=all did not show anything different in the resulting XML from SOLR 
> 7.x.
> 
> 
> I found out something curious yesterday. When I try to force the Standard 
> query parser on SOLR 7.x using the same query, but adding "defType=lucene" at 
> the beginning, SOLR 7 raises a SolrException with this message: "analyzer 
> returned too many terms for multiTerm term: ac6023" (full response: 
> https://pastebin.com/ijdBj4GF)
> 
> Log entry for that request:
> 2018-12-28 15:50:58.804 ERROR (qtp1769193365-15) [   x:collection1] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: analyzer 
> returned too many terms for multiTerm term: ac6023
>at 
> org.apache.solr.schema.TextField.analyzeMultiTerm(TextField.java:180)
>at 
> org.apache.solr.parser.SolrQueryParserBase.analyzeIfMultitermTermText(SolrQueryParserBase.java:992)
>at 
> org.apache.solr.parser.SolrQueryParserBase.getPrefixQuery(SolrQueryParserBase.java:1173)
>at 
> org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:781)
>at org.apache.solr.parser.QueryParser.Term(QueryParser.java:421)
>at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:278)
>at org.apache.solr.parser.QueryParser.Query(QueryParser.java:162)
>at 
> org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:131)
>at 
> org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:254)
>at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:49)
>at org.apache.solr.search.QParser.getQuery(QParser.java:173)
>at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:160)
>at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:279)
>at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
>at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
>at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
>at 
> 

Re: Solr relevancy score different on replicated nodes

2019-01-04 Thread Mikhail Khludnev
Replicated segments might have different deleted documents by design.
Precise numbers can be achieved via exact stats. see
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_


On Fri, Jan 4, 2019 at 2:40 PM AshB  wrote:

> Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes
> Machine-1,Machine-2
> holding single instances of solr
>
> We are having a collection which was single shard and single replica i.e
> s=1
> and rf=1
>
> Few days back we tried to add replica to it.But the score for same query is
> coming different from different replicas.
>
>
> http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[
>
> whereas on another machine(replica)
>
>
> http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[
>
> The maxScore is different.
>
> Relevancy gets affected due to sharding but replication was not expected as
> same documents get copied to other node. score explaination gives issue
> with
> docCount and docFreq uneven.
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
> from:
> 1.050635000 docCount :*10020.0* docFreq :*3504.000*
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))
> from:
> 1.068795100
>
> docCount :*10291.0* docFreq :*3534.000*
>
> Is this expected?What could be wrong here?Please suggest
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
Sincerely yours
Mikhail Khludnev


Re: So Many Zookeeper Warnings--There Must Be a Problem

2019-01-04 Thread Erick Erickson
How brave are you? ;)

I'll defer to Scott on the internals of ZK and why it
might be necessary to delete the ZK data dirs, but
what happens if you just correct your configuration and
drive on?

If that doesn't work here's something to try
Shut down your Solr instances, then.

- bin/solr zk cp -r zk:/ some_local_dir

- fix your ZK, perhaps blowing the data directories away
and bring the ZK servers back up.

- bin/solr zk cp -r some_local_dir zk:/

Start your Solr instances.

NOTE: if you've configured your solr info with a "chroot", the ZK path
will be slightly different.

NOTE: I'm going from memory on the exact form of those commands.
bin/solr -help
should show you the info

WARNING: This worked at some point in the past, but is _not_
"officially" supported, it was just a happy consequence of code to
copy data from ZK and back to replace the zkCli functionality, creating
one less thing for Solr users to have to keep track of.

What that does is copy the cluster status relevant to Solr from then back to ZK.

DO NOT change your Solr data in any way when doing this. What this is
trying to do is copy all the topology information in ZK. Assuming the Solr
nodes haven't changed, have the same IP address etc. it _might_ work for you.

Best,
Erick

On Fri, Jan 4, 2019 at 4:25 AM Joe Lerner  wrote:
>
> wrt, "You'll probably have to delete the contents of the zk data directory
> and rebuild your collections."
>
> Rebuild my *SOLR* collections? That's easy enough for us.
>
> If this is how we're incorrectly configured now:
>
> server #1 = myid#1
> server #2 = myid#2
> server #3 = myid#2
>
> My plan would be to do the following, while users are still online (it's a
> big [bad] deal if we need to take search offline):
>
> 1. Take zk #3 down.
> 2. Fix zk #3 by deleting the contents of the zk data directory and assign it
> myid#3
> 3. Bring zk#3 back up
> 4. Do a full re-build of all collections
>
> Thanks!
>
> Joe
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: So Many Zookeeper Warnings--There Must Be a Problem

2019-01-04 Thread Shawn Heisey

On 1/4/2019 5:24 AM, Joe Lerner wrote:

server #1 = myid#1
server #2 = myid#2
server #3 = myid#2

My plan would be to do the following, while users are still online (it's a
big [bad] deal if we need to take search offline):

1. Take zk #3 down.
2. Fix zk #3 by deleting the contents of the zk data directory and assign it
myid#3
3. Bring zk#3 back up
4. Do a full re-build of all collections


There should be no need to rebuild anything in Solr once zookeeper is 
repaired in this fashion.  The third zookeeper will replicate data from 
whichever of the other two has won the leader election.  A three-node 
zookeeper ensemble is 100% functional with two nodes running.


You would only need to rebuild the Solr side if all data on the 
zookeeper side were lost.  I would not expect this action to lose any 
data in zookeeper.


The info you tried to share about your log messages in the original post 
for this thread did not come through.  I do not see it either on the 
mailing list or in the Nabble mirror.  It does look like you started 
another thread which does have the info.  I will address those messages 
in that thread.


Thanks,
Shawn



Re: SOLR v7 Security Issues Caused Denial of Use - Sonatype Application Composition Report

2019-01-04 Thread Bob Hathaway
The most important feature of any software running today is that it can be
run at all. Security vulnerabilities can preclude software from running in
enterprise environments. Today software must be free of critical and severe
security vulnerabilities or they can't be run at all from Information
Security policies. Enterprises today run security scan software to check
for security and licensing vulnerabilities because today most organizations
are using open source software where this has become most relevant.
Forrester has a good summary on the need for software composition analysis
tools which virtually all enterprises run today befor allowing software to
run in production environments:
https://www.blackducksoftware.com/sites/default/files/images/Downloads/Reports/USA/ForresterWave-Rpt.pdf

Solr version 6.5 passes security scans showing no critical security
issues.  Solr version 7 fails security scans with over a dozen critical and
severe security vulnerabilities for Solr version from 7.1.  Then we ran
scans against the latest Solr version 7.6 which failed as well.  Most of
the issues are due to using old libraries including the JSON Jackson
framework, Dom 4j and Xerces and should be easy to bring up to date. Only
the latest version of SimpleXML has severe security vulnerabilities. Derby
leads the most severe security violations at Level 9.1 by using an out of
date version.

What good is software or any features if enterprises can't run them?
Today software cybersecurity is a top priority and risk for enterprises.
Solr version 6.5 is very old exposing the zookeeper backend from the SolrJ
client which is a differentiating capability.

Is security and remediation a priority for SolrJ?  I believe this should be
a top feature to allow SolrJ to continue providing search features to
enterprises and a security roadmap and plan to keep Solr secure and usable
by continually adapting and improving in the ever changing security
landscape and ecosystem.  The Darby vulnerability issue CVE-2015-1832 was a
passing medium Level 6.2  issue in CVSS 2.0 last year but is the most
critical issue with Solr 7.6 at Level 9.1 in this year's CVSS 3.0.  These
changes need to be tracked and updates and fixes incorporated into new Solr
versions.
https://nvd.nist.gov/vuln/detail/CVE-2015-1832

On Thu, Jan 3, 2019 at 12:19 PM Bob Hathaway  wrote:

> Critical and Severe security vulnerabilities against Solr v7.1.  Many of
> these appear to be from old open source  framework versions.
>
> *9* CVE-2017-7525 com.fasterxml.jackson.core : jackson-databind : 2.5.4
> Open
>
>CVE-2016-131 commons-fileupload : commons-fileupload : 1.3.2 Open
>
>CVE-2015-1832 org.apache.derby : derby : 10.9.1.0 Open
>
>CVE-2017-7525 org.codehaus.jackson : jackson-mapper-asl : 1.9.13 Open
>
>CVE-2017-7657 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
>
>CVE-2017-7658 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
>
>CVE-2017-1000190 org.simpleframework : simple-xml : 2.7.1 Open
>
> *7* sonatype-2016-0397 com.fasterxml.jackson.core : jackson-core : 2.5.4
> Open
>
>sonatype-2017-0355 com.fasterxml.jackson.core : jackson-core : 2.5.4
> Open
>
>CVE-2014-0114 commons-beanutils : commons-beanutils : 1.8.3 Open
>
>CVE-2018-1000632 dom4j : dom4j : 1.6.1 Open
>
>CVE-2018-8009 org.apache.hadoop : hadoop-common : 2.7.4 Open
>
>CVE-2017-12626 org.apache.poi : poi : 3.17-beta1 Open
>
>CVE-2017-12626 org.apache.poi : poi-scratchpad : 3.17-beta1 Open
>
>CVE-2018-1308 org.apache.solr : solr-dataimporthandler : 7.1.0 Open
>
>CVE-2016-4434 org.apache.tika : tika-core : 1.16 Open
>
>CVE-2018-11761 org.apache.tika : tika-core : 1.16 Open
>
>CVE-2016-1000338 org.bouncycastle : bcprov-jdk15 : 1.45 Open
>
>CVE-2016-1000343 org.bouncycastle : bcprov-jdk15 : 1.45 Open
>
>CVE-2018-1000180 org.bouncycastle : bcprov-jdk15 : 1.45 Open
>
>CVE-2017-7656 org.eclipse.jetty : jetty-http : 9.3.20.v20170531 Open
>
>CVE-2012-0881 xerces : xercesImpl : 2.9.1 Open
>
>CVE-2013-4002 xerces : xercesImpl : 2.9.1 Open
>
> On Thu, Jan 3, 2019 at 12:15 PM Bob Hathaway  wrote:
>
>> We want to use SOLR v7 but Sonatype scans past v6.5 show dozens of
>> critical and severe security issues and dozens of licensing issues. The
>> critical security violations using Sonatype are inline and are indexed with
>> codes from the National Vulnerability Database,
>>
>> Are there recommended steps for running Solr 7 in secure enterprises
>> specifically infosec remediation over Sonatype Application Composition
>> Reports?
>>
>> Are there plans to make Solr more secure in v7 or v8?
>>
>> I'm new to the Solr User forum and suggests are welcome.
>>
>>
>> Sonatype Application Composition Reports
>> Of Solr - 7.6.0, Build Scanned On Thu Jan 03 2019 at 14:49:49
>> Using Scanner 1.56.0-01
>>
>> [image: image.png]
>>
>> [image: image.png]
>>
>> [image: image.png]
>>
>> Security Issues
>> Threat Level Problem Code Component Status

Re: Solr relevancy score different on replicated nodes

2019-01-04 Thread Erick Erickson
See particularly point 3 here and to a lesser extent point 2.
https://support.lucidworks.com/s/question/0D5803LRpijCAD/the-number-of-results-returned-is-not-constant-every-time-i-query-solr

For point two (the internal Lucene doc IDs are different) you can
easily correct it by adding sort=score desc, solrId asc to the query.

That article was written before TLOG and PULL replicas came into the
picture. Since those replica types all have the
exact same index structure you shouldn't have this problem in that case.

Best,
Erick

On Fri, Jan 4, 2019 at 3:40 AM AshB  wrote:
>
> Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2
> holding single instances of solr
>
> We are having a collection which was single shard and single replica i.e s=1
> and rf=1
>
> Few days back we tried to add replica to it.But the score for same query is
> coming different from different replicas.
>
> http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[
>
> whereas on another machine(replica)
>
> http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json
>
> "response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[
>
> The maxScore is different.
>
> Relevancy gets affected due to sharding but replication was not expected as
> same documents get copied to other node. score explaination gives issue with
> docCount and docFreq uneven.
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
> 1.050635000 docCount :*10020.0* docFreq :*3504.000*
>
> idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
> 1.068795100
>
> docCount :*10291.0* docFreq :*3534.000*
>
> Is this expected?What could be wrong here?Please suggest
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Continuous Zookeeper Client Warnings

2019-01-04 Thread Joe Lerner
Hi, 

We have a simple architecture: 2 SOLR Cloud servers (on servers #1 and #2), 
and 3 zookeeper instances (on servers #1, #2, and #3). Things appear to work 
fine but: 

We are getting *TONS* of continuous log warnings from our client 
applications. From one server it shows this: 


[MYAPP-WEB] 2019-01-03 14:17:46,519 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:46,519 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:47,385 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,386 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,386 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,386 INFO
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:47,386 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent 
at
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) 
[MYAPP-WEB] 2019-01-03 14:17:47,487 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:47,487 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,943 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:47,944 INFO
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:47,944 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent 
at
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) 
[MYAPP-WEB] 2019-01-03 14:17:48,044 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:48,044 WARN
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:48,687 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:48,687 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:48,688 INFO [org.apache.zookeeper.ClientCnxn] -

[MYAPP-WEB] 2019-01-03 14:17:48,689 INFO
[org.apache.solr.common.cloud.ConnectionManager] - 
[MYAPP-WEB] 2019-01-03 14:17:48,689 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/proto/WatcherEvent 
at
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:770) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) 
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1144) 






And from another server we get this: 


[MYAPP-WEB] 2019-01-03 14:19:47,273 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/Login 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216)
 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) 
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1063) 
[MYAPP-WEB] 2019-01-03 14:19:47,753 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/Login 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216)
 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) 
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1063) 
[MYAPP-WEB] 2019-01-03 14:19:48,197 INFO
[gov.fbi.guardian.web.filter.SentinelRedirectFilter] - 
[MYAPP-WEB] 2019-01-03 14:19:48,450 WARN [org.apache.zookeeper.ClientCnxn] -

java.lang.NoClassDefFoundError: org/apache/zookeeper/Login 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:216)
 
at
org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient.java:119)
 
at
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:1011) 
at

Re: So Many Zookeeper Warnings--There Must Be a Problem

2019-01-04 Thread Joe Lerner
wrt, "You'll probably have to delete the contents of the zk data directory
and rebuild your collections."

Rebuild my *SOLR* collections? That's easy enough for us. 

If this is how we're incorrectly configured now:

server #1 = myid#1
server #2 = myid#2
server #3 = myid#2

My plan would be to do the following, while users are still online (it's a
big [bad] deal if we need to take search offline):

1. Take zk #3 down.
2. Fix zk #3 by deleting the contents of the zk data directory and assign it
myid#3
3. Bring zk#3 back up
4. Do a full re-build of all collections

Thanks!

Joe



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr relevancy score different on replicated nodes

2019-01-04 Thread AshB
Version Solr 7.4.0 zookeeper 3.4.11 Achitecture Two boxes Machine-1,Machine-2
holding single instances of solr

We are having a collection which was single shard and single replica i.e s=1
and rf=1

Few days back we tried to add replica to it.But the score for same query is
coming different from different replicas.

http://Machine-1:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json

"response":{"numFound":5836,"start":0,"maxScore":*4.418847*,"docs":[

whereas on another machine(replica)

http://Machine-2:8983/solr/MyTestCollection/select?q=%22data%22+OR+(data)=10=score=edismax=search_field+content=json

"response":{"numFound":5836,"start":0,"maxScore":*4.4952264*,"docs":[

The maxScore is different.

Relevancy gets affected due to sharding but replication was not expected as
same documents get copied to other node. score explaination gives issue with
docCount and docFreq uneven.

idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
1.050635000 docCount :*10020.0* docFreq :*3504.000*

idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
1.068795100

docCount :*10291.0* docFreq :*3534.000*

Is this expected?What could be wrong here?Please suggest



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Inconsistent debugQuery score with multiplicative boost

2019-01-04 Thread Thomas Aglassinger
Hi!

When debugging a query using multiplicative boost based on the product() 
function I noticed that the score computed in the explain section is correct 
while the score in the actual result is wrong.

As an example here’s a simple query that boosts a field name_text_de 
(containing German product names). The term “Netzteil” boost to 200% and “Sony” 
boosts to 300%. A name that contains both terms would be boosted to 600%. If a 
term does not match, a default pseudo boost of 1 is used (multiplicative 
identity). The params of the responseHeader in the query result are:

"q":"{!boost b=$ymb}(+{!lucene v=$yq})",
"ymb":"product(query({!v=\"name_text_de\\:Netzteil\\^=2.0\"},1),query({!v=\"name_text_de\\:Sony\\^=3.0\"},1))",
"yq":"*:*",

The parsed query of the ymb parameter translates to:

FunctionScoreQuery(FunctionScoreQuery(+*:*, scored by 
boost(product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0),query((ConstantScore(name_text_de:sony))^3.0,def=1.0)

For a product that contains both terms, the score in the result and explain 
section correctly yields 6.0:

"name_text_de":"Original Sony Vaio Netzteil",
"score":6.0,

6.0 = product of:
  1.0 = boost
  6.0 = product of:
1.0 = *:*
6.0 = 
product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=3.0)

However, for a product with only “Netzteil” in the name, the result score 
wrongly is 1.0 while the explain score correctly is 2.0:

"name_text_de":"GS-Netzteil 20W schwarz",
"score":1.0,

2.0 = product of:
  1.0 = boost
  2.0 = product of:
1.0 = *:*
2.0 = 
product(query((ConstantScore(name_text_de:netzteil))^2.0,def=1.0)=2.0,query((ConstantScore(name_text_de:sony))^3.0,def=1.0)=1.0)

(Note: the filter chain splits words on hyphen so the “GS-“ in front of the 
“Netzteil” should not be an issue.)

Here’s the complete filter chain for the text_de field type:














Interestingly if I simplify the query to only boost on “Netzteil”, the score in 
both the result and explain section are correctly 2.0.

I reproduced this with a local Solr 7.5.0 server (no sharding, no replica) on 
Mac OS X 10.14.1.

I found mention of a somewhat similar situation with BooleanQuery, which was 
considered a bug and fixed in 2016: 
https://issues.apache.org/jira/browse/LUCENE-7132

So my questions are:

1. Is there something wrong in my query that prevents the “Netzteil”-only 
product to get a score of 2.0?
2. Shouldn’t the score in the result and the explain section always be the same?

Best regards,
Thomas


RE: [solr-solrcloud] How does DIH work when there are multiple nodes?

2019-01-04 Thread 유정인
Hi

The reader was looking for a way to do 'DIH' automatically.

The reason was for HA configuration.

Thank you for answer.

If you know how, please reply.
-Original Message-
From: Doss  
Sent: Friday, January 04, 2019 3:59 PM
To: solr-user@lucene.apache.org
Subject: RE: [solr-solrcloud] How does DIH work when there are multiple
nodes?

Hi,

The data import process will not happen automatically, we have to do it
manually through the admin interface or by calling the URL

https://lucene.apache.org/solr/guide/7_5/uploading-structured-data-store-
data-with-the-data-import-handler.html

Full Import:

http://node1ip:8983/solr/yourindexname/dataimport?command=full-
import=true

Delta Import:

http://node1ip:8983/solr/yourindexname/dataimport?command=delta-
import=true


If you want to do the delta import automatically you can setup a cron
(linux) which can call the URL periodically.

Best,
Doss.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html