Re: What is “high cardinality” in facet streams?

2018-02-22 Thread Alfonso Muñoz-Pomer Fuentes
Right now we’re sharding the collection as we hit performance issues in the 
past with legacy Solr (i.e. a single Solr core), and also we’re experimenting a 
bit to see which replication factor we can get away with (in terms of resources 
and cost). Unfortunately, PSQL isn’t yet an option due to the lack of point 
field support, which we’re using in our schema 
(https://issues.apache.org/jira/browse/SOLR-10427).

Thanks for pointing at the parallel function. What I don’t understand, though, 
is if I don’t use the parallel decorator, my query isn’t distributed across my 
cluster nodes (e.g. I have four shards and no replicas)?


> On 22 Feb 2018, at 03:01, Joel Bernstein  wrote:
> 
> With Streaming Expressions you have options for speeding up large
> aggregations.
> 
> 1) Shard
> 2) Use the parallel function to run the aggregation in parallel.
> 3) Add more replicas
> 
> When you use the parallel function the same aggregation can be pulled from
> every shard and every shard replica in the cluster.
> 
> The parallel SQL interface supports a map_reduce aggregation mode where you
> can specific then number of parallel workers. If a SQL group by query works
> for you that might be the easiest way to go. The docs have good coverage of
> this topic.
> 
> 
> 
> 
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Wed, Feb 21, 2018 at 8:43 PM, Shawn Heisey  wrote:
> 
>> On 2/21/2018 12:08 PM, Alfonso Muñoz-Pomer Fuentes wrote:
>>> Some more details about my collection:
>>> - Approximately 200M documents
>>> - 1.2M different values in the field I’m faceting over
>>> 
>>> The query I’m doing is over a single bucket, which after applying q and
>> fq the 1.2M values are reduced to, at most 60K (often times half that
>> value). From your replies I assume I’m not going to hit a bottleneck any
>> time soon. Thanks a lot.
>> 
>> Two hundred million documents is going to be a pretty big index even if
>> the documents are small.  The server is going to need a lot of spare
>> memory (not assigned to programs) for good general performance.
>> 
>> As I understand it, facet performance is going to be heavily determined
>> by the 1.2 million unique values in the field you're using.  Facet
>> performance is probably going to be very similar whether your query
>> matches 60K or 1 million.
>> 
>> Thanks,
>> Shawn
>> 
>> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



Re: What is “high cardinality” in facet streams?

2018-02-22 Thread Alfonso Muñoz-Pomer Fuentes
All in all the index is about 250GB and it’s sharded in two dedicated VMs with 
24GB of memory and it’s performing ok so far (queries take about 7 seconds, the 
worst cases about 10). At some point in the past we needed to transition to 
SolrCloud because a single Solr core, of course, wouldn’t scale.

> On 22 Feb 2018, at 01:43, Shawn Heisey  wrote:
> 
> On 2/21/2018 12:08 PM, Alfonso Muñoz-Pomer Fuentes wrote:
>> Some more details about my collection:
>> - Approximately 200M documents
>> - 1.2M different values in the field I’m faceting over
>> 
>> The query I’m doing is over a single bucket, which after applying q and fq 
>> the 1.2M values are reduced to, at most 60K (often times half that value). 
>> From your replies I assume I’m not going to hit a bottleneck any time soon. 
>> Thanks a lot.
> 
> Two hundred million documents is going to be a pretty big index even if
> the documents are small.  The server is going to need a lot of spare
> memory (not assigned to programs) for good general performance.
> 
> As I understand it, facet performance is going to be heavily determined
> by the 1.2 million unique values in the field you're using.  Facet
> performance is probably going to be very similar whether your query
> matches 60K or 1 million.
> 
> Thanks,
> Shawn
> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer



AW: Sort by nested field but only in matching nested documents

2018-02-22 Thread Florian Fankhauser
Thanks for your answer, Mikhail.

Florian


-Ursprüngliche Nachricht-
Von: Mikhail Khludnev [mailto:m...@apache.org] 
Gesendet: Dienstag, 6. Februar 2018 11:44
An: solr-user 
Betreff: Re: Sort by nested field but only in matching nested documents

Hello Florian,
No. As an alternative you can put it into q param, suppressing scoring from 
undesired clauses with ^=0

On Thu, Feb 1, 2018 at 5:22 PM, Florian Fankhauser 
wrote:

> Hello,
> given the following document structure (books as parent, libraries 
> having these books as children):
>
> 
> 
> 
> book
> 1000
> Mr. Mercedes
> Stephen King
> 
> library
> 1000/100
> 20160810
> Innsbruck
> 
> 
> library
> 1000/101
> 20180103
> Hall
> 
> 
> 
> book
> 1001
> Noah
> Sebastian Fitzek
> 
> library
> 1001/100
> 20170810
> Innsbruck
> 
> 
> 
> 
>
> Now i want to get all books located in libraries in city "Innsbruck", 
> sorted by acquisition date descending.
> In other words: i want to filter in the field city_i in the child 
> documents, but return only the parent document. And i want to sort by 
> the field acquisition_date_i in the child documents in descending 
> order, newest first.
>
> My first try:
> -
>
> URL:
> http://localhost:8983/solr/test1/select?q=title_t:*&fq={!
> parent%20which=doc_type_s:book}city_t:Innsbruck&sort={!
> parent%20which=doc_type_s:book%20score=max%20v=%27%
> 2Bdoc_type_s:library%20%2B{!func}acquisition_date_i%27}%20desc
>
> URL params decoded:
> q=title_t:*
> fq={!parent which=doc_type_s:book}city_t:Innsbruck
> sort={!parent which=doc_type_s:book score=max v='+doc_type_s:library
> +{!func}acquisition_date_i'} desc
>
> Result:
> {
>   "responseHeader":{
> "status":0,
> "QTime":4,
> "params":{
>   "q":"title_t:*",
>   "fq":"{!parent which=doc_type_s:book}city_t:Innsbruck",
>   "sort":"{!parent which=doc_type_s:book score=max 
> v='+doc_type_s:library +{!func}acquisition_date_i'} desc"}},
>   "response":{"numFound":2,"start":0,"docs":[
>   {
> "doc_type_s":"book",
> "text":["book",
>   "Mr. Mercedes",
>   "Stephen King"],
> "id":"1000",
> "title_t":"Mr. Mercedes",
> "title_t_fac":"Mr. Mercedes",
> "autor_t":"Stephen King",
> "autor_t_fac":"Stephen King",
> "_version_":1591205521252155392},
>   {
> "doc_type_s":"book",
> "text":["book",
>   "Noah",
>   "Sebastian Fitzek"],
> "id":"1001",
> "title_t":"Noah",
> "title_t_fac":"Noah",
> "autor_t":"Sebastian Fitzek",
> "autor_t_fac":"Sebastian Fitzek",
> "_version_":1591205521256349696}]
>   }}
>
> The result is wrong, because "Noah" should be before "Mr. Mercedes" in 
> the list. The reason is, i guess, because "Mr. Mercedes" has another 
> child document with a newer acquisition_date. But this child document 
> is not in city "Innsbruck" and should not influence the sorting.
>
> So i tried to add the city-filter to the sort-parameter as well in my 
> second try:
> -
>
> URL:
> http://localhost:8983/solr/test1/select?q=title_t:*&fq={!
> parent%20which=doc_type_s:book}city_t:Innsbruck&sort={!
> parent%20which=doc_type_s:book%20score=max%20v=%27%
> 2Bdoc_type_s:library%20%2Bcity_t:Innsbruck%20%2B{!
> func}acquisition_date_i%27}%20desc
>
> URL params decoded:
> q=title_t:*
> fq={!parent which=doc_type_s:book}city_t:Innsbruck
> sort={!parent which=doc_type_s:book score=max v='+doc_type_s:library
> +city_t:Innsbruck +{!func}acquisition_date_i'} desc
>
> (I added "+city_t:Innsbruck" to the sort param)
>
> Result:
> {
>   "responseHeader":{
> "status":0,
> "QTime":3,
> "params":{
>   "q":"title_t:*",
>   "fq":"{!parent which=doc_type_s:book}city_t:Innsbruck",
>   "sort":"{!parent which=doc_type_s:book score=max 
> v='+doc_type_s:library +city_t:Innsbruck +{!func}acquisition_date_i'} 
> desc"}},
>   "response":{"numFound":2,"start":0,"docs":[
>   {
> "doc_type_s":"book",
> "text":["book",
>   "Noah",
>   "Sebastian Fitzek"],
> "id":"1001",
> "title_t":"Noah",
> "title_t_fac":"Noah",
> "autor_t":"Sebastian Fitzek",
> "autor_t_fac":"Sebastian Fitzek",
> "_version_":1591205521256349696},
>   {
> "doc_type_s":"book",
> "text":["book",
>   "Mr. Mercedes",
>   "Stephen King"],
> "id":"1000",
> "title_t":"Mr. Mercedes",
> "title_t_fac":"Mr. Mercedes",
> "autor_t":"Stephen King",
> "autor_t_fac":"Stephen King",
> "_version_

Problems with DocExpirationUpdateProcessor with Secured SolrCloud

2018-02-22 Thread Chris Ulicny
Hi,

We recently setup a 7.2.1 cloud with the intent to have the documents be
automatically deleted from the collection using the
DocExpirationUpdateProcessorFactory. We also have the cloud secured using
the BasicAuthenticationPlugin. Our current config settings are below.

The deployment is 3 nodes, each with a single solr instance which host a
single replica for the collection. The collection itself only has 1 shard,
so we have 3 copies (all NRT) of the same index.

What keeps happening is that the follower replicas end up being published
in a down state by the leader replica on the first autoDelete pass since it
doesn't authenticate the distributed updates. Relevant log dump:
https://pastebin.com/ZtirJLSu

Is there something that we were missing when we set this up? Besides the
replicas going down, the processor works as expected on the leader replica.

Thanks,
Chris


  

  +300SECONDS


  doc-expiration-processor-chain

  

  

  
  _expireat_
  _ttl_
  300



  


Re: Limit search queries only to pull replicas

2018-02-22 Thread Stanislav Sandalnikov
Hi,

The use case for this is that our indexing node has more shards than it has CPU 
cores it is enough for indexing, but not enough to serve the search queries if 
those queries are heavy. To put it out of serving requests we are using 
in-house solution that routes the queries to pull replicas based on information 
from zookeeper.

Ere, thanks for the patch, looking forward to try it.

Regards
Stanislav

> 14 февр. 2018 г., в 18:18, Ere Maijala  написал(а):
> 
> I've now posted https://issues.apache.org/jira/browse/SOLR-11982 with a 
> patch. It works just like preferLocalShards. SOLR-10880 is awesome, but my 
> idea is not to filter out anything, so this just adjusts the order of nodes.
> 
> --Ere
> 
> Tomas Fernandez Lobbe kirjoitti 8.1.2018 klo 21.42:
>> This feature is not currently supported. I was thinking in implementing it 
>> by extending the work done in SOLR-10880. I still didn’t have time to work 
>> on it though.  There is a patch for SOLR-10880 that doesn’t implement 
>> support for replica types, but could be used as base.
>> Tomás
>>> On Jan 8, 2018, at 12:04 AM, Ere Maijala  wrote:
>>> 
>>> Server load alone doesn't always indicate the server's ability to serve 
>>> queries. Memory and cache state are important too, and they're not as easy 
>>> to monitor. Additionally, server load at any single point in time or a 
>>> short term average is not indicative of the server's ability to handle 
>>> search requests if indexing happens in short but intense bursts.
>>> 
>>> It can also complicate things if there are more than one Solr instance 
>>> running on a single server.
>>> 
>>> I'm definitely not against intelligent routing. In many cases it makes 
>>> perfect sense, and I'd still like to use it, just limited to the pull 
>>> replicas.
>>> 
>>> --Ere
>>> 
>>> Erick Erickson kirjoitti 5.1.2018 klo 19.03:
 Actually, I think a much better option is to route queries to server load.
 The theory of preferring pull replicas to leaders would be that the leader
 will be doing the indexing work and the pull replicas would be doing less
 work therefore serving queries faster. But that's a fragile assumption.
 Let's say indexing stops totally. Now your leader is sitting there idle
 when it could be serving queries.
 The autoscaling work will allow for more intelligent routing, you can
 monitor the CPU load on your servers and if the leader has some spare
 cycles use them .vs. crudely routing all queries to pull replicas (or tlog
 replicas for that matter). NOTE: I don't know whether this is being
 actively worked on or not, but seems a logical extension of the increased
 monitoring capabilities being put in place for autoscaling, but I'd rather
 see effort put in there than support routing based solely on a node's type.
 Best,
 Erick
 On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović <
 emir.arnauto...@sematext.com> wrote:
> It is interesting that ES had similar feature to prefer primary/replica
> but it deprecating that and will remove it - could not find explanation 
> why.
> 
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 5 Jan 2018, at 15:22, Ere Maijala  wrote:
>> 
>> Hi,
>> 
>> It would be really nice to have a server-side option, though. Not
> everyone uses Solrj, and a typical fairly dummy client just queries the
> server without any understanding about shards etc. Solr could be clever
> enough to not forward the query to NRT shards when configured to prefer
> PULL shards and they're available. Maybe it could be something similar to
> the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".
>> 
>> --Ere
>> 
>> Emir Arnautović kirjoitti 14.12.2017 klo 11.41:
>>> Hi Stanislav,
>>> I don’t think that there is a built in feature to do this, but that
> sounds like nice feature of Solrj - maybe you should check if available.
> You can implement it outside of Solrj - check cluster state to see which
> shards are available and send queries only to pull replicas.
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
 On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <
> s.sandalni...@gmail.com> wrote:
 
 Hi,
 
 We have a Solr 7.1 setup with SolrCloud where we have multiple shards
> on one server (for indexing) each shard has a pull replica on other 
> servers.
 
 What are the possible ways to limit search request only to pull type
> replicase?
 At the moment the only solution I found is to append shards parameter
> to each query, but if new shards added later it requires to change
>

Response time under 1 second?

2018-02-22 Thread LOPEZ-CORTES Mariano-ext
Hello

With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format).

Is it better to disable completely Solr cache ? There is enough RAM for the 
entire index.

Is there a way for reduce random queries under 1 second?

Thanks!





Re: Solr Swap space

2018-02-22 Thread Shawn Heisey

On 2/21/2018 7:58 PM, Susheel Kumar wrote:

Below output for prod machine based on the steps you described.  Please
take a look.  The solr searches are returning fine and no issue with
performance but since last 4 months swap space started going up. After
restart, it comes down to zero and then few weeks, it utilization reaches
to 40-50% and thus requires restart of solr process.


I bet that if you run this command, it will show you a value of 60:

cat /proc/sys/vm/swappiness

This makes the OS very aggressive about using swap, even when there is 
absolutely no need for it to do so.


If you type the following series of commands, it should fix the problem 
and prevent it from happening again until you reboot the system:


echo "0" > /proc/sys/vm/swappiness
swapoff -a
swapon -a

Note that when the swapoff command runs, it will force the OS to read 
all the swapped data back into memory.  It will take several minutes for 
this to occur, because it must read nearly a gigabyte of data and figure 
out how to put it back in memory. Both of the command outputs you 
included say that there is over 20GB of free memory.  So I do not 
anticipate the system having problems from running these commands.  It 
will slow the machine down temporarily, though -- so only do it during a 
quiet time for your Solr install.


To make this setting survive a reboot, find the sysctl.conf file 
somewhere in your /etc directory and add this line to it:


vm.swappiness = 0

This setting does not completely disable swap.  If the system finds 
itself with real memory pressure and actually does NEED to use swap, it 
still will ... it just won't swap anything out before it's actually 
required.


I do not think the behavior you are seeing is actually causing problems, 
based on your system load and CPU usage.  But what I've shared should 
fix it for you.


Thanks,
Shawn



Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-22 Thread Yonik Seeley
Thanks Antelmo, I'm trying to reproduce this now.
-Yonik


On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar  wrote:
> Hi all,
>
> I was wondering if the information I sent is sufficient to look into the
> issue.  Let me know if you need anything else from me please.
>
> Thanks,
> Antelmo
>
> On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar  wrote:
>
>> Hi,
>>
>> Here are two pastebins.  The first is the full complete response with the
>> search parameters used.  The second is the stack trace from the logs:
>>
>> https://pastebin.com/rsHvKK63
>>
>> https://pastebin.com/8amxacAj
>>
>> I am not using any custom code or plugins with the Solr instance.
>>
>> Please let me know if you need anything else and thanks for looking into
>> this.
>>
>> -Antelmo
>>
>> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley  wrote:
>>
>>> Could you provide the full stack trace containing "Invalid Date
>>> String"  and the full request that causes it?
>>> Are you using any custom code/plugins in Solr?
>>> -Yonik
>>>
>>>
>>> On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar  wrote:
>>> > Hi,
>>> >
>>> > I was using the following part of a query to get facet buckets so that I
>>> > can use the information in the buckets for some post-processing:
>>> >
>>> > "json":
>>> > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:
>>> true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter
>>> m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec
>>> ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\
>>> "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\
>>> ":{\"collection\":
>>> > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa
>>> cet\":{\"abnd\":\"sum(div(sample_size_i,
>>> > collection_duration_days_i))\""
>>> >
>>> > Sorry if it is hard to read.  Basically what is was doing was getting
>>> the
>>> > following buckets:
>>> >
>>> > First bucket will be categorized by "Species category" by default
>>> unless we
>>> > pass in the request the "term" parameter which we will categories the
>>> first
>>> > bucket by whatever "term" is set to.  Then inside this first bucket, we
>>> > create another buckets of the "Collection date" category.  Then inside
>>> the
>>> > "Collection date" category buckets, we would use some functions to do
>>> some
>>> > calculations and return those calculations inside the "Collection date"
>>> > category buckets.
>>> >
>>> > This query is working fine in Solr 6.2, but I upgraded our instance of
>>> Solr
>>> > 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr
>>> 6.6
>>> > broke the above query.  Now it complains when trying to create the
>>> buckets
>>> > of the "Collection date" category.  I get the following error:
>>> >
>>> > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>>> >
>>> > It seems that when creating the buckets of a date field, it does some
>>> > conversion of the way the date is stored and causes the error to appear.
>>> > Does anyone have an idea as to why this error is happening?  I would
>>> really
>>> > appreciate any help.  Hopefully I was able to explain my issue well.
>>> >
>>> > Thanks,
>>> > Antelmo
>>>
>>
>>


Re: Response time under 1 second?

2018-02-22 Thread Shawn Heisey

On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote:

With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format).

Is it better to disable completely Solr cache ? There is enough RAM for the 
entire index.


The size of the input data will have an effect on how big the index is, 
but it is not a direct indication of the index size.  The size of the 
index is more important than the size of the data that you send to Solr 
to create the index.


You say 12GB ... but is this total system memory, or the max Java heap 
size for Solr?  What are these two numbers for your servers?


If you go to the admin UI for one of these servers and look at the 
Overview page for all of the index cores it contains, you will be able 
to see how many documents and what size each index is on disk.  What are 
these numbers?  If the numbers are similar for all the servers, then I 
will only need to see it for one of them.


If the machine is running an OS like Linux that has the gnu top program, 
then I can see a lot of useful information from that program.  Run "top" 
(not htop or other variants), press shift-M to sort the list by memory, 
and grab a screenshot.  This will probably be an image file, so you'll 
need to find a file sharing site and give us a URL to access the file.  
Attachments rarely make it to the mailing list.


Thanks,
Shawn



SOLR Score Range Changed

2018-02-22 Thread Hodder, Rick
I am migrating from SOLR 4.10.2 to SOLR 7.1.

All seems to be going well, except for one thing: the score that is coming back 
for the resulting documents is giving different scores.

The core uses a schema. Here's the schema info for the field that i am 
searching on:




When searching maxrows=750, fields: *,score

IDX_Company:(cat and scratch)

SOLR 7.1: max score 6.95 and a min of 6.28

SOLR 4.10.2: max score 8.63 and a min of 0.91

IDX_InsuredName:(cat and scratch and fever)

SOLR 7.1 max score of 12.99 and a min of 11.25 SOLR 4.10.2 max 3.97 and min of 
0.77

See how the range of values is different (ranges in 7.1 dont go down to 0.x) 
Also notice that the max score doubles when I add one word to the search terms 
in 7.1. Most important, the ranges in 4.10.2 overlap - but the 7.1 dont.

A little more information to show you how I use this information, and why this 
is causing a problem.

I get a company name like "bobs cabinetry" and another "all american tech 
enterprise"

I run two SOLR queries per company name, I'll call them 1-AND, 1-OR, 2-AND, 
2-OR.

IDX_Company:(bobs AND cabinetry) &f=*,score,requestid:"1-AND"
IDX_Company:(bobs OR cabinetry) &f=*,score,requestid:"1-OR"
IDX_Company:(all AND american AND tech AND enterprise) 
&f=*,score,requestid:"2-AND"
IDX_Company:(all OR american OR tech OR enterprise) &f=*,score,requestid:"2-OR"

I combine the results together sort by descending score, and then take the top 
750 rows.(The requestid lets me know which query the results came from)

Because of the changes in the range of scores, the sort pushes all of the all 
american tech enterprise rows to the top of the results (because of no 
overlap), and when the top 750 are taken everything for bobs carpentry is 
removed from the results.

Is there some config setting I can change to make score calculation act like it 
did in 4.10.2?

Or something else?


Re: Solr Swap space

2018-02-22 Thread Susheel Kumar
Cool, Thanks, Shawn.  I was also looking the swapiness and it is set to
60.  Will try this out and let you know.  Thanks, again.

On Thu, Feb 22, 2018 at 10:55 AM, Shawn Heisey  wrote:

> On 2/21/2018 7:58 PM, Susheel Kumar wrote:
>
>> Below output for prod machine based on the steps you described.  Please
>> take a look.  The solr searches are returning fine and no issue with
>> performance but since last 4 months swap space started going up. After
>> restart, it comes down to zero and then few weeks, it utilization reaches
>> to 40-50% and thus requires restart of solr process.
>>
>
> I bet that if you run this command, it will show you a value of 60:
>
> cat /proc/sys/vm/swappiness
>
> This makes the OS very aggressive about using swap, even when there is
> absolutely no need for it to do so.
>
> If you type the following series of commands, it should fix the problem
> and prevent it from happening again until you reboot the system:
>
> echo "0" > /proc/sys/vm/swappiness
> swapoff -a
> swapon -a
>
> Note that when the swapoff command runs, it will force the OS to read all
> the swapped data back into memory.  It will take several minutes for this
> to occur, because it must read nearly a gigabyte of data and figure out how
> to put it back in memory. Both of the command outputs you included say that
> there is over 20GB of free memory.  So I do not anticipate the system
> having problems from running these commands.  It will slow the machine down
> temporarily, though -- so only do it during a quiet time for your Solr
> install.
>
> To make this setting survive a reboot, find the sysctl.conf file somewhere
> in your /etc directory and add this line to it:
>
> vm.swappiness = 0
>
> This setting does not completely disable swap.  If the system finds itself
> with real memory pressure and actually does NEED to use swap, it still will
> ... it just won't swap anything out before it's actually required.
>
> I do not think the behavior you are seeing is actually causing problems,
> based on your system load and CPU usage.  But what I've shared should fix
> it for you.
>
> Thanks,
> Shawn
>
>


Re: SOLR Score Range Changed

2018-02-22 Thread Shawn Heisey

On 2/22/2018 9:50 AM, Hodder, Rick wrote:

I am migrating from SOLR 4.10.2 to SOLR 7.1.

All seems to be going well, except for one thing: the score that is coming back 
for the resulting documents is giving different scores.


The absolute score has no meaning when you change something -- the 
index, the query, the software version, etc.  You can't compare absolute 
scores.


What matters is the relative score of one document to another *in the 
same query*.  The amount of difference is almost irrelevant -- the goal 
of Lucene's score calculation gymnastics is to have one document score 
higher than another, so the *order* is reasonably correct.


Assuming you're using the default relevancy sort, does the order of your 
search results change dramatically from one version to the other?  If it 
does, is the order generally better from a relevance standpoint, or 
generally worse?  If you are specifying an explicit sort, then the 
scores will likely be ignored.


What I am describing is also why it's strongly recommended that you 
never try to convert scores to percentages:


https://wiki.apache.org/lucene-java/ScoresAsPercentages

Thanks,
Shawn



RE: Response time under 1 second?

2018-02-22 Thread LOPEZ-CORTES Mariano-ext
For the moment, I have the following information:

12GB is max java heap. Total memory i don't know. No direct access to host.

2 replicas = 
Size 1 = 11.51 GB
Size 2 = 11.82 GB
(Sizes showed in the Core-Overview admin gui)

Thanks very much!

-Message d'origine-
De : Shawn Heisey [mailto:elyog...@elyograg.org] 
Envoyé : jeudi 22 février 2018 17:06
À : solr-user@lucene.apache.org
Objet : Re: Response time under 1 second?

On 2/22/2018 8:53 AM, LOPEZ-CORTES Mariano-ext wrote:
> With a 3 nodes cluster each 12GB and a corpus of 5GB (CSV format).
>
> Is it better to disable completely Solr cache ? There is enough RAM for the 
> entire index.

The size of the input data will have an effect on how big the index is, but it 
is not a direct indication of the index size.  The size of the index is more 
important than the size of the data that you send to Solr to create the index.

You say 12GB ... but is this total system memory, or the max Java heap size for 
Solr?  What are these two numbers for your servers?

If you go to the admin UI for one of these servers and look at the Overview 
page for all of the index cores it contains, you will be able to see how many 
documents and what size each index is on disk.  What are these numbers?  If the 
numbers are similar for all the servers, then I will only need to see it for 
one of them.

If the machine is running an OS like Linux that has the gnu top program, then I 
can see a lot of useful information from that program.  Run "top" 
(not htop or other variants), press shift-M to sort the list by memory, and 
grab a screenshot.  This will probably be an image file, so you'll need to find 
a file sharing site and give us a URL to access the file. Attachments rarely 
make it to the mailing list.

Thanks,
Shawn



Re: Response time under 1 second?

2018-02-22 Thread Shawn Heisey

On 2/22/2018 10:45 AM, LOPEZ-CORTES Mariano-ext wrote:

For the moment, I have the following information:

12GB is max java heap. Total memory i don't know. No direct access to host.

2 replicas =
Size 1 = 11.51 GB
Size 2 = 11.82 GB
(Sizes showed in the Core-Overview admin gui)


OK, so you have about 23GB of total index data on the machine.  With a 
12GB heap, and assuming there's no other software running on the 
machine, then for good performance I would want to have at least 32GB 
total memory, which leaves around 20GB for the OS to cache the 23GB 
index.  More memory would be better, but probably isn't a requirement.  
If there is other software running on the machine, then that will 
increase the total memory requirement.


It is always possible that your Solr install is in a situation where 
12GB of heap is actually not quite big enough. If that happens, 
performance will usually be a lot worse than in situations where the 
left-over memory is not enough for the OS to cache the index properly.


You might be able to get decent performance if the total memory is about 
24GB, but that much might NOT be enough.  There are a lot of factors 
affecting actual memory requirements.


The Solr admin UI will tell you what the total physical memory in the 
system is, on the dashboard.  It will be the upper right graph.  Note 
that this graph is likely to show 100% or nearly 100% full.  Don't let 
this alarm you -- it's normal.


How did you arrive at the 12GB size for your heap?  Have you tried 
reducing this number so that there is more memory left for the OS to 
handle disk caching?  I have no idea whether your Solr install will 
still work properly with a smaller heap, so be aware that reducing the 
heap might cause more problems.


https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn



Turn on/off query based on a url parameter

2018-02-22 Thread Roopa Rao
Hi,

I want to enable or disable a SolrFeature in LTR based on efi parameter.

In simple the query should be executed only if a parameter is true.

Any examples or suggestion on how to accomplish this?

Functions queries examples are are using fields to give a value to. In my
case I want to execute the query only if a url parameter is true

Thanks,
Roopa


RE: Turn on/off query based on a url parameter

2018-02-22 Thread Phil Scadden
I always filter solr request via a proxy (so solr itself is not exposed 
directly to the web). In that proxy, the query parameters can be broken down 
and filtered as desired (I examine authorities granted to a session to control 
even which indexes are being searched) before passing the modified url to solr. 
The coding of the proxy obviously depends on your application environment. We 
use java and Spring.

-Original Message-
From: Roopa Rao [mailto:roop...@gmail.com]
Sent: Friday, 23 February 2018 8:04 a.m.
To: solr-user@lucene.apache.org
Subject: Turn on/off query based on a url parameter

Hi,

I want to enable or disable a SolrFeature in LTR based on efi parameter.

In simple the query should be executed only if a parameter is true.

Any examples or suggestion on how to accomplish this?

Functions queries examples are are using fields to give a value to. In my case 
I want to execute the query only if a url parameter is true

Thanks,
Roopa
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Solr Autoscaling multi-AZ rules

2018-02-22 Thread Jeff Wartes

I managed to miss this reply earlier, but:

Shard: A logical segment of a collection
Replica: A physical core, representing a particular Shard
Replication Factor (RF): A set of Replicas, such that a single Replica exists 
for each Shard in a Collection. 
Availability Zone (AZ): A partitioned set of nodes such that a physical or 
hardware failure in one AZ should not affect another AZ. AZ could mean distinct 
racks in a data center, or distinct  data centers, but I happen to specifically 
mean the AWS definition here: 
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-regions-availability-zones

So an RF2 collection with 2 shards means I have four Replicas in my collection, 
two shard1 and two shard2. If it's RF3, then I have six: three shard1 and three 
shard2.
I'm using "Distinct RF" as a shorthand for "a single replica for every shard in 
the collection". 
In the RF2 example above, if I have two Availability Zones, I would want a 
Distinct RF in each AZ. So, a replica for shard1 and shard2 in AZ1, and a 
replica for shard1 and shard2 in AZ2. I would *not* want, say, both shard1 
replicas in AZ1 because then a failure of AZ1 could leave me with no replicas 
for shard1 and an incomplete collection.
If I had RF6 and two AZs, I would want three Distinct RFs in each AZ. (three 
replicas for each shard, per AZ)

I understand that {"replica": "<7", "node":"#ANY"} may result in two replicas 
of the same shard ending up on the same node. However, the other rule should 
prevent this: {"replica": "<2", "shard": "#EACH", "node": "#ANY"}
So by using both rules, that should mean "no more than six replicas on a node, 
where all the replicas on that node represent distinct shards". Right?



On 2/12/18, 12:18 PM, "Noble Paul"  wrote:

>>Goal: No node should have more than 6 shards

This is not possible today

 {"replica": "<7", "node":"#ANY"} , means don't put more than 7
replicas of the collection (irrespective of the shards) in a given
node

what do you mean by distinct 'RF' ? I think we are screwing up the
terminologies a bit here

On Wed, Feb 7, 2018 at 1:38 PM, Jeff Wartes  wrote:
> I’ve been messing around with the Solr 7.2 autoscaling framework this 
week. Some things seem trivial, but I’m also running into questions and issues. 
If anyone else has experience with this stuff, I’d be glad to hear it. 
Specifically:
>
>
> Context:
> -One collection, consisting of 42 shards, where up to 6 shards can fit on 
a single node. (which means 7 nodes per Replication Factor)
> -Three AZs, each with its own ip_2 value.
>
> Goals:
>
> Goal: Fully utilize available nodes.
> Cluster Preference: {“maximize”: "cores”}
>
> Goal: No node should have more than one replica of a given shard
> Rule: {"replica": "<2", "shard": "#EACH", "node": "#ANY"}
>
> Goal: No node should have more than 6 shards
> Rule: {"replica": "<7", "node":"#ANY"}
>
> Goal: Where possible, distinct RFs should each exist in an AZ.
> (Example1: I’d like 7 nodes with a complete RF in AZ 1 and 7 nodes with a 
complete RF in AZ 2, and not end up with, say, both shard2 replicas in AZ 1)
> (Example2: If I have 14 nodes in AZ 1 and 7 in AZ 2, I should have two 
full RFs in AZ 1 and one in AZ 2)
> Rule: ???
>
> I could have multiple non-strict rules perhaps? Like:
> {"replica": "<2", "shard": "#EACH", "ip_2": "1", "strict":false}
> {"replica": "<3", "shard": "#EACH", "ip_2": "1", "strict":false}
> {"replica": "<4", "shard": "#EACH", "ip_2": "1", "strict":false}
> {"replica": "<2", "shard": "#EACH", "ip_2": "2", "strict":false}
> {"replica": "<3", "shard": "#EACH", "ip_2": "2", "strict":false}
> {"replica": "<4", "shard": "#EACH", "ip_2": "2", "strict":false}
> etc
> So having more than one RF in an AZ is a technical “violation”, but if 
placement minimizes non-strict violations, replicas would tend to get placed 
correctly.
>
>
> Given a working set of rules, I’m still having trouble with two things:
>
>   1.  I’ve manually created the “.system” collection, as it didn’t seem 
to get created automatically. However, autoscaling activity is not getting 
logged to it.
>   2.  I can’t seem to figure out how to scale up.
>  *   I’d presumed editing the collection’s “replicationFactor” would 
do the trick, but it does not.
>  *   The “node-up” trigger will serve to replace lost replicas, but 
won’t otherwise take advantage of additional capacity.
>
>i.  
There’s a UTILIZENODE command in 7.2, but it appears that’s still something you 
need to trigger manually.
>
> Anyone played with this stuff?



-- 
-
Noble Paul




Re: Deploying solr to tomcat 7

2018-02-22 Thread Rehaman
Dear Shawn,

Thanks a lot quick response.

I will check with the same.

Thanks & Regards
Fazulur Rehaman

On Wed, Feb 21, 2018 at 4:55 PM, Shawn Heisey  wrote:

> On 2/21/2018 3:00 AM, Rehaman wrote:
>
>> We installed Ensembl server in our environment and not able to query
>> databases with large number of entries. And for that purpose we need to use
>> indexed databases through Solr search engine.
>>
>> We have installed Solr search engine (Solr Specification Version: 3.6.1)
>> on Tomcat 7. Able to see Solr main page "Welcome to Solr"  with ensembl
>> shards.
>>
>
> I don't know anything about Ensembl.  But I can comment about Solr.
> Version 3.6.1 is nearly six years old, and is four major versions out of
> date, as version 7.2.1 is the current release.  I can attempt to help, but
> this version is so old that it's effectively end of life.
>
> When I try to query each shard I am getting error "HTTP status 500" error.
>> I have searched in forum for this and not able to resolve. Please find
>> attached error log.
>>
>
> This is the relevant line from the log that indicates the problem:
>
> Caused by: java.net.ConnectException: Connection refused (Connection
> refused)
>
> The Solr server is trying to access one of the URL endpoints mentioned in
> the "shards" parameter.  That connection is being refused.  Which means
> that either the traffic is being blocked, possibly by a firewall, or the
> URL endpoint in the shards parameter is not correct.
>
> Thanks,
> Shawn
>
>


Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-22 Thread Yonik Seeley
I've reproduced the issue and opened
https://issues.apache.org/jira/browse/SOLR-12020

-Yonik



On Thu, Feb 22, 2018 at 11:03 AM, Yonik Seeley  wrote:
> Thanks Antelmo, I'm trying to reproduce this now.
> -Yonik
>
>
> On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar  wrote:
>> Hi all,
>>
>> I was wondering if the information I sent is sufficient to look into the
>> issue.  Let me know if you need anything else from me please.
>>
>> Thanks,
>> Antelmo
>>
>> On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar  wrote:
>>
>>> Hi,
>>>
>>> Here are two pastebins.  The first is the full complete response with the
>>> search parameters used.  The second is the stack trace from the logs:
>>>
>>> https://pastebin.com/rsHvKK63
>>>
>>> https://pastebin.com/8amxacAj
>>>
>>> I am not using any custom code or plugins with the Solr instance.
>>>
>>> Please let me know if you need anything else and thanks for looking into
>>> this.
>>>
>>> -Antelmo
>>>
>>> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley  wrote:
>>>
 Could you provide the full stack trace containing "Invalid Date
 String"  and the full request that causes it?
 Are you using any custom code/plugins in Solr?
 -Yonik


 On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar  wrote:
 > Hi,
 >
 > I was using the following part of a query to get facet buckets so that I
 > can use the information in the buckets for some post-processing:
 >
 > "json":
 > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:
 true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter
 m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec
 ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\
 "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\
 ":{\"collection\":
 > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa
 cet\":{\"abnd\":\"sum(div(sample_size_i,
 > collection_duration_days_i))\""
 >
 > Sorry if it is hard to read.  Basically what is was doing was getting
 the
 > following buckets:
 >
 > First bucket will be categorized by "Species category" by default
 unless we
 > pass in the request the "term" parameter which we will categories the
 first
 > bucket by whatever "term" is set to.  Then inside this first bucket, we
 > create another buckets of the "Collection date" category.  Then inside
 the
 > "Collection date" category buckets, we would use some functions to do
 some
 > calculations and return those calculations inside the "Collection date"
 > category buckets.
 >
 > This query is working fine in Solr 6.2, but I upgraded our instance of
 Solr
 > 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr
 6.6
 > broke the above query.  Now it complains when trying to create the
 buckets
 > of the "Collection date" category.  I get the following error:
 >
 > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
 >
 > It seems that when creating the buckets of a date field, it does some
 > conversion of the way the date is stored and causes the error to appear.
 > Does anyone have an idea as to why this error is happening?  I would
 really
 > appreciate any help.  Hopefully I was able to explain my issue well.
 >
 > Thanks,
 > Antelmo

>>>
>>>


Indexing timeout issues with SolrCloud 7.1

2018-02-22 Thread Tom Peters
I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues. It 
will hang most of the time, and timeout the rest.

Here's an example:

time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d 
'{"solr_id":"test_001", "data_type":"test"}'|jq .
{
  "responseHeader": {
"status": 0,
"QTime": 5004
  }
}
curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d   0.00s user 
0.00s system 0% cpu 5.025 total
jq .  0.01s user 0.00s system 0% cpu 5.025 total

Here's some of the timeout errors I'm seeing:

2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 
r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.RequestHandlerBase 
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout 
expired: 12/12 ms
2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 
r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.s.HttpSolrCall 
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout 
expired: 12/12 ms
2018-02-23 03:55:36.517 ERROR 
(recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr 
x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) 
[c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] 
o.a.s.h.ReplicationHandler Index fetch failed 
:org.apache.solr.common.SolrException: Index fetch failed :
2018-02-23 03:55:36.517 ERROR 
(recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr 
x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) 
[c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] 
o.a.s.c.RecoveryStrategy Error while trying to 
recover:org.apache.solr.common.SolrException: Replication for recovery failed.


We currently have two separate Solr clusters. Our current in-production cluster 
which runs on Solr 3.4 and a new ring that I'm trying to bring up which runs on 
SolrCloud 7.1. I have the exact same code that is indexing to both clusters. 
The Solr 3.4 indexes fine, but I'm running into lots of issues with SolrCloud 
7.1.


Some additional details about the setup:

* 5 nodes solr2-a through solr2-e.
* 5 replicas
* 1 shard
* The servers have 48G of RAM with -Xmx and -Xms set to 16G
* I currently have soft commits at 10m intervals and hard commits (with 
openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard) as 
well.

Any help or pointers would be greatly appreciated. Thanks!


This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Solrj SolrServer not converting the Collection of Pojo Objects inside Parent Pojo

2018-02-22 Thread vracks


We are using Solrj version 4.10.4 as the java client to add documents into
Solr version 1.4.1

Sample Pojo Object:

@SolrDocument(solrCoreName="customer")
public class Customer
{ private String customerId; private String customerName; private int age;
private List addresses; //getters and setters }

public class Address
{ private String street; private String city; private String state; private
String country; private Long zip; //getters and setters }

When indexing the customer Document with the below schema






Customer document that gets indexed in Solr is having the Address Object
Memory

Address@spjdspf13Address@sdf535

reference as arr of elements instead of individual fields of Address.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr not accessible - javax.net.ssl.SSLException

2018-02-22 Thread protonmail4us
Greetings, Apache Solr community.

I'm here to ask for your help and advice about a Solr-related problem I'm 
having.

My company is an e-commerce website and uses Solr in production for the 
querying of items in our inventory. The Solr installation was done by an 
engineer who has left the company. About 2 weeks ago, Solr stopped working 
completely (our website wasn't rendering completely and we lost the search 
functionality).

We also couldn't access the Solr dashboard, located in our server at 
[https://api.ishippo.com:8282/solr/#](https://api.mydomain.com:8282/solr/#)

(NB - Solr runs on ports 8282 in our server.)

I logged onto the remote server where Solr was installed and ran
> bin/solr status

I got this message -
Found 1 Solr nodes:

Solr process 4365 running on port 8282

ERROR: Failed to get system information from https://localhost:8282/solr due 
to: javax.net.ssl.SSLException: Certificate for doesn't match any of the 
subject alternative names: [*.ishippo.com, ishippo.com]

We figured that it could be an SSL issue and tried accessing the Solr dashboard 
through plain HTTP by plugging in our server's IP address. This time, we could 
access the Solr dashboard. But our website works solely by https, so the Solr 
query gets blocked every time.

It seems that only https connections are being blocked by Solr and its port 
(8282). Everything works fine on the other ports, and on http.

We contacted our SSL certificate authority, and they said everything was fine 
from their end. They even made us perform openssl tests and send them the 
output. but they couldn't find anything cause from their end. (I have the 
openssl messages returned from the tests, which are long. I can share them if 
someone needs it)

What could be the issue here? I have tried so many things in order to fix this 
to no avail. Does anybody know what's going on and help a user out?

Thank you for your patience,
iShippo

Here is a summary of the problem -

- Solr dashboard (located on https://api.ishippo.com:8282/solr/#) is not 
accessible.
- Only port 8282 (which Solr runs on is affected).  Services also running on 
api.ishippo.com on other ports are running fine.
- Solr throws a javax.net.ssl.SSLException error.

- We discover we are able to access the Solr dashboard by looking up the IP 
address of our server (and not the URL) on http 
(http://52.66.65.108:8282/solr/#)
- Our platform runs soleley on HTTPS, so we're not able to go around it by 
using http.
- Our SSL certificate authority couldn't find a cause on their end.