Re: Some performance questions....

2018-03-11 Thread Deepak Goel
On 12 Mar 2018 05:51, "Shawn Heisey"  wrote:

On 3/11/2018 11:35 AM, BlackIce wrote:

> I have some questions regarding performance.
>
> Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
> Solr and some other stuff.
>
> Would it be more beneficial to only run 1 instance of Solr with the
> collection stored on 4 HD's in RAID 0?? Or Have several Virtual
> Machines each running of its own HD, ie: Have 4 VM's running Solr?
>

Performance is always going to be better on bare metal than on virtual
machines.  Virtualization in modern times is really good, so the difference
*might* be minimal, but there is ALWAYS overhead.

*Deepak*

I doubt this. It would be great if someone can subtantiate this with hard
facts
*Deepak*


I used to create virtual machines in my hardware for Solr. Initially with
vmware esxi, then later natively in Linux with KVM.  At that time, I was
running one index core per VM.  Just for some testing, I took a similar
machine and set up one Solr instance handling all the same cores on bare
metal.  I do not remember HOW much faster it was, but it was definitely
faster. One big thing I like about bare metal is that there's only one
"machine", IP address, and Solr instance to administer.

Unless you're willing to completely rebuild the whole thing in the event of
drive failure, don't use RAID0.  If one drive dies (and every hard drive IS
eventually going to die if it's used long enough), then *all* of the data
on the whole RAID volume is gone.

You could do RAID5, which has decent redundancy and good space efficiency,
but if you're not familiar with the RAID5 write penalty, do some research
on it, and you'll probably come out of it not wanting to EVER use it.  If
you like, I can explain exactly why you should avoid any RAID level that
incorporates 5 or 6.

Overall, the best level is RAID10 ... but it has a glaring disadvantage
from a cost perspective -- you lose half of your raw capacity.  Since
drives are relatively cheap, I always build my servers with RAID10, using a
1MB stripe size and a battery-backed caching controller.  For the typical
hardware I'm using, that means that I'm going to end up with 6 to 12TB of
usable space instead of 10 to 20TB (raid5), but the volume is FAST.

Thanks,
Shawn


Re: Some performance questions....

2018-03-11 Thread Shawn Heisey

On 3/11/2018 11:35 AM, BlackIce wrote:

I have some questions regarding performance.

Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
Solr and some other stuff.

Would it be more beneficial to only run 1 instance of Solr with the
collection stored on 4 HD's in RAID 0?? Or Have several Virtual
Machines each running of its own HD, ie: Have 4 VM's running Solr?


Performance is always going to be better on bare metal than on virtual 
machines.  Virtualization in modern times is really good, so the 
difference *might* be minimal, but there is ALWAYS overhead.


I used to create virtual machines in my hardware for Solr. Initially 
with vmware esxi, then later natively in Linux with KVM.  At that time, 
I was running one index core per VM.  Just for some testing, I took a 
similar machine and set up one Solr instance handling all the same cores 
on bare metal.  I do not remember HOW much faster it was, but it was 
definitely faster. One big thing I like about bare metal is that there's 
only one "machine", IP address, and Solr instance to administer.


Unless you're willing to completely rebuild the whole thing in the event 
of drive failure, don't use RAID0.  If one drive dies (and every hard 
drive IS eventually going to die if it's used long enough), then *all* 
of the data on the whole RAID volume is gone.


You could do RAID5, which has decent redundancy and good space 
efficiency, but if you're not familiar with the RAID5 write penalty, do 
some research on it, and you'll probably come out of it not wanting to 
EVER use it.  If you like, I can explain exactly why you should avoid 
any RAID level that incorporates 5 or 6.


Overall, the best level is RAID10 ... but it has a glaring disadvantage 
from a cost perspective -- you lose half of your raw capacity.  Since 
drives are relatively cheap, I always build my servers with RAID10, 
using a 1MB stripe size and a battery-backed caching controller.  For 
the typical hardware I'm using, that means that I'm going to end up with 
6 to 12TB of usable space instead of 10 to 20TB (raid5), but the volume 
is FAST.


Thanks,
Shawn



Re: Solr search engine configuration

2018-03-11 Thread PeterKerk
Sorry for this lengthy post, but I wanted to be complete.

The only occurence of edismax in solrconfig.xml is this one:


   

  edismax
  explicit
  10
 
  double_score
  false
  *:*



I don't have a requestHandler named "/select".


Also, removing the gramming definitely helped! :-)

I tried to simplify my setup first and then expand, so what I have now is
this:



  
   
   
 



  
  
   
   
 



  




In my database I have these 4 values for "title" that populate
"title_search_global"   

"Hi there dier something else"
"Hi there dieren zaak something else"
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

ps. "dier" is singular of plural "dieren". 

Using this query:
http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)=(lang%3A%22nl%22+OR+lang%3A%22all%22)=id%2Ctitle=xml=true=edismax=title_search_global=true=true=true

These results are found:
"Hi there dier something else"
"Hi there dieren zaak something else"

And these are NOT:
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

I'd expect it should be fairly easy (although I don't know how) to also
include result "dierenzaak", by compounding the 2 query values. And yes you
are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
sure what logic would also include "dierzaak"

Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
exact match of "dieren zaak"
So I also checked the usage of pf parameters with edismax (based on these
links:
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
And also for dismax:
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter

But I can't find any examples how to actually use these parameters? 


The search results, including debug info is here:




0
7

title_search_global:(dieren zaak)
edismax
true
true
title_search_global
id,title
(lang:"nl" OR lang:"all")
xml
true
true




dieren zaak
115_3699638


dier
115_3699637



title_search_global:(dieren zaak)
title_search_global:(dieren zaak)

(+(title_search_global:dier title_search_global:zaak))/no_coord


+(title_search_global:dier title_search_global:zaak)



5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
= queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)


1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
(MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)


0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
(MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =

Re: Some performance questions....

2018-03-11 Thread BlackIce
Second to this wouldn't 4 Solr instances each with its own HD be fault
tolerant? vs. one solr instance with 4 HD's in RAID 0? Plus to his comes
the storage capacity, I need the capacity of those 4 drives... the more I
read.. the more questions

On Sun, Mar 11, 2018 at 9:43 PM, BlackIce  wrote:

> Thnx for the pointers.
>
> I haven't given much thought to Solr, asides shemal.xml and solrconfig.xml
> and I'm just diving into a bit more deeper stuff!
>
> Greetz
>
> RRK
>
> On Sun, Mar 11, 2018 at 8:58 PM, Deepak Goel  wrote:
>
>> To rephrase your Question
>>
>> "Does Solr do well with Scale-up or Scale-out?"
>>
>> Are there any Performance Benchmarks for the same out there supporting the
>> claim?
>>
>> On 11 Mar 2018 23:05, "BlackIce"  wrote:
>>
>> > Hi,
>> >
>> > I have some questions regarding performance.
>> >
>> > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
>> > Solr and some other stuff.
>> >
>> > Would it be more beneficial to only run 1 instance of Solr with the
>> > collection stored on 4 HD's in RAID 0?? Or Have several Virtual
>> > Machines each running of its own HD, ie: Have 4 VM's running Solr?
>> >
>> > Any Thoughts?
>> >
>> > Thank you!
>> >
>> > RRK
>> >
>>
>
>


Re: Some performance questions....

2018-03-11 Thread BlackIce
Thnx for the pointers.

I haven't given much thought to Solr, asides shemal.xml and solrconfig.xml
and I'm just diving into a bit more deeper stuff!

Greetz

RRK

On Sun, Mar 11, 2018 at 8:58 PM, Deepak Goel  wrote:

> To rephrase your Question
>
> "Does Solr do well with Scale-up or Scale-out?"
>
> Are there any Performance Benchmarks for the same out there supporting the
> claim?
>
> On 11 Mar 2018 23:05, "BlackIce"  wrote:
>
> > Hi,
> >
> > I have some questions regarding performance.
> >
> > Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
> > Solr and some other stuff.
> >
> > Would it be more beneficial to only run 1 instance of Solr with the
> > collection stored on 4 HD's in RAID 0?? Or Have several Virtual
> > Machines each running of its own HD, ie: Have 4 VM's running Solr?
> >
> > Any Thoughts?
> >
> > Thank you!
> >
> > RRK
> >
>


Re: Some performance questions....

2018-03-11 Thread Deepak Goel
To rephrase your Question

"Does Solr do well with Scale-up or Scale-out?"

Are there any Performance Benchmarks for the same out there supporting the
claim?

On 11 Mar 2018 23:05, "BlackIce"  wrote:

> Hi,
>
> I have some questions regarding performance.
>
> Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
> Solr and some other stuff.
>
> Would it be more beneficial to only run 1 instance of Solr with the
> collection stored on 4 HD's in RAID 0?? Or Have several Virtual
> Machines each running of its own HD, ie: Have 4 VM's running Solr?
>
> Any Thoughts?
>
> Thank you!
>
> RRK
>


Re: Solr search engine configuration

2018-03-11 Thread Erick Erickson
bq: I tried the query with and without the =edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

Well, not an error at all, this line:
 ExtendedDismaxQParser

Means you're using edismax. If that happens both with or without
, that means
that your request handler in solrconfig.xml has this defined as a
default. Look for the
entry like:


   
 edismax

So any search you send to Solr like
http://blah blah/solr/collection/select?

will use edismax if no defType overrides it on the URL.

---
Let's talk about what "exact match" means ;)


Exact match "dieren zaak". Does "Exact match" here mean it would or
would not be an exact match on "dieren zaak soemthingelse"?

I you do NOT consider the above "exact match", the usual trick is to
use a copyField directive to a field that uses KeywordTokenizerFactory
(probably) followed by LowerCaseFilterFactory etc.
KeywordTokenizerFactory takes the entire input field as a _single_
token, then you can transform it various ways, things like folding
accents, lowercasing and the like if desired.

I you DO consider the above "exact match", take a look at the pf, pf2
and pf3 parameters in edismax. They're all about forming phrases,
bigrams and trigrams respectively for this form of "exact match".

Exact match "dierenzaak". This one is tricky. There's little OOB that
understands that "dieren zaak" is equivalent to "dierenzaak". I know
that in German there's prior art on "decompounding" filters, I don't
know about Dutch. Further, given my total lack of understanding the
rules of either language I don't know if it does "compounding" too,
i.e. understanding that "dieren zaak" is equivalent to "dierenzaak".
Can't help much there.

For a start I'd get rid of the gramming until I'd explored other
alternatives. Gramming is generally a good thing for pre-and-post
wildcards, i.e. matching *some*. Since you're concerned with
relevance, I suspect that gramming will make your task harder.

And if you haven't discovered the admin UI/analysis page, I recommend
you spend some time with it (hint, un-check the "verbose" checkbox).
As you play with various combinations of tokenizers and filters it'll
give you a much better understanding of what the effects of various
combinations are.

If only human language followed strict rules ;)

Professor:"In English, two negatives are
allowed and mean a positive, but two positives don't mean a negative."
Bored voice from the back: "Yeah, right".

Erick

On Sun, Mar 11, 2018 at 5:19 AM, PeterKerk  wrote:
> Thanks! That provides me with some more insight, I altered the search query
> to "dieren zaak" to see how queries consisting of more than 1 word are
> handled.
> I see that words are tokenized into groups of 3, I think because of my
> NGramFilterFactory with minGramSize of 3.
>
> 
> 
> (title_search_global:(dieren zaak) OR 
> description_search_global:(dieren
> zaak))
> 
> 
> (title_search_global:(dieren zaak) OR 
> description_search_global:(dieren
> zaak))
> 
> 
> (+(((title_search_global:die title_search_global:ier
> title_search_global:ere title_search_global:ren title_search_global:dier
> title_search_global:iere title_search_global:eren title_search_global:diere
> title_search_global:ieren title_search_global:dieren)
> (title_search_global:zaa title_search_global:aak title_search_global:zaak))
> (((description_search_global:dier description_search_global:diere
> description_search_global:dieren)/no_coord)
> description_search_global:zaak)))/no_coord
> 
> 
> +(((title_search_global:die title_search_global:ier 
> title_search_global:ere
> title_search_global:ren title_search_global:dier title_search_global:iere
> title_search_global:eren title_search_global:diere title_search_global:ieren
> title_search_global:dieren) (title_search_global:zaa title_search_global:aak
> title_search_global:zaak)) ((description_search_global:dier
> description_search_global:diere description_search_global:dieren)
> description_search_global:zaak))
> 
> ExtendedDismaxQParser
> 
> 
> 
> 
> 
> (lang:"nl" OR lang:"all")
> 
> 
> lang:nl lang:all
> 
> 
>
>
> I tried the query with and without the =edismax parameter but I'm
> getting the EXACT same results. Does that mean some configuration error?
>
> I'm not sure how to progress from here. Can you see if your presumption that
> I'm mixing two different parsers is correct? My schema.xml is here:
> http://www.telefonievergelijken.nl/schema.xml
>
>
> Related: do you know of the existence of any sample schema.xml config that
> would be usable for a search engine? Seems like something so obvious to
> float around out there. I feel that would go a long way.
>
>
>
> Not sure if it matters but my requirements are:
>
> Exact match 

CLUSTERSTATUS API and Error loading specified collection / config in Solr 5.3.2.

2018-03-11 Thread Atita Arora
Hi ,

I am working on an application which involves working on a highly
distributed Solr cloud environment. The application supports multi-tenancy
and we have around 250-300 collections on Solr where each client has their
own collection with a new shard being created as clientid-
where the timestamp is whenever the new data comes in for the client
(typically every 4-8 hrs) , the reason for this convention is to make sure
when the Indexes are being built (on demand) the timestamp matches closely
to the time when the last indexing was run (the earlier shard is
de-provisioned as soon as the new one is created). Whenever the indexing is
triggered it first makes a DB entry and then creates a catalog with
timestamp in solr.
The Solr cloud has 10 Nodes distributed geographically among 10 datacenters.
The replication factor is 2. The Solr version is 5.3.2.
Coming to my problem - I had to write a utility to ensure that the DB
insert timestamp matches closely to the Solr index timestamp wherein I can
ensure that if the difference between DB timestamp and Solr Index tinestamp
is <= 2 hrs , we have fresh index. The new index contains revised prices of
products or offers etc which are critical to be updated as in when they
come. Hence this utility is to track that the required updates have been
successfully made.
I used *CLUSTERSTATUS* api for this task. It is serving the purpose well so
far , but pretty recently our solr cloud started complaining of strange
things because of which the *CLUSTERSTATUS* api keeps returning as error.

The error claims to be of missing config & sometime missing collections
like.

org.apache.solr.common.SolrException: Could not find collection :
> 1785-1520548816454

org.apache.solr.common.SolrException: Could not find collection :
1785-1520548816454
at
org.apache.solr.common.cloud.ClusterState.getCollection(ClusterState.java:165)
at
org.apache.solr.handler.admin.ClusterStatus.getClusterStatus(ClusterStatus.java:110)
at
org.apache.solr.handler.admin.CollectionsHandler$CollectionOperation$19.call(CollectionsHandler.java:614)
at
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:166)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:678)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:444)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:215)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)

The other times it would complain of missing the config for same or
different client id- timestamp like :

1532-1518669619526_shard1_replica3:
org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException:
Specified config does not exist in ZooKeeper:1532-1518669619526I

I would really appreciate if :


   1. Someone can possibly guide me as to whats going on Solr Cloud
   2. If CLUSTERSTATUS is the right pick to build such utility. Do we
have any other option?


Thanks for any pointers and suggestions.

Appreciate your attention looking this through.

Atita


Re: What are descent disk I/O for Solr and Zookeeper ?

2018-03-11 Thread Dominique Bejean
Hi Shawn,

I agree on Disk I/O versus available memory about Solr performances.
However for heavy indexing and heavy searching context, even with a lot of
RAM, disk I/O should be critical.

My concern is also about write I/O for Zookeeper transactions log. My
understanding is that is critical not as much for Solrcloud performances
but mainly for SolrCloud stability.

Sometimes even with best practices respect and all possible configuration
tuning, Solrcoud is not stable or not performant due to lake of hardware
resources. Monitoring CPU, CPU load, iowait, jvm GC, … should highlight
theses lake of ressources. If the hardware is undersized, we need metrics
in order to explain and demonstrate this to the customer (furthermore if
the infrastructure provider do not want admit there are issues with
hardware or virtualization). That was the meaning of my question about
“decent disk I/O”.

Regards

Dominique


Le ven. 9 mars 2018 à 00:40, Shawn Heisey  a écrit :

> On 3/8/2018 2:55 PM, Dominique Bejean wrote:
> > Disk I/O are critical for high performance Solrcloud.
>
> This statement has truth to it, but if your system is correctly sized,
> disk performance will not have much of an impact on Solr performance.
> If upgrading to faster disks does improves long-term query performance,
> the system probably doesn't have enough memory installed.  There can be
> other causes, but that is the most common.
>
> When there is enough memory available to allow the operating system to
> effectively cache the index data, Solr will not need to access the disk
> much at all for queries -- all that data will be already in memory.
> Indexing will still be dependent on disk performance even when there is
> plenty of memory, because that will require writing new data to the disk.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> This is my hammer.  To me, your question looks like a nail.  :)
>
> Thanks,
> Shawn
>
> --
Dominique Béjean
06 08 46 12 43


Some performance questions....

2018-03-11 Thread BlackIce
Hi,

I have some questions regarding performance.

Lets says I have a dual CPU with a total of 8 cores and 24 GB RAM for my
Solr and some other stuff.

Would it be more beneficial to only run 1 instance of Solr with the
collection stored on 4 HD's in RAID 0?? Or Have several Virtual
Machines each running of its own HD, ie: Have 4 VM's running Solr?

Any Thoughts?

Thank you!

RRK


Re: Solr search engine configuration

2018-03-11 Thread PeterKerk
Thanks! That provides me with some more insight, I altered the search query
to "dieren zaak" to see how queries consisting of more than 1 word are
handled.
I see that words are tokenized into groups of 3, I think because of my
NGramFilterFactory with minGramSize of 3.



(title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))


(title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))


(+(((title_search_global:die title_search_global:ier
title_search_global:ere title_search_global:ren title_search_global:dier
title_search_global:iere title_search_global:eren title_search_global:diere
title_search_global:ieren title_search_global:dieren)
(title_search_global:zaa title_search_global:aak title_search_global:zaak))
(((description_search_global:dier description_search_global:diere
description_search_global:dieren)/no_coord)
description_search_global:zaak)))/no_coord


+(((title_search_global:die title_search_global:ier 
title_search_global:ere
title_search_global:ren title_search_global:dier title_search_global:iere
title_search_global:eren title_search_global:diere title_search_global:ieren
title_search_global:dieren) (title_search_global:zaa title_search_global:aak
title_search_global:zaak)) ((description_search_global:dier
description_search_global:diere description_search_global:dieren)
description_search_global:zaak))

ExtendedDismaxQParser





(lang:"nl" OR lang:"all")


lang:nl lang:all




I tried the query with and without the =edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

I'm not sure how to progress from here. Can you see if your presumption that
I'm mixing two different parsers is correct? My schema.xml is here:
http://www.telefonievergelijken.nl/schema.xml


Related: do you know of the existence of any sample schema.xml config that
would be usable for a search engine? Seems like something so obvious to
float around out there. I feel that would go a long way.



Not sure if it matters but my requirements are:

Exact match "dieren zaak" boost result with 1000 
Exact match "dierenzaak" boost result with 900 
Exact match "dieren" or "zaak" boost result with 600 

Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500 
Stem match "dier" boost result with 100 
Stem partial match "huisdier" boost result with 70 
Other partial matches "die" boost result with 10 




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html