Which fields matched?

2012-12-07 Thread Jeff Wartes

If I have an arbitrarily complex query that uses ORs, something like:
q=(simple_fieldtype:foo OR complex_fieldtype:foo) AND 
(another_simple_fieldtype:bar OR another_complex_fieldtype:bar)

I want to know which fields actually contributed to the match for each document 
returned. Something like:
docID=1, 
fields_matched=simple_fieldtype,complex_fieldtype,another_complex_fieldtype
docID=2, fields_matched=simple_fieldtype,another_complex_fieldtype


My basic use case is that I have several copyField'ed variations on the same 
data (using different complex FieldTypes), and I want to know which variations 
contributed to the document so I can conclude things like Well, this document 
matched the field with the SynonymFilterFactory, but not the one without, so 
this particular document must've been a synonym match.

I know you could probably lift this from debugQuery output, but that's a 
non-starter due to parsing complexity and query performance impact.
I think you could edge into some of this using the HighlightComponent output, 
but that's a non-starter because it requires fields be stored=true. Most of my 
fieldTypes are intended solely for indexing/search, and make no sense from a 
stored/retrieval standpoint. And to be clear, I really don't care about which 
terms matched anyway, only which fields.

If there's an easy way to get this, I'd love to hear it. Otherwise, I'm mostly 
looking for a head start on where to go looking for this data so I can add my 
own Component or something - assuming the data is even available in the solr 
layer?

Thanks.


RE: Which fields matched?

2012-12-07 Thread Jeff Wartes
Thanks, I did start to dig into how DebugComponent does its thing a little, and 
I'm not all the way down the rabbit hole yet, but the lucene indexSearcher's 
explain() method has this comment:

This is intended to be used in developing Similarity implementations, and, for 
good performance, should not be displayed with every hit. Computing an 
explanation is as expensive as executing the query over the entire index.

Which makes me wonder if I'd get almost all of the debugQuery=true performance 
penalty anyway if I try to do as you suggest.


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Friday, December 07, 2012 10:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Which fields matched?

The debugQuery explain is simply a text display of what Lucene has already 
calculated. As such, you could do a custom search component that gets the 
non-text Lucene Explanation object for the query and then traverse it to get 
your matched field list without all the text. No parsed would be required, but 
the Explanation structure could get messy.

-- Jack Krupansky

-Original Message-
From: Jeff Wartes
Sent: Friday, December 07, 2012 11:59 AM
To: solr-user@lucene.apache.org
Subject: Which fields matched?


If I have an arbitrarily complex query that uses ORs, something like:
q=(simple_fieldtype:foo OR complex_fieldtype:foo) AND 
(another_simple_fieldtype:bar OR another_complex_fieldtype:bar)

I want to know which fields actually contributed to the match for each 
document returned. Something like:
docID=1, 
fields_matched=simple_fieldtype,complex_fieldtype,another_complex_fieldtype
docID=2, fields_matched=simple_fieldtype,another_complex_fieldtype


My basic use case is that I have several copyField'ed variations on the same 
data (using different complex FieldTypes), and I want to know which 
variations contributed to the document so I can conclude things like Well, 
this document matched the field with the SynonymFilterFactory, but not the 
one without, so this particular document must've been a synonym match.

I know you could probably lift this from debugQuery output, but that's a 
non-starter due to parsing complexity and query performance impact.
I think you could edge into some of this using the HighlightComponent 
output, but that's a non-starter because it requires fields be stored=true. 
Most of my fieldTypes are intended solely for indexing/search, and make no 
sense from a stored/retrieval standpoint. And to be clear, I really don't 
care about which terms matched anyway, only which fields.

If there's an easy way to get this, I'd love to hear it. Otherwise, I'm 
mostly looking for a head start on where to go looking for this data so I 
can add my own Component or something - assuming the data is even available 
in the solr layer?

Thanks. 



RE: Which fields matched?

2012-12-11 Thread Jeff Wartes
Thanks, this is good stuff, I hadn't seen LUCENE-1999, and it even has a 
reference to some methods now in core.

I think I'm still stuck for the moment though. I'm pretty fixed on Solr 3.5 for 
the next few development cycles, and I've been trying really hard to avoid 
compiling my own Solr - as it appears most of these approaches would require. 
(I'm not above inserting my own jars into the Stock Solr WAR, but I have to 
draw the line someplace.)

I'm going to spend some time with the highlight component and see if I can get 
something working with that. The basic String-version of my fields (which I 
copyField into many indexed-only field types) is stored, so I'm thinking I 
might be able to use hl.q to at least answer a basic did this document match 
the original query unaltered question.



-Original Message-
From: Paul Libbrecht [mailto:p...@hoplahup.net] 
Sent: Saturday, December 08, 2012 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Which fields matched?

We've used lucene-1999 with some success in ActiveMath to find the language 
that was matched.

paul


Le 8 déc. 2012 à 10:09, Mikhail Khludnev a écrit :

 Jeff,
 explain() algorithm is definitely too slow to be used at search time. 
 There is an approach which I'm aware of - watch for scorers during the 
 search time. If scorer matches some doc _at some moment_ 
 scorer.docID()==docNum.
 My team successfully implemented such Match Spotting algorithm, it 
 performs quite well, and provides info like http://goo.gl/7vgrB The 
 problem with this algorithm is that it's tightly coupled with low 
 level scorers behavior, and they intended to behave contra-intuitively 
 sometimes, and changes that behavior due to performance optimizations in 
 lucene core.
 https://issues.apache.org/jira/browse/LUCENE-1999 sounds almost the 
 same, but I never looked into the source.
 
 
 On Fri, Dec 7, 2012 at 11:00 PM, Jeff Wartes jwar...@whitepages.com wrote:
 
 Thanks, I did start to dig into how DebugComponent does its thing a 
 little, and I'm not all the way down the rabbit hole yet, but the 
 lucene indexSearcher's explain() method has this comment:
 
 This is intended to be used in developing Similarity 
 implementations, and, for good performance, should not be displayed with 
 every hit.
 Computing an explanation is as expensive as executing the query over 
 the entire index.
 
 Which makes me wonder if I'd get almost all of the debugQuery=true 
 performance penalty anyway if I try to do as you suggest.
 
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com]
 Sent: Friday, December 07, 2012 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Which fields matched?
 
 The debugQuery explain is simply a text display of what Lucene has 
 already calculated. As such, you could do a custom search component 
 that gets the non-text Lucene Explanation object for the query and 
 then traverse it to get your matched field list without all the text. 
 No parsed would be required, but the Explanation structure could get messy.
 
 -- Jack Krupansky
 
 -Original Message-
 From: Jeff Wartes
 Sent: Friday, December 07, 2012 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Which fields matched?
 
 
 If I have an arbitrarily complex query that uses ORs, something like:
 q=(simple_fieldtype:foo OR complex_fieldtype:foo) AND 
 (another_simple_fieldtype:bar OR another_complex_fieldtype:bar)
 
 I want to know which fields actually contributed to the match for 
 each document returned. Something like:
 docID=1,
 fields_matched=simple_fieldtype,complex_fieldtype,another_complex_fie
 ldtype docID=2, 
 fields_matched=simple_fieldtype,another_complex_fieldtype
 
 
 My basic use case is that I have several copyField'ed variations on 
 the same data (using different complex FieldTypes), and I want to 
 know which variations contributed to the document so I can conclude 
 things like Well, this document matched the field with the 
 SynonymFilterFactory, but not the one without, so this particular 
 document must've been a synonym match.
 
 I know you could probably lift this from debugQuery output, but 
 that's a non-starter due to parsing complexity and query performance impact.
 I think you could edge into some of this using the HighlightComponent 
 output, but that's a non-starter because it requires fields be stored=true.
 Most of my fieldTypes are intended solely for indexing/search, and 
 make no sense from a stored/retrieval standpoint. And to be clear, I 
 really don't care about which terms matched anyway, only which fields.
 
 If there's an easy way to get this, I'd love to hear it. Otherwise, 
 I'm mostly looking for a head start on where to go looking for this 
 data so I can add my own Component or something - assuming the data 
 is even available in the solr layer?
 
 Thanks.
 
 
 
 
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
 mkhlud...@griddynamics.com



RE: Solr load balancer

2013-01-31 Thread Jeff Wartes

For what it's worth, Google has done some pretty interesting research into 
coping with the idea that particular shards might very well be busy doing 
something else when your query comes in.

Check out this slide deck: http://research.google.com/people/jeff/latency.html
Lots of interesting ideas, but in particular, around slide 39 he talks about 
backup requests where you wait for something like your typical response time 
and then issue a second request to a different shard. You take whichever answer 
you get first, and cancel the other. The initial wait + cancellation means your 
extra cluster load is minimal, and you still get the benefit of reducing your 
p95+ response times if the first request was high-latency due to something 
unrelated to the query. (Say, GC.)

Of course, a central principle of this approach is being able to cancel a query 
and have it stop consuming resources. I'd love to be corrected, but I don't 
think Solr allows this. You can stop waiting for a response, but even the 
timeAllowed param doesn't seem to stop resource usage after the allotted time.  
Meaning, a few exceptionally long-running queries can take out your 
high-throughput cluster by tying up entire CPUs for long periods.

Let me know the JIRA number, I'd love to see work in this area.


-Original Message-
From: Phil Hoy [mailto:p...@brightsolid.com] 
Sent: Tuesday, January 29, 2013 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr load balancer

Hi Erick,

Thanks, I have read the blogs you cited and I found them very interesting, and 
we have tuned the jvm accordingly but still we get the odd longish gc pause. 

That said we perhaps have an unusual setup; we index a lot of small documents 
using servers with ssd's and 128 GB RAM in a sharded set up with replicas and 
our queries rely heavily on query filters and faceting with minimal free-text 
style searching. For that reason we rely heavily on the filter cache to improve 
query latency, therefore we assign a large percentage of available ram to the 
jvm hosting solr. 

Anyhow we are happy with the current configuration and performance profile, 
aside from the odd gc pause that is, and as we have index replicas it seems to 
me that we should be able to cope, hence my willingness to tweak how the load 
balancer behaves.

Thanks,
Phil



-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: 20 January 2013 15:56
To: solr-user@lucene.apache.org
Subject: Re: Solr load balancer

Hmmm, the first thing I'd look at is why you are having long GC pauses. Here's 
a great place to start:

http://www.lucidimagination.com/blog/2011/03/27/garbage-collection-bootcamp-1-0/
and:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

I've wondered about a similar approach, but by firing off the same query to 
multiple nodes in your cluster, you'll be effectively doubling (at least) the 
load on your system. Leading to more memory issues perhaps in a non-virtuous 
cycle.

FWIW,
Erick

On Fri, Jan 18, 2013 at 5:41 AM, Phil Hoy p...@brightsolid.com wrote:
 Hi,

 I would like to experiment with some custom load balancers to help with query 
 latency in the face of long gc pauses and the odd time-consuming query that 
 we need to be able to support. At the moment setting the socket timeout via 
 the HttpShardHandlerFactory does help, but of course it can only be set to a 
 length of time as long as the most time consuming query we are likely to 
 receive.

 For example perhaps a load balancer that sends multiple queries concurrently 
 to all/some replicas and only keeps the first response might be effective. Or 
 maybe a load balancer which takes account of the frequency of timeouts would 
 be able to recognize zombies more effectively.

 To use alternative load balancer implementations cleanly and without having 
 to hack solr directly, I would need to be able to make the existing 
 LBHttpSolrServer and HttpShardHandlerFactory more amenable to extension, I 
 can then override the default load balancer using solr's plugin mechanism.

 So my question is, if I made a patch to make the load balancer more 
 pluggable, is this something that would be acceptable and if so what do I do 
 next?

 Phil

 __
 brightsolid is used in this email to collectively mean brightsolid online 
 innovation limited and its subsidiary companies brightsolid online publishing 
 limited and brightsolid online technology limited.
 findmypast.co.uk is a brand of brightsolid online publishing limited.
 brightsolid online innovation limited, Gateway House, Luna Place, Dundee 
 Technology Park, Dundee DD2 1TP.  Registered in Scotland No. SC274983.
 brightsolid online publishing limited, The Glebe, 6 Chapel Place, Rivington 
 Street, London EC2A 3DQ. Registered in England No. 04369607.
 brightsolid online technology limited, Gateway House, Luna Place, Dundee 
 Technology Park, Dundee 

Can't mix Synonyms with Shingles?

2011-08-10 Thread Jeff Wartes

I would like to combine the ShingleFilterFactory with a SynonymFilterFactory in 
a field type. 

I've looked at something like this using the analysis.jsp tool: 

fieldType name=TestTerm class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.WordDelimiterFilterFactory generateWordParts=1 
generateNumberParts=1 stemEnglishPosessive=1/
filter class=solr.ShingleFilterFactory tokenSeparator= /
filter class=solr.SynonymFilterFactory 
synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
...
  /analyzer
  analyzer type=query
  ...
  /analyzer
/fieldType

However, when a ShingleFilterFactory is applied first, the SynonymFilterFactory 
appears to do nothing. 
I haven't found any documentation or other warnings against this combination, 
and I don't want to apply shingles after synonyms (this works) because 
multi-word synonyms then cause severe term expansion. I don't really mind if 
the synonyms fail to match shingles, (although I'd prefer they succeed) but I'd 
at least expect that synonyms would continue to match on the original tokens, 
as they do if I remove the ShingleFilterFactory.

I'm using Solr 3.3, any clarification would be appreciated.

Thanks,
  -Jeff Wartes



RE: Can't mix Synonyms with Shingles?

2011-08-10 Thread Jeff Wartes

Hi Steven,

The token separator was certainly a deliberate choice, are you saying that 
after applying shingles, synonyms can only match shingled terms? The term 
analysis suggests the original tokens still exist. 
You've made me realize that only certain synonyms seem to have problems though, 
so it's not a blanket failure.

Take this synonym definition:
wamu, washington mutual bank, washington mutual

Indexing wamu looks like it'll work fine - there are no shingles, and all 
three synonym expansions appear to get indexed. (expand=true) However, 
indexing washington mutual applies the shingles correctly, (adds 
washingtonmutual to position 1) but the synonym expansion does not happen. I 
would still expect the synonym definition to match the original terms and index 
'wamu' along with the other stuff.

Thanks.



-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu] 
Sent: Wednesday, August 10, 2011 12:54 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?

Hi Jeff,

Hi Jeff,

You have configured ShingleFilterFactory with a token separator of , so e.g. 
International Corporation will output the shingle InternationalCorporation. 
 If this is the form you want to use for synonym matching, it must exist in 
your synonym file.  Does it?

Steve

 -Original Message-
 From: Jeff Wartes [mailto:jwar...@whitepages.com]
 Sent: Wednesday, August 10, 2011 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Can't mix Synonyms with Shingles?
 
 
 I would like to combine the ShingleFilterFactory with a 
 SynonymFilterFactory in a field type.
 
 I've looked at something like this using the analysis.jsp tool:
 
 fieldType name=TestTerm class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 stemEnglishPosessive=1/
 filter class=solr.ShingleFilterFactory tokenSeparator= /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
 ...
   /analyzer
   analyzer type=query
   ...
   /analyzer
 /fieldType
 
 However, when a ShingleFilterFactory is applied first, the 
 SynonymFilterFactory appears to do nothing.
 I haven't found any documentation or other warnings against this 
 combination, and I don't want to apply shingles after synonyms (this
 works) because multi-word synonyms then cause severe term expansion. I 
 don't really mind if the synonyms fail to match shingles, (although 
 I'd prefer they succeed) but I'd at least expect that synonyms would 
 continue to match on the original tokens, as they do if I remove the 
 ShingleFilterFactory.
 
 I'm using Solr 3.3, any clarification would be appreciated.
 
 Thanks,
   -Jeff Wartes



RE: Can't mix Synonyms with Shingles?

2011-08-10 Thread Jeff Wartes

After some further playing around, I think I understand what's going on. 
Because the SynonymFilterFactory pays attention to term position when it 
inserts a multi-word synonym, I had assumed it scanned for matches in a way 
that respected term position as well. (ie, for a two-word synonym, I assumed it 
would try to find the second word in position n+1 if it found the first word in 
position n) 

This does not appear to be the case. It appears to find multi-word synonym 
matches by simply walking the list of terms, exhausting all the terms in 
position one before looking at any terms in position two. The ShingleFilter 
adds terms to most positions, so that throws off the 'adjacency' of the 
flattened list of terms. Meaning, a two-word synonym can only match if the 
synonym consists of the original term (position 1) followed by the added 
shingle (also in position 1).
Perhaps a better description is if you're looking at the analysis.jsp display, 
it does not scan for multi-word synonym tokens across then down, it scans 
down then across.


It doesn't look like there's a way to do what I'm trying to do (index shingles 
AND multi-word synonyms in one field) without writing my own filter.


-Original Message-
From: Jeff Wartes [mailto:jwar...@whitepages.com] 
Sent: Wednesday, August 10, 2011 1:27 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?


Hi Steven,

The token separator was certainly a deliberate choice, are you saying that 
after applying shingles, synonyms can only match shingled terms? The term 
analysis suggests the original tokens still exist. 
You've made me realize that only certain synonyms seem to have problems though, 
so it's not a blanket failure.

Take this synonym definition:
wamu, washington mutual bank, washington mutual

Indexing wamu looks like it'll work fine - there are no shingles, and all 
three synonym expansions appear to get indexed. (expand=true) However, 
indexing washington mutual applies the shingles correctly, (adds 
washingtonmutual to position 1) but the synonym expansion does not happen. I 
would still expect the synonym definition to match the original terms and index 
'wamu' along with the other stuff.

Thanks.



-Original Message-
From: Steven A Rowe [mailto:sar...@syr.edu]
Sent: Wednesday, August 10, 2011 12:54 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't mix Synonyms with Shingles?

Hi Jeff,

Hi Jeff,

You have configured ShingleFilterFactory with a token separator of , so e.g. 
International Corporation will output the shingle InternationalCorporation. 
 If this is the form you want to use for synonym matching, it must exist in 
your synonym file.  Does it?

Steve

 -Original Message-
 From: Jeff Wartes [mailto:jwar...@whitepages.com]
 Sent: Wednesday, August 10, 2011 3:43 PM
 To: solr-user@lucene.apache.org
 Subject: Can't mix Synonyms with Shingles?
 
 
 I would like to combine the ShingleFilterFactory with a 
 SynonymFilterFactory in a field type.
 
 I've looked at something like this using the analysis.jsp tool:
 
 fieldType name=TestTerm class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1 stemEnglishPosessive=1/
 filter class=solr.ShingleFilterFactory tokenSeparator= /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.BusinessNames.txt ignoreCase=true expand=true/
 ...
   /analyzer
   analyzer type=query
   ...
   /analyzer
 /fieldType
 
 However, when a ShingleFilterFactory is applied first, the 
 SynonymFilterFactory appears to do nothing.
 I haven't found any documentation or other warnings against this 
 combination, and I don't want to apply shingles after synonyms (this
 works) because multi-word synonyms then cause severe term expansion. I 
 don't really mind if the synonyms fail to match shingles, (although 
 I'd prefer they succeed) but I'd at least expect that synonyms would 
 continue to match on the original tokens, as they do if I remove the 
 ShingleFilterFactory.
 
 I'm using Solr 3.3, any clarification would be appreciated.
 
 Thanks,
   -Jeff Wartes



DistributedSearchDesign and multiple requests

2010-10-21 Thread Jeff Wartes

I'm using Solr 1.4. My observations and this page 
http://wiki.apache.org/solr/DistributedSearchDesign#line-254 indicate that the 
general strategy for Distributed Search is something like:
1. Query the shards with the user's query and fl=unique_field,score
2. Re-query (maybe a subset of) the shards for certain documents by 
unique_field with the field list the user requested.
3. Maybe re-query the shards again to flesh out faceting info.

I'm encountering a significant performance penalty using DistributedSearch due 
to these additional queries, and it seems like there are some obvious 
optimizations that could avoid them in certain cases. 

For example, a way to say I claim the fields I'm requesting are small enough 
that querying again for stored fields is worse than just getting the stored 
fields in the first request. 
(assert_tiny_data=truefl=tiny_stored_field,unique_field) 
Or, If the field list of the original query is contained in the first round of 
shard requests, don't bother querying again for more fields. 
(fl=unique_field,score)

Has anyone else looked into this? I'd be interested to learn if there are 
issues that makes these kind of shortcuts difficult before I dig in.

Thanks,
  -Jeff Wartes


Re: Knowing what field caused the retrival of the document

2013-08-06 Thread Jeff Wartes

For what it's worth, I had the same question last year, and I never really
got a good solution:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3C81
e9a7879c550b42a767f0b86b2b81591a15b...@ex4.corp.w3data.com%3E


I dug into the highlight component for a while, but it turned out I
couldn't use that approach. I'm afraid I don't recall exactly why. The
debugQuery method had a *huge* performance cost, so that was a non-starter.

I managed to solve a subset of my problem by writing a custom
QueryComponent that (re)examines the documents being returned and
annotates the response. This worked because I was able to reduced my
problem space to just determining whether a document was a string-literal
match with the query vs via synonyms or other fuzzy expansion.
It still required that I had stored and was returning the fields I wanted
to (re)examine, so it was hardly ideal on several fronts.

If you can, I'd suggest just doing two queries.


On 8/6/13 7:38 AM, Jack Krupansky j...@basetechnology.com wrote:

Add the debugQuery=true parameter and the explainsection will detail
exactly what terms matched for each document.

You could also use the Solr term sectors component to get info on what
terms 
occur where in a document, but that adds more overhead to the index for
stored term vectors.

-- Jack Krupansky

-Original Message-
From: Mysurf Mail
Sent: Tuesday, August 06, 2013 5:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Knowing what field caused the retrival of the document

But what if this for multiple words ?
I am guessing solr knows why the document is there since I get to see the
paragraph in the highlight.(hl) section.


On Tue, Aug 6, 2013 at 11:36 AM, Raymond Wiker rwi...@gmail.com wrote:

 If you were searching for single words (terms), you could use the 'tf'
 function, by adding something like

 matchesinname:tf(name, whatever)

 to the 'fl' parameter - if the 'name' field contains whatever, the
 (result) field 'matchesinname' will be 1.




 On Tue, Aug 6, 2013 at 10:24 AM, Mysurf Mail stammail...@gmail.com
 wrote:

  I have two indexed fields in my document.- Name, Comment.
  The user searches for a phrase and I need to act differently if it
 appeared
  in the comment or the name.
  Is there a way to know why the document was retrieved?
  Thanks.
 
 




TermFrequency in a multi-valued field

2013-08-07 Thread Jeff Wartes

This might end up being more of a Lucene question, but anyway...

For a multivalued field, it appears that term frequency is calculated as
something a little like:

sum(tf(value1), ..., tf(valueN))

I'd rather my score not give preference based on how *many* of the values
in the multivalued field matched, I want it to give preference based on
the value that matched *best*. In other words, something more like:

max(tf(value1), ..., tf(valueN))


Put another way, I want a search like q=mvf:foo against a document with a
multivalued field: 
mvf: [ foo ]
to get scored the exact same as a document with a multivalued field:
mvf: [ foo, foo ]
but worse than a document with a multivalued field:
mvf: [ foo foo ]


I'm guessing this'd require a custom Similarity implementation, but I'm
beginning to wonder if even that is low enough level.
Other thoughts? This seems like a pretty obvious desire.

Thanks.



Re: TermFrequency in a multi-valued field

2013-08-07 Thread Jeff Wartes


A multivalued text field is directly equivalent to concatenating the
values, 
with a possible position gap between the last and first terms of adjacent
values.


That, in a nutshell, would be the problem. Maybe the discussion is over at
this point. 


It could be I dumbed down the problem a bit too much for illustration
purposes. I'm actually doing phrase query matches with slop. As such, the
search phrase I'm interested in could easily be in more than one of the
(unique) values, and the score for each value-match could be very
different when considered alone.

For document scoring purposes, I don't care that (for example) I got a
sloppy match on one value if I got a nice phrase out of another value in
the same document. In fact, I explicitly want to ignore the fact that
there was also a sloppy match. I also don't care if the exact phrase
occurred in more than one value, and I don't want the case where it does
match more than one influencing that document's score.




Distance sort on a multi-value field

2013-08-14 Thread Jeff Wartes

I'm still pondering aggregate-type operations for scoring multi-valued
fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
something like that.

Somewhat surprisingly I don't see this in the documentation anywhere, but
I presume the example query: (from:
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}

assigns the distance/score based on the *closest* lat/long if the sfield
is a multi-valued field.

That's a reasonable default, but it's a bit arbitrary. Can I sort based on
the *furthest* lat/long in the document? Or the average distance?

Anyone know more about how this works and could give me some pointers?

Thanks.



Re: Distance sort on a multi-value field

2013-08-14 Thread Jeff Wartes

Hm, Give me all the stores that only have branches in this area might be
a plausible use case for farthest distance.
That's essentially a contains question though, so maybe that's already
supported? I guess it depends on how contains/intersects/etc handle
multi-values. I feel like multi-value interaction really deserves its own
section in the documentation.


I'm aware of the memory issue, but it seems like if you want sort
multi-valued points, it's either this or try to pull in the 2155 patch. In
general I'd rather go with the thing that's being maintained.


Thanks for the code pointer. You're right, that doesn't look like
something I can easily use for more general aggregate scoring control. Ah
well.



On 8/14/13 12:35 PM, Smiley, David W. dsmi...@mitre.org wrote:



On 8/14/13 2:26 PM, Jeff Wartes jwar...@whitepages.com wrote:


I'm still pondering aggregate-type operations for scoring multi-valued
fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
something like that.

It isn't.


Somewhat surprisingly I don't see this in the documentation anywhere, but
I presume the example query: (from:
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}

assigns the distance/score based on the *closest* lat/long if the sfield
is a multi-valued field.

Yes it does.


That's a reasonable default, but it's a bit arbitrary. Can I sort based
on
the *furthest* lat/long in the document? Or the average distance?

Anyone know more about how this works and could give me some pointers?

I considered briefly supporting the farthest distance but dismissed it as
I saw no real use-case.  I didn't think of the average distance; that's
plausible.  Any way, you're best bet is to dig into the code.  The
relevant part is ShapeFieldCacheDistanceValueSource.

FYI something to keep in mind:
https://issues.apache.org/jira/browse/LUCENE-4698

~ David




Re: Distance sort on a multi-value field

2013-08-22 Thread Jeff Wartes

This is actually pretty far afield from my original subject, but it turns
out that I also had issues  with NRT and multi-field geospatial
performance in Solr 4, so I'll follow that up.


I've been testing and working with David's SOLR-5170 patch ever since he
posted it, and I pushed it into production with only some cosmetic changes
a few hours ago. 
I have a relatively low update and query rate for this particular query
type, (something like 2 updates/sec, 10 queries/sec) but a short
autosoftcommit time. (5 sec) Based on the data so far this patch looks
like it's brought my average response time down from 4 seconds to about
50ms.

Very nice!



On 8/20/13 7:37 PM, David Smiley (@MITRE.org) dsmi...@mitre.org wrote:

The distance sorting code in SOLR-2155 is roughly equivalent to the code
that
RPT uses (RPT has its lineage in SOLR-2155 after all).  I just reviewed it
to double-check.  It's possible the behavior is slightly better in
SOLR-2155
because the cache (a Solr cache) contains normal hard-references whereas
RPT
has one based on weak references, which will linger longer.  But I think
the
likelihood of OOM is the same.

Any way, the current best option is
https://issues.apache.org/jira/browse/SOLR-5170  which I posted a few days
ago.

~ David


Billnbell wrote
 We have been using 2155 for over 6 months in production with over 2M
hits
 every 10 minutes. No OOM yet.
 
 2155 seems great, and would this issue be any worse than 2155?
 
 
 
 On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes lt;

 jwartes@

 gt; wrote:
 

 Hm, Give me all the stores that only have branches in this area might
 be
 a plausible use case for farthest distance.
 That's essentially a contains question though, so maybe that's
already
 supported? I guess it depends on how contains/intersects/etc handle
 multi-values. I feel like multi-value interaction really deserves its
own
 section in the documentation.


 I'm aware of the memory issue, but it seems like if you want sort
 multi-valued points, it's either this or try to pull in the 2155 patch.
 In
 general I'd rather go with the thing that's being maintained.


 Thanks for the code pointer. You're right, that doesn't look like
 something I can easily use for more general aggregate scoring control.
Ah
 well.



 On 8/14/13 12:35 PM, Smiley, David W. lt;

 dsmiley@

 gt; wrote:

 
 
 On 8/14/13 2:26 PM, Jeff Wartes lt;

 jwartes@

 gt; wrote:
 
 
 I'm still pondering aggregate-type operations for scoring
multi-valued
 fields (original thread: http://goo.gl/zOX53f ), and it occurred to
me
 that distance-sort with SpatialRecursivePrefixTreeFieldType must be
 doing
 something like that.
 
 It isn't.
 
 
 Somewhat surprisingly I don't see this in the documentation anywhere,
 but
 I presume the example query: (from:
 http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
 q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}
 
 assigns the distance/score based on the *closest* lat/long if the
 sfield
 is a multi-valued field.
 
 Yes it does.
 
 
 That's a reasonable default, but it's a bit arbitrary. Can I sort
based
 on
 the *furthest* lat/long in the document? Or the average distance?
 
 Anyone know more about how this works and could give me some
pointers?
 
 I considered briefly supporting the farthest distance but dismissed it
 as
 I saw no real use-case.  I didn't think of the average distance;
that's
 plausible.  Any way, you're best bet is to dig into the code.  The
 relevant part is ShapeFieldCacheDistanceValueSource.
 
 FYI something to keep in mind:
 https://issues.apache.org/jira/browse/LUCENE-4698
 
 ~ David
 


 
 
 -- 
 Bill Bell

 billnbell@

 cell 720-256-8076





-
 Author: 
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context:
http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp
4084666p4085797.html
Sent from the Solr - User mailing list archive at Nabble.com.



Required local configuration with ZK solr.xml?

2014-01-28 Thread Jeff Wartes


It was my hope that storing solr.xml would mean I could spin up a Solr node 
pointing it to a properly configured zookeeper ensamble, and that no further 
local configuration or knowledge would be necessary.

However, I’m beginning to wonder if that’s sufficient. It’s looking like I may 
also need each node to be preconfigured with at least a directory and a 
core.properties for each collection/core the node intends to participate in? Is 
that correct?

I figured I’d test this by starting a stand-alone ZK, and configuring it by 
issuing a zkCli bootstrap against the solr examples dir solr dir, then manually 
putfile-ing the (new-style) solr.xml.
I then attempted to connect two solr instances that referenced that zookeeper, 
but did NOT use the solr examples dir as the base. I essentially used empty 
directories for the solr home. Although both connected and zk shows both in the 
/live_nodes, both report “0 cores discovered” in the logs, and don’t seem to 
find and participate in the collection as happens when you follow the SolrCloud 
example verbatim. 
(http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster)

I may have some other configuration issues at present, but I’m going to be 
disappointed if I need to have preknowledge of what collections/cores may have 
been dynamically created in a cluster in order to add a node that participates 
in that cluster.
It feels like I might be missing something.

Any clarifications would be appreciated.


Re: Required local configuration with ZK solr.xml?

2014-01-29 Thread Jeff Wartes


...the differnce between that example and what you are doing here is that
in that example, because both of nodes already had collection1 instance
dirs, they expected to be part of collection1 when they joined the
cluster.


And that, I think, is my misunderstanding. I had assumed that the link
between a node and the collections it belongs to would be the (possibly
chroot¹ed) zookeeper reference *itself*, not the node¹s directory
structure. Instead, it appears that ZK is simply a repository for the
collection configuration, where nodes may look up what they need based on
filesystem core references.

So assume I have a ZK running separately, and it already has a config
uploaded which all my collections will use. I then add two nodes with
empty solr home dirs, (no cores found) and use the collections API to
create a new collection ³collection1 with numShards=1 and
replicationFactor=2.
Please check my assertions:

WRONG: If I take the second node down, wipe the solr home dir on it, and
add it back, the collection will properly replicate back onto the node.
RIGHT: If I replace the second node, it must have a
$SOLR_HOME/collection1_shard1_replica2/core.properties file in order to
properly act as the replacement.
RIGHT: If I replace the first node, it must have a
$SOLR_HOME/collection1_shard1_replica1/core.properties file in order to
properly act as the replacement.

If correct, this sounds kinda painful, because you have to know the exact
list of directory names for the set of collections/shards/replicas the
node was participating in if you want to replace a given node. Worse, if
you¹re replacing a node whose disk failed, you need to manually
reconstruct the contents of those core.properties files?


This also leaves me a bit confused about how to increase the
replicationFactor on an existing collection, but I guess that¹s tangential.

Thanks.






Re: Required local configuration with ZK solr.xml?

2014-01-30 Thread Jeff Wartes

Work is underway towards a new mode where zookeeper is the ultimate
source of truth, and each node will behave accordingly to implement and
maintain that truth.  I can't seem to locate a Jira issue for it,
unfortunately.  It's possible that one doesn't exist yet, or that it has
an obscure title.  Mark Miller is the one who really understands the
full details, as he's a primary author of SolrCloud code.

Currently, what SolrCloud considers to be truth is dictated by both
zookeeper and an amalgamation of which cores each server actually has
present.  The collections API modifies both.  With an older config (all
current and future 4.x versions), the latter is in solr.xml.  If you're
using the new solr.xml format (available 4.4 and later, will be
mandatory in 5.0), it's done with Core Discovery.  Zookeeper has a list
of everything and coordinates the cluster state, but has no real control
over the cores that actually exist on each server.  When the two sources
of truth disagree, nothing happens to fix the situation, manual
intervention is required.


Thanks Shawn, this was exactly the confirmation I was looking for. I think
I have a much better understanding now.

The takeaway I have is that SolrCloud¹s current automation assumes
relatively static clusters, and that if I want anything like dynamic
scaling, I¹m going to have to write my own tooling to add nodes safely.

Fortunately, it appears that the necessary CoreAdmin commands don¹t need
much besides the collection name, so it smells like a simple thing to
query zookeeper¹s /collections path (or clusterstate.json) and issue GET
requests accordingly when I spin up a new node.

If you (or anyone) does happen to recall a reference to the work you
alluded to, I¹d certainly be interested. I googled around myself for a few
minutes, but haven¹t found anything so far.




Re: Required local configuration with ZK solr.xml?

2014-01-30 Thread Jeff Wartes
Found it. In case anyone else cares, this appears to be the root issue:
https://issues.apache.org/jira/browse/SOLR-5128

Thanks again.


On 1/30/14, 9:01 AM, Jeff Wartes jwar...@whitepages.com wrote:


Work is underway towards a new mode where zookeeper is the ultimate
source of truth, and each node will behave accordingly to implement and
maintain that truth.  I can't seem to locate a Jira issue for it,
unfortunately.  It's possible that one doesn't exist yet, or that it has
an obscure title.  Mark Miller is the one who really understands the
full details, as he's a primary author of SolrCloud code.

Currently, what SolrCloud considers to be truth is dictated by both
zookeeper and an amalgamation of which cores each server actually has
present.  The collections API modifies both.  With an older config (all
current and future 4.x versions), the latter is in solr.xml.  If you're
using the new solr.xml format (available 4.4 and later, will be
mandatory in 5.0), it's done with Core Discovery.  Zookeeper has a list
of everything and coordinates the cluster state, but has no real control
over the cores that actually exist on each server.  When the two sources
of truth disagree, nothing happens to fix the situation, manual
intervention is required.


Thanks Shawn, this was exactly the confirmation I was looking for. I think
I have a much better understanding now.

The takeaway I have is that SolrCloud¹s current automation assumes
relatively static clusters, and that if I want anything like dynamic
scaling, I¹m going to have to write my own tooling to add nodes safely.

Fortunately, it appears that the necessary CoreAdmin commands don¹t need
much besides the collection name, so it smells like a simple thing to
query zookeeper¹s /collections path (or clusterstate.json) and issue GET
requests accordingly when I spin up a new node.

If you (or anyone) does happen to recall a reference to the work you
alluded to, I¹d certainly be interested. I googled around myself for a few
minutes, but haven¹t found anything so far.





Re: SolrCloud how to spread out to multiple nodes

2014-02-10 Thread Jeff Wartes

If you¹re only concerned with moving your shards, (rather than changing
the number of shards), I¹d:

1. Add a new server and fire up Solr pointed to the same ZooKeeper with
the same config

At this point the new server won¹t be indexing anything, but will still
technically be part of the SolrCloud cluster since it points to the same
ZK.

2. Add the shard to the node by making it a replica.

The CoreAdmin API is necessary for this, since the Collections API doesn¹t
support changing the number of replicas on the fly yet. See
http://wiki.apache.org/solr/SolrCloud#Creating_cores_via_CoreAdmin

3. After the new server has finished replicating the shard, you can delete
the shard from the original server. Again, use the CoreAdmin command
against the specific server. http://wiki.apache.org/solr/CoreAdmin#UNLOAD


If you¹re comfortable with downtime, you could accomplish the same thing
by physically copying the shard¹s index files, but I think you¹d have to
make some careful edits to the core.properties before you bring things
back up again.



On 2/9/14, 7:27 AM, soodyogesh soodyog...@gmail.com wrote:



since amount of data we would be indexing would increase over period of
time
(read 100-200G and more) we would like to use SOlrCloud.

Now I have been reading posts and wikipages plus trying things on my own
to
test.

to simplify i would  create a collection with n number of shards where
n=lets say 10. I would start everything on single machine to start with,
however as my index grow I would like to spread out those Shards to
multiple
machine.

My question is how do I spread shards (one collection) from one machine to
multiple machine. It would be great help if some one can provide me steps
to
test this.







--
View this message in context:
http://lucene.472066.n3.nabble.com/SolrCloud-how-to-spread-out-to-multiple
-nodes-tp4116326.html
Sent from the Solr - User mailing list archive at Nabble.com.



handleSelect=true with SolrCloud

2014-02-11 Thread Jeff Wartes

I’m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0 at 
present.) The old query style relied on using /solr/select?qt=foo to select the 
proper requestHandler. I know handleSelect=true is deprecated now, but it’d be 
pretty handy for testing to be able to be backwards compatible, at least until 
some time after the initial release.

So in my SolrCloud configuration, I set requestDispatcher handleSelect=true” 
and deleted the /select requestHandler as suggested here: 
http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Resolution_.28qt_param.29

However, my /solr/collection1/select?qt=foo query throws an “unknown handler: 
null” error with this configuration. Has anyone successfully tried 
handleSelect=true with the collections api?

Thanks.




Re: handleSelect=true with SolrCloud

2014-02-11 Thread Jeff Wartes

Got it in one. Thanks!


On 2/11/14, 9:50 AM, Shawn Heisey s...@elyograg.org wrote:

On 2/11/2014 10:21 AM, Jeff Wartes wrote:
 I¹m working on a port of a Solr service to SolrCloud. (Targeting v4.6.0
at present.) The old query style relied on using /solr/select?qt=foo to
select the proper requestHandler. I know handleSelect=true is deprecated
now, but it¹d be pretty handy for testing to be able to be backwards
compatible, at least until some time after the initial release.

 So in my SolrCloud configuration, I set requestDispatcher
handleSelect=true² and deleted the /select requestHandler as suggested
here: 
http://wiki.apache.org/solr/SolrRequestHandler#Old_handleSelect.3Dtrue_Re
solution_.28qt_param.29

 However, my /solr/collection1/select?qt=foo query throws an ³unknown
handler: null² error with this configuration. Has anyone successfully
tried handleSelect=true with the collections api?

I'm pretty sure that if you won't have a handler named /select, then you
need to have default=true as an attribute on one of your other handler
definitions.

See line 715 of the example solrconfig.xml for Solr 3.5:

http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_3_5/solr/exam
ple/solr/conf/solrconfig.xml?view=annotate

Thanks,
Shawn




ZK connection problems

2014-02-21 Thread Jeff Wartes

I’ve been experimenting with SolrCloud configurations in AWS. One issue I’ve 
been plagued with is that during indexing, occasionally a node decides it can’t 
talk to ZK, and this disables updates in the pool. The node usually recovers 
within a second or two. It’s possible this happens when I’m not indexing too, 
but I’m much less likely to notice.

I’ve seen this with multiple sharding configurations and multiple cluster 
sizes. I’ve searched around, and I think I’ve addressed the usual resolutions 
when someone complains about ZK and Solr. I’m using:

  *   60-sec ZK connection timeout (although this seems like a pretty terrible 
requirement)
  *   Independent 3-node ZK cluster, also in AWS.
  *   Solr 4.6.1
  *   Optimized GC settings (and I’ve confirmed no GC pauses are occurring)
  *   5-min auto-hard-commit with openSearcher=false

I’m indexing some 10K docs/sec using CloudSolrServer, but the CPU usage on the 
nodes doesn’t exceed 20%, typically it’s around 5%.

Here is the relevant section of logs from one of the nodes when this happened:
http://pastebin.com/K0ZdKmL4

It looks like it had a connection timeout, and tried to re-establish the same 
session on a connection to a new ZK node, except the session had also expired. 
It then closes *that* connection, changes to read-only mode, and eventually 
creates a new connection and new session which allows writes again.

Can anyone familiar with the ZK connection/session stuff comment on whether 
this is a bug? I really know nothing about proper ZK client behaviour.

Thanks.



Re: DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-24 Thread Jeff Wartes

I¹ll second that thank-you, this is awesome.

I asked about this issue in 2010, but when I didn¹t hear anything (and
disappointingly didn¹t find SOLR-1880), we ended up rolling our own
version of this functionality. I¹ve been laboriously migrating it every
time we bump our Solr version ever since. The performance difference is
quite noticeable. 
One thing is that our version interferes pretty badly with various other
Components. It¹s been a while, but my recollection is that other
Components like Debug assumed some stuff happened in STAGE_GET_FIELDS.

I think I¹ll try to apply SOLR-1880 to 4.6.1 and see what happens.




On 2/24/14, 11:07 AM, Gregg Donovan gregg...@gmail.com wrote:

Thank you Shalin and Yonik! Both
SOLR-1880https://issues.apache.org/jira/browse/SOLR-1880
 and SOLR-5768 https://issues.apache.org/jira/browse/SOLR-5768 will be
very helpful for our distributed search performance.



On Mon, Feb 24, 2014 at 5:02 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 I opened SOLR-5768

 https://issues.apache.org/jira/browse/SOLR-5768

 On Mon, Feb 24, 2014 at 12:56 AM, Shalin Shekhar Mangar
 shalinman...@gmail.com wrote:
  Yes that should be simple. But regardless of the parameter, the
  fl=id,score use-case should be optimized by default. I think I'll
  commit the patch as-is and open a new issue to add the
  distrib.singlePass parameter.
 
  On Sun, Feb 23, 2014 at 11:49 PM, Yonik Seeley yo...@heliosearch.com
 wrote:
  On Sun, Feb 23, 2014 at 1:08 PM, Shalin Shekhar Mangar
  shalinman...@gmail.com wrote:
  I should clarify though that this optimization only works with
 fl=id,score.
 
  Although it seems like it should be relatively simple to make it work
  with other fields as well, by passing down the complete fl
requested
  if some optional parameter is set (distrib.singlePass?)
 
  -Yonik
  http://heliosearch.org - native off-heap filters and fieldcache for
 solr
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.



 --
 Regards,
 Shalin Shekhar Mangar.




Re: SolrCloud Startup

2014-02-24 Thread Jeff Wartes

There is a RELOAD collection command you might try:
https://cwiki.apache.org/confluence/display/solr/Collections+API#Collection
sAPI-api2


I think you¹ll find this a lot faster than restarting your whole JVM.


On 2/24/14, 4:12 PM, KNitin nitin.t...@gmail.com wrote:

Hi

 I have a 4 node solrcloud cluster with more than 50 collections with 4
shards each. Everytime I want to make a schema change, I upload configs to
zookeeper and then restart all nodes. However the restart of every node is
very slow and takes about 20-30 minutes per node.

Is it recommended to make loadOnStartup=false and allow solrcloud to lazy
load? Is there a way to make schema changes without restarting solrcloud?


Thanks



Re: Automate search results filtering based on scoring

2014-03-05 Thread Jeff Wartes

It¹s worth mentioning that scores should not be considered comparable
across queries, so equating ³confidence² and ³score² is a tricky
proposition. 
That is, the maxScore for the search field1:foo may be 10.0, and the
maxScore for ³field1:bar² may be 1.0, but that doesn¹t mean the top result
for ³foo is ten times better than the top result for ³bar². If might just
be that ³foo² and ³bar² have very different frequencies in your data.

You can work around this with a custom Similarity class. You¹d do this by
removing most of Solr/Lucene¹s term and document frequency scoring
intelligence, so that scoring is almost entirely a factor of which fields
matched with which boosts.

You could also truncate a given result set based on a percentage of the
maxScore for that result set, but even within a single result set, you
shouldn¹t assume that 1/3rd the score means 1/3rd the match quality.
You¹ll need to find numbers that work for you.



On 3/3/14, 7:50 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net
wrote:

Hi,

We are looking to automate searches (name searches)  filter out the
results based on some scoring confidence. Any suggestions on what
different approaches we can use to pick only top closer matches and
filter out rest of the results.


Thanks,
Susheel




Re: Result merging takes too long

2014-03-17 Thread Jeff Wartes

This is highly anecdotal, but I tried SOLR-1880 with 4.7 for some tests I
was running, and saw almost a 30% improvement in latency. If you¹re only
doing document selection, it¹s definitely worth having.

I¹m reasonably certain that the patch would work in 4.6 too, but the test
file relies on some test infrastructure changes in 4.7, so I couldn¹t try
that without re-writing the test.



On 3/16/14, 7:31 AM, Shalin Shekhar Mangar shalinman...@gmail.com
wrote:

On Thu, Mar 13, 2014 at 1:49 PM, remi tassing tassingr...@gmail.com
wrote:

 I've used the fl=id parameter to avoid retrieving the actual documents
 (step 4 in your mail) but the problem still exists.
 Any ideas on how to find the merging time(step 3)?

Actually that doesn't skip steps #4 and #5. That optimization will be
available in Solr 4.8 (the next release). See
https://issues.apache.org/jira/browse/SOLR-1880

-- 
Regards,
Shalin Shekhar Mangar.



Re: Bootstrapping SolrCloud cluster with multiple collections in differene sharding/replication setup

2014-03-20 Thread Jeff Wartes

Please note that although the article talks about the ADDREPLICA command,
that feature is coming in Solr 4.8, so don¹t be confused if you can¹t find
it yet. See https://issues.apache.org/jira/browse/SOLR-5130



On 3/20/14, 7:45 AM, Erick Erickson erickerick...@gmail.com wrote:

You might find this useful:
http://heliosearch.org/solrcloud-assigning-nodes-machines/


It uses the collections API to create your collection with zero
nodes, then shows how to assign your leaders to specific
machines (well, at least specify the nodes the leaders will
be created on, it doesn't show how to assign, for instance,
shard1 to nodeX)

It also shows a way to assign specific replicas on specific nodes
to specific shards, although as Mark says this is a transitional
technique. I know there's an addreplica command in the works
for the collections API that should make this easier, but that's
not released yet.

Best,
Erick


On Thu, Mar 20, 2014 at 7:23 AM, Ugo Matrangolo
ugo.matrang...@gmail.com wrote:
 Hi,

 I would like some advice about the best way to bootstrap from scratch a
 SolrCloud cluster housing at least two collections with different
 sharding/replication setup.

 Going through the docs/'Solr In Action' book what I have sees so far is
 that there is a way to bootstrap a SolrCloud cluster with sharding
 configuration using the:

   -DnumShards=2

 but this (afaik) works only for a single collection. What I need is a
way
 to deploy from scratch a SolrCloud cluster housing (e.g.) two
collections
 Foo and Bar where Foo has only one shard and is replicated everywhere
while
 Bar has three shards and ,again, is replicated.

 I can't find a config file where to put this sharding plan and I'm
starting
 to think that the only way to do this is after the deploy using the
 Collections API.

 Is there a best approach way to do this ?

 Ugo



Re: Logging which client connected to Solr

2014-03-27 Thread Jeff Wartes

You could always just pass the username as part of the GET params for the
query. Solr will faithfully ignore and log any parameters it doesn¹t
recognize, so it¹d show up in your {lot of params}.

That means your log parser would need more intelligence, and your client
would have to pass in the data, but it would save any custom work on the
server side.



On 3/27/14, 7:07 AM, Greg Walters greg.walt...@answers.com wrote:

We do something similar and include the server's hostname in solr's
response. To accomplish this you'll have to write a class that extends
org.apache.solr.servlet.SolrDispatchFilter and put your custom class in
place as the SolrRequestFilter in solr's web.xml.

Thanks,
Greg

On Mar 27, 2014, at 8:59 AM, Juha Haaga juha.ha...@codenomicon.com
wrote:

 Hello,
 
 I¹m investigating the possibility of logging the username of the client
who did the search on Solr along with the normal logging information.
The username is in the basic auth headers of the request, and the access
control is managed by an Apache instance proxying to Solr. Is there a
way to append that information to the Solr query log, so that the log
would look like this:
 
 INFO  - 2014-03-27 11:16:24.000; org.apache.solr.core.SolrCore;
[generic] webapp=/solr path=/select params={lots of params} hits=0
status=0 QTime=49 username=juha
 
 I need to log both username and the query, and if I do it directly in
Apache then I lose the information about amount of hits and the query
time. If I log it with Solr then I get query time and hits, but no
username. Username logging is higher priority requirement than the hits
and query time, but I¹m looking for solution that covers both cases.
 
 Has anyone implemented this kind of logging scheme, and how would I
accomplish this? I couldn¹t find this as a configuration option.
 
 Regards,
 Juha
 
 
 
 
 




Re: svn vs GIT

2014-04-14 Thread Jeff Wartes
I vastly prefer git, but last I checked, (admittedly, some time ago) you
couldn't build the project from the git clone. Some of the build scripts
assumed some svn commands will work.



On 4/12/14, 3:56 PM, Furkan KAMACI furkankam...@gmail.com wrote:

Hi Amon;

There has been a conversation about it at dev list:
http://search-lucene.com/m/PrTmPXyDlv/The+Old+Git+Discussionsubj=Re+The+O
ld+Git+Discussion
On
the other hand you do not need to know SVN to use, develop and contribute
to Apache Solr project. You can follow the project at GitHub:
https://github.com/apache/lucene-solr

Thanks;
Furkan KAMACI


2014-04-11 5:12 GMT+03:00 Aman Tandon amantandon...@gmail.com:

 thanks sir,
 in that case i need to know about svn as well.


 Thanks
 Aman Tandon

 On Fri, Apr 11, 2014 at 7:26 AM, Alexandre Rafalovitch
 arafa...@gmail.comwrote:

  You can find the read-only Git's version of Lucene+Solr source code
  here: https://github.com/apache/lucene-solr . The SVN preference is
  Apache Foundation's choice and legacy. Most of the developers'
  workflows are also around SVN.
 
  Regards,
 Alex.
 
  Personal website: http://www.outerthoughts.com/
  Current project: http://www.solr-start.com/ - Accelerating your Solr
  proficiency
 
 
  On Fri, Apr 11, 2014 at 7:48 AM, Aman Tandon amantandon...@gmail.com
  wrote:
   Hi,
  
   I am new here, i have question in mind that why we are preferring
the
 svn
   more than git?
  
   --
   With Regards
   Aman Tandon
 




Re: svn vs GIT

2014-04-15 Thread Jeff Wartes

I guess I should¹ve double-checked it was still the case before saying
anything, but I¹m glad to be proven wrong.
Yes, it worked nicely for me when I tried today, which should simplify my
life a bit.


On 4/14/14, 4:35 PM, Shawn Heisey s...@elyograg.org wrote:

On 4/14/2014 12:56 PM, Ramkumar R. Aiyengar wrote:
 ant compile / ant -f solr dist / ant test certainly work, I use them
with a
 git working copy. You trying something else?
 On 14 Apr 2014 19:36, Jeff Wartes jwar...@whitepages.com wrote:

 I vastly prefer git, but last I checked, (admittedly, some time ago)
you
 couldn't build the project from the git clone. Some of the build
scripts
 assumed some svn commands will work.

The nightly-smoke build target uses svn.  There is a related smoketest
script that uses provided URL parameters (or svn if it's a checkout from
svn and the parameters are not supplied) to obtain artifacts for
testing.  This may not be the only build target that uses facilities not
available from git, but it's the only one that I know about for sure.

Ordinary people should be able to use repositories cloned from the
git.apache.org or github mirrors with no problem if they are not using
exotic build targets or build scripts.

When I tried 'ant precommit' it worked, but it did say at least once in
what scrolled by that this was not an SVN checkout, so the
'-check-svn-working-copy' build target (which is part of precommit)
didn't work.

Thanks,
Shawn




Re: timeAllowed in not honoring

2014-04-30 Thread Jeff Wartes

It¹s not just FacetComponent, here¹s the original feature ticket for
timeAllowed:
https://issues.apache.org/jira/browse/SOLR-502


As I read it, timeAllowed only limits the time spent actually getting
documents, not the time spent figuring out what data to get or how. I
think that means the primary use-case is serving as a guard against
excessive paging.



On 4/30/14, 4:49 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote:

On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon
amantandon...@gmail.comwrote:

  lst name=querydouble
 name=time3337.0/double  /lst  lst name=facet
 double name=time6739.0/double  /lst


Most time is spent in facet counting. FacetComponent doesn't checks
timeAllowed right now. You can try to experiment with facet.method=enum or
even with https://issues.apache.org/jira/browse/SOLR-5725 or try to
distribute search with SolrCloud. AFAIK, you can't employ threads to speed
up multivalue facets.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-30 Thread Jeff Wartes


On 4/19/14, 6:51 AM, Ken Krugler kkrugler_li...@transpac.com wrote:

The code I see seems to be using an FSDirectory, or is there another
layer of wrapping going on here?

return new NRTCachingDirectory(FSDirectory.open(new File(path)),
maxMergeSizeMB, maxCachedMB);


I was also curious about this subject. Not enough to test anything, but
enough to look at the code too.

FSDirectory.open picks one of MMapDirectory, SimpleFSDirectory and
NIOFSDirectory in that order of preference based on what it thinks your
system will support.

There¹s still the possibility that the added caching functionality slows
down bulk index operations, but setting that aside, it does look like
NRTCachingDirectoryFactory is almost always the best choice.



Re: Strategy for removing an active shard from zookeeper

2014-07-03 Thread Jeff Wartes

To expand on that, the Collections API DELETEREPLICA command is availible
in Solr = 4.6, but will not have the ability wipe the disk until Solr
4.10. 
Note that whether or not it deletes anything from disk, DELETEREPLICA will
remove that replica from your cluster state in ZK, so even in 4.10,
rebooting the node will NOT cause it to copy the data from the remaining
replica. You¹d need to explicitly ADDREPLICA (Solr = 4.8) to get it
participating again. On the plus side, you could do this without
restarting any servers.

The CoreAdmin UNLOAD command (which I think DELETEREPLICA uses under the
hood) has been available and able to wipe the disk since Solr 4.0. It
looks like specifying ³deleteIndex=true² might essentially do what you¹re
currently doing. I¹m not sure if you¹d still need a restart.


It¹s odd to me that one replica would use more disk space than the other
though, that implies a replication issue. Which, in turn, means you
probably don¹t have any assurances that deleting the node with a bigger
index isn¹t losing unique documents.



On 6/30/14, 7:10 PM, Anshum Gupta ans...@anshumgupta.net wrote:

You should use the DELETEREPLICA Collections API:
https://cwiki.apache.org/confluence/display/solr/Collections+API#Collectio
nsAPI-api9

As of the last release, I don't think it deletes the index directory
but I remember there was a JIRA for the same.
For now you could perhaps use this API and follow it up with manually
deleting the directory after that. This should help you maintain the
sanity of the SolrCloud state.


On Mon, Jun 30, 2014 at 8:45 PM, tomasv dadk...@gmail.com wrote:
 Hello All,
 (I'm a newbie, so if my terminology is incorrect or my concepts are
wrong,
 please point me in the right direction)(This is the first of several
 questions to come)

 I've inherited a SOLR 4 cloud installation and we're having some issues
with
 disk space on one of our shards.

 We currently have 64 servers serving a collection. The collection is
managed
 by a zookeeper instance. There are two servers for each shard (32
replicated
 shards).

 We have a service that is constantly running and inserting new records
into
 our collection as we get new data to be indexed.

 One of our shards is growing (on disk)  disproportionately  quickly.
When
 the disk gets full, we start getting 500-series errors from the SOLR
system
 and our websites start to fail.

 Currently, when we start seeing these errors, and IT sees that the disk
is
 full on this particular server, the folks in IT delete the /data
directory
 and restart the server (linux based). This has the effect of causing the
 shard to reboot and re-load itself from its paired partner.

 But I would expect that there is a more elegant way to recover from this
 event.

 Can anyone point me to a strategy that may be used in an instance such
as
 this? Should we be taking steps to save the indexed information prior to
 restarting the server (more on this in a separate question). Should we
be
 backing up something (anything) prior to the restart?

 (I'm still going through the SOLR wiki; so if the answer is there a
link is
 appreciated).

 Thanks!





 --
 View this message in context:
http://lucene.472066.n3.nabble.com/Strategy-for-removing-an-active-shard-
from-zookeeper-tp4144892.html
 Sent from the Solr - User mailing list archive at Nabble.com.



-- 

Anshum Gupta
http://www.anshumgupta.net



Re: Listening on SolrCloud events

2014-07-03 Thread Jeff Wartes
If you¹re using SolrJ, CloudSolrServer exposes the information you need
directly, although you¹d have to poll it for changes.
Specifically, this code path will get you a snapshot of the clusterstate:
http://lucene.apache.org/solr/4_5_0/solr-solrj/org/apache/solr/client/solrj
/impl/CloudSolrServer.html#getZkStateReader()

http://lucene.apache.org/solr/4_5_0/solr-solrj/org/apache/solr/common/cloud
/ZkStateReader.html#updateClusterState(boolean)

http://lucene.apache.org/solr/4_5_0/solr-solrj/org/apache/solr/common/cloud
/ZkStateReader.html#getClusterState()


If you¹re not using SolrJ, or don¹t want to poll, you really only need a
zookeeper client. As Shawn Heisey suggests, you put a watch on the
clusterstate.json, and write something to determine if the change was
relevant to you. 

A few months ago I wanted to do analysis of changes in my SolrCloud
cluster, so I threw together something using the Curator library to set
watches on clusterstate.json and commit the new version to a local git
repo whenever it changed.
(https://github.com/randomstatistic/git_zk_monitor) The hardest part
turned out to be learning about the Zookeeper watch semantics, but you
could use any client you like.



On 7/3/14, 9:23 AM, Shawn Heisey s...@elyograg.org wrote:

On 7/3/2014 7:49 AM, Ugo Matrangolo wrote:
 I would like to be informed as soon as a cluster event happens like a
node
 dropping and/or starting a recovery process.
 
 What is the best way (if any) to listening on SolrCloud events ?

I don't know how it's done, but if you are using SolrJ and
CloudSolrServer, you have access to the zookeeper client.  With the
zookeeper client, you can put a watcher on various parts of the
zookeeper database, which should fire whenever there is a change.

Thanks,
Shawn




SolrCloud extended warmup support

2014-07-21 Thread Jeff Wartes

I’d like to ensure an extended warmup is done on each SolrCloud node prior to 
that node serving traffic.
I can do certain things prior to starting Solr, such as pump the index dir 
through /dev/null to pre-warm the filesystem cache, and post-start I can use 
the ping handler with a health check file to prevent the node from entering the 
clients load balancer until I’m ready.
What I seem to be missing is control over when a node starts participating in 
queries sent to the other nodes.

I can, of course, add solrconfig.xml firstSearcher queries, which I assume (and 
fervently hope!) happens before a node registers itself in ZK clusterstate.json 
as ready for work, but that doesn’t scale so well if I want that initial warmup 
to run thousands of queries, or run them with some paralleism. I’m storing 
solrconfig.xml in ZK, so I’m sensitive to the size.

Any ideas, or corrections to my assumptions?

Thanks.


Re: SolrCloud extended warmup support

2014-07-21 Thread Jeff Wartes

On 7/21/14, 4:50 PM, Shawn Heisey s...@elyograg.org wrote:

On 7/21/2014 5:37 PM, Jeff Wartes wrote:
 I¹d like to ensure an extended warmup is done on each SolrCloud node
prior to that node serving traffic.
 I can do certain things prior to starting Solr, such as pump the index
dir through /dev/null to pre-warm the filesystem cache, and post-start I
can use the ping handler with a health check file to prevent the node
from entering the clients load balancer until I¹m ready.
 What I seem to be missing is control over when a node starts
participating in queries sent to the other nodes.
 
 I can, of course, add solrconfig.xml firstSearcher queries, which I
assume (and fervently hope!) happens before a node registers itself in
ZK clusterstate.json as ready for work, but that doesn¹t scale so well
if I want that initial warmup to run thousands of queries, or run them
with some paralleism. I¹m storing solrconfig.xml in ZK, so I¹m sensitive
to the size.
 
 Any ideas, or corrections to my assumptions?

I think that firstSearcher/newSearcher (and making sure useColdSearcher
is set to false) is going to be the only way you can do this in a way
that's compatible with SolrCloud.  If you were doing manual distributed
search without SolrCloud, you'd have more options available.

If useColdSearcher is set to false, that should keep *everything* from
using the searcher until the warmup has finished.  I cannot be certain
that this is the case, but I have some reasonable confidence that this
is how it works.  If you find that it doesn't behave this way, I'd call
it a bug.

Thanks,
Shawn


Thanks for the quick reply. Since distributed search latency is the max of
the shard sub-requests, I¹m trying my best to minimize any spikes in
cluster latency due to node restarts.
I double-checked useColdSearcher was false, but the doc says this means
requests ³block until the first searcher is done warming², which
translates pretty clearly to ³latency spike². The more I think about it,
the more worried I am that a node might indeed register itself in
live_nodes and get distributed requests before it¹s got a searcher to work
with. *Especially* if I have lots of serial firstSearcher queries.

I¹ll look through the code myself tomorrow, but if anyone can help
confirm/deny the order of operations here, I¹d appreciate it.



Re: SolrCloud extended warmup support

2014-07-24 Thread Jeff Wartes

Well, I’m not sure what to say. I’ve been observing a noticeable latency
decrease over the first few thousand queries. I’m not doing anything too
tricky either. Same exact query pattern, only one fq, always on the same
field, no faceting. The only potential suspects that occur to me could be
that it’s a large index (although properly sharded to fit in system
memory), and that it’s doing geo filtering  ordering.

Since I don’t have a good mechanism for many queries, I’ll probably just
do a few queries in firstSearcher for now and cross my fingers, but I’m
not optimistic.

For what it’s worth, I did verify that a replica doesn’t make itself
available to other nodes until after the firstSearcher queries are
completed.



On 7/21/14, 5:57 PM, Erick Erickson erickerick...@gmail.com wrote:

I've never seen it necessary to run thousands of queries
to warm Solr. Usually less than a dozen will work fine. My
challenge would be for you to measure performance differences
on queries after running, say, 12 well-chosen queries as
opposed to hundreds/thousands. I bet that if
1 you search across all the relevant fields, you'll fill up the
 low-level caches for those fields.
2 you facet on all the fields you intend to facet on.
3 you sort on all the fields you intend to sort on.
4 you specify some filter queries. This is fuzzy since
 really depends on you being able to predict what
 those will be for firstSearcher. Things like in the
 last day/week/month can be pre-configured, but
 others you won't get. BTW, here's a blog about
 why in the last day fq clauses can be tricky.
   http://searchhub.org/2012/02/23/date-math-now-and-filter-queries/

that you'll pretty much nail warmup and be fine. Note that
you can do all the faceting on a single query. Specifying
the primary, secondary  etc. sorts will fill those caches.

Best,
Erick


On Mon, Jul 21, 2014 at 5:07 PM, Jeff Wartes jwar...@whitepages.com
wrote:


 On 7/21/14, 4:50 PM, Shawn Heisey s...@elyograg.org wrote:

 On 7/21/2014 5:37 PM, Jeff Wartes wrote:
  I¹d like to ensure an extended warmup is done on each SolrCloud node
 prior to that node serving traffic.
  I can do certain things prior to starting Solr, such as pump the
index
 dir through /dev/null to pre-warm the filesystem cache, and
post-start I
 can use the ping handler with a health check file to prevent the node
 from entering the clients load balancer until I¹m ready.
  What I seem to be missing is control over when a node starts
 participating in queries sent to the other nodes.
 
  I can, of course, add solrconfig.xml firstSearcher queries, which I
 assume (and fervently hope!) happens before a node registers itself in
 ZK clusterstate.json as ready for work, but that doesn¹t scale so well
 if I want that initial warmup to run thousands of queries, or run them
 with some paralleism. I¹m storing solrconfig.xml in ZK, so I¹m
sensitive
 to the size.
 
  Any ideas, or corrections to my assumptions?
 
 I think that firstSearcher/newSearcher (and making sure useColdSearcher
 is set to false) is going to be the only way you can do this in a way
 that's compatible with SolrCloud.  If you were doing manual distributed
 search without SolrCloud, you'd have more options available.
 
 If useColdSearcher is set to false, that should keep *everything* from
 using the searcher until the warmup has finished.  I cannot be certain
 that this is the case, but I have some reasonable confidence that this
 is how it works.  If you find that it doesn't behave this way, I'd call
 it a bug.
 
 Thanks,
 Shawn


 Thanks for the quick reply. Since distributed search latency is the max
of
 the shard sub-requests, I¹m trying my best to minimize any spikes in
 cluster latency due to node restarts.
 I double-checked useColdSearcher was false, but the doc says this means
 requests ³block until the first searcher is done warming², which
 translates pretty clearly to ³latency spike². The more I think about it,
 the more worried I am that a node might indeed register itself in
 live_nodes and get distributed requests before it¹s got a searcher to
work
 with. *Especially* if I have lots of serial firstSearcher queries.

 I¹ll look through the code myself tomorrow, but if anyone can help
 confirm/deny the order of operations here, I¹d appreciate it.




Re: SolrCloud extended warmup support

2014-07-25 Thread Jeff Wartes
It¹s a command like this just prior to jetty startup:

find -L solrhome dir -type f -exec cat {}  /dev/null \;


On 7/24/14, 2:11 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote:

Jeff Wartes [jwar...@whitepages.com] wrote:
 Well, I¹m not sure what to say. I¹ve been observing a noticeable latency
 decrease over the first few thousand queries.

How exactly do you get the index files fully cached? The cp-command will
(at least for some systems) happily skip copying if the destination is
/dev/null. One way is to ensure caching is to cat all the files to
/dev/null.

- Toke Eskildsen



Re: SOLR cloud creating multiple copies of the same index

2014-07-25 Thread Jeff Wartes

Looks to me like you are, or were, hitting the replication handler¹s
backup function:
http://wiki.apache.org/solr/SolrReplication#HTTP_API

ie, http://master_host:port/solr/replication?command=backup

You might not have been doing it explicitly, there¹s some support for a
backup being triggered when certain things happen:
http://wiki.apache.org/solr/SolrReplication#Master




On 7/25/14, 1:50 PM, pras.venkatesh prasann...@outlook.com wrote:

Hi , we have a solr cloud instance with 8 nodes and 4 shards. We are
starting
to see that index size is growing so huge and when looked at the file
system
solr has created several copies of the index.
However using solr admin, I could see its using only on the them.

This is what I see in solr admin.

Index:
/opt/solr/collections/aq-collection/data/index.20140725024044234

Master (Searching) 
1406320016969
   
Gen - 81553
   
size -58.72 GB.

But when I go in to the file system , This is how it looks.

16G   index.20140527220456134
  45G   index.20140630001131038
 4.6G   index.20140630090031282
  20G   index.20140703192128959
 1.3G   index.20140703200948410
  31G   index.20140708162308859
  52G   index.20140716165801658
  59G   index.20140725024044234
   4K   index.properties
   4K   replication.properties

it is actually pointing only to the index.20140725024044234, and using
that
for searching and indexing. The timstamps on other indexes are old(about a
month or so)

Can some one explain me why it created so many copies of the index(we did
not create them manually). and how it can be prevented.

Our solr instances are running on solaris VMs



--
View this message in context:
http://lucene.472066.n3.nabble.com/SOLR-cloud-creating-multiple-copies-of-
the-same-index-tp4149264.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to restore an index from a backup over HTTP

2014-08-18 Thread Jeff Wartes

I¹m able to do cross-solrcloud-cluster index copy using nothing more than
careful use of the ³fetchindex² replication handler command.

I¹m using this as a build/deployment tool, so I manually create a
collection in two clusters, index into one, test, and then ask the other
cluster to fetchindex from it on each shard/replica.

Some caveats:
  1. It seems like fetchindex may silently decline if it thinks the index
it has is newer.
  2. I¹m not doing this on an index that¹s currently receiving updates.
  3. SolrCloud replication doesn¹t come into this flow, even if you
fetchindex on a leader. (although once you¹re done, updates should get
replicated normally)
  4. Both collections must be created with the same number of shards and
sharding mechanism. (although replication factor can vary)
 

I¹ve got a tool for automating this that I¹d like to push to github at
some point, let me know if you¹re interested.





On 8/16/14, 3:03 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty
straight forward, but the main concern I have is the internal data format
that ReplicationHandler and SnapPuller use. This new handler as well as
the code that I've already written to download the index files from Solr
will depend on that format. Unfortunately, this format is not documented
and is not abstracted by SolrJ, so I wonder what I can do to make sure it
does not change on us without notice.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
 What I want to achieve is being able to send the backed up index to
Solr (either standalone or with ZooKeeper) in a way similar to creating
a new Collection. I.e. create a new collection and upload an exiting
index directly into that Collection. I've looked through Solr code and
so far I have not found a handler that would allow this scenario. So,
the last idea is to implement a special handler for this case, perhaps
extending CoreAdminHandler. ReplicationHandler together with SnapPuller
do pretty much what I need to do, except that the action has to be
initiated by the receiving Solr server and I need to initiate the action
externally. I.e., instead of having Solr slave download an index from
Solr master, I need to feed the index to Solr master and ideally this
would work the same way in standalone and SolrCloud modes.

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexRe
plication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn



Re: Replicating Between Solr Clouds

2014-08-19 Thread Jeff Wartes

I¹ve been working on this tool, which wraps the collections API to do more
advanced cluster-management operations:
https://github.com/whitepages/solrcloud_manager


One of the operations I¹ve added (copy) is a deployment mechanism that
uses the replication handler¹s snap puller to hot-load a pre-indexed
collection from one solrcloud cluster into another. You create the same
collection name with the same shard count in two clusters, index into one,
and copy from that into the other.

This method won¹t work as a method of active replication, since it copies
the whole index. If you only need a periodic copy between data centers
though, or want someplace to restore from in case of critical failure
(until you can properly rebuild), there might be something you can use
here. 




On 8/19/14, 12:45 PM, reparker23 reparke...@gmail.com wrote:

Are there any more OOB solutions for inter-SolrCloud replication now?  Our
indexing is so slow that we cannot rely on a complete re-index of data
from
our DB of record (SQL) to recover data in the Solr indices.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp41211
96p4153856.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to restore an index from a backup over HTTP

2014-08-20 Thread Jeff Wartes

Here’s the repo:
https://github.com/whitepages/solrcloud_manager


Comments/Issues/Patches welcome.


On 8/18/14, 11:28 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Jeff, I'd be interested in taking a look at the code for this
tool. My github ID is grishick.

Thanks,
Greg

- Original Message -
From: Jeff Wartes jwar...@whitepages.com
To: solr-user@lucene.apache.org
Sent: Monday, August 18, 2014 9:49:28 PM
Subject: Re: How to restore an index from a backup over HTTP

I¹m able to do cross-solrcloud-cluster index copy using nothing more than
careful use of the ³fetchindex² replication handler command.

I¹m using this as a build/deployment tool, so I manually create a
collection in two clusters, index into one, test, and then ask the other
cluster to fetchindex from it on each shard/replica.

Some caveats:
  1. It seems like fetchindex may silently decline if it thinks the index
it has is newer.
  2. I¹m not doing this on an index that¹s currently receiving updates.
  3. SolrCloud replication doesn¹t come into this flow, even if you
fetchindex on a leader. (although once you¹re done, updates should get
replicated normally)
  4. Both collections must be created with the same number of shards and
sharding mechanism. (although replication factor can vary)
 

I¹ve got a tool for automating this that I¹d like to push to github at
some point, let me know if you¹re interested.





On 8/16/14, 3:03 AM, Greg Solovyev g...@zimbra.com wrote:

Thanks Shawn, this is a pretty cool idea. Adding the handler seems pretty
straight forward, but the main concern I have is the internal data format
that ReplicationHandler and SnapPuller use. This new handler as well as
the code that I've already written to download the index files from Solr
will depend on that format. Unfortunately, this format is not documented
and is not abstracted by SolrJ, so I wonder what I can do to make sure it
does not change on us without notice.

Thanks,
Greg

- Original Message -
From: Shawn Heisey s...@elyograg.org
To: solr-user@lucene.apache.org
Sent: Friday, August 15, 2014 7:31:19 PM
Subject: Re: How to restore an index from a backup over HTTP

On 8/15/2014 5:51 AM, Greg Solovyev wrote:
 What I want to achieve is being able to send the backed up index to
Solr (either standalone or with ZooKeeper) in a way similar to creating
a new Collection. I.e. create a new collection and upload an exiting
index directly into that Collection. I've looked through Solr code and
so far I have not found a handler that would allow this scenario. So,
the last idea is to implement a special handler for this case, perhaps
extending CoreAdminHandler. ReplicationHandler together with SnapPuller
do pretty much what I need to do, except that the action has to be
initiated by the receiving Solr server and I need to initiate the action
externally. I.e., instead of having Solr slave download an index from
Solr master, I need to feed the index to Solr master and ideally this
would work the same way in standalone and SolrCloud modes.

I have not made any attempt to verify what I'm stating below.  It may
not work.

What I think I would *try* is setting up a standalone Solr (no cloud) on
the backup server.  Use scripted index/config copies and Solr start/stop
actions to get the index up and running on a known core in the
standalone Solr.  Then use the replication handler's HTTP API to
replicate the index from that standalone server to each of the replicas
in your cluster.

https://wiki.apache.org/solr/SolrReplication#HTTP_API
https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexR
e
plication-HTTPAPICommandsfortheReplicationHandler

One thing that I do not know is whether SolrCloud itself might interfere
with these actions, or whether it might automatically take care of
additional replicas if you replicate to the shard leader.  If SolrCloud
*would* interfere, then this idea might need special support in
SolrCloud, perhaps as an extension to the Collections API.  If it won't
interfere, then the use-case would need to be documented (on the user
wiki at a minimum) so that committers will be aware of it and preserve
the capability in future versions.  An extension to the Collections API
might be a good idea either way -- I've seen a number of questions about
capability that falls under this basic heading.

Thanks,
Shawn



Re: Solr API for getting shard's leader/replica status

2014-09-08 Thread Jeff Wartes

I had a similar need. The resulting tool is in scala, but it still might
be useful to look at. I had to work through some of those same issues:
https://github.com/whitepages/solrcloud_manager


From a clusterstate perspective, I mostly cared about active vs
non-active, so here¹s a sample output of printing the cluster state:

sbt run -z zookeeper.example.com:2181/qa
[info] Running com.whitepages.CLI -z zookeeper.example.com:2181/qa
Aliases:
collection1-current -  collection1
Nodes:
101.16.0.101:8980/solr (up/used)
101.16.0.103:8980/solr (up/used)
Collection  Slice   ReplicaName HostActiveSlice ActiveReplica   
CoreName
collection1 shard1* core_node2  101.16.0.103:8980/solr  truetrue
collection1
_shard1_replica1
collection1 shard1  core_node3  101.16.0.101:8980/solr  truetrue
collection1_
shard1_replica2
collection1 shard2* core_node1  101.16.0.103:8980/solr  truetrue
collection1
_shard2_replica1
collection1 shard2  core_node4  101.16.0.101:8980/solr  truetrue
collection1_
shard2_replica2






On 9/5/14, 2:21 AM, manohar211 manohar.srip...@gmail.com wrote:

Thanks for the comments!!
I found out the solution on how I can get the replica's state. Here's the
piece of code.

while (iter.hasNext()) {
Slice slice = iter.next();
for(Replica replica:slice.getReplicas()) {

System.out.println(replica state for  +
replica.getStr(core)
+  : + replica.getStr( state ));

System.out.println(slice.getName());
System.out.println(slice.getState());
}
}



--
View this message in context:
http://lucene.472066.n3.nabble.com/Solr-API-for-getting-shard-s-leader-rep
lica-status-tp4156902p4157108.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Sharding Help

2014-09-08 Thread Jeff Wartes

You need to specify a replication factor of 2 if you want two copies of
each shard. Solr doesn¹t ³auto fill² available capacity, contrary to the
misleading examples on the http://wiki.apache.org/solr/SolrCloud page.
Those examples only have that behavior because they ask you to copy the
examples directory, which brings some on-disk configuration with it.



On 9/8/14, 1:33 PM, Ethan eh198...@gmail.com wrote:

Thanks Erick.  That cleared my confusion.

I have a follow up question -  If I run the CREATE command with 4 nodes in
createNodeSet, I thought 2 leaders and 2 followers will be created
automatically. Thats not the case, however.

http://serv001:5258/solr/admin/collections?action=CREATEname=MainnumShar
ds=2maxShardsPerNode=1createNodeSet=
 serv001:5258_solr, serv002:5258_solr,serv003:5258_solr, serv004:5258_solr

I still get the same response.  I see 2 leaders being created, but I do
not
see other 2 nodes show up as followers in the cloud page in Solr Admin UI.
 It looks like collection was not created for those 2 nodes at all.

Is there additional step involved to add them?

On Mon, Sep 8, 2014 at 12:11 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Ahhh, this is a continual source of confusion. I've started a one-man
 campaign to talk about leaders and followers when relevant...

 _Every_ node is a replica. This is because a node can be a leader or
 follower, and the role can change.

 So your case is entirely normal. These nodes are probably the leaders
 too, and will remain so while you add more replicas/followers.

 Best,
 Erick

 On Mon, Sep 8, 2014 at 11:20 AM, Ethan eh198...@gmail.com wrote:
  I am trying to setup 2 shard cluster with 2 replicas with dedicated
nodes
  for replicas.  I have 4 node SolrCloud setup that I am trying to shard
  using collections api .. (Like
 
 
https://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_
shard_replicas_and_zookeeper_ensemble
  )
 
  I ran this command -
 
 
 
http://serv001:5258/solr/admin/collections?action=CREATEname=MainnumSha
rds=2maxShardsPerNode=1createNodeSet=
   serv001:5258_solr, serv002:5258_solr
 
  Response -
 
  response
  lst name=responseHeader
  int name=status0/int
  int name=QTime3932/int
  /lst
  lst name=success
  lst
  lst name=responseHeader
  int name=status0/int
  int name=QTime2982/int
  /lst
  str name=coreMain_shard2_replica1/str
  /lst
  lst
  lst name=responseHeader
  int name=status0/int
  int name=QTime3005/int
  /lst
  str name=coreMain_shard1_replica1/str
  /lst
  /lst
  /response
 
  I want to know what *_replica1 or *_replica2 means?  Are they actually
  replicas and not the shards?  I intended to add 2 more nodes as
dedicated
  replication nodes.  How to accomplish that?
 
  Would appreciate any pointers.
 
  -E




Re: replica recovery

2015-10-27 Thread Jeff Wartes

On the face of it, your scenario seems plausible. I can offer two pieces
of info that may or may not help you:

1. A write request to Solr will not be acknowledged until an attempt has
been made to write to all relevant replicas. So, B won’t ever be missing
updates that were applied to A, unless communication with B was disrupted
somehow at the time of the update request. You can add a min_rf param to
your write request, in which case the response will tell you how many
replicas received the update, but it’s still up to your indexer client to
decide what to do if that’s less than your replication factor.

See 
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+
Tolerance for more info.

2. There are two forms of replication. The usual thing is for the leader
for each shard to write an update to all replicas before acknowledging the
write itself, as above. If a replica is less than N docs behind the
leader, the leader can replay those docs to the replica from its
transaction log. If a replica is more than N docs behind though, it falls
back to the replication handler recovery mode you mention, and attempts to
re-sync the whole shard from the leader.
The default N for this is 100, which is pretty low for a high-update-rate
index. It can be changed by increasing the size of the transaction log,
(via numRecordsToKeep) but be aware that a large transaction log size can
delay node restart.

See 
https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConf
ig#UpdateHandlersinSolrConfig-TransactionLog for more info.


Hope some of that helps, I don’t know a way to say
delete-first-on-recovery.



On 10/27/15, 5:21 PM, "Brian Scholl"  wrote:

>Whoops, in the description of my setup that should say 2 replicas per
>shard.  Every server has a replica.
>
>
>> On Oct 27, 2015, at 20:16, Brian Scholl  wrote:
>> 
>> Hello,
>> 
>> I am experiencing a failure mode where a replica is unable to recover
>>and it will try to do so forever.  In writing this email I want to make
>>sure that I haven't missed anything obvious or missed a configurable
>>option that could help.  If something about this looks funny, I would
>>really like to hear from you.
>> 
>> Relevant details:
>> - solr 5.3.1
>> - java 1.8
>> - ubuntu linux 14.04 lts
>> - the cluster is composed of 1 SolrCloud collection with 100 shards
>>backed by a 3 node zookeeper ensemble
>> - there are 200 solr servers in the cluster, 1 replica per shard
>> - a shard replica is larger than 50% of the available disk
>> - ~40M docs added per day, total indexing time is 8-10 hours spread
>>over the day
>> - autoCommit is set to 15s
>> - softCommit is not defined
>>  
>> I think I have traced the failure to the following set of events but
>>would appreciate feedback:
>> 
>> 1. new documents are being indexed
>> 2. the leader of a shard, server A, fails for any reason (java crashes,
>>times out with zookeeper, etc)
>> 3. zookeeper promotes the other replica of the shard, server B, to the
>>leader position and indexing resumes
>> 4. server A comes back online (typically 10s of seconds later) and
>>reports to zookeeper
>> 5. zookeeper tells server A that it is no longer the leader and to sync
>>with server B
>> 6. server A checks with server B but finds that server B's index
>>version is different from its own
>> 7. server A begins replicating a new copy of the index from server B
>>using the (legacy?) replication handler
>> 8. the original index on server A was not deleted so it runs out of
>>disk space mid-replication
>> 9. server A throws an error, deletes the partially replicated index,
>>and then tries to replicate again
>> 
>> At this point I think steps 6  => 9 will loop forever
>> 
>> If the actual errors from solr.log are useful let me know, not doing
>>that now for brevity since this email is already pretty long.  In a
>>nutshell and in order, on server A I can find the error that took it
>>down, the post-recovery instruction from ZK to unregister itself as a
>>leader, the corrupt index error message, and then the (start - whoops,
>>out of disk- stop) loop of the replication messages.
>> 
>> I first want to ask if what I described is possible or did I get lost
>>somewhere along the way reading the docs?  Is there any reason to think
>>that solr should not do this?
>> 
>> If my version of events is feasible I have a few other questions:
>> 
>> 1. What happens to the docs that were indexed on server A but never
>>replicated to server B before the failure?  Assuming that the replica on
>>server A were to complete the recovery process would those docs appear
>>in the index or are they gone for good?
>> 
>> 2. I am guessing that the corrupt replica on server A is not deleted
>>because it is still viable, if server B had a catastrophic failure you
>>could pick up the pieces from server A.  If so is this a configurable
>>option somewhere?  I'd rather take my chances on server B going down

Re: Facet queries blow out the filterCache

2015-10-28 Thread Jeff Wartes

FWIW, since it seemed like there was at least one bug here (and possibly
more), I filed
https://issues.apache.org/jira/browse/SOLR-8171



On 10/6/15, 3:58 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:

>
>I dug far enough yesterday to find the GET_DOCSET, but not far enough to
>find why. Thanks, a little context is really helpful sometimes.
>
>
>So, starting with an empty filterCache...
>
>http://localhost:8983/solr/techproducts/select?q=name:foo=1=tru
>e
>=popularity
>
>New values:lookups: 0, hits: 0, inserts: 1, size: 1
>
>So for the reasons you explained, "inserts" is incremented for this new
>search
>
>http://localhost:8983/solr/techproducts/select?q=name:boo=1=tru
>e
>=popularity
>
>New values: inserts:   lookups: 0, hits: 0, inserts 2, size: 2
>
>
>Another new search, another new insert. No "lookups" though, so how does
>it know name:boo wasn’t cached?
>
>http://localhost:8983/solr/techproducts/select?q=name:boo=1=tru
>e
>=popularity
>New values: inserts:   lookups: 1, hits: 1, inserts: 2, size: 2
>
>
>But it clearly does know - when I repeat the search, I get both a lookup
>and a hit. (and no insert) So is this just
>a bug in the stats reporting, perhaps?
>
>
>When I first started looking at this, it was in a solrcloud cluster, and
>one interesting thing about that cluster is that it was configured with
>the queryResultCache turned off, so let’s repeat the above experiment
>without the queryResultCache. (I’m just commenting it out in the
>techproducts config for this run.)
>
>
>Starting with an empty filterCache...
>
>http://localhost:8983/solr/techproducts/select?q=name:foo=1=tru
>e
>=popularity
>New values:lookups: 0, hits: 0, inserts: 1, size: 1
>
>Same as before...
>
>http://localhost:8983/solr/techproducts/select?q=name:boo=1=tru
>e
>=popularity
>New values: inserts:   lookups: 0, hits: 0, inserts 2, size: 2
>
>Same as before...
>
>http://localhost:8983/solr/techproducts/select?q=name:boo=1=tru
>e
>=popularity
>New values: inserts: lookups: 0, hits: 0, inserts 3, size: 2
>
>No cache hit! We get an insert instead, but it’s already in there, so the
>size doesn’t change. So disabling the queryResultCache apparently causes
>facet queries to be unable to use the filterCache?
>
>
>
>
>I’m increasingly thinking that different use cases need different
>filterCaches, rather than try to bundle every explicit or unexpected
>use-case under one cache with one size and one regenerator.
>
>
>
>
>
>
>On 10/6/15, 2:45 PM, "Chris Hostetter" <hossman_luc...@fucit.org> wrote:
>
>>: So, no SolrCloud, default example config, about as basic as you get. I
>>: didn’t even bother indexing any docs. Then I issued this query:
>>: 
>>: 
>>http://localhost:8983/solr/techproducts/select?q=name:foo=1=tr
>>u
>>e
>>: =popularity=0=-1
>>
>>: This still causes an insert into the filterCache.
>>
>>the faceting component is a type of operation that indicates in the
>>QueryCommand that it needs to GET_DOCSET for the set of all documents
>>matching the query (independent of pagination) -- the point of this
>>DocSet 
>>is so the faceting logic can then compute the intersection of the set of
>>all matching documents with the set of documents matching each facet
>>constraint.  the cached DocSet will be re-used both within the context
>>of the current request, and in future facet requests over the
>>same query+filters.
>>
>>: The only real difference I’m noticing vs my solrcloud collection is
>>that
>>: repeating the query increments cache lookups and hits. It’s still odd
>>: though, because issuing new distinct queries causes a reported insert,
>>but
>>: not a lookup, so the cache hit ratio is always exactly 1.
>>
>>i'm not following what you are saying at all ... can you give some
>>concrete examples (ie: "starting with an empty cache i do this request,
>>then i see these cache stats, then i do this identical/different query
>>and 
>>then the cache stats look like this...")
>>
>>
>>
>>-Hoss
>>http://www.lucidworks.com/
>



Re: copy data between collection

2015-10-26 Thread Jeff Wartes

The “copy” command in this tool automatically does what Upayavira
describes, including bringing the replicas up to date. (if any)
https://github.com/whitepages/solrcloud_manager


I’ve been using it as a mechanism for copying a collection into a new
cluster (different ZK), but it should work within
a cluster too. The same caveats apply - see the entry in the README.

I’ve also been doing some collection backup/restore stuff that could be
used to copy a collection within a cluster, (back up your collection, then
restore into a new collection with a different name) but I only just
pushed that, and haven’t bundled it into a release yet.

In all cases, you’re responsible for managing the actual collection
definitions yourself.

An alternative tool I’m aware of is this one:
https://github.com/bloomreach/solrcloud-haft

This says it’s only tested with Solr 4.6, but I’d think it should work.
The Solr APIs for replication haven’t changed much. I haven’t used it, but
it looks like it has some stuff around saving ZK data that could be
useful, and that’s one thing I haven’t focused on myself yet.



On 10/26/15, 4:46 AM, "Upayavira"  wrote:

>Hi Shani,
>
>There isn't a SolrCloud way to do it. A proper 'clone this collection'
>feature would be a very useful thing.
>
>However, I have managed to do it, in a way that involves some caveats:
> * you should only do this on a collection that has no replicas. Add
> replicas *after* cloning the index
> * if you must do it on a sharded index, then you will need to do it
> once for each shard. No guarantees though
>
>All SolrCloud nodes are all already enabled as 'replication masters' so
>that new replicas can pull a full index from the current leader. We're
>gonna use this feature to pull our index (assuming single shard):
>
>http://:8983/solr/_shard1_replica1/replicat
>ion?command=fetchindex=http://:8983/solr/lection>_shard1_replica1/replication
>
>This basically says to the core behind your new collection: "Go to the
>core behind the old collection, and pull its entire index".
>
>This worked for me. I added a replica afterwards, and the index cloned
>correctly. However, when I did it against a collection that had a
>replica already, the replica *didn't* notice, meaning the leader/replica
>were now out of sync, i.e: Really make sure you do this replication
>before you add replicas to your new collection.
>
>Hope this helps.
>
>Upayavira
>
>On Mon, Oct 26, 2015, at 11:21 AM, Chaushu, Shani wrote:
>> Hi,
>> Is there an API to copy all the documents from one collection to another
>> collection in the same solr server simply?
>> I'm using solr cloud 4.10
>> Thanks,
>> Shani
>> 
>> -
>> Intel Electronics Ltd.
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.



Re: are there any SolrCloud supervisors?

2015-10-14 Thread Jeff Wartes

I’m aware of two public administration tools:
This was announced to the list just recently:
https://github.com/bloomreach/solrcloud-haft
And I’ve been working in this:
https://github.com/whitepages/solrcloud_manager

Both of these hook the Solrcloud client’s ZK access to inspect the cluster
state and execute more complex cluster-aware operations. I was also a bit
amused, because it looks like we both independently arrived at the same
replication-handler-based copy-collection operation. (Which suggests to me
that the functionality should be pushed into the collections API.)

Neither of these is a supervisor though, they merely provide a way to
execute cluster aware commands. Another monitor-oriented mechanism would
be needed to detect when to perform those commands, and I’ve not seen
anything existing along those lines.



On 10/13/15, 5:35 AM, "Susheel Kumar"  wrote:

>Sounds interesting...
>
>On Tue, Oct 13, 2015 at 12:58 AM, Trey Grainger 
>wrote:
>
>> I'd be very interested in taking a look if you post the code.
>>
>> Trey Grainger
>> Co-Author, Solr in Action
>> Director of Engineering, Search & Recommendations @ CareerBuilder
>>
>> On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:
>>
>> > I've been working on something that just monitors ZooKeeper to add and
>> > remove nodes from collections. the use case being I put SolrCloud in
>> > an autoscaling group on EC2 and as instances go up and down, I need
>> > them added to the collection. It's something I've built for work and
>> > could clean up to share on GitHub if there is much interest.
>> >
>> > I asked in the IRC about a SolrCloud supervisor utility but wanted to
>> > extend that question to this list. are there any more "full featured"
>> > supervisors out there?
>> >
>> >
>> > -renning
>> >
>>



Re: DevOps question : auto deployment/setup of Solr & Zookeeper on medium-large clusters

2015-10-20 Thread Jeff Wartes

If you’re using AWS, there’s this:
https://github.com/LucidWorks/solr-scale-tk
If you’re using chef, there’s this:
https://github.com/vkhatri/chef-solrcloud

(There are several other chef cookbooks for Solr out there, but this is
the only one I’m aware of that supports Solr 5.3.)

For ZK, I’m less familiar, but if you’re using chef there’s this:
https://github.com/SimpleFinance/chef-zookeeper
And this might be handy to know about too:
https://github.com/Netflix/exhibitor/wiki


On 10/20/15, 6:37 AM, "Davis, Daniel (NIH/NLM) [C]" 
wrote:

>Waste of money in my opinion.   I would point you towards other tools -
>bash scripts and free configuration managers such as puppet, chef, salt,
>or ansible.Depending on what development you are doing, you may want
>a continuous integration environment.   For a small company starting out,
>using a free CI, maybe SaaS, is a good choice.   A professional version
>such as Bamboo, TeamCity, Jenkins are almost essential in a large
>enterprise if you are doing diverse builds.
>
>When you create a VM, you can generally specify a script to run after the
>VM is mostly created.   There is a protocol (PXE Boot) that enables this
>- a PXE server listens and hears that a new server with such-and-such
>Ethernet Address is starting.   The PXE server makes it boot like a
>CD-ROM/DVD install, booting from installation media on the network and
>installing.Once that install is down, a custom script may be invoked.
>  This script is typically a bash script, because you may not be able to
>count on too much else being installed.   However, python/perl are also
>reasonable choices - just be careful that the modules/libraries you are
>using for the script are present.The same PXE protocol is used in
>large on-premises installations (vCenter) and in the cloud (AWS/Digital
>Ocean).  We don't care about the PXE server - the point is that you can
>generally run a bash script after your install.
>
>The bash script can bootstrap other services such as puppet, chef, or
>salt, and/or setup keys so that push configuration management tools such
>as ansible can reach the server.   The bash script may even be smart
>enough to do all of the setup you need, depending on what other servers
>you need to configure.   Smart bash scripts are good for a small company,
>but for large setups, I'd use puppet, chef, salt, and/or ansible.
>
>What I tend to do is to deploy things in such a way that puppet (because
>it is what we use here) can setup things so that a "solradm" account can
>setup everything else, and solr and zookeeper are running as a "solrapp"
>user using puppet.Then, my continuous integration server, which is
>Atlassian Bamboo (you can also use tools such as Jenkins, TeamCity,
>BuildBot), installs solr as "solradm" and sets it up to run as "solrapp".
>
>I am not a systems administrator, and I'm not really in "DevOps", my job
>is to be above all of that and do "systems architecture" which I am lucky
>still involves coding both in system administration and applications
>development.   So, that's my 2 cents.
>
>Dan Davis, Systems/Applications Architect (Contractor),
>Office of Computer and Communications Systems,
>National Library of Medicine, NIH
>
>-Original Message-
>From: Susheel Kumar [mailto:susheel2...@gmail.com]
>Sent: Tuesday, October 20, 2015 9:19 AM
>To: solr-user@lucene.apache.org
>Subject: DevOps question : auto deployment/setup of Solr & Zookeeper on
>medium-large clusters
>
>Hello,
>
>Resending to see opinion from Dev-Ops perspective on the tools for
>installing/deployment of Solr & ZK on large no of machines and
>maintaining them. I have heard Bladelogic or HP OO (commercial tools)
>etc. being used.
>Please share your experience or pros / cons of such tools.
>
>Thanks,
>Susheel
>
>On Mon, Oct 19, 2015 at 3:32 PM, Susheel Kumar 
>wrote:
>
>> Hi,
>>
>> I am trying to find the best practises for setting up Solr on new 20+
>> machines  & ZK (5+) and repeating same on other environments.  What's
>> the best way to download, extract, setup Solr & ZK in an automated way
>> along with other dependencies like java etc.  Among shell scripts or
>> puppet or docker or imaged vm's what is being used & suggested from
>> Dev-Ops perspective.
>>
>> Thanks,
>> Susheel
>>



Re: Facet queries blow out the filterCache

2015-10-06 Thread Jeff Wartes

I dug far enough yesterday to find the GET_DOCSET, but not far enough to
find why. Thanks, a little context is really helpful sometimes.


So, starting with an empty filterCache...

http://localhost:8983/solr/techproducts/select?q=name:foo=1=true
=popularity

New values: lookups: 0, hits: 0, inserts: 1, size: 1

So for the reasons you explained, "inserts" is incremented for this new
search

http://localhost:8983/solr/techproducts/select?q=name:boo=1=true
=popularity

New values: inserts:lookups: 0, hits: 0, inserts 2, size: 2


Another new search, another new insert. No "lookups" though, so how does
it know name:boo wasn’t cached?

http://localhost:8983/solr/techproducts/select?q=name:boo=1=true
=popularity
New values: inserts:lookups: 1, hits: 1, inserts: 2, size: 2


But it clearly does know - when I repeat the search, I get both a lookup
and a hit. (and no insert) So is this just
a bug in the stats reporting, perhaps?


When I first started looking at this, it was in a solrcloud cluster, and
one interesting thing about that cluster is that it was configured with
the queryResultCache turned off, so let’s repeat the above experiment
without the queryResultCache. (I’m just commenting it out in the
techproducts config for this run.)


Starting with an empty filterCache...

http://localhost:8983/solr/techproducts/select?q=name:foo=1=true
=popularity
New values: lookups: 0, hits: 0, inserts: 1, size: 1

Same as before...

http://localhost:8983/solr/techproducts/select?q=name:boo=1=true
=popularity
New values: inserts:lookups: 0, hits: 0, inserts 2, size: 2

Same as before...

http://localhost:8983/solr/techproducts/select?q=name:boo=1=true
=popularity
New values: inserts: lookups: 0, hits: 0, inserts 3, size: 2

No cache hit! We get an insert instead, but it’s already in there, so the
size doesn’t change. So disabling the queryResultCache apparently causes
facet queries to be unable to use the filterCache?




I’m increasingly thinking that different use cases need different
filterCaches, rather than try to bundle every explicit or unexpected
use-case under one cache with one size and one regenerator.






On 10/6/15, 2:45 PM, "Chris Hostetter"  wrote:

>: So, no SolrCloud, default example config, about as basic as you get. I
>: didn’t even bother indexing any docs. Then I issued this query:
>: 
>: 
>http://localhost:8983/solr/techproducts/select?q=name:foo=1=tru
>e
>: =popularity=0=-1
>
>: This still causes an insert into the filterCache.
>
>the faceting component is a type of operation that indicates in the
>QueryCommand that it needs to GET_DOCSET for the set of all documents
>matching the query (independent of pagination) -- the point of this
>DocSet 
>is so the faceting logic can then compute the intersection of the set of
>all matching documents with the set of documents matching each facet
>constraint.  the cached DocSet will be re-used both within the context
>of the current request, and in future facet requests over the
>same query+filters.
>
>: The only real difference I’m noticing vs my solrcloud collection is that
>: repeating the query increments cache lookups and hits. It’s still odd
>: though, because issuing new distinct queries causes a reported insert,
>but
>: not a lookup, so the cache hit ratio is always exactly 1.
>
>i'm not following what you are saying at all ... can you give some
>concrete examples (ie: "starting with an empty cache i do this request,
>then i see these cache stats, then i do this identical/different query
>and 
>then the cache stats look like this...")
>
>
>
>-Hoss
>http://www.lucidworks.com/



Re: Data Import Handler / Backup indexes

2015-11-17 Thread Jeff Wartes

https://github.com/whitepages/solrcloud_manager supports 5.x, and I added
some backup/restore functionality similar to SOLR-5750 in the last
release. 
Like SOLR-5750, this backup strategy requires a shared filesystem, but
note that unlike SOLR-5750, I haven’t yet added any backup functionality
for the contents of ZK. I’m currently working on some parts of that.


Making a copy of a collection is supported too, with some caveats.


On 11/17/15, 10:20 AM, "Brian Narsi"  wrote:

>Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>
>
>
>On Tue, Nov 17, 2015 at 12:09 PM, KNitin  wrote:
>
>> afaik Data import handler does not offer backups. You can try using the
>> replication handler to backup data as you wish to any custom end point.
>>
>> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>>This
>> helps backup solr indices across clusters.
>>
>> On Tue, Nov 17, 2015 at 7:08 AM, Brian Narsi  wrote:
>>
>> > I am using Data Import Handler to retrieve data from a database with
>> >
>> > full-import, clean = true, commit = true and optimize = true
>> >
>> > This has always worked correctly without any errors.
>> >
>> > But just to be on the safe side, I am thinking that we should do a
>>backup
>> > before initiating Data Import Handler. And just in case something
>>happens
>> > restore the backup.
>> >
>> > Can backup be done automatically (before initiating Data Import
>>Handler)?
>> >
>> > Thanks
>> >
>>



Re: Cached fq decreases performance

2015-09-04 Thread Jeff Wartes


On 9/4/15, 7:06 AM, "Yonik Seeley"  wrote:
>
>Lucene seems to always be changing it's execution model, so it can be
>difficult to keep up.  What version of Solr are you using?
>Lucene also changed how filters work,  so now, a filter is
>incorporated with the query like so:
>
>query = new BooleanQuery.Builder()
>.add(query, Occur.MUST)
>.add(pf.filter, Occur.FILTER)
>.build();
>
>It may be that term queries are no longer worth caching... if this is
>the case, we could automatically not cache them.
>
>It also may be the structure of the query that is making the
>difference.  Solr is creating
>
>(complicated stuff) +(filter(enabled:true))
>
>If you added +enabled:true directly to an existing boolean query, that
>may be more efficient for lucene to process (flatter structure).
>
>If you haven't already, could you try putting parens around your
>(complicated stuff) to see if it makes any difference?
>
>-Yonik


I’ll reply at this point in the thread, since it’s addressed to me, but I
strongly agree with some of the later comments in the thread about knowing
what’s going on. The whole point of this post is that this situation
violated my mental heuristics about how to craft a query.

In answer to the question, this is a Solrcloud 5.3 cluster. I can provide
a little more detail on (complicated stuff) too if that’s helpful. I have
not tried putting everything else in parens, but it’s a couple of distinct
paren clauses anyway:

q=+(_query_:"{!dismax (several fields)}") +(_query_:"{!dismax (several
fields)}") +(_query_:"{!dismax (several fields)}") +enabled:true

So to be clear, that query template outperforms this one:
q=+(_query_:"{!dismax (several fields)}") +(_query_:"{!dismax (several
fields)}") +(_query_:"{!dismax (several fields)}")=enabled:true


Your comments remind me that I migrated from 5.2.1 to 5.3 while I’ve been
doing my performance testing, and I thought I noticed a performance
degradation in that transition, but I never followed though to confirm
that. I hadn’t tested moving that FQ clause into the Q on 5.2.1, only 5.3.






Re: Cached fq decreases performance

2015-09-03 Thread Jeff Wartes

I’m measuring performance in the aggregate, over several minutes and tens
of thousands of distinct queries that all use this specific fq.
The cache hit count reported is roughly identical to the number of queries
I’ve sent, so no, this isn’t a first-query cache-miss situation.

The fq result will be large, 15% of my documents qualify, so if solr is
intelligent enough to ignore that restriction in the main query until it’s
found a much smaller set to scan for that criteria, I could see how simply
processing the intersection of the full fq cache value could be time
consuming. Is that the kind of thing you’re talking about with
intersection hopping?


On 9/3/15, 2:00 PM, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:

>FQ has to calculate the result bit set for every document to be able
>to cache it. Q will only calculate it for the documents it matches on
>and there is some intersection hopping going on.
>
>Are you seeing this performance hit on first query only or or every
>one? I would expect on first query only unless your filter cache size
>assumptions are somehow wrong.
>
>Regards,
>   Alex.
>
>
>Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>http://www.solr-start.com/
>
>
>On 3 September 2015 at 16:45, Jeff Wartes <jwar...@whitepages.com> wrote:
>>
>> I have a query like:
>>
>> q==enabled:true
>>
>> For purposes of this conversation, "fq=enabled:true" is set for every
>>query, I never open a new searcher, and this is the only fq I ever use,
>>so the filter cache size is 1, and the hit ratio is 1.
>> The fq=enabled:true clause matches about 15% of my documents. I have
>>some 20M documents per shard, in a 5.3 solrcloud cluster.
>>
>> Under these circumstances, this alternate version of the query averages
>>about 1/3 faster, consumes less CPU, and generates less garbage:
>>
>> q= +enabled:true
>>
>> So it appears I have a case where using the cached fq result is more
>>expensive than just putting the same restriction in the query.
>> Does someone have a clear mental model of how “q” and “fq” interact?
>> Naively, I’d expect that either the “q” operates within the set matched
>>by the fq (in which case it’s doing "complicated stuff" on only a subset
>>and should be faster) or that Solr takes the intersection of the q & fq
>>sets (in which case putting the restriction in the “q” means that set
>>needs to be generated instead of retrieved from cache, and should be
>>slower).
>> This has me wondering, if you want fq cache speed boosts, but also want
>>ranking involved, can you do that? Would something like q=>stuff> AND = help, or just be
>>more work?
>>
>> Thanks.



Cached fq decreases performance

2015-09-03 Thread Jeff Wartes

I have a query like:

q==enabled:true

For purposes of this conversation, "fq=enabled:true" is set for every query, I 
never open a new searcher, and this is the only fq I ever use, so the filter 
cache size is 1, and the hit ratio is 1.
The fq=enabled:true clause matches about 15% of my documents. I have some 20M 
documents per shard, in a 5.3 solrcloud cluster.

Under these circumstances, this alternate version of the query averages about 
1/3 faster, consumes less CPU, and generates less garbage:

q= +enabled:true

So it appears I have a case where using the cached fq result is more expensive 
than just putting the same restriction in the query.
Does someone have a clear mental model of how “q” and “fq” interact?
Naively, I’d expect that either the “q” operates within the set matched by the 
fq (in which case it’s doing "complicated stuff" on only a subset and should be 
faster) or that Solr takes the intersection of the q & fq sets (in which case 
putting the restriction in the “q” means that set needs to be generated instead 
of retrieved from cache, and should be slower).
This has me wondering, if you want fq cache speed boosts, but also want ranking 
involved, can you do that? Would something like q= AND 
= help, or just be more work?

Thanks.


Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
index on fields like this:



Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes

No change, still shows an insert per-request. As does a simplified request
with only the facet params
"=city=true"

It’s definitely facet related though, facet=false eliminates the insert.



On 10/1/15, 1:50 PM, "Mikhail Khludnev" <mkhlud...@griddynamics.com> wrote:

>what if you set f.city.facet.limit=-1 ?
>
>On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes <jwar...@whitepages.com>
>wrote:
>
>>
>> I’m doing some fairly simple facet queries in a two-shard 5.3 SolrCloud
>> index on fields like this:
>>
>> > docValues="true”/>
>>
>> that look something like this:
>> 
>>q=...=id,score=city=true=1
>>it
>> y.facet.limit=50=0=0=fc
>>
>> (no, NOT facet.method=enum - the usage of the filterCache there is
>>pretty
>> well documented)
>>
>> Watching the filterCache stats, it appears that every one of these
>>queries
>> causes the "inserts" counter to be incremented by one. Distinct "q="
>> queries also increase the "size", and eviction happens as normal. If I
>> repeat the same query a few times, "lookups" is not incremented, so
>>these
>> entries generally appear to be completely wasted. (Although when
>>running a
>> lot of these queries, it appears as though a very small set also
>>increment
>> the "lookups" counter, but only a small set, and I haven’t figured out
>>why
>> some are special.)
>>
>> So the question is, why does this facet query have anything to do with
>>the
>> filterCache? This causes a huge amount of filterCache churn with no
>> apparent benefit.
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
><http://www.griddynamics.com>
><mkhlud...@griddynamics.com>



Re: Facet queries blow out the filterCache

2015-10-01 Thread Jeff Wartes
It still inserts if I address the core directly and use distrib=false.

I’ve got a few collections sharing the same config, so it’s surprisingly
annoying to
change solrconfig.xml right now, but it seemed pretty clear the query is
the thing being cached, since
the cache size only changes when the query does.



On 10/1/15, 3:01 PM, "Mikhail Khludnev" <mkhlud...@griddynamics.com> wrote:

>hm..
>This option was useful for introspecting cache content
>https://wiki.apache.org/solr/SolrCaching#showItems It might help you to
>find-out a cause.
>I'm still blaming distributed requests, it expained here
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Over-Re
>questParameters
>eg does it happen if you run with distrib=false?
>
>On Fri, Oct 2, 2015 at 12:27 AM, Jeff Wartes <jwar...@whitepages.com>
>wrote:
>
>>
>> No change, still shows an insert per-request. As does a simplified
>>request
>> with only the facet params
>> "=city=true"
>>
>by default it's 100
>https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-Theface
>t.limitParameter
>and can cause filtering by values, it can be seen in logs, btw.
>
>>
>> It’s definitely facet related though, facet=false eliminates the insert.
>>
>>
>>
>> On 10/1/15, 1:50 PM, "Mikhail Khludnev" <mkhlud...@griddynamics.com>
>> wrote:
>>
>> >what if you set f.city.facet.limit=-1 ?
>> >
>> >On Thu, Oct 1, 2015 at 7:43 PM, Jeff Wartes <jwar...@whitepages.com>
>> >wrote:
>> >
>> >>
>> >> I’m doing some fairly simple facet queries in a two-shard 5.3
>>SolrCloud
>> >> index on fields like this:
>> >>
>> >> > >> docValues="true”/>
>> >>
>> >> that look something like this:
>> >>
>> 
>>>>q=...=id,score=city=true=1
>>>>.c
>> >>it
>> >> y.facet.limit=50=0=0=fc
>> >>
>> >> (no, NOT facet.method=enum - the usage of the filterCache there is
>> >>pretty
>> >> well documented)
>> >>
>> >> Watching the filterCache stats, it appears that every one of these
>> >>queries
>> >> causes the "inserts" counter to be incremented by one. Distinct "q="
>> >> queries also increase the "size", and eviction happens as normal. If
>>I
>> >> repeat the same query a few times, "lookups" is not incremented, so
>> >>these
>> >> entries generally appear to be completely wasted. (Although when
>> >>running a
>> >> lot of these queries, it appears as though a very small set also
>> >>increment
>> >> the "lookups" counter, but only a small set, and I haven’t figured
>>out
>> >>why
>> >> some are special.)
>> >>
>> >> So the question is, why does this facet query have anything to do
>>with
>> >>the
>> >> filterCache? This causes a huge amount of filterCache churn with no
>> >> apparent benefit.
>> >>
>> >>
>> >
>> >
>> >--
>> >Sincerely yours
>> >Mikhail Khludnev
>> >Principal Engineer,
>> >Grid Dynamics
>> >
>> ><http://www.griddynamics.com>
>> ><mkhlud...@griddynamics.com>
>>
>>
>
>
>-- 
>Sincerely yours
>Mikhail Khludnev
>Principal Engineer,
>Grid Dynamics
>
><http://www.griddynamics.com>
><mkhlud...@griddynamics.com>



Re: Facet queries blow out the filterCache

2015-10-02 Thread Jeff Wartes

I backed up a bit. I took the stock solr download and did this:

solr-5.3.1>$ bin/solr -e techproducts

So, no SolrCloud, default example config, about as basic as you get. I
didn’t even bother indexing any docs. Then I issued this query:

http://localhost:8983/solr/techproducts/select?q=name:foo=1=true
=popularity=0=-1


This still causes an insert into the filterCache.

The only real difference I’m noticing vs my solrcloud collection is that
repeating the query increments cache lookups and hits. It’s still odd
though, because issuing new distinct queries causes a reported insert, but
not a lookup, so the cache hit ratio is always exactly 1.



On 10/2/15, 4:18 AM, "Toke Eskildsen" <t...@statsbiblioteket.dk> wrote:

>On Thu, 2015-10-01 at 22:31 +, Jeff Wartes wrote:
>> It still inserts if I address the core directly and use distrib=false.
>
>It is quite strange that is is triggered with the direct access. If that
>can be reproduced in test, it looks like a performance optimization to
>be done.
>
>Anyway, operating under the assumption that the single-core facet
>request for some reason acts as a distributed call, the key to avoid the
>fine-counting is to ensure that _all_ possibly relevant term counts has
>been returned in the first facet phase.
>
>Try setting both facet.mincount=0 and facet.limit=-1.
>
>- Toke Eskildsen, State and University Library, Denmark
>
>



Re: Cost of having multiple search handlers?

2015-09-29 Thread Jeff Wartes

At the risk of going increasingly off-thread, yes, please do.
I’ve been using this:
https://dropwizard.github.io/metrics/3.1.0/manual/jetty/, which is
convenient, but doesn’t even have request-handler-level resolution.

Something I’ve started doing for issues that don’t seem likely to get
pulled in but also don’t really need changes in solr/lucene source code is
to publish a free-standing project (with tests) that builds the necessary
jar. For example, https://github.com/whitepages/SOLR-4449.
Seems like a decent middle ground where people can easily use or
contribute changes, and then if it gets popular enough, that’s a strong
signal it should be in the solr distribution itself.



On 9/28/15, 6:05 PM, "Walter Underwood" <wun...@wunderwood.org> wrote:

>We built our own because there was no movement on that. Don’t hold your
>breath.
>
>Glad to contribute it. We’ve been running it in production for a year,
>but the config is pretty manual.
>
>wunder
>Walter Underwood
>wun...@wunderwood.org
>http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 28, 2015, at 4:41 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>> 
>> 
>> One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
>> be done by then.
>> 
>> 
>> On 9/28/15, 11:39 AM, "Walter Underwood" <wun...@wunderwood.org> wrote:
>> 
>>> We did the same thing, but reporting performance metrics to Graphite.
>>> 
>>> But we won’t be able to add servlet filters in 6.x, because it won’t
>>>be a
>>> webapp.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Sep 28, 2015, at 11:32 AM, Gili Nachum <gilinac...@gmail.com>
>>>>wrote:
>>>> 
>>>> A different solution to the same need: I'm measuring response times of
>>>> different collections measuring  online/batch queries apart using New
>>>> Relic. I've added a servlet filter that analyses the request and makes
>>>> this
>>>> info available to new relic over a request argument.
>>>> 
>>>> The built in new relic solr plug in doesn't provide much.
>>>> On Sep 28, 2015 17:16, "Shawn Heisey" <apa...@elyograg.org> wrote:
>>>> 
>>>>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
>>>>>> I want to register multiple but identical search handler to have
>>>>> multiple buckets to measure performance for our different apis and
>>>>> consumers (and to find out who is actually using Solr).
>>>>>> 
>>>>>> What are there some costs associated with having multiple search
>>>>> handlers? Are they neglible?
>>>>> 
>>>>> Unless you are creating hundreds or thousands of them, I doubt you'll
>>>>> notice any significant increase in resource usage from additional
>>>>> handlers.  Each handler definition creates an additional URL endpoint
>>>>> within the servlet container, additional object creation within Solr,
>>>>> and perhaps an additional thread pool and threads to go with it, so
>>>>> it's
>>>>> not free, but I doubt that it's significant.  The resources required
>>>>> for
>>>>> actually handling a request is likely to dwarf what's required for
>>>>>more
>>>>> handlers.
>>>>> 
>>>>> Disclaimer: I have not delved into the code to figure out exactly
>>>>>what
>>>>> gets created with a search handler config, so I don't know exactly
>>>>>what
>>>>> happens.  I'm basing this on general knowledge about how Java
>>>>>programs
>>>>> are constructed by expert developers, not specifics about Solr.
>>>>> 
>>>>> There are others on the list who have a much better idea than I do,
>>>>>so
>>>>> if I'm wrong, I'm sure one of them will let me know.
>>>>> 
>>>>> Thanks,
>>>>> Shawn
>>>>> 
>>>>> 
>>> 
>> 
>



Re: Cost of having multiple search handlers?

2015-09-28 Thread Jeff Wartes

One would hope that https://issues.apache.org/jira/browse/SOLR-4735 will
be done by then. 


On 9/28/15, 11:39 AM, "Walter Underwood"  wrote:

>We did the same thing, but reporting performance metrics to Graphite.
>
>But we won’t be able to add servlet filters in 6.x, because it won’t be a
>webapp.
>
>wunder
>Walter Underwood
>wun...@wunderwood.org
>http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 28, 2015, at 11:32 AM, Gili Nachum  wrote:
>> 
>> A different solution to the same need: I'm measuring response times of
>> different collections measuring  online/batch queries apart using New
>> Relic. I've added a servlet filter that analyses the request and makes
>>this
>> info available to new relic over a request argument.
>> 
>> The built in new relic solr plug in doesn't provide much.
>> On Sep 28, 2015 17:16, "Shawn Heisey"  wrote:
>> 
>>> On 9/28/2015 6:30 AM, Oliver Schrenk wrote:
 I want to register multiple but identical search handler to have
>>> multiple buckets to measure performance for our different apis and
>>> consumers (and to find out who is actually using Solr).
 
 What are there some costs associated with having multiple search
>>> handlers? Are they neglible?
>>> 
>>> Unless you are creating hundreds or thousands of them, I doubt you'll
>>> notice any significant increase in resource usage from additional
>>> handlers.  Each handler definition creates an additional URL endpoint
>>> within the servlet container, additional object creation within Solr,
>>> and perhaps an additional thread pool and threads to go with it, so
>>>it's
>>> not free, but I doubt that it's significant.  The resources required
>>>for
>>> actually handling a request is likely to dwarf what's required for more
>>> handlers.
>>> 
>>> Disclaimer: I have not delved into the code to figure out exactly what
>>> gets created with a search handler config, so I don't know exactly what
>>> happens.  I'm basing this on general knowledge about how Java programs
>>> are constructed by expert developers, not specifics about Solr.
>>> 
>>> There are others on the list who have a much better idea than I do, so
>>> if I'm wrong, I'm sure one of them will let me know.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>



Autowarm and filtercache invalidation

2015-09-24 Thread Jeff Wartes

If I configure my filterCache like this:


and I have <= 10 distinct filter queries I ever use, does that mean I’ve
effectively disabled cache invalidation? So my cached filter query results
will never change? (short of JVM restart)

I’m unclear on whether autowarm simply copies the value into the new
searcher’s cache or whether it tries to rebuild the results of the cached
filter query based on the new searcher’s view of the data.



Re: Autowarm and filtercache invalidation

2015-09-24 Thread Jeff Wartes
Answering my own question: Looks like the default filterCache regenerator
uses the old cache to re-executes queries in the context of the new
searcher and does nothing with the old cache value.

So, the new searcher’s cache contents will be consistent with that
searcher’s view, regardless of whether it was populated via autowarm.


On 9/24/15, 11:28 AM, "Jeff Wartes" <jwar...@whitepages.com> wrote:

>
>If I configure my filterCache like this:
>autowarmCount="10"/>
>
>and I have <= 10 distinct filter queries I ever use, does that mean I’ve
>effectively disabled cache invalidation? So my cached filter query results
>will never change? (short of JVM restart)
>
>I’m unclear on whether autowarm simply copies the value into the new
>searcher’s cache or whether it tries to rebuild the results of the cached
>filter query based on the new searcher’s view of the data.
>



Re: How to know index file in OS Cache

2015-09-25 Thread Jeff Wartes


I’ve been relying on this:
https://code.google.com/archive/p/linux-ftools/


fincore will tell you what percentage of a given file is in cache, and
fadvise can suggest to the OS that a file be cached.

All of the solr start scripts at my company first call fadvise
(FADV_WILLNEED) on all the files in the index directories. It works great
if you’re on a linux system.



On 9/25/15, 8:41 AM, "Gili Nachum"  wrote:

>Gonna try Mikhail suggestion, but just for fun you can also empirically
>"test" for how much of a file is in the oshr...@matrix.co.il cache with:
>time cat  > /dev/null
>
>The faster it completes the more blocks are cached you can take a baseline
>after manually purging of cache - don't recall the command. Note that
>running the command by itself encourages to cache the file.
>On Sep 25, 2015 12:39, "Aman Tandon"  wrote:
>
>> Awesome thank you Mikhail. This is what I was looking for.
>>
>> This was just a random question poped up in my mind. So I just asked
>>this
>> on the group.
>>
>> With Regards
>> Aman Tandon
>>
>> On Fri, Sep 25, 2015 at 2:49 PM, Mikhail Khludnev <
>> mkhlud...@griddynamics.com> wrote:
>>
>> > What about Linux:
>> > $less /proc//maps
>> > $pmap 
>> >
>> > On Fri, Sep 25, 2015 at 10:57 AM, Markus Jelsma <
>> > markus.jel...@openindex.io>
>> > wrote:
>> >
>> > > Hello - as far as i remember, you don't. A file itself is not the
>>unit
>> to
>> > > cache, but blocks are.
>> > > Markus
>> > >
>> > >
>> > > -Original message-
>> > > > From:Aman Tandon 
>> > > > Sent: Friday 25th September 2015 5:56
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: How to know index file in OS Cache
>> > > >
>> > > > Hi,
>> > > >
>> > > > Is there any way to know that the index file/s is present in the
>>OS
>> > cache
>> > > > or RAM. I want to check if the index is present in the RAM or in
>>OS
>> > cache
>> > > > and which files are not in either of them.
>> > > >
>> > > > With Regards
>> > > > Aman Tandon
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Sincerely yours
>> > Mikhail Khludnev
>> > Principal Engineer,
>> > Grid Dynamics
>> >
>> > 
>> > 
>> >
>>



Re: Solrcloud: 1 server, 1 configset, multiple collections, multiple schemas

2015-12-04 Thread Jeff Wartes

If you want two different collections to have two different schemas, those
collections need to reference two different configsets.
So you need another copy of your config available using a different name,
and to reference that other name when you create the second collection.


On 12/4/15, 6:26 AM, "bengates"  wrote:

>Hello,
>
>I'm having usage issues with *Solrcloud*.
>
>What I want to do:
>- Manage a solr server *only with the API* (create / reload / delete
>collections, create / replace / delete fields, etc).
>- A new collection should* start with pre-defined default fields,
>fieldTypes
>and copyFields* (let's say, field1 and field2 for fields).
>- Each collection must *have its own schema*.
>
>What I've setup yet:
>- Installed a *Solr 5.3.1* in //opt/solr/ on an Ubuntu 14.04 server
>- Installed *Zookeeper 3.4.6* in //opt/zookeeper/ as described in the solr
>wiki
>- Added line "server.1=127.0.0.1:2888:3888" in
>//opt/zookeeper/conf/zoo.cfg/
>- Added line "127.0.0.1:2181" in
>//var/solr/data/solr.xml/
>- Told solr or zookeeper somewhere (don't remember where I setup this) to
>use //home/me/configSet/managed-schema.xml/ and
>//home/me/configSet/solrconfig.xml/ for configSet
>- Run solr on port 8983
>
>My //home/me/configSet/managed-schema.xml/ contains *field1* and *field2*.
>
>Now let's create a collection:
>http://my.remote.addr:8983/solr/admin/collections?action=CREATE=colle
>ction1=1
>- *collection1 *is created, with *field1 *and *field2*. Perfect.
>
>Let's create another collection:
>http://my.remote.addr:8983/solr/admin/collections?action=CREATE=colle
>ction2=1
>- *collection2 *is created, with *field1 *and *field2*. Perfect.
>
>No, if I *add some fields* on *collection1 *by POSTing to :
>/http://my.remote.addr:8983/solr/collection1/schema/ the following:
>
>
>- *field3 *and *field4 *are successfully added to *collection1*
>- ... but they are *also added* to *collection2* (verified by GETting
>/http://my.remote.addr:8983/solr/collection2/schema/fields/)
>
>How to prevent this behavior, since my collections have *different kind of
>datas*, and may have the same field names but not the same types?
>
>Thanks,
>Ben
>
>
>
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Solrcloud-1-server-1-configset-multiple
>-collections-multiple-schemas-tp4243584.html
>Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 5: Schema.xml vs. Managed Schema - which is advisable?

2015-12-03 Thread Jeff Wartes
I’ve never used the managed schema, so I’m probably biased, but I’ve never
seen much of a point to the Schema API.

I need to make changes sometimes to solrconfig.xml, in addition to
schema.xml and other config files, and there’s no API for those, so my
process has been like:

1. Put the entire config directory used by a collection in source control
somewhere. solrconfig.xml, schema.xml, synonyms.txt, everything.
2. Make changes, test, commit
3. “Release” by uploading the whole config dir at a specific commit to ZK
(overwriting any existing files) and issuing a collections API “reload”.


This has the downside that I can upload a broken config and take down my
collection, but with the whole config dir in source control,
I can also easily roll back to any point by uploading an old commit.
You still have to be aware of how the changes you’re making will effect
your current index, but that’s unavoidable.


On 12/3/15, 7:09 AM, "Kelly, Frank"  wrote:

>Just wondering if folks have any suggestions on using Schema.xml vs.
>Managed Schema going forward.
>
>Our deployment will be
>> 3 Zk, 3 Shards, 3 replicas
>> Copies of each collection in 5 AWS regions (EBS-backed EC2 instances)
>> Planning at least 1 Billion objects indexed (currently < 100 million)
>
>I'm sure our schema.xml will have changes and fixes and just wondering
>which approach (schema.xml vs. managed)
>will be easier to deploy / maintain?
>
>Cheers!
>
>-Frank
>
>
>Frank Kelly
>Principal Software Engineer
>Predictive Analytics Team (SCBE/HAC/CDA)
>
>
>
>
>
>
>
>



Re: How to list all collections in solr-4.7.2

2015-12-03 Thread Jeff Wartes
Looks like LIST was added in 4.8, so I guess you’re stuck looking at ZK,
or finding some tool that looks in ZK for you.

The zkCli.sh that ships with zookeeper would probably suffice for a
one-off manual inspection:
https://zookeeper.apache.org/doc/trunk/zookeeperStarted.html#sc_ConnectingT
oZooKeeper



On 12/3/15, 12:05 PM, "Pushkar Raste"  wrote:

>Will 'wget http://host;port//solr/admin/collections?action=LIST' help?
>
>On 3 December 2015 at 12:12, rashi gandhi  wrote:
>
>> Hi all,
>>
>> I have setup two solr-4.7.2 server instances on two diff machines with 3
>> zookeeper severs in solrcloud mode.
>>
>> Now, I want to retrieve list of all the collections that I have created
>>in
>> solrcloud mode.
>>
>> I tried LIST command of collections api, but its not working with
>> solr-4.7.2.
>> Error: unknown command LIST
>>
>> Please suggest me the command, that I can use.
>>
>> Thanks.
>>



Re: Fully automated replica creation in AWS

2015-12-09 Thread Jeff Wartes

It’s a pretty common misperception that since solr scales, you can just
spin up new nodes and be done. Amazon ElasticSearch and older solrcloud
getting-started docs encourage this misperception, as does the HDFS-only
autoAddReplicas flag.

I agree that auto-scaling should be approached carefully, and
per-collection, but the question also comes up a lot, so the lack of
available/blessed solr tools hurts, in my opinion.
https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placeme
nt helps, but you still need to say when and how many.

Anyway, the last time this came up
(https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201510.mbox/%3C
cae+cwktalxicdfc2zlfxvxregvnt-56yyqct3u-onhqchxa...@mail.gmail.com%3E) I
suggested this as a place to get started - it’s a tool that knows where to
look: https://github.com/whitepages/solrcloud_manager

Using this, scaling up a collection is pretty easy. Add some nodes, then:
// Add replicas as "cluster space” allows. (assumes you’re not using
built-in rule-based placement)
// Respects current maxShardsPerNode for the collection, prevents
adding replicas that already exist on a node,
// adds replicas for shards with a lower replication factor first.
java -jar solrcloud_manager-assembly-1.4.0.jar fill -z
zk0.example.com:2181/myapp -c collection1

Scaling down is also a one-liner. Turn off some nodes and run:
// removes any replicas in the collection that are not marked “active”
from the cluster state
java -jar solrcloud_manager-assembly-1.4.0.jar cleancollection -z
zk0.example.com:2181/myapp -c collection1
BUT, you still need to take care not to take down all of the nodes with a
given shard at once. This can be tricky to figure out if your collection
has shards spread across many nodes.


Another downscaling option would be to proactively delete the replicas off
of a node before turning it off:
// deletes all replicas for the collection on the node(s),
// but refuses to delete a replica if that’d bring the
replicationFactor for that shard below 2
java -jar solrcloud_manager-assembly-1.4.0.jar clean -z
zk0.example.com:2181/myapp -c collection1 --nodes abc.example.com
--safetyFactor 2



On 12/9/15, 12:09 PM, "Erick Erickson"  wrote:

>bq: As a side note, we do this for our
>customers as that's baked into our cloud provisioning software,
>
>Exactly, nothing OOB is there, but all the data is available, you
>"just" have to write a tool that knows where to look ;) That said,
>this would certainly be something that would have to be optional
>and, IMO, not on by default. It'd need some design effort, as I'd
>guess it'd be on a per-collection basis thus turned on in solrconfig.xml.
>Hmmm, or perhaps this would be some kind of list of nodes
>in Zookeeper. Or..
>
>So when a new Solr node came up for the first time, it'd need
>to find out which collections were configured with this option
>either through enumerating all the collections and checking
>their states or looking in their solrconfig files or enumerating
>a list of children in Zookeeper or Plus you'd need some
>kind of safeguards in place to handle bringing up, say, 10 new
>Solr instances one at a time so all the new replicas didn't
>end up on the first node you added. And so on
>
>The more I think about all that the less I like it; it seems that
>custom utilities on a per-collection basis make more sense.
>
>And yeah, autoAddReplicas is exactly for that on HDFS. Using
>autoAddReplicas for non shared filesystems doesn't really
>make sense IMO. The supposition is that the Solr node has
>gone away. How would some _other_ Solr instance on some _other_
>node know where to look for the index?
>
>Best,
>Erick
>
>On Wed, Dec 9, 2015 at 11:37 AM, Sameer Maggon
> wrote:
>> Erick,
>>
>> Typically, while creating collections, a replicationFactor is specified.
>> Thus, the meta data about the collection does have information about
>>what
>> the "desired" replicationFactor is for the collection. If that's the
>>case,
>> when a Solr node joins the cluster, there could be a pro-active
>>add-replica
>> operation that can be initiated if the Solr detects that the current
>> replicas are less than the desired replicationFactor and pull the
>> collection data from the leader.
>>
>> Isn't that what the attribute "autoAddReplicas" does for HDFS - can
>>this be
>> done for non-shared filesystem? As a side note, we do this for our
>> customers as that's baked into our cloud provisioning software, but it
>> would be nice if Solr supports that OOTB. Are there any underlying
>>flaws of
>> doing that?
>>
>> Thanks,
>> --
>>
>> *Sameer Maggon*
>> www.measuredsearch.com
>> 
>>>?url=http%3A%2F%2Fmeasuredsearch.com%2F=6dbc74f0abef4882>
>> |
>> Deploy, Scale & Manage Solr in the cloud of your choice.
>>
>>
>> On Wed, Dec 9, 2015 at 11:19 AM, Erick Erickson

Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Jeff Wartes

Don’t set solr.data.dir. Instead, set the install dir. Something like:
-Dsolr.solr.home=/data/solr
-Dsolr.install.dir=/opt/solr

I have many solrcloud collections, and separate data/install dirs, and
I’ve never had to do anything with manual per-collection or per-replica
data dirs.

That said, it’s been a while since I set this up, and I may not remember
all the pieces. 
You might need something like this too, for example:

-Djetty.home=/opt/solr/server


On 12/14/15, 3:11 PM, "Erick Erickson"  wrote:

>Currently, it'll be a little tedious but here's what you can do (going
>partly from memory)...
>
>When you create the collection, specify the special value EMPTY for
>createNodeSet (Solr 5.3+).
>Use ADDREPLICA to add each individual replica. When you do this, you
>can add a dataDir for
>each individual replica and thus keep them separate, i.e. for a
>particular box the first
>replica would get /data/solr/collection1_shard1_replica1, the second
>/data/solr/collection1_shard2_replica1 and so forth.
>
>If you don't have Solr 5.3+, you can still to the same thing, except
>you create your collection letting
>the replicas fall where they will. Then do the ADDREPLICA as above.
>When that's all done,
>DELETEREPLICA for the original replicas.
>
>Best,
>Erick
>
>On Mon, Dec 14, 2015 at 2:21 PM, Tom Evans 
>wrote:
>> On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey 
>>wrote:
>>> On 12/14/2015 10:49 AM, Tom Evans wrote:
 When I tried this in SolrCloud mode, specifying
 "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
 for the first collection, but then the second collection tried to use
 the same directory to store its index, which obviously failed. I fixed
 this by changing solrconfig.xml in each collection to specify a
 specific directory, like so:

   ${solr.data.dir:}products

 Looking back after the weekend, I'm not a big fan of this. Is there a
 way to add a core.properties to ZK, or a way to specify
 core.baseDatadir on the command line, or just a better way of handling
 this that I'm not aware of?
>>>
>>> Since you're running SolrCloud, just let Solr handle the dataDir, don't
>>> try to override it.  It will default to "data" relative to the
>>> instanceDir.  Each instanceDir is likely to be in the solr home.
>>>
>>> With SolrCloud, your cores will not contain a "conf" directory (unless
>>> you create it manually), therefore the on-disk locations will be *only*
>>> data, there's not really any need to have separate locations for
>>> instanceDir and dataDir.  All active configuration information for
>>> SolrCloud is in zookeeper.
>>>
>>
>> That makes sense, but I guess I was asking the wrong question :)
>>
>> We have our SSDs mounted on /data/solr, which is where our indexes
>> should go, but our solr install is on /opt/solr, with the default solr
>> home in /opt/solr/server/solr. How do we change where the indexes get
>> put so they end up on the fast storage?
>>
>> Cheers
>>
>> Tom



Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes

Honestly, I have no idea which is "old". The solr source itself uses slice 
pretty consistently, so I stuck with that when I started the project last year. 
And logically, a shard being an instance of a slice makes sense to me. But one 
significant place where they word shard is exposed is the default names of the 
slices, so it’s a mixed bag.


See here:
  https://github.com/whitepages/solrcloud_manager#terminology






On 1/8/16, 2:34 PM, "Robert Brown" <r...@intelcompute.com> wrote:

>Thanks for the pointer Jeff,
>
>For SolrCloud it turned out to be...
>
>=xxx
>
>btw, for your app, isn't "slice" old notation?
>
>
>
>
>On 08/01/16 22:05, Jeff Wartes wrote:
>>
>> I’m pretty sure you could change the name when you ADDREPLICA using a 
>> core.name property. I don’t know if you can when you initially create the 
>> collection though.
>>
>> The CLUSTERSTATUS command will tell you the core names: 
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
>>
>> That said, this tool might make things easier.
>> https://github.com/whitepages/solrcloud_manager
>>
>>
>> # shows cluster status, including core names:
>> java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp
>>
>>
>> # deletes a replica by node/collection/shard (figures out the core name 
>> under the hood)
>> java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
>> zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
>> shard2
>>
>>
>> I mention this tool every now and then on this list because I like it, but 
>> I’m the author, so take that with a pretty big grain of salt. Feedback is 
>> very welcome.
>>
>>
>>
>>
>>
>>
>>
>> On 1/8/16, 1:18 PM, "Robert Brown" <r...@intelcompute.com> wrote:
>>
>>> Hi,
>>>
>>> I'm having trouble identifying a replica to delete...
>>>
>>> I've created a 3-shard cluster, all 3 created on a single host, then
>>> added a replica for shard2 onto another host, no problem so far.
>>>
>>> Now I want to delete the original shard, but got this error when trying
>>> a *replica* param value I thought would work...
>>>
>>> shard2/uk available replicas are core_node1,core_node4
>>>
>>> I can't find any mention of core_node1 or core_node4 via the admin UI,
>>> how would I know/find the name of each one?
>>>
>>> Is it possible to set these names explicitly myself for easier maintenance?
>>>
>>> Many thanks for any guidance,
>>> Rob
>>>
>


Re: SolrCloud: Setting/finding node names for deleting replicas

2016-01-08 Thread Jeff Wartes


I’m pretty sure you could change the name when you ADDREPLICA using a core.name 
property. I don’t know if you can when you initially create the collection 
though.

The CLUSTERSTATUS command will tell you the core names: 
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api18
 

That said, this tool might make things easier.
https://github.com/whitepages/solrcloud_manager


# shows cluster status, including core names:
java -jar solrcloud_manager-assembly-1.4.0.jar -z zk0.example.com:2181/myapp


# deletes a replica by node/collection/shard (figures out the core name under 
the hood)
java -jar solrcloud_manager-assembly-1.4.0.jar deletereplica -z 
zk0.example.com:2181/myapp -c collection1 --node node1.example.com --slice 
shard2


I mention this tool every now and then on this list because I like it, but I’m 
the author, so take that with a pretty big grain of salt. Feedback is very 
welcome.







On 1/8/16, 1:18 PM, "Robert Brown"  wrote:

>Hi,
>
>I'm having trouble identifying a replica to delete...
>
>I've created a 3-shard cluster, all 3 created on a single host, then 
>added a replica for shard2 onto another host, no problem so far.
>
>Now I want to delete the original shard, but got this error when trying 
>a *replica* param value I thought would work...
>
>shard2/uk available replicas are core_node1,core_node4
>
>I can't find any mention of core_node1 or core_node4 via the admin UI, 
>how would I know/find the name of each one?
>
>Is it possible to set these names explicitly myself for easier maintenance?
>
>Many thanks for any guidance,
>Rob
>


Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-23 Thread Jeff Wartes
Looks like it’ll set partialResults=true on your results if you hit the 
timeout. 

https://issues.apache.org/jira/browse/SOLR-502

https://issues.apache.org/jira/browse/SOLR-5986






On 12/22/15, 5:43 PM, "Vincenzo D'Amore"  wrote:

>Well... I can write everything, but really all this just to understand
>when timeAllowed
>parameter trigger a partial answer? I mean, isn't there anything set in the
>response when is partial?
>
>On Wed, Dec 23, 2015 at 2:38 AM, Walter Underwood 
>wrote:
>
>> We need to know a LOT more about your site. Number of documents, size of
>> index, frequency of updates, length of queries approximate size of server
>> (CPUs, RAM, type of disk), version of Solr, version of Java, and features
>> you are using (faceting, highlighting, etc.).
>>
>> After that, we’ll have more questions.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore 
>> wrote:
>> >
>> > Hi All,
>> >
>> > my website is under pressure, there is a big number of concurrent
>> searches.
>> > When the connected users are too many, the searches becomes so slow that
>> in
>> > some cases users have to wait many seconds.
>> > The queue of searches becomes so long that, in same cases, servers are
>> > blocked trying to serve all these requests.
>> > As far as I know because some searches are very expensive, and when many
>> > expensive searches clog the queue server becomes unresponsive.
>> >
>> > In order to quickly workaround this herd effect, I have added a
>> > default timeAllowed to 15 seconds, and this seems help a lot.
>> >
>> > But during stress tests but I'm unable to understand when and what
>> requests
>> > are affected by timeAllowed parameter.
>> >
>> > Just be clear, I have configure timeAllowed parameter in a SolrCloud
>> > environment, given that partial results may be returned (if there are
>> any),
>> > how can I know when this happens? When the timeAllowed parameter trigger
>> a
>> > partial answer?
>> >
>> > Best regards,
>> > Vincenzo
>> >
>> >
>> >
>> > --
>> > Vincenzo D'Amore
>> > email: v.dam...@gmail.com
>> > skype: free.dev
>> > mobile: +39 349 8513251
>>
>>
>
>
>-- 
>Vincenzo D'Amore
>email: v.dam...@gmail.com
>skype: free.dev
>mobile: +39 349 8513251


Re: replica recovery

2015-11-19 Thread Jeff Wartes

I completely agree with the other comments on this thread with regard to
needing more disk space asap, but I thought I’d add a few comments
regarding the specific questions here.

If your goal is to prevent full recovery requests, you only need to cover
the duration you expect a replica to be unavailable.

If your common issue is GC due to bad queries, you probably don’t need to
cover more than the number of docs you write in your typical full GC
pause. I suspect this is less than 10M.
If your common issue is the length of time it takes you to notice a server
crashed and restart it, you may need to cover something like 10 minutes
worth of docs. I suspect this is still less than 10M.

You certainly don’t need to keep an entire day’s transaction logs. If your
servers routinely go down for a whole day, solve that by fixing your
servers. :)

With respect to ulimit, if Solr is the only thing of significance on the
box, there’s no reason not to bump that up. I usually just set something
like 32k and stop thinking about it. I get hit by a low ulimit every now
and then, but I can’t recall ever having had an issue with it being too
high.



On 11/19/15, 6:21 AM, "Brian Scholl" <bsch...@legendary.com> wrote:

>I have opted to modify the number and size of transaction logs that I
>keep to resolve the original issue I described.  In so doing I think I
>have created a new problem, feedback is appreciated.
>
>Here are the new updateLog settings:
>
>
>  ${solr.ulog.dir:}
>  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>  1000
>  5760
>
>
>First I want to make sure I understand what these settings do:
>   numRecordsToKeep: per transaction log file keep this number of documents
>   maxNumLogsToKeep: retain this number of transaction log files total
>
>During my testing I thought I observed that a new tlog is created every
>time auto-commit is triggered (every 15 seconds in my case) so I set
>maxNumLogsToKeep high enough to contain an entire days worth of updates.
> Knowing that I could potentially need to bulk load some data I set
>numRecordsToKeep higher than my max throughput per replica for 15 seconds.
>
>The problem that I think this has created is I am now running out of file
>descriptors on the servers.  After indexing new documents for a couple
>hours a some servers (not all) will start logging this error rapidly:
>
>73021439 WARN  
>(qtp1476011703-18-acceptor-0@6d5514d9-ServerConnector@6392e703{HTTP/1.1}{0
>.0.0.0:8983}) [   ] o.e.j.s.ServerConnector
>java.io.IOException: Too many open files
>   at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422
>)
>   at 
>sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250
>)
>   at 
>org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
>   at 
>org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.
>java:500)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a:635)
>   at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:555)
>   at java.lang.Thread.run(Thread.java:745)
>
>The output of ulimit -n for the user running the solr process is 1024.  I
>am pretty sure I can prevent this error from occurring  by increasing the
>limit on each server but it isn't clear to me how high it should be or if
>raising the limit will cause new problems.
>
>Any advice you could provide in this situation would be awesome!
>
>Cheers,
>Brian
>
>
>
>> On Oct 27, 2015, at 20:50, Jeff Wartes <jwar...@whitepages.com> wrote:
>> 
>> 
>> On the face of it, your scenario seems plausible. I can offer two pieces
>> of info that may or may not help you:
>> 
>> 1. A write request to Solr will not be acknowledged until an attempt has
>> been made to write to all relevant replicas. So, B won’t ever be missing
>> updates that were applied to A, unless communication with B was
>>disrupted
>> somehow at the time of the update request. You can add a min_rf param to
>> your write request, in which case the response will tell you how many
>> replicas received the update, but it’s still up to your indexer client
>>to
>> decide what to do if that’s less than your replication factor.
>> 
>> See 
>> 
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Faul
>>t+
>> Tolerance for more info.
>> 
>> 2. There are two forms of replication. The usual thing is for the leader
>> for each shard to write an update to all replicas before acknowledg

Re: Data Import Handler / Backup indexes

2015-11-23 Thread Jeff Wartes

The backup/restore approach in SOLR-5750 and in solrcloud_manager is
really just that - copying the index files.
On backup, it saves your index directories, and on restore, it puts them
in the data dir, moves a pointer for the current index dir, and opens a
new searcher. Both are mostly just wrappers on the proper Solr
replication-handler commands, since Solr already has some lower level APIs
for these operations.

There is a shared filesystem requirement for backup/restore though, which
is to account for the fact that when you make the backup you don’t know
which nodes will need to restore a given shard.

The commands would look something like:

java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c collection1 --dir 

Or you could restore into a new collection:
java -jar solrcloud_manager-assembly-1.4.0.jar backupindex -z
zk0.example.com:2181/myapp -c collection1 --dir 
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1
java -jar solrcloud_manager-assembly-1.4.0.jar restoreindex -z
zk0.example.com:2181/myapp -c newcollection --dir 
--restoreFrom collection1

If you don’t have a shared filesystem, you can still do the copy
collection route:
java -jar solrcloud_manager-assembly-1.4.0.jar clonecollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

java -jar solrcloud_manager-assembly-1.4.0.jar copycollection -z
zk0.example.com:2181/myapp -c newcollection --fromCollection collection1

This creates a new collection with the same settings, (clonecollection)
and triggers a one-shot “replication” into it. (copycollection) Again,
this is just framework for the proper (largely undocumented) Solr API
commands, to work around the lack of a convenient collections-level API
command.

One nice thing about using copy collection is that it can be used to keep
a backup collection up to date, only copying if necessary. Honestly
though, I don’t have as much experience with this use case as KNitin does
in solrcloud-haft, which is why I suggest using an empty collection in the
README right now. If you try that use case with solrcloud_manager, I’d be
interested in your experience. It should work, but you’ll need to disable
the verification with --skipCheck and check manually.


Having said all that though, yes, with your simple use case and small
collection, you can do everything you want with just cp. The easiest way
would be to make a backup copy of your index dir. If you need to restore,
shut down solr, nuke your index dir, and copy the backup in there. You’d
probably need to do this on all nodes at once though, to prevent a
non-leader from coming up and re-syncing with a piece of the index you
hadn’t restored yet.




On 11/21/15, 10:12 PM, "Brian Narsi" <bnars...@gmail.com> wrote:

>What are the caveats regarding the copy of a collection?
>
>At this time DIH takes only about 10 minutes. So in case of accidental
>delete we can just re-run the DIH. The reason I am thinking about backup
>is
>just in case records are deleted accidentally and the DIH cannot be run
>because the database is unavailable.
>
>Our collection is simple: 2 nodes - 1 collection - 2 shards with 2
>replicas
>each
>
>So a simple copy (cp command) for both the nodes/shards might work for us?
>How do I restore the data back?
>
>
>
>On Tue, Nov 17, 2015 at 4:56 PM, Jeff Wartes <jwar...@whitepages.com>
>wrote:
>
>>
>> https://github.com/whitepages/solrcloud_manager supports 5.x, and I
>>added
>> some backup/restore functionality similar to SOLR-5750 in the last
>> release.
>> Like SOLR-5750, this backup strategy requires a shared filesystem, but
>> note that unlike SOLR-5750, I haven’t yet added any backup functionality
>> for the contents of ZK. I’m currently working on some parts of that.
>>
>>
>> Making a copy of a collection is supported too, with some caveats.
>>
>>
>> On 11/17/15, 10:20 AM, "Brian Narsi" <bnars...@gmail.com> wrote:
>>
>> >Sorry I forgot to mention that we are using SolrCloud 5.1.0.
>> >
>> >
>> >
>> >On Tue, Nov 17, 2015 at 12:09 PM, KNitin <nitin.t...@gmail.com> wrote:
>> >
>> >> afaik Data import handler does not offer backups. You can try using
>>the
>> >> replication handler to backup data as you wish to any custom end
>>point.
>> >>
>> >> You can also try out : https://github.com/bloomreach/solrcloud-haft.
>> >>This
>> >> helps backup solr indices across clusters.
>> >>
>> >> On T

Re: Solr off-heap FieldCache & HelioSearch

2016-06-03 Thread Jeff Wartes

For what it’s worth, I’d suggest you go into a conversation with Azul with a 
more explicit “I’m looking to buy” approach. I reached out to them with a more 
“I’m exploring my options” attitude, and never even got a trial. I get the 
impression their business model involves a fairly expensive (to them) trial 
process, so they’re looking for more urgency on the part of the client than I 
was expressing.

Instead, I spent a few weeks analyzing how my specific index allocated memory. 
This turned out to be quite worthwhile. Armed with that information, I was able 
to file a few patches (coming in 6.1, perhaps?) that reduced allocations by a 
pretty decent amount on large indexes. (SOLR-8922, particularly) It also 
straight-up ruled out certain things Solr supports, because the allocations 
were just too heavy. (SOLR-9125)

I suppose the next thing I’m considering is using multiple JVMs per host, 
essentially one per shard. This wouldn’t change the allocation rate, but does 
serve to reduce the worst-case GC pause, since each JVM can have a smaller 
heap. I’d be trading a little p50 latency for some p90 latency reduction, I’d 
expect. Of course, that adds a bunch of headache to managing replica locations 
too.


On 6/2/16, 6:30 PM, "Phillip Peleshok"  wrote:

>Fantastic! I'm sorry I couldn't find that JIRA before and for getting you
>to track it down.
>
>Yup, I noticed that for the docvalues with the ordinal map and I'm
>definitely leveraging all that but I'm hitting the terms limit now and that
>ends up pushing me over.  I'll see about giving Zing/Azul a try.  From all
>my readings using theUnsafe seemed a little sketchy (
>http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/) so
>I'm glad that seemed to be the point of contention bringing it in and not
>anything else.
>
>Thank you very much for the info,
>Phil
>
>On Thu, Jun 2, 2016 at 6:14 PM, Erick Erickson 
>wrote:
>
>> Basically it never reached consensus, see the discussion at:
>> https://issues.apache.org/jira/browse/SOLR-6638
>>
>> If you can afford it I've seen people with very good results
>> using Zing/Azul, but that can be expensive.
>>
>> DocValues can help for fields you facet and sort on,
>> those essentially move memory into the OS
>> cache.
>>
>> But memory is an ongoing struggle I'm afraid.
>>
>> Best,
>> Erick
>>
>> On Wed, Jun 1, 2016 at 12:34 PM, Phillip Peleshok 
>> wrote:
>> > Hey everyone,
>> >
>> > I've been using Solr for some time now and running into GC issues as most
>> > others have.  Now I've exhausted all the traditional GC settings
>> > recommended by various individuals (ie Shawn Heisey, etc) but neither
>> > proved sufficient.  The one solution that I've seen that proved useful is
>> > Heliosearch and the off-heap implementation.
>> >
>> > My question is this, why wasn't the off-heap FieldCache implementation (
>> > http://yonik.com/hs-solr-off-heap-fieldcache-performance/) ever rolled
>> into
>> > Solr when the other HelioSearch improvement were merged? Was there a
>> > fundamental design problem or just a matter of time/testing that would be
>> > incurred by the move?
>> >
>> > Thanks,
>> > Phil
>>



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-05-31 Thread Jeff Wartes
never
>> > > > >> match and add alternatives.  See <
>> > > > >> https://issues.apache.org/jira/browse/LUCENE-2605>, where I’ve
>> > > posted a
>> > > > >> patch to directly address that problem - note that it’s still a
>> work
>> > > in
>> > > > >> progress.
>> > > > >>
>> > > > >> Once LUCENE-2605 has been fixed, there is still work to do getting
>> > > > >> (e)dismax to work with the modified Lucene QueryParser, and
>> > addressing
>> > > > >> problems with how queries are constructed from Lucene’s
>> “sausagized”
>> > > > token
>> > > > >> stream.
>> > > > >>
>> > > > >> --
>> > > > >> Steve
>> > > > >> www.lucidworks.com
>> > > > >>
>> > > > >> > On May 26, 2016, at 2:21 PM, John Bickerstaff <
>> > > > j...@johnbickerstaff.com>
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > Thanks Chris --
>> > > > >> >
>> > > > >> > The two projects I'm aware of are:
>> > > > >> >
>> > > > >> > https://github.com/healthonnet/hon-lucene-synonyms
>> > > > >> >
>> > > > >> > and the one referenced from the Lucidworks page here:
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://lucidworks.com/blog/2014/07/12/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
>> > > > >> >
>> > > > >> > ... which is here :
>> > > > >> https://github.com/LucidWorks/auto-phrase-tokenfilter
>> > > > >> >
>> > > > >> > Is there anything else out there that you would recommend I look
>> > at?
>> > > > >> >
>> > > > >> > On Thu, May 26, 2016 at 12:01 PM, Chris Morley <
>> > ch...@depahelix.com
>> > > >
>> > > > >> wrote:
>> > > > >> >
>> > > > >> >> Chris Morley here, from Wayfair.  (Depahelix = my domain)
>> > > > >> >>
>> > > > >> >> Suyash Sonawane and I have worked on multiple word synonyms at
>> > > > Wayfair.
>> > > > >> >> We worked mostly off of Ted Sullivan's work and also off of
>> some
>> > > > >> >> suggestions from Koorosh Vakhshoori.  We have gotten to a point
>> > > where
>> > > > >> we
>> > > > >> >> have a more sophisticated internal implementation, however,
>> we've
>> > > > found
>> > > > >> >> that it is very difficult to make it do what you want it to do,
>> > and
>> > > > >> also be
>> > > > >> >> sufficiently performant.  Watch out for exceptional situations
>> > with
>> > > > mm
>> > > > >> >> (minimum should match).
>> > > > >> >>
>> > > > >> >> Trey Grainger (now at Lucidworks) and Simon Hughes of Dice.com
>> > have
>> > > > >> also
>> > > > >> >> done work in this area.
>> > > > >> >>
>> > > > >> >> It should be very possible to get this kind of thing working on
>> > > > >> >> SolrCloud.  I haven't tried it yet but I think theoretically,
>> it
>> > > > should
>> > > > >> >> just work.  The synonyms stuff is mostly about doing things at
>> > > index
>> > > > >> time
>> > > > >> >> and query time.  The index time stuff should translate to
>> > SolrCloud
>> > > > >> >> directly, while the query time stuff might pose some issues,
>> but
>> > > > >> probably
>> > > > >> >> not too bad, if there are any issues at all.
>> > > > >> >>
>> > > > >> >> I've had decent luck porting our various plugins from 4.10.x to
>> > > 5.5.0
>> > > > >> >> because a lot of stuff is just Java, and it still works within
>> > the
>>

Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-05-26 Thread Jeff Wartes
Oh, interesting. I’ve certainty encountered issues with multi-word synonyms, 
but I hadn’t come across this. If you end up using it with a recent solr 
verison, I’d be glad to hear your experience.

I haven’t used it, but I am aware of one other project in this vein that you 
might be interested in looking at: 
https://github.com/LucidWorks/auto-phrase-tokenfilter


On 5/26/16, 9:29 AM, "John Bickerstaff"  wrote:

>Ahh - for question #3 I may have spoken too soon.  This line from the
>github repository readme suggests a way.
>
>Update: We have tested to run with the jar in $SOLR_HOME/lib as well, and
>it works (Jetty).
>
>I'll try that and only respond back if that doesn't work.
>
>Questions 1 and 2 still stand of course...  If anyone on the list has
>experience in this area...
>
>Thanks.
>
>On Thu, May 26, 2016 at 10:25 AM, John Bickerstaff > wrote:
>
>> Hi all,
>>
>> I'm creating a Solr Cloud that will index and search medical text.
>> Multi-word synonyms are a pretty important factor.
>>
>> I find that there are some challenges around multi-word synonyms and I
>> also found on the wiki that there is a recommended 3rd-party parser
>> (synonym_edismax parser) created by Nolan Lawson and found here:
>> https://github.com/healthonnet/hon-lucene-synonyms
>>
>> Here's the thing - the instructions on the github site involve bringing
>> the jar file into the war file - which is not applicable any more... at
>> least I think it's not...
>>
>> I have three questions:
>>
>> 1. Is this still a good solution for multi-word synonyms (I.e. Solr Cloud
>> doesn't break it in some way)
>> 2. Is there a tool or plug-in out there that the contributors would
>> recommend above this one?
>> 3. Assuming 1 = yes and 2 = no, can anyone tell me an updated procedure
>> for bringing it in to Solr Cloud (I'm running 5.4.x)
>>
>> Thanks
>>



Re: Solr Cloud and Multi-word Synonyms :: synonym_edismax parser

2016-06-01 Thread Jeff Wartes
In the interests of the specific questions to me:

I’m using 5.4, solrcloud. 
I’ve never used the blob store thing, didn’t even know it existed before this 
thread.

I’m uncertain how not finding the class could be specific to hon, it really 
feels like a general solr config issue, but you could try some other foreign 
jar and see if that works. 
Here’s one I use: https://github.com/whitepages/SOLR-4449 (although this one is 
also why I use WEB-INF/lib, because it overrides a protected method, so it 
might not be the greatest example)


On 5/31/16, 4:02 PM, "John Bickerstaff" <j...@johnbickerstaff.com> wrote:

>Thanks Jeff,
>
>I believe I tried that, and it still refused to load..  But I'd sure love
>it to work since the other process is a bit convoluted - although I see
>it's value in a large Solr installation.
>
>When I "locate" the jar on the linux command line I get:
>
>/opt/solr-5.4.0/server/solr-webapp/webapp/WEB-INF/lib/hon-lucene-synonyms-2.0.0.jar
>
>But the log file is still carrying class not found exceptions when I
>restart...
>
>Are you in "Cloud" mode?  What version of Solr are you using?
>
>On Tue, May 31, 2016 at 4:08 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>
>> I’ve generally been dropping foreign plugin jars in this dir:
>> server/solr-webapp/webapp/WEB-INF/lib/
>> This is because it then gets loaded by the same classloader as Solr
>> itself, which can be useful if you’re, say, overriding some
>> solr-protected-space method.
>>
>> If you don’t care about the classloader, I believe you can use whatever
>> dir you want, with the appropriate bit of solrconfig.xml to load it.
>> Something like:
>> 
>>
>>
>> On 5/31/16, 2:13 PM, "John Bickerstaff" <j...@johnbickerstaff.com> wrote:
>>
>> >All --
>> >
>> >I'm now attempting to use the hon_lucene_synonyms project from github.
>> >
>> >I found the documents that were infered by the dead links on the readme in
>> >the repository -- however, given that I'm using Solr 5.4.x, I no longer
>> >have the need to integrate into a war file (as far as I can see).
>> >
>> >The suggestion on the readme is that I can drop the hon_lucene_synonyms
>> jar
>> >file into the $SOLR_HOME directory, but this does not seem to be working -
>> >I'm getting class not found exceptions.
>> >
>> >Does anyone on this list have direct experience with getting this plugin
>> to
>> >work in Solr 5.x?
>> >
>> >Thanks in advance...
>> >
>> >On Mon, May 30, 2016 at 6:57 PM, MaryJo Sminkey <mjsmin...@gmail.com>
>> wrote:
>> >
>> >> It's been awhile since I installed it so I really can't say. I'm more
>> of a
>> >> code monkey than a server gal (particularly Linux... I'm amazed I got
>> Solr
>> >> installed in the first place, LOL!) So I had asked our network guy to
>> look
>> >> it over recently and see if it looked like I did it okay. He said since
>> it
>> >> shows up in the list of jars in the Solr admin that it's installed
>> if
>> >> that's not necessarily true, I probably need to point him in the right
>> >> direction for what else to do since he really doesn't know Solr well
>> >> either.
>> >>
>> >> Mary Jo
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, May 30, 2016 at 7:49 PM, John Bickerstaff <
>> >> j...@johnbickerstaff.com>
>> >> wrote:
>> >>
>> >> > Thanks for the comment Mary Jo...
>> >> >
>> >> > The error loading the class rings a bell - did you find and follow
>> >> > instructions for adding that to the WAR file?  I vaguely remember
>> seeing
>> >> > something about that.
>> >> >
>> >> > I'm going to try my own tests on the auto phrasing one..  If I'm
>> >> > successful, I'll post back.
>> >> >
>> >> > On Mon, May 30, 2016 at 3:45 PM, MaryJo Sminkey <mjsmin...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > This is a very timely discussion for me as well as we're trying to
>> >> tackle
>> >> > > the multi term synonym issue as well and have not been able to
>> >> hon-lucene
>> >> > > plugin to work, the jar shows up as installed but when we set up the
>> >> > sample
>> >> > > request handler it throws this error:
>> >> > >

Re: Multiple calls across the distributed nodes for a query

2016-06-15 Thread Jeff Wartes
Any distributed query falls into the two-phase process. Actually, I think some 
components may require a third phase. (faceting?)

However, there are also cases where only a single pass is required. A 
fl=id,score will only be a single pass, for example, since it doesn’t need to 
get the field values.

https://issues.apache.org/jira/browse/SOLR-5768 would be a good place to read 
about some of this, and provides a way to help force a one-pass even if you 
need other fields.


On 6/15/16, 7:31 AM, "Raveendra Yerraguntla"  
wrote:

>I need help in understanding a query in solr cloud.
>When user issues a query , there are are two phases of query - one with the
>purpose(from debug info) of GET_TOP_FIELDS and another with GET_FIELDS.
>
>This is having an effect on end to end performance of the application.
>
>- what triggers (any components like facet, highlight, spellchecker ??? )
>the two calls
>- How can I make a query to be executed only with GET_FIELDS only .



Re: Long STW GCs with Solr Cloud

2016-06-16 Thread Jeff Wartes
Check your gc log for CMS “concurrent mode failure” messages. 

If a concurrent CMS collection fails, it does a stop-the-world pause while it 
cleans up using a *single thread*. This means the stop-the-world CMS collection 
in the failure case is typically several times slower than a concurrent CMS 
collection. The single-thread business means it will also be several times 
slower than the Parallel collector, which is probably what you’re seeing. I 
understand that it needs to stop the world in this case, but I really wish the 
CMS failure would fall back to a Parallel collector run instead.
The Parallel collector is always going to be the fastest at getting rid of 
garbage, but only because it stops all the application threads while it runs, 
so it’s got less complexity to deal with. That said, it’s probably not going to 
be orders of magnitude faster than a (successfully) concurrent CMS collection.

Regardless, the bigger the heap, the bigger the pause.

If your application is generating a lot of garbage, or can generate a lot of 
garbage very suddenly, CMS concurrent mode failures are more likely. You can 
turn down the  -XX:CMSInitiatingOccupancyFraction value in order to give the 
CMS collection more of a head start at the cost of more frequent collections. 
If that doesn’t work, you can try using a bigger heap, but you may eventually 
find yourself trying to figure out what about your query load generates so much 
garbage (or causes garbage spikes) and trying to address that. Even G1 won’t 
protect you from highly unpredictable garbage generation rates.

In my case, for example, I found that a very small subset of my queries were 
using the CollapseQParserPlugin, which requires quite a lot of memory 
allocations, especially on a large index. Although generally this was fine, if 
I got several of these rare queries in a very short window, it would always 
spike enough garbage to cause CMS concurrent mode failures. The single-threaded 
concurrent-mode failure would then take long enough that the ZK heartbeat would 
fail, and things would just go downhill from there.



On 6/15/16, 3:57 PM, "Cas Rusnov"  wrote:

>Hey Shawn! Thanks for replying.
>
>Yes I meant HugePages not HugeTable, brain fart. I will give the
>transparent off option a go.
>
>I have attempted to use your CMS configs as is and also the default
>settings and the cluster dies under our load (basically a node will get a
>35-60s GC STW and then the others in the shard will take the load, and they
>will in turn get long STWs until the shard dies), which is why basically in
>a fit of desperation I tried out ParallelGC and found it to be half-way
>acceptable. I will run a test using your configs (and the defaults) again
>just to be sure (since I'm certain the machine config has changed since we
>used your unaltered settings).
>
>Thanks!
>Cas
>
>
>On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey  wrote:
>
>> On 6/15/2016 3:05 PM, Cas Rusnov wrote:
>> > After trying many of the off the shelf configurations (including CMS
>> > configurations but excluding G1GC, which we're still taking the
>> > warnings about seriously), numerous tweaks, rumors, various instance
>> > sizes, and all the rest, most of which regardless of heap size and
>> > newspace size resulted in frequent 30+ second STW GCs, we settled on
>> > the following configuration which leads to occasional high GCs but
>> > mostly stays between 10-20 second STWs every few minutes (which is
>> > almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions
>> > -XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC
>> > -XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m
>> > -XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m
>> > -Xmx31000m
>>
>> You mentioned something called "HugeTable" ... I assume you're talking
>> about huge pages.  If that's what you're talking about, have you also
>> turned off transparent huge pages?  If you haven't, you might want to
>> completely disable huge pages in your OS.  There's evidence that the
>> transparent option can affect performance.
>>
>> I assume you've probably looked at my GC info at the following URL:
>>
>> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>>
>> The parallel collector is most definitely not a good choice.  It does
>> not optimize for latency.  It's my understanding that it actually
>> prefers full GCs, because it is optimized for throughput.  Solr thrives
>> on good latency, throughput doesn't matter very much.
>>
>> If you want to continue avoiding G1, you should definitely be using
>> CMS.  My recommendation right now would be to try the G1 settings on my
>> wiki page under the heading "Current experiments" or the CMS settings
>> just below that.
>>
>> The out-of-the-box GC tuning included with Solr 6 is probably a better
>> option than the parallel collector you've got configured now.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
>-- 

Re: Long STW GCs with Solr Cloud

2016-06-17 Thread Jeff Wartes
For what it’s worth, I looked into reducing the allocation footprint of 
CollapsingQParserPlugin a bit, but without success. See 
https://issues.apache.org/jira/browse/SOLR-9125

As it happened, I was collapsing on a field with such high cardinality that the 
chances of a query even doing much collapsing of interest was pretty low. That 
allowed me to use a vastly stripped-down version of CollapsingQParserPlugin 
with a *much* lower memory footprint, in exchange for collapsed document heads 
essentially being picked at random. (That is, when collapsing two documents, 
the one that gets returned is random.)

If that’s of interest, I could probably throw the code someplace public.


On 6/16/16, 3:39 PM, "Cas Rusnov" <c...@manzama.com> wrote:

>Hey thanks for your reply.
>
>Looks like running the suggested CMS config from Shawn, we're getting some
>nodes with 30+sec pauses, I gather due to large heap, interestingly enough
>while the scenario Jeff talked about is remarkably similar (we use field
>collapsing), including the performance aspects of it, we are getting
>concurrent mode failures both due to new space allocation failures and due
>to promotion failures. I suspect there's a lot of garbage building up.
>We're going to run tests with field collapsing disabled and see if that
>makes a difference.
>
>Cas
>
>
>On Thu, Jun 16, 2016 at 1:08 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>
>> Check your gc log for CMS “concurrent mode failure” messages.
>>
>> If a concurrent CMS collection fails, it does a stop-the-world pause while
>> it cleans up using a *single thread*. This means the stop-the-world CMS
>> collection in the failure case is typically several times slower than a
>> concurrent CMS collection. The single-thread business means it will also be
>> several times slower than the Parallel collector, which is probably what
>> you’re seeing. I understand that it needs to stop the world in this case,
>> but I really wish the CMS failure would fall back to a Parallel collector
>> run instead.
>> The Parallel collector is always going to be the fastest at getting rid of
>> garbage, but only because it stops all the application threads while it
>> runs, so it’s got less complexity to deal with. That said, it’s probably
>> not going to be orders of magnitude faster than a (successfully) concurrent
>> CMS collection.
>>
>> Regardless, the bigger the heap, the bigger the pause.
>>
>> If your application is generating a lot of garbage, or can generate a lot
>> of garbage very suddenly, CMS concurrent mode failures are more likely. You
>> can turn down the  -XX:CMSInitiatingOccupancyFraction value in order to
>> give the CMS collection more of a head start at the cost of more frequent
>> collections. If that doesn’t work, you can try using a bigger heap, but you
>> may eventually find yourself trying to figure out what about your query
>> load generates so much garbage (or causes garbage spikes) and trying to
>> address that. Even G1 won’t protect you from highly unpredictable garbage
>> generation rates.
>>
>> In my case, for example, I found that a very small subset of my queries
>> were using the CollapseQParserPlugin, which requires quite a lot of memory
>> allocations, especially on a large index. Although generally this was fine,
>> if I got several of these rare queries in a very short window, it would
>> always spike enough garbage to cause CMS concurrent mode failures. The
>> single-threaded concurrent-mode failure would then take long enough that
>> the ZK heartbeat would fail, and things would just go downhill from there.
>>
>>
>>
>> On 6/15/16, 3:57 PM, "Cas Rusnov" <c...@manzama.com> wrote:
>>
>> >Hey Shawn! Thanks for replying.
>> >
>> >Yes I meant HugePages not HugeTable, brain fart. I will give the
>> >transparent off option a go.
>> >
>> >I have attempted to use your CMS configs as is and also the default
>> >settings and the cluster dies under our load (basically a node will get a
>> >35-60s GC STW and then the others in the shard will take the load, and
>> they
>> >will in turn get long STWs until the shard dies), which is why basically
>> in
>> >a fit of desperation I tried out ParallelGC and found it to be half-way
>> >acceptable. I will run a test using your configs (and the defaults) again
>> >just to be sure (since I'm certain the machine config has changed since we
>> >used your unaltered settings).
>> >
>> >Thanks!
>> >Cas
>> >
>> >
>> >On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey

Re: SolrCloud: Adding a very large collection to a pre-existing cluster

2016-06-21 Thread Jeff Wartes

There’s no official way of doing #1, but there are some less official ways:
1. The Backup/Restore API provides some hooks into loading pre-existing data 
dirs into an existing collection. Lots of caveats.
2. If you don’t have many shards, there’s always rsync/reload.
3. There are some third-party tools that help with this kind of thing:
a. https://github.com/whitepages/solrcloud_manager (primarily a command line 
tool)
b. https://github.com/bloomreach/solrcloud-haft (primarily a library)

For #2, absolutely. Spin up some new nodes in your cluster, and then use the 
“createNodeSet” parameter when creating the new collection to restrict to those 
new nodes:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1




On 6/21/16, 12:33 PM, "Kelly, Frank"  wrote:

>We have about 200 million documents (~70 GB) we need to keep indexed across 3 
>collections.
>
>Currently 2 of the 3 collections are already indexed (roughly 90m docs).
>
>We'd like to create the remaining collection (about 100 m documents) but 
>minimizing the performance impact on the existing collections on Solr servers 
>during that Time.
>
>Is there some way to do this either by
>
>  1.  Creating the collection in another environment and shipping the 
> (underlying Lucene) index files
>  2.  Creating the collection on (dedicated) new machines that we add to the 
> SolrCloud cluster?
>
>Thoughts, comments or suggestions appreciated,
>
>Best
>
>-Frank Kelly
>



Re: collection aliasing

2016-01-28 Thread Jeff Wartes
I enjoy using collection aliases in all client references, because that allows 
me to change the collection all clients use without updating the clients. I 
just move the alias. 
This is particularly useful if I’m doing a full index rebuild and want an 
atomic, zero-downtime switchover.





On 1/28/16, 6:07 AM, "Shawn Heisey"  wrote:

>On 1/28/2016 2:59 AM, vidya wrote:
>> Hi
>>
>> Then what is the difference between collection aliasing and shards parameter
>> mentioned in request handler of solrconfig.xml.
>>
>> In request handler of new collection's solrconfig.xml
>>shards =
>> http://localhost:8983/solr/collection1,http://localhost:8983/solr/collection1
>> I can query both data of collection1 and collection2 in new collection which
>> is same as collection aliasing.
>>
>> Is my understanding correct ? If so, then what is the special characteristic
>> of collection alaising. Please help me.
>
>Collection aliasing handles it completely automatically, no need to put
>a shards parameter *anywhere*.  That is the main difference.
>
>The shards parameter is the old way of doing distributed searches. 
>SolrCloud completely automates the process so that neither the admin nor
>the user has to worry about it.  Aliases are part of that automation.
>
>Thanks,
>Shawn
>


Re: SolrCloud replicas out of sync

2016-01-27 Thread Jeff Wartes

On 1/27/16, 8:28 AM, "Shawn Heisey"  wrote:





>
>I don't think any documentation states this, but it seems like a good
>idea to me use an alias from day one, so that you always have the option
>of swapping the "real" collection that you are using without needing to
>change anything else.  I'll need to ask some people if they think this
>is a good documentation addition, and think of a good place to mention
>it in the reference guide.


+1 - I recommend this at every opportunity. 

I’ve even considered creating 10 aliases for a single collection and having the 
client select one of the aliases randomly per query. This would allow 
transparently shifting traffic between collections in 10% increments.




Re: SolrCloud replicas out of sync

2016-01-27 Thread Jeff Wartes

If you can identify the problem documents, you can just re-index those after 
forcing a sync. Might save a full rebuild and downtime.

You might describe your cluster setup, including ZK. it sounds like you’ve done 
your research, but improper ZK node distribution could certainly invalidate 
some of Solr’s assumptions.




On 1/27/16, 7:59 AM, "David Smith"  wrote:

>Jeff, again, very much appreciate your feedback.  
>
>It is interesting — the article you linked to by Shalin is exactly why we 
>picked SolrCloud over ES, because (eventual) consistency is critical for our 
>application and we will sacrifice availability for it.  To be clear, after the 
>outage, NONE of our three replicas are correct or complete.
>
>So we definitely don’t have CP yet — our very first network outage resulted in 
>multiple overlapped lost updates.  As a result, I can’t pick one replica and 
>make it the new “master”.  I must rebuild this collection from scratch, which 
>I can do, but that requires downtime which is a problem in our app (24/7 High 
>Availability with few maintenance windows).
>
>
>So, I definitely need to “fix” this somehow.  I wish I could outline a 
>reproducible test case, but as the root cause is likely very tight timing 
>issues and complicated interactions with Zookeeper, that is not really an 
>option.  I’m happy to share the full logs of all 3 replicas though if that 
>helps.
>
>I am curious though if the thoughts have changed since 
>https://issues.apache.org/jira/browse/SOLR-5468 of seriously considering a 
>“majority quorum” model, with rollback?  Done properly, this should be free of 
>all lost update problems, at the cost of availability.  Some SolrCloud users 
>(like us!!!) would gladly accept that tradeoff.  
>
>Regards
>
>David
>
>


Re: Shard allocation across nodes

2016-02-01 Thread Jeff Wartes

You could write your own snitch: 
https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement

Or, it would be more annoying, but you can always add/remove replicas manually 
and juggle things yourself after you create the initial collection.




On 2/1/16, 8:42 AM, "Tom Evans"  wrote:

>Hi all
>
>We're setting up a solr cloud cluster, and unfortunately some of our
>VMs may be physically located on the same VM host. Is there a way of
>ensuring that all copies of a shard are not located on the same
>physical server?
>
>If they do end up in that state, is there a way of rebalancing them?
>
>Cheers
>
>Tom


Re: Restoring backups of solrcores

2016-02-01 Thread Jeff Wartes

Aliases work when indexing too.

Create collection: collection1
Create alias: this_week -> collection1
Index to: this_week

Next week...

Create collection: collection2
Create (Move) alias: this_week -> collection2
Index to: this_week




On 2/1/16, 2:14 AM, "vidya"  wrote:

>Hi 
>
>How can that be useful, can u please explain.
>I want to have the same collection name everytime when I index data i.e.,
>current_collection.
>
>By collection aliasing, i can create a new collection and point my alias
>(say ALIAS) to new collection but cannot rename that collection to the same
>current_collection which i have created and indexed previous week.
>
>So, are you asking me to create whatever collection name i want to create
>but point out my alias with name i want and change that alias pointing to
>new collection that i create and query using my alias name.
>
>Please help me on this.
>
>Thanks in advance
>
>
>
>--
>View this message in context: 
>http://lucene.472066.n3.nabble.com/Restoring-backups-of-solrcores-tp4254080p4254366.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: very slow frequent updates

2016-02-24 Thread Jeff Wartes

I suspect your problem is the intersection of “very large document” and “high 
rate of change”. Either of those alone would be fine.

You’re correct, if the thing you need to search or sort by is the thing with a 
high change rate, you probably aren’t going to be able to peel those things out 
of your index. 

Perhaps you could work something out with join queries? So you have two kinds 
of documents - book content and book price - and your high-frequency change is 
limited to documents with very little data.





On 2/24/16, 4:01 AM, "roland.sz...@booknwalk.com on behalf of Szűcs Roland" 
<roland.sz...@booknwalk.com on behalf of szucs.rol...@bookandwalk.hu> wrote:

>I have checked it already in the ref. guide. It is stated that you can not
>search in external fields:
>https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes
>
>Really I am very curios that my problem is not a usual one or the case is
>that SOLR mainly focuses on search and not a kind of end-to-end support.
>How this approach works with 1 million documents with frequently changing
>prices?
>
>Thanks your time,
>
>Roland
>
>2016-02-24 12:39 GMT+01:00 Stefan Matheis <matheis.ste...@gmail.com>:
>
>> Depending of what features you do actually need, might be worth a look
>> on "External File Fields" Roland?
>>
>> -Stefan
>>
>> On Wed, Feb 24, 2016 at 12:24 PM, Szűcs Roland
>> <szucs.rol...@bookandwalk.hu> wrote:
>> > Thanks Jeff your help,
>> >
>> > Can it work in production environment? Imagine when my customer initiate
>> a
>> > query having 1 000 docs in the result set. I can not use the pagination
>> of
>> > SOLR as the field which is the basis of the sort is not included in the
>> > schema for example the price. The customer wants the list in descending
>> > order of the price.
>> >
>> > So I have to get all the 1000 docids from solr and find the metadata of
>> > them in a sql database or in cache in best case. This is the way you
>> > suggested? Is it not too slow?
>> >
>> > Regards,
>> > Roland
>> >
>> > 2016-02-23 19:29 GMT+01:00 Jeff Wartes <jwar...@whitepages.com>:
>> >
>> >>
>> >> My suggestion would be to split your problem domain. Use Solr
>> exclusively
>> >> for search - index the id and only those fields you need to search on.
>> Then
>> >> use some other data store for retrieval. Get the id’s from the solr
>> >> results, and look them up in the data store to get the rest of your
>> fields.
>> >> This allows you to keep your solr docs as small as possible, and you
>> only
>> >> need to update them when a *searchable* field changes.
>> >>
>> >> Every “update" in solr is a delete/insert. Even the "atomic update”
>> >> feature is just a shortcut for that. It requires stored fields because
>> the
>> >> data from the stored fields gets copied into the new insert.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 2/22/16, 12:21 PM, "Roland Szűcs" <roland.sz...@booknwalk.com>
>> wrote:
>> >>
>> >> >Hi folks,
>> >> >
>> >> >We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>> >> >fields do not change at all like content, author, publisher Only
>> the
>> >> >price field changes frequently.
>> >> >
>> >> >We let the customers to make full text search so we indexed the content
>> >> >filed. Due to the frequency of the price updates we use the atomic
>> update
>> >> >feature. As a requirement of the atomic updates we have to store all
>> the
>> >> >fields even the content field which is 1MB/document and we did not
>> want to
>> >> >store it just index it.
>> >> >
>> >> >As we wanted to update 100 documents with atomic update it took about 3
>> >> >minutes. Taking into account that our metadata /document is 1 Kb and
>> our
>> >> >content field / document is 1MB we use 1000 more memory to accelerate
>> the
>> >> >update process.
>> >> >
>> >> >I am almost 100% sure that we make something wrong.
>> >> >
>> >> >What is the best practice of the frequent updates when 99% part of a
>> given
>> >> >document is constant forever?
>> >> >
>> >> >Thank i

Re: very slow frequent updates

2016-02-23 Thread Jeff Wartes

My suggestion would be to split your problem domain. Use Solr exclusively for 
search - index the id and only those fields you need to search on. Then use 
some other data store for retrieval. Get the id’s from the solr results, and 
look them up in the data store to get the rest of your fields. This allows you 
to keep your solr docs as small as possible, and you only need to update them 
when a *searchable* field changes.

Every “update" in solr is a delete/insert. Even the "atomic update” feature is 
just a shortcut for that. It requires stored fields because the data from the 
stored fields gets copied into the new insert.





On 2/22/16, 12:21 PM, "Roland Szűcs"  wrote:

>Hi folks,
>
>We use SOLR 5.2.1. We have ebooks stored in SOLR. The majority of the
>fields do not change at all like content, author, publisher Only the
>price field changes frequently.
>
>We let the customers to make full text search so we indexed the content
>filed. Due to the frequency of the price updates we use the atomic update
>feature. As a requirement of the atomic updates we have to store all the
>fields even the content field which is 1MB/document and we did not want to
>store it just index it.
>
>As we wanted to update 100 documents with atomic update it took about 3
>minutes. Taking into account that our metadata /document is 1 Kb and our
>content field / document is 1MB we use 1000 more memory to accelerate the
>update process.
>
>I am almost 100% sure that we make something wrong.
>
>What is the best practice of the frequent updates when 99% part of a given
>document is constant forever?
>
>Thank in advance
>
>-- 
> Roland Szűcs
> Connect with
>me on Linkedin 
>
>CEO Phone: +36 1 210 81 13
>Bookandwalk.hu 


Re: Shard State vs Replica State

2016-02-26 Thread Jeff Wartes

I believe the shard state is a reflection of whether that shard is still in use 
by the collection, and has nothing to do with the state of the replicas. I 
think doing a split-shard operation would create two new shards, and mark the 
old one as inactive, for example.




On 2/26/16, 8:50 AM, "Dennis Gove"  wrote:

>In clusterstate.json (or just state.json in new versions) I'm seeing the
>following
>
>"shard1":{
>"range":"8000-d554",
>"state":"active",
>"replicas":{
>  "core_node7":{
>"core":"people_shard1_replica3",
>"base_url":"http://192.168.2.32:8983/solr;,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"},
>  "core_node9":{
>"core":"people_shard1_replica2",
>"base_url":"http://192.168.2.32:8983/solr;,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"},
>  "core_node2":{
>"core":"people_shard1_replica1",
>"base_url":"http://192.168.2.32:8983/solr;,
>"node_name":"192.168.2.32:8983_solr",
>"state":"down"}
>}
>},
>
>All replicas are down (I hosed the index for one of the replicas on purpose
>to simulate this) and each replica is showing its state accurately as
>"down". But the shard state is still showing "active". I would expect the
>shard state to reflect the availability of that shard (ie, the best state
>across all the replicas). For example, if one replica is active then the
>shard state is active, if two replicas are recovering and one is down then
>the shard state shows recovering, etc...
>
>What I'm seeing, however, doesn't match my expectation so I'm wondering
>what is shard state showing?
>
>Thanks,
>Dennis


Re: SolrCloud replicas out of sync

2016-01-26 Thread Jeff Wartes

My understanding is that the "version" represents the timestamp the searcher 
was opened, so it doesn’t really offer any assurances about your data.

Although you could probably bounce a node and get your document counts back in 
sync (by provoking a check), it’s interesting that you’re in this situation. It 
implies to me that at some point the leader couldn’t write a doc to one of the 
replicas, but that the replica didn’t consider itself down enough to check 
itself.

You might watch the achieved replication factor of your updates and see if it 
ever changes:
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
 (See Achieved Replication Factor/min_rf)

If it does, that might give you clues about how this is happening. Also, it 
might allow you to work around the issue by trying the write again.






On 1/22/16, 10:52 AM, "David Smith"  wrote:

>I have a SolrCloud v5.4 collection with 3 replicas that appear to have fallen 
>permanently out of sync.  Users started to complain that the same search, 
>executed twice, sometimes returned different result counts.  Sure enough, our 
>replicas are not identical:
>
>>> shard1_replica1:  89867 documents / version 1453479763194
>>> shard1_replica2:  89866 documents / version 1453479763194
>>> shard1_replica3:  89867 documents / version 1453479763191
>
>I do not think this discrepancy is going to resolve itself.  The Solr Admin 
>screen reports all 3 replicas as “Current”.  The last modification to this 
>collection was 2 hours before I captured this information, and our auto commit 
>time is 60 seconds.  
>
>I have a lot of concerns here, but my first question is if anyone else has had 
>problems with out of sync replicas, and if so, what they have done to correct 
>this?
>
>Kind Regards,
>
>David
>


Re: SolrCloud replicas out of sync

2016-01-26 Thread Jeff Wartes

Ah, perhaps you fell into something like this then? 
https://issues.apache.org/jira/browse/SOLR-7844

That says it’s fixed in 5.4, but that would be an example of a split-brain type 
incident, where different documents were accepted by different replicas who 
each thought they were the leader. If this is the case, and you actually have 
different data on each replica, I’m not aware of any way to fix the problem 
short of reindexing those documents. Before that, you’ll probably need to 
choose a replica and just force the others to get in sync with it. I’d choose 
the current leader, since that’s slightly easier.

Typically, a leader writes an update to it’s transaction log, then sends the 
request to all replicas, and when those all finish it acknowledges the update. 
If a replica gets restarted, and is less than N documents behind, the leader 
will only replay that transaction log. (Where N is the numRecordsToKeep 
configured in the updateLog section of solrconfig.xml)

What you want is to provoke the heavy-duty process normally invoked if a 
replica has missed more than N docs, which essentially does a checksum and file 
copy on all the raw index files. FetchIndex would probably work, but it’s a 
replication handler API originally designed for master/slave replication, so 
take care: https://wiki.apache.org/solr/SolrReplication#HTTP_API
Probably a lot easier would be to just delete the replica and re-create it. 
That will also trigger a full file copy of the index from the leader onto the 
new replica.

I think design decisions around Solr generally use CP as a goal. (I sometimes 
wish I could get more AP behavior!) See posts like this: 
http://lucidworks.com/blog/2014/12/10/call-maybe-solrcloud-jepsen-flaky-networks/
 
So the fact that you encountered this sounds like a bug to me.
That said, another general recommendation (of mine) is that you not use Solr as 
your primary data source, so you can rebuild your index from scratch if you 
really need to. 






On 1/26/16, 1:10 PM, "David Smith" <dsmiths...@yahoo.com.INVALID> wrote:

>Thanks Jeff!  A few comments
>
>>>
>>> Although you could probably bounce a node and get your document counts back 
>>> in sync (by provoking a check)
>>>
> 
>
>If the check is a simple doc count, that will not work. We have found that 
>replica1 and replica3, although they contain the same doc count, don’t have 
>the SAME docs.  They each missed at least one update, but of different docs.  
>This also means none of our three replicas are complete.
>
>>>
>>>it’s interesting that you’re in this situation. It implies to me that at 
>>>some point the leader couldn’t write a doc to one of the replicas,
>>>
>
>That is our belief as well. We experienced a datacenter-wide network 
>disruption of a few seconds, and user complaints started the first workday 
>after that event.  
>
>The most interesting log entry during the outage is this:
>
>"1/19/2016, 5:08:07 PM ERROR null DistributedUpdateProcessorRequest says it is 
>coming from leader,​ but we are the leader: 
>update.distrib=FROMLEADER=http://dot.dot.dot.dot:8983/solr/blah_blah_shard1_replica3/=javabin=2;
>
>>>
>>> You might watch the achieved replication factor of your updates and see if 
>>> it ever changes
>>>
>
>This is a good tip. I’m not sure I like the implication that any failure to 
>write all 3 of our replicas must be retried at the app layer.  Is this really 
>how SolrCloud applications must be built to survive network partitions without 
>data loss? 
>
>Regards,
>
>David
>
>
>On 1/26/16, 12:20 PM, "Jeff Wartes" <jwar...@whitepages.com> wrote:
>
>>
>>My understanding is that the "version" represents the timestamp the searcher 
>>was opened, so it doesn’t really offer any assurances about your data.
>>
>>Although you could probably bounce a node and get your document counts back 
>>in sync (by provoking a check), it’s interesting that you’re in this 
>>situation. It implies to me that at some point the leader couldn’t write a 
>>doc to one of the replicas, but that the replica didn’t consider itself down 
>>enough to check itself.
>>
>>You might watch the achieved replication factor of your updates and see if it 
>>ever changes:
>>https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance
>> (See Achieved Replication Factor/min_rf)
>>
>>If it does, that might give you clues about how this is happening. Also, it 
>>might allow you to work around the issue by trying the write again.
>>
>>
>>
>>
>>
>>
>>On 1/22/16, 10:52 AM, "David Smith" <dsmiths...@yahoo.com.INVALID> 

Re: Adding nodes

2016-02-17 Thread Jeff Wartes
Solrcloud does not come with any autoscaling functionality. If you want such a 
thing, you’ll need to write it yourself.

https://github.com/whitepages/solrcloud_manager might be a useful head start 
though, particularly the “fill” and “cleancollection” commands. I don’t do 
*auto* scaling, but I do use this for all my cluster management, which 
certantly involves moving collections/shards around among nodes, adding 
capacity, and removing capacity.






On 2/14/16, 11:17 AM, "McCallick, Paul"  wrote:

>These are excellent questions and give me a good sense of why you suggest 
>using the collections api.
>
>In our case we have 8 shards of product data with a even distribution of data 
>per shard, no hot spots. We have very different load at different points in 
>the year (cyber monday), and we tend to have very little traffic at night. I'm 
>thinking of two use cases:
>
>1) we are seeing increased latency due to load and want to add 8 more replicas 
>to handle the query volume.  Once the volume subsides, we'd remove the nodes. 
>
>2) we lose a node due to some unexpected failure (ec2 tends to do this). We 
>want auto scaling to detect the failure and add a node to replace the failed 
>one. 
>
>In both cases the core api makes it easy. It adds nodes to the shards evenly. 
>Otherwise we have to write a fairly involved script that is subject to race 
>conditions to determine which shard to add nodes to. 
>
>Let me know if I'm making dangerous or uninformed assumptions, as I'm new to 
>solr. 
>
>Thanks,
>Paul
>
>> On Feb 14, 2016, at 10:35 AM, Susheel Kumar  wrote:
>> 
>> Hi Pual,
>> 
>> 
>> For Auto-scaling, it depends on how you are thinking to design and what/how
>> do you want to scale. Which scenario you think makes coreadmin API easy to
>> use for a sharded SolrCloud environment?
>> 
>> Isn't if in a sharded environment (assume 3 shards A,B & C) and shard B has
>> having higher or more load,  then you want to add Replica for shard B to
>> distribute the load or if a particular shard replica goes down then you
>> want to add another Replica back for the shard in which case ADDREPLICA
>> requires a shard name?
>> 
>> Can you describe your scenario / provide more detail?
>> 
>> Thanks,
>> Susheel
>> 
>> 
>> 
>> On Sun, Feb 14, 2016 at 11:51 AM, McCallick, Paul <
>> paul.e.mccall...@nordstrom.com> wrote:
>> 
>>> Hi all,
>>> 
>>> 
>>> This doesn’t really answer the following question:
>>> 
>>> What is the suggested way to add a new node to a collection via the
>>> apis?  I  am specifically thinking of autoscale scenarios where a node has
>>> gone down or more nodes are needed to handle load.
>>> 
>>> 
>>> The coreadmin api makes this easy.  The collections api (ADDREPLICA),
>>> makes this very difficult.
>>> 
>>> 
 On 2/14/16, 8:19 AM, "Susheel Kumar"  wrote:
 
 Hi Paul,
 
 Shawn is referring to use Collections API
 https://cwiki.apache.org/confluence/display/solr/Collections+API  than
>>> Core
 Admin API https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API
 for SolrCloud.
 
 Hope that clarifies and you mentioned about ADDREPLICA which is the
 collections API, so you are on right track.
 
 Thanks,
 Susheel
 
 
 
 On Sun, Feb 14, 2016 at 10:51 AM, McCallick, Paul <
 paul.e.mccall...@nordstrom.com> wrote:
 
> Then what is the suggested way to add a new node to a collection via the
> apis?  I  am specifically thinking of autoscale scenarios where a node
>>> has
> gone down or more nodes are needed to handle load.
> 
> Note that the ADDREPLICA endpoint requires a shard name, which puts the
> onus of how to scale out on the user. This can be challenging in an
> autoscale scenario.
> 
> Thanks,
> Paul
> 
>> On Feb 14, 2016, at 12:25 AM, Shawn Heisey 
>>> wrote:
>> 
>>> On 2/13/2016 6:01 PM, McCallick, Paul wrote:
>>> - When creating a new collection, SOLRCloud will use all available
> nodes for the collection, adding cores to each.  This assumes that you
>>> do
> not specify a replicationFactor.
>> 
>> The number of nodes that will be used is numShards multipled by
>> replicationFactor.  The default value for replicationFactor is 1.  If
>> you do not specify numShards, there is no default -- the CREATE call
>> will fail.  The value of maxShardsPerNode can also affect the overall
>> result.
>> 
>>> - When adding new nodes to the cluster AFTER the collection is
>>> created,
> one must use the core admin api to add the node to the collection.
>> 
>> Using the CoreAdmin API is strongly discouraged when running
>>> SolrCloud.
>> It works, but it is an expert API when in cloud mode, and can cause
>> serious problems if not used correctly.  Instead, use the Collections
>> API.  It can handle all normal 

Re: SolrCloud - Strategy for recovering cluster states

2016-03-01 Thread Jeff Wartes

I’ve been running SolrCloud clusters in various versions for a few years here, 
and I can only think of two or three cases that the ZK-stored cluster state was 
broken in a way that I had to manually intervene by hand-editing the contents 
of ZK. I think I’ve seen Solr fixes go by for those cases, too. I’ve never 
completely wiped ZK. (Although granted, my ZK cluster has been pretty stable, 
and my collection count is smaller than yours)

My philosophy is that ZK is the source of cluster configuration, not the 
collection of core.properties files on the nodes. 
Currently, cluster state is shared between ZK and core directories. I’d prefer, 
and I think Solr development is going this way, (SOLR-7269) that all cluster 
state exist and be managed via ZK, and all state be removed from the local disk 
of the cluster nodes. The fact that a node uses local disk based configuration 
to figure out what collections/replicas it has is something that should be 
fixed, in my opinion.

If you’re frequently getting into bad states due to ZK issues, I’d suggest you 
file bugs against Solr for the fact that you got into the state, and then fix 
your ZK cluster.

Failing that, can you just periodically back up your ZK data and restore it if 
something breaks? I wrote a little tool to watch clusterstate.json and write 
every version to a local git repo a few years ago. I was mostly interested 
because I wanted to see changes that happened pretty fast, but it could also 
serve as a backup approach. Here’s a link, although I clearly haven’t touched 
it lately. Feel free to ask if you have issues: 
https://github.com/randomstatistic/git_zk_monitor




On 3/1/16, 12:09 PM, "danny teichthal"  wrote:

>Hi,
>Just summarizing my questions if the long mail is a little intimidating:
>1. Is there a best practice/automated tool for overcoming problems in
>cluster state coming from zookeeper disconnections?
>2. Creating a collection via core admin is discouraged, is it true also for
>core.properties discovery?
>
>I would like to be able to specify collection.configName in the
>core.properties and when starting server, the collection will be created
>and linked to the config name specified.
>
>
>
>On Mon, Feb 29, 2016 at 4:01 PM, danny teichthal 
>wrote:
>
>> Hi,
>>
>>
>> I would like to describe a process we use for overcoming problems in
>> cluster state when we have networking issues. Would appreciate if anyone
>> can answer about what are the flaws on this solution and what is the best
>> practice for recovery in case of network problems involving zookeeper.
>> I'm working with Solr Cloud with version 5.2.1
>> ~100 collections in a cluster of 6 machines.
>>
>> This is the short procedure:
>> 1. Bring all the cluster down.
>> 2. Clear all data from zookeeper.
>> 3. Upload configuration.
>> 4. Restart the cluster.
>>
>> We rely on the fact that a collection is created on core discovery
>> process, if it does not exist. It gives us much flexibility.
>> When the cluster comes up, it reads from core.properties and creates the
>> collections if needed.
>> Since we have only one configuration, the collections are automatically
>> linked to it and the cores inherit it from the collection.
>> This is a very robust procedure, that helped us overcome many problems
>> until we stabilized our cluster which is now pretty stable.
>> I know that the leader might change in such case and may lose updates, but
>> it is ok.
>>
>>
>> The problem is that today I want to add a new config set.
>> When I add it and clear zookeeper, the cores cannot be created because
>> there are 2 configurations. This breaks my recovery procedure.
>>
>> I thought about a few options:
>> 1. Put the config Name in core.properties - this doesn't work. (It is
>> supported in CoreAdminHandler, but  is discouraged according to
>> documentation)
>> 2. Change recovery procedure to not delete all data from zookeeper, but
>> only relevant parts.
>> 3. Change recovery procedure to delete all, but recreate and link
>> configurations for all collections before startup.
>>
>> Option #1 is my favorite, because it is very simple, it is currently not
>> supported, but from looking on code it looked like it is not complex to
>> implement.
>>
>>
>>
>> My questions are:
>> 1. Is there something wrong in the recovery procedure that I described ?
>> 2. What is the best way to fix problems in cluster state, except from
>> editing clusterstate.json manually? Is there an automated tool for that? We
>> have about 100 collections in a cluster, so editing is not really a
>> solution.
>> 3.Is creating a collection via core.properties is also discouraged?
>>
>>
>>
>> Would very appreciate any answers/ thoughts on that.
>>
>>
>> Thanks,
>>
>>
>>
>>
>>
>>


Re: SolrCloud backup/restore

2016-04-05 Thread Jeff Wartes

There is some automation around this process in the backup commands here:
https://github.com/whitepages/solrcloud_manager


It’s been tested with 5.4, and will restore arbitrary replication factors. 
Ever assuming the shared filesystem for backups, of course.



On 4/5/16, 3:18 AM, "Reth RM"  wrote:

>Yes. It should be backing up each shard leader of collection. For each
>collection, for each shard, find the leader and request a backup command on
>that. Further, restore this on new collection, in its respective shard and
>then go on adding new replica which will duly pull it from the newly added
>shards.
>
>
>On Mon, Apr 4, 2016 at 10:32 PM, Zisis Tachtsidis 
>wrote:
>
>> I've tested backup/restore successfully in a SolrCloud installation with a
>> single node (no replicas). This has been achieved in
>> https://issues.apache.org/jira/browse/SOLR-6637
>> Can you do something similar when more replicas are involved? What I'm
>> looking for is a restore command that will restore index in all replicas of
>> a collection.
>> Judging from the code in /ReplicationHandler.java/ and
>> https://issues.apache.org/jira/browse/SOLR-5750 I assume that more work
>> needs to be done to achieve this.
>>
>> Is my understanding correct? If the situation is like this I guess an
>> alternative would be to just create a new collection, restore index and
>> then
>> add replicas. (I'm using Solr 5.5.0)
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-backup-restore-tp4267954.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>


Re: SolrCloud no leader for collection

2016-04-05 Thread Jeff Wartes
I recall I had some luck fixing a leader-less shard (after a ZK quorum failure) 
by forcably removing the records for the down-state replicas from the leader 
election list, and then forcing an election. 
The ZK path looks like collections//leader_elect/shardX/election. 
Usually you’ll find the down-state one that keeps getting elected is the first 
one. Delete that, then try the force-election collections api command again.



On 4/5/16, 3:15 AM, "Tom Evans"  wrote:

>Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections,
>most of them in a 1 shard x 8 replicas configuration. We have 5 ZK
>nodes.
>
>During the night, we attempted to reindex one of the larger
>collections. We reindex by pushing json docs to the update handler
>from a number of processes. It seemed this overwhelmed the servers,
>and caused all of the collections to fail and end up in either a down
>or a recovering state, often with no leader.
>
>Restarting and rebooting the servers brought a lot of the collections
>back online, but we are left with a few collections for which all the
>nodes hosting those replicas are up, but the replica reports as either
>"active" or "down", and with no leader.
>
>Trying to force a leader election has no effect, it keeps choosing a
>leader that is in "down" state. Removing all the nodes that are in
>"down" state and forcing a leader election also has no effect.
>
>
>Any ideas? The only viable option I see is to create a new collection,
>index it and then remove the old collection and alias it in.
>
>Cheers
>
>Tom


Re: SolrCloud - Strategy for recovering cluster states

2016-03-02 Thread Jeff Wartes
Well, with the understanding that someone who isn’t involved in the process is 
describing something that isn’t built yet...

I could imagine changes like:
 - Core discovery ignores cores that aren’t present in the ZK cluster state
 - New cores are automatically created to bring a node in line with ZK cluster 
state (addreplica, essentially) 
 
So if the clusterstate said “node XYZ has a replica of shard3 of collection1 
and that’s all”, and you downed node XYZ and deleted the data directory, it’d 
get restored when you started the node again. And if you copied the core 
directory for shard1 of collection2 in there and restarted the node, it’d get 
ignored because the clusterstate says node XYZ doesn’t have that.

More importantly, if you completely destroyed a node and rebuilt it from an 
image, (AWS?) that image wouldn't need any special core directories specific to 
that node. As long as the node name was the same, Solr would handle bringing 
that node back to where it was in the cluster.

Back to opinions, I think mixing the cluster definition between local disk on 
the nodes and ZK clusterstate is just confusing. It should really be one or the 
other. Specifically, I think it should be local disk for non-SolrCloud, and ZK 
for SolrCloud.





On 3/2/16, 12:13 AM, "danny teichthal" <dannyt...@gmail.com> wrote:

>Thanks Jeff,
>I understand your philosophy and it sounds correct.
>Since we had many problems with zookeeper when switching to Solr Cloud. we
>couldn't make it as a source of knowledge and had to relay on a more stable
>source.
>The issues is that when we get such an event of zookeeper, it brought our
>system down, and in this case, clearing the core.properties were a life
>saver.
>We've managed to make it pretty stable not, but we will always need a
>"dooms day" weapon.
>
>I looked into the related JIRA and it confused me a little, and raised a
>few other questions:
>1. What exactly defines zookeeper as a truth?
>2. What is the role of core.properties if the state is only in zookeeper?
>
>
>
>Your tool is very interesting, I just thought about writing such a tool
>myself.
>From the sources I understand that you represent each node as a path in the
>git repository.
>So, I guess that for restore purposes I will have to do
>the opposite direction and create a node for every path entry.
>
>
>
>
>On Tue, Mar 1, 2016 at 11:36 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>
>>
>> I’ve been running SolrCloud clusters in various versions for a few years
>> here, and I can only think of two or three cases that the ZK-stored cluster
>> state was broken in a way that I had to manually intervene by hand-editing
>> the contents of ZK. I think I’ve seen Solr fixes go by for those cases,
>> too. I’ve never completely wiped ZK. (Although granted, my ZK cluster has
>> been pretty stable, and my collection count is smaller than yours)
>>
>> My philosophy is that ZK is the source of cluster configuration, not the
>> collection of core.properties files on the nodes.
>> Currently, cluster state is shared between ZK and core directories. I’d
>> prefer, and I think Solr development is going this way, (SOLR-7269) that
>> all cluster state exist and be managed via ZK, and all state be removed
>> from the local disk of the cluster nodes. The fact that a node uses local
>> disk based configuration to figure out what collections/replicas it has is
>> something that should be fixed, in my opinion.
>>
>> If you’re frequently getting into bad states due to ZK issues, I’d suggest
>> you file bugs against Solr for the fact that you got into the state, and
>> then fix your ZK cluster.
>>
>> Failing that, can you just periodically back up your ZK data and restore
>> it if something breaks? I wrote a little tool to watch clusterstate.json
>> and write every version to a local git repo a few years ago. I was mostly
>> interested because I wanted to see changes that happened pretty fast, but
>> it could also serve as a backup approach. Here’s a link, although I clearly
>> haven’t touched it lately. Feel free to ask if you have issues:
>> https://github.com/randomstatistic/git_zk_monitor
>>
>>
>>
>>
>> On 3/1/16, 12:09 PM, "danny teichthal" <dannyt...@gmail.com> wrote:
>>
>> >Hi,
>> >Just summarizing my questions if the long mail is a little intimidating:
>> >1. Is there a best practice/automated tool for overcoming problems in
>> >cluster state coming from zookeeper disconnections?
>> >2. Creating a collection via core admin is discouraged, is it true also
>> for
>> >core.properties discovery?
>> >
>> >I would like

Re: Separating cores from Solr home

2016-03-03 Thread Jeff Wartes
It’s a bit backwards feeling, but I’ve had luck setting the install dir and 
solr home, instead of the data dir.

Something like:
-Dsolr.solr.home=/data/solr 
-Dsolr.install.dir=/opt/solr


So all of the Solr files are in in /opt/solr and all of the index/core related 
files end up in /data/solr.




On 3/3/16, 3:58 AM, "Tom Evans"  wrote:

>Hmm, I've worked around this by setting the directory where the
>indexes should live to be the actual solr home, and symlink the files
>from the current release in to that directory, but it feels icky.
>
>Any better ideas?
>
>Cheers
>
>Tom
>
>On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans  wrote:
>> Hi all
>>
>> I'm struggling to configure solr cloud to put the index files and
>> core.properties in the correct places in SolrCloud 5.5. Let me explain
>> what I am trying to achieve:
>>
>> * solr is installed in /opt/solr
>> * the user who runs solr only has read only access to that tree
>> * the solr home files - custom libraries, log4j.properties, solr.in.sh
>> and solr.xml - live in /data/project/solr/releases/, which
>> is then the target of a symlink /data/project/solr/releases/current
>> * releasing a new version of the solr home (eg adding/changing
>> libraries, changing logging options) is done by checking out a fresh
>> copy of the solr home, switching the symlink and restarting solr
>> * the solr core.properties and any data live in /data/project/indexes,
>> so they are preserved when new solr home is released
>>
>> Setting core specific dataDir with absolute paths in solrconfig.xml
>> only gets me part of the way, as the core.properties for each shard is
>> created inside the solr home.
>>
>> This is obviously no good, as when releasing a new version of the solr
>> home, they will no longer be in the current solr home.
>>
>> Cheers
>>
>> Tom


Re: XX:ParGCCardsPerStrideChunk

2016-03-03 Thread Jeff Wartes

I've experimented with that a bit, and Shawn added my comments in IRC to his 
Solr/GC page here: https://wiki.apache.org/solr/ShawnHeisey

The relevant bit:
"With values of 4096 and 32768, the IRC user was able to achieve 15% and 19% 
reductions in average pause time, respectively, with the maximum pause cut 
nearly in half. That option was added to the options you can find in the Solr 
5.x start script, on a 12GB heap."


This was with CMS, obviously. I haven’t repeated the experiment recently, but 
I’m using the option in production at this point.






On 3/2/16, 9:39 PM, "William Bell"  wrote:

>Has anyone tried -XX:ParGCCardsPerStrideChunk with Solr?
>
>There has been reports of improved GC times.
>
>-- 
>Bill Bell
>billnb...@gmail.com
>cell 720-256-8076


Re: Replicas for same shard not in sync

2016-04-27 Thread Jeff Wartes
I didn’t leave it out, I was asking what it was. I’ve been reading around some 
more this morning though, and here’s what I’ve come up with, feel free to 
correct.

Continuing my scenario:

If you did NOT specify min_rf
5. leader sets leader_initiated_recovery in ZK for the replica with the 
failure. Hopefully that replica notices and re-syncs at some point, because it 
can’t become a leader until it does. (SOLR-5495, SOLR-8034)
6. leader returns success to the client (http://bit.ly/1UhB2cF)

If you specified a min_rf and it WAS achieved:

5. leader sets leader_initiated_recovery in ZK for the replica with the failure.

6. leader returns success (and the achieved rf) to the client (SOLR-5468, 
SOLR-8062)


If you specified a min_rf and it WASN'T achieved:
5. leader does NOT set leader_initiated_recovery (SOLR-8034)
6. leader returns success (and the achieved rf) to the client (SOLR-5468, 
SOLR-8062)

I couldn’t seem to find anyplace that’d cause an error return to the client, 
aside from race conditions around who the leader should be, or if the update 
couldn’t be applied to the leader itself.






On 4/26/16, 8:22 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>You left out step 5... leader responds with fail for the update to the
>client. At this point, the client is in charge of retrying the docs.
>Retrying will update all the docs that were successfully indexed in
>the failed packet, but that's not unusual.
>
>There's no real rollback semantics that I know of. This is analogous
>to not hitting minRF, see:
>https://support.lucidworks.com/hc/en-us/articles/212834227-How-does-indexing-work-in-SolrCloud.
>In particular the bit about "it is the client's responsibility to
>re-send it"...
>
>There's some retry logic in the code that distributes the updates from
>the leader as well.
>
>Best,
>Erick
>
>On Tue, Apr 26, 2016 at 12:51 PM, Jeff Wartes <jwar...@whitepages.com> wrote:
>>
>> At the risk of thread hijacking, this is an area where I don’t know I fully 
>> understand, so I want to make sure.
>>
>> I understand the case where a node is marked “down” in the clusterstate, but 
>> what if it’s down for less than the ZK heartbeat? That’s not unreasonable, 
>> I’ve seen some recommendations for really high ZK timeouts. Let’s assume 
>> there’s some big GC pause, or some other ephemeral service interruption that 
>> recovers very quickly.
>>
>> So,
>> 1. leader gets an update request
>> 2. leader makes update requests to all live nodes
>> 3. leader gets success responses from all but one replica
>> 4. leader gets failure response from one replica
>>
>> At this point we have different replicas with different data sets. Does 
>> anything signal that the failure-response node has now diverged? Does the 
>> leader attempt to roll back the other replicas? I’ve seen references to 
>> leader-initiated-recovery, is this that?
>>
>> And regardless, is the update request considered a success (and reported as 
>> such to the client) by the leader?
>>
>>
>>
>> On 4/25/16, 12:14 PM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>
>>>Ted:
>>>Yes, deleting and re-adding the replica will be fine.
>>>
>>>Having commits happen from the client when you _also_ have
>>>autocommits that frequently (10 seconds and 1 second are pretty
>>>aggressive BTW) is usually not recommended or necessary.
>>>
>>>David:
>>>
>>>bq: if one or more replicas are down, updates presented to the leader
>>>still succeed, right?  If so, tedsolr is correct that the Solr client
>>>app needs to re-issue update
>>>
>>>Absolutely not the case. When the replicas are down, they're marked as
>>>down by Zookeeper. When then come back up they find the leader through
>>>Zookeeper magic and ask, essentially "Did I miss any updates"? If the
>>>replica did miss any updates it gets them from the leader either
>>>through the leader replaying the updates from its transaction log to
>>>the replica or by replicating the entire index from the leader. Which
>>>path is followed is a function of how far behind the replica is.
>>>
>>>In this latter case, any updates that come in to the leader while the
>>>replication is happening are buffered and replayed on top of the index
>>>when the full replication finishes.
>>>
>>>The net-net here is that you should not have to track whether updates
>>>got to all the replicas or not. One of the major advantages of
>>>SolrCloud is to remove that worry from the indexing client...
>>&

Re: Solr 5.2.1 on Java 8 GC

2016-04-28 Thread Jeff Wartes

Shawn Heisey’s page is the usual reference guide for GC settings: 
https://wiki.apache.org/solr/ShawnHeisey
Most of the learnings from that are in the Solr 5.x startup scripts already, 
but your heap is bigger, so your mileage may vary.

Some tools I’ve used while doing GC tuning:

* VisualVM - Comes with the jdk. It has a Visual GC plug-in that’s pretty nice 
for visualizing what’s going on in realtime, but you need to connect it via 
jstatd for that to work.
* GCViewer - Visualizes a GC log. The UI leaves a lot to be desired, but it’s 
the best tool I’ve found for this purpose. Use this fork for jdk 6+ - 
https://github.com/chewiebug/GCViewer
* Swiss Java Knife has a bunch of useful features - 
https://github.com/aragozin/jvm-tools
* YourKit - I’ve been using this lately to analyze where garbage comes from. 
It’s not free though. 
* Eclipse Memory Analyzer - I used this to analyze heap dumps before I got a 
YourKit license: http://www.eclipse.org/mat/

Good luck!






On 4/28/16, 9:27 AM, "Yonik Seeley"  wrote:

>On Thu, Apr 28, 2016 at 12:21 PM, Nick Vasilyev
> wrote:
>> Hi Yonik,
>>
>> There are a lot of logistics involved with re-indexing and naturally
>> upgrading Solr. I was hoping that there is an easier alternative since this
>> is only a single back end script that is having problems.
>>
>> Is there any room for improvement with tweaking GC params?
>
>There always is ;-)  But I'm not a GC tuning expert.  I prefer to
>attack memory problems more head-on (i.e. with code to use less
>memory).
>
>-Yonik


  1   2   >