Optimizing integer primary key lookup speed: optimal FieldType and Codec?

2019-06-17 Thread Gregg Donovan
Hello! We (Etsy) would like to optimize primary key lookup speed. Our
primary key is a 32-bit integer -- and are wondering what the
state-of-the-art is for FieldType and Codec these days for maximizing the
throughput of 32-bit ID lookups.


Context:
Specifically, we're looking to optimize the loading loop of
ExternalFileField
.
We are developing a specialized binary file version of the EFF that is
optimized for 32-bit int primary keys and their scores. Our motivation is
saving on storage, bandwidth, etc. via specializing the code for our
use-case -- we are heavy EFF users.

In pseudo-code, the inner EFF loading loop is:

for each primary_key, score pair in the external file:
termsEnum.seekExact(primary_key)
doc_id = postingsEnum.nextDoc()


Re: Codecs:
Is anything special needed to make ID lookups faster now that "pulsing" has
been incorporated into the default codec
?
What about using IDVersionPostingsFormat
?
Is that likely to be faster? Or is it the wrong choice if we don't need the
version support?


FieldType:
I see that EFFs do not currently support the new Points-based int fields,
but this does not appear to be due to any inherent limitation in the Points
field. At least, that's what I infer from the JIRA
. Are the Point fields
the right choice for fast 32-bit int ID lookups?

Thanks!

Gregg


1969 vs 1960s: not-quite-synonyms in Solr

2019-03-06 Thread Gregg Donovan
For a search like "1969 shirt" I would like to return items with either
1969 or 1960s but boost 1969 items higher. For the query "1960s shirt",
1960s and 1960, 1961, ... 1969 should all match equally.

Is there a standard technique for this? I'm struggling to do this with
eDisMax without adding new fields to the index.

Thanks.

Gregg


Re: Query of Death Lucene/Solr 7.6

2019-02-22 Thread Gregg Donovan
FWIW: we have also seen serious Query of Death issues after our upgrade to
Solr 7.6. Are there any open issues we can watch? Is Markus' findings
around `pf` our best guess? We've seen these issues even with ps=0. We also
use the WDF.

On Fri, Feb 22, 2019 at 8:58 AM Markus Jelsma 
wrote:

> Hello Michael,
>
> Sorry it took so long to get back to this, too many things to do.
>
> Anyway, yes, we have WDF on our query-time analysers. I uploaded two log
> files, both the same query of death with and without synonym filter enabled.
>
> https://mail.openindex.io/export/solr-8983-console.log 23 MB
> https://mail.openindex.io/export/solr-8983-console-without-syns.log 1.9 MB
>
> Without the synonym we still see a huge number of entries. Many different
> parts of our analyser chain contribute to the expansion of queries, but pf
> itself really turns the problem on or off.
>
> Since SOLR-12243 is new in 7.6, does anyone know that SOLR-12243 could
> have this side-effect?
>
> Thanks,
> Markus
>
>
> -Original message-
> > From:Michael Gibney 
> > Sent: Friday 8th February 2019 17:19
> > To: solr-user@lucene.apache.org
> > Subject: Re: Query of Death Lucene/Solr 7.6
> >
> > Hi Markus,
> > As of 7.6, LUCENE-8531 <
> https://issues.apache.org/jira/browse/LUCENE-8531>
> > reverted a graph/Spans-based phrase query implementation (introduced in
> 6.5
> > -- LUCENE-7699 ) to
> an
> > implementation that builds a separate phrase query for each possible
> > enumerated path through the graph described by a parsed query.
> > The potential for combinatoric explosion of the enumerated approach was
> (as
> > far as I can tell) one of the main motivations for introducing the
> > Spans-based implementation. Some real-world use cases would be good to
> > explore. Markus, could you send (as an attachment) the debug toString()
> for
> > the queries with/without synonyms enabled? I'm also guessing you may have
> > WordDelimiterGraphFilter on the query analyzer?
> > As an alternative to disabling pf, LUCENE-8531 only reverts to the
> > enumerated approach for phrase queries where slop>0, so setting ps=0
> would
> > probably also help.
> > Michael
> >
> > On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma  >
> > wrote:
> >
> > > Hello (apologies for cross-posting),
> > >
> > > While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the
> > > remaining four, we stumbled upon a situation where the 7.6 nodes
> quickly
> > > succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all
> > > unaffected (tested and confirmed).
> > >
> > > Following Smiley's suggestion i used Eclipse MAT to find the problem in
> > > the heap dump i obtained, this fantastic tool revealed within minutes
> that
> > > a query thread ate 65 % of all resources, in the class variables i
> could
> > > find the the query, and reproduce the problem.
> > >
> > > The problematic query is 'dubbele dijk/rijke dijkproject in het
> dijktracé
> > > eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString()
> output
> > > in edismax' newFieldQuery. If the node survives it takes 2+ seconds
> for the
> > > query to run (150 ms otherwise). If i disable all query time
> > > SynonymGraphFilters it still takes a second and produces just a 9 MB
> > > toString() for the query.
> > >
> > > I could not find anything like this in Jira. I did think of LUCENE-8479
> > > and LUCENE-8531 but they were about graphs, this problem looked related
> > > though.
> > >
> > > I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i
> > > leave Solr's edismax' pf parameter empty, everything runs fast. When
> all
> > > fields are configured for pf, the node dies.
> > >
> > > I am now unsure whether this is a Solr or a Lucene issue.
> > >
> > > Please let me know.
> > >
> > > Many thanks,
> > > Markus
> > >
> > > ps. in Solr i even got an 'Impossible Exception', my first!
> > >
> >
>


Compression for solrbin?

2015-11-13 Thread Gregg Donovan
We've had success with LZ4 compression in a custom ShardHandler to reduce
network overhead, getting ~25% compression with low CPU impact. LZ4 or
Snappy seem like reasonable choices[1] for maximizing compression +
transfer + decompression times in the data center.

Would it make sense to integrate compression into javabin itself? For the
ShardHandler and transaction log javabin usage it seems to make sense. We
could flip on gzip in Jetty for HTTP, but GZIP may add more CPU than is
desirable and wouldn't help with the transaction log.

If we did, i t seems incrementing the javabin version[2] and
compressing/decompressing inside of JavaBinCodec#marshal[3] and
JavaBinCodec#unmarshal[4] would allow us to retain backwards compatibility
with older clients or existing files.

Thoughts?

--Gregg

[1] http://cyan4973.github.io/lz4/#tab-2
[2]
https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L83
[3]
https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L112:L120
[4]
https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L129:L137


ShardHandler semantics

2015-04-02 Thread Gregg Donovan
We're starting work on adding backup requests
http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
to the ShardHandler. Roughly something like:

1. Send requests to 100 shards.
2. Wait for results from 75 to come back.
3. Wait for either a) the other 25 to come back or b) 20% more time to
elapse
4. If any shards have still not returned results, send a second request to
a different server for each of the missing shards.

I want to be sure I understand the ShardHandler contract correctly before
getting started. My understanding is :

--ShardHandler#take methods
https://github.com/apache/lucene-solr/blob/dff38c2051ba26f928687139218bbc43e9004ebe/solr/core/src/java/org/apache/solr/handler/component/ShardHandler.java#L25:L26
can be called with different ShardRequests having been submitted
https://github.com/apache/lucene-solr/blob/dff38c2051ba26f928687139218bbc43e9004ebe/solr/core/src/java/org/apache/solr/handler/component/ShardHandler.java#L24
.
--ShardHandler#takeXXX is then called in a loop, returning a ShardResponse
from the last shard returning for a given ShardRequest.
--When ShardHandler#takeXXX returns null, the SearchHandler
https://github.com/apache/lucene-solr/blob/dff38c2051ba26f928687139218bbc43e9004ebe/solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java#L277:L367
proceeds
https://github.com/apache/lucene-solr/blob/dff38c2051ba26f928687139218bbc43e9004ebe/solr/core/src/java/org/apache/solr/handler/component/SearchHandler.java#L333
.

For example, the flow could look like:

shardHandler.submit(slowGroupingRequest, shard1, groupingParams);
shardHandler.submit(slowGroupingRequest, shard2, groupingParams);
shardHandler.submit(fastFacetRefinementRequest, shard1, facetParams);
shardHandler.submit(fastFacetRefinementRequest, shard2, facetParams);
shardHandler.takeCompletedOrError(); // returns fastFacetRefinementRequest
with responses
shardHandler.takeCompletedOrError(); // returns slowGroupingRequest with
responses
shardHandler.takeCompletedOrError(); // return null, SearchHandler exits
take loop

Does that seem like a correct understanding of the
SearchHandler-ShardHandler interaction?

If so, it seems that to make backup requests work we'd need to fanout
individual ShardRequests independently, each with its own completion
service and pending queue. Does that sound right?

Thanks!

--Gregg


Re: How To Interrupt Solr Query Execution

2015-03-20 Thread Gregg Donovan
SOLR-5986 looks like a great enhancement for enforcing timeouts. I'm
curious about how to handle *manual* cancellation.

We're working on backup requests -- e.g. wait till 90% of shards have
responded then send out a backup request for the lagging (e.g. GC, cache
miss, overloaded, etc.) shards after a configurable delay -- and would like
to be able to cancel the slower to respond shard. From what I can see in
SOLR-5986, I'm not sure that Solr will check the interrupt status of a
thread if it is manually interrupted. Am I reading that right? If so, would
it make sense to add a check for the interrupt status in
SolrQueryTimeoutImpl#shouldExit()[1]?


[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/SolrQueryTimeoutImpl.java#L57:L63

On Fri, Dec 19, 2014 at 10:15 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 Also note SOLR-5986 which will help in such cases when queries are stuck
 iterating through terms. This will be released with Solr 5.0

 On Fri, Dec 19, 2014 at 9:14 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:
 
  Hello,
  Note, that timeout is checked only during the search. But for example, it
  isn't checked during facet counting. Check debugQuery=true output, to
  understand how the processing time is distributed across components.
 
  On Fri, Dec 19, 2014 at 12:05 PM, Vishnu Mishra vdil...@gmail.com
 wrote:
  
   Hi, I am using solr 4.9 for searching  over 90 million+ documents. My
  Solr
   is
   running on tomcat server and  I am  querying Solr from an application.
 I
   have a problem with long-running queries against Solr. Although I have
  set
   timeAllowed to 4ms, but it seems that solr still running this query
   until it processed fully. I read some articles where it is written that
  
   *Internally, Solr does nothing to time out any requests -- it lets
 both
   updates and queries take however long they need to take to be processed
   fully.*
  
   Is this what Solr does? If so, is there a configuration option to
 change
   this behavior? or can I interrupt solr query execution.
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/How-To-Interrupt-Solr-Query-Execution-tp4175190.html
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 
 
  --
  Sincerely yours
  Mikhail Khludnev
  Principal Engineer,
  Grid Dynamics
 
  http://www.griddynamics.com
  mkhlud...@griddynamics.com
 


 --
 Regards,
 Shalin Shekhar Mangar.



Re: Enforcing a hard timeout on shard requests?

2014-06-02 Thread Gregg Donovan
On our search pages we have a main request where we really want to give the
correct answer, but we also have a number of other child searches performed
on that page where we're happy to get 90% of the way there and be able to
enforce an SLA.

Right now, when the main search finishes we have to completely cancel the
other searches if they haven't returned. We'd prefer to give them a max
timeAllowed and take whatever results are back. Without any timeouts, we
would end up waiting for the slowest shard of all of the requests before
returning results to the user.


On Fri, May 30, 2014 at 6:09 PM, Jason Hellman 
jhell...@innoventsolutions.com wrote:

 Gregg,

 I don’t have an answer to your question but I’m very curious what use case
 you have that permits such arbitrary partial-results.  Is it just an edge
 case or do you want to permit a common occurrence?

 Jason

 On May 30, 2014, at 3:05 PM, Gregg Donovan gregg...@gmail.com wrote:

  I'd like a to add a hard timeout on some of my sharded requests. E.g.:
 for
  about 30% of the requests, I want to wait no longer than 120ms before a
  response comes back, but aggregating results from as many shards as
  possible in that 120ms.
 
  My first attempt was to use timeAllowed=120shards.tolerant=true. This
 sort
  of works, in that I'll see partial results occasionally, but slow shards
  will still take much longer than my timeout to return, sometimes up to
  700ms. I imagine if the CPU is busy or the node is GC-ing that it won't
 be
  able to enforce the timeAllowed and return.
 
  Is there a way to enforce this timeout without failing the request
  entirely? I'd still like to get as many shards to return in 120ms as I
 can,
  even if they have partialResults.
 
  Thanks.
 
  --Gregg




Enforcing a hard timeout on shard requests?

2014-05-30 Thread Gregg Donovan
I'd like a to add a hard timeout on some of my sharded requests. E.g.: for
about 30% of the requests, I want to wait no longer than 120ms before a
response comes back, but aggregating results from as many shards as
possible in that 120ms.

My first attempt was to use timeAllowed=120shards.tolerant=true. This sort
of works, in that I'll see partial results occasionally, but slow shards
will still take much longer than my timeout to return, sometimes up to
700ms. I imagine if the CPU is busy or the node is GC-ing that it won't be
able to enforce the timeAllowed and return.

Is there a way to enforce this timeout without failing the request
entirely? I'd still like to get as many shards to return in 120ms as I can,
even if they have partialResults.

Thanks.

--Gregg


How to optimize a DisMax of multiple cachable queries?

2014-04-25 Thread Gregg Donovan
Some of our site's categories are actually search driven. They're created
manually by crafting a Solr query our of list of Lucene queries that are
joined in a DisjunctionMaxQuery. There are often 100+ disjuncts in this
query.

This works nicely, but is much slower than it could be because many parts
of the disjuncts could be cached -- both filters and queries -- but none
are because only the top-level DisjunctionMaxQuery is fed to the
SolrIndexSearcher.

We're looking for ways to speed this up. One way we've considered is
sending each of the disjuncts to the SolrIndexSearcher and then merging a
the DocLists manually. My concern is that this would lose the query
normalization that happens in DisjunctionMaxQuery.

This seems like a common problem: how to cache parts of a complex Solr
query individually. Any ideas or common patterns for solving it?

Thanks.

--Gregg

Gregg Donovan
Senior Software Engineer, Etsy.com
gregg...@gmail.com


Re: Estimating RAM usage of SolrCache instances?

2014-04-16 Thread Gregg Donovan
Ideally we could get good approximates for all of them, including any of
our custom caches (of which we have about five). The RAM size estimator
spreadsheet [1] is helpful but we'd love to get accurate live size metrics.

[1]
https://github.com/apache/lucene-solr/blob/trunk/dev-tools/size-estimator-lucene-solr.xls


On Mon, Apr 14, 2014 at 10:56 PM, Erick Erickson erickerick...@gmail.comwrote:

 _which_ solrCache objects? filterCache? result cache? documentcache?

 result cache is about average size of a query + window size *
 sizeof int) for each entry.
 filter cache is about average size of a filter query + maxdoc/8
 document cacha is about average size of the stored fields in bytes *
 size.

 HTH,
 Erick

 On Mon, Apr 14, 2014 at 5:17 PM, Gregg Donovan gregg...@gmail.com wrote:
  We'd like to graph the approximate RAM size of our SolrCache instances.
 Our
  first attempt at doing this was to use the Lucene RamUsageEstimator [1].
 
  Unfortunately, this appears to give a bogus result. Every instance of
  FastLRUCache was judged to have the same exact size, down to the byte. I
  assume this is due to an issue with how the variably-sized backing maps
  were calculated, but I'm not sure.
 
  Any ideas for how to get an accurate RAM estimation for SolrCache
 objects?
 
  --Gregg
 
  [1] https://gist.github.com/greggdonovan/10682810



Estimating RAM usage of SolrCache instances?

2014-04-14 Thread Gregg Donovan
We'd like to graph the approximate RAM size of our SolrCache instances. Our
first attempt at doing this was to use the Lucene RamUsageEstimator [1].

Unfortunately, this appears to give a bogus result. Every instance of
FastLRUCache was judged to have the same exact size, down to the byte. I
assume this is due to an issue with how the variably-sized backing maps
were calculated, but I'm not sure.

Any ideas for how to get an accurate RAM estimation for SolrCache objects?

--Gregg

[1] https://gist.github.com/greggdonovan/10682810


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan
That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception  -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.


[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628


On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 On the second thought,

 If you are already managing to pass the value using the request
 parameters, what stops you from just having a servlet filter looking
 for that parameter and assigning it directly to the MDC context?

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:
  I like the idea. No comments about implementation, leave it to others.
 
  But if it is done, maybe somebody very familiar with logging can also
  review Solr's current logging config. I suspect it is not optimized
  for troubleshooting at this point.
 
  Regards,
 Alex.
  Personal website: http://www.outerthoughts.com/
  Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency
 
 
  On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan gregg...@gmail.com
 wrote:
  We have some metadata -- e.g. a request UUID -- that we log to every log
  line using Log4J's MDC [1]. The UUID logging allows us to connect any
 log
  lines we have for a given request across servers. Sort of like Zipkin
 [2].
 
  Currently we're using EmbeddedSolrServer without sharding, so adding the
  UUID is fairly simple, since everything is in one process and one
 thread.
  But, we're testing a sharded HTTP implementation and running into some
  difficulties getting this data passed around in a way that lets us trace
  all log lines generated by a request to its UUID.
 



Re: Fetching uniqueKey and other int quickly from documentCache?

2014-04-07 Thread Gregg Donovan
Yonik,

Requesting
fl=unique_key:field(unique_key),secondary_key:field(secondary_key),score vs
fl=unique_key,secondary_key,score was a nice performance win, as unique_key
and secondary_key were both already in the fieldCache. We removed our
documentCache, in fact, as it got very such little use.

We do see a code path that fetches stored fields, though, in
BinaryResponseWriter, for the case of *only* pseudo-fields being requested.
I opened a ticket and attached a patch to
https://issues.apache.org/jira/browse/SOLR-5968.




On Mon, Mar 3, 2014 at 11:30 AM, Yonik Seeley yo...@heliosearch.com wrote:

 On Mon, Mar 3, 2014 at 11:14 AM, Gregg Donovan gregg...@gmail.com wrote:
  Yonik,
 
  That's a very clever idea. Unfortunately, I think that will skip the
  distributed query optimization we were hoping to take advantage of in
  SOLR-1880 [1], but it should work with the proposed distrib.singlePass
  optimization in SOLR-5768 [2]. Does that sound right?


 Yep, the two together should do the trick.

 -Yonik
 http://heliosearch.org - native off-heap filters and fieldcache for solr


  --Gregg
 
  [1] https://issues.apache.org/jira/browse/SOLR-1880
  [2] https://issues.apache.org/jira/browse/SOLR-5768
 
 
  On Wed, Feb 26, 2014 at 8:53 PM, Yonik Seeley yo...@heliosearch.com
 wrote:
 
  You could try forcing things to go through function queries (via
  pseudo-fields):
 
  fl=field(id), field(myfield)
 
  If you're not requesting any stored fields, that *might* currently
  skip that step.
 
  -Yonik
  http://heliosearch.org - native off-heap filters and fieldcache for
 solr
 
 
  On Mon, Feb 24, 2014 at 9:58 PM, Gregg Donovan gregg...@gmail.com
 wrote:
   We fetch a large number of documents -- 1000+ -- for each search. Each
   request fetches only the uniqueKey or the uniqueKey plus one secondary
   integer key. Despite this, we find that we spent a sizable amount of
 time
   in SolrIndexSearcher#doc(int docId, SetString fields). Time is spent
   fetching the two stored fields, LZ4 decoding, etc.
  
   I would love to be able to tell Solr to always fetch these two fields
  from
   memory. We have them both in the fieldCache so we're already spending
 the
   RAM. I've seen this asked previously [1], so it seems like a fairly
  common
   need, especially for distributed search. Any ideas?
  
   A few possible ideas I had:
  
   --Check FieldCache.html#getCacheEntries() before going to stored
 fields.
   --Give the documentCache config a list of fields it should load from
 the
   fieldCache
  
  
   Having an in-memory mapping from docId-uniqueKey has come up for us
   before. We've used a custom SolrCache maintaining that mapping to
 quickly
   filter over personalized collections. Maybe the uniqueKey should be
 more
   optimized out of the box? Perhaps a custom uniqueKey codec that also
   maintained the docId-uniqueKey mapping in memory?
  
   --Gregg
  
   [1] http://search-lucene.com/m/oCUKJ1heHUU1
 



Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan
Michael,

Thanks! Unfortunately, as we use POSTs, that approach would trigger the
getParameterIncompatibilityException call due to the Enumeration of
getParameterNames before SolrDispatchFilter has a chance to access the
InputStream.

I opened https://issues.apache.org/jira/browse/SOLR-5969 to discuss further
and attached our current patch.


On Mon, Apr 7, 2014 at 2:02 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 I had to grapple with something like this problem when I wrote Lux's
 app-server.  I extended SolrDispatchFilter and handle parameter swizzling
 to keep everything nicey-nicey for Solr while being able to play games with
 parameters of my own.  Perhaps this will give you some ideas:

 https://github.com/msokolov/lux/blob/master/src/main/java/
 lux/solr/LuxDispatchFilter.java

 It's definitely hackish, but seems to get the job done - for me - it's not
 a reusable component, but might serve as an illustration of one way to
 handle the problem

 -Mike


 On 04/07/2014 12:23 PM, Gregg Donovan wrote:

 That was my first attempt, but it's much trickier than I anticipated.

 A filter that calls HttpServletRequest#getParameter() before
 SolrDispatchFilter will trigger an exception  -- see
 getParameterIncompatibilityException [1] -- if the request is a POST. It
 seems that Solr depends on the configured per-core SolrRequestParser to
 properly parse the request parameters. A servlet filter that came before
 SolrDispatchFilter would need to fetch the correct SolrRequestParser for
 the requested core, parse the request, and reset the InputStream before
 pulling the data into the MDC. It also duplicates the work of request
 parsing. It's especially tricky if you want to remove the tracing
 parameters from the SolrParams and just have them in the MDC to avoid them
 being logged twice.


 [1]
 https://github.com/apache/lucene-solr/blob/trunk/solr/
 core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628


 On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:

  On the second thought,

 If you are already managing to pass the value using the request
 parameters, what stops you from just having a servlet filter looking
 for that parameter and assigning it directly to the MDC context?

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
 arafa...@gmail.com wrote:

 I like the idea. No comments about implementation, leave it to others.

 But if it is done, maybe somebody very familiar with logging can also
 review Solr's current logging config. I suspect it is not optimized
 for troubleshooting at this point.

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr

 proficiency


 On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan gregg...@gmail.com

 wrote:

 We have some metadata -- e.g. a request UUID -- that we log to every log
 line using Log4J's MDC [1]. The UUID logging allows us to connect any

 log

 lines we have for a given request across servers. Sort of like Zipkin

 [2].

 Currently we're using EmbeddedSolrServer without sharding, so adding the
 UUID is fairly simple, since everything is in one process and one

 thread.

 But, we're testing a sharded HTTP implementation and running into some
 difficulties getting this data passed around in a way that lets us
 trace
 all log lines generated by a request to its UUID.





Distributed tracing for Solr via adding HTTP headers?

2014-04-04 Thread Gregg Donovan
We have some metadata -- e.g. a request UUID -- that we log to every log
line using Log4J's MDC [1]. The UUID logging allows us to connect any log
lines we have for a given request across servers. Sort of like Zipkin [2].

Currently we're using EmbeddedSolrServer without sharding, so adding the
UUID is fairly simple, since everything is in one process and one thread.
But, we're testing a sharded HTTP implementation and running into some
difficulties getting this data passed around in a way that lets us trace
all log lines generated by a request to its UUID.

The first thing I tried was to add the UUID by adding it to the SolrParams.
This achieves the goal of getting those values logged on the shards if a
request is successful, but we miss having those values in the MDC if there
are other log lines before the final log line. E.g. an Exception in a
custom component.

My current thought is that sending HTTP headers with diagnostic information
would be very useful. Those could be placed in the MDC even before handing
off to work to SolrDispatchFilter, so that any Solr problem will have the
proper logging.

I.e. every additional header added to a Solr request gets a Solr- prefix.
On the server, we look for those headers and add them to the SLF4J MDC[3].

Here's a patch [4] that does this that we're testing out. Is this a good
idea? Would anyone else find this useful? If so, I'll open a ticket.

--Gregg

[1] http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/MDC.html
[2] http://twitter.github.io/zipkin/
[3] http://www.slf4j.org/api/org/slf4j/MDC.html
[4] https://gist.github.com/greggdonovan/9982327


Re: SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-03 Thread Gregg Donovan
Thanks, Mark!

The supervised process sounds very promising but complicated to get right.
E.g. where does the supervisor run, where do nodes report their status to,
are the checks active or passive, etc.

Having each node perform a regular background self-check and remove itself
from the cluster if that healthcheck doesn't pass seems like a great first
step, though. The most common failure we've seen has been disk failure and
a self-check should usually detect that. (JIRA:
https://issues.apache.org/jira/browse/SOLR-5805)

It would also be nice, as a cluster operator, to have an easy way to remove
a failing node from the cluster. Ideally, right from the Solr UI, but even
from a command-line script would be great. In the cases of disk failure, we
can often not SSH into a node to shut down the VM that's still connected to
ZooKeeper. We have to physically power it down. Having something quicker
would be great. (JIRA: https://issues.apache.org/jira/browse/SOLR-5806)




On Sun, Mar 2, 2014 at 9:36 PM, Mark Miller markrmil...@gmail.com wrote:

 The heartbeat that keeps the node alive is the connection it maintains
 with ZooKeeper.

 We don't currently have anything built in that will actively make sure
 each node can serve queries and remove it from clusterstatem.json if it
 cannot. If a replica is maintaining it's connection with ZooKeeper and in
 most cases, if it is accepting updates, it will appear up. Load balancing
 should handle the failures, but I guess it depends on how sticky the
 request fails are.

 In the past, I've seen this handled on a different search engine by having
 a variety of external agent scripts that would occasionally attempt to do a
 query, and if things did not go right, it killed the process to cause it to
 try and startup again (supervised process).

 I'm not sure what the right long term feature for Solr is here, but feel
 free to start a JIRA issue around it.

 One simple improvement might even be a background thread that periodically
 checks some local readings and depending on the results, pulls itself out
 of the mix as best it can (remove itself from clusterstate.json or simply
 closes it's zk conneciton).

 - Mark

 http://about.me/markrmiller

 On Mar 2, 2014, at 3:42 PM, Gregg Donovan gregg...@gmail.com wrote:

  We had a brief SolrCloud outage this weekend when a node's SSD began to
  fail but the node still appeared to be up to the rest of the SolrCloud
  cluster (i.e. still green in clusterstate.json). Distributed queries that
  reached this node would fail but whatever heartbeat keeps the node in the
  clustrstate.json must have continued to succeed.
 
  We eventually had to power the node down to get it to be removed from
  clusterstate.json.
 
  This is our first foray into SolrCloud, so I'm still somewhat fuzzy on
 what
  the default heartbeat mechanism is and how we may augment it to be sure
  that the disk is checked as part of the heartbeat and/or we verify that
 it
  can serve queries.
 
  Any pointers would be appreciated.
 
  Thanks!
 
  --Gregg




Re: Fetching uniqueKey and other int quickly from documentCache?

2014-03-03 Thread Gregg Donovan
Yonik,

That's a very clever idea. Unfortunately, I think that will skip the
distributed query optimization we were hoping to take advantage of in
SOLR-1880 [1], but it should work with the proposed distrib.singlePass
optimization in SOLR-5768 [2]. Does that sound right?

--Gregg

[1] https://issues.apache.org/jira/browse/SOLR-1880
[2] https://issues.apache.org/jira/browse/SOLR-5768


On Wed, Feb 26, 2014 at 8:53 PM, Yonik Seeley yo...@heliosearch.com wrote:

 You could try forcing things to go through function queries (via
 pseudo-fields):

 fl=field(id), field(myfield)

 If you're not requesting any stored fields, that *might* currently
 skip that step.

 -Yonik
 http://heliosearch.org - native off-heap filters and fieldcache for solr


 On Mon, Feb 24, 2014 at 9:58 PM, Gregg Donovan gregg...@gmail.com wrote:
  We fetch a large number of documents -- 1000+ -- for each search. Each
  request fetches only the uniqueKey or the uniqueKey plus one secondary
  integer key. Despite this, we find that we spent a sizable amount of time
  in SolrIndexSearcher#doc(int docId, SetString fields). Time is spent
  fetching the two stored fields, LZ4 decoding, etc.
 
  I would love to be able to tell Solr to always fetch these two fields
 from
  memory. We have them both in the fieldCache so we're already spending the
  RAM. I've seen this asked previously [1], so it seems like a fairly
 common
  need, especially for distributed search. Any ideas?
 
  A few possible ideas I had:
 
  --Check FieldCache.html#getCacheEntries() before going to stored fields.
  --Give the documentCache config a list of fields it should load from the
  fieldCache
 
 
  Having an in-memory mapping from docId-uniqueKey has come up for us
  before. We've used a custom SolrCache maintaining that mapping to quickly
  filter over personalized collections. Maybe the uniqueKey should be more
  optimized out of the box? Perhaps a custom uniqueKey codec that also
  maintained the docId-uniqueKey mapping in memory?
 
  --Gregg
 
  [1] http://search-lucene.com/m/oCUKJ1heHUU1



SolrCloud: heartbeat succeeding while node has failing SSD?

2014-03-02 Thread Gregg Donovan
We had a brief SolrCloud outage this weekend when a node's SSD began to
fail but the node still appeared to be up to the rest of the SolrCloud
cluster (i.e. still green in clusterstate.json). Distributed queries that
reached this node would fail but whatever heartbeat keeps the node in the
clustrstate.json must have continued to succeed.

We eventually had to power the node down to get it to be removed from
clusterstate.json.

This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what
the default heartbeat mechanism is and how we may augment it to be sure
that the disk is checked as part of the heartbeat and/or we verify that it
can serve queries.

Any pointers would be appreciated.

Thanks!

--Gregg


Fetching uniqueKey and other int quickly from documentCache?

2014-02-24 Thread Gregg Donovan
We fetch a large number of documents -- 1000+ -- for each search. Each
request fetches only the uniqueKey or the uniqueKey plus one secondary
integer key. Despite this, we find that we spent a sizable amount of time
in SolrIndexSearcher#doc(int docId, SetString fields). Time is spent
fetching the two stored fields, LZ4 decoding, etc.

I would love to be able to tell Solr to always fetch these two fields from
memory. We have them both in the fieldCache so we're already spending the
RAM. I've seen this asked previously [1], so it seems like a fairly common
need, especially for distributed search. Any ideas?

A few possible ideas I had:

--Check FieldCache.html#getCacheEntries() before going to stored fields.
--Give the documentCache config a list of fields it should load from the
fieldCache


Having an in-memory mapping from docId-uniqueKey has come up for us
before. We've used a custom SolrCache maintaining that mapping to quickly
filter over personalized collections. Maybe the uniqueKey should be more
optimized out of the box? Perhaps a custom uniqueKey codec that also
maintained the docId-uniqueKey mapping in memory?

--Gregg

[1] http://search-lucene.com/m/oCUKJ1heHUU1


DistributedSearch: Skipping STAGE_GET_FIELDS?

2014-02-23 Thread Gregg Donovan
In most of our Solr use-cases, we fetch only fl=uniqueKey or
fl=uniqueKey,another_int_field. I'd like to be able to do a distributed
search and skip STAGE_GET_FIELDS -- i.e. the stage where each shard is
queried for the documents found the  the top ids -- as it seems like we
could be collecting this information earlier in the pipeline.

Is this possible out-of-the-box? If not, how would you recommend
implementing it?

Thanks!

--Gregg


Caching Solr boost functions?

2014-02-18 Thread Gregg Donovan
We're testing out a new handler that uses edismax with three different
boost functions. One has a random() function in it, so is not very
cacheable, but the other two boost functions do not change from query to
query.

I'd like to tell Solr to cache those boost queries for the life of the
Searcher so they don't get recomputed every time. Is there any way to do
that out of the box?

In a different custom QParser we have we wrote a CachingValueSource that
wrapped a ValueSource with a custom ValueSource cache. Would it make sense
to implement that as a standard Solr function so that one could do:

boost=cache(expensiveFunctionQuery())

Thanks.

--Gregg


Re: SolrCore#getIndexDir() contract change between 3.6 and 4.1?

2013-02-07 Thread Gregg Donovan
Thanks, Mark. I created SOLR-4413 [1] for it.

I'm not sure what the best fix is since it looks like a lot of the work at
that time went into refactoring SolrIndexSearcher to use DirectoryFactory
everywhere and index.properties doesn't make much sense when an FSDirectory
is not used...

Anyway, I'll follow up in JIRA.

--Gregg

[1] https://issues.apache.org/jira/browse/SOLR-4413

On Wed, Feb 6, 2013 at 8:42 PM, Mark Miller markrmil...@gmail.com wrote:

 Thanks Gregg - can you file a JIRA issue?

 - Mark

 On Feb 6, 2013, at 5:57 PM, Gregg Donovan gregg...@gmail.com wrote:

  Mark-
 
  You're right that SolrCore#getIndexDir() did not directly read
  index.properties in 3.6. In 3.6, it gets it indirectly from what is
 passed
  to the constructor of SolrIndexSearcher. Here's SolrCore#getIndexDir() in
  3.6:
 
   public String getIndexDir() {
 synchronized (searcherLock) {
   if (_searcher == null)
 return dataDir + index/;
   SolrIndexSearcher searcher = _searcher.get();
   return searcher.getIndexDir() == null ? dataDir + index/ :
  searcher.getIndexDir();
 }
   }
 
  In 3.6 the only time I see a new SolrIndexSearcher created without the
  results of SolrCore#getNewIndexDir() getting passed in somehow would be
 if
  SolrCore#newSearcher(String, boolean) is called manually before any other
  SolrIndexSearcher. Otherwise, it looks like getNewIndexDir() is getting
  passed to new SolrIndexSearcher which is then reflected back
  in SolrCore#getIndexDir().
 
  So, in 3.6 we had been able to rely on SolrCore#getIndexDir() giving us
  either the value the index referenced in index.properties OR dataDir +
  index/ if index.properties was missing. In 4.1, it always gives us
  dataDir + index/.
 
  Here's the comment in 3.6 on SolrCore#getNewIndexDir() that I think you
  were referring to. The comment is unchanged in 4.1:
 
   /**
* Returns the indexdir as given in index.properties. If
 index.properties
  exists in dataDir and
* there is a property iindex/i available and it points to a valid
  directory
* in dataDir that is returned Else dataDir/index is returned. Only
  called for creating new indexSearchers
* and indexwriters. Use the getIndexDir() method to know the active
  index directory
*
* @return the indexdir as given in index.properties
*/
   public String getNewIndexDir() {
 
  *Use the getIndexDir() method to know the active index directory* is
 the
  behavior that we were reliant on. Since it's now hardcoded to dataDir +
  index/, it doesn't always return the active index directory.
 
  --Gregg
 
  On Wed, Feb 6, 2013 at 5:13 PM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
  On Feb 6, 2013, at 4:23 PM, Gregg Donovan gregg...@gmail.com wrote:
 
  code we had that relied on the 3.6 behavior of SolrCore#getIndexDir()
 is
  not working the same way.
 
  Can you be very specific about the different behavior that you are
 seeing?
  What exactly where you seeing and counting on and what are you seeing
 now?
 
  - Mark




SolrCore#getIndexDir() contract change between 3.6 and 4.1?

2013-02-06 Thread Gregg Donovan
In the process of upgrading from 3.6 to 4.1, we've noticed that much of the
code we had that relied on the 3.6 behavior of SolrCore#getIndexDir() is
not working the same way.

In 3.6, SolrCore#getIndexDir() would get us the index directory read from
index.properties, if it existed, otherwise it would return dataDir +
index/.  As of svn 1420992 [1], SolrCore#getIndexDir() just
returns dataDir + index/ and does not take index.properties into account.

This has me wondering what the intended state of support for
index.properties is in 4.1. After reading the code for some of the relevant
components -- Core admin, HTTP Replication, etc. -- I'm somewhat confused.

--In CoreAdminHandler#handleUnloadAction(SolrQueryRequest,
SolrQueryResponse) if the deleteIndex flag is set to true, it calls
core.getDirectoryFactory().remove(core.getIndexDir()). If a value other
than index/ is set in index.properties, won't this delete the wrong
directory?

--In CoreAdminHandler#getIndexSize(SolrCore), the existence of
SolrCore#getIndexDir() is checked before SolrCore#getNewIndexDir(). If a
value other than index/ is set in index.properties, won't this return the
size of the wrong directory?

Seeing these two examples, I wondered if index.properties and the use of
directories other than dataDir/index/ was deprecated, but I see that
SnapPuller will create a new directory within dataDir and update
index.properties to point to it in cases where isFullCopyNeeded=true.

Our current Solr 3.6 reindexing scheme works by modifying index.properties
to point to a new directory and then doing a core reload. I'm wondering if
this method is intended to be deprecated at this point, or if the SolrCloud
scenarios are just getting more attention and some bugs have slipped into
the older code paths. I can certainly appreciate that it's tough to make
the changes needed for SolrCloud while maintaining perfect compatibility in
pre-Cloud code paths. Would restoring the previous contact of
SolrCore#getIndexDir() break anything in SolrCloud?

Thanks!

--Gregg


Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com

[1]
http://svn.apache.org/viewvc?diff_format=hview=revisionrevision=1420992


Re: SolrCore#getIndexDir() contract change between 3.6 and 4.1?

2013-02-06 Thread Gregg Donovan
Mark-

You're right that SolrCore#getIndexDir() did not directly read
index.properties in 3.6. In 3.6, it gets it indirectly from what is passed
to the constructor of SolrIndexSearcher. Here's SolrCore#getIndexDir() in
3.6:

  public String getIndexDir() {
synchronized (searcherLock) {
  if (_searcher == null)
return dataDir + index/;
  SolrIndexSearcher searcher = _searcher.get();
  return searcher.getIndexDir() == null ? dataDir + index/ :
searcher.getIndexDir();
}
  }

In 3.6 the only time I see a new SolrIndexSearcher created without the
results of SolrCore#getNewIndexDir() getting passed in somehow would be if
SolrCore#newSearcher(String, boolean) is called manually before any other
SolrIndexSearcher. Otherwise, it looks like getNewIndexDir() is getting
passed to new SolrIndexSearcher which is then reflected back
in SolrCore#getIndexDir().

So, in 3.6 we had been able to rely on SolrCore#getIndexDir() giving us
either the value the index referenced in index.properties OR dataDir +
index/ if index.properties was missing. In 4.1, it always gives us
dataDir + index/.

Here's the comment in 3.6 on SolrCore#getNewIndexDir() that I think you
were referring to. The comment is unchanged in 4.1:

  /**
   * Returns the indexdir as given in index.properties. If index.properties
exists in dataDir and
   * there is a property iindex/i available and it points to a valid
directory
   * in dataDir that is returned Else dataDir/index is returned. Only
called for creating new indexSearchers
   * and indexwriters. Use the getIndexDir() method to know the active
index directory
   *
   * @return the indexdir as given in index.properties
   */
  public String getNewIndexDir() {

*Use the getIndexDir() method to know the active index directory* is the
behavior that we were reliant on. Since it's now hardcoded to dataDir +
index/, it doesn't always return the active index directory.

--Gregg

On Wed, Feb 6, 2013 at 5:13 PM, Mark Miller markrmil...@gmail.com wrote:


 On Feb 6, 2013, at 4:23 PM, Gregg Donovan gregg...@gmail.com wrote:

  code we had that relied on the 3.6 behavior of SolrCore#getIndexDir() is
  not working the same way.

 Can you be very specific about the different behavior that you are seeing?
 What exactly where you seeing and counting on and what are you seeing now?

 - Mark


replicateOnStartup not finding commits after SOLR-3911?

2013-01-29 Thread Gregg Donovan
In the process of upgrading to 4.1 from 3.6, I've noticed that our
master servers do not show any commit points available until after a
new commit happens. So, for static indexes, replication doesn't happen
and for dynamic indexes, we have to wait until an incremental update
of master for slaves to see any commits.

Tracing through the code, it looks like the change that may have
effected us was part of SOLR-3911 [1], specifically commenting out the
initialization of the newIndexWriter in the replicateAfterStartup
block [2]:

// TODO: perhaps this is no longer necessary then?
// core.getUpdateHandler().newIndexWriter(true);

I'm guessing this is commented out because it is assumed that
indexCommitPoint was going to be set by that block, but when a slave
requests commits, that goes back to
core.getDeletionPolicy().getCommits() to fetch the list of commits. If
no indexWriter has been initialized, then, as far as I can tell,
IndexDeletionPolicyWrapper#onInit will not have been called and there
will be no commits available.

Is there something in the code or configuration that we may be missing
that should be initializing the commits for replication or should we
just try uncommenting that line in ReplicationHandler?

Thanks!

--Gregg

Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com


[1]
https://issues.apache.org/jira/browse/SOLR-3911
https://issues.apache.org/jira/secure/attachment/12548596/SOLR-3911.patch

[2]
http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/solr/core/src/java/org/apache/solr/handler/ReplicationHandler.java?annotate=1420992diff_format=hpathrev=1420992#l880


Re: replicateOnStartup not finding commits after SOLR-3911?

2013-01-29 Thread Gregg Donovan
Thanks, Mark -- that fixed the issue for us. I created
https://issues.apache.org/jira/browse/SOLR-4380 to track it.

On Tue, Jan 29, 2013 at 4:06 PM, Mark Miller markrmil...@gmail.com wrote:

 On Jan 29, 2013, at 3:50 PM, Gregg Donovan gregg...@gmail.com wrote:

  should we
 just try uncommenting that line in ReplicationHandler?

 Please try. I'd file a JIRA issue in any case. I can probably take a closer 
 look.

 - Mark


PK uniqueness aware Solr index merging?

2013-01-24 Thread Gregg Donovan
We have a Hadoop process that produces a set of Solr indexes from a cluster
of HBase documents. After the job runs, we pull the indexes from HDFS and
merge the them together locally. The issue we're running into is that
sometimes we'll have duplicate occurrences of a primary key across indexes
that we'll want merged out. For example, a set of directories with:

./dir00/
doc_id=0
PK=1

./dir01/
doc_id=0
PK=1

should merge into a Solr index containing a single document rather than one
with two Lucene documents each containing PK=1.

The Lucene-level merge code -- i.e. oal.index.SegmentMerger.merge()--
doesn't know about the Solr schema, so it will merge these two directories
into two duplicate documents. It doesn't appear that either Solr's
oas.handler.admin.CoreAdminHandler.handleMergeAction(SolrQueryRequest,
SolrQueryResponse) handles this either, as it ends up passing the list of
merge directories to oal.index.IndexWriter.addIndexes(IndexReader...) via
oas.update.DirectUpdateHandler2.mergeIndexes(MergeIndexesCommand).

So, if I want to merge multiple Solr directories in a way that respects
primary key uniqueness, is there any more efficient manner than re-adding
all of the documents in each directory to a new Solr index to avoid PK
duplicates?

Thanks.

--Gregg

Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com


Solr 4.0 SnapPuller version vs. generation issue

2013-01-10 Thread Gregg Donovan
We are in the midst of upgrading from Solr 3.6 to Solr 4.0 and have
encountered an issue with the method the SnapPuller now uses to determine
if a new directory is needed when fetching files to a slave from master.

With Solr 3.6, our reindexing process was:

On master:
1. Reindex in a separate process into a new directory:
solr.data.dir/core/index-timestamp_of_reindex_start
2. Update solr.data.dir/core/index.properties to 'index=solr.data.dir
/core/index-timestamp_of_reindex_start'
3. Reload core on master so that the new index referenced in
solr.data.dir/core/index.properties would be loaded.

Slaves would then fetch this new index into a new directory without any
manual intervention because the slave would determine that a full new index
copy was needed. SnapPuller in 3.6 used:

*boolean* isFullCopyNeeded = commit.getGeneration() = latestGeneration;

Since the generation on master would be near zero on master after a reindex
and larger on a slave, the new index would be placed in a new directory on
the slave.

In Solr 4.0, beginning with svn 1235888 [1], the check is now:

*boolean* isFullCopyNeeded =
IndexDeletionPolicyWrapper.*getCommitTimestamp*(commit)
= latestVersion || forceReplication;
As far as I can tell, forceReplication is only used in SolrCloud recovery
scenarios. Our new index on master has a newer  commitTimeMSec than the
slave index, so neither of these conditions is true. With
isFullCopyNeeded=false, our new index files get pulled into the existing
directory on the slave where they are then deleted. I think this is because
their generation is older than the current slave generation.

Would it still make sense to create a new directory on the slave if
master's generation is less than the slave's generation? I can't see a
scenario where you'd want a slave to fetch files from master with a smaller
generation into the current index directory of a slave.

If the commitTimeMSec based check in Solr 4.0 is needed for SolrCloud, what
other methods of swapping in an entirely new index would you recommend for
those not using SolrCloud?

[1] --
http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/SnapPuller.java?r1=1144761r2=1235888pathrev=1235888diff_format=h

Thanks!

--Gregg


Gregg Donovan
Senior Software Engineer, Etsy.com
gr...@etsy.com


Re: Solr 4.0 SnapPuller version vs. generation issue

2013-01-10 Thread Gregg Donovan
Thanks, Mark.

The relevant commit on the solrcloud branch appears to be 1231134 and is
focused on the recovery aspect of SolrCloud:

http://svn.apache.org/viewvc?diff_format=hview=revisionrevision=1231134
http://svn.apache.org/viewvc/lucene/dev/branches/solrcloud/solr/core/src/java/org/apache/solr/handler/SnapPuller.java?diff_format=hr1=1231133r2=1231134;

I tried changed the check on our 4.0 test cluster to:

boolean isFullCopyNeeded =
IndexDeletionPolicyWrapper.getCommitTimestamp(commit) = latestVersion
  || commit.getGeneration() = latestGeneration || forceReplication;

and that fixed our post-reindexing HTTP replication issues. But I'm not
sure if that check works for all of the cases that SnapPuller is designed
for.

--Gregg

On Thu, Jan 10, 2013 at 4:28 PM, Mark Miller markrmil...@gmail.com wrote:


 On Jan 10, 2013, at 4:11 PM, Gregg Donovan gregg...@gmail.com wrote:

  If the commitTimeMSec based check in Solr 4.0 is needed for SolrCloud,

 It's not. SolrCloud just uses the force option. I think this other change
 was made because Lucene stopped using both generation and version. I can
 try and look closer later - can't remember who made the change in Solr.

 - Mark


Re: Is FileFloatSource's WeakHashMap cache only cleaned by GC?

2012-06-06 Thread Gregg Donovan
Thanks for the suggestion, Erick. I created a JIRA and moved the patch
to SVN, just to be safe. [1]

--Gregg

[1] https://issues.apache.org/jira/browse/SOLR-3514

On Wed, Jun 6, 2012 at 2:35 PM, Erick Erickson erickerick...@gmail.com wrote:

 Hmmm, it would be better to open a Solr JIRA and attach this as a patch.
 Although we've had some folks provide a Git-based rather than an SVN-based
 patch.

 Anyone can open a JIRA, but you must create a signon to do that. It'd get more
 attention that way

 Best
 Erick

 On Tue, Jun 5, 2012 at 2:19 PM, Gregg Donovan gregg...@gmail.com wrote:
  We've encountered GC spikes at Etsy after adding new
  ExternalFileFields a decent number of times. I was always a little
  confused by this behavior -- isn't it just one big float[]? why does
  that cause problems for the GC? -- but looking at the FileFloatSource
  code a little more carefully, I wonder if this is due to using a
  WeakHashMap that is only cleaned by GC or manual invocation of a
  request handler.
 
  FileFloatSource stores a WeakHashMap containing IndexReader,float[]
  or CreationPlaceholder. In the code[1], it mentions that the
  implementation is modeled after the FieldCache implementation.
  However, the FieldCacheImpl adds listeners for IndexReader close
  events and uses those to purge its caches. [2] Should we be doing the
  same in FileFloatSource?
 
  Here's a mostly untested patch[3] with a possible implementation.
  There are probably better ways to do it (e.g. I don't love using
  another WeakHashMap), but I found it tough to hook into the
  IndexReader lifecycle without a) relying on classes other than
  FileFloatSource b) changing the public API of FIleFloatSource or c)
  changing the implementation too much.
 
  There is a RequestHandler inside of FileFloatSource
  (ReloadCacheRequestHandler) that can be used to clear the cache
  entirely[4], but this is sub-optimal for us for a few reasons:
 
  --It clears the entire cache. ExternalFileFields often take some
  non-trivial time to load and we prefer to do so during SolrCore
  warmups. Clearing the entire cache while serving traffic would likely
  cause user-facing requests to timeout.
  --It forces an extra commit with its consequent cache cycling, etc..
 
  I'm thinking of ways to monitor the size of FileFloatSource's cache to
  track its size against GC pause times, but it seems tricky because
  even calling WeakHashMap#size() has side-effects. Any ideas?
 
  Overall, what do you think? Does relying on GC to clean this cache
  make sense as a possible cause of GC spikiness? If so, does the patch
  [3] look like a decent approach?
 
  Thanks!
 
  --Gregg
 
  [1] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L135
  [2] https://github.com/apache/lucene-solr/blob/1c0eee5c5cdfddcc715369dad9d35c81027bddca/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java#L166
  [3] https://gist.github.com/2876371
  [4] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L310


Is FileFloatSource's WeakHashMap cache only cleaned by GC?

2012-06-05 Thread Gregg Donovan
We've encountered GC spikes at Etsy after adding new
ExternalFileFields a decent number of times. I was always a little
confused by this behavior -- isn't it just one big float[]? why does
that cause problems for the GC? -- but looking at the FileFloatSource
code a little more carefully, I wonder if this is due to using a
WeakHashMap that is only cleaned by GC or manual invocation of a
request handler.

FileFloatSource stores a WeakHashMap containing IndexReader,float[]
or CreationPlaceholder. In the code[1], it mentions that the
implementation is modeled after the FieldCache implementation.
However, the FieldCacheImpl adds listeners for IndexReader close
events and uses those to purge its caches. [2] Should we be doing the
same in FileFloatSource?

Here's a mostly untested patch[3] with a possible implementation.
There are probably better ways to do it (e.g. I don't love using
another WeakHashMap), but I found it tough to hook into the
IndexReader lifecycle without a) relying on classes other than
FileFloatSource b) changing the public API of FIleFloatSource or c)
changing the implementation too much.

There is a RequestHandler inside of FileFloatSource
(ReloadCacheRequestHandler) that can be used to clear the cache
entirely[4], but this is sub-optimal for us for a few reasons:

--It clears the entire cache. ExternalFileFields often take some
non-trivial time to load and we prefer to do so during SolrCore
warmups. Clearing the entire cache while serving traffic would likely
cause user-facing requests to timeout.
--It forces an extra commit with its consequent cache cycling, etc..

I'm thinking of ways to monitor the size of FileFloatSource's cache to
track its size against GC pause times, but it seems tricky because
even calling WeakHashMap#size() has side-effects. Any ideas?

Overall, what do you think? Does relying on GC to clean this cache
make sense as a possible cause of GC spikiness? If so, does the patch
[3] look like a decent approach?

Thanks!

--Gregg

[1] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L135
[2] https://github.com/apache/lucene-solr/blob/1c0eee5c5cdfddcc715369dad9d35c81027bddca/lucene/core/src/java/org/apache/lucene/search/FieldCacheImpl.java#L166
[3] https://gist.github.com/2876371
[4] https://github.com/apache/lucene-solr/blob/a3914cb5c0243913b827762db2d616ad7cc6801d/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L310


Good time for an upgrade to Solr/Lucene trunk?

2011-06-21 Thread Gregg Donovan
We (Etsy.com) are currently using a version of trunk from mid-October 2010
(SVN tag 1021515, to be exact). We'd like to upgrade to the current trunk
and are wondering if this is a good time. Is the new stuff (esp. DocValues)
stable? Are any other major features or performance improvements about to
land on trunk that are worth waiting a few weeks for?

Thanks for the guidance!

--Gregg

Gregg Donovan
Technical Lead, Search, Etsy.com
gr...@etsy.com


Sorting and filtering on fluctuating multi-currency price data?

2010-10-20 Thread Gregg Donovan
In our current search app, we have sorting and filtering based on item
prices. We'd like to extend this to support sorting and filtering in the
buyer's native currency with the items themselves listed in the seller's
native currency. E.g: as a buyer, if my native currency is the Euro, my
search of all items between 10 and 20 Euros would also find all items listed
in USD between 13.90 and 27.80, in CAD between 14.29 and 28.58, etc.

I wanted to run a few possible approaches by the list to see if we were on
the right track or not. Our index is updated every few minutes, but we only
update our currency conversions every few hours.

The easiest approach would be to update the documents with non-USD listings
every few hours with the USD-converted price. That will be fine, but if the
number of non-USD listings is large, this would be too expensive (i.e. large
parts of the index getting recreated frequently).

Another approach would be to use ExternalFileField and keep the price data,
normalized to USD, outside of the index. Every time the currency rates
changed, we would calculate new normalized prices for every document in the
index.

Still another approach would be to do the currency conversion at IndexReader
warmup time. We would index native price and currency code and create a
normalized currency field on the fly. This would be somewhat like
ExternalFileField in that it involved data from outside the index, but it
wouldn't need to be scoped to the parent SolrIndexReader, but could be
per-segment. Perhaps a custom poly-field could accomplish something like
this?

Has anyone dealt with this sort of problem? Do any of these approaches sound
more or less reasonable? Are we missing anything?

Thanks for the help!

Gregg Donovan
Technical Lead, Search
Etsy.com


Re: Difficulty with Multi-Word Synonyms

2009-09-17 Thread Gregg Donovan
Thanks. And thanks for the help -- we're hoping to switch from query-time to
index-time synonym expansion for all of the reasons listed on the
wikihttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-2c461ac74b4ddd82e453dc68fcfc92da77358d46,
so this will be great to resolve.

I created SOLR-1445 https://issues.apache.org/jira/browse/SOLR-1445,
though the problem seems to be caused by
LUCENE-1919https://issues.apache.org/jira/browse/LUCENE-1919,
as you noted.

Is there a recommended workaround that avoids combining the new and old
APIs? Would a version of SynonymFilter that also implemented
incrementToken() be helpful?

--Gregg

On Thu, Sep 17, 2009 at 7:38 PM, Yonik Seeley yo...@lucidimagination.comwrote:

 On Thu, Sep 17, 2009 at 6:29 PM, Lance Norskog goks...@gmail.com wrote:
  Please add a Jira issue for this. It will get more attention there.
 
  BTW, thanks for creating such a precise bug report.

 +1

 Thanks, I had missed this.  This is serious, and looks due to a Lucene
 back compat break.
 I've added the testcase and can confirm the bug.

 -Yonik
 http://www.lucidimagination.com



  On Mon, Sep 14, 2009 at 1:52 PM, Gregg Donovan gregg...@gmail.com
 wrote:
  I'm running into an odd issue with multi-word synonyms in Solr (using
  the latest [9/14/09] nightly ). Things generally seem to work as
  expected, but I sometimes see words that are the leading term in a
  multi-word synonym being replaced with the token that follows them in
  the stream when they should just be ignored (i.e. there's no synonym
  match for just that token). When I preview the analysis at
  admin/analysis.jsp it looks fine, but at runtime I see problems like
  the one in the unit test below. It's a simple case, so I assume I'm
  making some sort of configuration and/or usage error.
 
  package org.apache.solr.analysis;
  import java.io.*;
  import java.util.*;
  import org.apache.lucene.analysis.WhitespaceTokenizer;
  import org.apache.lucene.analysis.tokenattributes.TermAttribute;
 
  public class TestMultiWordSynonmys extends junit.framework.TestCase {
 
public void testMultiWordSynonmys() throws IOException {
  ListString rules = new ArrayListString();
  rules.add( a b c,d );
  SynonymMap synMap = new SynonymMap( true );
  SynonymFilterFactory.parseRules( rules, synMap, =, ,, true,
 null);
 
  SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
  StringReader(a e)), synMap );
  TermAttribute termAtt = (TermAttribute)
  ts.getAttribute(TermAttribute.class);
 
  ts.reset();
  ListString tokens = new ArrayListString();
  while (ts.incrementToken()) tokens.add( termAtt.term() );
 
 // This fails because [e,e] is the value of the token stream
  assertEquals(Arrays.asList(a,e), tokens);
}
  }
 
  Any help would be much appreciated. Thanks.
 
  --Gregg
 
 
 
 
  --
  Lance Norskog
  goks...@gmail.com
 



Difficulty with Multi-Word Synonyms

2009-09-14 Thread Gregg Donovan
I'm running into an odd issue with multi-word synonyms in Solr (using
the latest [9/14/09] nightly ). Things generally seem to work as
expected, but I sometimes see words that are the leading term in a
multi-word synonym being replaced with the token that follows them in
the stream when they should just be ignored (i.e. there's no synonym
match for just that token). When I preview the analysis at
admin/analysis.jsp it looks fine, but at runtime I see problems like
the one in the unit test below. It's a simple case, so I assume I'm
making some sort of configuration and/or usage error.

package org.apache.solr.analysis;
import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

public class TestMultiWordSynonmys extends junit.framework.TestCase {

  public void testMultiWordSynonmys() throws IOException {
    ListString rules = new ArrayListString();
    rules.add( a b c,d );
    SynonymMap synMap = new SynonymMap( true );
    SynonymFilterFactory.parseRules( rules, synMap, =, ,, true, null);

    SynonymFilter ts = new SynonymFilter( new WhitespaceTokenizer( new
StringReader(a e)), synMap );
    TermAttribute termAtt = (TermAttribute)
ts.getAttribute(TermAttribute.class);

    ts.reset();
    ListString tokens = new ArrayListString();
    while (ts.incrementToken()) tokens.add( termAtt.term() );

// This fails because [e,e] is the value of the token stream
    assertEquals(Arrays.asList(a,e), tokens);
  }
}

Any help would be much appreciated. Thanks.

--Gregg


Re: How to handle database replication delay when using DataImportHandler?

2009-01-29 Thread Gregg Donovan
Noble,

Thanks for the suggestion. The unfortunate thing is that we really don't
know ahead of time what sort of replication delay we're going to encounter
-- it could be one millisecond or it could be one hour. So, we end up
needing to do something like:

For delta-import run N:
1. query DB slave for seconds_behind_master, use this to calculate
Date(N).
2. query DB slave for records updated since Date(N - 1)

I see there are plugin points for EventListener classes (onImportStart,
onImportEnd). Would those be the right spot to calculate these dates so that
I could expose them to my custom function at query time?

Thanks.

--Gregg

On Wed, Jan 28, 2009 at 11:20 PM, Noble Paul നോബിള്‍ नोब्ळ् 
noble.p...@gmail.com wrote:

 The problem you are trying to solve is that you cannot use
 ${dataimporter.last_index_time} as is. you may need something like
 ${dataimporter.last_index_time} - 3secs

 am I right?

 There are no straight ways to do this .
 1) you may write your own function say 'lastIndexMinus3Secs' and add
 them. functions can be plugged in to DIH using a function
 name=lastIndexMinus3Secs class=foo.Foo/ under the dataConfig
 tag. And you can use it as
 ${dataimporter.functions.lastIndexMinus3Secs()}
 this will add to the existing in-built functions

 http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7

 the class must extend org.apache.solr.handler.dataimport.Evaluator

 we may add a standard function for this too . you can raise an issue
 --Noble



 On Thu, Jan 29, 2009 at 6:26 AM, Gregg gregg...@gmail.com wrote:
  I'd like to use the DataImportHandler running against a slave database
 that,
  at any given time, may be significantly behind the master DB. This can
 cause
  updates to be missed if you use the clock-time as the last_index_time.
  E.g., if the slave catches up to the master between two delta-imports.
 
  Has anyone run into this? In our non-DIH indexing system we get around
 this
  by either using the slave DB's seconds-behind-master or the max last
 update
  time of the records returned.
 
  Thanks.
 
  Gregg
 



 --
 --Noble Paul