Re: Issue with multivalued fields in UIMA

2014-10-30 Thread 5ton3
I had to overcome this issue, as I needed to analyze multivalued fields. The
fact that UIMA don't analyse multivalued fields is a known bug in UIMA. With
the help of Maryam, I solved the issue. The JIRA issue, along with a working
patch, can be found here: https://issues.apache.org/jira/browse/SOLR-6622



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-multivalued-fields-in-UIMA-tp4155609p4166576.html
Sent from the Solr - User mailing list archive at Nabble.com.


issue related to blank value in datefield

2014-10-30 Thread Aman Tandon
Hi,

I wants to set -00-00T00:00:00Z value for date field where I do not
have the value. When the index the at field with value as desired it is
getting indexed as 0002-11-30T00:00:00Z.

What is the reason behind this?

With Regards
Aman Tandon


Re: Solr Memory Usage

2014-10-30 Thread Toke Eskildsen
On Wed, 2014-10-29 at 23:37 +0100, Will Martin wrote:
 This command only touches OS level caches that hold pages destined for (or
 not) the swap cache. Its use means that disk will be hit on future requests,
 but in many instances the pages were headed for ejection anyway.
 
 It does not have anything whatsoever to do with Solr caches.

If you re-read my post, you will see the OS had to spend a lot of
resources just bookkeeping memory. OS, not JVM.

 It also is not fragmentation related; it is a result of the kernel
 managing virtual pages in an as designed manner. The proper command
 is
 
 #sync; echo 3 /proc/sys/vm/drop_caches. 

I just talked with a Systems guy to verify what happened when we had
the problem:

- The machine spawned Xmx1g JVMs with Tika, each instance processing a 
  single 100M ARC file, sending the result to a shared Solr instance 
  and shutting down. 40 instances were running at all times, each 
  instance living for a little less than 3 minutes.
  Besides taking ~40GB of RAM in total, this also meant that about 10GB 
  of RAM was released and re-requested from the system each minute.
  I don't know how the memory mapping in Solr works with regard to
  re-use of existing allocations, so I can't say if Solr added to than
  number or not.

- The indexing speed deteriorated after some days, grinding down to 
  (loose guess) something like 1/4th of initial speed.

- Running top showed that the majority of time was spend in the kernel.

- Running echo 3 /proc/sys/vm/drop_caches (I asked Systems explicitly
  about the integer and it was '3') brought the speed back to the 
  initial level. The temporary patch was to run it once every hour.

- Running top with the patch showed the vast majority of time was spend 
  in user space.

- Systems investigated and determined that huge pages were 
  automatically requested by processes on the machine, leading to 
  (virtual) memory fragmentation on the OS level. They used a tool in 
  'sysfsutils' (just relaying what they said here) to change the default
  from huge pages to small pages (or whatever the default is named).

- The disabling of huge pages made the problem go away and we no longer
  use the drop_caches-trick.

 http://linux.die.net/man/5/proc
 
 I have encountered resistance on the use of this on long-running processes
 for years ... from people who don't even research the matter.

The resistance is natural: Although it might work to drop_cache, as it
did for us, it is still symptom treatment. Until the cause has been
isolated and determined to be practically unresolvable, the drop_cache
is a red flag.

Your undetermined core problem might not be the same as ours, but it is
simple to check: Watch kernel time percentage. If it rises over time,
try disabling huge pages.

- Toke Eskildsen, State and University Library, Denmark




prefix length in fuzzy search solr 4.10.1

2014-10-30 Thread elisabeth benoit
Hello all,

Is there a parameter in solr 4.10.1 api allowing user to fix prefix length
in fuzzy search.

Best regards,
Elisabeth


Re: Sharding configuration

2014-10-30 Thread Anca Kopetz

Hi,

We did some tests with 4 shards / 4 different tomcat instances on the
same server and the average latency was smaller than the one when having
only one shard.
We tested also é shards on different servers and the performance results
were also worse.

It seems that the sharding does not make any difference for our index in
terms of latency gains.

Thanks for your response,
Anca

On 10/28/2014 08:44 PM, Ramkumar R. Aiyengar wrote:

As far as the second option goes, unless you are using a large amount of
memory and you reach a point where a JVM can't sensibly deal with a GC
load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
probably haven't reached that point. There are also other shared resources
at an instance level like connection pools and ZK connections, but those
are tunable and you probably aren't pushing them as well (I would imagine
you are just trying to have only a handful of shards given that you aren't
sharded at all currently).

That leaves single vs multiple machines. Assuming the network isn't a
bottleneck, and given the same amount of resources overall (number of
cores, amount of memory, IO bandwidth times number of machines), it
shouldn't matter between the two. If you are procuring new hardware, I
would say buy more, smaller machines, but if you already have the hardware,
you could serve as much as possible off a machine before moving to a
second. There's nothing which limits the number of shards as long as the
underlying machine has the sufficient amount of parallelism.

Again, this advice is for a small number of shards, if you had a lot more
(hundreds) of shards and significant volume of requests, things start to
become a bit more fuzzy with other limits kicking in.
On 28 Oct 2014 09:26, Anca Kopetzanca.kop...@kelkoo.com  wrote:


Hi,

We have a SolrCloud configuration of 10 servers, no sharding, 20
millions of documents, the index has 26 GB.
As the number of documents has increased recently, the performance of
the cluster decreased.

We thought of sharding the index, in order to measure the latency. What
is the best approach ?
- to use shard splitting and have several sub-shards on the same server
and in the same tomcat instance
- having several shards on the same server but on different tomcat
instances
- having one shard on each server (for example 2 shards / 5 replicas on
10 servers)

What's the impact of these 3 configuration on performance ?

Thanks,
Anca

--

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à
l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
destinataire de ce message, merci de le détruire et d'en avertir
l'expéditeur.



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: SolrCloud : node recovery fails with No registered leader was found

2014-10-30 Thread yann180
Hi guys,

just wondering if any solution was found for this?

I have a similar problem - Solr 4.7.2, 2-server cloud, single replicated
shard.

At random times one of the server dies with a the same message as in the
title of this thread.

I was hoping there might be a solution? (upgrading Solr is not practical for
me because of the JDK 1.7 requirement).

Thanks

Yann



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-node-recovery-fails-with-No-registered-leader-was-found-tp4137331p4166601.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Exporting Error in 4.10.1

2014-10-30 Thread Dmitry Kan
Hi,
Luke has a feature of index exporting, given that output format suits your
needs (xml). https://github.com/DmitryKey/luke/releases/tag/luke-4.10.1

http://dmitrykan.blogspot.fi/2014/09/exporting-lucene-index-to-xml-with-luke.html

It does not have the option to export select fields only, though.

Dmitry

On Thu, Oct 30, 2014 at 12:39 AM, Joseph Obernberger 
joseph.obernber...@gmail.com wrote:

 Hi - I'm trying to use 4.10.1 with /export.  I've defined a field as
 follows:
 field name=DocumentId type=string indexed=true stored=true
 required=true multiValued=false docValues=true/

 I then call:
 http://server:port
 /solr/COLLECT1/export?q=Collection:COLLECT2000sort=DocumentId
 descfl=DocumentId

 The error I receive is:
 java.io.IOException: DocumentId must have DocValues to use this feature.
 at

 org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:228)
 at

 org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:119)
 at

 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
 at

 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
 at
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
 at

 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
 at

 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at

 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at

 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at

 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at

 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at

 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:368)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at

 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
 at

 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
 at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at

 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at

 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:745)

 Any ideas what I'm doing wrong?
 Thank you!

 -Joe Obernberger




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: phrase query in solr 4

2014-10-30 Thread Dmitry Kan
On top of what Shawn rightly said, two things:

1. Try to benchmark yourself (best bet) solution with and without the
shingles. Then you know better and have story with numbers to tell.
2. If you go with the shingles approach, consider removing duplicates with
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Dmitry

On Mon, Oct 27, 2014 at 3:11 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/27/2014 6:20 AM, Robust Links wrote:
  1) we want to index and search all tokens in a document (i.e. we do not
  rely on external stores)
 
  2) we need search time to be fast and willing to pay larger indexing time
  and index size,
 
  3)  be able to search as fast as possible ngrams of 3 tokens or less
 (i.e,
  unigrams, bigrams and trigrams).
 
 
  To satisfy (1) we used the default
  maxFieldLength2147483647/maxFieldLength in
  solrconfig.xml of 3.6.1 index to specify the total number of tokens to
  index in an article. In solr 4 we are specifying it via the tokenizer in
  the analyzer chain
 
 
  tokenizer class=solr.ClassicTokenizerFactory
 maxTokenLength=2147483647
  /
 
 
  To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
  shingedFilterFactory in the analyzer chain
 
 
  filter class=solr.ShingleFilterFactory outputUnigrams=true
  maxShingleSize=3”/
 
 
  This was based on this thread:
 
 
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3c856ac15f0808161539p54417df2ga5a6fdfa35889...@mail.gmail.com%3E
 
 
  The open questions we are trying to understand now are:
 
 
  1) whether shingling is still the best strategy for phrase (ngram) search
  given our requirements above?
 
  2) if not then what would be a better strategy.

 The maxFieldLength setting is different than maxTokenLength.  The former
 is the number of tokens that are allowed.  The latter is the number of
 characters allowed in *each* token.  Since the value you were using
 should be the default value for maxFieldLength, you don't need it in
 your config.

 As for maxTokenLength, if the older version worked right without that
 setting, you probably don't need it now.  Really long tokens are usually
 useless, unless a later step in the analysis will be breaking it up into
 additional tokens (terms).  It's exceptionally rare that people will use
 or type a word that's 256 characters.  I have seen documents that
 exceed the token length on keyword fields where the input is only
 separated by commas -- there are no spaces for the WhiteSpaceTokenizer
 to split on, so a document with a lot of keywords ends up indexing none
 of them because the tokenizer ignores the input due to length.  If it
 had indexed them, they would have been further tokenized by the
 WordDelimiterFilter.

 Shingles may or may not be required to match the way you have described.
  It all depends on the *exact* nature of your queries.  I haven't
 wrapped my head around the possibilities, so I can't give you an
 example.  Since it's been working on your older index, chances are
 excellent that it will continue to work on the newer index.  Shingles
 can indeed increase search performance, if the conditions are right.

 Search performance in general is better in 4.x than it was in 3.x.

 It's always a good idea to look at this wiki page (and even dive into
 the Lucene javadocs) from time to time in order to determine whether
 there's a better way of doing your analysis:

 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

 It sounds like you've been at this a while, so you probably already know
 this next part, but it would be irresponsible of me to talk about all
 this without mentioning it.  When you change your index analysis, you
 must reindex.

 http://wiki.apache.org/solr/HowToReindex

 Thanks,
 Shawn




-- 
Dmitry Kan
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Score phrases higher than the records containing the words?

2014-10-30 Thread hschillig
How can I tell if the stop words is resolved? This is what I get when I turn
debugging on:

http://apaste.info/0Uz http://apaste.info/0Uz  

When I put:
q=title:(what if) OR title:what if^10
I get this:

rawquerystring: title:(what if) OR title:\what if\^10,
querystring: title:(what if) OR title:\what if\^10,
parsedquery: (+((title:what title:if) PhraseQuery(title:\what
if\^10.0)))/no_coord,
parsedquery_toString: +((title:what title:if) title:\what if\^10.0)

The other two titles still appear on top of the one that contains the what
if phrase.
I tried turning edismax on and placing title in the pf field and the same
results appear.

Thanks for any help,
Haley



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Score-phrases-higher-than-the-records-containing-the-words-tp4166488p4166608.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Score phrases higher than the records containing the words?

2014-10-30 Thread hschillig
Edit:
I filtered my query to author:randall so I could see the score that it's
getting from the query. This is the score of the record that contains what
if:
score: 0.004032644

The other two books are getting this score:
score: 0.0069850935

So... the boost is obviously not hitting that record. I wonder why?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Score-phrases-higher-than-the-records-containing-the-words-tp4166488p4166615.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Exporting Error in 4.10.1

2014-10-30 Thread Joseph Obernberger
Thank you Dmitry.  Any ideas why the Solr /export is not working for me?  I
forgot to mention that this is Solr Cloud.
I believe I've defined the field correctly, and I've also tried using
another field (title), but I get the same error:
Title must have DocValues to use this feature..

My goal is a high-speed method of getting a list of IDs out of Solr Cloud -
perhaps faster than using cursormark.

Thank you!

On Thu, Oct 30, 2014 at 7:24 AM, Dmitry Kan solrexp...@gmail.com wrote:

 Hi,
 Luke has a feature of index exporting, given that output format suits your
 needs (xml). https://github.com/DmitryKey/luke/releases/tag/luke-4.10.1


 http://dmitrykan.blogspot.fi/2014/09/exporting-lucene-index-to-xml-with-luke.html

 It does not have the option to export select fields only, though.

 Dmitry

 On Thu, Oct 30, 2014 at 12:39 AM, Joseph Obernberger 
 joseph.obernber...@gmail.com wrote:

  Hi - I'm trying to use 4.10.1 with /export.  I've defined a field as
  follows:
  field name=DocumentId type=string indexed=true stored=true
  required=true multiValued=false docValues=true/
 
  I then call:
  http://server:port
  /solr/COLLECT1/export?q=Collection:COLLECT2000sort=DocumentId
  descfl=DocumentId
 
  The error I receive is:
  java.io.IOException: DocumentId must have DocValues to use this feature.
  at
 
 
 org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:228)
  at
 
 
 org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:119)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
  at
 
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
  at
 
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
  at
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
  at
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
  at
  org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
  at
 
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
  at
 
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
  at
 
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
  at
 
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
  at org.eclipse.jetty.server.Server.handle(Server.java:368)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
  at
 
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
  at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
  at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
  at
 
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
  at
 
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
  at
 
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
  at java.lang.Thread.run(Thread.java:745)
 
  Any ideas what I'm doing wrong?
  Thank you!
 
  -Joe Obernberger
 



 --
 Dmitry Kan
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan
 SemanticAnalyzer: www.semanticanalyzer.info



Re: Exporting Error in 4.10.1

2014-10-30 Thread Joel Bernstein
Solr 4.10 is the very first release of the export feature. It does require
that all fields being sorted and exported have docValues = true in the
schema. This is likely to change in the future, but DocValues will likely
always provide the best indexing option for sorting and exporting full
result sets.

This feature includes an entirely new sorting and exporting engine so there
are some bugs lurking.

I'll be opening a ticket to resolve these bugs shortly.



Joel Bernstein
Search Engineer at Heliosearch

On Thu, Oct 30, 2014 at 9:41 AM, Joseph Obernberger 
joseph.obernber...@gmail.com wrote:

 Thank you Dmitry.  Any ideas why the Solr /export is not working for me?  I
 forgot to mention that this is Solr Cloud.
 I believe I've defined the field correctly, and I've also tried using
 another field (title), but I get the same error:
 Title must have DocValues to use this feature..

 My goal is a high-speed method of getting a list of IDs out of Solr Cloud -
 perhaps faster than using cursormark.

 Thank you!

 On Thu, Oct 30, 2014 at 7:24 AM, Dmitry Kan solrexp...@gmail.com wrote:

  Hi,
  Luke has a feature of index exporting, given that output format suits
 your
  needs (xml). https://github.com/DmitryKey/luke/releases/tag/luke-4.10.1
 
 
 
 http://dmitrykan.blogspot.fi/2014/09/exporting-lucene-index-to-xml-with-luke.html
 
  It does not have the option to export select fields only, though.
 
  Dmitry
 
  On Thu, Oct 30, 2014 at 12:39 AM, Joseph Obernberger 
  joseph.obernber...@gmail.com wrote:
 
   Hi - I'm trying to use 4.10.1 with /export.  I've defined a field as
   follows:
   field name=DocumentId type=string indexed=true stored=true
   required=true multiValued=false docValues=true/
  
   I then call:
   http://server:port
   /solr/COLLECT1/export?q=Collection:COLLECT2000sort=DocumentId
   descfl=DocumentId
  
   The error I receive is:
   java.io.IOException: DocumentId must have DocValues to use this
 feature.
   at
  
  
 
 org.apache.solr.response.SortingResponseWriter.getFieldWriters(SortingResponseWriter.java:228)
   at
  
  
 
 org.apache.solr.response.SortingResponseWriter.write(SortingResponseWriter.java:119)
   at
  
  
 
 org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:765)
   at
  
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)
   at
  
  
 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
   at
  
  
 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
   at
  
 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
   at
  
  
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
   at
  
 
 org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
   at
  
  
 
 org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
   at
  
  
 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
   at
  
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
   at
  
  
 
 org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
   at
  
  
 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
   at
  
  
 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
   at
  
  
 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
   at
  
  
 
 org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
   at
  
  
 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
   at org.eclipse.jetty.server.Server.handle(Server.java:368)
   at
  
  
 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
   at
  
  
 
 org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
   at
  
  
 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
   at
  
  
 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
   at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
   at
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
   at
  
  
 
 org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
   at
  
  
 
 org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
   at
  
  
 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
   at
  
  
 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
   at java.lang.Thread.run(Thread.java:745)
  
   Any ideas what I'm doing wrong?
   Thank you!
  
   -Joe Obernberger
  
 
 
 
  --
  Dmitry Kan
  Blog: http://dmitrykan.blogspot.com
  Twitter: http://twitter.com/dmitrykan
  

Re: Sharding configuration

2014-10-30 Thread Shawn Heisey
On 10/30/2014 4:32 AM, Anca Kopetz wrote:
 We did some tests with 4 shards / 4 different tomcat instances on the
 same server and the average latency was smaller than the one when having
 only one shard.
 We tested also é shards on different servers and the performance results
 were also worse.

 It seems that the sharding does not make any difference for our index in
 terms of latency gains.

That statement is confusing, because if latency goes down, that's good,
not worse.

If you're going to put multiple shards on one server, it should be done
with one solr/tomcat instance, not multiple.  One instance is perfectly
capable of dealing with many shards, and has a lot less overhead.  The
SolrCloud collection create command would need the maxShardsPerNode
parameter.

In order to see a gain in performance from multiple shards per server,
the server must have a lot of CPUs and the query rate must be fairly
low.  If the query rate is high, then all the CPUs will be busy just
handling simultaneous queries, so putting multiple shards per server
will probably slow things down.  When query rate is low, multiple CPUs
can handle each shard query simultaneously, speeding up the overall query.

Thanks,
Shawn



Design optimal Solr Schema

2014-10-30 Thread tomas.kalas
Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

?xml version=1.0?
add
doc
field name=id01.cn/field
field name=t0br / 1br / 2br / 2 br / 3 br / /field
field name=st0.00br / 1.54br / 1.54br / 1.54 br / 1.57 br /
/field
field name=et1.54br / 1.54br / 1.57br / 1.57 br / 1.7 br /
/field
field name=w_SILENCE_br / sbr / HELLObr / HALLO br / _DELETE_
br / /field
field name=p0.00br / 1br / 1br / 2.06115e-009 br / 1 br /
/field
field name=c0br / 0br / 0br / 0 br / 0 br / /field
/doc
/add

I displayed it in html document, and therefore i used the br /.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.00 C=0
T=1 ST=1.54 ET=1.54 W=s P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..
..

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Design optimal Solr Schema

2014-10-30 Thread Jorge Luis Betancourt González
Are you going to use the values stored on Solr to display the data in HTML? For 
searching purposes I suggest to delete all the HTML tags, and store the plain 
text, for this you could use the HTMLStripCharFilterFactory char filter, this 
will clean your content and only pass the actual text which is in the end 
what you're going to use. 

If you are going to use the solr result to display the content in an HTML page 
then I would suggest to keep your index clean and index only the actual 
searchable text no HTML, I actually use the recommended filter to strip HTML 
out of crawled HTML pages. Although what a Solr document means to you? An 
entire conversation is modeled 1 Solr document? have you considered separating 
each conversation interaction on a document? 


- Original Message -
From: tomas.kalas kala...@email.cz
To: solr-user@lucene.apache.org
Sent: Thursday, October 30, 2014 10:27:50 AM
Subject: Design optimal Solr Schema

Hello i have problem with design of schema in Solr. I have a transcript of a
telephone conversation in this format. I parse it at individual fields. I
have this schema:

?xml version=1.0?
add
doc
field name=id01.cn/field
field name=t0br / 1br / 2br / 2 br / 3 br / /field
field name=st0.00br / 1.54br / 1.54br / 1.54 br / 1.57 br /
/field
field name=et1.54br / 1.54br / 1.57br / 1.57 br / 1.7 br /
/field
field name=w_SILENCE_br / sbr / HELLObr / HALLO br / _DELETE_
br / /field
field name=p0.00br / 1br / 1br / 2.06115e-009 br / 1 br /
/field
field name=c0br / 0br / 0br / 0 br / 0 br / /field
/doc
/add

I displayed it in html document, and therefore i used the br /.

This is a original document:

T=0 ST=0.00 ET=1.54 W=_SILENCE_ P=0.00 C=0
T=1 ST=1.54 ET=1.54 W=s P=1 C=0
T=2 ST=1.54 ET=1.57 W=HELLO P=1 C=0
T=2 ST=1.54 ET=1.57 W=HALLO P=2.06115e-009 C=0
T=3 ST=1.57 ET=1.70 W=_DELETE_ P=1 C=0
T=3 ST=1.57 ET=1.70 W=NO P=2.06115e-009 C=0
T=4 ST=1.70 ET=2.12 W=HOW P=1 C=0
T=5 ST=2.12 ET=2.18 W=ARE_ P=0.25 C=0
T=5 ST=2.12 ET=2.18 W=_DELETE_ P=0.25 C=0
..
..

Id - filename
T = Segment
ST = Start time
ET = End time
W = Word
P = Probability
C = Chanel

I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0
TO 1.57]). But if i have all data in one field (t, st,et ...) then it
doesn't work. It find all files where is hello a further time than 1.57.

Do you have any ideas how it make it? Thanks a lot for your help.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Design-optimal-Solr-Schema-tp4166632.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sharding configuration

2014-10-30 Thread Anca Kopetz

Hi,

You are right, it is a mistake in my phrase, for the tests with 4
shards/ 4 instances,  the latency was worse (therefore *bigger*) than
for the tests with one shard.

In our case, the query rate is high.

Thanks,
Anca

On 10/30/2014 03:48 PM, Shawn Heisey wrote:

On 10/30/2014 4:32 AM, Anca Kopetz wrote:

We did some tests with 4 shards / 4 different tomcat instances on the
same server and the average latency was smaller than the one when having
only one shard.
We tested also é shards on different servers and the performance results
were also worse.

It seems that the sharding does not make any difference for our index in
terms of latency gains.

That statement is confusing, because if latency goes down, that's good,
not worse.

If you're going to put multiple shards on one server, it should be done
with one solr/tomcat instance, not multiple.  One instance is perfectly
capable of dealing with many shards, and has a lot less overhead.  The
SolrCloud collection create command would need the maxShardsPerNode
parameter.

In order to see a gain in performance from multiple shards per server,
the server must have a lot of CPUs and the query rate must be fairly
low.  If the query rate is high, then all the CPUs will be busy just
handling simultaneous queries, so putting multiple shards per server
will probably slow things down.  When query rate is low, multiple CPUs
can handle each shard query simultaneously, speeding up the overall query.

Thanks,
Shawn



Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Score phrases higher than the records containing the words?

2014-10-30 Thread hschillig
The other ones are still rating higher. I think it's because the other two
titles contain what 3 times.. the more it says what, the higher it scores.
I'm not sure what else can be done. Does anybody else have any ideas?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Score-phrases-higher-than-the-records-containing-the-words-tp4166488p4166656.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: facet on field aliases of same field

2014-10-30 Thread Dan Field
Thanks Michael. We’re looking into the use of localparams now. 

 On 29 Oct 2014, at 12:56, Michael Ryan mr...@moreover.com wrote:
 
 It is indeed possible. Just need to use a different syntax. As far as I know, 
 the facet parameters need to be local parameters, like this...
 
 facet.range={!key=date_decade facet.range.start=1600-01-01T00:00:00Z 
 facet.range.end=2000-01-01T00:00:00Z 
 facet.range.gap=%2B10YEARS}datefacet.range={!key=date_year 
 facet.range.start=1600-01-01T00:00:00Z facet.range.end=2000-01-01T00:00:00Z 
 facet.range.gap=%2B1YEARS}date
 
 -Michael
 
 -Original Message-
 From: Dan Field [mailto:d...@llgc.org.uk] 
 Sent: Wednesday, October 29, 2014 5:54 AM
 To: solr-user@lucene.apache.org
 Subject: facet on field aliases of same field
 
 Hi, we have a use case where we are trying to create multiple facet ranges 
 based on a single field. 
 
 I have successfully aliased the field by using the fl parameter e.g. 
 fl=date_decade:date,date_year:date,date_month:date,date_day:date where date 
 is the original field and the day_decade etc are the aliases. 
 
 What I am failing to do is to create multiple facet ranges based on these 
 aliased fields e.g:
 
 facet.field={!key=date_month 
 ex=date_month}date_monthfacet.field={!key=date_day 
 ex=date_day}datefacet.range={!key=date_decade 
 ex=date_decade}date_decadefacet.range={!key=date_year 
 ex=date_year}date_yearf.date_decade.facet.range.start=1600-01-01T00:00:00Zf.date_decade.facet.range.end=2000-01-01T00:00:00Zf.date_decade.facet.range.gap=+10YEARSf.date_year.facet.range.start=1600-01-01T00:00:00Zf.date_year.facet.range.end=2000-01-01T00:00:00Zf.date_year.facet.range.gap=+1YEARS
 
 We’re using Solarium here to generate the query and facet ranges but if we 
 can do this in a raw HTTP request, that’s fine. I’m just not sure whether 
 Solr will allow us to generate multiple facet ranges based on a single data 
 field. Or am I approaching the problem in the wrong way?
 
 Server is Solr 4.1
 
 Any help appreciated
 
 -- 
 Dan Field d...@llgc.org.uk mailto:d...@llgc.org.uk   
 Ffôn/Tel. +44 1970 632 582
 Pennaeth Uned DatblyguHead of Development Unit
 Llyfrgell Genedlaethol Cymru  National Library of Wales
 

-- 
Dan Field d...@llgc.org.uk mailto:d...@llgc.org.uk   
Ffôn/Tel. +44 1970 632 582
Pennaeth Uned DatblyguHead of Development Unit
Llyfrgell Genedlaethol Cymru  National Library of Wales



Re: Design optimal Solr Schema

2014-10-30 Thread Alexandre Rafalovitch
I am afraid, it is not very clear what you are trying to do here (the
sentence below). Could you explain again the business level results.
Are you trying to search for words within particular given time range?
Can those words span the segments? Or are you trying to find segments
with all their words from given segments.

Your Solr design should be driven by what you want to find, not what
you have to index.

Regards,
   Alex.
On 30 October 2014 10:27, tomas.kalas kala...@email.cz wrote:
 I want to search for example word which is to time 1.57 (w:HeLLO) AND (t:[0 
 TO 1.57]).


Re: Slow forwarding requests to collection leader

2014-10-30 Thread Matt Hilt
Thanks for the info Daniel. I will go forth and make a better client.


On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com wrote:

 I kind of think this might be working as designed, but I'll be happy to
 be corrected by others :)
 
 We had a similar issue which we discovered by accident, we had 2 or 3
 collections spread across some machines, and we accidentally tried to send
 an indexing request to a node in teh cloud that didn't have a replica of
 collection1 (but it had other collections). We saw an instant jump in
 indexing latency to 5s, which given the previous latencies had been ~20ms
 was rather obvious!
 
 Querying seems to be fine with this kind of forwarding approach, but
 indexing would logically require ZK information (to find the right shard
 for the destination collection and the leader of that shard), so I'm
 wondering if a node in the cloud that has a replica of collection1 has that
 information cached, whereas a node in the (same) cloud that only has a
 collection2 replica only has collection2 information cached, and has to go
 to ZK for every forwarding request.
 
 I haven't checked the code recently, but that seems plausible to me. Would
 you really want all your collection2 nodes to be running ZK watches for all
 collection1 updates as well as their own collection2 watches, that would
 clog them up processing updates that in all honestly, they shouldn't have
 to deal with. Every node in the cloud would have to have a watch on
 everything else which if you have a lot of independent collections would be
 an unnecessary burden on each of them.
 
 If you use SolrJ as a client, that would route to a correct node in the
 cloud (which is what we ended up using through JNI which was
 interesting), but if you are using HTTP to index, that's something your
 application has to take care of.
 
 On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:
 
 I have three equal machines each running solr cloud (4.8). I have multiple
 collections that are replicated but not sharded. I also have document
 generation processes running on these nodes which involves querying the
 collection ~5 times per document generated.
 
 Node 1 has a replica of collection A and is running document generation
 code that pushes to the HTTP /update/json hander.
 Node 2 is the leader of collection A.
 Node 3 does not have a replica of node A, but is running document
 generation code for collection A.
 
 The issue I see is that node 1 can push documents into Solr 3-5 times
 faster than node 3 when they both talk to the solr instance on their
 localhost. If either of them talk directly to the solr instance on node 2,
 the performance is excellent (on par with node 1). To me it seems that the
 only difference in these cases is the query/put request forwarding. Does
 this involve some slow zookeeper communication that should be avoided? Any
 other insights?
 
 Thanks



smime.p7s
Description: S/MIME cryptographic signature


Solr And query

2014-10-30 Thread vsriram30
Hi All,

This might be a simple question. I tried to find a solution, but not exactly
finding what I want. I have the following fields f1, f2 and f3. I want to do
an AND query in these fields. 

If I want to search for single word in these 3 fields, then I am facing no
problem. I can simply construct query like, q=f1:word1 AND f2:word2 AND
f3:word3 . But if I want to search for more than one word, then I am
required to enclose it in double quotes. eg) q=f1:word1 word2 AND f2:word3
AND f3:word4 But the problem I am facing with this approach is, word1 and
word2 has to appear in the same order in field f1.But for my use case, I
don't require it. It can be present anywhere in that field and I want same
scoring irrespective of where it is present. In simpler words, I just want
basic term matching of those words in that field.

Hence I tried the following solutions,

1. Using a slop query:

I constructed query like q=f1:word1 word2~1000 AND f2:word3 AND f3:word4
I read that it puts more load on CPU as it finds the position difference
between those words and creates the score. I just want a plain term matching
and I don't require score to vary based on distance.

2. Using Filter query:

I constructed query like, q=word1 word2df=f1fq=f2:word3f3:word4 The score
is way less as it is not using filter query terms for scoring. Also since I
enabled filter cache, I don't want these filter queries to be cached. Hence
I don't want to use filter queries for these fields.

3. Using AND operator and df:

I constructed query like, q=word1 word2 AND f2:word3 AND f3:word4df=f1.
This works perfectly fine as word1 and word2 are searched in f1 and other
AND queries are also working fine. But now, If I want to search for 2 words
in f2 as well, then I am not sure how to construct query.

eg) q=word1 word2 AND f2:word3 word4 AND f3:word5 word6df=f1 .Here, word4
and word6 will be searched against field f1. But I want them to searched on
f2 and f3 respectively. Hence please help me with this.

Thanks,
Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-And-query-tp4166685.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Migrating cloud to another set of machines

2014-10-30 Thread Jakov Sosic

On 10/30/2014 04:47 AM, Otis Gospodnetic wrote:

Hi/Bok Jakov,

2) sounds good to me.  It means no down-time.  1) means stoppage.  If
stoppage is not OK, but falling behind with indexing new content is OK, you
could:
* add a new cluster
* start reading from old index and indexing into the new index
* stop old cluster when done
* index new content to new cluster (or maybe you can be doing this all
along if indexing old + new at the same time is OK for you)
--


Thank you for suggestions Otis.

Everything is acceptable currently, but in the future as the data grows, 
we will certainly enter those edge cases where neither stopping indexing 
nor stopping queries will be acceptable.


What makes things a little bit more problematic is that ZooKeepers are 
migrating also to new machines.





Re: issue related to blank value in datefield

2014-10-30 Thread Chris Hostetter

Solr has never really owrked well with years prior to 1 because the 
specs for how they should be formated/parsed -- in particular realted to 
year 0 have always been painfully ambiguious/contradictory.

https://issues.apache.org/jira/browse/SOLR-2773

If you are really trying to deal with year 0 and dates that are BC 
then the current TrieDateField code probably isn't going to work well for 
you -- but if your goal, as you said, is to index -00-00T00:00:00Z for 
documenst that have no value in the date field -- i have to ask why?

the best solution is to not index anything in that field for those 
documents -- that will should give you the optimal behavior in all 
situations (queries, faceting, returned documents, etc...) 

so why do you want to put -00-00T00:00:00Z in these documents?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an XY Problem ... that is: you are dealing
with X, you are assuming Y will help you, and you are asking about Y
without giving more details about the X so that we can understand the
full issue.  Perhaps the best solution doesn't involve Y at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




: Date: Thu, 30 Oct 2014 14:09:13 +0530
: From: Aman Tandon amantandon...@gmail.com
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org solr-user@lucene.apache.org
: Subject: issue related to blank value in datefield
: 
: Hi,
: 
: I wants to set -00-00T00:00:00Z value for date field where I do not
: have the value. When the index the at field with value as desired it is
: getting indexed as 0002-11-30T00:00:00Z.
: 
: What is the reason behind this?
: 
: With Regards
: Aman Tandon
: 

-Hoss
http://www.lucidworks.com/


Boosting on field-not-empty

2014-10-30 Thread Håvard Wahl Kongsgård
Hi, a simple question how to boost field-not-empty. For some reasons
solr(4.6) returns rows with empty fields first (while the fields are not
part of the search query).

I came across this old thread
http://grokbase.com/t/lucene/solr-user/125e4yenha/boosting-on-field-empty-or-not
, but no solution



-- 
Håvard Wahl Kongsgård


Automating Solr

2014-10-30 Thread Craig Hoffman
Simple question:
What is best way to automate re-indexing Solr? Setup a CRON JOB / Curl Script? 

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman















Re: Automating Solr

2014-10-30 Thread Alexandre Rafalovitch
You don't reindex Solr. You reindex data into Solr. So, this depends
where you data is coming from and how often it changes. If the data
does not change, no point re-indexing it. And how do you get the data
into the Solr in the first place?

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On 30 October 2014 13:58, Craig Hoffman mountain@gmail.com wrote:
 Simple question:
 What is best way to automate re-indexing Solr? Setup a CRON JOB / Curl Script?

 Thanks,
 Craig
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman















Re: Automating Solr

2014-10-30 Thread Craig Hoffman
Right, of course. The data changes every few days. According to this
article, you can run a CRON Job to create a new index.
http://www.finalconcept.com.au/article/view/apache-solr-hints-and-tips

On Thu, Oct 30, 2014 at 12:04 PM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 You don't reindex Solr. You reindex data into Solr. So, this depends
 where you data is coming from and how often it changes. If the data
 does not change, no point re-indexing it. And how do you get the data
 into the Solr in the first place?

 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


 On 30 October 2014 13:58, Craig Hoffman mountain@gmail.com wrote:
  Simple question:
  What is best way to automate re-indexing Solr? Setup a CRON JOB / Curl
 Script?
 
  Thanks,
  Craig
  --
  Craig Hoffman
  w: http://www.craighoffmanphotography.com
  FB: www.facebook.com/CraigHoffmanPhotography
  TW: https://twitter.com/craiglhoffman
 
 
 
 
 
 
 
 
 
 
 
 
 




-- 
__
Craig Hoffman
iChat / AIM:mountain.do
__


Re: Automating Solr

2014-10-30 Thread Craig Hoffman
The data gets into Solr via MySQL script.
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman













 On Oct 30, 2014, at 12:11 PM, Craig Hoffman mountain@gmail.com wrote:
 
 Right, of course. The data changes every few days. According to this article, 
 you can run a CRON Job to create a new index.
 http://www.finalconcept.com.au/article/view/apache-solr-hints-and-tips 
 http://www.finalconcept.com.au/article/view/apache-solr-hints-and-tips
 
 On Thu, Oct 30, 2014 at 12:04 PM, Alexandre Rafalovitch arafa...@gmail.com 
 mailto:arafa...@gmail.com wrote:
 You don't reindex Solr. You reindex data into Solr. So, this depends
 where you data is coming from and how often it changes. If the data
 does not change, no point re-indexing it. And how do you get the data
 into the Solr in the first place?
 
 Regards,
Alex.
 Personal: http://www.outerthoughts.com/ http://www.outerthoughts.com/ and 
 @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ 
 http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 
 https://www.linkedin.com/groups?gid=6713853
 
 
 On 30 October 2014 13:58, Craig Hoffman mountain@gmail.com 
 mailto:mountain@gmail.com wrote:
  Simple question:
  What is best way to automate re-indexing Solr? Setup a CRON JOB / Curl 
  Script?
 
  Thanks,
  Craig
  --
  Craig Hoffman
  w: http://www.craighoffmanphotography.com 
  http://www.craighoffmanphotography.com/
  FB: www.facebook.com/CraigHoffmanPhotography 
  http://www.facebook.com/CraigHoffmanPhotography
  TW: https://twitter.com/craiglhoffman https://twitter.com/craiglhoffman
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 __
 Craig Hoffman 
 iChat / AIM:mountain.do
 __



Re: Automating Solr

2014-10-30 Thread Håvard Wahl Kongsgård
Then you have to run it again and again
30. okt. 2014 19:18 skrev Craig Hoffman mountain@gmail.com følgende:

 The data gets into Solr via MySQL script.
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman













  On Oct 30, 2014, at 12:11 PM, Craig Hoffman mountain@gmail.com
 wrote:
 
  Right, of course. The data changes every few days. According to this
 article, you can run a CRON Job to create a new index.
  http://www.finalconcept.com.au/article/view/apache-solr-hints-and-tips 
 http://www.finalconcept.com.au/article/view/apache-solr-hints-and-tips
 
  On Thu, Oct 30, 2014 at 12:04 PM, Alexandre Rafalovitch 
 arafa...@gmail.com mailto:arafa...@gmail.com wrote:
  You don't reindex Solr. You reindex data into Solr. So, this depends
  where you data is coming from and how often it changes. If the data
  does not change, no point re-indexing it. And how do you get the data
  into the Solr in the first place?
 
  Regards,
 Alex.
  Personal: http://www.outerthoughts.com/ http://www.outerthoughts.com/
 and @arafalov
  Solr resources and newsletter: http://www.solr-start.com/ 
 http://www.solr-start.com/ and @solrstart
  Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
 https://www.linkedin.com/groups?gid=6713853
 
 
  On 30 October 2014 13:58, Craig Hoffman mountain@gmail.com mailto:
 mountain@gmail.com wrote:
   Simple question:
   What is best way to automate re-indexing Solr? Setup a CRON JOB / Curl
 Script?
  
   Thanks,
   Craig
   --
   Craig Hoffman
   w: http://www.craighoffmanphotography.com 
 http://www.craighoffmanphotography.com/
   FB: www.facebook.com/CraigHoffmanPhotography 
 http://www.facebook.com/CraigHoffmanPhotography
   TW: https://twitter.com/craiglhoffman 
 https://twitter.com/craiglhoffman
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
  --
  __
  Craig Hoffman
  iChat / AIM:mountain.do
  __




Re: Automating Solr

2014-10-30 Thread Alexandre Rafalovitch
Do you mean DataImportHandler? If so, you can create full and
incremental queries and trigger them - from CRON - as often as you
would like. E.g. 1am nightly.

Regards,
   Alex.
On 30 October 2014 14:17, Craig Hoffman mountain@gmail.com wrote:
 The data gets into Solr via MySQL script.


Re: Automating Solr

2014-10-30 Thread Ramzi Alqrainy
Simple add this line to your crontab with crontab -e command:

0,30 * * * * /usr/bin/wget
http://solr_host:8983/solr/core_name/dataimport?command=full-import 

This will full import every 30 minutes. Replace solr_host and core_name
with your configuration

*Using delta-import command*

Delta Import operation can be started by hitting the URL
http://localhost:8983/solr/dataimport?command=delta-import. This operation
will be started in a new thread and the status attribute in the response
should be shown busy now. Depending on the size of your data set, this
operation may take some time. At any time, you can hit
http://localhost:8983/solr/dataimport to see the status flag.

When delta-import command is executed, it reads the start time stored in
conf/dataimport.properties. It uses that timestamp to run delta queries and
after completion, updates the timestamp in conf/dataimport.properties.

Note: there is an alternative approach for updating documents in Solr, which
is in many cases more efficient and also requires less configuration
explained on DataImportHandlerDeltaQueryViaFullImport.

*Delta-Import Example*

We will use the same example database used in the full import example. Note
that the database schema has been updated and each table contains an
additional column last_modified of timestamp type. You may want to download
the database again since it has been updated recently. We use this timestamp
field to determine what rows in each table have changed since the last
indexed time.

Take a look at the following data-config.xml


dataConfig
dataSource driver=org.hsqldb.jdbcDriver
url=jdbc:hsqldb:/temp/example/ex user=sa /
document name=products
entity name=item pk=ID
query=select * from item
deltaImportQuery=select * from item where
ID='${dih.delta.id}'
deltaQuery=select id from item where last_modified gt;
'${dih.last_index_time}'
entity name=feature pk=ITEM_ID
query=select description as features from feature where
item_id='${item.ID}'
/entity
entity name=item_category pk=ITEM_ID, CATEGORY_ID
query=select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'
entity name=category pk=ID
   query=select description as cat from category where
id = '${item_category.CATEGORY_ID}'
/entity
/entity
/entity
/document
/dataConfig
Pay attention to the deltaQuery attribute which has an SQL statement capable
of detecting changes in the item table. Note the variable
${dataimporter.last_index_time} The DataImportHandler exposes a variable
called last_index_time which is a timestamp value denoting the last time
full-import 'or' delta-import was run. You can use this variable anywhere in
the SQL you write in data-config.xml and it will be replaced by the value
during processing.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Automating-Solr-tp4166696p4166707.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Boosting on field-not-empty

2014-10-30 Thread Ramzi Alqrainy
You can use FunctionQuery that allows one to use the actual value of a field
and functions of those fields in a relevancy score.

Two function will help you, which are :

*exists*

exists(field|function) returns true if a value exists for a given document.

Example use: exists(myField) will return true if myField has a value, while
exists(query({!v='year:2012'})) will return true for docs with year=2012.

*if*

if(expression,trueValue,falseValue) emits trueValue if the expression is
true, else falseValue. An expression can be any function which outputs
boolean values, or even functions returning numeric values, in which case
value 0 will be interpreted as false, or strings, in which case empty string
is interpreted as false.

Example use: if(exists(myField),100,0) returns 100 if myField exists

*Solution: *

Use in a parameter that is explicitly for specifying functions, such as the
EDisMax query parser's boost param, or DisMax query parser's bf (boost
function) parameter. (Note that the bf parameter actually takes a list of
function queries separated by white space and each with an optional boost.
Make sure you eliminate any internal white space in single function queries
when using bf). For example:

http://lucene.472066.n3.nabble.com/file/n4166709/Screen_Shot_2014-10-30_at_9.png
 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boosting-on-field-not-empty-tp4166692p4166709.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Automating Solr

2014-10-30 Thread Craig Hoffman
Thanks! One more question. WGET seems to choking on a my URL in particular the 
# and the  character . What’s the best method escaping? 

http://My Host 
:8983/solr/#/articles/dataimport//dataimport?command=full-importclean=trueoptimize=true
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman













 On Oct 30, 2014, at 12:30 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com wrote:
 
 Simple add this line to your crontab with crontab -e command:
 
 0,30 * * * * /usr/bin/wget
 http://solr_host:8983/solr/core_name/dataimport?command=full-import 
 
 This will full import every 30 minutes. Replace solr_host and core_name
 with your configuration
 
 *Using delta-import command*
 
 Delta Import operation can be started by hitting the URL
 http://localhost:8983/solr/dataimport?command=delta-import. This operation
 will be started in a new thread and the status attribute in the response
 should be shown busy now. Depending on the size of your data set, this
 operation may take some time. At any time, you can hit
 http://localhost:8983/solr/dataimport to see the status flag.
 
 When delta-import command is executed, it reads the start time stored in
 conf/dataimport.properties. It uses that timestamp to run delta queries and
 after completion, updates the timestamp in conf/dataimport.properties.
 
 Note: there is an alternative approach for updating documents in Solr, which
 is in many cases more efficient and also requires less configuration
 explained on DataImportHandlerDeltaQueryViaFullImport.
 
 *Delta-Import Example*
 
 We will use the same example database used in the full import example. Note
 that the database schema has been updated and each table contains an
 additional column last_modified of timestamp type. You may want to download
 the database again since it has been updated recently. We use this timestamp
 field to determine what rows in each table have changed since the last
 indexed time.
 
 Take a look at the following data-config.xml
 
 
 dataConfig
dataSource driver=org.hsqldb.jdbcDriver
 url=jdbc:hsqldb:/temp/example/ex user=sa /
document name=products
entity name=item pk=ID
query=select * from item
deltaImportQuery=select * from item where
 ID='${dih.delta.id}'
deltaQuery=select id from item where last_modified gt;
 '${dih.last_index_time}'
entity name=feature pk=ITEM_ID
query=select description as features from feature where
 item_id='${item.ID}'
/entity
entity name=item_category pk=ITEM_ID, CATEGORY_ID
query=select CATEGORY_ID from item_category where
 ITEM_ID='${item.ID}'
entity name=category pk=ID
   query=select description as cat from category where
 id = '${item_category.CATEGORY_ID}'
/entity
/entity
/entity
/document
 /dataConfig
 Pay attention to the deltaQuery attribute which has an SQL statement capable
 of detecting changes in the item table. Note the variable
 ${dataimporter.last_index_time} The DataImportHandler exposes a variable
 called last_index_time which is a timestamp value denoting the last time
 full-import 'or' delta-import was run. You can use this variable anywhere in
 the SQL you write in data-config.xml and it will be replaced by the value
 during processing.
 
 
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Automating-Solr-tp4166696p4166707.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Automating Solr

2014-10-30 Thread Michael Della Bitta

You probably just need to put double quotes around the url.


On 10/30/14 15:27, Craig Hoffman wrote:

Thanks! One more question. WGET seems to choking on a my URL in particular the # 
and the  character . What’s the best method escaping?

http://My Host 
:8983/solr/#/articles/dataimport//dataimport?command=full-importclean=trueoptimize=true
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman














On Oct 30, 2014, at 12:30 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com wrote:

Simple add this line to your crontab with crontab -e command:

0,30 * * * * /usr/bin/wget
http://solr_host:8983/solr/core_name/dataimport?command=full-import

This will full import every 30 minutes. Replace solr_host and core_name
with your configuration

*Using delta-import command*

Delta Import operation can be started by hitting the URL
http://localhost:8983/solr/dataimport?command=delta-import. This operation
will be started in a new thread and the status attribute in the response
should be shown busy now. Depending on the size of your data set, this
operation may take some time. At any time, you can hit
http://localhost:8983/solr/dataimport to see the status flag.

When delta-import command is executed, it reads the start time stored in
conf/dataimport.properties. It uses that timestamp to run delta queries and
after completion, updates the timestamp in conf/dataimport.properties.

Note: there is an alternative approach for updating documents in Solr, which
is in many cases more efficient and also requires less configuration
explained on DataImportHandlerDeltaQueryViaFullImport.

*Delta-Import Example*

We will use the same example database used in the full import example. Note
that the database schema has been updated and each table contains an
additional column last_modified of timestamp type. You may want to download
the database again since it has been updated recently. We use this timestamp
field to determine what rows in each table have changed since the last
indexed time.

Take a look at the following data-config.xml


dataConfig
dataSource driver=org.hsqldb.jdbcDriver
url=jdbc:hsqldb:/temp/example/ex user=sa /
document name=products
entity name=item pk=ID
query=select * from item
deltaImportQuery=select * from item where
ID='${dih.delta.id}'
deltaQuery=select id from item where last_modified gt;
'${dih.last_index_time}'
entity name=feature pk=ITEM_ID
query=select description as features from feature where
item_id='${item.ID}'
/entity
entity name=item_category pk=ITEM_ID, CATEGORY_ID
query=select CATEGORY_ID from item_category where
ITEM_ID='${item.ID}'
entity name=category pk=ID
   query=select description as cat from category where
id = '${item_category.CATEGORY_ID}'
/entity
/entity
/entity
/document
/dataConfig
Pay attention to the deltaQuery attribute which has an SQL statement capable
of detecting changes in the item table. Note the variable
${dataimporter.last_index_time} The DataImportHandler exposes a variable
called last_index_time which is a timestamp value denoting the last time
full-import 'or' delta-import was run. You can use this variable anywhere in
the SQL you write in data-config.xml and it will be replaced by the value
during processing.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Automating-Solr-tp4166696p4166707.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Automating Solr

2014-10-30 Thread Shawn Heisey
On 10/30/2014 1:27 PM, Craig Hoffman wrote:
 Thanks! One more question. WGET seems to choking on a my URL in particular 
 the # and the  character . What’s the best method escaping? 

 http://My Host 
 :8983/solr/#/articles/dataimport//dataimport?command=full-importclean=trueoptimize=true

Putting the URL in quotes would work ... but if you are calling a Solr
URL with /#/ in it, you're doing it wrong.

URLs with /#/ in them are specifically for the admin UI.  They only work
properly in a browser, where javascript and AJAX are available.  They
will NOT work like you expect with wget, even if you get the URL escaped
properly.

See the cron example that Ramzi Alqrainy gave you for the proper way of
requesting a full-import.

Thanks,
Shawn



Re: Automating Solr

2014-10-30 Thread Craig Hoffman
Thanks everyone. I got it working.
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman













 On Oct 30, 2014, at 1:48 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 10/30/2014 1:27 PM, Craig Hoffman wrote:
 Thanks! One more question. WGET seems to choking on a my URL in particular 
 the # and the  character . What’s the best method escaping? 
 
 http://My Host 
 :8983/solr/#/articles/dataimport//dataimport?command=full-importclean=trueoptimize=true
 
 Putting the URL in quotes would work ... but if you are calling a Solr
 URL with /#/ in it, you're doing it wrong.
 
 URLs with /#/ in them are specifically for the admin UI.  They only work
 properly in a browser, where javascript and AJAX are available.  They
 will NOT work like you expect with wget, even if you get the URL escaped
 properly.
 
 See the cron example that Ramzi Alqrainy gave you for the proper way of
 requesting a full-import.
 
 Thanks,
 Shawn
 



Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Howdy all -

The short version is: We are not seeing Solr Cloud performance scale (event
close to) linearly as we add nodes. Can anyone suggest good diagnostics for
finding scaling bottlenecks? Are there known 'gotchas' that make Solr Cloud
fail to scale?

In detail:

We have used Solr (in non-Cloud mode) for over a year and are now beginning
a transition to SolrCloud.  To this end I have been running some basic load
tests to figure out what kind of capacity we should expect to provision.
In short, I am seeing very poor scalability (increase in effective QPS) as
I add Solr nodes.  I'm hoping to get some ideas on where I should be
looking to debug this.  Apologies in advance for the length of this email;
I'm trying to be comprehensive and provide all relevant information.

Our setup:

1 load generating client
 - generates tiny, fake documents with unique IDs
 - performs only writes (no queries at all)
 - chooses a random solr server for each ADD request (with 1 doc per add
request)

N collections spread over K solr servers
 - every collection is sharded K times (so every solr instance has 1 shard
from every collection)
 - no replicas
 - external zookeeper server (not using zkRun)
 - autoCommit maxTime=6
 - autoSoftCommit maxTime =15000

Everything is running within a single zone on Google Compute Engine, so
high quality gigabit network links between all machines (ping times  1ms).

My methodology is as follows.
1. Start up a K solr servers.
2. Remove all existing collections.
3. Create N collections, with numShards=K for each.
4. Start load testing.  Every minute, print the number of successful
updates and the number of failed updates.
5. Keep increasing the offered load (via simulated users) until the qps
flatlines.

In brief (more detailed results at the bottom of email), I find that for
any number of nodes between 2 and 5, the QPS always caps out at ~3000.
Obviously something must be wrong here, as there should be a trend of the
QPS scaling (roughly) linearly with the number of nodes.  Or at the very
least going up at all!

So my question is what else should I be looking at here?

* CPU on the loadtest client is well under 100%
* No other obvious bottlenecks on loadtest client (running 2 clients leads
to ~1/2 qps on each)
* In many cases, CPU on the solr servers is quite low as well (e.g. with
100 users hitting 5 solr nodes, all nodes are 50% idle)
* Network bandwidth is a few MB/s, well under the gigabit capacity of our
network
* Disk bandwidth ( 2 MB/s) and iops ( 20/s) are low.

Any ideas?  Thanks very much!
- Ian


p.s. Here is my raw data broken out by number of nodes and number of
simulated users:


Num NodesNum UsersQPS111020153180110382511539001204050140410021472251790210
229021528502202900240321026032002803210210031803138535158031020903152560320
27603252890380305041375451560410220041525004202700425280043028505152450520
2640525279053028405100290052002810


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey
On 10/30/2014 2:23 PM, Ian Rose wrote:
 My methodology is as follows.
 1. Start up a K solr servers.
 2. Remove all existing collections.
 3. Create N collections, with numShards=K for each.
 4. Start load testing.  Every minute, print the number of successful
 updates and the number of failed updates.
 5. Keep increasing the offered load (via simulated users) until the qps
 flatlines.

If you want to increase QPS, you should not be increasing numShards. 
You need to increase replicationFactor.  When your numShards matches the
number of servers, every single server will be doing part of the work
for every query.  If you increase replicationFactor instead, then each
server can be doing a different query in parallel.

Sharding the index is what you need to do when you need to scale the
size of the index, so each server does not get overwhelmed by dealing
with every document for every query.

Getting a high QPS with a big index requires increasing both numShards
*AND* replicationFactor.

Thanks,
Shawn



Re: Boosting on field-not-empty

2014-10-30 Thread Håvard Wahl Kongsgård
Thanks :)

On Thu, Oct 30, 2014 at 7:49 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com
wrote:

 You can use FunctionQuery that allows one to use the actual value of a
 field
 and functions of those fields in a relevancy score.

 Two function will help you, which are :

 *exists*

 exists(field|function) returns true if a value exists for a given document.

 Example use: exists(myField) will return true if myField has a value, while
 exists(query({!v='year:2012'})) will return true for docs with year=2012.

 *if*

 if(expression,trueValue,falseValue) emits trueValue if the expression is
 true, else falseValue. An expression can be any function which outputs
 boolean values, or even functions returning numeric values, in which case
 value 0 will be interpreted as false, or strings, in which case empty
 string
 is interpreted as false.

 Example use: if(exists(myField),100,0) returns 100 if myField exists

 *Solution: *

 Use in a parameter that is explicitly for specifying functions, such as the
 EDisMax query parser's boost param, or DisMax query parser's bf (boost
 function) parameter. (Note that the bf parameter actually takes a list of
 function queries separated by white space and each with an optional boost.
 Make sure you eliminate any internal white space in single function queries
 when using bf). For example:

 
 http://lucene.472066.n3.nabble.com/file/n4166709/Screen_Shot_2014-10-30_at_9.png
 




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boosting-on-field-not-empty-tp4166692p4166709.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose

 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.



I think this is true only for actual queries, right?  I am not issuing any
queries, only writes (document inserts).  In the case of writes, increasing
the number of shards should increase my throughput (in ops/sec) more or
less linearly, right?


On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:

 On 10/30/2014 2:23 PM, Ian Rose wrote:
  My methodology is as follows.
  1. Start up a K solr servers.
  2. Remove all existing collections.
  3. Create N collections, with numShards=K for each.
  4. Start load testing.  Every minute, print the number of successful
  updates and the number of failed updates.
  5. Keep increasing the offered load (via simulated users) until the qps
  flatlines.

 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.  If you increase replicationFactor instead, then each
 server can be doing a different query in parallel.

 Sharding the index is what you need to do when you need to scale the
 size of the index, so each server does not get overwhelmed by dealing
 with every document for every query.

 Getting a high QPS with a big index requires increasing both numShards
 *AND* replicationFactor.

 Thanks,
 Shawn




Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Matt Hilt
If you are issuing writes to shard non-leaders, then there is a large overhead 
for the eventual redirect to the leader. I noticed a 3-5 times performance 
increase by making my write client leader aware.


On Oct 30, 2014, at 2:56 PM, Ian Rose ianr...@fullstory.com wrote:

 
 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.
 
 
 
 I think this is true only for actual queries, right?  I am not issuing any
 queries, only writes (document inserts).  In the case of writes, increasing
 the number of shards should increase my throughput (in ops/sec) more or
 less linearly, right?
 
 
 On Thu, Oct 30, 2014 at 4:50 PM, Shawn Heisey apa...@elyograg.org wrote:
 
 On 10/30/2014 2:23 PM, Ian Rose wrote:
 My methodology is as follows.
 1. Start up a K solr servers.
 2. Remove all existing collections.
 3. Create N collections, with numShards=K for each.
 4. Start load testing.  Every minute, print the number of successful
 updates and the number of failed updates.
 5. Keep increasing the offered load (via simulated users) until the qps
 flatlines.
 
 If you want to increase QPS, you should not be increasing numShards.
 You need to increase replicationFactor.  When your numShards matches the
 number of servers, every single server will be doing part of the work
 for every query.  If you increase replicationFactor instead, then each
 server can be doing a different query in parallel.
 
 Sharding the index is what you need to do when you need to scale the
 size of the index, so each server does not get overwhelmed by dealing
 with every document for every query.
 
 Getting a high QPS with a big index requires increasing both numShards
 *AND* replicationFactor.
 
 Thanks,
 Shawn
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Shawn Heisey
On 10/30/2014 2:56 PM, Ian Rose wrote:
 I think this is true only for actual queries, right? I am not issuing
 any queries, only writes (document inserts). In the case of writes,
 increasing the number of shards should increase my throughput (in
 ops/sec) more or less linearly, right?

No, that won't affect indexing speed all that much.  The way to increase
indexing speed is to increase the number of processes or threads that
are indexing at the same time.  Instead of having one client sending
update requests, try five of them.  Also, index many documents with each
update request.  Sending one document at a time is very inefficient.

You didn't say how you're doing commits, but those need to be as
infrequent as you can manage.  Ideally, you would use autoCommit with
openSearcher=false on an interval of about five minutes, and send an
explicit commit (with the default openSearcher=true) after all the
indexing is done.

You may have requirements regarding document visibility that this won't
satisfy, but try to avoid doing commits with openSearcher=true (soft
commits qualify for this) extremely frequently, like once a second. 
Once a minute is much more realistic.  Opening a new searcher is an
expensive operation, especially if you have cache warming configured.

Thanks,
Shawn



Missing Records

2014-10-30 Thread AJ Lemke
Hi All,

We have a SOLR cloud instance that has been humming along nicely for months.
Last week we started experiencing missing records.

Admin DIH Example:
Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
A *:* search claims that there are only 903,902 this is the first full index.
Subsequent full indexes give the following counts for the *:* search
903,805
903,665
826,357

All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0, Processed: 
903,993 (x/s) every time. ---records per second is variable


I found an item that should be in the index but is not found in a search.

Here are the referenced lines of the log file.

DEBUG - 2014-10-30 15:10:51.160; 
org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE 
add{,id=750041421} 
{{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
DEBUG - 2014-10-30 15:10:51.160; org.apache.solr.update.SolrCmdDistributor; 
sending update to http://192.168.20.57:7574/solr/inventory_shard1_replica2/ 
retry:0 add{,id=750041421} 
params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

--- there are 746 lines of log between entries ---

DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;   
[0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 
Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux 
City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162 
Long Track 
[0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and 1000cc 
of power you can fly up any 
hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission 
[0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified Auto, 
Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto, 
Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 
SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit 
Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
What could be the issue and how does one fix this issue?

Thanks so much and if more information is needed I have preserved the log files.

AJ


Re: Indexing documents/files for production use

2014-10-30 Thread Olivier Austina
Thank you Alexandre, Jürgen and Erick for your replies. It is clear for me.

Regards
Olivier


2014-10-28 23:35 GMT+01:00 Erick Erickson erickerick...@gmail.com:

 And one other consideration in addition to the two excellent responses
 so far

 In a SolrCloud environment, SolrJ via CloudSolrServer will automatically
 route the documents to the correct shard leader, saving some additional
 overhead. Post.jar and cURL send the docs to a node, which in turn
 forward the docs to the correct shard leader which lowers
 throughput

 Best,
 Erick

 On Tue, Oct 28, 2014 at 2:32 PM, Jürgen Wagner (DVT)
 juergen.wag...@devoteam.com wrote:
  Hello Olivier,
for real production use, you won't really want to use any toys like
  post.jar or curl. You want a decent connector to whatever data source
 there
  is, that fetches data, possibly massages it a bit, and then feeds it into
  Solr - by means of SolrJ or directly into the web service of Solr via
 binary
  protocols. This way, you can properly handle incremental feeding,
 processing
  of data from remote locations (with the connector being closer to the
 data
  source), and also source data security. Also think about what happens if
 you
  do processing of incoming documents in Solr. What happens if Tika runs
 out
  of memory because of PDF problems? What if this crashes your Solr node?
 In
  our Solr projects, we generally do not do any sizable processing within
 Solr
  as document processing and document indexing or querying have all
 different
  scaling properties.
 
  Production use most typically is not achieved by deploying a vanilla
 Solr,
  but rather having a bit more glue and wrappage, so the whole will fit
 your
  requirements in terms of functionality, scaling, monitoring and
 robustness.
  Some similar platforms like Elasticsearch try to alleviate these pains of
  going to a production-style infrastructure, but that's at the expense of
  flexibility and comes with limitations.
 
  For proof-of-concept or demonstrator-style applications, the plain tools
 out
  of the box will be fine. For production applications, you want to have
 more
  robust components.
 
  Best regards,
  --Jürgen
 
 
  On 28.10.2014 22:12, Olivier Austina wrote:
 
  Hi All,
 
  I am reading the solr documentation. I have understood that post.jar
  
 http://wiki.apache.org/solr/ExtractingRequestHandler#SimplePostTool_.28post.jar.29
 
  is not meant for production use, cURL
  
 https://cwiki.apache.org/confluence/display/solr/Introduction+to+Solr+Indexing
 
  is not recommanded. Is SolrJ better for production?  Thank you.
  Regards
  Olivier
 
 
 
  --
 
  Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
  уважением
  i.A. Jürgen Wagner
  Head of Competence Center Intelligence
   Senior Cloud Consultant
 
  Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
  Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864
 1543
  E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
 
  
  Managing Board: Jürgen Hatzipantelis (CEO)
  Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
  Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
 
 



Re: Missing Records

2014-10-30 Thread S.L
I am curious , how many shards do you have and whats the replication factor
you are using ?

On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
 A *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and
 1000cc of power you can fly up any
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the log
 files.

 AJ



Master Slave set up in Solr Cloud

2014-10-30 Thread S.L
Hi All,

As I previously reported due to no overlap in terms of the documets in the
SolrCloud replicas of the index shards , I have turned off the replication
and basically have there shards with a replication factor of 1.

It obviously seems will not be scalable due to the fact that the same core
will be indexed and queried at the same time as this is a long running
indexing task.

My questions is what options do I have to set up the replicas of the single
per shard core outside of the SolrCloud replication factor mechanism
because that does not seem to work for me ?


Thanks.


Re: Migrating cloud to another set of machines

2014-10-30 Thread Otis Gospodnetic
I think ZK stuff may actually be easier to handle, no?
Add new ones to the existing ZK cluster and then remove the old ones.
Won't this work smoothly?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Oct 30, 2014 at 1:16 PM, Jakov Sosic jso...@gmail.com wrote:

 On 10/30/2014 04:47 AM, Otis Gospodnetic wrote:

 Hi/Bok Jakov,

 2) sounds good to me.  It means no down-time.  1) means stoppage.  If
 stoppage is not OK, but falling behind with indexing new content is OK,
 you
 could:
 * add a new cluster
 * start reading from old index and indexing into the new index
 * stop old cluster when done
 * index new content to new cluster (or maybe you can be doing this all
 along if indexing old + new at the same time is OK for you)
 --


 Thank you for suggestions Otis.

 Everything is acceptable currently, but in the future as the data grows,
 we will certainly enter those edge cases where neither stopping indexing
 nor stopping queries will be acceptable.

 What makes things a little bit more problematic is that ZooKeepers are
 migrating also to new machines.





Re: Solr And query

2014-10-30 Thread vsriram30
Actually I found out how to form the query. I just need to use,

q=f1:(word1 word2) AND f2:(word3 word4) AND f3:(word5 word6)

Thanks,
V.Sriram



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-And-query-tp4166685p4166744.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Sharding configuration

2014-10-30 Thread Erick Erickson
This is not too surprising. There are additional hops necessary for a
cloud setup. This is the sequence, let's say there are 4 shards and the
rows parameter on the query is 10 and you're sorting by score

node1 receives request.
node1 sends the request out to each shard
node1 receives the top 10 doc Ids back with the  score (note, not the
_contents_).
node1 sorts the 4 lists of 10 docs into the final top 10.
node1 then requests the actual docs from the nodes that they reside on
node1 then gets the results back and assembles them into a final list
node1 then returns the list to the client.

Contrast this with a single shard
node1 receives the request
node1 finds the top 10 docs locally
node1 return the docs to the client

You should only resort to sharding when you have too many docs
to fit in a single shard (and give you acceptable search times). If
all your docs fit comfortably on a single machine, you can _still_ use
SolrCloud, just with a single shard. This configuration deals with all
the replication, NRT processing, self-repair when nodes go up and
down and all that, but since there's no second trip to get the docs
from shards your query performance won't be affected.

And using SolrCloud with a single shard will essentially scale linearly
as you add nodes for queries.

Best,
Erick


On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com wrote:
 Hi,

 You are right, it is a mistake in my phrase, for the tests with 4
 shards/ 4 instances,  the latency was worse (therefore *bigger*) than
 for the tests with one shard.

 In our case, the query rate is high.

 Thanks,
 Anca


 On 10/30/2014 03:48 PM, Shawn Heisey wrote:

 On 10/30/2014 4:32 AM, Anca Kopetz wrote:

 We did some tests with 4 shards / 4 different tomcat instances on the
 same server and the average latency was smaller than the one when having
 only one shard.
 We tested also é shards on different servers and the performance results
 were also worse.

 It seems that the sharding does not make any difference for our index in
 terms of latency gains.

 That statement is confusing, because if latency goes down, that's good,
 not worse.

 If you're going to put multiple shards on one server, it should be done
 with one solr/tomcat instance, not multiple.  One instance is perfectly
 capable of dealing with many shards, and has a lot less overhead.  The
 SolrCloud collection create command would need the maxShardsPerNode
 parameter.

 In order to see a gain in performance from multiple shards per server,
 the server must have a lot of CPUs and the query rate must be fairly
 low.  If the query rate is high, then all the CPUs will be busy just
 handling simultaneous queries, so putting multiple shards per server
 will probably slow things down.  When query rate is low, multiple CPUs
 can handle each shard query simultaneously, speeding up the overall query.

 Thanks,
 Shawn


 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à l'attention
 exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
 message, merci de le détruire et d'en avertir l'expéditeur.


Re: Score phrases higher than the records containing the words?

2014-10-30 Thread Erick Erickson
So what happens if you increase the boost to 100? or 20?

The problem is that boosting will always be more art than science.

What about the other 3 possibilities I mentioned?

Basically, you have to tweak things to fit your corpus, and it's often
an empirically determined thing.

Best,
Erick

On Thu, Oct 30, 2014 at 9:14 AM, hschillig mouseywi...@live.com wrote:
 The other ones are still rating higher. I think it's because the other two
 titles contain what 3 times.. the more it says what, the higher it scores.
 I'm not sure what else can be done. Does anybody else have any ideas?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Score-phrases-higher-than-the-records-containing-the-words-tp4166488p4166656.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow forwarding requests to collection leader

2014-10-30 Thread Erick Erickson
Matt:

You might want to look at SolrJ, in particular with the use of CloudSolrServer.
The big benefit here is that it'll route the docs to the correct leader for each
shard rather than relying on the nodes to communicate with each other.

Here's a SolrJ example. NOTE: it used ConcurrentUpdateSolrServer which
you should replace with CloudSolrServer. Other than making the c'tor work, that
should be the only change you need as far as instantiating the right Solr
Server.

This one connects with a DB and also parses Tika files, but you should be able
to remove all that without too much problem.

https://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Oct 30, 2014 at 10:08 AM, Matt Hilt matt.h...@numerica.us wrote:
 Thanks for the info Daniel. I will go forth and make a better client.


 On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com wrote:

 I kind of think this might be working as designed, but I'll be happy to
 be corrected by others :)

 We had a similar issue which we discovered by accident, we had 2 or 3
 collections spread across some machines, and we accidentally tried to send
 an indexing request to a node in teh cloud that didn't have a replica of
 collection1 (but it had other collections). We saw an instant jump in
 indexing latency to 5s, which given the previous latencies had been ~20ms
 was rather obvious!

 Querying seems to be fine with this kind of forwarding approach, but
 indexing would logically require ZK information (to find the right shard
 for the destination collection and the leader of that shard), so I'm
 wondering if a node in the cloud that has a replica of collection1 has that
 information cached, whereas a node in the (same) cloud that only has a
 collection2 replica only has collection2 information cached, and has to go
 to ZK for every forwarding request.

 I haven't checked the code recently, but that seems plausible to me. Would
 you really want all your collection2 nodes to be running ZK watches for all
 collection1 updates as well as their own collection2 watches, that would
 clog them up processing updates that in all honestly, they shouldn't have
 to deal with. Every node in the cloud would have to have a watch on
 everything else which if you have a lot of independent collections would be
 an unnecessary burden on each of them.

 If you use SolrJ as a client, that would route to a correct node in the
 cloud (which is what we ended up using through JNI which was
 interesting), but if you are using HTTP to index, that's something your
 application has to take care of.

 On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:

 I have three equal machines each running solr cloud (4.8). I have multiple
 collections that are replicated but not sharded. I also have document
 generation processes running on these nodes which involves querying the
 collection ~5 times per document generated.

 Node 1 has a replica of collection A and is running document generation
 code that pushes to the HTTP /update/json hander.
 Node 2 is the leader of collection A.
 Node 3 does not have a replica of node A, but is running document
 generation code for collection A.

 The issue I see is that node 1 can push documents into Solr 3-5 times
 faster than node 3 when they both talk to the solr instance on their
 localhost. If either of them talk directly to the solr instance on node 2,
 the performance is excellent (on par with node 1). To me it seems that the
 only difference in these cases is the query/put request forwarding. Does
 this involve some slow zookeeper communication that should be avoided? Any
 other insights?

 Thanks



Re: Boosting on field-not-empty

2014-10-30 Thread Erick Erickson
bq: ...while the fields are not part of the search query

I'm really confused. The presence or absence of fields that
aren't part of the search should be totally irrelevant to
scoring. Are you perhaps sorting by a different field?

It'd help if you showed us the query you're sending, a sample
of the doc you're having problems with and, perhaps, the
request handler definition.

Barring all that, perhaps just adding debug=query to your URL
(and perhaps showing us that) would shed some light on this.

Because on the surface, I don't see any connection here.

Best,
Erick



On Thu, Oct 30, 2014 at 1:52 PM, Håvard Wahl Kongsgård
haavard.kongsga...@gmail.com wrote:
 Thanks :)

 On Thu, Oct 30, 2014 at 7:49 PM, Ramzi Alqrainy ramzi.alqra...@gmail.com
 wrote:

 You can use FunctionQuery that allows one to use the actual value of a
 field
 and functions of those fields in a relevancy score.

 Two function will help you, which are :

 *exists*

 exists(field|function) returns true if a value exists for a given document.

 Example use: exists(myField) will return true if myField has a value, while
 exists(query({!v='year:2012'})) will return true for docs with year=2012.

 *if*

 if(expression,trueValue,falseValue) emits trueValue if the expression is
 true, else falseValue. An expression can be any function which outputs
 boolean values, or even functions returning numeric values, in which case
 value 0 will be interpreted as false, or strings, in which case empty
 string
 is interpreted as false.

 Example use: if(exists(myField),100,0) returns 100 if myField exists

 *Solution: *

 Use in a parameter that is explicitly for specifying functions, such as the
 EDisMax query parser's boost param, or DisMax query parser's bf (boost
 function) parameter. (Note that the bf parameter actually takes a list of
 function queries separated by white space and each with an optional boost.
 Make sure you eliminate any internal white space in single function queries
 when using bf). For example:

 
 http://lucene.472066.n3.nabble.com/file/n4166709/Screen_Shot_2014-10-30_at_9.png
 




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Boosting-on-field-not-empty-tp4166692p4166709.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: issue related to blank value in datefield

2014-10-30 Thread Aman Tandon
Hi Chris,

Thanks for replying.

but if your goal, as you said, is to index -00-00T00:00:00Z for
 documenst that have no value in the date field -- i have to ask why?


I was just trying to index the fields returned by my msql and i found this
issue. So i asked in the group. Sorry for writing a confusing mail.
Actually i just want to know why it is getting stored as '
 0002-11-30T00:00:00Z' on indexing the value -00-00T00:00:00Z.

Is it somewhere pre-defined?

With Regards
Aman Tandon

On Thu, Oct 30, 2014 at 11:17 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 Solr has never really owrked well with years prior to 1 because the
 specs for how they should be formated/parsed -- in particular realted to
 year 0 have always been painfully ambiguious/contradictory.

 https://issues.apache.org/jira/browse/SOLR-2773

 If you are really trying to deal with year 0 and dates that are BC
 then the current TrieDateField code probably isn't going to work well for
 you -- but if your goal, as you said, is to index -00-00T00:00:00Z for
 documenst that have no value in the date field -- i have to ask why?

 the best solution is to not index anything in that field for those
 documents -- that will should give you the optimal behavior in all
 situations (queries, faceting, returned documents, etc...)

 so why do you want to put -00-00T00:00:00Z in these documents?

 https://people.apache.org/~hossman/#xyproblem
 XY Problem

 Your question appears to be an XY Problem ... that is: you are dealing
 with X, you are assuming Y will help you, and you are asking about Y
 without giving more details about the X so that we can understand the
 full issue.  Perhaps the best solution doesn't involve Y at all?
 See Also: http://www.perlmonks.org/index.pl?node_id=542341




 : Date: Thu, 30 Oct 2014 14:09:13 +0530
 : From: Aman Tandon amantandon...@gmail.com
 : Reply-To: solr-user@lucene.apache.org
 : To: solr-user@lucene.apache.org solr-user@lucene.apache.org
 : Subject: issue related to blank value in datefield
 :
 : Hi,
 :
 : I wants to set -00-00T00:00:00Z value for date field where I do not
 : have the value. When the index the at field with value as desired it is
 : getting indexed as 0002-11-30T00:00:00Z.
 :
 : What is the reason behind this?
 :
 : With Regards
 : Aman Tandon
 :

 -Hoss
 http://www.lucidworks.com/



Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson
Your indexing client, if written in SolrJ, should use CloudSolrServer
which is, in Matt's terms leader aware. It divides up the
documents to be indexed into packets that where each doc in
the packet belongs on the same shard, and then sends the packet
to the shard leader. This avoids a lot of re-routing and should
scale essentially linearly. You may have to add more clients
though, depending upon who hard the document-generator is
working.

Also, make sure that you send batches of documents as Shawn
suggests, I use 1,000 as a starting point.

Best,
Erick

On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:
 On 10/30/2014 2:56 PM, Ian Rose wrote:
 I think this is true only for actual queries, right? I am not issuing
 any queries, only writes (document inserts). In the case of writes,
 increasing the number of shards should increase my throughput (in
 ops/sec) more or less linearly, right?

 No, that won't affect indexing speed all that much.  The way to increase
 indexing speed is to increase the number of processes or threads that
 are indexing at the same time.  Instead of having one client sending
 update requests, try five of them.  Also, index many documents with each
 update request.  Sending one document at a time is very inefficient.

 You didn't say how you're doing commits, but those need to be as
 infrequent as you can manage.  Ideally, you would use autoCommit with
 openSearcher=false on an interval of about five minutes, and send an
 explicit commit (with the default openSearcher=true) after all the
 indexing is done.

 You may have requirements regarding document visibility that this won't
 satisfy, but try to avoid doing commits with openSearcher=true (soft
 commits qualify for this) extremely frequently, like once a second.
 Once a minute is much more realistic.  Opening a new searcher is an
 expensive operation, especially if you have cache warming configured.

 Thanks,
 Shawn



Re: Missing Records

2014-10-30 Thread Erick Erickson
First question: Is there any possibility that some of the docs
have duplicate IDs (uniqueKeys)? If so, then some of
the docs will be replaced, which will lower your returns.
One way to figuring this out is to go to the admin screen and if
numDocs  maxDoc, then documents have been replaced.

Also, if numDocs is smaller than 903,993 then you probably have
some docs being replaced. One warning, however. Even if docs
are deleted, then this could still be the case because when segments
are merged the deleted docs are purged.

Best,
Erick

On Thu, Oct 30, 2014 at 3:12 PM, S.L simpleliving...@gmail.com wrote:
 I am curious , how many shards do you have and whats the replication factor
 you are using ?

 On Thu, Oct 30, 2014 at 5:27 PM, AJ Lemke aj.le...@securitylabs.com wrote:

 Hi All,

 We have a SOLR cloud instance that has been humming along nicely for
 months.
 Last week we started experiencing missing records.

 Admin DIH Example:
 Fetched: 903,993 (736/s), Skipped: 0, Processed: 903,993 (736/s)
 A *:* search claims that there are only 903,902 this is the first full
 index.
 Subsequent full indexes give the following counts for the *:* search
 903,805
 903,665
 826,357

 All the while the admin returns: Fetched: 903,993 (x/s), Skipped: 0,
 Processed: 903,993 (x/s) every time. ---records per second is variable


 I found an item that should be in the index but is not found in a search.

 Here are the referenced lines of the log file.

 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.processor.LogUpdateProcessor; PRE_UPDATE
 add{,id=750041421}
 {{params(debug=falseoptimize=trueindent=truecommit=trueclean=truewt=jsoncommand=full-importentity=adsverbose=false),defaults(config=data-config.xml)}}
 DEBUG - 2014-10-30 15:10:51.160;
 org.apache.solr.update.SolrCmdDistributor; sending update to
 http://192.168.20.57:7574/solr/inventory_shard1_replica2/ retry:0
 add{,id=750041421}
 params:update.distrib=TOLEADERdistrib.from=http%3A%2F%2F192.168.20.57%3A8983%2Fsolr%2Finventory_shard1_replica1%2F

 --- there are 746 lines of log between entries ---

 DEBUG - 2014-10-30 15:10:51.340; org.apache.http.impl.conn.Wire;  
 [0x2][0xc3][0xe0]params[0xa2][0xe0].update.distrib(TOLEADER[0xe0],distrib.from?[0x17]
 http://192.168.20.57:8983/solr/inventory_shard1_replica1/[0xe0]delByQ[0x0][0xe0]'docsMap[0xe][0x13][0x10]8[0x8]?[0x80][0x0][0x0][0xe0]#Zip%51106[0xe0]-IsReelCentric[0x2][0xe0](HasPrice[0x1][0xe0]*Make_Lower'ski-doo[0xe0])StateName$Iowa[0xe0]-OriginalModel/Summit
 Highmark[0xe0]/VerticalSiteIDs!2[0xe0]-ClassBinaryIDp@[0xe0]#lat(42.48929[0xe0]-SubClassFacet01704|Snowmobiles[0xe0](FuelType%Other[0xe0]2DivisionName_Lower,recreational[0xe0]latlon042.4893,-96.3693[0xe0]*PhotoCount!8[0xe0](HasVideo[0x2][0xe0]ID)750041421[0xe0]Engine
 [0xe0]*ClassFacet.12|Snowmobiles[0xe0]$Make'Ski-Doo[0xe0]$City*Sioux
 City[0xe0]#lng*-96.369302[0xe0]-Certification!N[0xe0]0EmotionalTagline0162
 Long Track
 [0xe0]*IsEnhanced[0x1][0xe0]*SubClassID$1704[0xe0](NetPrice$4500[0xe0]1IsInternetSpecial[0x2][0xe0](HasPhoto[0x1][0xe0]/DealerSortOrder!2[0xe0]+Description?VThis
 Bad boy will pull you through the deepest snow!With the 162 track and
 1000cc of power you can fly up any
 hill!![0xe0],DealerRadius+8046.72[0xe0],Transmission
 [0xe0]*ModelFacet7Ski-Doo|Summit Highmark[0xe0]/DealerNameFacet9Certified
 Auto,
 Inc.|4150[0xe0])StateAbbrIA[0xe0])ClassName+Snowmobiles[0xe0](DealerID$4150[0xe0]AdCode$DX1Q[0xe0]*DealerName4Certified
 Auto,
 Inc.[0xe0])Condition$Used[0xe0]/Condition_Lower$used[0xe0]-ExteriorColor+Blue/Yellow[0xe0],DivisionName,Recreational[0xe0]$Trim(1000
 SDI[0xe0](SourceID!1[0xe0]0HasAdEnhancement!0[0xe0]'ClassID12[0xe0].FuelType_Lower%other[0xe0]$Year$2005[0xe0]+DealerFacet?[0x8]4150|Certified
 Auto, Inc.|Sioux City|IA[0xe0],SubClassName+Snowmobiles[0xe0]%Model/Summit
 Highmark[0xe0])EntryDate42011-11-17T10:46:00Z[0xe0]+StockNumber000105[0xe0]+PriceRebate!0[0xe0]+Model_Lower/summit
 highmark[\n]
 What could be the issue and how does one fix this issue?

 Thanks so much and if more information is needed I have preserved the log
 files.

 AJ



Re: Migrating cloud to another set of machines

2014-10-30 Thread Erick Erickson
Jakov:

Be particularly aware of the ADDREPLICA collections API
command here:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica

That allows you to specify exactly which node the new replica should be on,
so you can force it to be on the new HW. Here's a guide:

http://heliosearch.org/solrcloud-assigning-nodes-machines/

Best,
Erick

On Thu, Oct 30, 2014 at 3:40 PM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
 I think ZK stuff may actually be easier to handle, no?
 Add new ones to the existing ZK cluster and then remove the old ones.
 Won't this work smoothly?

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Thu, Oct 30, 2014 at 1:16 PM, Jakov Sosic jso...@gmail.com wrote:

 On 10/30/2014 04:47 AM, Otis Gospodnetic wrote:

 Hi/Bok Jakov,

 2) sounds good to me.  It means no down-time.  1) means stoppage.  If
 stoppage is not OK, but falling behind with indexing new content is OK,
 you
 could:
 * add a new cluster
 * start reading from old index and indexing into the new index
 * stop old cluster when done
 * index new content to new cluster (or maybe you can be doing this all
 along if indexing old + new at the same time is OK for you)
 --


 Thank you for suggestions Otis.

 Everything is acceptable currently, but in the future as the data grows,
 we will certainly enter those edge cases where neither stopping indexing
 nor stopping queries will be acceptable.

 What makes things a little bit more problematic is that ZooKeepers are
 migrating also to new machines.





Re: issue related to blank value in datefield

2014-10-30 Thread Chris Hostetter

: I was just trying to index the fields returned by my msql and i found this

If you are importing dates from MySql where you have -00-00T00:00:00Z 
as the default value, you should actaully be getting an error lsat time i 
checked, but this explains the right way to tell the MySQL JDBC driver not 
to give you those values ...

https://wiki.apache.org/solr/DataImportHandlerFaq#Invalid_dates_.28e.g._.22-00-00.22.29_in_my_MySQL_database_cause_my_import_to_abort

(even if you aren't using DIH to talk to MySQL, the same principle holds 
if you are using JDBC, if you are talking to MySQL from some other client 
langauge there should be a similar option)

: Actually i just want to know why it is getting stored as '
:  0002-11-30T00:00:00Z' on indexing the value -00-00T00:00:00Z.

like i said: bugs. behavior with Year  is ndefined in alot of the 
underlying date code.  as for what that speciic date? ... no idea.


-Hoss
http://www.lucidworks.com/


Re: Solr And query

2014-10-30 Thread Erick Erickson
Right, but do be aware of one thing. The form
f1:(word1 word2) has an implicit OR between them
based on q.op which is specified in your
solrconfig.xml file for the request handler you're
using.

This is no problem, but if you ever specify q.op as AND
either in solrconfig.xml or as an explicit parameter to the
search you'll get a different logical expression.

Best,
Erick

On Thu, Oct 30, 2014 at 3:45 PM, vsriram30 vsrira...@gmail.com wrote:
 Actually I found out how to form the query. I just need to use,

 q=f1:(word1 word2) AND f2:(word3 word4) AND f3:(word5 word6)

 Thanks,
 V.Sriram



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-And-query-tp4166685p4166744.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Slow forwarding requests to collection leader

2014-10-30 Thread CP Mishra
+1 for CloudSolrServer

CloudSolrServer also has built in fault tolerance (i.e. if the master shard
is not reachable then it adds to the replica) and much better error
reporting than ConcurrentUpdateSolrServer.  The only downside is lack of
batching. As long as you are adding documents in decent size batches (can
also use multiple threads to add), you will get good indexing performance.

CP

On Thu, Oct 30, 2014 at 6:53 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Matt:

 You might want to look at SolrJ, in particular with the use of
 CloudSolrServer.
 The big benefit here is that it'll route the docs to the correct leader
 for each
 shard rather than relying on the nodes to communicate with each other.

 Here's a SolrJ example. NOTE: it used ConcurrentUpdateSolrServer which
 you should replace with CloudSolrServer. Other than making the c'tor work,
 that
 should be the only change you need as far as instantiating the right Solr
 Server.

 This one connects with a DB and also parses Tika files, but you should be
 able
 to remove all that without too much problem.

 https://lucidworks.com/blog/indexing-with-solrj/

 Best,
 Erick

 On Thu, Oct 30, 2014 at 10:08 AM, Matt Hilt matt.h...@numerica.us wrote:
  Thanks for the info Daniel. I will go forth and make a better client.
 
 
  On Oct 29, 2014, at 2:28 AM, Daniel Collins danwcoll...@gmail.com
 wrote:
 
  I kind of think this might be working as designed, but I'll be happy
 to
  be corrected by others :)
 
  We had a similar issue which we discovered by accident, we had 2 or 3
  collections spread across some machines, and we accidentally tried to
 send
  an indexing request to a node in teh cloud that didn't have a replica of
  collection1 (but it had other collections). We saw an instant jump in
  indexing latency to 5s, which given the previous latencies had been
 ~20ms
  was rather obvious!
 
  Querying seems to be fine with this kind of forwarding approach, but
  indexing would logically require ZK information (to find the right shard
  for the destination collection and the leader of that shard), so I'm
  wondering if a node in the cloud that has a replica of collection1 has
 that
  information cached, whereas a node in the (same) cloud that only has a
  collection2 replica only has collection2 information cached, and has to
 go
  to ZK for every forwarding request.
 
  I haven't checked the code recently, but that seems plausible to me.
 Would
  you really want all your collection2 nodes to be running ZK watches for
 all
  collection1 updates as well as their own collection2 watches, that would
  clog them up processing updates that in all honestly, they shouldn't
 have
  to deal with. Every node in the cloud would have to have a watch on
  everything else which if you have a lot of independent collections
 would be
  an unnecessary burden on each of them.
 
  If you use SolrJ as a client, that would route to a correct node in the
  cloud (which is what we ended up using through JNI which was
  interesting), but if you are using HTTP to index, that's something
 your
  application has to take care of.
 
  On 28 October 2014 19:29, Matt Hilt matt.h...@numerica.us wrote:
 
  I have three equal machines each running solr cloud (4.8). I have
 multiple
  collections that are replicated but not sharded. I also have document
  generation processes running on these nodes which involves querying the
  collection ~5 times per document generated.
 
  Node 1 has a replica of collection A and is running document generation
  code that pushes to the HTTP /update/json hander.
  Node 2 is the leader of collection A.
  Node 3 does not have a replica of node A, but is running document
  generation code for collection A.
 
  The issue I see is that node 1 can push documents into Solr 3-5 times
  faster than node 3 when they both talk to the solr instance on their
  localhost. If either of them talk directly to the solr instance on
 node 2,
  the performance is excellent (on par with node 1). To me it seems that
 the
  only difference in these cases is the query/put request forwarding.
 Does
  this involve some slow zookeeper communication that should be avoided?
 Any
  other insights?
 
  Thanks
 



Re: Solr And query

2014-10-30 Thread vsriram30
Thanks Eric. I tried q.op=AND and noticed that it is equivalent to
specifying,
q=f1:word1 word2 AND f2:word3 word4 AND f3:word5 word6



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-And-query-tp4166685p4166760.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Ian Rose
Thanks for the suggestions so for, all.

1) We are not using SolrJ on the client (not using Java at all) but I am
working on writing a smart router so that we can always send to the
correct node.  I am certainly curious to see how that changes things.
Nonetheless even with the overhead of extra routing hops, the observed
behavior (no increase in performance with more nodes) doesn't make any
sense to me.

2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
and autoSoftCommit (maxTime=15000).

3) Suggestions to batch documents certainly make sense for production code
but in this case I am not real concerned with absolute performance; I just
want to see the *relative* performance as we use more Solr nodes.  So I
don't think batching or not really matters.

4) No, that won't affect indexing speed all that much.  The way to
increase indexing speed is to increase the number of processes or threads
that are indexing at the same time.  Instead of having one client
sending update
requests, try five of them.

Can you elaborate on this some?  I'm worried I might be misunderstanding
something fundamental.  A cluster of 3 shards over 3 Solr nodes
*should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.  Regarding your comment of increase the number of
processes or threads, note that for each value of K (number of Solr nodes)
I measured with several different numbers of simulated users so that I
could find a saturation point.  For example, take a look at my data for
K=2:

Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
1003180

It's clear that once the load test client has ~40 simulated users, the Solr
cluster is saturated.  Creating more users just increases the average
request latency, such that the total QPS remained (nearly) constant.  So I
feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
The problem is that I am finding roughly this same max point, no matter
how many simulated users the load test client created, for any value of K
( 1).

Cheers,
- Ian


On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Your indexing client, if written in SolrJ, should use CloudSolrServer
 which is, in Matt's terms leader aware. It divides up the
 documents to be indexed into packets that where each doc in
 the packet belongs on the same shard, and then sends the packet
 to the shard leader. This avoids a lot of re-routing and should
 scale essentially linearly. You may have to add more clients
 though, depending upon who hard the document-generator is
 working.

 Also, make sure that you send batches of documents as Shawn
 suggests, I use 1,000 as a starting point.

 Best,
 Erick

 On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey apa...@elyograg.org wrote:
  On 10/30/2014 2:56 PM, Ian Rose wrote:
  I think this is true only for actual queries, right? I am not issuing
  any queries, only writes (document inserts). In the case of writes,
  increasing the number of shards should increase my throughput (in
  ops/sec) more or less linearly, right?
 
  No, that won't affect indexing speed all that much.  The way to increase
  indexing speed is to increase the number of processes or threads that
  are indexing at the same time.  Instead of having one client sending
  update requests, try five of them.  Also, index many documents with each
  update request.  Sending one document at a time is very inefficient.
 
  You didn't say how you're doing commits, but those need to be as
  infrequent as you can manage.  Ideally, you would use autoCommit with
  openSearcher=false on an interval of about five minutes, and send an
  explicit commit (with the default openSearcher=true) after all the
  indexing is done.
 
  You may have requirements regarding document visibility that this won't
  satisfy, but try to avoid doing commits with openSearcher=true (soft
  commits qualify for this) extremely frequently, like once a second.
  Once a minute is much more realistic.  Opening a new searcher is an
  expensive operation, especially if you have cache warming configured.
 
  Thanks,
  Shawn
 



Re: Solr And query

2014-10-30 Thread Erick Erickson
U. That may be true for your particular example data set, but not
in the general case, so don't be fooled.

q.op=AND is equivalent to
q=f1:(word1 AND word2) AND f2:(word3 AND word4) AND f3:(word5 AND word6)

This query
q=f1:word1 word2 AND f2:word3 word4 AND f3:word5 word6
would not match a document like this

f1:word2 word1
f2:word3 word4
f3:word5 word6

since it requires that the words be in order. Whereas

q=f1:(word1 AND word2) AND f2:(word3 AND word4) AND f3:(word5 AND word6)

would match the doc.

Best,
Erick


On Thu, Oct 30, 2014 at 5:46 PM, vsriram30 vsrira...@gmail.com wrote:
 Thanks Eric. I tried q.op=AND and noticed that it is equivalent to
 specifying,
 q=f1:word1 word2 AND f2:word3 word4 AND f3:word5 word6



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-And-query-tp4166685p4166760.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Ideas for debugging poor SolrCloud scalability

2014-10-30 Thread Erick Erickson
I'm really confused:

bq: I am not issuing any queries, only writes (document inserts)

bq: It's clear that once the load test client has ~40 simulated users

bq: A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right

QPS is usually used to mean Queries Per Second, which is different from
the statement that I am not issuing any queries. And what do the
number of users have to do with inserting documents?

You also state:  In many cases, CPU on the solr servers is quite low as well

So let's talk about indexing first. Indexing should scale nearly
linearly as long as
1 you are routing your docs to the correct leader, which happens with SolrJ
and the CloudSolrSever automatically. Rather than rolling your own, I strongly
suggest you try this out.
2 you have enough clients feeding the cluster to push CPU utilization
on them all.
Very often slow indexing, or in your case lack of scaling is a
result of document
acquisition or, in your case, your doc generator is spending all it's
time waiting for
the individual documents to get to Solr and come back.

bq: chooses a random solr server for each ADD request (with 1 doc per add
request)

Probably your culprit right there. Each and every document requires that you
have to cross the network (and forward that doc to the correct leader). So given
that you're not seeing high CPU utilization, I suspect that you're not sending
enough docs to SolrCloud fast enough to see scaling. You need to batch up
multiple docs, I generally send 1,000 docs at a time.

But even if you do solve this, the inter-node routing will prevent
linear scaling.
When a doc (or a batch of docs) goes to a random Solr node, here's what
happens:
1 the docs are re-packaged into groups based on which shard they're
destined for
2 the sub-packets are forwarded to the leader for each shard
3 the responses are gathered back and returned to the client.

This set of operations will eventually degrade the scaling.

bq:  A cluster of 3 shards over 3 Solr nodes *should* support
a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
behind sharding.

If we're talking search requests, the answer is no. Sharding is
what you do when your collection no longer fits on a single node.
If it _does_ fit on a single node, then you'll usually get better query
performance by adding a bunch of replicas to a single shard. When
the number of  docs on each shard grows large enough that you
no longer get good query performance, _then_ you shard. And
take the query hit.

If we're talking about inserts, then see above. I suspect your problem is
that you're _not_ saturating the SolrCloud cluster, you're sending
docs to Solr very inefficiently and waiting on I/O. Batching docs and
sending them to the right leader should scale pretty linearly until you
start saturating your network.

Best,
Erick

On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose ianr...@fullstory.com wrote:
 Thanks for the suggestions so for, all.

 1) We are not using SolrJ on the client (not using Java at all) but I am
 working on writing a smart router so that we can always send to the
 correct node.  I am certainly curious to see how that changes things.
 Nonetheless even with the overhead of extra routing hops, the observed
 behavior (no increase in performance with more nodes) doesn't make any
 sense to me.

 2) Commits: we are using autoCommit with openSearcher=false (maxTime=6)
 and autoSoftCommit (maxTime=15000).

 3) Suggestions to batch documents certainly make sense for production code
 but in this case I am not real concerned with absolute performance; I just
 want to see the *relative* performance as we use more Solr nodes.  So I
 don't think batching or not really matters.

 4) No, that won't affect indexing speed all that much.  The way to
 increase indexing speed is to increase the number of processes or threads
 that are indexing at the same time.  Instead of having one client
 sending update
 requests, try five of them.

 Can you elaborate on this some?  I'm worried I might be misunderstanding
 something fundamental.  A cluster of 3 shards over 3 Solr nodes
 *should* support
 a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
 behind sharding.  Regarding your comment of increase the number of
 processes or threads, note that for each value of K (number of Solr nodes)
 I measured with several different numbers of simulated users so that I
 could find a saturation point.  For example, take a look at my data for
 K=2:

 Num NodesNum UsersQPS214722517902102290215285022029002403210260320028032102
 1003180

 It's clear that once the load test client has ~40 simulated users, the Solr
 cluster is saturated.  Creating more users just increases the average
 request latency, such that the total QPS remained (nearly) constant.  So I
 feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
 The problem is that I am finding roughly this same max