Re: Best practices for Solr (how to update jar files safely)

2016-02-22 Thread Ramkumar R. Aiyengar
I side with Toke on this. Enterprise bare metal machines often have
hundreds of gigs of memory and tens of CPU cores -- you would have to fit
multiple instances in a machine to make use of them to circumvent huge
heaps.

If this is not a common case now, it could well be in the future the way
hardware evolves -- so I would rather mention the factors which need
multiple instances than discourage them.
On 20 Feb 2016 14:55, "Toke Eskildsen"  wrote:

> Shawn Heisey  wrote:
> > I've updated the "Taking Solr to Production" reference guide page with
> > what I feel is an appropriate caution against running multiple instances
> > in a typical installation.  I'd actually like to use stronger language,
>
> And I would like you to use softer language.
>
> Machines gets bigger all the time and as you state yourself, GC can
> (easily) be a problem with the heap grows. With reference to the 32GB JVM
> limit for small pointers, a max Xmx just below 32GB looks like a practical
> choice for a Solr installation (if possible of course): Running 2 instances
> of 31GB will provide more usable memory than a single instance of 64GB.
>
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>
> Caveat: I have not done any testing on this with Solr, so I do not know
> how large the effect is. Some things, such as String faceting, DocValues
> structures and some of the field caches are array-of-atomics oriented and
> will not suffer with larger pointers. Other things, such as numerics
> faceting, large rows-settings and grouping uses a lot of objects and will
> require more memory. The overhead will differ depending on usage.
>
> We tend to use separate Solr installations on the same machines. For some
> machines we do it to allow for independent upgrades (long story), for
> others because a heap of 200GB is not something we are ready to experiment
> with.
>
> - Toke Eskildsen
>


Re: It's possible up and debug solr in eclipse IDE?

2016-01-14 Thread Ramkumar R. Aiyengar
I should add to Erick's point that the test framework allows you to test
HTTP APIs through an embedded Jetty instance, so you should be able to do
anything that you do with a remote Solr instance from code..
On 12 Jan 2016 18:24, "Erick Erickson"  wrote:

> And a neater way to debug stuff rather than attaching to
> Solr is to step through the Junit tests that exercise the code
> you need to work on rather than attach to a remote Solr.
> This is often much faster rather than compile/start solr/attach.
>
> Of course some problems don't fit that process, but I thought
> I'd mention it.
>
> Best,
> Erick
>
> On Tue, Jan 12, 2016 at 4:08 AM, Vincenzo D'Amore 
> wrote:
> > Mmmm... I'm not sure it worth the trouble. Anyway, I'm just curious, when
> > you find a way let me know.
> >
> > On Tue, Jan 12, 2016 at 1:01 PM, Rodrigo Testillano <
> > rodrite.testill...@gmail.com> wrote:
> >
> >> Yes, with remote debug is working, but i want up a jetty with solr in
> >> Eclipse like i did with tomcat in older versions. Thank you very much
> for
> >> your help! I am going to try  other way to do it, but maybe will be not
> >> possible
> >>
> >> 2016-01-12 12:51 GMT+01:00 Rodrigo Testillano <
> >> rodrite.testill...@gmail.com>
> >> :
> >>
> >> > Thank you so much!, I'm going to try right now and tell you my
> results!!
> >> >
> >> > 2016-01-12 12:47 GMT+01:00 Vincenzo D'Amore :
> >> >
> >> >> Yep.
> >> >>
> >> >> I have done this just few hours ago.
> >> >> Let's download Solr source:
> >> >>
> >> >>  wget
> >> http://it.apache.contactlab.it/lucene/solr/5.4.0/solr-5.4.0-src.tgz
> >> >>
> >> >> untar the file.
> >> >>
> >> >> I'm not sure we need, but I have already installed latest versions
> of:
> >> >> ant,
> >> >> ivy and maven.
> >> >>
> >> >> Then in the solr-5.4.0 directory I did this:
> >> >>
> >> >> ant resolve
> >> >>
> >> >> ant eclipse
> >> >>
> >> >> Now you can import solr-5.4.0 as eclipse project.
> >> >>
> >> >> Under the hood the ant "eclipse" task have created .project and
> >> .classpath
> >> >> and .settings directory.
> >> >>
> >> >> Now if you want debug, all you need to do is create with eclipse a
> java
> >> >> remote debug configuration and start solr with the debugging
> parameters:
> >> >>
> >> >> ./solr start -m 4g a "-Xdebug
> >> >> -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044"
> >> >>
> >> >> :)
> >> >>
> >> >> On Tue, Jan 12, 2016 at 12:31 PM, Rodrigo Testillano <
> >> >> rodrite.testill...@gmail.com> wrote:
> >> >>
> >> >> > I need debug my custom processor (updateRequestProcessor)  in my
> >> Eclipse
> >> >> > IDE. With old Solr version was possible, but with the solr like a
> >> >> service
> >> >> > with jetty i don't know if exists some way to do
> >> >> > --
> >> >> > Un Saludo.
> >> >> >
> >> >> > Rodrigo Testillano Tordesillas.
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Vincenzo D'Amore
> >> >> email: v.dam...@gmail.com
> >> >> skype: free.dev
> >> >> mobile: +39 349 8513251
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Un Saludo.
> >> >
> >> > Rodrigo Testillano Tordesillas.
> >> >
> >>
> >>
> >>
> >> --
> >> Un Saludo.
> >>
> >> Rodrigo Testillano Tordesillas.
> >>
> >
> >
> >
> > --
> > Vincenzo D'Amore
> > email: v.dam...@gmail.com
> > skype: free.dev
> > mobile: +39 349 8513251
>


Re: Number of requests to each shard is different with and without using of grouping

2015-08-22 Thread Ramkumar R. Aiyengar
M is the number of ids you want for each group, specified by group.limit.
It's unrelated to the number of rows requested..
On 21 Aug 2015 19:54, SolrUser1543 osta...@gmail.com wrote:

 Ramkumar R. Aiyengar wrote
  Grouping does need 3 phases.. The phases are:
 
 
  (2) For the N groups, each shard is asked for the top M ids (M is
  configurable per request).
 

 What do you exactly means by /M is configurable per request/ ? how
 exactly
 is it configurable and what is the relation between N ( which is initial
 rows number )  and M  ?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Number-of-requests-to-each-shard-is-different-with-and-without-using-of-grouping-tp4224293p4224521.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SOLR to SOLR communication with custom authentication

2015-08-21 Thread Ramkumar R. Aiyengar
Custom authentication support was added in 5x, and the imminent (in the
next few days) 5.3 release has a lot of features in this regard, including
a basic authentication module, I would suggest upgrading to it. 5x versions
(include 5.3) do support Java 7, so I don't see an issue here?
On 20 Aug 2015 12:48, Prasad Bodapati prasad.bodap...@pb.com wrote:

 Hi All,

 We have cluster environment on JBOSS, All of our deployed applications are
 protected by OpenAM including SOLR. On Slave nodes we enabled SOLR to
 communicate with master nodes to get data.
 Since the SOLR on master is protected with OpenAM slave can't talk to it.
 In Solr.xml there is a way to configure replication requests to use basic
 HTTP authentication but not to use custom authentication.
 I have tried to override ReplicationHandler and SnapPuller classes to use
 provide custom authentication but I couldn't.

 I have tried to follow instructions at
 https://wiki.apache.org/solr/SolrSecurity but I could not find the
 classes
 org.apache.solr.security.InterSolrNodeAuthCredentialsFactory.SubRequestFactory
 and
 org.apache.solr.security.InterSolrNodeAuthCredentialsFactory.SubRequestFactory.

 Have anyone of you used custom authentication before for replocation ? Any
 help would be greatly appreciated.

 Environment
 SOLR version: 4.10.2 (We can't upgrade at moment as we use Java 7)
 JBOSS 6.2 EAP

 Thanks,
 Prasad

 




Re: Number of requests to each shard is different with and without using of grouping

2015-08-21 Thread Ramkumar R. Aiyengar
Grouping does need 3 phases.. The phases are:

(1) Each shard is asked for the top N groups (instead of ids), with the
sort value. The federator then sorts the groups from all shards and chooses
the top N groups.
(2) For the N groups, each shard is asked for the top M ids (M is
configurable per request). The top M ids from each shard for every group is
again sorted within each group to find the overall top M. At the end of
this phase, you have the top N groups with the top M ids for each group.
(3) The final phase gets the stored fields for these M*N ids.
On 20 Aug 2015 20:00, SolrUser1543 osta...@gmail.com wrote:

 I want to understand, why number of requests in SOLD CLOUD is different
 with
 and without using of grouping feature.


 1. suppose we have several shards in SOLR CLOUD ( lets say 3 shards )
 2. One of them, gets a query with rows = n
 3. This shards distributes a request among others and suppose that every
 shard has a lot of results , much more than n .
 4. Then it receives an item IDs from each shards , so the number of results
 in total is 3n
 5. Then it sorts the results and chooses the  best n results , when in my
 case each shard  has representatives in total results .
 6. Then it send a second request to each shard , with appropriate item IDs
 ,
 to get a stored fields .

 So then in this case ,each shard will be queried twice, first one to get
 item IDs , and the second to get stored fields .

 That is what I see in my logs . ( I see 6 log entries , 2 for each shard )

 *The question is , why when I am using a grouping feature, the number of
 request to each shard is 3 instead of 2 times ?*  ( I see 8 or 9 log
 entries
 )




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Number-of-requests-to-each-shard-is-different-with-and-without-using-of-grouping-tp4224293.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr 5.2.1 on Solaris

2015-06-19 Thread Ramkumar R. Aiyengar
Please open a JIRA with details of what the issues are, we should try to
support this..
On 18 Jun 2015 15:07, Bence Vass bence.v...@inso.tuwien.ac.at wrote:

 Hello,

 Is there any documentation on how to start Solr 5.2.1 on Solaris (Solaris
 10)? The script (solr start) doesn't work out of the box, is anyone running
 Solaris 5.x on Solaris?

 - Thanks



Re: Please help test the new Angular JS Admin UI

2015-06-17 Thread Ramkumar R. Aiyengar
I started with an empty Solr instance and Firefox 38 on Linux. This is the
trunk source..

There's a 'No cores available. Go and create one' button available in the
old and the new UI. In the old UI, clicking it goes to the core admin, and
pops open the dialog for Add Core. The new UI only goes to the core admin.
Also, when you then click on the Add Core, the dialog bleeds into the
sidebar.

I then started with a getting started config and a cloud of 2x2. Then
brought up admin UI on one of them, opened up one of the cores, and clicked
on the Files tab -- that showed an exception..

{data:{responseHeader:{status:500,QTime:1},error:{msg:Path
must not end with / character,trace:java.lang.IllegalArgumentException:
Path must not end with / character\n\tat
org.apache.zookeeper.common.PathUtils.validatePath(PathUtils.java:58)\n\tat
org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1024)\n\tat
org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:319)\n\tat
org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:316)\n\tat
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)\n\tat
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:316)\n\tat
org.apache.solr.handler.admin.ShowFileRequestHandler.getAdminFileFromZooKeeper(ShowFileRequestHandler.java:324)\n\tat
org.apache.solr.handler.admin.ShowFileRequestHandler.showFromZooKeeper(ShowFileRequestHandler.java:148)\n\tat
org.apache.solr.handler.admin.ShowFileRequestHandler.handleRequestBody(ShowFileRequestHandler.java:135)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
org.apache.solr.core.SolrCore.execute(SolrCore.java:2057)\n\tat
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:648)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:452)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat


Moving to Plugins/Stats, and then Core, and selecting the first searcher
entry (e.g. for me, it is Searcher@3a7bd1[gettingstarted_shard1_replica1]
main), I see stats like:

   - searcherName:Searcher@#8203;3a7bd1[gettingstarted_shard1_replica1]
   main
   - reader:
   ExitableDirectoryReader(#8203;UninvertingDirectoryReader(#8203;))

Notice the unescaped characters there..


Re: SolrCloud Leader Election

2015-05-21 Thread Ramkumar R. Aiyengar
This shouldn't happen, but if it does, there's no good way currently for
Solr to automatically fix it. There are a couple of issues being worked on
to do that currently. But till then, your best bet is to restart the node
which you expect to be the leader (you can look at ZK to see who is at the
head of the queue it maintains). If you can't figure that out, safest is to
just stop/start all nodes in sequence, and if that doesn't work, stop all
nodes and start them back one after the other.
On 21 May 2015 00:24, Ryan Steele ryan.ste...@pgi.com wrote:

 My SolrCloud cluster isn't reassigning the collections leaders from downed
 cores--the downed cores are still listed as the leaders. The cluster has
 been in the state for a few hours and the logs continue to report No
 registered leader was found after waiting for 4000ms. Is there a way to
 force it to reassign the leader?

 I'm running SolrCloud 5.0.
 I have 7 Solr nodes, 3 Zookeeper nodes, and 3739 collections.

 Thanks,
 Ryan

 ---
 This email has been scanned for email related threats and delivered safely
 by Mimecast.
 For more information please visit http://www.mimecast.com

 ---



Re: Multiple index.timestamp directories using up disk space

2015-05-05 Thread Ramkumar R. Aiyengar
Yes, data loss is the concern. If the recovering replica is not able to
retrieve the files from the leader, it at least has an older copy.

Also, the entire index is not fetched from the leader, only the segments
which have changed. The replica initially gets the file list from the
replica, checks against what it has, and then downloads the difference --
then moves it to the main index. Note that this process can fail sometimes
(say due to I/O errors, or due to a problem with the leader itself), in
which case the replica drops all accumulated files from the leader, and
starts from scratch. If that happens, it needs to look back at its old
index again to figure out what it needs to download on the next attempt.

May be with a fair number of assumptions which should usually hold good,
you can still come up with a mechanism to drop existing files, but those
won't hold good in case of serious issues with the cloud, you could end up
losing data. That's worse than using a bit more disk space!
On 4 May 2015 11:56, Rishi Easwaran rishi.easwa...@aol.com wrote:

Thanks for the responses Mark and Ramkumar.

 The question I had was, why does Solr need 2 copies at any given time,
leading to 2x disk space usage.
 Not sure if this information is not published anywhere, and makes HW
estimation almost impossible for large scale deployment. Even if the copies
are temporary, this becomes really expensive, especially when using SSD in
production, when the complex size is over 400TB indexes, running 1000's of
solr cloud shards.

 If a solr follower has decided that it needs to do replication from leader
and capture full copy snapshot. Why can't it delete the old information and
replicate from scratch, not requiring more disk space.
 Is the concern data loss (a case when both leader and follower lose data)?.

 Thanks,
 Rishi.







-Original Message-
From: Mark Miller markrmil...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Tue, Apr 28, 2015 10:52 am
Subject: Re: Multiple index.timestamp directories using up disk space


If copies of the index are not eventually cleaned up, I'd fill a JIRA
to
address the issue. Those directories should be removed over time. At
times
there will have to be a couple around at the same time and others may
take
a while to clean up.

- Mark

On Tue, Apr 28, 2015 at 3:27 AM Ramkumar
R. Aiyengar 
andyetitmo...@gmail.com wrote:

 SolrCloud does need up to
twice the amount of disk space as your usual
 index size during replication.
Amongst other things, this ensures you have
 a full copy of the index at any
point. There's no way around this, I would
 suggest you provision the
additional disk space needed.
 On 20 Apr 2015 23:21, Rishi Easwaran
rishi.easwa...@aol.com wrote:

  Hi All,
 
  We are seeing this
problem with solr 4.6 and solr 4.10.3.
  For some reason, solr cloud tries to
recover and creates a new index
  directory - (ex:index.20150420181214550),
while keeping the older index
 as
  is. This creates an issues where the
disk space fills up and the shard
  never ends up recovering.
  Usually
this requires a manual intervention of  bouncing the instance and
  wiping
the disk clean to allow for a clean recovery.
 
  Any ideas on how to
prevent solr from creating multiple copies of index
  directory.
 
 
Thanks,
  Rishi.
 



Re: Multiple index.timestamp directories using up disk space

2015-04-28 Thread Ramkumar R. Aiyengar
SolrCloud does need up to twice the amount of disk space as your usual
index size during replication. Amongst other things, this ensures you have
a full copy of the index at any point. There's no way around this, I would
suggest you provision the additional disk space needed.
On 20 Apr 2015 23:21, Rishi Easwaran rishi.easwa...@aol.com wrote:

 Hi All,

 We are seeing this problem with solr 4.6 and solr 4.10.3.
 For some reason, solr cloud tries to recover and creates a new index
 directory - (ex:index.20150420181214550), while keeping the older index as
 is. This creates an issues where the disk space fills up and the shard
 never ends up recovering.
 Usually this requires a manual intervention of  bouncing the instance and
 wiping the disk clean to allow for a clean recovery.

 Any ideas on how to prevent solr from creating multiple copies of index
 directory.

 Thanks,
 Rishi.



Re: Restart solr failed after applied the patch in https://issues.apache.org/jira/browse/SOLR-6359

2015-03-31 Thread Ramkumar R. Aiyengar
It shouldn't be any different without the patch, or with the patch and
(100,10) as parameters. Which is why I wanted you to check with 100,10.. If
you see the same issue with that, then the patch is probably not an issue,
may be it is with the patched build in general..
On 30 Mar 2015 13:01, forest_soup tanglin0...@gmail.com wrote:

 But if the value can only be 100,10, is there any difference with no that
 patch? Can we enlarge those 2 values? Thanks!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Restart-solr-failed-after-applied-the-patch-in-https-issues-apache-org-jira-browse-SOLR-6359-tp4196251p4196280.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Restart solr failed after applied the patch in https://issues.apache.org/jira/browse/SOLR-6359

2015-03-30 Thread Ramkumar R. Aiyengar
I doubt this has anything to do with the patch. Do you observe the same
behaviour if you reduce the values for the config to defaults? (100, 10)
On 30 Mar 2015 09:51, forest_soup tanglin0...@gmail.com wrote:

 https://issues.apache.org/jira/browse/SOLR-6359

 I also posted the questions to the JIRA ticket.

 We have a SolrCloud with 5 solr servers of Solr 4.7.0. There are one
 collection with 80 shards(2 replicas per shard) on those 5 servers. And we
 made a patch by merge the patch
 (https://issues.apache.org/jira/secure/attachment/12702473/SOLR-6359.patch
 )
 to 4.7.0 stream. And after applied the patch to our servers with the config
 changing uploaded to ZooKeeper, we did a restart on one of the 5 solr
 server, we met some issues on that server. Below is the details -
 The solrconfig.xml we changed:
 updateLog
 str name=dir$
 {solr.ulog.dir:}
 /str
 int name=numRecordsToKeep1/int
 int name=maxNumLogsToKeep100/int
 /updateLog

 After we restarted one solr server without other 4 servers are running, we
 met below exceptions in the restarted one:
 ERROR - 2015-03-16 20:48:48.214; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: Exception writing document id
 Q049bGx0bWFpbDIxL089bGxwX3VzMQ==41703656!B68BF5EC5A4A650D85257E0A00724A3B
 to
 the index; possible analysis error.
 at

 org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
 at

 org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
 at

 org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:703)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:857)
 at

 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:556)
 at

 org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:96)
 at

 org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:166)
 at

 org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:136)
 at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:225)
 at

 org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:121)
 at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:190)
 at
 org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:116)
 at

 org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:173)
 at

 org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:106)
 at org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
 at

 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
 at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:780)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:427)
 at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
 at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
 at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
 at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
 at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
 at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
 at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
 at

 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1040)
 at

 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:607)
 at

 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:314)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1156)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:626)
 at

 org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
 at java.lang.Thread.run(Thread.java:804)
 Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter
 is closed
 at 

Re: How to use ConcurrentUpdateSolrServer for Secured Solr?

2015-03-22 Thread Ramkumar R. Aiyengar
Not a direct answer, but Anshum just created this..

https://issues.apache.org/jira/browse/SOLR-7275
 On 20 Mar 2015 23:21, Furkan KAMACI furkankam...@gmail.com wrote:

 Is there anyway to use ConcurrentUpdateSolrServer for secured Solr as like
 CloudSolrServer:

 HttpClientUtil.setBasicAuth(cloudSolrServer.getLbServer().getHttpClient(),
 username, password);

 I see that there is no way to access HTTPClient for
 ConcurrentUpdateSolrServer?

 Kind Regards,
 Furkan KAMACI



Re: Want to modify Solr Source Code

2015-03-17 Thread Ramkumar R. Aiyengar
Is your concern that you want to be able to modify source code just on your
machine or that  you can't for some reason install svn?

If it's the former, even if you checkout using svn, you can't modify
anything outside the machine as changes can be checked in only by the
committers of the project. You will need to raise a JIRA for the changes to
go back in as described by the wiki page.

If the latter, try downloading the source code using the downloads section
in https://lucene.apache.org/solr and choose the download which ends as
-src.tgz, that has the source bundled as a single file.
On 17 Mar 2015 07:42, Nitin Solanki nitinml...@gmail.com wrote:

 Hi Gora,
Hi, I want to make changes only into my machine without svn.
 I want to do test on source code. How ? Any steps to do so ? Please help..

 On Tue, Mar 17, 2015 at 1:01 PM, Gora Mohanty g...@mimirtech.com wrote:

  On 17 March 2015 at 12:22, Nitin Solanki nitinml...@gmail.com wrote:
  
   Hi,
I want to modify the solr source code. I don't have any idea
  where
   source code is available. I want to edit source code. How can I do ?
   Any help please...
 
  Please start with:
 
 
 http://wiki.apache.org/solr/HowToContribute#Contributing_Code_.28Features.2C_Bug_Fixes.2C_Tests.2C_etc29
 
  Regards,
  Gora
 



Re: Whole RAM consumed while Indexing.

2015-03-16 Thread Ramkumar R. Aiyengar
Yes, and doing so is painful and takes lots of people and hardware
resources to get there for large amounts of data and queries :)

As Erick says, work backwards from 60s and first establish how high the
commit interval can be to satisfy your use case..
On 16 Mar 2015 16:04, Erick Erickson erickerick...@gmail.com wrote:

 First start by lengthening your soft and hard commit intervals
 substantially. Start with 6 and work backwards I'd say.

 Ramkumar has tuned the heck out of his installation to get the commit
 intervals to be that short ;).

 I'm betting that you'll see your RAM usage go way down, but that' s a
 guess until you test.

 Best,
 Erick

 On Sun, Mar 15, 2015 at 10:56 PM, Nitin Solanki nitinml...@gmail.com
 wrote:
  Hi Erick,
  You are saying correct. Something, **overlapping searchers
  warning messages** are coming in logs.
  **numDocs numbers** are changing when documents are adding at the time of
  indexing.
  Any help?
 
  On Sat, Mar 14, 2015 at 11:24 PM, Erick Erickson 
 erickerick...@gmail.com
  wrote:
 
  First, the soft commit interval is very short. Very, very, very, very
  short. 300ms is
  just short of insane unless it's a typo ;).
 
  Here's a long background:
 
 
 https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
 
  But the short form is that you're opening searchers every 300 ms. The
  hard commit is better,
  but every 3 seconds is still far too short IMO. I'd start with soft
  commits of 6 and hard
  commits of 6 (60 seconds), meaning that you're going to have to
  wait 1 minute for
  docs to show up unless you explicitly commit.
 
  You're throwing away all the caches configured in solrconfig.xml more
  than 3 times a second,
  executing autowarming, etc, etc, etc
 
  Changing these to longer intervals might cure the problem, but if not
  then, as Hoss would
  say, details matter. I suspect you're also seeing overlapping
  searchers warning messages
  in your log, and it;s _possible_ that what's happening is that you're
  just exceeding the
  max warming searchers and never opening a new searcher with the
  newly-indexed documents.
  But that's a total shot in the dark.
 
  How are you looking for docs (and not finding them)? Does the numDocs
  number in
  the solr admin screen change?
 
 
  Best,
  Erick
 
  On Thu, Mar 12, 2015 at 10:27 PM, Nitin Solanki nitinml...@gmail.com
  wrote:
   Hi Alexandre,
  
  
   *Hard Commit* is :
  
autoCommit
  maxTime${solr.autoCommit.maxTime:3000}/maxTime
  openSearcherfalse/openSearcher
/autoCommit
  
   *Soft Commit* is :
  
   autoSoftCommit
   maxTime${solr.autoSoftCommit.maxTime:300}/maxTime
   /autoSoftCommit
  
   And I am committing 2 documents each time.
   Is it good config for committing?
   Or I am good something wrong ?
  
  
   On Fri, Mar 13, 2015 at 8:52 AM, Alexandre Rafalovitch 
  arafa...@gmail.com
   wrote:
  
   What's your commit strategy? Explicit commits? Soft commits/hard
   commits (in solrconfig.xml)?
  
   Regards,
  Alex.
   
   Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
   http://www.solr-start.com/
  
  
   On 12 March 2015 at 23:19, Nitin Solanki nitinml...@gmail.com
 wrote:
Hello,
  I have written a python script to do 2 documents
  indexing
each time on Solr. I have 28 GB RAM with 8 CPU.
When I started indexing, at that time 15 GB RAM was freed. While
   indexing,
all RAM is consumed but **not** a single document is indexed. Why
 so?
And it through *HTTPError: HTTP Error 503: Service Unavailable* in
  python
script.
I think it is due to heavy load on Zookeeper by which all nodes
 went
   down.
I am not sure about that. Any help please..
Or anything else is happening..
And how to overcome this issue.
Please assist me towards right path.
Thanks..
   
Warm Regards,
Nitin Solanki
  
 



Re: Jetty version

2015-03-12 Thread Ramkumar R. Aiyengar
Yes, Solr 5.0 uses Jetty 8.

FYI, the upcoming release 5.1 will move to Jetty 9.

Also, just in case it matters -- as noted in the 5.0 release notes, the use
of Jetty is now an implementation detail and we might move away from it in
the future -- so you shouldn't be depending on Solr using Jetty or a
particular version of Jetty.
On 12 Mar 2015 10:33, Aman Tandon amantandon...@gmail.com wrote:

 Hi,

 I am not sure but when i am looking into the server/lib directory then i am
 able to see the version 8.1 with all those lib files present in that
 folder. So i am guessing its version 8.1.

 I confirmed it by downloading the new jetty server which was jetty-9.2 and
 i found the same version on jetty libraries.

 With Regards
 Aman Tandon

 On Thu, Mar 12, 2015 at 12:19 PM, Philippe de Rochambeau phi...@free.fr
 wrote:

  Hello,
 
  which jetty version does solr 5 integrate?
 
  Cheers,
 
  Philippe
 



Re: 4.10.4 - nodes up, shard without leader

2015-03-09 Thread Ramkumar R. Aiyengar
The update log replay issue looks like
https://issues.apache.org/jira/browse/SOLR-6583
On 9 Mar 2015 01:41, Mark Miller markrmil...@gmail.com wrote:

 Interesting bug.

 First there is the already closed transaction log. That by itself deserves
 a look. I'm not even positive we should be replaying the log we
 reconnecting from ZK disconnect, but even if we do, this should never
 happen.

 Beyond that there seems to be some race. Because of the log trouble, we try
 and cancel the election - but we don't find the ephemeral election node yet
 for some reason and so just assume it's fine, no node there to remove
 (well, we WARN, because it is a little unexpected). Then that ephemeral
 node materializes I guess, and the new leader doesn't register because the
 old leader won't give up the thrown. We don't try and force the new leader
 because that may just hide bugs and cause data loss, we no leader is
 elected.

 I'd guess there are two JIRA issues to resolve here.

 - Mark

 On Sun, Mar 8, 2015 at 8:37 AM Markus Jelsma markus.jel...@openindex.io
 wrote:

  Hello - i stumbled upon an issue i've never seen earlier, a shard with
 all
  nodes up and running but no leader. This is on 4.10.4. One of the two
 nodes
  emits the following error log entry:
 
  2015-03-08 05:25:49,095 WARN [solr.cloud.ElectionContext] - [Thread-136]
 -
  : cancelElection did not find election node to remove
  /overseer_elect/election/93434598784958483-178.21.116.
  225:8080_solr-n_000246
  2015-03-08 05:25:49,121 WARN [solr.cloud.ElectionContext] - [Thread-136]
 -
  : cancelElection did not find election node to remove
 
 /collections/oi/leader_elect/shard3/election/93434598784958483-178.21.116.
  225:8080_solr_oi_h-n_43
  2015-03-08 05:25:49,220 ERROR [solr.update.UpdateLog] - [Thread-136] - :
  Error inspecting tlog
 tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0001394
  refcount=2}
  java.nio.channels.ClosedChannelException
  at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
  at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
  at org.apache.solr.update.ChannelFastInputStream.
  readWrappedStream(TransactionLog.java:784)
  at org.apache.solr.common.util.FastInputStream.refill(
  FastInputStream.java:89)
  at org.apache.solr.common.util.FastInputStream.read(
  FastInputStream.java:125)
  at java.io.InputStream.read(InputStream.java:101)
  at org.apache.solr.update.TransactionLog.endsWithCommit(
  TransactionLog.java:218)
  at org.apache.solr.update.UpdateLog.recoverFromLog(
  UpdateLog.java:800)
  at org.apache.solr.cloud.ZkController.register(
  ZkController.java:841)
  at org.apache.solr.cloud.ZkController$1.command(
  ZkController.java:277)
  at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
  ConnectionManager.java:166)
  2015-03-08 05:25:49,225 ERROR [solr.update.UpdateLog] - [Thread-136] - :
  Error inspecting tlog
 tlog{file=/opt/solr/cores/oi_c/data/tlog/tlog.0001471
  refcount=2}
  java.nio.channels.ClosedChannelException
  at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
  at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:679)
  at org.apache.solr.update.ChannelFastInputStream.
  readWrappedStream(TransactionLog.java:784)
  at org.apache.solr.common.util.FastInputStream.refill(
  FastInputStream.java:89)
  at org.apache.solr.common.util.FastInputStream.read(
  FastInputStream.java:125)
  at java.io.InputStream.read(InputStream.java:101)
  at org.apache.solr.update.TransactionLog.endsWithCommit(
  TransactionLog.java:218)
  at org.apache.solr.update.UpdateLog.recoverFromLog(
  UpdateLog.java:800)
  at org.apache.solr.cloud.ZkController.register(
  ZkController.java:841)
  at org.apache.solr.cloud.ZkController$1.command(
  ZkController.java:277)
  at org.apache.solr.common.cloud.ConnectionManager$1$1.run(
  ConnectionManager.java:166)
  2015-03-08 12:21:04,438 WARN [solr.cloud.RecoveryStrategy] -
  [zkCallback-2-thread-28] - : Stopping recovery for core=oi_h
 coreNodeName=
  178.21.116.225:8080_solr_oi_h
 
  The other node makes a mess in the logs:
 
  2015-03-08 05:25:46,020 WARN [solr.cloud.RecoveryStrategy] -
  [zkCallback-2-thread-20] - : Stopping recovery for core=oi_c
 coreNodeName=
  194.145.201.190:
  8080_solr_oi_c
  2015-03-08 05:26:08,670 ERROR [solr.cloud.ShardLeaderElectionContext] -
  [zkCallback-2-thread-19] - : There was a problem trying to register as
 the
  leader:org.
  apache.solr.common.SolrException: Could not register as the leader
  because creating the ephemeral registration node in ZooKeeper failed
  at org.apache.solr.cloud.ShardLeaderElectionContextBase
  .runLeaderProcess(ElectionContext.java:146)
  at org.apache.solr.cloud.ShardLeaderElectionContext.
  runLeaderProcess(ElectionContext.java:317)
  at 

Re: Using tmpfs for Solr index

2015-01-27 Thread Ramkumar R. Aiyengar
I don't have formal benchmarks, but we did get significant performance
gains by switching from a RAMDirectory to a MMapDirectory on tmpfs,
especially under parallel queries. Locking seemed to pull down the former..
On 23 Jan 2015 06:35, deniz denizdurmu...@gmail.com wrote:

 Would it boost any performance in case the index has been switched from
 RAMDirectoryFactory to use tmpfs? Or it would simply do the same thing like
 MMap?

 And in case it would be better to use tmpfs rather than RAMDirectory or
 MMap, which directory factory would be the most feasible one for this
 purpose?

 Regards,



 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Using-tmpfs-for-Solr-index-tp4181399.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Recovery process

2015-01-26 Thread Ramkumar R. Aiyengar
https://issues.apache.org/jira/browse/SOLR-6359 has a patch which allows
this to be configured, it has not gone in as yet.

Note that the current design of the UpdateLog causes it to be less
efficient if the number is bumped up too much, but certainly worth
experimenting with.
On 22 Jan 2015 02:47, Nishanth S nishanth.2...@gmail.com wrote:

 Thank you Shalin.So in a system where the indexing rate is more than 5K TPS
 or so the replica  will never be able to recover   through peer sync
 process.In  my case I have mostly seen  step 3 where a full copy happens
 and  if the index size is huge it takes a very long time for replicas to
 recover.Is there a way we can  configure the  number of missed updates for
 peer sync.

 Thanks,
 Nishanth

 On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

  Hi Nishanth,
 
  The recovery happens as follows:
 
  1. PeerSync is attempted first. If the number of new updates on leader is
  less than 100 then the missing documents are fetched directly and indexed
  locally. The tlog tells us the last 100 updates very quickly. Other uses
 of
  the tlog are for durability of updates and of course, startup recovery.
  2. If the above step fails then replication recovery is attempted. A hard
  commit is called on the leader and then the leader is polled for the
 latest
  index version and generation. If the leader's version and generation are
  greater than local index's version/generation then the difference of the
  index files between leader and replica are fetched and installed.
  3. If the above fails (because leader's version/generation is somehow
 equal
  or more than local) then a full index recovery happens and the entire
 index
  from the leader is fetched and installed locally.
 
  There are some other details involved in this process too but probably
 not
  worth going into here.
 
  On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S nishanth.2...@gmail.com
  wrote:
 
   Hello Everyone,
  
   I am hitting a few issues with solr replicas going into recovery and
 then
   doing a full index copy.I am trying to understand the solr recovery
   process.I have read a few blogs  on this and saw  that when leader
  notifies
   a replica to  recover(in my case it is due to connection resets) it
 will
   try to do a peer sync first and  if the missed updates are more than
 100
  it
   will do a full index copy from the leader.I am trying to understand
 what
   peer sync is and where does tlog come into picture.Are tlogs replayed
  only
   during server restart?.Can some one  help me with this?
  
   Thanks,
   Nishanth
  
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



Re: Easiest way to embed solr in a desktop application

2015-01-16 Thread Ramkumar R. Aiyengar
That's correct, even though it should still be possible to embed Jetty,
that could change in the future, and that's why support for pluggable
containers is being taken away.

If you need to deal with the index at a lower level, there's always Lucene
you can use as a library instead of Solr.

But I am assuming you need to use the search engine at a higher level than
that and hence you ask for Solr. In which case, I urge you to think through
if you really can't run this out of process, may be this is an XY problem.
Keep in mind that Solr has the ability to provide higher level
functionality because it can control almost the entirety of the application
(which is the philosophical reason behind removal of the war as well), and
that's the reason something like EmbeddedSolrServer will always have
caveats.
On 15 Jan 2015 15:09, Robert Krüger krue...@lesspain.de wrote:

 I was considering the programmatic Jetty option but then I read that Solr 5
 no longer supports being run with an external servlet container but maybe
 they still support programmatic jetty use in some way. atm I am using solr
 4.x, so this would work. No idea if this gets messy classloader-wise in any
 way.

 I have been using exactly the approach you described in the past, i.e. I
 built a really, really simple swing dialogue to input queries and display
 results in a table but was just guessing that the built-in ui was far
 superior but maybe I should just live with it for the time being.

 On Thu, Jan 15, 2015 at 3:56 PM, Erik Hatcher erik.hatc...@gmail.com
 wrote:

  It’d certainly be easiest to just embed Jetty into your application.  You
  don’t need to have Jetty as a separate process, you could launch it
 through
  it’s friendly Java API, configured to use solr.war.
 
  If all you needed was to make HTTP(-like) queries to Solr instead of the
  full admin UI, your application could stick to using EmbeddedSolrServer
 and
  also provide a UI that takes in a Solr query string (or builds one up)
 and
  then sends it to the embedded Solr and displays the result.
 
  Erik
 
   On Jan 15, 2015, at 9:44 AM, Robert Krüger krue...@lesspain.de
 wrote:
  
   Hi Andrea,
  
   you are assuming correctly. It is a local, non-distributed index that
 is
   only accessed by the containing desktop application. Do you know if
 there
   is a possibility to run the Solr admin UI on top of an embedded
 instance
   somehow?
  
   Thanks a lot,
  
   Robert
  
   On Thu, Jan 15, 2015 at 3:17 PM, Andrea Gazzarini 
 a.gazzar...@gmail.com
  
   wrote:
  
   Hi Robert,
   I've used the EmbeddedSolrServer in a scenario like that and I never
 had
   problems.
   I assume you're talking about a standalone application, where the
 whole
   index resides locally and you don't need any cluster / cloud /
  distributed
   feature.
  
   I think the usage of EmbeddedSolrServer is discouraged in a
  (distributed)
   service scenario, because it is a direct connection to a SolrCore
   instance...but this is not a problem in the situation you described
 (as
  far
   as I know)
  
   Best,
   Andrea
  
  
   On 01/15/2015 03:10 PM, Robert Krüger wrote:
  
   Hi,
  
   I have been using an embedded instance of solr in my desktop
  application
   for a long time and it works fine. At the time when I made that
  decision
   (vs. firing up a solr web application within my swing application) I
  got
   the impression embedded use is somewhat unsupported and I should
 expect
   problems.
  
   My first question is, is this still the case now (4 years later),
 that
   embedded solr is discouraged?
  
   The one limitation I am running into is that I cannot use the solr
  admin
   UI
   for debugging purposes (mainly for running queries). Is there any
 other
   way
   to do this other than no longer using embedded solr and
  programmatically
   firing up a web application (e.g. using jetty)? Should I do the
 latter
   anyway?
  
   Any insights/advice greatly appreciated.
  
   Best regards,
  
   Robert
  
  
  
  
  
   --
   Robert Krüger
   Managing Partner
   Lesspain GmbH  Co. KG
  
   www.lesspain-software.com
 
 


 --
 Robert Krüger
 Managing Partner
 Lesspain GmbH  Co. KG

 www.lesspain-software.com



Re: Solr startup script in version 4.10.3

2015-01-08 Thread Ramkumar R. Aiyengar
Versions 4.10.3 and beyond already use server rather than example, which
still finds a reference in the script purely for back compat. A major
release 5.0 is coming soon, perhaps the back compat can be removed for that.
On 6 Jan 2015 09:30, Dominique Bejean dominique.bej...@eolya.fr wrote:

 Hi,

 In release 4.10.3, the following lines were removed from solr starting
 script (bin/solr)

 # TODO: see SOLR-3619, need to support server or example
 # depending on the version of Solr
 if [ -e $SOLR_TIP/server/start.jar ]; then
   DEFAULT_SERVER_DIR=$SOLR_TIP/server
 else
   DEFAULT_SERVER_DIR=$SOLR_TIP/example
 fi

 However, the usage message always say

   -d dir  Specify the Solr server directory; defaults to server


 Either the usage have to be fixed or the removed lines put back to the
 script.

 Personally, I like the default to server directory.

 My installation process in order to have a clean empty solr instance is to
 copy examples into server and remove directories like example-DIH,
 example-shemaless, multicore and solr/collection1

 Solr server (or node) can be started without the -d parameter.

 If this makes sense, a Jira issue could be open.

 Dominique
 http://www.eolya.fr/



Re: Dealing with bad apples in a SolrCloud cluster

2014-11-26 Thread Ramkumar R. Aiyengar
As Eric mentions, his change to have a state where indexing happens but
querying doesn't surely helps in this case.

But these are still boolean decisions of send vs don't send. In general, it
would be nice to abstract the routing policy so that it is pluggable. You
could then do stuff like have a least pending policy for choosing
replicas -- instead of choosing a replica at random, you maintain a pending
response count, and you always send to the one with least pending (or
randomly amongst a set of replicas if there is a tie).

Also the chances your distrib=false case will be hit is actually 1/5 (or
something like that, I have forgotten my probability theory). Because you
have two shards and you get two chances at hitting the bad apple. This was
one of the reasons we got in SOLR-6730 to use replica and host affinity.
Under good enough load, the load distribution will more or less be the same
with this change, but chances of hitting bad apples will be lesser..
On 21 Nov 2014 18:56, Timothy Potter thelabd...@gmail.com wrote:

Just soliciting some advice from the community ...

Let's say I have a 10-node SolrCloud cluster and have a single collection
with 2 shards with replication factor 10, so basically each shard has one
replica on each of my nodes.

Now imagine one of those nodes starts getting into a bad state and starts
to be slow about serving queries (not bad enough to crash outright though)
... I'm sure we could ponder any number of ways a box might slow down
without crashing.

From my calculations, about 2/10ths of the queries will now be affected
since

1/10 queries from client apps will hit the bad apple
  +
1/10 queries from other replicas will hit the bad apple (distrib=false)


If QPS is high enough and the bad apple is slow enough, things can start to
get out of control pretty fast, esp. since we've set max threads so high to
avoid distributed dead-lock.

What have others done to mitigate this risk? Anything we can do in Solr to
help deal with this? It seems reasonable that nodes can identify a bad
apple by keeping track of query times and looking for nodes that are
significantly outside (=2 stddev) what the other nodes are doing. Then
maybe mark the node as being down in ZooKeeper so clients and other nodes
stop trying to send requests to it; or maybe a simple policy of just don't
send requests to that node for a few minutes.


Re: any difference between using collection vs. shard in URL?

2014-11-06 Thread Ramkumar R. Aiyengar
Do keep one thing in mind though. If you are already doing the work of
figuring out the right shard leader (through solrJ or otherwise), using
that location with just the collection name might be suboptimal if there
are multiple shard leaders present in the same instance -- the collection
name just goes to *some* shard leader and not necessarily to the one where
your document is destined. If it chooses the wrong one, it will lead to a
HTTP request to itself.
On 5 Nov 2014 15:33, Shalin Shekhar Mangar shalinman...@gmail.com wrote:

 There's no difference between the two. Even if you send updates to a shard
 url, it will still be forwarded to the right shard leader according to the
 hash of the id (assuming you're using the default compositeId router). Of
 course, if you happen to hit the right shard leader then it is just an
 internal forward and not an extra network hop.

 The advantage with using the collection name is that you can hit any
 SolrCloud node (even the ones not hosting this collection) and it will
 still work. So for a non Java client, a load balancer can be setup in front
 of the entire cluster and things will just work.

 On Wed, Nov 5, 2014 at 8:50 PM, Ian Rose ianr...@fullstory.com wrote:

  If I add some documents to a SolrCloud shard in a collection alpha, I
 can
  post them to /solr/alpha/update.  However I notice that you can also
 post
  them using the shard name, e.g. /solr/alpha_shard4_replica1/update - in
  fact this is what Solr seems to do internally (like if you send documents
  to the wrong node so Solr needs to forward them over to the leader of the
  correct shard).
 
  Assuming you *do* always post your documents to the correct shard, is
 there
  any difference between these two, performance or otherwise?
 
  Thanks!
  - Ian
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 23:46, Erick Erickson erickerick...@gmail.com wrote:

 This configuration deals with all
 the replication, NRT processing, self-repair when nodes go up and
 down and all that, but since there's no second trip to get the docs
 from shards your query performance won't be affected.

More or less.. Vaguely recall that you still would need to add a
shortCircuit parameter to the url in such a case to avoid a second trip. I
might be wrong here but I do recall wondering why that wasn't the default..


 And using SolrCloud with a single shard will essentially scale linearly
 as you add nodes for queries.

 Best,
 Erick


 On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com
wrote:
  Hi,
 
  You are right, it is a mistake in my phrase, for the tests with 4
  shards/ 4 instances,  the latency was worse (therefore *bigger*) than
  for the tests with one shard.
 
  In our case, the query rate is high.
 
  Thanks,
  Anca
 
 
  On 10/30/2014 03:48 PM, Shawn Heisey wrote:
 
  On 10/30/2014 4:32 AM, Anca Kopetz wrote:
 
  We did some tests with 4 shards / 4 different tomcat instances on the
  same server and the average latency was smaller than the one when
having
  only one shard.
  We tested also é shards on different servers and the performance
results
  were also worse.
 
  It seems that the sharding does not make any difference for our index
in
  terms of latency gains.
 
  That statement is confusing, because if latency goes down, that's good,
  not worse.
 
  If you're going to put multiple shards on one server, it should be done
  with one solr/tomcat instance, not multiple.  One instance is perfectly
  capable of dealing with many shards, and has a lot less overhead.  The
  SolrCloud collection create command would need the maxShardsPerNode
  parameter.
 
  In order to see a gain in performance from multiple shards per server,
  the server must have a lot of CPUs and the query rate must be fairly
  low.  If the query rate is high, then all the CPUs will be busy just
  handling simultaneous queries, so putting multiple shards per server
  will probably slow things down.  When query rate is low, multiple CPUs
  can handle each shard query simultaneously, speeding up the overall
query.
 
  Thanks,
  Shawn
 
 
  Kelkoo SAS
  Société par Actions Simplifiée
  Au capital de € 4.168.964,30
  Siège social : 8, rue du Sentier 75002 Paris
  425 093 069 RCS Paris
 
  Ce message et les pièces jointes sont confidentiels et établis à
l'attention
  exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de
ce
  message, merci de le détruire et d'en avertir l'expéditeur.


Re: Sharding configuration

2014-11-01 Thread Ramkumar R. Aiyengar
On 30 Oct 2014 14:49, Shawn Heisey apa...@elyograg.org wrote:

 In order to see a gain in performance from multiple shards per server,
 the server must have a lot of CPUs and the query rate must be fairly
 low.  If the query rate is high, then all the CPUs will be busy just
 handling simultaneous queries, so putting multiple shards per server
 will probably slow things down.  When query rate is low, multiple CPUs
 can handle each shard query simultaneously, speeding up the overall query.

Except that your query latency isn't always CPU bound, there's a
significant IO bound portion as well. I wouldn't go so far as to say that
will large query volumes you shouldn't use multiple shards -- finally comes
down to how many shards a machine can handle under peak load, it could
depend on CPU/IO/GC pressure.. We have multiple shards on a machine under
heavy query load for example. The only real way is to benchmark this and
see..

 Thanks,
 Shawn



Re: Sharding configuration

2014-10-28 Thread Ramkumar R. Aiyengar
As far as the second option goes, unless you are using a large amount of
memory and you reach a point where a JVM can't sensibly deal with a GC
load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
probably haven't reached that point. There are also other shared resources
at an instance level like connection pools and ZK connections, but those
are tunable and you probably aren't pushing them as well (I would imagine
you are just trying to have only a handful of shards given that you aren't
sharded at all currently).

That leaves single vs multiple machines. Assuming the network isn't a
bottleneck, and given the same amount of resources overall (number of
cores, amount of memory, IO bandwidth times number of machines), it
shouldn't matter between the two. If you are procuring new hardware, I
would say buy more, smaller machines, but if you already have the hardware,
you could serve as much as possible off a machine before moving to a
second. There's nothing which limits the number of shards as long as the
underlying machine has the sufficient amount of parallelism.

Again, this advice is for a small number of shards, if you had a lot more
(hundreds) of shards and significant volume of requests, things start to
become a bit more fuzzy with other limits kicking in.
On 28 Oct 2014 09:26, Anca Kopetz anca.kop...@kelkoo.com wrote:

 Hi,

 We have a SolrCloud configuration of 10 servers, no sharding, 20
 millions of documents, the index has 26 GB.
 As the number of documents has increased recently, the performance of
 the cluster decreased.

 We thought of sharding the index, in order to measure the latency. What
 is the best approach ?
 - to use shard splitting and have several sub-shards on the same server
 and in the same tomcat instance
 - having several shards on the same server but on different tomcat
 instances
 - having one shard on each server (for example 2 shards / 5 replicas on
 10 servers)

 What's the impact of these 3 configuration on performance ?

 Thanks,
 Anca

 --

 Kelkoo SAS
 Société par Actions Simplifiée
 Au capital de € 4.168.964,30
 Siège social : 8, rue du Sentier 75002 Paris
 425 093 069 RCS Paris

 Ce message et les pièces jointes sont confidentiels et établis à
 l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
 destinataire de ce message, merci de le détruire et d'en avertir
 l'expéditeur.



Re: Advice on highlighting

2014-09-14 Thread Ramkumar R. Aiyengar
https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-2878
provides lucene API what you are trying to do, it's not yet in though.
There's a fork which has the change in
https://github.com/flaxsearch/lucene-solr-intervals
On 12 Sep 2014 21:24, Craig Longman clong...@iconect.com wrote:

 In order to take our Solr usage to the next step, we really need to
 improve its highlighting abilities.  What I'm trying to do is to be able
 to write a new component that can return the fields that matched the
 search (including numeric fields) and the start/end positions for the
 alphanumeric matches.



 I see three different approaches take, either way will require making
 some modifications to the lucene/solr parts, as it just does not appear
 to be doable as a completely stand alone component.



 1) At initial search time.

 This seemed like a good approach.  I can follow IndexSearcher creating
 the TermContext that parses through AtomicReaderContexts to see if it
 contains a match and then adds it to the contexts available for later.
 However, at this point, inside SegmentTermsEnum.seekExact() it seems
 like Solr is not really looking for matching terms as such, it's just
 scanning what looks like the raw index.  So, I don't think I can easily
 extract term positions at this point.



 2) Write a odified HighlighterComponent.  We have managed to get phrases
 to highlight properly, but it seems like getting the full field matches
 would be more difficult in this module, however, because it does its
 highlighting oblivious to any other criteria, we can't use it as is.
 For example, this search:



   (body:large+AND+user_id:7)+OR+user_id:346



 Will highlight large in records that have user_id = 346 when
 technically (for our purposes at least) it should not be considered a
 hit because the large was accompanied by the user_id = 7 criteria.
 It's not immediately clear to me how difficult it would be to change
 this.



 3) Make a modified DebugComponent and enhance the existing explain()
 methods (in the query types we require it at least) to include more
 information such as the start/end positions of the term that was hit.
 I'm exploring this now, but I don't easily see how I can figure out what
 those positions might be from the explain() information.  Any pointers
 on how, at the point that TermQuery.explain() is being called that I can
 figure out which indexed token was the actual hit on?





 Craig Longman

 C++ Developer

 iCONECT Development, LLC
 519-645-1663





 This message and any attachments are intended only for the use of the
 addressee and may contain information that is privileged and confidential.
 If the reader of the message is not the intended recipient or an authorized
 representative of the intended recipient, you are hereby notified that any
 dissemination of this communication is strictly prohibited. If you have
 received this communication in error, notify the sender immediately by
 return email and delete the message and any attachments from your system.




Re: Scaling to large Number of Collections

2014-08-31 Thread Ramkumar R. Aiyengar
On 31 Aug 2014 13:24, Mark Miller markrmil...@gmail.com wrote:


  On Aug 31, 2014, at 4:04 AM, Christoph Schmidt 
christoph.schm...@moresophy.de wrote:
 
  we see at least two problems when scaling to large number of
collections. I would like to ask the community, if they are known and maybe
already addressed in development:
  We have a SolrCloud running with the following numbers:
  -  5 Servers (each 24 CPUs, 128 RAM)
  -  13.000 Collection with 25.000 SolrCores in the Cloud
  The Cloud is working fine, but we see two problems, if we like to scale
further
  1.   Resource consumption of native system threads
  We see that each collection opens at least two threads: one for the
zookeeper (coreZkRegister-1-thread-5154) and one for the searcher
(searcherExecutor-28357-thread-1)
  We will run in OutOfMemoryError: unable to create new native thread.
Maybe the architecture could be changed here to use thread pools?
  2.   The shutdown and the startup of one server in the SolrCloud
takes 2 hours. So a rolling start is about 10h. For me the problem seems to
be that leader election is linear. The Overseer does core per core. The
organisation of the cloud is not done parallel or distributed. Is this
already addressed by https://issues.apache.org/jira/browse/SOLR-5473 or is
there more needed?

 2. No, but it should have been fixed by another issue that will be in
4.10.

Note however that this fix will result in even more temporary thread usage
as all leadership elections will happen in parallel, so you might still end
up with these out of threads issue again.

Quite possibly the out of threads issue is just some system soft limit
which is kicking in. Linux certainly has a limit you can configure through
sysctl, your OS, whatever that might be, probably does the same. May be
worth exploring if you can bump that up.


 - Mark
 http://about.me/markrmiller


Re: Why does CLUSTERSTATUS return different information than the web cloud view?

2014-08-26 Thread Ramkumar R. Aiyengar
ZK has the list of live nodes available as a set of ephemeral nodes. You
can use /zookeeper on Solr or talk to ZK directly to get that list.
On 24 Aug 2014 03:08, Nathan Neulinger nn...@neulinger.org wrote:

 Is there a way to query the 'live node' state without sending a query to
 every node myself? i.e. to get the same data that is used for that cloud
 status screen?

 -- Nathan

 On 08/23/2014 06:39 PM, Mark Miller wrote:

 The state is actually a combo of the state in clusterstate and the live
 nodes. If the live node is not there, it's gone regardless of the last
 state it published.

 - Mark

  On Aug 23, 2014, at 6:00 PM, Nathan Neulinger nn...@neulinger.org
 wrote:

 In particular, a shard being 'active' vs. 'gone'.

 The web ui is clearly showing the given replicas as being in Gone
 state when I shut down a server, yet the CLUSTERSTATUS says that each
 replica has state: active

 Is there any way to ask it for status that will reflect that the replica
 is gone?

 This is with 4.8.0.

 -- Nathan

 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412


 --
 
 Nathan Neulinger   nn...@neulinger.org
 Neulinger Consulting   (573) 612-1412



Re: Disabling transaction logs

2014-08-13 Thread Ramkumar R. Aiyengar
(1) sounds a lot like SOLR-6261 I mention above. There are possibly other
improvements since 4.6.1 as Mark mentions, I would certainly suggest you
test with the latest version with the issue above patched (or use the
current stable branch in svn, branch_4x) to see if that makes a difference.


Re: Disabling transaction logs

2014-08-09 Thread Ramkumar R. Aiyengar
I didn't realise you could even disable tlog when running SolrCloud, but as
Anshum says it's a bad idea.  In all possibility, even if it worked,
removing transaction logs is likely to make your restart slower, SolrCloud
would always be forced to do a full recovery because it cannot now use
tlogs to do recovery which it tries to to speed things up.

Your problem is probably elsewhere. How many replicas do you have? Do you
see this problem always or when you bounce the leaders? SOLR-6261 recently
speeded that up and is scheduled to roll out with the next release but you
can try out the patch meanwhile.
Hello

I am using solr 4.6.1 with over 1000 collections and 8 nodes. Restarting of
nodes takes a long time (especially if we have indexing running against it)
. I want to see if disabling transaction logs can help with a better robust
restart. However I can't see any docs around disabling txn logs in solrcloud

Can anyone help with info on how to disable transaction logs ?


Thanks
Nitin


Re: SolrCloud without NRT and indexing only on the master

2014-07-31 Thread Ramkumar R. Aiyengar
I agree with Erick that this gain you are looking at might not be worth, so
do measure and see if there's a difference.

Also, the next release of Solr is to have some significant improvements
when it comes to CPU usage under heavy indexing load, and we have had at
least one anecdote so far where the throughput has increased by an order of
magnitude, so one option might be to try that out as well and see. See
SOLR-6136 and potentially SOLR-6259 (probably lesser so, depends on your
schema) if you want to try out before the release.

An another option is to use the HDFS directory support in Solr. That way
you can build indices offline and make them available for all your Solr
replicas for search. See batch indexing at
http://www.cloudera.com/content/cloudera-content/cloudera-docs/Search/latest/Cloudera-Search-User-Guide/csug_introducing.html
On 30 Jul 2014 11:54, Harald Kirsch harald.kir...@raytion.com wrote:

 Hi Daniel,

 well, I assume there is a performance difference on host B between

 a) getting some ready-made segments from host A (master, taking care of
 indexing) to host B (slave, taking care of answering queries)

 and

 b) host B (along with host A) doing all the work necessary to prepare
 incoming SolrDocument objects into a segment and make it searchable.

 I am talking here about a setup where during peak loads the CPUs on host B
 are sweating at 80% and I assume the following:

 i) Indexing will draw more than 20% CPU. Thereby it would start competing
 with query answering

 ii) Merely copying finished segments to the query-answering node will not
 draw more than 20% CPU and will thereby not compete with query answering.

 Index consistency is not an issue, because the number of documents and the
 number of different, hard-to-get-at source we will be indexing will always
 be out-of-sync with the index. Adding and hour or two here is the least of
 my problems.

 Harald.

 On 30.07.2014 11:58, Daniel Collins wrote:

 Working backwards slightly, what do you think SolrCloud is going to give
 you, apart from the consistency of the index (which you want to turn off)?
   What are all the other benefits of SolrCloud, if you are querying
 separate instances that aren't guaranteed to be in sync (since you want to
 use the traditional-style master-slave for indexing.

 And secondly, why don't you want to use SolrCloud for indexing everywhere?
   Again, what do you think master-slave methodology gains you?  You have
 said you want all the resources of the slaves to be for querying, which
 makes sense, but the slaves have to get the new updates somehow, surely?
 Whether that is from SolrCloud directly, or via a master-slave
 replication,
 the work has to be done at some point?

 If you don't have NRT, and you set your commit frequency to something
 reasonably large, then I don't see the cost of SolrCloud, but I guess it
 depends on the frequency of your updates.


 On 30 July 2014 08:22, Harald Kirsch harald.kir...@raytion.com wrote:

  Thanks Erick,

 for the confirmation.

 You say traditional but the docs call it legacy. Not a native speaker
 I might misinterpret the meaning slightly but to me it conveys the notion
 of don't use this stuff if you don't have to.


 SolrCloud indexes to all nodes all the time, there's no real way to turn
 that off.

 which is really a pity when only query-load must be scaled and NRT is not
 necessary. :-/

 Harald.


 On 29.07.2014 18:16, Erick Erickson wrote:

  bq: What if I don't need NRT and in particular want the slave to use all
 resources for query answering, i.e. only the master shall index. But at
 the
 same time I want all the other benefits of SolrCloud.

 You want all the benefits of SolrCloud without... using SolrCloud?

 Your only two choices are traditional master/slave or SolrCloud.
 SolrCloud
 indexes to all nodes all the time, there's no real way to turn that off.
 You _can_ control the frequency of commits but you can't turn off the
 indexing to all the nodes.

 FWIW,
 Erick


 On Tue, Jul 29, 2014 at 5:41 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

   I never did it, but always like.


 http://lucene.472066.n3.nabble.com/Best-practice-for-
 rebuild-index-in-SolrCloud-td4054574.html
   From time to time such recipes are mentioned in the list.


 On Tue, Jul 29, 2014 at 12:39 PM, Harald Kirsch 
 harald.kir...@raytion.com


   wrote:


   Hi all,


 from the Solr documentation I find two options how replication of an
 indexing is handled:

 a) SolrCloud indexes on master and all slaves in parallel to support
 NRT
 (near realtime search)

 b) Legacy replication where only the master does the indexing and
 slave
 receive index copies once in a while.

 What if I don't need NRT and in particular want the slave to use all
 resources for query answering, i.e. only the master shall index. But
 at

  the

  same time I want all the other benefits of SolrCloud.

 Is this setup possible? Is it somewhere described in the docs?

 Thanks,
 Harald.





Re: Anybody knows of a project that indexes SVN repos into Solr?

2014-06-02 Thread Ramkumar R. Aiyengar
Not an exact answer.. OpenGrok uses Lucene, but not Solr.
On 2 Jun 2014 07:48, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Hello,

 Anybody knows of a recent projects that index SVN repos for Solr
 search? With or without UI.

 I know of similar efforts for other VCS, but the only thing I found
 for SVN is from 2010 and looking quiet.

 Regards,
Alex.
 P.s. This could also be a cool show-off project for somebody. Plenty
 of SVN repos around to use as data source.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency



Re: Distributed Search in Solr with different queries per shard

2014-05-25 Thread Ramkumar R. Aiyengar
I agree with Eric that this is premature unless you can show that it makes
a difference.

Firstly why are you splitting the data into multiple time tiers (one
recent, and one all) and then waiting to merge results from all of them?
Time tiering is useful when you can do the search separately on both and
then pick the one which comes back with full results first (usually will be
the recent one but it might not have as many results as you want).

The way you are trying to aggregate the data is sharding, where one of the
cores doesn't have the data the other one has. So you could just 'optimize'
by not having the data present in the historical collection. We have
support for custom sharding keys now in Solr, haven't used it personally
but that might be worth a shot..
On 21 May 2014 14:57, Avner Levy av...@checkpoint.com wrote:

 I have 2 cores.
 One with active data and one with historical data (for documents which
 were removed from the active one).
 I want to run Distributed Search on both and get the unified result (as
 supported by Solr Distributed Search, I'm not using Solr Cloud).
 My problem is that the query for each core is different.
 Is there a way to specify different query per core and still let Solr to
 unify the query results?
 For example:
 Active data core query: select all green docs
 History core query: select all green docs with year=2012
 Is there a way to extend the distributed search handler to support such a
 scenario?
 Thanks in advance,
   Avner
 · One option is to send a unified query to both but then each core
 will work harder for no reason.




Re: Can I reconstruct text from tokens?

2014-04-18 Thread Ramkumar R. Aiyengar
Sorry, didn't think this through. You're right, still the same problem..
On 16 Apr 2014 17:40, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Why? I want stored=false, at which point multivalued field is just offset
 values in the dictionary. Still have to reconstruct from offsets.

 Or am I missing something?

 Regards,
  Alex
 On 16/04/2014 10:59 pm, Ramkumar R. Aiyengar andyetitmo...@gmail.com
 wrote:

  Logically if you tokenize and put the results in a multivalued field, you
  should be able to get all values in sequence?
  On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
   Hello,
  
   If I use very basic tokenizers, e.g. space based and no filters, can I
   reconstruct the text from the tokenized form?
  
   So, This is a test - This, is, a, test - This is a test?
  
   I know we store enough information, but I don't know internal API
   enough to know what I should be looking at for reconstruction
   algorithm.
  
   Any hints?
  
   The XY problem is that I want to store large amount of very repeatable
   text into Solr. I want the index to be as small as possible, so
   thought if I just pre-tokenized, my dictionary will be quite small.
   And I will be reconstructing some final form anyway.
  
   The other option is to just use compressed fields on stored field, but
   I assume that does not take cross-document efficiencies into account.
   And, it will be a read-only index after build, so I don't care about
   updates messing things up.
  
   Regards,
  Alex
  
   Personal website: http://www.outerthoughts.com/
   Current project: http://www.solr-start.com/ - Accelerating your Solr
   proficiency
  
 



Re: Can I reconstruct text from tokens?

2014-04-16 Thread Ramkumar R. Aiyengar
Logically if you tokenize and put the results in a multivalued field, you
should be able to get all values in sequence?
On 16 Apr 2014 16:51, Alexandre Rafalovitch arafa...@gmail.com wrote:

 Hello,

 If I use very basic tokenizers, e.g. space based and no filters, can I
 reconstruct the text from the tokenized form?

 So, This is a test - This, is, a, test - This is a test?

 I know we store enough information, but I don't know internal API
 enough to know what I should be looking at for reconstruction
 algorithm.

 Any hints?

 The XY problem is that I want to store large amount of very repeatable
 text into Solr. I want the index to be as small as possible, so
 thought if I just pre-tokenized, my dictionary will be quite small.
 And I will be reconstructing some final form anyway.

 The other option is to just use compressed fields on stored field, but
 I assume that does not take cross-document efficiencies into account.
 And, it will be a read-only index after build, so I don't care about
 updates messing things up.

 Regards,
Alex

 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency



Re: svn vs GIT

2014-04-14 Thread Ramkumar R. Aiyengar
ant compile / ant -f solr dist / ant test certainly work, I use them with a
git working copy. You trying something else?
On 14 Apr 2014 19:36, Jeff Wartes jwar...@whitepages.com wrote:

 I vastly prefer git, but last I checked, (admittedly, some time ago) you
 couldn't build the project from the git clone. Some of the build scripts
 assumed some svn commands will work.



 On 4/12/14, 3:56 PM, Furkan KAMACI furkankam...@gmail.com wrote:

 Hi Amon;
 
 There has been a conversation about it at dev list:
 
 http://search-lucene.com/m/PrTmPXyDlv/The+Old+Git+Discussionsubj=Re+The+O
 ld+Git+Discussion
 On
 the other hand you do not need to know SVN to use, develop and contribute
 to Apache Solr project. You can follow the project at GitHub:
 https://github.com/apache/lucene-solr
 
 Thanks;
 Furkan KAMACI
 
 
 2014-04-11 5:12 GMT+03:00 Aman Tandon amantandon...@gmail.com:
 
  thanks sir,
  in that case i need to know about svn as well.
 
 
  Thanks
  Aman Tandon
 
  On Fri, Apr 11, 2014 at 7:26 AM, Alexandre Rafalovitch
  arafa...@gmail.comwrote:
 
   You can find the read-only Git's version of Lucene+Solr source code
   here: https://github.com/apache/lucene-solr . The SVN preference is
   Apache Foundation's choice and legacy. Most of the developers'
   workflows are also around SVN.
  
   Regards,
  Alex.
  
   Personal website: http://www.outerthoughts.com/
   Current project: http://www.solr-start.com/ - Accelerating your Solr
   proficiency
  
  
   On Fri, Apr 11, 2014 at 7:48 AM, Aman Tandon amantandon...@gmail.com
 
   wrote:
Hi,
   
I am new here, i have question in mind that why we are preferring
 the
  svn
more than git?
   
--
With Regards
Aman Tandon
  
 




Re: update in SolrCloud through C++ client

2014-02-16 Thread Ramkumar R. Aiyengar
If only availability is your concern, you can always keep a list of servers
to which your C++ clients will send requests, and round robin amongst them.
If one of the servers go down, you will either not be able to reach it or
get a 500+ error in the HTTP response, you can take it out of circulation
(and probably retry in the background with some kind of a ping every minute
or so to these down servers to ascertain if they have come back and then
add them to the list). This is something SolrJ does currently. This doesn't
technically need any Zookeeper interaction.

The biggest benefit that SolrJ provides (since 4.6 I think) though is that
it finds the shard leader to send an update to using ZK and saves a hop.
You can technically do this by retrieving and listening to updates using a
C++ ZK client (available) and doing what SolrJ currently does. This would
be good, the only drawback though, apart from the effort, is that
improvements are still happening in the area of managing clusters and how
its state is saved with ZK. These changes might not break your code, but at
the same time you might not be able to take advantage of them without
additional effort.

An alternative approach is to link SolrJ into your C++ client using JNI.
This has the added benefit of using the Javabin format for requests which
would have some performance benefits.

In short, it comes down to what performance requirements are. If indexing
speed and throughput is not that big a deal, just go with having a list of
servers and load balancing amongst the active ones. I would suggest you try
this anyway before second guessing that you do need the optimization.

If not, I would probably try the JNI route,  and if that fails, using a C
ZK client to read the cluster state and using that knowledge to decide
where to send requests.
On 14 Feb 2014 10:58, neerajp neeraj_star2...@yahoo.com wrote:

 Hello All,
 I am using Solr for indexing my data. My client is in C++. So I make Curl
 request to Solr server for indexing.
 Now, I want to use indexing in SolrCloud mode using ZooKeeper for HA.  I
 read the wiki link of SolrCloud (http://wiki.apache.org/solr/SolrCloud).

 What I understand from wiki that we should always check solr instance
 status(up  running) in solrCloud before making an update request. Can I
 not
 send update request to zookeeper and let the zookeeper forwards it to
 appropriate replica/leader ? In the later case I need not to worry which
 servers are up and running before making indexing request.




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/update-in-SolrCloud-through-C-client-tp4117340.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-16 Thread Ramkumar R. Aiyengar
Start with http://wiki.apache.org/solr/SolrPerformanceProblems It has a
section on GC tuning and a link to some example settings.
On 16 Feb 2014 21:19, lboutros boutr...@gmail.com wrote:

 Thanks a lot for your answer.

 Is there a web page, on the wiki for instance, where we could find some JVM
 settings or recommandations that we should used for Solr with some index
 configurations?

 Ludovic.





 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101p4117653.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrCloud Zookeeper disconnection/reconnection

2014-02-14 Thread Ramkumar R. Aiyengar
Ludovic, recent Solr changes won't do much to prevent ZK session expiry,
you might want to enable GC logging on Solr and Zookeeper to check for
pauses and tune appropriately.

The patch below fixes a situation under which the cloud can get to a bad
state during the recovery after session expiry. The recovery after a
session expiry is unavoidable, but as you guessed, it would be quick if
there aren't too many updates.

4.6.1 also has SOLR-5577 which will prevent updates from unnecessarily
stalling when you are disconnected from ZK for a short while.

These changes (and probably others) will thus probably help the cloud
behave better on ZK expiry and for that reason I would encourage you to
upgrade, but the ZK expiry problem would have to be dealt with ensuring
that ZK and Solr don't pause for too long and by choosing an appropriate
session timeout (which btw will be defaulted up to 30s from 15s in Solr 4.7
onwards).
On 13 Feb 2014 08:23, lboutros boutr...@gmail.com wrote:

 Dear all,

 we are currenty using Solr 4.3.1 in production (With SolrCloud).

 We encounter quite the same problem described in this other old post:


 http://lucene.472066.n3.nabble.com/SolrCloud-CloudSolrServer-Zookeeper-disconnects-and-re-connects-with-heavy-memory-usage-consumption-td4026421.html

 Sometime some nodes are disconnected from Zookeeper and then they try to
 reconnect. The process is quite long because we have a quite long warming
 process. And because of this long warming process, just after the recovery
 process, the node is disconnected again and so on... until OOM sometime.

 We already increased the Zk timeout. But it is not enought.

 We are thinking to migrate to Solr 4.6.1 at least (perhaps 4.7 will be up
 before the end of the migration :) ).

 I know that a lot of SolrCloud bugs are corrected since Solr 4.3.1.

 But, could we be sure that this problem will be resolved ? Or can this
 problem occur with the last Solr version ? (I know this is not an easy
 question ;) )

 It seems that this correction :

 Deadlock while trying to recover after a ZK session expiry :
 https://issues.apache.org/jira/browse/SOLR-5615

 is a good point in addressing our current problem.

 But do you think it will be enought ?

 One last thing, I don't know if it is already adressed by a correction,
 but,
 if there is no updates between disconnection and the reconnection, the
 recovery process should not do anything more than the reconnection, I mean:
 no replication, no tLog replay and no warming process. Is it the case ?

 Ludovic.



 -
 Jouve
 France.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-Zookeeper-disconnection-reconnection-tp4117101.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: need help in understating solr cloud stats data

2014-02-05 Thread Ramkumar R. Aiyengar
We have had success with starting up Jolokia in the same servlet container
as Solr, and then using its REST/Bulk API to JMX from the application of
choice.
On 4 Feb 2014 17:16, Walter Underwood wun...@wunderwood.org wrote:

 I agree that sorting and filtering stats in Solr is not a good idea. There
 is certainly some use in aggregation, though. One request to /admin/mbeans
 replaces about 50 JMX requests.

 Is anybody working on https://issues.apache.org/jira/browse/SOLR-4735?

 wunder

 On Feb 4, 2014, at 8:13 AM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  +101 for more stats.  Was just saying that trying to pre-aggregate them
  along multiple dimensions is probably best left out of Solr.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Tue, Feb 4, 2014 at 10:49 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  I think that is silly. We can still offer per shard stats *and* let a
 user
  easily see stats for a collection without requiring they jump hoops or
 use
  a specific monitoring solution where someone else has already jumped
 hoops
  for them.
 
  You don't have to guess what ops people really want - *everyone* wants
  stats that make sense for the collections and cluster on top of the per
  shard stats. *Everyone* wouldn't mind seeing these without having to
 setup
  a monitoring solution first.
 
  If you want more than that, then you can fiddle with your monitoring
  solution.
 
  - Mark
 
  http://about.me/markrmiller
 
  On Feb 3, 2014, at 11:10 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:
 
  Hi,
 
  Oh, I just saw Greg's email on dev@ about this.
  IMHO aggregating in the search engine is not the way to do.  Leave that
  to
  external tools, which are likely to be more flexible when it comes to
  this.
  For example, our SPM for Solr can do all kinds of aggregations and
  filtering by a number of Solr and SolrCloud-specific dimensions
 already,
  without Solr having to do any sort of aggregation that it thinks Ops
  people
  will really want.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, Feb 3, 2014 at 11:08 AM, Mark Miller markrmil...@gmail.com
  wrote:
 
  You should contribute that and spread the dev load with others :)
 
  We need something like that at some point, it's just no one has done
 it.
  We currently expect you to aggregate in the monitoring layer and it's
 a
  lot
  to ask IMO.
 
  - Mark
 
  http://about.me/markrmiller
 
  On Feb 3, 2014, at 10:49 AM, Greg Walters greg.walt...@answers.com
  wrote:
 
  I've had some issues monitoring Solr with the per-core mbeans and
 ended
  up writing a custom request handler that gets loaded then registers
  itself as an mbean. When called it polls all the per-core mbeans then
  adds
  or averages them where appropriate before returning the requested
 value.
  I'm not sure if there's a better way to get jvm-wide stats via jmx but
  it
  is *a* way to get it done.
 
  Thanks,
  Greg
 
  On Feb 3, 2014, at 1:33 AM, adfel70 adfe...@gmail.com wrote:
 
  I'm sending all solr stats data to graphite.
  I have some questions:
  1. query_handler/select requestTime -
  if i'm looking at some metric, lets say 75thPcRequestTime - I see
 that
  each
  core in a single collection has different values.
  Is each value of each core is the time that specific core spent on a
  request?
  so to get an idea of total request time, I should summarize all the
  values
  of all the cores?
 
 
  2.update_handler/commits - does this include auto_commits? becuaste
  I'm
  pretty sure I'm not doing any manual commits and yet I see a number
  there.
 
  3. update_handler/docs pending - what does this mean? pending for
  what?
  for
  flush to disk?
 
  thanks.
 
 
 
  --
  View this message in context:
 
 
 http://lucene.472066.n3.nabble.com/need-help-in-understating-solr-cloud-stats-data-tp4114992.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 
 

 --
 Walter Underwood
 wun...@wunderwood.org






Re: Removing last replica from a SolrCloud collection

2014-02-02 Thread Ramkumar R. Aiyengar
There's already an issue for this,
https://issues.apache.org/jira/browse/SOLR-5209, we were once bitten by the
same issue, when we were trying to relocate a shard. As Mark mentions, the
idea was to do this in zk truth mode, the link also references where that
work is being done.
On 31 Jan 2014 23:10, David Smiley (@MITRE.org) dsmi...@mitre.org wrote:

 Hi,

 If I issue either a core UNLOAD command, or a collection DELETEREPLICA
 command,  (which both seem pretty much equivalent) it works but if there
 are
 no other replicas for the shard, then the metadata for the shard is
 completely gone in clusterstate.json!  That's pretty disconcerting because
 you're basically hosed.  Of course, why would I even want to do that?  Well
 I'm experimenting with ways to restore a backed-up replica to replace
 existing data for the shard.

 If this is unexpected behavior then I'll file a bug.

 ~ David



 -
  Author:
 http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Removing-last-replica-from-a-SolrCloud-collection-tp4114772.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr limitations

2013-07-10 Thread Ramkumar R. Aiyengar
I understand, thanks. I just wanted to check in case there were scalability
limitations with how SolrCloud operates..
On 9 Jul 2013 12:45, Erick Erickson erickerick...@gmail.com wrote:

 I think Jack was mostly thinking in slam dunk terms. I know of
 SolrCloud demo clusters with 500+ nodes, and at that point
 people said it's going to work for our situation, we don't need
 to push more.

 As you start getting into that kind of scale, though, you really
 have a bunch of ops considerations etc. Mostly when I get into
 larger scales I pretty much want to examine my assumptions
 and see if they're correct, perhaps start to trim my requirements
 etc.

 FWIW,
 Erick

 On Tue, Jul 9, 2013 at 4:07 AM, Ramkumar R. Aiyengar
 andyetitmo...@gmail.com wrote:
  5. No more than 32 nodes in your SolrCloud cluster.
 
  I hope this isn't too OT, but what tradeoffs is this based on? Would have
  thought it easy to hit this number for a big index and high load (hence
  with the view of both the number of shards and replicas horizontally
  scaling..)
 
  6. Don't return more than 250 results on a query.
 
  None of those is a hard limit, but don't go beyond them unless your
 Proof
  of Concept testing proves that performance is acceptable for your
 situation.
 
  Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
  tests and then scale as needed.
 
  Dynamic and multivalued fields? Try to stay away from them - excepts for
  the simplest cases, they are usually an indicator of a weak data model.
  Sure, it's fine to store a relatively small number of values in a
  multivalued field (say, dozens of values), but be aware that you can't
  directly access individual values, you can't tell which was matched on a
  query, and you can't coordinate values between multiple multivalued
 fields.
  Except for very simple cases, multivalued fields should be flattened into
  multiple documents with a parent ID.
 
  Since you brought up the topic of dynamic fields, I am curious how you
  got the impression that they were a good technique to use as a starting
  point. They're fine for prototyping and hacking, and fine when used in
  moderation, but not when used to excess. The whole point of Solr is
  searching and searching is optimized within fields, not across fields, so
  having lots of dynamic fields is counter to the primary strengths of
 Lucene
  and Solr. And... schemas with lots  of dynamic fields tend to be
 difficult
  to maintain. For example, if you wanted to ask a support question here,
 one
  of the first things we want to know is what your schema looks like, but
  with lots of dynamic fields it is not possible to have a simple
 discussion
  of what your schema looks like.
 
  Sure, there is something called schemaless design (and Solr supports
  that in 4.4), but that's very different from heavy reliance on dynamic
  fields in the traditional sense. Schemaless design is A-OK, but using
  dynamic fields for arrays of data in a single document is a poor match
  for the search features of Solr (e.g., Edismax searching across multiple
  fields.)
 
  One other tidbit: Although Solr does not enforce naming conventions for
  field names, and you can put special characters in them, there are plenty
  of features in Solr, such as the common fl parameter, where field names
  are expected to adhere to Java naming rules. When people start going
 wild
  with dynamic fields, it is common that they start going wild with their
  names as well, using spaces, colons, slashes, etc. that cannot be parsed
 in
  the fl and qf parameters, for example. Please don't go there!
 
  In short, put up a small cluster and start doing a Proof of Concept
  cluster. Stay within my suggested guidelines and you should do okay.
 
  -- Jack Krupansky
 
  -Original Message- From: Marcelo Elias Del Valle
  Sent: Monday, July 08, 2013 9:46 AM
  To: solr-user@lucene.apache.org
  Subject: Solr limitations
 
 
  Hello everyone,
 
 I am trying to search information about possible solr limitations I
  should consider in my architecture. Things like max number of dynamic
  fields, max number o documents in SolrCloud, etc.
 Does anyone know where I can find this info?
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr



Re: Solr limitations

2013-07-09 Thread Ramkumar R. Aiyengar
 5. No more than 32 nodes in your SolrCloud cluster.

I hope this isn't too OT, but what tradeoffs is this based on? Would have
thought it easy to hit this number for a big index and high load (hence
with the view of both the number of shards and replicas horizontally
scaling..)

 6. Don't return more than 250 results on a query.

 None of those is a hard limit, but don't go beyond them unless your Proof
of Concept testing proves that performance is acceptable for your situation.

 Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
tests and then scale as needed.

 Dynamic and multivalued fields? Try to stay away from them - excepts for
the simplest cases, they are usually an indicator of a weak data model.
Sure, it's fine to store a relatively small number of values in a
multivalued field (say, dozens of values), but be aware that you can't
directly access individual values, you can't tell which was matched on a
query, and you can't coordinate values between multiple multivalued fields.
Except for very simple cases, multivalued fields should be flattened into
multiple documents with a parent ID.

 Since you brought up the topic of dynamic fields, I am curious how you
got the impression that they were a good technique to use as a starting
point. They're fine for prototyping and hacking, and fine when used in
moderation, but not when used to excess. The whole point of Solr is
searching and searching is optimized within fields, not across fields, so
having lots of dynamic fields is counter to the primary strengths of Lucene
and Solr. And... schemas with lots  of dynamic fields tend to be difficult
to maintain. For example, if you wanted to ask a support question here, one
of the first things we want to know is what your schema looks like, but
with lots of dynamic fields it is not possible to have a simple discussion
of what your schema looks like.

 Sure, there is something called schemaless design (and Solr supports
that in 4.4), but that's very different from heavy reliance on dynamic
fields in the traditional sense. Schemaless design is A-OK, but using
dynamic fields for arrays of data in a single document is a poor match
for the search features of Solr (e.g., Edismax searching across multiple
fields.)

 One other tidbit: Although Solr does not enforce naming conventions for
field names, and you can put special characters in them, there are plenty
of features in Solr, such as the common fl parameter, where field names
are expected to adhere to Java naming rules. When people start going wild
with dynamic fields, it is common that they start going wild with their
names as well, using spaces, colons, slashes, etc. that cannot be parsed in
the fl and qf parameters, for example. Please don't go there!

 In short, put up a small cluster and start doing a Proof of Concept
cluster. Stay within my suggested guidelines and you should do okay.

 -- Jack Krupansky

 -Original Message- From: Marcelo Elias Del Valle
 Sent: Monday, July 08, 2013 9:46 AM
 To: solr-user@lucene.apache.org
 Subject: Solr limitations


 Hello everyone,

I am trying to search information about possible solr limitations I
 should consider in my architecture. Things like max number of dynamic
 fields, max number o documents in SolrCloud, etc.
Does anyone know where I can find this info?

 Best regards,
 --
 Marcelo Elias Del Valle
 http://mvalle.com - @mvallebr


Re: whole index in memory

2013-06-01 Thread Ramkumar R. Aiyengar
In general, just increasing the cache sizes to make everything fit in
memory might not always give you best results. Do keep in mind that the
caches are in Java memory and that incurs the penalty of garbage collection
and other housekeeping Java's memory management might have to do.

Reasonably recent Solr distributions should default to memory mapping your
collections on most platforms. What that means is that if you have
sufficient free memory available on your server for the operating system to
use, it would do the caching for you and that invariably ends up being much
better in terms of performance. From that angle, its preferable to keep the
caches as small as possible so that the OS has more to cache.

That said, as always, YMMV. The ultimate test in all this is to try it out
yourself with various configurations and see the performance differences
for yourself.
On 1 Jun 2013 01:34, alx...@aim.com wrote:

 Hello,

 I have a solr index of size 5GB. I am thinking of increasing  cache size
 to 5 GB, expecting Solr will put whole index into memory.

 1. Will Solr indeed put whole index into memory?
 2. What are drawbacks of this approach?

 Thanks in advance.
 Alex.