Re: High cpu and gc time when performing optimization.

2016-07-12 Thread Otis Gospodnetic
Heap: start small and increase as necessary. Leave as much RAM for FS cache, 
don't give it to the JVM until it starts crying. SPM for Solr will help you see 
when Solr and JVM are starting to hurt.

Otis

> On Jul 12, 2016, at 11:45, Jason  wrote:
> 
> I'm using optimize because it's a option for fast search.
> Our index updates one or more weekly.
> If I don't use optimize, many index files should be kept.
> Any performance issues in that case?
> 
> And I'm wondering relation between index file size and heap size.
> In case of running as master server that only update index,
> is there any guide for heap size include Xmx, NewSize, MaxNewSize, etc.?
> 
> 
> 
> Yonik Seeley wrote
>> Optimize is a very expensive operation.  It involves reading the
>> entire index and merging and rewriting at a single segment.
>> If you find it too expensive, do it less often, or don't do it at all.
>> It's an optional operation.
>> 
>> -Yonik
>> 
>> 
>> On Mon, Jul 11, 2016 at 10:19 PM, Jason 
> 
>> hialooha@
> 
>>  wrote:
>>> hi, all.
>>> 
>>> I'm running solr instance with two cores and JVM max heap is 32G.
>>> Each core index size is 68G, 61G repectively.
>>> I'm always keeping on optimization after update index.
>>> BTW, on last week, document update is completed but optimize phase cpu is
>>> very high.
>>> I think that is because long gc time.
>>> How should I solve this problem?
>>> welcome any idea.
>>> thanks,
>>> 
>>> 
>>> 
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/High-cpu-and-gc-time-when-performing-optimization-tp4286704.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/High-cpu-and-gc-time-when-performing-optimization-tp4286704p4286796.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing logs in Solr

2016-06-05 Thread Otis Gospodnetic
You can ship SOLR logs to Logsene or any other log management service and not 
worry too much about their storage/size.

Otis

> On Jun 5, 2016, at 02:08, Anil  wrote:
> 
> Hi ,
> 
> i would like to index logs using to enable search on it in our application.
> 
> The problem would be index and stored size as log files size would go upto
> terabytes.
> 
> is there any way to use highlight feature without storing ?
> 
> i found following link where Benedetti Alessandro mentioned about custom
> highlighter on url field.
> 
> http://lucene.472066.n3.nabble.com/Highlighting-for-non-stored-fields-td1773015.html
> 
> Any ideas would be helpful. Thanks.
> 
> Cheers,
> Anil


Re: Best way to track cumulative GC pauses in Solr

2015-11-13 Thread Otis Gospodnetic
Hi Tom,

SPM for SOLR should be helpful here. See http://sematext.com/spm

Otis

 

> On Nov 13, 2015, at 10:00, Tom Evans  wrote:
> 
> Hi all
> 
> We have some issues with our Solr servers spending too much time
> paused doing GC. From turning on gc debug, and extracting numbers from
> the GC log, we're getting an idea of just how much of a problem.
> 
> I'm currently doing this in a hacky, inefficient way:
> 
> grep -h 'Total time for which application threads were stopped:' solr_gc* \
>| awk '($11 > 0.3) { print $1, $11 }' \
>| sed 's#:.*:##' \
>| sort -n \
>| sum_by_date.py
> 
> (Yes, I really am using sed, grep and awk all in one line. Just wrong :)
> 
> The "sum_by_date.py" program simply adds up all the values with the
> same first column, and remembers the largest value seen. This is
> giving me the cumulative GC time for extended pauses (over 0.5s), and
> the maximum pause seen in a given time period (hourly), eg:
> 
> 2015-11-13T11 119.124037 2.203569
> 2015-11-13T12 184.683309 3.156565
> 2015-11-13T13 65.934526 1.978202
> 2015-11-13T14 63.970378 1.411700
> 
> 
> This is fine for seeing that we have a problem. However, really I need
> to get this in to our monitoring systems - we use munin. I'm
> struggling to work out the best way to extract this information for
> our monitoring systems, and I think this might be my naivety about
> Java, and working out what should be logged.
> 
> I've turned on JMX debugging, and looking at the different beans
> available using jconsole, but I'm drowning in information. What would
> be the best thing to monitor?
> 
> Ideally, like the stats above, I'd like to know the cumulative time
> spent paused in GC since the last poll, and the longest GC pause that
> we see. munin polls every 5 minutes, are there suitable counters
> exposed by JMX that it could extract?
> 
> Thanks in advance
> 
> Tom


Re: Best strategy for logging security

2015-06-02 Thread Otis Gospodnetic
Logstash is open-source and free.  At some point Sematext contributed Solr
connector/output to Logstash.  Here are some numbers about Logstash (and
rsyslog, which is also an option, though it doesn't have Solr output):
http://blog.sematext.com/2015/05/18/tuning-elasticsearch-indexing-pipeline-for-logs/

If you are new to Logstash, this is a good one:
http://blog.sematext.com/2013/12/19/getting-started-with-logstash/

Note: Solr was mentioned as the destination for logs here, but it's not the
only option.  You can send your logs to other systems and services,
including off-site ones, those that also archive your old logs for audit or
other purposes, have more than just basic log search functionality, etc.

HTH

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jun 1, 2015 at 4:47 PM, Vishal Swaroop vishal@gmail.com wrote:

 Thanks Rajesh... just trying to figure out if *logstash *is opensource and
 free ?

 On Mon, Jun 1, 2015 at 2:13 PM, Rajesh Hazari rajeshhaz...@gmail.com
 wrote:

  Logging :
 
  Just use logstash to a parse your logs for all collection and  logstash
  forwarder and lumberjack at your solr replicas in your solr cloud to send
  the log events to you central logstash server and send it to back to solr
  (either the same or different instance) to a different collection.
 
  The default log4j.properties that comes with solr dist can log core name
  with each query log.
 
  Security:
  suggest you to go through this wiki
  https://wiki.apache.org/solr/SolrSecurity
 
  *Thanks,*
  *Rajesh,*
  *(mobile) : 8328789519.*
 
  On Mon, Jun 1, 2015 at 11:20 AM, Vishal Swaroop vishal@gmail.com
  wrote:
 
   It will be great if you can provide your valuable inputs on strategy
 for
   logging  security...
  
  
   Thanks a lot in advance...
  
  
  
   Logging :
  
   - Is there a way to implement logging for each cores separately.
  
   - What will be the best strategy to log every query details (like
 source
   IP, search query, etc.) at some point we will need monthly reports for
   analysis.
  
  
  
   Securing SOLR :
  
   - We need to implement SOLR security from client as well as server
  side...
   requests will be performed via web app as well as other server side
 apps
   e.g. curl...
  
   Please suggest about the best approach we can follow... link to any
   documentation will also help.
  
  
  
   Environment : SOLR 4.7 configured on Tomcat 7  (Linux)
  
 



Re: Solr Performance with Ram size variation

2015-04-17 Thread Otis Gospodnetic
Hi,

Because you went over 31-32 GB heap you lost the benefit of compressed
pointers and even though you gave the JVM more memory the GC may have had
to work harder.  This is a relatively well educated guess, which you can
confirm if you run tests and look at GC counts, times, JVM heap memory pool
utilization, etc.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Apr 17, 2015 at 10:14 PM, Kamal Kishore Aggarwal 
kkroyal@gmail.com wrote:

 Hi,

 As per this article, the linux machine is preferred to have 1.5 times RAM
 with respect to index size. So, to verify this, I tried testing the solr
 performance in different volumes of RAM allocation keeping other
 configuration (i.e Solid State Drives, 8 core processor, 64-Bit) to be same
 in both the cases. I am using solr 4.8.1 with tomcat server.

 https://wiki.apache.org/solr/SolrPerformanceProblems

 1) Initially, the linux machine had 32 GB RAM, out of which I allocated
 14GB to solr.

 export CATALINA_OPTS=-Xms2048m -Xmx14336m -XX:+UseConcMarkSweepGC
 -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps -Xloggc:./logs/info_error/tomcat_gcdetails.log

 The average search time for 1000 queries 300ms.

 2) After that, RAM was increased to 68 GB, out of which I allocated 40GB to
 Solr. Now, on a strange note, the average search time for the same set of
 queries was 3000ms.

 Now, after this, I reduced solr allocated RAM to 25GB on 68GB machine. But,
 still the search time was higher as compared to first case.

 What am I missing. Please suggest.



Re: Measuring QPS

2015-04-06 Thread Otis Gospodnetic
Hi Daniel,

See SPM http://sematext.com/spm/, which will give you QPS and a bunch of
other Solr, JVM, and OS metrics, along with alerting, anomaly detection,
and not-yet-announced transaction tracing
https://sematext.atlassian.net/wiki/display/PUBSPM/Transactions+Tracing.
It has percentiles Wunder mentions.  I see others mentioned JMeter.  We use
SPM with JMeter pretty regularly when helping clients with Solr performance
issues.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Apr 3, 2015 at 11:37 AM, Davis, Daniel (NIH/NLM) [C] 
daniel.da...@nih.gov wrote:

 I wanted to gather QPS for our production Solr instances, but I was
 surprised that the Admin UI did not contain this information.   We are
 running a mix of versions, but mostly 4.10 at this point.   We are not
 using SolrCloud at present; that's part of why I'm checking - I want to
 validate the size of our existing setup and what sort of SolrCloud setup
 would be needed to centralize several of them.

 What is the best way to gather QPS information?

 What is the best way to add information like this to the Admin UI, if I
 decide to take that step?

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH




Re: Best way to monitor Solr regarding crashes

2015-03-28 Thread Otis Gospodnetic
Hi Michael ,

SPM - http://sematext.com/spm will help. It can monitor all SOLR and JVM 
metrics and alert you when their values cross thresholds or become abnormal. In 
your case I'd first look at the JVM metrics - memory pools and their 
utilization. Heartbeat alert will notify you when your server(s) become 
unresponsive without you having to ping them. Solr logs will also likely have 
clues.

Otis
 

 On Mar 28, 2015, at 09:45, Michael Bakonyi m.bako...@civit.de wrote:
 
 Hi,
 
 we were using Solr for about 3 months without problems until a few days ago 
 it crashed one time and we don't know why. After a restart everything was 
 fine again but we want to be better prepared the next time this could happen. 
 So I'd like to know what's the best way to monitor a single Solr-instance and 
 what logging-configuration you think is useful for this kind of monitoring. 
 Maybe there's a possibility to automatically restart Solr after it crashed + 
 to see in detail in the logs what happend right before the crash ..?
 
 Can you give me any hints? We're using Tomcat 6.X with Solr 4.8.X
 
 Cheers,
 Michael


Re: Solr Monitoring - Stored Stats?

2015-03-26 Thread Otis Gospodnetic
Matt,

SPM will give you all that out of the box with alerts, anomaly detection etc. 
See http://sematext.com/spm

Otis

 

 On Mar 25, 2015, at 11:26, Matt Kuiper matt.kui...@issinc.com wrote:
 
 Hello,
 
 I am familiar with the JMX points that Solr exposes to allow for monitoring 
 of statistics like QPS, numdocs, Average Query Time...
 
 I am wondering if there is a way to configure Solr to automatically store the 
 value of these stats over time (for a given time interval), and then allow a 
 user to query a stat over a time range.  So for the QPS stat,  the query 
 might return a set that includes the QPS value for each hour in the time 
 range specified.
 
 Thanks,
 Matt
 
 


Re: How To Remove an Alert

2015-03-23 Thread Otis Gospodnetic
Hi,

I think this may have been for Sematext SPM http://sematext.com/spm/ for
Solr monitoring and Jack got our help a few hours ago.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Mar 23, 2015 at 7:46 PM, Erick Erickson erickerick...@gmail.com
wrote:

 What product? What alert? This doesn't sound like straight Solr.

 There is zero context here to help us help you...

 Please review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best,
 Erick

 On Mon, Mar 23, 2015 at 1:37 PM, jack.met...@hp.com
 st.comm.c...@gmail.com wrote:
  Hello,
 
  I have a problem  I just created an alert but I set the threshold too
 low. Is there a way to edit or remove the alert.



Re: backport Heliosearch features to Solr

2015-03-01 Thread Otis Gospodnetic
Hi Yonik,

Now that you joined Cloudera, why not everything?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley ysee...@gmail.com wrote:

 As many of you know, I've been doing some work in the experimental
 heliosearch fork of Solr over the past year.  I think it's time to
 bring some more of those changes back.

 So here's a poll: Which Heliosearch features do you think should be
 brought back to Apache Solr?

 http://bit.ly/1E7wi1Q
 (link to google form)

 -Yonik



Re: how to debug solr performance degradation

2015-02-25 Thread Otis Gospodnetic
Lots of suggestions here already.  +1 for those JVM params from Boogie and
for looking at JMX.
Rebecca, try SPM http://sematext.com/spm (will look at JMX for you, among
other things), it may save you time figuring out
JVM/heap/memory/performance issues.  If you can't tell what's slow via SPM,
we can have a look at your metrics (charts are sharable) and may be able to
help you faster than guessing.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Feb 25, 2015 at 4:27 PM, Erick Erickson erickerick...@gmail.com
wrote:

 Before diving in too deeply, try attaching debug=timing to the query.
 Near the bottom of the response there'll be a list of the time taken
 by each _component_. So there'll be separate entries for query,
 highlighting, etc.

 This may not show any surprises, you might be spending all your time
 scoring. But it's worth doing as a check and might save you from going
 down some dead-ends. I mean if your query winds up spending 80% of its
 time in the highlighter you know where to start looking..

 Best,
 Erick


 On Wed, Feb 25, 2015 at 12:01 PM, Boogie Shafer
 boogie.sha...@proquest.com wrote:
  rebecca,
 
  you probably need to dig into your queries, but if you want to
 force/preload the index into memory you could try doing something like
 
  cat `find /path/to/solr/index`  /dev/null
 
 
  if you haven't already reviewed the following, you might take a look here
  https://wiki.apache.org/solr/SolrPerformanceProblems
 
  perhaps going back to a very vanilla/default solr configuration and
 building back up from that baseline to better isolate what might specific
 setting be impacting your environment
 
  
  From: Tang, Rebecca rebecca.t...@ucsf.edu
  Sent: Wednesday, February 25, 2015 11:44
  To: solr-user@lucene.apache.org
  Subject: RE: how to debug solr performance degradation
 
  Sorry, I should have been more specific.
 
  I was referring to the solr admin UI page. Today we started up an AWS
  instance with 240 G of memory to see if we fit all of our index (183G) in
  the memory and have enough for the JMV, could it improve the performance.
 
  I attached the admin UI screen shot with the email.
 
  The top bar is ³Physical Memory² and we have 240.24 GB, but only 4% 9.52
  GB is used.
 
  The next bar is Swap Space and it¹s at 0.00 MB.
 
  The bottom bar is JVM Memory which is at 2.67 GB and the max is 26G.
 
  My understanding is that when Solr starts up, it reserves some memory for
  the JVM, and then it tries to use up as much of the remaining physical
  memory as possible.  And I used to see the physical memory at anywhere
  between 70% to 90+%.  Is this understanding correct?
 
  And now, even with 240G of memory, our index is performing at 10 - 20
  seconds for a query.  Granted that our queries have fq¹s and highlighting
  and faceting, I think with a machine this powerful I should be able to
 get
  the queries executed under 5 seconds.
 
  This is what we send to Solr:
  q=(phillip%20morris)
  wt=json
  start=0
  rows=50
  facet=true
  facet.mincount=0
  facet.pivot=industry,collection_facet
  facet.pivot=availability_facet,availabilitystatus_facet
  facet.field=dddate
 
 fq%3DNOT(pg%3A1%20AND%20(dt%3A%22blank%20document%22%20OR%20dt%3A%22blank%
 
 20page%22%20OR%20dt%3A%22file%20folder%22%20OR%20dt%3A%22file%20folder%20be
 
 gin%22%20OR%20dt%3A%22file%20folder%20cover%22%20OR%20dt%3A%22file%20folder
 
 %20end%22%20OR%20dt%3A%22file%20folder%20label%22%20OR%20dt%3A%22file%20she
 
 et%22%20OR%20dt%3A%22file%20sheet%20beginning%22%20OR%20dt%3A%22tab%20page%
  22%20OR%20dt%3A%22tab%20sheet%22))
  facet.field=dt_facet
  facet.field=brd_facet
  facet.field=dg_facet
  hl=true
  hl.simple.pre=%3Ch1%3E
  hl.simple.post=%3C%2Fh1%3E
  hl.requireFieldMatch=false
  hl.preserveMulti=true
  hl.fl=ot,ti
  f.ot.hl.fragsize=300
  f.ot.hl.alternateField=ot
  f.ot.hl.maxAlternateFieldLength=300
  f.ti.hl.fragsize=300
  f.ti.hl.alternateField=ti
  f.ti.hl.maxAlternateFieldLength=300
  fq={!collapse%20field=signature}
  expand=true
  sort=score+desc,availability_facet+asc
 
 
  My guess is that it¹s performing so badly because it¹s only using 4% of
  the memory? And searches require disk access.
 
 
  Rebecca
  
  From: Shawn Heisey [apa...@elyograg.org]
  Sent: Tuesday, February 24, 2015 5:23 PM
  To: solr-user@lucene.apache.org
  Subject: Re: how to debug solr performance degradation
 
  On 2/24/2015 5:45 PM, Tang, Rebecca wrote:
  We gave the machine 180G mem to see if it improves performance.
 However,
  after we increased the memory, Solr started using only 5% of the
 physical
  memory.  It has always used 90-something%.
 
  What could be causing solr to not grab all the physical memory (grabbing
  so little of the physical memory)?
 
  I would like to know what memory numbers in which program you are
  looking at, and why you 

Re: Confirm Solr index corruption

2015-02-18 Thread Otis Gospodnetic
Hi,

It sounds like Solr simply could not index some docs.  The index is not
corrupt, it's just that indexing was failing while disk was full.  You'll
need to re-send/re-add/re-index the missing docs (or simply all of them if
you don't know which ones are missing).

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Feb 18, 2015 at 1:52 AM, Thomas Mathew mothas.tho...@gmail.com
wrote:

 Hi All,

 I use Solr 4.4.0 in a master-slave configuration. Last week, the master
 server ran out of disk (logs got too big too quick due to a bug in our
 system). Because of this, we weren't able to add new docs to an index. The
 first thing I did was to delete a few old log files to free up disk space
 (later I moved the other logs to free up disk). The index is working fine
 even after this fiasco.

 The next day, a colleague of mine pointed out that we may be missing a few
 documents in the index. I suspect the above scenario may have broken the
 index. I ran the checkIndex against this index. It didn't mention of any
 corruption though.

 Right now, the index has about 25k docs. I haven't optimized this index in
 a while, and there are about 4000 deleted-docs. How can I confirm if we
 lost anything? If we've lost docs, is there a way to recover it?

 Thanks in advance!!

 Regards
 Thomas



Re: 43sec commit duration - blocked by index merge events?

2015-02-13 Thread Otis Gospodnetic
Check http://search-lucene.com/?q=commit+wait+blockfc_type=mail+_hash_+user

e.g. http://search-lucene.com/m/QTPa7Sqx81

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Feb 13, 2015 at 8:50 AM, Gili Nachum gilinac...@gmail.com wrote:

 Thanks Otis, can you confirm that a commit call will wait for merges to
 complete before returning?

 On Thu, Feb 12, 2015 at 8:46 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  If you are using Solr and SPM for Solr, you can check a report that shows
  the # of files in an index and the report that shows you the max docs-num
  docs delta.  If you see the # of files drop during a commit, that's a
  merge.  If you see a big delta change, that's probably a merge, too.
 
  You could also jstack or kill -3 the JVM and see where it's spending its
  time to give you some ideas what's going on inside.
 
  HTH.
 
  Otis
  --
  Monitoring * Alerting * Anomaly Detection * Centralized Log Management
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Sun, Feb 8, 2015 at 6:48 AM, Gili Nachum gilinac...@gmail.com
 wrote:
 
   Hello,
  
   During a load test I noticed a commit that took 43 seconds to complete
   (client hard complete).
   Is this to be expected? What's causing it?
   I have a pair of machines hosting a 128M docs collection (8 shards,
   replication factor=2).
  
   Could it be merges? In Lucene merges happen async of commit statements,
  but
   reading Solr's doc for Update Hanlder
   
  
 
 https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig
   
   it sounds like hard commits do wait for merges to occur: * The
 tradeoff
  is
   that a soft commit gives you faster visibility because it's not waiting
  for
   background merges to finish.*
   Thanks.
  
 



Re: How to make SolrCloud more elastic

2015-02-13 Thread Otis Gospodnetic
Hi Matt,

See:
http://search-lucene.com/?q=query+routingfc_project=Solr
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Feb 12, 2015 at 2:09 PM, Matt Kuiper matt.kui...@issinc.com wrote:

 Otis,

 Thanks for your reply.  I see your point about too many shards and search
 efficiency.  I also agree that I need to get a better handle on customer
 requirements and expected loads.

 Initially I figured that with the shard splitting option, I would need to
 double my Solr nodes every time I split (as I would want to split every
 shard within the collection).  Where actually only the number of shards
 would double, and then I would have the opportunity to rebalance the shards
 over the existing Solr nodes plus a number of new nodes that make sense at
 the time.  This may be preferable to defining many micro shards up front.

 The time-base collections may be an option for this project.  I am not
 familiar with query routing, can you point me to any documentation on how
 this might be implemented?

 Thanks,
 Matt

 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Wednesday, February 11, 2015 9:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: How to make SolrCloud more elastic

 Hi Matt,

 You could create extra shards up front, but if your queries are fanned out
 to all of them, you can run into situations where there are too many
 concurrent queries per node causing lots of content switching and
 ultimately being less efficient than if you had fewer shards.  So while
 this is an approach to take, I'd personally first try to run tests to see
 how much a single node can handle in terms of volume, expected query rates,
 and target latency, and then use monitoring/alerting/whatever-helps tools
 to keep an eye on the cluster so that when you start approaching the target
 limits you are ready with additional nodes and shard splitting if needed.

 Of course, if your data and queries are such that newer documents are
 queries   more, you should look into time-based collections... and if your
 queries can only query a subset of data you should look into query routing.

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Wed, Feb 11, 2015 at 3:32 PM, Matt Kuiper matt.kui...@issinc.com
 wrote:

  I am starting a new project and one of the requirements is that Solr
  must scale to handle increasing load (both search performance and index
 size).
 
  My understanding is that one way to address search performance is by
  adding more replicas.
 
  I am more concerned about handling a growing index size.  I have
  already been given some good input on this topic and am considering a
  shard splitting approach, but am more focused on a rebalancing
  approach that includes defining many shards up front and then moving
  these existing shards on to new Solr servers as needed.  Plan to
  experiment with this approach first.
 
  Before I got too deep, I wondered if anyone has any tips or warnings
  on these approaches, or has scaled Solr in a different manner.
 
  Thanks,
  Matt
 



Re: Solr scoring confusion

2015-02-13 Thread Otis Gospodnetic
Hi Scott,

Try optimizing after reindexing and this should go away. Had to do with 
updated/deleted docs participating in score computation.

Otis
 

 On Feb 13, 2015, at 18:29, Scott Johnson sjohn...@dag.com wrote:
 
 We are getting inconsistent scoring results in Solr. It works about 95% of
 the time, where a search on one term returns the results which equal exactly
 that one term at the top, and results with multiple terms that also contain
 that one term are returned lower. Occasionally, however, if a subset of the
 data has been re-indexed (the same data just added to the index again) then
 the results will be slightly off, for example the data from the earlier
 index will get a higher score than it should, until we re-index all the
 data.
 
 
 
 Our assumption here is that setting omitNorms to false, then indexing the
 data, then searching, should result in scores where the data with an exact
 match has a higher score. We usually see this but not always. Is something
 added to the score besides the value that is being searched that we are not
 understaning?
 
 
 
 Thanks.
 
 ..
 Scott Johnson
 Data Advantage Group, Inc.
 
 604 Mission Street 
 San Francisco, CA 94105 
 Office:   +1.415.947.0400 x204
 Fax:  +1.415.947.0401
 
 Take the first step towards a successful
 meta data initiative with MetaCenter - 
 the only plug and play, real-time 
 meta data solution.http://www.dag.com/ www.dag.com 
 ..
 
 
 


Re: Multy-tenancy and quarantee of service per application (tenant)

2015-02-12 Thread Otis Gospodnetic
Not really, not 100%, if tenants share the same hardware and there is no
isolation through things like containers (in which case they don't share
the same SolrCloud cluster, really).

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Feb 12, 2015 at 11:17 AM, Victor Rondel rondelvic...@gmail.com
wrote:

 Hi everyone,

 I am wondering about multy-tenancy and garantee of service in SolrCloud :

 *Multy-tenant cluster* : Is there a way to *guarantee a level of service* /
 capacity planning for *each tenant* using the cluster (its *own
 collections*)
 ?


 Thanks,



Re: 43sec commit duration - blocked by index merge events?

2015-02-12 Thread Otis Gospodnetic
If you are using Solr and SPM for Solr, you can check a report that shows
the # of files in an index and the report that shows you the max docs-num
docs delta.  If you see the # of files drop during a commit, that's a
merge.  If you see a big delta change, that's probably a merge, too.

You could also jstack or kill -3 the JVM and see where it's spending its
time to give you some ideas what's going on inside.

HTH.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Sun, Feb 8, 2015 at 6:48 AM, Gili Nachum gilinac...@gmail.com wrote:

 Hello,

 During a load test I noticed a commit that took 43 seconds to complete
 (client hard complete).
 Is this to be expected? What's causing it?
 I have a pair of machines hosting a 128M docs collection (8 shards,
 replication factor=2).

 Could it be merges? In Lucene merges happen async of commit statements, but
 reading Solr's doc for Update Hanlder
 
 https://cwiki.apache.org/confluence/display/solr/UpdateHandlers+in+SolrConfig
 
 it sounds like hard commits do wait for merges to occur: * The tradeoff is
 that a soft commit gives you faster visibility because it's not waiting for
 background merges to finish.*
 Thanks.



Re: Solrcloud performance issues

2015-02-12 Thread Otis Gospodnetic
Hi,

Did you say you have 150 servers in this cluster?  And 10 shards for just
90M docs?  If so, that 150 hosts sounds like too much for all other numbers
I see here.  I'd love to see some metrics here.  e.g. what happens with
disk IO around those commits?  How about GC time/size info?  Are JVM memory
pools full-ish and is the CPU jumping like crazy?  Can you share more info
to give us a more complete picture of your system? SPM for Solr
http://sematext.com/spm/ will help if you don't already capture these
types of things.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Feb 12, 2015 at 11:07 AM, Vijay Sekhri sekhrivi...@gmail.com
wrote:

 Hi Erick,
 We have following configuration of our solr cloud

1. 10 Shards
2. 15 replicas per shard
3. 9 GB of index size per shard
4. a total of around 90 mil documents
5. 2 collection viz search1 serving live traffic and search 2 for
indexing. We swap collection when indexing finishes
6. On 150 hosts we have 2 JVMs running one for search1 collection and
other for search2 collection
7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
total
8. Each host has 16 processors
9. Linux XXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
10. We have two ways to index data.
1. Bulk indexing . All 90 million docs pumped in from 14 parallel
   process (on 14 different client hosts). This is done on
 collection that is
   not serving live traffic
   2.  Incremental indexing . Only delta changes (Range from 100K to 5
   Mil) every two hours. This is done on collection also serving live
 traffic
11. The request per second count on live collection is around 300 TPS
12. Hard commit setting is every 30 second with open searcher false and
soft commit setting is every 15 minutes . We have tried a lot of
 different
setting here BTW.




 Now we have two issues with indexing
 1) Solr just could not keep up with the bulk indexing when replicas are
 also active. We have concluded this by changing the number of replicas to
 just 2 , to 4 and then to 15. When the number of replicas increases the
 bulk indexing time increase almost exponentially
 We seem to have encountered the same issue reported here
 https://issues.apache.org/jira/browse/SOLR-6816
 It gets to a point that even to index 100 docs the solr cluster would take
 300 second. It would start of indexing 100 docs in 55 millisecond and
 slowly increase over time and within hour and a half just could not keep
 up. We have a workaround for this and i.e we stop all the replicas , do the
 bulk indexing and bring all the replicas up one by one . This sort of
 defeats the purpose of solr cloud but we can still work with this
 workaround. We can do this because , bulk indexing happen on the collection
 that is not serving live traffic. However we would love to have a solution
 from the solr cloud itself like ask it to stop replication and start via an
 API at the end of indexing.

 2) This issues is related to soft commit with incremental indexing . When
 we do incremental indexing, it is done on the same collection serving live
 traffic with 300 request per second throughput.  Everything is fine except
 whenever the soft commit happens. Each time soft commit (autosoftcommit in
 sorlconfig.xml) happens which BTW happens almost at the same time
 throughout the cluster , there is a spike in the response times and
 throughput decreases almost to 150 tps. The spike continues for 2 minutes
 and then it happens again at the exact interval when the soft commit
 happens. We have monitored the logs and found a direct co relation when the
 soft commit happens and when the response time tanks.

 Now the latter issue is quite disturbing , because it is serving live
 traffic and we cannot sustain these periodic degradation. We have played
 around with different soft commit setting . Interval ranging from 2 minutes
 to 30 minutes . Auto warming half cache  , auto warming full cache, auto
 warming only 10 %. Doing warm up queries on every new searcher , doing NONE
 warm up queries on every new searching and all the different setting yields
 the same results . As and when soft commit happens the response time tanks
 and throughput deceases. The difference is almost 50 % in response times
 and 50 % in throughput


 Our workaround for this solution is to also do incremental delta indexing
 on the collection not serving live traffic and swap when it is done. As you
 can see that this also defeats the purpose of solr cloud . We cannot do
 bulk indexing because replicas cannot keeps up and we cannot do incremental
 indexing because of soft commit performance.

 Is there a way to make the cluster not do soft commit all at the same time
 or is there a way to make soft commit not cause this degradation ?
 We are open 

Re: Solr 4.10.x on Oracle Java 1.8.x ?

2015-02-11 Thread Otis Gospodnetic
Bok Jakov,

We've been running Solr with Java 8 for several months without issues.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Feb 10, 2015 at 3:03 PM, Jakov Sosic jso...@gmail.com wrote:

 Hi guys,

 at the end of April Java 1.7 will be obsoleted, and Oracle will stop
 updating it.

 Is it safe to run Tomcat7 / Solr 4.10 on Java 1.8? Did anyone tried it
 already?



Re: Multi words query

2015-02-11 Thread Otis Gospodnetic
Hi,

Can you share details about how exactly you are querying Solr?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Feb 11, 2015 at 5:21 AM, melb melaggo...@gmail.com wrote:

 Hi,
 I have a solr collection which I use to index some documents ( title,
 description, body)
 and I can search it well with solr query when it is a single word query
 When I search for multi words, the result is not satisfactory because I get
 some results with high scores with only one word of the query while
 documents with all terms are scored poorly

 How can I query solr collection with multi words and get documents with all
 query terms first and in the same time keeping the other documents too

 rgds



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Multi-words-query-tp4185625.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to make SolrCloud more elastic

2015-02-11 Thread Otis Gospodnetic
Hi Matt,

You could create extra shards up front, but if your queries are fanned out
to all of them, you can run into situations where there are too many
concurrent queries per node causing lots of content switching and
ultimately being less efficient than if you had fewer shards.  So while
this is an approach to take, I'd personally first try to run tests to see
how much a single node can handle in terms of volume, expected query rates,
and target latency, and then use monitoring/alerting/whatever-helps tools
to keep an eye on the cluster so that when you start approaching the target
limits you are ready with additional nodes and shard splitting if needed.

Of course, if your data and queries are such that newer documents are
queries more, you should look into time-based collections... and if your
queries can only query a subset of data you should look into query routing.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Feb 11, 2015 at 3:32 PM, Matt Kuiper matt.kui...@issinc.com wrote:

 I am starting a new project and one of the requirements is that Solr must
 scale to handle increasing load (both search performance and index size).

 My understanding is that one way to address search performance is by
 adding more replicas.

 I am more concerned about handling a growing index size.  I have already
 been given some good input on this topic and am considering a shard
 splitting approach, but am more focused on a rebalancing approach that
 includes defining many shards up front and then moving these existing
 shards on to new Solr servers as needed.  Plan to experiment with this
 approach first.

 Before I got too deep, I wondered if anyone has any tips or warnings on
 these approaches, or has scaled Solr in a different manner.

 Thanks,
 Matt



Re: Solrcloud (to HDFS) poor indexing performance

2015-02-10 Thread Otis Gospodnetic
Hi Tim,

Although I doubt Kafka is the problem, I'd look at that first and eliminate
that.

What about those Flume agents?  How are they behaving in terms of CPU/GC,
and such?
You have 18 Solr nodes. what happens if you increase the number of
Flume sinks?

Are you seeing anything specific that makes you think the problem is on the
Solr side?  Can you share charts that show your GC activity, disk IO, etc.?
 (you can share them easily with SPM http://sematext.com/spm, which may
help others help you more easily)

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Feb 3, 2015 at 7:47 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 I have a SolrCloud (Solr 4.4, writing to HDFS on CDH-5.3) collection
 configured to be populated by flume Morphlines sink. The flume agent reads
 data from Kafka and writes to the Solr collection.

 The issue is that Solr indexing rate is abysmally poor (~6k docs/sec at
 best, dips to a few hundred per sec) across the cluster. The incoming
 data/document rate is about 30-40k/second.

 I have gone wide/thin with 18 nodes and each with 8GB (Java) + 4GB
 (non-heap) memory and narrow/thick with current set of 5 dedicated nodes
 each with 36GB (Java) and 16GB (non-heap) memory (18 shards with the former
 config and 5 shards, right now).

 On the flume side, I have gone from 5 flume instances, each with a single
 sink to 5 sinks for each flume instance. I have tweaked batchSize and
 batchDuration.

 I checked ZooKeeper loads and don't see it stressed. Neither are the
 datanodes. On the Solr nodes, solr is consuming all the allocated memory
 (32GB) but I don't see solr hitting any CPU limits.

 *But*, indexing rate stubbornly stays at ~6k docs/sec. When I bounce the
 flume agent, it jumps up momentarily to several hundreds of thousands but
 then comes down to ~6k/sec and the flume channels get saturated within
 seconds.

 Any clues/pointers for troubleshooting will be appreciated?


 Thanks,

 Tim



Re: Garbage Collection tuning - G1 is now a good option

2015-01-07 Thread Otis Gospodnetic
Not sure about AggressiveOpts, but G1 has been working for us nicely.
We've successfully used it with HBase, Hadoop, Elasticsearch, and other
custom Java apps (all still Java 7, but Java 8 should be even better).  Not
sure if we are using in on our Solr instances.

e.g. see http://blog.sematext.com/2013/06/24/g1-cms-java-garbage-collector/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Jan 1, 2015 at 8:35 PM, William Bell billnb...@gmail.com wrote:

 But tons of people on this mailing list do not recommend AggressiveOpts

 Why do you recommend it?

 On Thu, Jan 1, 2015 at 12:10 PM, Shawn Heisey apa...@elyograg.org wrote:

  I've been working with Oracle employees to find better GC tuning
  options.  The results are good enough to share with the community:
 
  https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
 
  With the latest Java 7 or Java 8 version, and a couple of tuning
  options, G1GC has grown up enough to be a viable choice.  Two of the
  settings on that list were critical for making the performance
  acceptable with my testing: ParallelRefProcEnabled and G1HeapRegionSize.
 
  I've included some notes on the wiki about how you can size the G1 heap
  regions appropriately for your own index.
 
  Thanks,
  Shawn
 
 


 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: .htaccess / password

2015-01-06 Thread Otis Gospodnetic
Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman choff...@eclimb.net wrote:

 Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
 will Solr continue to function properly? One thing to note, I will have a
 CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
 looking for a way to secure this area.

 Thanks,
 Craig
 --
 Craig Hoffman
 w: http://www.craighoffmanphotography.com
 FB: www.facebook.com/CraigHoffmanPhotography
 TW: https://twitter.com/craiglhoffman
















Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Hi Charles,

See http://search-lucene.com/?q=solr+hdfs and
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr
wrote:

 I am considering using *Solr* to extend *Hortonworks Data Platform*
 capabilities to search.

 - I found tutorials to index documents into a Solr instance from *HDFS*,
 but I guess this solution would require a Solr cluster distinct to the
 Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
 cluster instead? - *With the index stored in HDFS?*

 - Where would the processing take place (could it be handed down to
 Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
 integrate with *Yarn*?

 - What about *SolrCloud*: what does it bring regarding Hadoop based
 use-cases? Does it stand for a Solr-only cluster?

 - Well, if that could lead to something working with a roles-based
 authorization-compliant *Banana*, it would be Christmass again!

 Thanks a lot for any help!

 Charles



 Ce message et toutes les pièces jointes (ci-après le 'Message') sont
 établis à l'intention exclusive des destinataires et les informations qui y
 figurent sont strictement confidentielles. Toute utilisation de ce Message
 non conforme à sa destination, toute diffusion ou toute publication totale
 ou partielle, est interdite sauf autorisation expresse.

 Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
 le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
 partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
 votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
 sur quelque support que ce soit. Nous vous remercions également d'en
 avertir immédiatement l'expéditeur par retour du message.

 Il est impossible de garantir que les communications par messagerie
 électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
 erreur ou virus.
 

 This message and any attachments (the 'Message') are intended solely for
 the addressees. The information contained in this Message is confidential.
 Any use of information contained in this Message not in accord with its
 purpose, any dissemination or disclosure, either whole or partial, is
 prohibited except formal approval.

 If you are not the addressee, you may not copy, forward, disclose or use
 any part of it. If you have received this message in error, please delete
 it and all copies from your system and notify the sender immediately by
 return message.

 E-mail communication cannot be guaranteed to be timely secure, error or
 virus-free.




Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi Charles,

 See http://search-lucene.com/?q=solr+hdfs and
 https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

 Otis
 --
 Monitoring * Alerting * Anomaly Detection * Centralized Log Management
 Solr  Elasticsearch Support * http://sematext.com/


 On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE charles.val...@edf.fr
 wrote:

 I am considering using *Solr* to extend *Hortonworks Data Platform*
 capabilities to search.

 - I found tutorials to index documents into a Solr instance from *HDFS*,
 but I guess this solution would require a Solr cluster distinct to the
 Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
 cluster instead? - *With the index stored in HDFS?*

 - Where would the processing take place (could it be handed down to
 Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
 integrate with *Yarn*?

 - What about *SolrCloud*: what does it bring regarding Hadoop based
 use-cases? Does it stand for a Solr-only cluster?

 - Well, if that could lead to something working with a roles-based
 authorization-compliant *Banana*, it would be Christmass again!

 Thanks a lot for any help!

 Charles



 Ce message et toutes les pièces jointes (ci-après le 'Message') sont
 établis à l'intention exclusive des destinataires et les informations qui y
 figurent sont strictement confidentielles. Toute utilisation de ce Message
 non conforme à sa destination, toute diffusion ou toute publication totale
 ou partielle, est interdite sauf autorisation expresse.

 Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
 le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
 partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
 votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
 sur quelque support que ce soit. Nous vous remercions également d'en
 avertir immédiatement l'expéditeur par retour du message.

 Il est impossible de garantir que les communications par messagerie
 électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
 erreur ou virus.
 

 This message and any attachments (the 'Message') are intended solely for
 the addressees. The information contained in this Message is confidential.
 Any use of information contained in this Message not in accord with its
 purpose, any dissemination or disclosure, either whole or partial, is
 prohibited except formal approval.

 If you are not the addressee, you may not copy, forward, disclose or use
 any part of it. If you have received this message in error, please delete
 it and all copies from your system and notify the sender immediately by
 return message.

 E-mail communication cannot be guaranteed to be timely secure, error or
 virus-free.





Re: SolrCloud multi-datacenter failover?

2015-01-05 Thread Otis Gospodnetic
Hi,

Check http://search-lucene.com/?q=%22Cross+Data+Center+Replicaton%22 -
http://issues.apache.org/jira/browse/SOLR-6273

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Jan 2, 2015 at 4:52 PM, jaime spicciati jaime.spicci...@gmail.com
wrote:

 All,

 At my current customer we have developed a custom federator that will
 federate queries between Endeca and Solr to ease the transition from an
 extremely large (TBs of data) Endeca index to Solr. (Endeca is similar to
 Solr in terms of search/faceted navigation/etc).



 During this transition plan we need to support multi datacenter failover
 which we have historically handled via load balancers with the appropriate
 failover configurations (think F5). We are currently playing our dataloads
 into multiple datacenters to ensure data consistency. (Each datacenter has
 a stand-alone instance of solrcloud with its own redundancy/failover)



 I am curious to see how the community handles multi datacenter failureover
 at the presentation layer (datacenter A goes down and we want to failover
 to B). Solrcloud within a datacenter will handle single datacenter failure
 within the instance, but in order to support multi datacenter failover I
 haven't seen a definitive ‘answer’ as to how to handle this situation.



 At this point the only two options I can come up with are

 1) Fail the entire datacenter if Solrcloud goes offline (GUI/index/etc go
 offline)

  - This is problematic because some portion of user activity will fail,
 queries that are in transit will not complete

 2) Implement failover at the custom federator level. In doing so we would
 need to detect a failure at datacenter A within our federator, then query
 datacenter B to fulfill the user request, then potentially fail the entire
 datacenter A once all transactions have been fulfilled against A



 Since we are looking up the active solr instance via zookeeper (solrcloud)
 per datacenter I don’t see any reasonable means of failing over to another
 datacenter if a given solrcloud instance goes down?


 Any thoughts are welcome at this point?

 Thanks

 Jaime



Re: questions about default operator within solr query string

2015-01-05 Thread Otis Gospodnetic
Hi Chun,

Something like:
+slug:variety +slug:entertainment headline:entertainment should work.

But you may also want to use function queries for slug filtering:
http://search-lucene.com/?q=fqfc_project=Solr
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq(FilterQuery)Parameter

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jan 5, 2015 at 6:11 AM, chun.sh...@thomsonreuters.com wrote:

 Hi,

  Nice to have a chance to discuss with solr experts!

   We are using solr as our search solution. But now we have a
 requirement that we don't know how to handle, even after we have looked
 through the Solr documentation.

   The solr version we used is 4.10.1.

   For the question, please refer to the following example url:


 http://10.90.44.33/solr/searcher/select?start=0rows=24fl=id,headline,slugq=slug:variety-entertainment%20headline:entertainmentsort=score%20ascdebug=true


   With our default operator(q.op) is configured as OR, the parsed
 query is:

slug:variety slug:entertainment headline:entertainment


   But what we really want is as follows:

   +slug:variety+slug:entertainment headline:entertainment


   So, the question is:

   When searching , is there any way  to configure the applied
 operator between the terms from the field slug to be AND and the
 operator between the fields slug and headline is OR?

  If no, could you please advise on how to handle this
 requirement in other ways?


 Thanks in advance


 Chun



Re: Solr performance issues

2014-12-26 Thread Otis Gospodnetic
Likely lots of disk + network IO, yes. Put SPM for Solr on your nodes to double 
check.

 Otis

 On Dec 26, 2014, at 09:17, Mahmoud Almokadem prog.mahm...@gmail.com wrote:
 
 Dears,
 
 We've installed a cluster of one collection of 350M documents on 3
 r3.2xlarge (60GB RAM) Amazon servers. The size of index on each shard is
 about 1.1TB and maximum storage on Amazon is 1 TB so we add 2 SSD EBS
 General purpose (1x1TB + 1x500GB) on each instance. Then we create logical
 volume using LVM of 1.5TB to fit our index.
 
 The response time is about 1 and 3 seconds for simple queries (1 token).
 
 Is the LVM become a bottleneck for our index?
 
 Thanks for help.


# of daily/weekly/monthly Solr downloads?

2014-12-09 Thread Otis Gospodnetic
Hi,

Does anyone know the number of daily/weekly/monthly Solr downloads?

Thanks,
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


Re: Standardized index metrics (Was: Constantly high disk read access (40-60M/s))

2014-12-01 Thread Otis Gospodnetic
Hi,

On Sat, Nov 29, 2014 at 2:27 PM, Michael Sokolov 
msoko...@safaribooksonline.com wrote:

 On 11/29/14 1:30 PM, Toke Eskildsen wrote:

 Michael Sokolov [msoko...@safaribooksonline.com] wrote:

 I wonder if there's any value in providing this metric (total index size
 - stored field size - term vector size) as part of the admin panel?  Is
 it meaningful?  It seems like there would be a lot of cases where it
 could give a good rule of thumb for memory sizing, and it would save
 having to root around in the index folder.

 At Lucene/Solr Revolution, I talked with Alexandre Rafalovitch about
 this. We know (https://lucidworks.com/blog/sizing-hardware-in-the-
 abstract-why-we-dont-have-a-definitive-answer/) that we cannot get the
 full picture of an index, but it is a weekly occurrence on this mailing
 list that people asks questions where it helps to have a gist of the index
 metrics and how the index is used.

 Some sort of Copy the content of this concentrated metrics box, when you
 need to talk with other people about your index-functionality in the admin
 panel might help with this. To get an idea of usage, it could also contain
 a few non-filled fields, such as peak queries per second or typical
 queries.

 - Toke Eskildsen

 Yes - the cautions about the need for prototyping are all very well, but
 even if you take that advice, and build a prototype, it's not clear how to
 tell whether your setup has enough memory or not. You can add more and
 measure response times, but even then you only have a gross measurement,
 and no way of knowing where, in detail, the memory is being used.  Also,
 you might be able to improve your system to make better use of memory with
 more precise information. It seems like we ought to be able to monitor a
 running system, observe its memory requirements over time, and report on
 those.


+1 to that!
I haven't been following this aspect of development super closely, but I
believe there are memory/size estimators for various things at Lucene level
that Elasticsearch is nicely exposing via its stats API.  I don't know the
specifics around those estimators without digging in, otherwise I'd open a
JIRA, because I think this is valuable information -- at Sematext we
regularly deal with hardware sizing, memory / CPU usage estimates, etc.
etc., so the more of this info is surfaced the easier it will be for people
to work with Solr.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


Re: Constantly high disk read access (40-60M/s)

2014-12-01 Thread Otis Gospodnetic
Po-Yu,

To add what others have said:
* Your query cache is clearly not serving its purpose, so you are just
wasting your heap on it.  Consider disabling it.
* That's a pretty big index.  Do your queries really always have to go
against the whole index?  Are there multiple tenants in this index that
would let you break up the index into multiple smaller indices?  Can you
segment your index by time?  Maybe by doing that some indices will be
hotter and some colder, and the OS could do a better job caching.
* You didn't say anything about your queries.  Maybe they can be tighten to
pull less data off disk?
* Add RAM :)

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Sat, Nov 29, 2014 at 12:59 AM, Po-Yu Chuang ratbert.chu...@gmail.com
wrote:

 Hi all,

 I am using Solr 4.9 with Tomcat. Thanks to the suggestions from Yonik and
 Dmitry about the slow start up. Everything works fine now, but I noticed
 that the load average of the server is high because there is constantly
 heavy disk read access. Please point me some directions.

 Some numbers about my system:
 RAM: 18G
 swap space: 2G
 number of documents: 27 million
 Solr home: 185G
 disk read access constantly 40-60M/s
 document cache size: 16K entries
 document cache hit ratio: 0.65
 query cache size: 16K
 query cache hit ratio: 0.03

 At first, I wondered if the disk read comes from swap, so I decreased the
 swappiness from 60 to 10, but the disk read is still there, which means
 that the disk read access does not result from swapping in.

 Then, I tried different document cache size and query different size. The
 effect on changing query cache size is not obvious. I tried 512, 16K, 256K
 entries and the hit ratio is between 0.01 to 0.03.

 For document cache, the larger cache size did improve the hit ratio of
 document cache size (I tried 512, 16K, 256K, 512K, 1024K and the hit ratio
 is between 0.58 - 0.87), but the disk read is still high.

 Is adjusting document cache size a reasonable direction? Or I should just
 increase the physical memory? Is there any method to estimate the right
 size of document cache (or other caches) and to estimate the size of
 physical memory needed?

 Thanks,
 Po-Yu



Re: Replicate a collection to a 2nd SolrCloud

2014-11-25 Thread Otis Gospodnetic
Hi,

I think you are looking for this:
http://search-lucene.com/?q=Cross+Data+Center+Replicationfc_project=Solr
== https://issues.apache.org/jira/browse/SOLR-6273

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Nov 25, 2014 at 3:29 PM, Gili Nachum gilinac...@gmail.com wrote:

 Hi,



 *I need to replicate a collection between SolrClouds, anyone did it?*The
 replication style I need is one direction replicating anything that happens
 on my main site SolrCloud to the DR site (master-salve)

 I considered and decide against synchronizing the collections' shards
 Lucene index over rsync, for being s tricky to arrive at a consistent index
 and not being efficient enough on bandwidth.
 My current approach is writing a replicator app that knows to sync between
 two collections, in a fairly generic way, but it's a last of investment
 which I rather avoid.

 Saw that master-slave replication can't be used in SolrCloud
 https://cwiki.apache.org/confluence/display/solr/Index+Replication



Re: Does any solr version use lucene concurrent flush

2014-11-25 Thread Otis Gospodnetic
Yes.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Nov 25, 2014 at 4:37 PM, Aaron Beach aaron.be...@sendgrid.com
wrote:

 --
 Aaron Beach
 Senior Data Scientist
 w: +1-303-625-7043

 SendGrid -- Email Delivery. Simplified.
 http://sendgrid.com/careers.html



Re: New Meetup in London - Lucene/Solr User Group

2014-11-18 Thread Otis Gospodnetic
Would LOVE to see the results (assuming you can ensure the same fruit(s?)
are being compared)

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Nov 18, 2014 at 11:55 AM, Alexandre Rafalovitch arafa...@gmail.com
wrote:

 On 18 November 2014 11:41, Charlie Hull char...@flax.co.uk wrote:
  presenting some results of a Solr/Elasticsearch comparative performance
  study.

 I was asked about that a couple of times at the Solr Revolution
 conference. Looking forward to seeing the results.

 Regards,
Alex.

 Personal: http://www.outerthoughts.com/ and @arafalov
 Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
 Solr popularizers community: https://www.linkedin.com/groups?gid=6713853



Re: Search for partial name in Solr 4.x

2014-11-09 Thread Otis Gospodnetic
Hi,

You may be looking for wildcard queries or ngrams.

Otis
 

 On Nov 9, 2014, at 3:26 PM, PeriS peri.subrahma...@htcinc.com wrote:
 
 I was wondering if there is a way to search on partial names? Ex; Field is a 
 string and stores values like titles of a book; When searching part of the 
 title may be supplied; How do I resolve this? Please let me know
 
 
 Thanks
 -PeriS
 
 
 
 
 
 
 *** DISCLAIMER *** This is a PRIVATE message. If you are not the intended 
 recipient, please delete without copying and kindly advise us by e-mail of 
 the mistake in delivery.
 NOTE: Regardless of content, this e-mail shall not operate to bind HTC Global 
 Services to any order or other contract unless pursuant to explicit written 
 agreement or government initiative expressly permitting the use of e-mail for 
 such purpose.
 
 


Re: recovery process - node with stale data elected leader

2014-11-07 Thread Otis Gospodnetic
Hi,

Not a direct answer to your question, sorry, but since 4.6.0 is relatively
old and there have been a ton of changes around leader election, syncing,
replication, etc., I'd first jump to the latest Solr and then see if this
is still a problem.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Nov 6, 2014 at 5:32 AM, francois.groll...@barclays.com wrote:

 Hi all,

 Any idea on my issue below?

 Thanks
 Francois

 -Original Message-
 From: Grollier, Francois: IT (PRG)
 Sent: Tuesday, November 04, 2014 6:19 PM
 To: solr-user@lucene.apache.org
 Subject: recovery process - node with stale data elected leader

 Hi,

 I'm running solrCloud 4.6.0 and I have a question/issue regarding the
 recovery process.

 My cluster is made of 2 shards with 2 replicas each. Nodes A1 and B1 are
 leaders, A2 and B2 followers.

 I start indexing docs and kill A2. I keep indexing for a while and then
 kill A1. At this point, the cluster stops serving queries as one shard is
 completely unavailable.
 Then I restart A2 first, then A1. A2 gets elected leader, waits a bit for
 more replicas to be up and once it sees A1 it starts the recovery process.
 My understanding of the recovery process was that at this point A2 would
 notice that A1 has a more up to date state and it would sync with A1. It
 seems to happen like this but then I get:

 INFO  - 2014-11-04 11:50:43.068; org.apache.solr.cloud.RecoveryStrategy;
 Attempting to PeerSync from http://a1:8111/solr/executions/
 core=executions - recoveringAfterStartup=false INFO  - 2014-11-04
 11:50:43.069; org.apache.solr.update.PeerSync; PeerSync: core=executions
 url=http://a2:8211/solr START replicas=[http://a1:8111/solr/executions/]
 nUpdates=100 INFO  - 2014-11-04 11:50:43.076;
 org.apache.solr.update.PeerSync; PeerSync: core=executions url=
 http://a2:8211/solr  Received 98 versions from a1:8111/solr/executions/
 INFO  - 2014-11-04 11:50:43.076; org.apache.solr.update.PeerSync; PeerSync:
 core=executions url=http://a2:8211/solr  Our versions are newer.
 ourLowThreshold=1483859630192852992 otherHigh=1483859633446584320 INFO  -
 2014-11-04 11:50:43.077; org.apache.solr.update.PeerSync; PeerSync:
 core=executions url=http://a2:8211/solr DONE. sync succeeded


 And I end up with a different set of documents in each node (actually A1
 has all the documents but A2 misses some).

 Is my understanding wrong and is it a completely nonsense to start A2
 before A1?

 If my understanding right, what could cause the desync? (I can provide
 more logs) And is there a way to force A2 to index the missing documents? I
 have try the FORCERECOVERY command but it generates the same result as
 shown above.

 Thanks
 francois

 ___

 This message is for information purposes only, it is not a recommendation,
 advice, offer or solicitation to buy or sell a product or service nor an
 official confirmation of any transaction. It is directed at persons who are
 professionals and is not intended for retail customer use. Intended for
 recipient only. This message is subject to the terms at:
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see:
 www.barclays.com/salesandtradingdisclaimer regarding market commentary
 from Barclays Sales and/or Trading, who are active market participants; and
 in respect of Barclays Research, including disclosures relating to specific
 issuers, please see http://publicresearch.barclays.com.

 ___
 ___

 This message is for information purposes only, it is not a recommendation,
 advice, offer or solicitation to buy or sell a product or service nor an
 official confirmation of any transaction. It is directed at persons who are
 professionals and is not intended for retail customer use. Intended for
 recipient only. This message is subject to the terms at:
 www.barclays.com/emaildisclaimer.

 For important disclosures, please see:
 www.barclays.com/salesandtradingdisclaimer regarding market commentary
 from Barclays Sales and/or Trading, who are active market participants; and
 in respect of Barclays Research, including disclosures relating to specific
 issuers, please see http://publicresearch.barclays.com.

 ___



Re: Migrating cloud to another set of machines

2014-10-30 Thread Otis Gospodnetic
I think ZK stuff may actually be easier to handle, no?
Add new ones to the existing ZK cluster and then remove the old ones.
Won't this work smoothly?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Oct 30, 2014 at 1:16 PM, Jakov Sosic jso...@gmail.com wrote:

 On 10/30/2014 04:47 AM, Otis Gospodnetic wrote:

 Hi/Bok Jakov,

 2) sounds good to me.  It means no down-time.  1) means stoppage.  If
 stoppage is not OK, but falling behind with indexing new content is OK,
 you
 could:
 * add a new cluster
 * start reading from old index and indexing into the new index
 * stop old cluster when done
 * index new content to new cluster (or maybe you can be doing this all
 along if indexing old + new at the same time is OK for you)
 --


 Thank you for suggestions Otis.

 Everything is acceptable currently, but in the future as the data grows,
 we will certainly enter those edge cases where neither stopping indexing
 nor stopping queries will be acceptable.

 What makes things a little bit more problematic is that ZooKeepers are
 migrating also to new machines.





Re: Migrating cloud to another set of machines

2014-10-29 Thread Otis Gospodnetic
Hi/Bok Jakov,

2) sounds good to me.  It means no down-time.  1) means stoppage.  If
stoppage is not OK, but falling behind with indexing new content is OK, you
could:
* add a new cluster
* start reading from old index and indexing into the new index
* stop old cluster when done
* index new content to new cluster (or maybe you can be doing this all
along if indexing old + new at the same time is OK for you)

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Oct 29, 2014 at 10:18 PM, Jakov Sosic jso...@gmail.com wrote:

 Hi guys


 I was wondering is there some smart way to migrate Solr cloud from 1 set
 of machines to another?

 Specificaly, I have 2 cores, each of them with 2 replicas and 2 shards,
 spread across 4 machines.

 We bought new HW and are in a process of moving to new 4 machines.


 What are my options?


 1) - Create new cluster on new set of machines.
- stop write operations
- copy data directories from old machines to new machines
- start solrs on new machines


 2) - expand number of replicas from 2 to 4
- add new solr nodes to cloud
- wait for resync
- stop old solr nodes
- shrink number of replicas from 4 back to 2


 Is there any other path to achieve this?

 I'm leaning towards no1, because I don't feel too comfortable with doing
 all those changes explained in no2 ...

 Ideas?



Re: Heavy Multi-threaded indexing and SolrCloud 4.10.1 replicas out of synch.

2014-10-27 Thread Otis Gospodnetic
Hi,

You may simply be overwhelming your cluster-nodes. Have you checked
various metrics to see if that is the case?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/



 On Oct 26, 2014, at 9:59 PM, S.L simpleliving...@gmail.com wrote:

 Folks,

 I have posted previously about this , I am using SolrCloud 4.10.1 and have
 a sharded collection with  6 nodes , 3 shards and a replication factor of 2.

 I am indexing Solr using a Hadoop job , I have 15 Map fetch tasks , that
 can each have upto 5 threds each , so the load on the indexing side can get
 to as high as 75 concurrent threads.

 I am facing an issue where the replicas of a particular shard(s) are
 consistently getting out of synch , initially I thought this was beccause I
 was using a custom component , but I did a fresh install and removed the
 custom component and reindexed using the Hadoop job , I still see the same
 behavior.

 I do not see any exceptions in my catalina.out , like OOM , or any other
 excepitions, I suspecting thi scould be because of the multi-threaded
 indexing nature of the Hadoop job . I use CloudSolrServer from my java code
 to index and initialize the CloudSolrServer using a 3 node ZK ensemble.

 Does any one know of any known issues with a highly multi-threaded indexing
 and SolrCloud ?

 Can someone help ? This issue has been slowing things down on my end for a
 while now.

 Thanks and much appreciated!


Re: about Solr log file

2014-10-22 Thread Otis Gospodnetic
Hi Chunki,

Having logs on the local disk is not a problem.  You can use tools like rsyslog
or Logstash or Flume or fluentd and ship your logs wherever you want - your
own centralized logging system or Splunk or Logsene for example.  This will
make it easier to debug/troubleshoot, too - no need to grep big log files...

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Oct 22, 2014 at 10:50 PM, Lee Chunki lck7...@coupang.com wrote:

 Hi,

 I have two questions about Solr log file.

 First,
 Is it possible to set log setting to use one log file for each core?
 Because of I run many cores on one Solr and log file is getting bigger and
 bigger and it makes me to hard to debug when system error.

 Second,
 Is there any setting to gather Solr Cloud logs at any one server?
 I have plan to migrate to Solr Cloud but it seems that each sold node
 makes log at their local disk.

 Thanks,
 Chunki.


Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-20 Thread Otis Gospodnetic
Hi Jae,

Sounds a bit complicated and messy to me, but maybe I'm missing something.
What are you trying to accomplish with this approach?  Which problems do
you have that are making you look for non-straight forward setup?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com
wrote:

 Hi Folks,

 Here are some my ideas to use shared file system with two separate Solr
 Clouds(Writer Solr Cloud and Reader Solr Cloud).

 I want to get your valuable feedbacks

 For prototype, I setup two separate Solr Clouds(one for Writer and the
 other for Reader).

 Basically big picture of my prototype is like below.

 1. Reader and Writer Solr clouds share the same directory
 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud
 inside postCommit eventHandler. That is, when new data are added to Writer
 Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr
 Cloud.
 3. Reader opens searcher only when it receives openSearcher commands
 from Writer SolrCloud
 4. Writer has own deletionPolicy to keep old commit points which might be
 used by running queries on Reader Solr Cloud when new searcher is opened on
 reader SolrCloud.
 5. Reader has no update/no commits. Everything on reader Solr Cloud are
 read-only. It also creates searcher from directory not from
 indexer(nrtMode=false).

 That is,
 In Writer Solr Cloud, I added postCommit eventListner. Inside the
 postCommit eventListner, it sends own openSearcher command to reader Solr
 Cloud's own handler. Then reader Solr Cloud will create openSearcher
 directly without commit and return the writer's request.

 With this approach, Writer and Reader can use the same commit points in
 shared file system in synchronous way.
 When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
 Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in
 the reader SolrCloud, writer sends openSearcher command to reader Solr
 Cloud.

 Does it make sense? Or am I missing some important stuff?

 any feedback would be very helpful to me.

 Thanks,
 Jae



Re: Retrieving and updating large set of documents on Solr 4.7.2

2014-08-18 Thread Otis Gospodnetic
Hi,

Not sure if you've seen https://issues.apache.org/jira/browse/SOLR-5244 ?

It's not in Solr 4.7.2, but may be a good excuse to update Solr.

Otis
--
Solr Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Aug 18, 2014 at 4:09 AM, deniz denizdurmu...@gmail.com wrote:

  0 down vote favorite


 I am trying to implement an activity feed for a website, and planning to
 use
 Solr for this case. As it does not have any follower/following relation,
 Solr is fitting for the requirements.

 There is one point which makes me concerned about performance. So as user
 A,
 I may have 10K activities in the feed, and then I have updated my
 preferences, so the activities that I have posted should be updated too
 (imagine that I am changing my user name, so all of the activities would
 have my new username). In order to update the all 10K activities, i need to
 retrieve the unique document ids from Solr, then update them. Retrieving
 10K
 docs at once is not a good idea, if you imagine bunch of other users are
 also doing a similar change. I have checked docs and forums, using Cursors
 on Solr seems ok, but still makes me thing about the performance (after id
 retrieval, i need to update each activity)

 Are there any other ways to handle this withou Cursors? Or I should better
 use another tool/backend to have something like a username - activity_id
 mapping, so i can directly retrieve the ids to update?

 Regards,




 -
 Zeki ama calismiyor... Calissa yapar...
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Retrieving-and-updating-large-set-of-documents-on-Solr-4-7-2-tp4153457.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Anybody uses Solr JMX?

2014-08-07 Thread Otis Gospodnetic
Hi Paul,

There are lots of people/companies using SPM for Solr/SolrCloud and I don't
recall anyone saying SPM agent collecting metrics via JMX had a negative
impact on Solr performance.  That said, some people really dislike JMX and
some open source projects choose to expose metrics via custom stats APIs or
even files.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Aug 6, 2014 at 11:18 PM, Paul Libbrecht p...@hoplahup.net wrote:

 Hello Otis,

 this looks like an excellent idea!
 I'm in need of that, erm… last week and probably this one too.

 Is there not a risk that reading certain JMX properties actually hogs the
 process? (or is it by design that MBeans are supposed to be read without
 any lock effect?).

 thanks for the hint.

 paul



 On 6 mai 2014, at 04:43, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Alexandre, you could use something like
  http://blog.sematext.com/2012/09/25/new-tool-jmxc-jmx-console/ to
 quickly
  dump everything out of JMX and see if there is anything there Solr Admin
 UI
  doesn't expose.  I think you'll find there is more in JMX than Solr Admin
  UI shows.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, May 5, 2014 at 1:56 AM, Alexandre Rafalovitch 
 arafa...@gmail.comwrote:
 
  Thank you everybody for the links and explanations.
 
  I am still curious whether JMX exposes more details than the Admin UI?
  I am thinking of a troubleshooting context, rather than long-term
  monitoring one.
 
  Regards,
Alex.
  Personal website: http://www.outerthoughts.com/
  Current project: http://www.solr-start.com/ - Accelerating your Solr
  proficiency
 
 
  On Mon, May 5, 2014 at 12:21 PM, Gora Mohanty g...@mimirtech.com
 wrote:
  On May 5, 2014 7:09 AM, Alexandre Rafalovitch arafa...@gmail.com
  wrote:
 
  I have religiously kept jmx statement in my solrconfig.xml, thinking
  it was enabling the web interface statistics output.
 
  But looking at the server logs really closely, I can see that JMX is
  actually disabled without server present. And the Admin UI does not
  actually seem to care after a quick test.
 
  Does anybody have a real experience with Solr JMX? Does it expose more
  information than Admin UI's Plugins/Stats page? Is it good for
 
 
  Have not been using JMX lately, but we were using it in the past. It
 does
  allow monitoring many useful details. As others have commented, it also
  integrates well with other monitoring  tools as JMX is a standard.
 
  Regards,
  Gora
 




Re: Solr vs ElasticSearch

2014-08-01 Thread Otis Gospodnetic
If performance is the main reason, you can stick with Solr.  Both Solr and
ES have many knobs to turn for performance, it is impossible to give a
direct and correct answer to the question which is faster.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Aug 1, 2014 at 7:35 AM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 I did see that earlier. My main concern is search
 performance/scalability/throughput which unfortunately that article didn't
 address. Any benchmarks or comments about that?

 We are already using SOLR but there has been a push to check elasticsearch.
 All the benchmarks I have seen are at least few years old.


 On Fri, Aug 1, 2014 at 4:59 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com
  wrote:

  Not super fresh, but more recent than the 2 links you sent:
 
 http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Thu, Jul 31, 2014 at 10:33 PM, Salman Akram 
  salman.ak...@northbaysolutions.net wrote:
 
   This is quite an old discussion. Wanted to check any new comparisons
  after
   SOLR 4 especially with regards to performance/scalability/throughput?
  
  
   On Tue, Jul 26, 2011 at 7:33 PM, Peter peat...@yahoo.de wrote:
  
Have a look:
   
   
   
  
 
 http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage
   
   
  http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/
   
Regards,
Peter.
   
--
View this message in context:
   
  
 
 http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html
Sent from the Solr - User mailing list archive at Nabble.com.
   
  
  
  
   --
   Regards,
  
   Salman Akram
  
 



 --
 Regards,

 Salman Akram



Re: Auto suggest with adding accents

2014-08-01 Thread Otis Gospodnetic
Aha.  I don't know if Solr Suggester can do that.  Let's see what others
say.  I know http://www.sematext.com/products/autocomplete/ could do that.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Aug 1, 2014 at 9:26 AM, benjelloun anass@gmail.com wrote:

 hello,

 you didnt enderstand well my problem i give you exemple:
 the document contain the word genève.
 q=gene  auto suggestion give geneve
 q=genè auto suggestion give genève

 but what i need is q=gene auto suggestion give genève with accent like
 correction of word.
 i tried to add spellchecker to correct it but the maximum of character for
 correction is 2
 maybe there is other solution,
 i give my schema of field:

 fieldType name=textSuggest class=solr.TextField
 positionIncrementGap=100 omitNorms=true
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.StopFilterFactory words=stopApostrophe.txt
 ignoreCase=true/
 filter class=solr.ASCIIFoldingFilterFactory preserveOriginal=true/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.StandardFilterFactory/
 /analyzer
 analyzer type=query
  tokenizer
 class=solr.StandardTokenizerFactory/replacement=$2/--
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.StopFilterFactory words=stopApostrophe.txt
 ignoreCase=true/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.StandardFilterFactory/
 /analyzer
 /fieldType

 thanks best regards,
 Anass BENJELLOUN




 2014-07-31 18:41 GMT+02:00 Otis Gospodnetic-5 [via Lucene] 
 ml-node+s472066n4150410...@n3.nabble.com:

  You need to do the opposite.  Make sure accents are NOT removed at index
 
  query time.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
 
  On Thu, Jul 31, 2014 at 5:49 PM, benjelloun [hidden email]
  http://user/SendEmail.jtp?type=nodenode=4150410i=0 wrote:
 
   hi,
  
   q=gene  it suggest geneve
   ASCIIFoldingFilter work like isolate accent
  
   what i need to suggest is genève
  
   any idea?
  
   thanks
   best reagards
   Anass BENJELLOUN
  
  
  
   --
   View this message in context:
  
 
 http://lucene.472066.n3.nabble.com/Auto-suggest-with-adding-accents-tp4150379p4150392.html
 
   Sent from the Solr - User mailing list archive at Nabble.com.
  
 
 
  --
   If you reply to this email, your message will be added to the discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/Auto-suggest-with-adding-accents-tp4150379p4150410.html
   To unsubscribe from Auto suggest with adding accents, click here
  
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4150379code=YW5hc3MuYm5qQGdtYWlsLmNvbXw0MTUwMzc5fC0xMDQyNjMzMDgx
 
  .
  NAML
  
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Auto-suggest-with-adding-accents-tp4150379p4150569.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Memory question

2014-08-01 Thread Otis Gospodnetic
Which version of Solr?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Aug 1, 2014 at 11:17 PM, Ethan eh198...@gmail.com wrote:

 Our SolrCloud setup : 3 Nodes with Zookeeper, 2 running SolrCloud.

 Current dataset size is 97GB, JVM is 10GB, but 6GB is used(for less garbage
 collection time).  RAM is 96GB,

 Our softcommit is set to 2secs and hardcommit is set to 1 hour.

 We are suddenly seeing high disk and network IOs.  During search the leader
 usually logs one more query with it's node name and shard information -

 {NOW=1406911121656shard.url=
 chexjvassoms006.ch.expeso.com:52158/solr/Main..

 ids=-9223372036371158536,-9223372036373602680,-9223372036618637568,-9223372036371157736..distrib=falsetimeAllowed=2000wt=javabinisShard=true

 The actually query didn't have any of this information.  This started just
 today and causing lot of latency issues.  We have had nodes go down several
 times today.

 Any of you faced similar issues before?

 E



Re: Auto suggest with adding accents

2014-07-31 Thread Otis Gospodnetic
You need to do the opposite.  Make sure accents are NOT removed at index 
query time.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Jul 31, 2014 at 5:49 PM, benjelloun anass@gmail.com wrote:

 hi,

 q=gene  it suggest geneve
 ASCIIFoldingFilter work like isolate accent

 what i need to suggest is genève

 any idea?

 thanks
 best reagards
 Anass BENJELLOUN



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Auto-suggest-with-adding-accents-tp4150379p4150392.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr is working very slow after certain time

2014-07-31 Thread Otis Gospodnetic
Can we look at your disk IO and CPU?  SPM http://sematext.com/spm/ can
help.

Isn't UseCompressedOops a typo? And deprecated?  In general, may want to
simplify your JVM params unless you are really sure they are helping.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Jul 31, 2014 at 7:54 PM, Ameya Aware ameya.aw...@gmail.com wrote:

 Hi,

 i could index around 10 documents in couple of hours. But after that
 the time for indexing very large (around just 15-20 documents per minute).

 i have taken care of garbage collection.

 i am passing below parameters to Solr:
 -Xms6144m -Xmx6144m -XX:MaxPermSize=128m -XX:+UseConcMarkSweepGC
 -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6
 -XX:CMSInitiatingOccupancyFraction=70 -XX:NewRatio=3
 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
 -XX:+UseCompressedOops -XX:+ParallelRefProcEnabled -XX:+UseLargePages
 -XX:+AggressiveOpts -XX:-UseGCOverheadLimit



 Can anyone help to solve this problem?


 Thanks,
 Ameya



Re: Solr vs ElasticSearch

2014-07-31 Thread Otis Gospodnetic
Not super fresh, but more recent than the 2 links you sent:
http://blog.sematext.com/2012/08/23/solr-vs-elasticsearch-part-1-overview/

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Jul 31, 2014 at 10:33 PM, Salman Akram 
salman.ak...@northbaysolutions.net wrote:

 This is quite an old discussion. Wanted to check any new comparisons after
 SOLR 4 especially with regards to performance/scalability/throughput?


 On Tue, Jul 26, 2011 at 7:33 PM, Peter peat...@yahoo.de wrote:

  Have a look:
 
 
 
 http://stackoverflow.com/questions/2271600/elasticsearch-sphinx-lucene-solr-xapian-which-fits-for-which-usage
 
  http://karussell.wordpress.com/2011/05/12/elasticsearch-vs-solr-lucene/
 
  Regards,
  Peter.
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Solr-vs-ElasticSearch-tp3009181p3200492.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 



 --
 Regards,

 Salman Akram



Re: Solr enterprise tech support in Brazil

2014-07-09 Thread Otis Gospodnetic
Hello,

Sematext would be happy to help.  Please see signature.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



 On Jul 9, 2014, at 4:15 PM, Jefferson Olyntho Neto (STI) 
 jefferson.olyn...@unimedbh.com.br wrote:

 Dear all,

 I would like some recommendation of companies who work with enterprise 
 technical support for Solr in Brazil. Could someone help me?

 Thanks!

 Jefferson Olyntho Neto
 jefferson.olyn...@unimedbh.com.brmailto:jefferson.olyn...@unimedbh.com.br


JOB: Solr / Elasticsearch engineer @ Sematext

2014-07-08 Thread Otis Gospodnetic
Hi,

I think most people on this list have heard of Sematext
http://sematext.com/, so I'll skip the company info, and just jump to the
meat, which involves a lot of fun work with Solr and/or Elasticsearch:

We have an opening for an engineer who knows either Elasticsearch or Solr
or both and wants to use these technologies to implement search and
analytics solutions for both Sematext's own products
http://sematext.com/products/ such as SPM http://sematext.com/spm/
(monitoring,
alerting, machine learning-based anomaly detection, etc.) and Logsene
http://sematext.com/logsene/ (logging), as well as for Sematext's clients
http://sematext.com/clients/.

More info at:
* http://blog.sematext.com/2014/07/07/job-elasticsearch-solr-engineer/
* http://sematext.com/about/jobs.html

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


Re: Tomcat or Jetty to use with solr in production ?

2014-06-30 Thread Otis Gospodnetic
Hi Gurunath,

In 90% of our engagements with various Solr customers we see Jetty, which
we also recommend and use ourselves for Solr + our own services and
products.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



On Mon, Jun 30, 2014 at 5:07 AM, gurunath gurunath@ge.com wrote:

 Hi,

 Confused with lot of reviews on Jetty and tomcat along with solr 4.7 ?, Is
 there any better option for production. want to know the complexity's with
 tomcat and jetty in future, as i want to cluster with huge data on solr.

 Thanks



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Tomcat-or-Jetty-to-use-with-solr-in-production-tp4144712.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr4 optimization

2014-06-09 Thread Otis Gospodnetic
Hi,

I don't remember last time I ran optimize.  Sure, yes, things will work
faster if you optimize an index and reduce the number of segments, but if
you are regularly writing to that index and performance is OK, leave it to
Lucene segment merges to purge deletes.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jun 9, 2014 at 4:15 PM, Joshi, Shital shital.jo...@gs.com wrote:

  Hi,

 We have SolrCloud cluster (5 shards and 2 replicas) on 10 boxes. On some
 of the boxes we have about 5 million deleted docs and we have never run
 optimization since beginning. Does number of deleted docs have anything to
 do with performance of query? Should we consider optimization at all if
 we're not worried about disk space?

 Thanks!





Re: Cache response time

2014-06-04 Thread Otis Gospodnetic
Hi Jeremy,

Nothing in Solr tracks that time.  Caches are pluggable.  If you really
want this info you could write your own cache that is just a proxy for the
real cache and then you can time it.

But why do you need this info?  Do you suspect that is slow?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Jun 4, 2014 at 3:33 PM, Branham, Jeremy [HR] 
jeremy.d.bran...@sprint.com wrote:

 Is there a JMX metric for measuring the cache request time?

 I can see the avg request times, but I'm assuming this includes the cache
 and non-cache values.

 http://wiki.apache.org/solr/SolrPerformanceFactors




 

 This e-mail may contain Sprint proprietary information intended for the
 sole use of the recipient(s). Any use by others is prohibited. If you are
 not the intended recipient, please contact the sender and delete all copies
 of the message.



Re: Strange behaviour when tuning the caches

2014-06-03 Thread Otis Gospodnetic
Hi,

Have you seen https://wiki.apache.org/solr/CollapsingQParserPlugin ?  May
help with the field collapsing queries.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jun 3, 2014 at 8:41 AM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:

 Hi Otis,

 We saw some improvement when increasing the size of the caches. Since
 then, we followed Shawn advice on the filterCache and gave some additional
 RAM to the JVM in order to reduce GC. The performance is very good right
 now but we are still experiencing some instability but not at the same
 level as before.
 With our current settings the number of evictions is actually very low so
 we might be able to reduce some caches to free up some additional memory
 for the JVM to use.

 As for the queries, it is a set of 5 million queries taken from our logs
 so they vary a lot. All I can say is that all queries involve either
 grouping/field collapsing and/or radius search around a point. Our largest
 customer is using a set of 8-10 filters that are translated as fq
 parameters. The collection contains around 13 million documents distributed
 on 5 shards with 2 replicas. The second collection has the same
 configuration and is used for indexing or as a fail-over index in case the
 first one falls.

 We`ll keep making adjustments today but we are pretty close of having
 something that performs while being stable.

 Thanks all for your help.



  -Original Message-
  From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
  Sent: June-03-14 12:17 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Strange behaviour when tuning the caches
 
  Hi Jean-Sebastien,
 
  One thing you didn't mention is whether as you are increasing(I assume)
  cache sizes you actually see performance improve?  If not, then maybe
 there
  is no value increasing cache sizes.
 
  I assume you changed only one cache at a time? Were you able to get any
  one of them to the point where there were no evictions without things
  breaking?
 
  What are your queries like, can you share a few examples?
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics Solr 
  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, Jun 2, 2014 at 11:09 AM, Jean-Sebastien Vachon  jean-
  sebastien.vac...@wantedanalytics.com wrote:
 
   Thanks for your quick response.
  
   Our JVM is configured with a heap of 8GB. So we are pretty close of
   the optimal configuration you are mentioning. The only other
   programs running is Zookeeper (which has its own storage device) and a
   proprietary API (with a heap of 1GB) we have on top of Solr to server
 our
  customer`s requests.
  
   I will look into the filterCache to see if we can better use it.
  
   Thanks for your help
  
-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: June-02-14 10:48 AM
To: solr-user@lucene.apache.org
Subject: Re: Strange behaviour when tuning the caches
   
On 6/2/2014 8:24 AM, Jean-Sebastien Vachon wrote:
 We have yet to determine where the exact breaking point is.

 The two patterns we are seeing are:

 -  less cache (around 20-30% hit/ratio), poor performance
 but
 overall good stability
   
When caches are too small, a low hit ratio is expected.  Increasing
them
   is a
good idea, but only increase them a little bit at a time.  The
   filterCache in
particular should not be increased dramatically, especially the
autowarmCount value.  Filters can take a very long time to execute,
so a
   high
autowarmCount can result in commits taking forever.
   
Each filter entry can take up a lot of heap memory -- in terms of
bytes,
   it is
the number of documents in the core divided by 8.  This means that
if the core has 10 million documents, each filter entry (for JUST
that
core) will take over a megabyte of RAM.
   
 -  more cache (over 90% hit/ratio), improved performance
 but
 almost no stability. In that case, we start seeing messages such
 as No shards hosting shard X or cancelElection did not find
 election node to remove
   
This would not be a direct result of increasing the cache size,
unless
   perhaps
you've increased them so they are *REALLY* big and you're running
out of RAM for the heap or OS disk cache.
   
 Anyone, has any advice on what could cause this? I am beginning to
 suspect the JVM version, is there any minimal requirements
 regarding the JVM?
   
Oracle Java 7 is recommended for all releases, and required for Solr
   4.8.  You
just need to stay away from 7u40, 7u45, and 7u51 because of bugs in
Java itself.  Right now, the latest release is recommended, which is
 7u60.
The
7u21 release that you are running should be perfectly fine.
   
With six 9.4GB cores per node, you'll achieve

Re: Strange behaviour when tuning the caches

2014-06-02 Thread Otis Gospodnetic
Hi Jean-Sebastien,

One thing you didn't mention is whether as you are increasing(I assume)
cache sizes you actually see performance improve?  If not, then maybe there
is no value increasing cache sizes.

I assume you changed only one cache at a time? Were you able to get any one
of them to the point where there were no evictions without things breaking?

What are your queries like, can you share a few examples?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jun 2, 2014 at 11:09 AM, Jean-Sebastien Vachon 
jean-sebastien.vac...@wantedanalytics.com wrote:

 Thanks for your quick response.

 Our JVM is configured with a heap of 8GB. So we are pretty close of the
 optimal configuration you are mentioning. The only other programs running
 is Zookeeper (which has its own storage device) and a proprietary API (with
 a heap of 1GB) we have on top of Solr to server our customer`s requests.

 I will look into the filterCache to see if we can better use it.

 Thanks for your help

  -Original Message-
  From: Shawn Heisey [mailto:s...@elyograg.org]
  Sent: June-02-14 10:48 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Strange behaviour when tuning the caches
 
  On 6/2/2014 8:24 AM, Jean-Sebastien Vachon wrote:
   We have yet to determine where the exact breaking point is.
  
   The two patterns we are seeing are:
  
   -  less cache (around 20-30% hit/ratio), poor performance but
   overall good stability
 
  When caches are too small, a low hit ratio is expected.  Increasing them
 is a
  good idea, but only increase them a little bit at a time.  The
 filterCache in
  particular should not be increased dramatically, especially the
  autowarmCount value.  Filters can take a very long time to execute, so a
 high
  autowarmCount can result in commits taking forever.
 
  Each filter entry can take up a lot of heap memory -- in terms of bytes,
 it is
  the number of documents in the core divided by 8.  This means that if the
  core has 10 million documents, each filter entry (for JUST that
  core) will take over a megabyte of RAM.
 
   -  more cache (over 90% hit/ratio), improved performance but
   almost no stability. In that case, we start seeing messages such as
   No shards hosting shard X or cancelElection did not find election
   node to remove
 
  This would not be a direct result of increasing the cache size, unless
 perhaps
  you've increased them so they are *REALLY* big and you're running out of
  RAM for the heap or OS disk cache.
 
   Anyone, has any advice on what could cause this? I am beginning to
   suspect the JVM version, is there any minimal requirements regarding
   the JVM?
 
  Oracle Java 7 is recommended for all releases, and required for Solr
 4.8.  You
  just need to stay away from 7u40, 7u45, and 7u51 because of bugs in Java
  itself.  Right now, the latest release is recommended, which is 7u60.
  The
  7u21 release that you are running should be perfectly fine.
 
  With six 9.4GB cores per node, you'll achieve the best performance if you
  have about 60GB of RAM left over for the OS disk cache to use -- the
 size of
  your index data on disk.  You did mention that you have 92GB of RAM per
  node, but you have not said how big your Java heap is, or whether there
 is
  other software on the machine that may be eating up RAM for its heap or
  data.
 
  http://wiki.apache.org/solr/SolrPerformanceProblems
 
  Thanks,
  Shawn
 
  -
  Aucun virus trouvé dans ce message.
  Analyse effectuée par AVG - www.avg.fr
  Version: 2014.0.4570 / Base de données virale: 3950/7571 - Date:
  27/05/2014



Re: Uneven shard heap usage

2014-05-31 Thread Otis Gospodnetic
Hi Joe,

Are you/how are you sure all 3 shards are roughly the same size?  Can you
share what you run/see that shows you that?

Are you sure queries are evenly distributed?  Something like SPM
http://sematext.com/spm/ should give you insight into that.

How big are your caches?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Sat, May 31, 2014 at 5:54 PM, Joe Gresock jgres...@gmail.com wrote:

 Interesting thought about the routing.  Our document ids are in 3 parts:

 10-digit identifier!epoch timestamp!format

 e.g., 5/12345678!13025603!TEXT

 Each object has an identifier, and there may be multiple versions of the
 object, hence the timestamp.  We like to be able to pull back all of the
 versions of an object at once, hence the routing scheme.

 The nature of the identifier is that a great many of them begin with a
 certain number.  I'd be interested to know more about the hashing scheme
 used for the document routing.  Perhaps the first character gives it more
 weight as to which shard it lands in?

 It seems strange that certain of the most highly-searched documents would
 happen to fall on this shard, but you may be onto something.   We'll scrape
 through some non-distributed queries and see what we can find.


 On Sat, May 31, 2014 at 1:47 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  This is very weird.
 
  Are you sure that all the Java versions are identical? And all the JVM
  parameters are the same? Grasping at straws here.
 
  More grasping at straws: I'm a little suspicious that you are using
  routing. You say that the indexes are about the same size, but is it is
  possible that your routing is somehow loading the problem shard
 abnormally?
  By that I mean somehow the documents on that shard are different, or
 have a
  drastically higher number of hits than the other shards?
 
  You can fire queries at shards with distrib=false and NOT have it go to
  other shards, perhaps if you can isolate the problem queries that might
  shed some light on the problem.
 
 
  Best
  er...@baffled.com
 
 
  On Sat, May 31, 2014 at 8:33 AM, Joe Gresock jgres...@gmail.com wrote:
 
   It has taken as little as 2 minutes to happen the last time we tried.
  It
   basically happens upon high query load (peak user hours during the
 day).
When we reduce functionality by disabling most searches, it
 stabilizes.
So it really is only on high query load.  Our ingest rate is fairly
 low.
  
   It happens no matter how many nodes in the shard are up.
  
  
   Joe
  
  
   On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky 
  j...@basetechnology.com
   wrote:
  
When you restart, how long does it take it hit the problem? And how
  much
query or update activity is happening in that time? Is there any
 other
activity showing up in the log?
   
If you bring up only a single node in that problematic shard, do you
   still
see the problem?
   
-- Jack Krupansky
   
-Original Message- From: Joe Gresock
Sent: Saturday, May 31, 2014 9:34 AM
To: solr-user@lucene.apache.org
Subject: Uneven shard heap usage
   
   
Hi folks,
   
I'm trying to figure out why one shard of an evenly-distributed
 3-shard
cluster would suddenly start running out of heap space, after 9+
 months
   of
stable performance.  We're using the ! delimiter in our ids to
   distribute
the documents, and indeed the disk size of our shards are very
 similar
(31-32GB on disk per replica).
   
Our setup is:
9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
basically 2 physical CPUs), 24GB disk
3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
reserve 10g heap for each solr instance.
Also 3 zookeeper VMs, which are very stable
   
Since the troubles started, we've been monitoring all 9 with
 jvisualvm,
   and
shards 2 and 3 keep a steady amount of heap space reserved, always
  having
horizontal lines (with some minor gc).  They're using 4-5GB heap, and
   when
we force gc using jvisualvm, they drop to 1GB usage.  Shard 1,
 however,
quickly has a steep slope, and eventually has concurrent mode
 failures
  in
the gc logs, requiring us to restart the instances when they can no
   longer
do anything but gc.
   
We've tried ruling out physical host problems by moving all 3 Shard 1
replicas to different hosts that are underutilized, however we still
  get
the same problem.  We'll still be working on ruling out
 infrastructure
issues, but I wanted to ask the questions here in case it makes
 sense:
   
* Does it make sense that all the replicas on one shard of a cluster
   would
have heap problems, when the other shard replicas do not, assuming a
   fairly
even data distribution?
* One thing we changed recently was to make all of our fields stored,
instead of only half of them.  This was to 

Re: Offline Indexes Update to Shard

2014-05-29 Thread Otis Gospodnetic
Hi,

On Wed, May 28, 2014 at 4:25 AM, Vineet Mishra clearmido...@gmail.comwrote:

 Hi All,

 Has anyone tried with building Offline indexes with EmbeddedSolrServer and
 posting it to Shards.


What do you mean by posting it to shards?  How is that different than
copying them manually to the right location in FS?  Could you please
elaborate?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



 FYI, I am done building the indexes but looking out for a way to post these
 index files on shards.
 Copying the indexes manually to each shard's replica is possible and is
 working fine but I don't want to go with that approach.

 Thanks!



Re: Solr High GC issue

2014-05-29 Thread Otis Gospodnetic
Hi Bihan,

That's a lot of parameters and without trying one can't really give you
very specific and good advice.  If I had to suggest something quickly I'd
say:

* go back to the basics - remove most of those params and stick with the
basic ones.  Look at GC and tune slowly by changing/adding params one at a
time.
* consider using G1 GC with the most recent Java7.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, May 29, 2014 at 1:36 AM, bihan.chandu bihan.cha...@gmail.comwrote:

 Hi All

 I am Currently using solr 3.6.1 and my system handle lot of request .Now we
 are facing High GC issue in system. Please find the memory parameters in my
 solr system . Can some on help me to identify is there any relationship
 between my memory parameters and GC issue.

 MEM_ARGS=-Xms7936M -Xmx7936M -XX:NewSize=512M -XX:MaxNewSize=512M
 -Xss1024k
 -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC
 -XX:CMSInitiatingOccupancyFraction=80 -XX:+UseCMSInitiatingOccupancyOnly
 -XX:+CMSParallelRemarkEnabled -XX:+AggressiveOpts
 -XX:LargePageSizeInBytes=2m -XX:+UseLargePages -XX:MaxTenuringThreshold=15
 -XX:-UseAdaptiveSizePolicy -XX:PermSize=256M -XX:MaxPermSize=256M
 -XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGC
 -Xloggc:${GCLOG} -XX:-OmitStackTraceInFastThrow -XX:+DisableExplicitGC
 -XX:-BindGCTaskThreadsToCPUs -verbose:gc -XX:StackShadowPages=20

 Thanks
 Bihan



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-High-GC-issue-tp4138570.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Contribute QParserPlugin

2014-05-28 Thread Otis Gospodnetic
Hi,

I think the question is not really how to do it - that's clear -
http://wiki.apache.org/solr/HowToContribute

The question is really about whether something like this would be of
interest to Solr community, whether it is likely it would be accepted into
Solr core or contrib, or whether, perhaps because of potentially unwanted
dependency on Redis, Solr dev community might not want this in Solr and
this might be better done outside Solr.

Not sure what the answer is. maybe active Solr developers can chime in
here?  Or maybe dev list is a better place to ask?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, May 28, 2014 at 2:03 PM, Alan Woodward a...@flax.co.uk wrote:

 Hi Pawel,

 The easiest thing to do is to open a JIRA ticket on the Solr project,
 here: https://issues.apache.org/jira/browse/SOLR, and attach your patch.

 Alan Woodward
 www.flax.co.uk


 On 28 May 2014, at 16:50, Pawel Rog wrote:

  Hi,
  I need QParserPlugin that will use Redis as a backend to prepare filter
  queries. There are several data structures available in Redis (hash, set,
  etc.). From some reasons I cannot fetch data from redis data structures,
  build and send big requests from application. That's why I want to build
  that filters on backend (Solr) side.
 
  I'm wondering what do I have to do to contribute QParserPlugin into Solr
  repository. Can you suggest me a way (in a few steps) to publish it in
 Solr
  repository, probably as a contrib?
 
  --
  Paweł Róg




Re: Percolator feature

2014-05-28 Thread Otis Gospodnetic
Yes - Luwak.  Stay tuned for more. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, May 28, 2014 at 4:44 PM, Jorge Luis Betancourt Gonzalez 
jlbetanco...@uci.cu wrote:

 Is there some work around in Solr ecosystem to get something similar to
 the percolator feature offered by elastic search?

 Greetings!VII Escuela Internacional de Verano en la UCI del 30 de junio al
 11 de julio de 2014. Ver www.uci.cu



Re: Contribute QParserPlugin

2014-05-28 Thread Otis Gospodnetic
Hi,

On Wed, May 28, 2014 at 10:58 PM, Alexandre Rafalovitch
arafa...@gmail.comwrote:

 Well, Solr just bundled a set of Hadoop jars that does not actually
 contribute anything to Solr itself (not really integrated, etc). So, I


Good point about Hadoop jars.


 am not sure how the may not want process happened there. Would be
 nice to have one actually, because there is a slow building wave of
 external components for Solr which are completely not discoverable by
 the Solr community at large.


Agreed and a Wiki page where people can add this or Google don't cut
it? (serious question)


 So, I would love us to (re-?)start the serious discussion on the
 plugin model for Solr. Probably on the dev list.


Sure.  Separate thread?

I would even commit to building an initial package discovery/search
 website if the dev-list powers would agree on how that mechanism
 (package/plugins/downloads) should look like. ElasticSearch is very
 obviously benefiting from having a plugin system. Solr's kitchen-sync
 approach worked when it was the only one. But with increased speed of
 releases and the growing packages, it is becoming very noticeably
 pudgy. It even had to be excused during the Solr vs. ElasticSearch
 presentation at the BerlinBuzz a couple of days ago.


For the curious - Alex is referring to
http://blog.sematext.com/2014/05/28/presentation-and-video-side-by-side-with-solr-and-elasticsearch/

Re building something - may be best to talk about that in that separate
thread.


 P.s. Regarding the specific issue, I know of another Redis plugin. Not
 sure how relevant or useful it is, but at least it exists:
 https://github.com/dfdeshom/solr-redis-cache


Thanks.  It's different from what Pawel was asking about.  Maybe Pawel can
provide a couple of examples so people can better understand what he is
looking to do.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/





 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Thu, May 29, 2014 at 2:50 AM, Otis Gospodnetic
 otis.gospodne...@gmail.com wrote:
  Hi,
 
  I think the question is not really how to do it - that's clear -
  http://wiki.apache.org/solr/HowToContribute
 
  The question is really about whether something like this would be of
  interest to Solr community, whether it is likely it would be accepted
 into
  Solr core or contrib, or whether, perhaps because of potentially unwanted
  dependency on Redis, Solr dev community might not want this in Solr and
  this might be better done outside Solr.
 
  Not sure what the answer is. maybe active Solr developers can chime
 in
  here?  Or maybe dev list is a better place to ask?
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Wed, May 28, 2014 at 2:03 PM, Alan Woodward a...@flax.co.uk wrote:
 
  Hi Pawel,
 
  The easiest thing to do is to open a JIRA ticket on the Solr project,
  here: https://issues.apache.org/jira/browse/SOLR, and attach your
 patch.
 
  Alan Woodward
  www.flax.co.uk
 
 
  On 28 May 2014, at 16:50, Pawel Rog wrote:
 
   Hi,
   I need QParserPlugin that will use Redis as a backend to prepare
 filter
   queries. There are several data structures available in Redis (hash,
 set,
   etc.). From some reasons I cannot fetch data from redis data
 structures,
   build and send big requests from application. That's why I want to
 build
   that filters on backend (Solr) side.
  
   I'm wondering what do I have to do to contribute QParserPlugin into
 Solr
   repository. Can you suggest me a way (in a few steps) to publish it in
  Solr
   repository, probably as a contrib?
  
   --
   Paweł Róg
 
 



Re: Physical Files v. Reported Index Size

2014-05-15 Thread Otis Gospodnetic
Darrell,

Look at the top index.x directory in your second image.  Looks like
that's your index, the same one you see in the Solr UI.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



On Tue, May 6, 2014 at 11:34 PM, Darrell Burgan darrell.bur...@infor.comwrote:

  Hello all, I’m trying to reconcile what I’m seeing in the file system
 for a Solr index versus what it is reporting in the UI. Here’s what I see
 in the UI for the index:



 https://s3-us-west-2.amazonaws.com/pa-darrell/ui.png



 As shown, the index is 74.85 GB in size. However, here is what I see in
 the data folder of the file system on that server:



 https://s3-us-west-2.amazonaws.com/pa-darrell/file-system.png



 As shown, it is consuming 109 GB of space. Also note that one of the index
 folders is 75 GB in size.



 My question is why the difference, and whether I can remove some of these
 index folders to reclaim file system space? Or is there a Solr command to
 do it (is it as obvious as “Optimize”)?



 If there a manual I should RTFM about the file structure, please point me
 to it.  J



 Thanks!

 Darrell





 [image: Description: Infor] http://www.infor.com/

 *Darrell Burgan* | Architect, Sr. Principal, PeopleAnswers

 office: 214 445 2172 | mobile: 214 564 4450 | fax: 972 692 5386 |
 darrell.bur...@infor.com | http://www.infor.com

 CONFIDENTIALITY NOTE: This email (including any attachments) is
 confidential and may be protected by legal privilege. If you are not the
 intended recipient, be aware that any disclosure, copying, distribution, or
 use of the information contained herein is prohibited.  If you have
 received this message in error, please notify the sender by replying to
 this message and then delete this message in its entirety. Thank you for
 your cooperation.





Re: SolrCloud - Highly Reliable / Scalable Resources?

2014-05-13 Thread Otis Gospodnetic
Hi,

Re:
 we have suffered several issues which always seem quite problematic to
resolve.

Try grabbing the latest version if you can.  We identified a number of
issues in older SolrCloud versions when working on large client setups with
thousands of cores, but a lot of those issues have been fixes in the more
recent versions.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/



On Mon, May 12, 2014 at 9:53 AM, Darren Lee d...@amplience.com wrote:

 Hi everyone,

 We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its
 excellent but we have suffered several issues which always seem quite
 problematic to resolve.

 I was wondering if anyone in the community can recommend good resources /
 reading for setting up a highly scalable / highly reliable cluster. A lot
 of what I see in the solr documentation is aimed at small setups or is
 quite sparse.

 Dealing with topics like:

 * Capacity planning

 * Losing nodes

 * Voting panic

 * Recovery failure

 * Replication factors

 * Elasticity / Auto scaling / Scaling recipes

 * Exhibitor

 * Container configuration, concurrency limits, packet drop tuning

 * Increasing capacity without downtime

 * Scalable approaches to full indexing hundreds of millions of
 documents

 * External health check vs CloudSolrServer

 * Separate vs local zookeeper

 * Benchmarks


 Sorry, I know that's a lot to ask heh. We are going to run a project for a
 month or so soon where we re-write all our run books and do deeper testing
 on various failure scenarios and the above but any starting point would be
 much appreciated.

 Thanks all,
 Darren



Re: Anybody uses Solr JMX?

2014-05-05 Thread Otis Gospodnetic
Alexandre, you could use something like
http://blog.sematext.com/2012/09/25/new-tool-jmxc-jmx-console/ to quickly
dump everything out of JMX and see if there is anything there Solr Admin UI
doesn't expose.  I think you'll find there is more in JMX than Solr Admin
UI shows.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, May 5, 2014 at 1:56 AM, Alexandre Rafalovitch arafa...@gmail.comwrote:

 Thank you everybody for the links and explanations.

 I am still curious whether JMX exposes more details than the Admin UI?
 I am thinking of a troubleshooting context, rather than long-term
 monitoring one.

 Regards,
Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr
 proficiency


 On Mon, May 5, 2014 at 12:21 PM, Gora Mohanty g...@mimirtech.com wrote:
  On May 5, 2014 7:09 AM, Alexandre Rafalovitch arafa...@gmail.com
 wrote:
 
  I have religiously kept jmx statement in my solrconfig.xml, thinking
  it was enabling the web interface statistics output.
 
  But looking at the server logs really closely, I can see that JMX is
  actually disabled without server present. And the Admin UI does not
  actually seem to care after a quick test.
 
  Does anybody have a real experience with Solr JMX? Does it expose more
  information than Admin UI's Plugins/Stats page? Is it good for
 
 
  Have not been using JMX lately, but we were using it in the past. It does
  allow monitoring many useful details. As others have commented, it also
  integrates well with other monitoring  tools as JMX is a standard.
 
  Regards,
  Gora



Re: How to get a list of currently executing queries?

2014-04-28 Thread Otis Gospodnetic
No, though one could write a custom SearchComponent, I imagine.  Not
terribly useful for most situations where queries typically run for only a
few milliseconds, but

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Apr 17, 2014 at 7:34 AM, Nikhil Chhaochharia nikhil...@yahoo.comwrote:

 Hello,

 Is there some way of getting a list of all queries that are currently
 executing?  Something similar to 'show full processlist' in MySQL.

 Thanks,
 Nikhil


Re: TB scale

2014-04-25 Thread Otis Gospodnetic
Hi Ed,

Unfortunately, there is no good *general* advice, so you'd need to provide
a lot more detail to get useful help.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley esmi...@ebrary.com wrote:

 Anyone with experience, suggestions or lessons learned in the 10 -100 TB
 scale they'd like to share?
 Researching optimum design for a Solr Cloud with, say, about 20TB index.
 -
 Thanks

 Ed Smiley, Senior Software Architect, Ebooks
 ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700
 ext. 3772
 ed.smi...@proquest.commailto:ed.smi...@proquest.com
 www.proquest.comhttp://www.proquest.com/ | www.ebrary.com
 http://www.ebrary.com/ | www.eblib.comhttp://www.eblib.com/
 ebrary and EBL, ProQuest businesses




Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread Otis Gospodnetic
Hi,

On Fri, Apr 25, 2014 at 12:54 PM, zzT zis@gmail.com wrote:

 Erick Erickson wrote
  Back up, you're misunderstanding the update process. A leader node
  distributes the update to every replica. So _all_ your nodes in a
  slice are indexing when _any_ of them index. So the idea of sending
  queries to just the replicas to avoid performance problems isn't
  relevant.

 Hmm, I thought that it's not actual indexing taking place on the replicas
 but that the changes were somehow transferred to the replicas and thus it
 was less intensive for them.


Unfortunately that's not the case.  Each node that gets a doc still has to
analyze and index it.  I think at some point I sent a message to the list
and/or created a JIRA issue to suggest doing analysis on just the receiving
node, in which case the other nodes that need to index could skip that step
and do a little less work, but that hasn't been implemented yet.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/





 Erick Erickson wrote
  In order to support NRT and HA/DR, it's required that all the nodes be
  ready to take over, so the notion of the leader being the only node
  that actually indexed the documents then distributing only the indexed
  document to the other members of the slice isn't how it's done.

 So, this is where SolrCloud is different from legacy master/slave
 configuration? I mean master/slave sends segments to the slaves using e.g.
 rsync while SolrCloud forwards the indexing request to replicas where it's
 processed locally on each replica, right?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SolrCloud-load-balancing-during-heavy-indexing-tp4133099p4133160.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Search for a mask that matches the requested string

2014-04-25 Thread Otis Gospodnetic
Luwak is not based on the fork of Lucene or rather, the fork you are seeing
is there only because the Luwak authors needed highlighting.  If you don't
need highlighting you can probably modify Luwak a bit to use regular
Lucene.  The Lucene fork you are seeing there will also, eventually, be
committed to Lucene trunk and then hopefully backported to 4.x.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Apr 25, 2014 at 6:46 PM, Muhammad Gelbana m.gelb...@gmail.comwrote:

 Luwak is based on a fork of solr\lucene which I cannot use. I have to do
 this using solr 4.6, whether by writing extra code or not. Thanks.

 *-*
 *Muhammad Gelbana*
 http://www.linkedin.com/in/mgelbana


 On Sat, Apr 26, 2014 at 12:13 AM, Ahmet Arslan iori...@yahoo.com wrote:

  Hi,
 
  You don't need to write code for this. Use luwak (I gave the link in my
  first e-mail) instead.
 
  If your can't get luwak running because its too complicated etc, see a
  similar discussion
 
  http://find.searchhub.org/document/9411388c7d2de701#36e50082e918b10c
 
  where diy-percolator example pointer is given. It is an example to use
  memory index.
 
  Ahmet
 
 
 
  On Saturday, April 26, 2014 1:05 AM, Muhammad Gelbana 
 m.gelb...@gmail.com
  wrote:
  @Jack, I am ready to write custom code to implement such feature but I
  don't know what feature in solr should I extend ? Where should I start ?
 I
  believe it should be a very simple task.
 
  @Ahmet, how can I use the class you mentioned ? Is there a tutorial for
 it
  ? I'm not sure how the code in the class's description should work, I've
  never extended solr before.
 
  Thank you all.
 
  *-*
  *Muhammad Gelbana*
  http://www.linkedin.com/in/mgelbana
 
 
 
  On Fri, Apr 25, 2014 at 10:38 PM, Ahmet Arslan iori...@yahoo.com
 wrote:
 
  
  
   Hi,
  
   Your use case is different than ad hoc retrieval. Where you have set of
   documents and varying queries.
  
   In your case it is the reverse, you have a query (string masks) stored
   A?, and incoming documents are percolated against it.
  
   out of the box Solr does not have support for this today.
  
   Please see :
  
  
  
 
 http://lucene.apache.org/core/4_7_2/memory/org/apache/lucene/index/memory/MemoryIndex.html
  
   By the way wildcard ? matches a single character.
  
   Ahmet
  
  
   On Friday, April 25, 2014 11:02 PM, Muhammad Gelbana 
  m.gelb...@gmail.com
   wrote:
   I have no idea how can this help me. I have been using solr for a few
  weeks
   and I'm not familiar with it yet. I'm asking for a very simple task, a
  way
   to customize how solr matches a string, does this exist in solr ?
  
   *-*
   *Muhammad Gelbana*
   http://www.linkedin.com/in/mgelbana
  
  
  
   On Thu, Apr 24, 2014 at 10:09 PM, Ahmet Arslan iori...@yahoo.com
  wrote:
  
Hi,
   
Please see : https://github.com/flaxsearch/luwak
   
Ahmet
   
   
On Thursday, April 24, 2014 8:40 PM, Muhammad Gelbana 
   m.gelb...@gmail.com
wrote:
(Please make sure you reply to my address because I didn't subscribe
 to
this mailing list)
   
I'm using Solr 4.6
   
I need to store string masks in Solr. By masks, I mean strings that
 can
match other strings.
   
Then I need to search for masks that match the string I'm providing
 in
  my
query. For example, assume the following single-field document stored
  in
Solr:
   
{
fieldA: __A__
}
   
I need to be able to find this document if I query the fieldA field
  with
   a
string like *12A34*, as the underscore *_* matches a single string.
  The
single string matching mechanism is my strict goal here, multiple
  string
matching won't be helpful.
   
I hope I was clear enough. Please elaborate because I'm not versatile
   with
solr and I haven't been using it for too long.
Thank you.
   
*-*
*Muhammad Gelbana*
http://www.linkedin.com/in/mgelbana
   
   
  
 
 



Re: Application of different stemmers / stopword lists within a single field

2014-04-25 Thread Otis Gospodnetic
Hi Tim,

Step one is probably to detect language boundaries.  You know your data.
 If they happen on paragraph breaks, your job will be easier.  If they
don't, a bit harder, but not impossible at all.  I'm sure there is a ton of
research on this topic out there, but the obvious approach would involve
dictionaries and individual terms or shingle lookups, keeping track of the
current language or language of last N terms and watching out for a
switch.

Once you have that you'd know the language of each paragraph.  At that
point you'd feed those into Solr in separate language-specific fields.

Of course, the other side of this is often the more complicated one -
identifying the language of the query.  The problem is they are short.  But
you can handle it via UI, via user preferences, via a combination of these
things, etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Apr 25, 2014 at 6:34 AM, Timothy Hill timothy.d.h...@gmail.comwrote:

 This may not be a practically solvable problem, but the company I work for
 has a large number of lengthy mixed-language documents - for example,
 scholarly articles about Islam written in English but containing lengthy
 passages of Arabic. Ideally, we would like users to be able to search both
 the English and Arabic portions of the text, using the full complement of
 language-processing tools such as stemming and stopword removal.

 The problem, of course, is that these two languages co-occur in the same
 field. Is there any way to apply different processing to different words or
 paragraphs within a single field through language detection? Is this to all
 intents and purposes impossible within Solr? Or is another approach (using
 language detection to split the single large field into
 language-differentiated smaller fields, for example) possible/recommended?

 Thanks,

 Tim Hill



Re: What contributes to disk IO?

2014-03-26 Thread Otis Gospodnetic
Lucene segment merges cause both reads and writes.  If you look at SPM,
you'll see the number of index files and the number of segments, which will
give you an idea what's going on at that level.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Mar 25, 2014 at 8:12 PM, Software Dev static.void@gmail.comwrote:

 What are the main contributing factors for Solr Cloud generating a lot
 of disk IO?

 A lot of reads? Writes? Insufficient RAM?

 I would think if there was enough disk cache available for the whole
 index there would be little to no disk IO.



Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-24 Thread Otis Gospodnetic
I think SQP is getting axed, no?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Mar 24, 2014 at 3:45 PM, T. Kuro Kurosaka k...@healthline.comwrote:

 On 3/19/14 5:13 PM, Otis Gospodnetic wrote: Hi,
 
  Guessing it's surround query parser's support for within backed by span
  queries.
 
  Otis

 You mean this?
 http://wiki.apache.org/solr/SurroundQueryParser

 I guess this parser needs improvement in documentation area.
 It doesn't explain or have an example of the w/int syntax at all.
 (Is this the infix notation of W?)
 An example would help explaining difference between W and N;
 some readers may not understand what ordered and unordered
 in this context mean.

 Kuro




Re: Limit on # of collections -SolrCloud

2014-03-20 Thread Otis Gospodnetic
Hours sounds too long indeed.  We recently had a client with several
thousand collections, but restart wasn't taking hours...

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Mar 20, 2014 5:49 PM, Erick Erickson erickerick...@gmail.com wrote:

 How many total replicas are we talking here?
 As in how many shards and, for each shard,
 how many replicas? I'm not asking for a long list
 here, just if you have a bazillion replicas in aggregate.

 Hours is surprising.

 Best,
 Erick

 On Thu, Mar 20, 2014 at 2:17 PM, Chris W chris1980@gmail.com wrote:
  Thanks, Shalin. Making clusterstate.json on a collection basis sounds
  awesome.
 
   I am not having problems with #2 . #3 is a major time hog in my
  environment. I have over 300 +collections and restarting the entire
 cluster
  takes in the order of hours.  (2-3 hour). Can you explain more about the
  leaderVoteWait setting?
 
 
 
 
  On Thu, Mar 20, 2014 at 1:28 PM, Shalin Shekhar Mangar 
  shalinman...@gmail.com wrote:
 
  There are no arbitrary limits on the number of collections but yes
  there are practical limits. For example, the cluster state can become
  a bottleneck. There is a lot of work happening on finding and
  addressing these problems. See
  https://issues.apache.org/jira/browse/SOLR-5381
 
  Boot up time is because of:
  1) Core discovery, schema/config parsing etc
  2) Transaction log replay on startup
  3) Wait time for enough replicas to become available before leader
  election happens
 
  You can't do much about 1 right now I think. For #2, you can keep your
  transaction logs smaller by a hard commit before shutdown. For #3
  there is a leaderVoteWait settings but I'd rather not touch that
  unless it becomes a problem.
 
  On Fri, Mar 21, 2014 at 1:39 AM, Chris W chris1980@gmail.com
 wrote:
   Hi there
  
Is there a limit on the # of collections solrcloud can support? Can
   zk/solrcloud handle 1000s of collections?
  
   Also i see that the bootup time of solrcloud increases with increase
 in #
   of cores. I do not have any expensive warm up queries. How do i
 speedup
   solr startup?
  
   --
   Best
   --
   C
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 
 
 
 
  --
  Best
  --
  C



Re: w/10 ? [was: Partial Counts in SOLR]

2014-03-19 Thread Otis Gospodnetic
Hi,

Guessing it's surround query parser's support for within backed by span
queries.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Mar 19, 2014 4:44 PM, T. Kuro Kurosaka k...@healthline.com wrote:

 In the thread Partial Counts in SOLR, Salman gave us this sample query:

  ((stock or share*) w/10 (sale or sell* or sold or bought or buy* or
 purchase* or repurchase*)) w/10 (executive or director)


 I'm not familiar with this w/10 notation. What does this mean,
 and what parser(s) supports this syntax?

 Kuro




Re: Excessive Heap Usage from docValues?

2014-03-19 Thread Otis Gospodnetic
Hi,

Which type of doc values? See Wiki or reference guide for a list of types.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Mar 19, 2014 5:02 PM, tradergene nos...@krevets.com wrote:

 Hello All,

 I'm hoping to get your assistance in debugging what seems like a memory
 issue.

 I have a Solr index with about 32 million docs.  Each doc is relatively
 small but has multiple dynamic fields that are storing INTs.  The initial
 problem that I had to resolve is that we were running into OOMs (on a 48GB
 heap, 130GB on-disk index).  I narrowed that issue down to Lucene
 FieldCache
 filling up the heap due to all the dynamic fields.  To mitigate this, I
 enabled docValues on the schema for many of the dynamicField culprits.
  This
 dropped the FieldCache down to almost nothing.

 Now, when re-indexing for docValues functionality, I ran into OOMs as soon
 as I reached 12 million of the 32 million documents.  Before enabling
 docValues, I was able to load up Solr on a 48GB heap but ran into problems
 after enough unique searches occurred (normal FieldCache issue).  Now, with
 docValues, a 48GB heap is giving me OOM after 12 million docs indexed.  I
 split the collection into 10 shards and with 2 nodes (48GB heap each) was
 able to get up to 21 million docs indexed.  Now, I've had to move the
 shards
 to more nodes and am up to 10 shards across 4 nodes and am hoping to be
 able
 to get all 32 million docs indexed.  This will be 48GB x 4 heap which seems
 really excessive for an index that was only 132GB pre-docValues.

 I would love some thoughts as to whether I'm expecting too much efficiency
 with docValues enabled.  I was under the impression that docValues would
 increase storage requirements on disk (which it has), but l thought that
 RAM
 usage would go down during searching (which I haven't tested) as well as
 indexing.

 Thanks for any assistance anyone can provide.

 Gene



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Excessive-Heap-Usage-from-docValues-tp4125577.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Indexing large documents

2014-03-18 Thread Otis Gospodnetic
Hi,

I think you probably want to split giant documents because you / your users
probably want to be able to find smaller sections of those big docs that
are best matches to their queries.  Imagine querying War and Peace.  Almost
any regular word your query for will produce a match.  Yes, you may want to
enable field collapsing aka grouping.  I've seen facet counts get messed up
when grouping is turned on, but have not confirmed if this is a (known) bug
or not.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Mar 18, 2014 at 10:52 PM, Stephen Kottmann 
stephen_kottm...@h3biomedicine.com wrote:

 Hi Solr Users,

 I'm looking for advice on best practices when indexing large documents
 (100's of MB or even 1 to 2 GB text files). I've been hunting around on
 google and the mailing list, and have found some suggestions of splitting
 the logical document up into multiple solr documents. However, I haven't
 been able to find anything that seems like conclusive advice.

 Some background...

 We've been using solr with great success for some time on a project that is
 mostly indexing very structured data - ie. mainly based on ingesting
 through DIH.

 I've now started a new project and we're trying to make use of solr again -
 however, in this project we are indexing mostly unstructured data - pdfs,
 powerpoint, word, etc. I've not done much configuration - my solr instance
 is very close to the example provided in the distribution aside from some
 minor schema changes. Our index is relatively small at this point ( ~3k
 documents ), and for initial indexing I am pulling documents from a http
 data source, running them through Tika, and then pushing to solr using
 solrj. For the most part this is working great... until I hit one of these
 huge text files and then OOM on indexing.

 I've got a modest JVM - 4GB allocated. Obviously I can throw more memory at
 it, but it seems like maybe there's a more robust solution that would scale
 better.

 Is splitting the logical document into multiple solr documents best
 practice here? If so, what are the considerations or pitfalls of doing this
 that I should be paying attention to. I guess when querying I always need
 to use a group by field to prevent multiple hits for the same document. Are
 there issues with term frequency, etc that you need to work around?

 Really interested to hear how others are dealing with this.

 Thanks everyone!
 Stephen

 --
 [This e-mail message may contain privileged, confidential and/or
 proprietary information of H3 Biomedicine. If you believe that it has been
 sent to you in error, please contact the sender immediately and delete the
 message including any attachments, without copying, using, or distributing
 any of the information contained therein. This e-mail message should not be
 interpreted to include a digital or electronic signature that can be used
 to authenticate an agreement, contract or other legal document, nor to
 reflect an intention to be bound to any legally-binding agreement or
 contract.]



Re: Help me understand these newrelic graphs

2014-03-14 Thread Otis Gospodnetic
Are you trying to bring that 24.9 ms response time down?
Looks like there is room for more aggressive sharing there, yes.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Mar 14, 2014 at 1:07 PM, Software Dev static.void@gmail.comwrote:

 Here is a screenshot of the host information:
 http://postimg.org/image/vub5ihxix/

 As you can see we have 24 core CPU's and the load is only at 5-7.5.


 On Fri, Mar 14, 2014 at 10:02 AM, Software Dev static.void@gmail.com
 wrote:

  If that is the case, what would help?
 
 
  On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic 
  otis.gospodne...@gmail.com wrote:
 
  It really depends, hard to give a definitive instruction without more
  pieces of info.
  e.g. if your CPUs are all maxed out and you already have a high number
 of
  concurrent queries than sharding may not be of any help at all.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Thu, Mar 13, 2014 at 7:42 PM, Software Dev 
 static.void@gmail.com
  wrote:
 
   Ahh.. its including the add operation. That makes sense I then. A bit
  silly
   on NR's part they don't break it down.
  
   Otis, our index is only 8G so I don't consider that big by any means
 but
   our queries can get a bit complex with a bit of faceting. Do you still
   think it makes sense to shard? How easy would this be to get working?
  
  
   On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic 
   otis.gospodne...@gmail.com wrote:
  
Hi,
   
I think NR has support for breaking by handler, no?  Just checked -
  no.
 Only webapp controller, but that doesn't apply to Solr.
   
SPM should be more helpful when it comes to monitoring Solr - you
 can
filter by host, handler, collection/core, etc. -- you can see the
  demo -
https://apps.sematext.com/demo - though this is plain Solr, not
   SolrCloud.
   
If your index is big or queries are complex, shard it and
 parallelize
search.
   
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/
   
   
On Thu, Mar 13, 2014 at 6:17 PM, ralph tice ralph.t...@gmail.com
   wrote:
   
 I think your response time is including the average response for
 an
  add
 operation, which generally returns very quickly and due to sheer
  number
are
 averaging out the response time of your queries.  New Relic should
   break
 out requests based on which handler they're hitting but they don't
  seem
to.


 On Thu, Mar 13, 2014 at 2:18 PM, Software Dev 
   static.void@gmail.com
 wrote:

  Here are some screen shots of our Solr Cloud cluster via
 Newrelic
 
  http://postimg.org/gallery/2hyzyeyc/
 
  We currently have a 5 node cluster and all indexing is done on
   separate
  machines and shipped over. Our machines are running on SSD's
 with
  18G
of
  ram (Index size is 8G). We only have 1 shard at the moment with
replicas
 on
  all 5 machines. I'm guessing thats a bit of a waste?
 
  How come when we do our bulk updating the response time actually
 decreases?
  I would think the load would be higher therefor response time
  should
   be
  higher. Any way I can decrease the response time?
 
  Thanks
 

   
  
 
 
 



Re: Help me understand these newrelic graphs

2014-03-13 Thread Otis Gospodnetic
Hi,

I think NR has support for breaking by handler, no?  Just checked - no.
 Only webapp controller, but that doesn't apply to Solr.

SPM should be more helpful when it comes to monitoring Solr - you can
filter by host, handler, collection/core, etc. -- you can see the demo -
https://apps.sematext.com/demo - though this is plain Solr, not SolrCloud.

If your index is big or queries are complex, shard it and parallelize
search.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Mar 13, 2014 at 6:17 PM, ralph tice ralph.t...@gmail.com wrote:

 I think your response time is including the average response for an add
 operation, which generally returns very quickly and due to sheer number are
 averaging out the response time of your queries.  New Relic should break
 out requests based on which handler they're hitting but they don't seem to.


 On Thu, Mar 13, 2014 at 2:18 PM, Software Dev static.void@gmail.com
 wrote:

  Here are some screen shots of our Solr Cloud cluster via Newrelic
 
  http://postimg.org/gallery/2hyzyeyc/
 
  We currently have a 5 node cluster and all indexing is done on separate
  machines and shipped over. Our machines are running on SSD's with 18G of
  ram (Index size is 8G). We only have 1 shard at the moment with replicas
 on
  all 5 machines. I'm guessing thats a bit of a waste?
 
  How come when we do our bulk updating the response time actually
 decreases?
  I would think the load would be higher therefor response time should be
  higher. Any way I can decrease the response time?
 
  Thanks
 



Re: Solr supports log-based recovery?

2014-03-13 Thread Otis Gospodnetic
Skimmed this, but yes, docs are durable thanks to transaction log that can
replay on start.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Mar 13, 2014 8:25 PM, shushuai zhu ss...@yahoo.com wrote:

 Hi,

 I noticed the following post indicating that Solr could recover
 not-committed data from operational log:


 http://www.opensourceconnections.com/2013/04/25/understanding-solr-soft-commits-and-data-durability/

 which contradicts with Solr's web site:

 https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

 that seems to indicate that data soft-committed before the last
 hard-commit is lost.

 I reproduced what the author did in the first post (the two lessons he
 listed) with Solr 4.7, and specifically compared below two experiments:

 I posted some records to Solr without commit
 I could not view the records on browser after that since I set soft-commit
 in 5 seconds
 After 5 seconds, I can view the records on browser
 Hard commit still does not happen since I set it in 60 seconds
 Kill the Solr with a kill -9 processId
 Keep the log file
 Re-start the Solr
 I could see the records via browser

 I think the hard-commit does not happen in the above experiment, since in
 a different experiment, I got:

 I posted some records to Solr without commit
 I could not view the records on browser after that since I set soft-commit
 in 5 seconds
 After 5 seconds, I can view the records on browser
 Hard commit still does not happen since I set it in 60 seconds
 Kill the Solr with a kill -9 processId
 Remove the log file
 Re-start the Solr
 I could NOT see the records via browser

 This means Solr supports some database-like recovery (based on log). So,
 as long as the log exists, after a crash, Solr can still recover from the
 log.

 Any comments or idea?

 Thanks.

 Shushuai



Re: Help me understand these newrelic graphs

2014-03-13 Thread Otis Gospodnetic
It really depends, hard to give a definitive instruction without more
pieces of info.
e.g. if your CPUs are all maxed out and you already have a high number of
concurrent queries than sharding may not be of any help at all.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Mar 13, 2014 at 7:42 PM, Software Dev static.void@gmail.comwrote:

 Ahh.. its including the add operation. That makes sense I then. A bit silly
 on NR's part they don't break it down.

 Otis, our index is only 8G so I don't consider that big by any means but
 our queries can get a bit complex with a bit of faceting. Do you still
 think it makes sense to shard? How easy would this be to get working?


 On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,
 
  I think NR has support for breaking by handler, no?  Just checked - no.
   Only webapp controller, but that doesn't apply to Solr.
 
  SPM should be more helpful when it comes to monitoring Solr - you can
  filter by host, handler, collection/core, etc. -- you can see the demo -
  https://apps.sematext.com/demo - though this is plain Solr, not
 SolrCloud.
 
  If your index is big or queries are complex, shard it and parallelize
  search.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Thu, Mar 13, 2014 at 6:17 PM, ralph tice ralph.t...@gmail.com
 wrote:
 
   I think your response time is including the average response for an add
   operation, which generally returns very quickly and due to sheer number
  are
   averaging out the response time of your queries.  New Relic should
 break
   out requests based on which handler they're hitting but they don't seem
  to.
  
  
   On Thu, Mar 13, 2014 at 2:18 PM, Software Dev 
 static.void@gmail.com
   wrote:
  
Here are some screen shots of our Solr Cloud cluster via Newrelic
   
http://postimg.org/gallery/2hyzyeyc/
   
We currently have a 5 node cluster and all indexing is done on
 separate
machines and shipped over. Our machines are running on SSD's with 18G
  of
ram (Index size is 8G). We only have 1 shard at the moment with
  replicas
   on
all 5 machines. I'm guessing thats a bit of a waste?
   
How come when we do our bulk updating the response time actually
   decreases?
I would think the load would be higher therefor response time should
 be
higher. Any way I can decrease the response time?
   
Thanks
   
  
 



Disabling lookups into disabled caches?

2014-03-11 Thread Otis Gospodnetic
Hi,

Is there a way to disable cache *lookups* into cached that are disabled?

Check this for example: https://apps.sematext.com/spm-reports/s/Z04bfIvGyH

This is a Document cache that was enabled, and then got disabled.  But the
lookups are still happening, which is pointless if the cache is disabled.

If that's not doable, I will JIRA?

Thanks,
Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


Re: Disabling lookups into disabled caches?

2014-03-11 Thread Otis Gospodnetic
Hi Shawn,

Here it is: https://issues.apache.org/jira/browse/SOLR-5851

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Mar 11, 2014 at 11:22 PM, Shawn Heisey s...@elyograg.org wrote:

 On 3/11/2014 8:51 PM, Shawn Heisey wrote:
  On 3/11/2014 8:07 PM, Otis Gospodnetic wrote:
  Is there a way to disable cache *lookups* into cached that are disabled?
 
  Check this for example:
 https://apps.sematext.com/spm-reports/s/Z04bfIvGyH
 
  This is a Document cache that was enabled, and then got disabled.  But
 the
  lookups are still happening, which is pointless if the cache is
 disabled.
 
  If that's not doable, I will JIRA?
 
  I think this needs an issue.  I've worked up a *possible* patch for the
  problem.  One that still needs testing and review.  Which reminds me, I
  should probably invent new test methods for this.
 
  The lookups should have very little overhead, but any avoidable overhead
  *should* be avoided.

 The quickfix that I started with on FastLRUCache didn't work and made
 most of the tests fail.  It turns out that FastLRUCache bumps the max
 cache size to 2 when you set it to zero.  I haven't looked deeper into
 the other cache types yet.

 Once you create the issue, we can move this discussion there.

 Thanks,
 Shawn




Re: Mixing lucene scoring and other scoring

2014-03-06 Thread Otis Gospodnetic
Hi Benson,

http://lucene.apache.org/core/4_7_0/expressions/org/apache/lucene/expressions/Expression.html
https://issues.apache.org/jira/browse/SOLR-5707

That?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Thu, Mar 6, 2014 at 8:34 AM, Benson Margulies bimargul...@gmail.comwrote:

 Some months ago, I talked to some people at LR about this, but I can't
 find my notes.

 Imagine a function of some fields that produces a score between 0 and 1.

 Imagine that you want to combine this score with relevance over some
 more or less complex ordinary query.

 What are the options, given the arbitrary nature of Lucene scores?



Re: Solr Filter Cache Size

2014-03-06 Thread Otis Gospodnetic
What Erick said.  That's a giant Filter Cache.  Have a look at these Solr
metrics and note the Filter Cache in the middle:
http://www.flickr.com/photos/otis/8409088080/

Note how small the cache is and how high the hit rate is.  Those are stats
for http://search-lucene.com/ and http://search-hadoop.com/ where you can
see facets on the right that and up being used as filter queries.  Most
Solr apps I've seen had small Filter Caches.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:34 PM, Erick Erickson erickerick...@gmail.comwrote:

 This, BTW, is an ENORMOUS number cached queries.

 Here's a rough guide:
 Each entry will be (length of query) + maxDoc/8 bytes long.

 Think of the filterCache as a map where the key is the query
 and the value is a bitmap large enough to hold maxDoc bits.

 BTW, I'd kick this back to the default (512?) and periodically check
 it with the adminplugins/stats page to see what kind of hit ratio
 I have and adjust from there.

 Best,
 Erick

 On Mon, Mar 3, 2014 at 11:00 AM, Benjamin Wiens
 benjamin.wi...@gmail.com wrote:
  How can we calculate how much heap memory the filter cache will consume?
 We
  understand that in order to determine a good size we also need to
 evaluate
  how many filterqueries would be used over a certain time period.
 
 
 
  Here's our setting:
 
 
 
  filterCache
 
class=solr.FastLRUCache
 
size=30
 
initialSize=30
 
autowarmCount=5/
 
 
 
  According to the post below, 53 GB of RAM would be needed just by the
  filter cache alone with 1.4 million Docs. Not sure if this true and how
  this would work.
 
 
 
  Reference:
 
 http://stackoverflow.com/questions/2004/solr-filter-cache-fastlrucache-takes-too-much-memory-and-results-in-out-of-mem
 
 
 
  We filled the filterquery cache with Solr Meter and had a JVM Heap Size
 of
  far less than 53 GB.
 
 
 
  Can anyone chime in and enlighten us?
 
 
 
  Thank you!
 
 
  Ben Wiens  Benjamin Mosior



Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

6M is really not huge these days.  6B is big, though also still not huge
any more.  What seems to be the bottleneck?  Solr or DB or network or
something else?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu rallav...@gmail.com wrote:

 All,

 Wondering about best practices/common practices to index/re-index huge
 amount of data in Solr. The data is about 6 million entries in the db and
 other source (data is not located in one resource). Trying with solrj based
 solution to collect data from difference resources to index into Solr. It
 takes hours to index Solr.

 Thanks in advance



Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

It depends.  Are docs huge or small? Server single core or 32 core?  Heap
big or small?  etc. etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu rallav...@gmail.com wrote:

 It seems the latency is introduced by collecting the data from different
 sources and putting them together then actual Solr index. I would say all
 these activities are contributing equally though I would say So, is it
 normal to expect to run indexing to run for long? Wondering what to expect
 in such cases. Thanks.

 On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:

 Hi,

 6M is really not huge these days.  6B is big, though also still not huge
 any more.  What seems to be the bottleneck?  Solr or DB or network or
 something else?

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu rallav...@gmail.com wrote:

  All,

 Wondering about best practices/common practices to index/re-index huge
 amount of data in Solr. The data is about 6 million entries in the db and
 other source (data is not located in one resource). Trying with solrj
 based
 solution to collect data from difference resources to index into Solr. It
 takes hours to index Solr.

 Thanks in advance





Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

Each doc is 100K?  That's on the big side, yes, and the server seems on the
small side, yes.  Hence the speed. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:37 PM, Rallavagu rallav...@gmail.com wrote:

 Otis,

 Good points. I guess you are suggesting that it depends on the resources.
 The document is 100k each the pre processing server is a 2 cpu VM running
 with 4G RAM. So, that could be a small machine relatively to process such
 amount of data??


 On 3/5/14, 12:27 PM, Otis Gospodnetic wrote:

 Hi,

 It depends.  Are docs huge or small? Server single core or 32 core?  Heap
 big or small?  etc. etc.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu rallav...@gmail.com wrote:

  It seems the latency is introduced by collecting the data from different
 sources and putting them together then actual Solr index. I would say all
 these activities are contributing equally though I would say So, is it
 normal to expect to run indexing to run for long? Wondering what to
 expect
 in such cases. Thanks.

 On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:

  Hi,

 6M is really not huge these days.  6B is big, though also still not huge
 any more.  What seems to be the bottleneck?  Solr or DB or network or
 something else?

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr  Elasticsearch Support * http://sematext.com/


 On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu rallav...@gmail.com wrote:

   All,


 Wondering about best practices/common practices to index/re-index huge
 amount of data in Solr. The data is about 6 million entries in the db
 and
 other source (data is not located in one resource). Trying with solrj
 based
 solution to collect data from difference resources to index into Solr.
 It
 takes hours to index Solr.

 Thanks in advance







Re: Scalability Limit of SolrCloud

2014-02-27 Thread Otis Gospodnetic
It depends on hardware, your latency requirements and such.

We've helped customers with several billion documents, so big numbers alone
are not a problem.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Feb 27, 2014 6:47 AM, Vineet Mishra clearmido...@gmail.com wrote:

 Hi All

 What is the Scalability Limit of CloudSolr, can it reach to index Billions
 of Documents and each document containing 400-500 Number Field(probably
 Float or Double).
 Is it possible and feasible to go with current CloudSolr Architecture or
 are there some other alternative or replacement.

 Regards



Re: SolrCloud Startup

2014-02-24 Thread Otis Gospodnetic
Hi,

Slow startup could it be your transaction logs are being replayed?  Are
they very big?  Do you see lots of disk reading during those 20-30 minutes?

Shawn was referring to http://wiki.apache.org/solr/SolrPerformanceProblems

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Feb 24, 2014 at 10:41 PM, Shawn Heisey s...@elyograg.org wrote:

  Hi
 
   I have a 4 node solrcloud cluster with more than 50 collections with 4
  shards each. Everytime I want to make a schema change, I upload configs
 to
  zookeeper and then restart all nodes. However the restart of every node
 is
  very slow and takes about 20-30 minutes per node.
 
  Is it recommended to make loadOnStartup=false and allow solrcloud to lazy
  load? Is there a way to make schema changes without restarting solrcloud?

 I'm on my phone so getting a Url for you is hard. Search the wiki for
 SolrPerformanceProblems. There's a section there on slow startup.

 If that's not it, it's probably not enough RAM for the OS disk cache. That
 is also discussed on that wiki page.

 Thanks,
 Shawn






JOB @ Sematext: Professional Services Lead = Head

2014-02-18 Thread Otis Gospodnetic
Hello,


We have what I think is a great opening at Sematext. Ideal candidate would
be in New York, but that's not an absolute must. More info below + on
http://sematext.com/about/jobs.html in job-ad-speak, but I'd be happy to
describe what we are looking for, what we do, and what types of companies
we work with in regular-human-speak off-line.

DESCRIPTION

Sematext is hiring a technical, hands-onProfessional Services Lead to join,
lead, and grow the Professional Services side of Sematext and potentially
grow into the Head role.

REQUIREMENTS

* Experience working with Solr or Elasticsearch

* Plan and coordinate customer engagements from business and technical
perspective

* Identify customer pain points, needs, and success criteria at the onset
of each engagement

* Provide expert-level consulting and support services and strive to be a
trustworthy advisor to a wide range of customers

* Resolve complex search issues involving Solr or Elasticsearch

* Identify opportunities to provide customers with additional value through
our products or services

* Communicate high-value use cases and customer feedback to our Product
teams

* Participate in open source community by contributing bug fixes,
improvements, answering questions, etc.

EXPERIENCE

* BS or higher in Engineering or Computer Science preferred

* 2 or more years of IT Consulting and/or Professional Services experience
required

* Exposure to other related open source projects (Hadoop, Nutch, Kafka,
Storm, Mahout, etc.) a plus

* Experience with other commercial and open source search technologies a
plus

* Enterprise Search, eCommerce, and/or Business Intelligence experience a
plus

* Experience working in a startup a plus

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Otis Gospodnetic
Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Hi Otis,

 Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1
 for Zookeeper. Is that correct?

 Thanks,
 Susheel

 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Friday, January 24, 2014 5:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Hi Susheel,

 Like Erick said, it's impossible to give precise recommendations, but
 making a few assumptions and combining them with experience (+ a licked
 finger in the air):
 * 3 servers
 * 32 GB
 * 2+ CPU cores
 * Linux

 Assuming docs are not bigger than a few KB, that they are not being
 reindexed over and over, that you don't have a search rate higher than a
 few dozen QPS, assuming your queries are not a page long, etc. assuming
 best practices are followed, the above should be sufficient.

 I hope this helps.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics Solr 
 Elasticsearch Support * http://sematext.com/


 On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:

  Hi,
 
  Currently we are indexing 10 million document from database (10 db
  data
  entities)  index size is around 8 GB on windows virtual box. Indexing
  in one shot taking 12+ hours while indexing parallel in separate cores
   merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a
  Production environment. There can be 200+ users performing search same
 time.
 
  No of physical servers (considering solr cloud) Memory requirement
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 
 



Re: need help in understating solr cloud stats data

2014-02-04 Thread Otis Gospodnetic
+101 for more stats.  Was just saying that trying to pre-aggregate them
along multiple dimensions is probably best left out of Solr.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Feb 4, 2014 at 10:49 AM, Mark Miller markrmil...@gmail.com wrote:

 I think that is silly. We can still offer per shard stats *and* let a user
 easily see stats for a collection without requiring they jump hoops or use
 a specific monitoring solution where someone else has already jumped hoops
 for them.

 You don't have to guess what ops people really want - *everyone* wants
 stats that make sense for the collections and cluster on top of the per
 shard stats. *Everyone* wouldn't mind seeing these without having to setup
 a monitoring solution first.

 If you want more than that, then you can fiddle with your monitoring
 solution.

 - Mark

 http://about.me/markrmiller

 On Feb 3, 2014, at 11:10 PM, Otis Gospodnetic otis.gospodne...@gmail.com
 wrote:

  Hi,
 
  Oh, I just saw Greg's email on dev@ about this.
  IMHO aggregating in the search engine is not the way to do.  Leave that
 to
  external tools, which are likely to be more flexible when it comes to
 this.
  For example, our SPM for Solr can do all kinds of aggregations and
  filtering by a number of Solr and SolrCloud-specific dimensions already,
  without Solr having to do any sort of aggregation that it thinks Ops
 people
  will really want.
 
  Otis
  --
  Performance Monitoring * Log Analytics * Search Analytics
  Solr  Elasticsearch Support * http://sematext.com/
 
 
  On Mon, Feb 3, 2014 at 11:08 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
  You should contribute that and spread the dev load with others :)
 
  We need something like that at some point, it's just no one has done it.
  We currently expect you to aggregate in the monitoring layer and it's a
 lot
  to ask IMO.
 
  - Mark
 
  http://about.me/markrmiller
 
  On Feb 3, 2014, at 10:49 AM, Greg Walters greg.walt...@answers.com
  wrote:
 
  I've had some issues monitoring Solr with the per-core mbeans and ended
  up writing a custom request handler that gets loaded then registers
  itself as an mbean. When called it polls all the per-core mbeans then
 adds
  or averages them where appropriate before returning the requested value.
  I'm not sure if there's a better way to get jvm-wide stats via jmx but
 it
  is *a* way to get it done.
 
  Thanks,
  Greg
 
  On Feb 3, 2014, at 1:33 AM, adfel70 adfe...@gmail.com wrote:
 
  I'm sending all solr stats data to graphite.
  I have some questions:
  1. query_handler/select requestTime -
  if i'm looking at some metric, lets say 75thPcRequestTime - I see that
  each
  core in a single collection has different values.
  Is each value of each core is the time that specific core spent on a
  request?
  so to get an idea of total request time, I should summarize all the
  values
  of all the cores?
 
 
  2.update_handler/commits - does this include auto_commits? becuaste
 I'm
  pretty sure I'm not doing any manual commits and yet I see a number
  there.
 
  3. update_handler/docs pending - what does this mean? pending for
 what?
  for
  flush to disk?
 
  thanks.
 
 
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/need-help-in-understating-solr-cloud-stats-data-tp4114992.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 




Re: Special NGRAMish requirement

2014-02-03 Thread Otis Gospodnetic
Hi,

Can you provide an example, Alexander?

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Feb 3, 2014 5:28 AM, Lochschmied, Alexander 
alexander.lochschm...@vishay.com wrote:

 Hi,

 we need to use something very similar to EdgeNGram (minGramSize=1
 maxGramSize=50 side=front).
 The only thing missing is that we would like to reduce the number of
 matches. The request we need to implement is returning only those matches
 with the longest tokens (or terms if that is the right word).

 Is there a way to do this in Solr (not necessarily with EdgeNGram)?

 Thanks,
 Alexander



Re: how to write an efficient query with a subquery to restrict the search space?

2014-02-03 Thread Otis Gospodnetic
Hi,

Sounds like a possible document and query routing use case.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 31, 2014 7:11 AM, svante karlsson s...@csi.se wrote:

 It seems to be faster to first restrict the search space and then do the
 scoring compared to just use the full query and let solr handle everything.

 For example in my application one of the scoring fields effectivly hits
 1/12 of the database (a month field) and if we have 100'' items in the
 database the this matters.

 /svante


 2014-01-30 Jack Krupansky j...@basetechnology.com:

  Lucene's default scoring should give you much of what you want - ranking
  hits of low-frequency terms higher - without any special query syntax -
  just list out your terms and use OR as your default operator.
 
  -- Jack Krupansky
 
  -Original Message- From: svante karlsson
  Sent: Thursday, January 23, 2014 6:42 AM
  To: solr-user@lucene.apache.org
  Subject: how to write an efficient query with a subquery to restrict the
  search space?
 
 
  I have a solr db containing 1 billion records that I'm trying to use in a
  NoSQL fashion.
 
  What I want to do is find the best matches using all search terms but
  restrict the search space to the most unique terms
 
  In this example I know that val2 and val4 is rare terms and val1 and val3
  are more common. In my real scenario I'll have 20 fields that I want to
  include or exclude in the inner query depending on the uniqueness of the
  requested value.
 
 
  my first approach was:
  q=field1:val1 OR field2:val2 OR field3:val3 OR field4:val4 AND
 (field2:val2
  OR field4:val4)rows=100fl=*
 
  but what I think I get is
  .  field4:val4 AND (field2:val2 OR field4:val4)   this result is then
  OR'ed with the rest
 
  if I write
  q=(field1:val1 OR field2:val2 OR field3:val3 OR field4:val4) AND
  (field2:val2 OR field4:val4)rows=100fl=*
 
  then what I think I get is two sub-queries that is evaluated separately
 and
  then joined - performance wise this is bad.
 
  Whats the best way to write these types of queries?
 
 
  Are there any performance issues when running it on several solrcloud
 nodes
  vs a single instance or should it scale?
 
 
 
  /svante
 



Re: Adding DocValues in an existing field

2014-02-03 Thread Otis Gospodnetic
Hi,

You can change the field definition and then reindex.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Jan 30, 2014 1:12 PM, yriveiro yago.rive...@gmail.com wrote:

 Hi,

 Can I add to an existing field the docvalue feature without wipe the
 actual?

 The modification on the schema will be something like this:
 field name=surrogate_id  type=tlong   indexed=true  stored=true
 multiValued=false /
 field name=surrogate_id  type=tlong   indexed=true  stored=true
 multiValued=false  docValues=true/

 I want use the actual data to reindex it again in the same collection but
 in
 the process create the docvalues too, it's possible?

 I'm using solr 4.6.1



 -
 Best regards
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Adding-DocValues-in-an-existing-field-tp4114462.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: need help in understating solr cloud stats data

2014-02-03 Thread Otis Gospodnetic
Hi,

Oh, I just saw Greg's email on dev@ about this.
IMHO aggregating in the search engine is not the way to do.  Leave that to
external tools, which are likely to be more flexible when it comes to this.
 For example, our SPM for Solr can do all kinds of aggregations and
filtering by a number of Solr and SolrCloud-specific dimensions already,
without Solr having to do any sort of aggregation that it thinks Ops people
will really want.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Feb 3, 2014 at 11:08 AM, Mark Miller markrmil...@gmail.com wrote:

 You should contribute that and spread the dev load with others :)

 We need something like that at some point, it's just no one has done it.
 We currently expect you to aggregate in the monitoring layer and it's a lot
 to ask IMO.

 - Mark

 http://about.me/markrmiller

 On Feb 3, 2014, at 10:49 AM, Greg Walters greg.walt...@answers.com
 wrote:

  I've had some issues monitoring Solr with the per-core mbeans and ended
 up writing a custom request handler that gets loaded then registers
 itself as an mbean. When called it polls all the per-core mbeans then adds
 or averages them where appropriate before returning the requested value.
 I'm not sure if there's a better way to get jvm-wide stats via jmx but it
 is *a* way to get it done.
 
  Thanks,
  Greg
 
  On Feb 3, 2014, at 1:33 AM, adfel70 adfe...@gmail.com wrote:
 
  I'm sending all solr stats data to graphite.
  I have some questions:
  1. query_handler/select requestTime -
  if i'm looking at some metric, lets say 75thPcRequestTime - I see that
 each
  core in a single collection has different values.
  Is each value of each core is the time that specific core spent on a
  request?
  so to get an idea of total request time, I should summarize all the
 values
  of all the cores?
 
 
  2.update_handler/commits - does this include auto_commits? becuaste I'm
  pretty sure I'm not doing any manual commits and yet I see a number
 there.
 
  3. update_handler/docs pending - what does this mean? pending for what?
 for
  flush to disk?
 
  thanks.
 
 
 
  --
  View this message in context:
 http://lucene.472066.n3.nabble.com/need-help-in-understating-solr-cloud-stats-data-tp4114992.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 




Re: Duplicate Facet.FIelds cause same results, should dedupe?

2014-02-03 Thread Otis Gospodnetic
Hi,

Don't know if this is old or new problem, but it does feel like a bug to me.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Feb 3, 2014 at 10:48 AM, William Bell billnb...@gmail.com wrote:

 If we add :

 facet.field=prac_spec_heirfacet.field=prac_spec_heir

 we get it twice in the results. This breaks deserialization on wt=json
 since you cannot have the same name twice

 Thoughts? Seems like a new bug in 4.6 ?


 facet.field:
 [prac_spec_heir,all_proc_name_code,all_cond_name_code,
 prac_spec_heir,{!ex=exgender}gender,{!ex=expayor}payor_code_name],

 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: SOLR USING 100% percent CPU and not responding after a while

2014-01-28 Thread Otis Gospodnetic
Hi,

Show us more graphs.  Is the GC working hard?  Any of the JVM mem pools at
or near 100%?  SPM for Solr is your friend for long term
monitoring/alerting/trends, jconsole and visualvm for a quick look.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Jan 28, 2014 at 2:11 PM, heaven aheave...@gmail.com wrote:

 I have the same problem, please look at the image:
 http://lucene.472066.n3.nabble.com/file/n4114026/Screenshot_733.png

 And this is on idle. Index size is about 90Gb. Solr 4.4.0. Memory is not an
 issue, there's a lot. RAID 10 (15000RPM rapid hdd).



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/SOLR-USING-100-percent-CPU-and-not-responding-after-a-while-tp4021359p4114026.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Related Search Suggestions

2014-01-27 Thread Otis Gospodnetic
Hi,

I don't know of anything like that in OSS, but we have it here:
http://sematext.com/products/related-searches/index.html

Is that the functionality you are looking for?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Jan 27, 2014 at 4:29 AM, kumar pavan2...@gmail.com wrote:

 What is the best way to implement related search suggestions.

 For example :

 If the user is looking for marriage halls i need to show results like
 catering services, photography, wedding cards, invitation cards,
 music organisers.


 Thanks  Regards,
 kumar



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Solr-Related-Search-Suggestions-tp4113672.html
 Sent from the Solr - User mailing list archive at Nabble.com.



  1   2   3   4   5   6   7   8   9   10   >