from:"Rallavagu"




On 7/22/16 9:56 AM, Erick Erickson wrote:

OK, scratch autowarming. In fact your autowarm counts
are quite high, I suspect far past "diminishing returns".
I usually see autowarm counts < 64, but YMMV.

Are you seeing actual hit ratios that are decent on
those caches (admin UI>>plugins/stats>>cache>>...)
And your cache sizes are also quite high in my experience,
it's probably worth measuring the utilization there as well.
And, BTW, your filterCache can occupy up to 2G of your heap.
That's probably not your central problem, but it's something
to consider.

Will look into it.


So I don't know why your queries are taking that long, my
assumption is that they may simply be very complex queries,
or you have grouping on or.

Queries are a bit complex for sure.


I guess the next thing I'd do is start trying to characterize
what queries are slow. Grouping? Pivot Faceting? 'cause
from everything you've said so far it's surprising that you're
seeing queries take this long, something doesn't feel right
but what it is I don't have a clue.


Thanks



Best,
Erick

On Fri, Jul 22, 2016 at 9:15 AM, Rallavagu <rallav...@gmail.com> wrote:



On 7/22/16 8:34 AM, Erick Erickson wrote:


Mostly this sounds like a problem that could be cured with
autowarming. But two things are conflicting here:
1> you say "We have a requirement to have updates available immediately
(NRT)"
2> your docs aren't available for 120 seconds given your autoSoftCommit
settings unless you're specifying
-Dsolr.autoSoftCommit.maxTime=some_other_interval
as a startup parameter.


Yes. We have 120 seconds available.


So assuming you really do have a 120 second autocommit time, you should be
able to smooth out the spikes by appropriate autowarming. You also haven't
indicated what your filterCache and queryResultCache settings are. They
come with a default of 0 for autowarm. But what is their size? And do you
see a correlation between longer queries every on 2 minute intervals? And
do you have some test harness in place (jmeter works well) to demonstrate
that differences in your configuration help or hurt? I can't
over-emphasize the
importance of this, otherwise if you rely on somebody simply saying "it's
slow"
you have no way to know what effect changes have.



Here is the cache configuration.









We have run load tests using JMeter with directory pointing to Solr and also
tests that are pointing to the application that queries Solr. In both cases,
we have noticed the results being slower.

Thanks



Best,
Erick


On Thu, Jul 21, 2016 at 11:22 PM, Shawn Heisey <apa...@elyograg.org>
wrote:


On 7/21/2016 11:25 PM, Rallavagu wrote:


There is no other software running on the system and it is completely
dedicated to Solr. It is running on Linux. Here is the full version.

Linux version 3.8.13-55.1.6.el7uek.x86_64
(mockbu...@ca-build56.us.oracle.com) (gcc version 4.8.3 20140911 (Red
Hat 4.8.3-9) (GCC) ) #2 SMP Wed Feb 11 14:18:22 PST 2015



Run the top program, press shift-M to sort by memory usage, and then
grab a screenshot of the terminal window.  Share it with a site like
dropbox, imgur, or something similar, and send the URL.  You'll end up
with something like this:

https://www.dropbox.com/s/zlvpvd0rrr14yit/linux-solr-top.png?dl=0

If you know what to look for, you can figure out all the relevant memory
details from that.

Thanks,
Shawn

Re: solr.NRTCachingDirectoryFactory


Also, here is the link to screenshot.

https://dl.dropboxusercontent.com/u/39813705/Screen%20Shot%202016-07-22%20at%2010.40.21%20AM.png

Thanks

On 7/21/16 11:22 PM, Shawn Heisey wrote:

On 7/21/2016 11:25 PM, Rallavagu wrote:

There is no other software running on the system and it is completely
dedicated to Solr. It is running on Linux. Here is the full version.

Linux version 3.8.13-55.1.6.el7uek.x86_64
(mockbu...@ca-build56.us.oracle.com) (gcc version 4.8.3 20140911 (Red
Hat 4.8.3-9) (GCC) ) #2 SMP Wed Feb 11 14:18:22 PST 2015


Run the top program, press shift-M to sort by memory usage, and then
grab a screenshot of the terminal window.  Share it with a site like
dropbox, imgur, or something similar, and send the URL.  You'll end up
with something like this:

https://www.dropbox.com/s/zlvpvd0rrr14yit/linux-solr-top.png?dl=0

If you know what to look for, you can figure out all the relevant memory
details from that.

Thanks,
Shawn

Re: solr.NRTCachingDirectoryFactory

Here is the snapshot of memory usage from "top" as you mentioned. First 
row is "solr" process. Thanks.


 PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ 
COMMAND
29468 solr  20   0 27.536g 0.013t 3.297g S  45.7 27.6   4251:45 java 






21366 root  20   0 14.499g 217824  12952 S   1.0  0.4 192:11.54 java 






 2077 root  20   0 14.049g 190824   9980 S   0.7  0.4  62:44.00 
java 





  511 root  20   0  125792  56848  56616 S   0.0  0.1   9:33.23 
systemd-journal 





  316 splunk20   0  232056  44284  11804 S   0.7  0.1  84:52.74 
splunkd 





 1045 root  20   0  257680  39956   6836 S   0.3  0.1   7:05.78 
puppet 





32631 root  20   0  360956  39292   4788 S   0.0  0.1   4:55.37 
mcollectived 





  703 root  20   0  250372   9000976 S   0.0  0.0   1:35.52 
rsyslogd 





 1058 nslcd 20   0  454192   6004   2996 S   0.0  0.0  15:08.87 nslcd

On 7/21/16 11:22 PM, Shawn Heisey wrote:

On 7/21/2016 11:25 PM, Rallavagu wrote:

There is no other software running on the system and it is completely
dedicated to Solr. It is running on Linux. Here is the full version.

Linux version 3.8.13-55.1.6.el7uek.x86_64
(mockbu...@ca-build56.us.oracle.com) (gcc version 4.8.3 20140911 (Red
Hat 4.8.3-9) (GCC) ) #2 SMP Wed Feb 11 14:18:22 PST 2015


Run the top program, press shift-M to sort by memory usage, and then
grab a screenshot of the terminal window.  Share it with a site like
dropbox, imgur, or something similar, and send the URL.  You'll end up
with something like this:

https://www.dropbox.com/s/zlvpvd0rrr14yit/linux-solr-top.png?dl=0

If you know what to look for, you can figure out all the relevant memory
details from that.

Thanks,
Shawn

Re: solr.NRTCachingDirectoryFactory




On 7/22/16 8:34 AM, Erick Erickson wrote:

Mostly this sounds like a problem that could be cured with
autowarming. But two things are conflicting here:
1> you say "We have a requirement to have updates available immediately (NRT)"
2> your docs aren't available for 120 seconds given your autoSoftCommit
settings unless you're specifying
-Dsolr.autoSoftCommit.maxTime=some_other_interval
as a startup parameter.


Yes. We have 120 seconds available.


So assuming you really do have a 120 second autocommit time, you should be
able to smooth out the spikes by appropriate autowarming. You also haven't
indicated what your filterCache and queryResultCache settings are. They
come with a default of 0 for autowarm. But what is their size? And do you
see a correlation between longer queries every on 2 minute intervals? And
do you have some test harness in place (jmeter works well) to demonstrate
that differences in your configuration help or hurt? I can't over-emphasize the
importance of this, otherwise if you rely on somebody simply saying "it's slow"
you have no way to know what effect changes have.


Here is the cache configuration.









We have run load tests using JMeter with directory pointing to Solr and 
also tests that are pointing to the application that queries Solr. In 
both cases, we have noticed the results being slower.


Thanks



Best,
Erick


On Thu, Jul 21, 2016 at 11:22 PM, Shawn Heisey <apa...@elyograg.org> wrote:

On 7/21/2016 11:25 PM, Rallavagu wrote:

There is no other software running on the system and it is completely
dedicated to Solr. It is running on Linux. Here is the full version.

Linux version 3.8.13-55.1.6.el7uek.x86_64
(mockbu...@ca-build56.us.oracle.com) (gcc version 4.8.3 20140911 (Red
Hat 4.8.3-9) (GCC) ) #2 SMP Wed Feb 11 14:18:22 PST 2015


Run the top program, press shift-M to sort by memory usage, and then
grab a screenshot of the terminal window.  Share it with a site like
dropbox, imgur, or something similar, and send the URL.  You'll end up
with something like this:

https://www.dropbox.com/s/zlvpvd0rrr14yit/linux-solr-top.png?dl=0

If you know what to look for, you can figure out all the relevant memory
details from that.

Thanks,
Shawn

Re: solr.NRTCachingDirectoryFactory

2016-07-21 Thread Rallavagu




On 7/21/16 9:16 PM, Shawn Heisey wrote:

On 7/21/2016 9:37 AM, Rallavagu wrote:

I suspect swapping as well. But, for my understanding - are the index
files from disk memory mapped automatically at the startup time?


They are *mapped* at startup time, but they are not *read* at startup.
The mapping just sets up a virtual address space for the entire file,
but until something actually reads the data from the disk, it will not
be in memory.  Getting the data in memory is what makes mmap fast.

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


We are not performing "commit" after every update and here is the
configuration for softCommit and hardCommit.


   ${solr.autoCommit.maxTime:15000}
   false



   ${solr.autoSoftCommit.maxTime:12}


I am seeing QTimes (for searches) swing between 10 seconds - 2
seconds. Some queries were showing the slowness caused to due to
faceting (debug=true). Since we have adjusted indexing and facet times
are improved but basic query QTime is still high so wondering where
can I look? Is there a way to debug (instrument) a query on Solr node?


Assuming you have not defined the maxTime system properties mentioned in
those configs, that config means you will potentially be creating a new
searcher every two minutes ... but if you are sending explicit commits
or using commitWithin on your updates, then the true situation may be
very different than what's configured here.


We have allocated significant amount of RAM (48G total
physical memory, 12G heap, Total index disk size is 15G)


Assuming there's no other software on the system besides the one
instance of Solr with a 12GB heap, this would mean that you have enough
room to cache the entire index.  What OS are you running on? With that
information, I may be able to relay some instructions that will help
determine what the complete memory situation is on your server.


There is no other software running on the system and it is completely 
dedicated to Solr. It is running on Linux. Here is the full version.


Linux version 3.8.13-55.1.6.el7uek.x86_64 
(mockbu...@ca-build56.us.oracle.com) (gcc version 4.8.3 20140911 (Red 
Hat 4.8.3-9) (GCC) ) #2 SMP Wed Feb 11 14:18:22 PST 2015


Thanks



Thanks,
Shawn

Re: solr.NRTCachingDirectoryFactory

2016-07-21 Thread Rallavagu


Thanks Erick.

On 7/21/16 8:25 AM, Erick Erickson wrote:

bq: map index files so "reading from disk" will be as simple and quick
as reading from memory hence would not incur any significant
performance degradation.

Well, if
1> the read has already been done. First time a page of the file is
accessed, it must be read from disk.
2> You have enough physical memory that _all_ of the files can be held
in memory at once.

<2> is a little tricky since the big slowdown comes from swapping
eventually. But in an LRU scheme, that may be OK if the oldest pages
are the stored=true data which are only accessed to return the top N,
not to satisfy the search.
I suspect swapping as well. But, for my understanding - are the index 
files from disk memory mapped automatically at the startup time?


What are your QTimes anyway? Define "optimal"

I'd really push back on this statement: "We have a requirement to have
updates available immediately (NRT)". Truly? You can't set
expectations that 5 seconds will be needed (or 10?). Often this is an
artificial requirement that does no real service to the user, it's
just something people think they want. If this means you're sending a
commit after every document, it's actually a really bad practice
that'll get you into trouble eventually. Plus you won't be able to do
any autowarming which will read data from disk into the OS memory and
smooth out any spikes


We are not performing "commit" after every update and here is the 
configuration for softCommit and hardCommit.



   ${solr.autoCommit.maxTime:15000}
   false



   ${solr.autoSoftCommit.maxTime:12}


I am seeing QTimes (for searches) swing between 10 seconds - 2 seconds. 
Some queries were showing the slowness caused to due to faceting 
(debug=true). Since we have adjusted indexing and facet times are 
improved but basic query QTime is still high so wondering where can I 
look? Is there a way to debug (instrument) a query on Solr node?




FWIW,
Erick

On Thu, Jul 21, 2016 at 8:18 AM, Rallavagu <rallav...@gmail.com> wrote:

Solr 5.4.1 with embedded jetty with cloud enabled

We have a Solr deployment (approximately 3 million documents) with both
write and search operations happening. We have a requirement to have updates
available immediately (NRT). Configured with default
"solr.NRTCachingDirectoryFactory" for directory factory. Considering the
fact that every time there is an update, caches are invalidated and re-built
I assume that "solr.NRTCachingDirectoryFactory" would memory map index files
so "reading from disk" will be as simple and quick as reading from memory
hence would not incur any significant performance degradation. Am I right in
my assumption? We have allocated significant amount of RAM (48G total
physical memory, 12G heap, Total index disk size is 15G) but not sure if I
am seeing the optimal QTimes (for searches). Any inputs are welcome. Thanks
in advance.

solr.NRTCachingDirectoryFactory

2016-07-21 Thread Rallavagu


Solr 5.4.1 with embedded jetty with cloud enabled

We have a Solr deployment (approximately 3 million documents) with both 
write and search operations happening. We have a requirement to have 
updates available immediately (NRT). Configured with default 
"solr.NRTCachingDirectoryFactory" for directory factory. Considering the 
fact that every time there is an update, caches are invalidated and 
re-built I assume that "solr.NRTCachingDirectoryFactory" would memory 
map index files so "reading from disk" will be as simple and quick as 
reading from memory hence would not incur any significant performance 
degradation. Am I right in my assumption? We have allocated significant 
amount of RAM (48G total physical memory, 12G heap, Total index disk 
size is 15G) but not sure if I am seeing the optimal QTimes (for 
searches). Any inputs are welcome. Thanks in advance.

Re: Document Cache


comments in line...

On 3/17/16 2:16 PM, Erick Erickson wrote:

First, I want to make sure when you say "TTL", you're talking about
documents being evicted from the documentCache and not the "Time To Live"
option whereby documents are removed completely from the index.


May be TTL was not the right word to use here. I wanted learn the 
criteria for an entry to be ejected.




The time varies with the number of new documents fetched. This is an LRU
cache whose size is configured in solrconfig.xml. It's pretty much
unpredictable. If for some odd reason every request gets the same document
it'll never be aged out. If no two queries return the same document, when
"cache size" docs are fetched by subsequent requests.

The entire thing is thrown out whenever a new searcher is opened (i.e.
softCommit or hardCommit with openSearcher=true)




But maybe this is an XY problem. Why do you care? Is there something you're
seeing that you're trying to understand or is this just a general interest
question?

I have following configuration,

${solr.autoCommit.maxTime:15000}false

${solr.autoSoftCommit.maxTime:12}

As you can see, openSearcher is set to "false". What I am seeing is 
(from heap dump due to OutOfMemory error) that the LRUCache pertaining 
"Document Cache" occupies around 85% of available heap and that is 
causing OOM errors. So, trying to understand the behavior to address the 
OOM issues.


Thanks



Best,
Erick

On Thu, Mar 17, 2016 at 1:40 PM, Rallavagu <rallav...@gmail.com> wrote:


Solr 5.4 embedded Jetty

Is it the right assumption that whenever a document that is returned as a
response to a query is cached in "Document Cache"?

Essentially, if I request for any entry like /select?q=id:
will it be cached in "Document Cache"? If yes, what is the TTL?

Thanks in advance

Solr5 Optimize


All,

Solr 5.4 with emdbedded Jetty (4G heap)

Trying to understand behavior of "optimize" operation if not run 
explicitly. What is the frequency at which this operation is run, what 
are the storage requirements and how do we schedule it? Any 
comments/pointers would greatly help.


Thanks in advance

Re: Solr5 Optimize


Thanks Erick. This helps.

On 3/16/16 10:11 AM, Erick Erickson wrote:

First of all, "optimize-like" does _not_ happen
"every time a commit happens". What _does_ happen
is the current state of the index is examined and if
certain conditions are met _then_ segment
merges happen. Think of these as "partial optimizes".

This is under control of the TieredMergePolicy by
default.

There are limits placed on the number of simultaneous
merges that can happen, and they're all done in
background threads so you should see lots of I/O,
but the priority of those threads is low so it shouldn't
have  much impact on query perf.

It's theoretically possible that the background merge
will merge down to one segment, so you still need at
least as much free space on your disk and your index
occupies.

Best,
Erick


On Wed, Mar 16, 2016 at 10:07 AM, Rallavagu <rallav...@gmail.com> wrote:

Erick, Thanks for the response. Comments in line...

On 3/16/16 9:56 AM, Erick Erickson wrote:


In general, don't bother with optimize unless the index is quite static,
i.e. there are very few adds/updates or those updates are done in
batches and rarely (i.e. once a day or less frequently).

As far as space, this will require that you have at _least_ as much
free space on your disks as your index occupies. Shouldn't require
much in the way of RAM though.

Optimize, also referred to as "Force Merge" will merge all the segments
down to one, and in the process reclaim data from deleted (or updated)
documents.

The thing is, this is also accomplished by "background merging" which
happens automatically. Every time you do a hard commit, Lucene
figures out if any segments need to be merged and does that automatically.
During that process, any information associated with deleted docs is
reclaimed.


If "optimize" like operation happening automatically every time a hard
commit happens, with following settings (15 seconds for hard commit) what
would be impact on performance particularly on disk space?


${solr.autoCommit.maxTime:15000}
false
  

Thanks.



The third video down here:

http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
is Mikes visualization of the automatic merging process.

Best,
Erick

On Wed, Mar 16, 2016 at 9:40 AM, Rallavagu <rallav...@gmail.com> wrote:


All,

Solr 5.4 with emdbedded Jetty (4G heap)

Trying to understand behavior of "optimize" operation if not run
explicitly.
What is the frequency at which this operation is run, what are the
storage
requirements and how do we schedule it? Any comments/pointers would
greatly
help.

Thanks in advance

Re: Document Cache




On 3/18/16 9:27 AM, Emir Arnautovic wrote:

Running single query that returns all docs and all fields will actually
load as many document as queryResultWindowSize is.
What you need to do is run multiple queries that will return different
documents. In case your id is numeric, you can run something like id:[1
TO 100] and then id:[100 TO 200] etc. Make sure that it is done within
those two minute period if there is any indexing activities.
Would the existing cache be cleared while a active thread is 
performing/receiving query?




Your index is relatively small so filter cache of initial size of 1000
entries should take around 20MB (assuming single shard)

Thanks,
Emir

On 18.03.2016 17:02, Rallavagu wrote:



On 3/18/16 8:56 AM, Emir Arnautovic wrote:

Problem starts with autowarmCount="5000" - that executes 5000 queries
when new searcher is created and as queries are executed, document cache
is filled. If you have large queryResultWindowSize and queries return
big number of documents, that will eat up memory before new search is
executed. It probably takes some time as well.

This is also combined with filter cache. How big is your index?


Index is not very large.


numDocs:
85933

maxDoc:
161115

deletedDocs:
75182

Size
1.08 GB

I have run a query to return all documents with all fields. I could
not reproduce OOM. I understand that I need to reduce cache sizes but
wondering what conditions could have caused OOM so I can keep a watch.

Thanks



Thanks,
Emir

On 18.03.2016 15:43, Rallavagu wrote:

Thanks for the recommendations Shawn. Those are the lines I am
thinking as well. I am reviewing application also.

Going with the note on cache invalidation for every two minutes due to
soft commit, wonder how would it go OOM in simply two minutes or is it
likely that a thread is holding the searcher due to long running query
that might be potentially causing OOM? Was trying to reproduce but
could not so far.

Here is the filter cache config



Query Results cache



On 3/18/16 7:31 AM, Shawn Heisey wrote:

On 3/18/2016 8:22 AM, Rallavagu wrote:

So, each soft commit would create a new searcher that would
invalidate
the old cache?

Here is the configuration for Document Cache



true


In an earlier message, you indicated you're running into OOM.  I think
we can see why with this cache definition.

There are exactly two ways to deal with OOM.  One is to increase the
heap size.  The other is to reduce the amount of memory that the
program
requires by changing something -- that might be the code, the
config, or
how you're using it.

Start by reducing that cache size to 4096 or 1024.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If yuo've also got a very large filterCache, reduce that size too. The
filterCache typically eats up a LOT of memory, because each entry
in the
cache is very large.

Thanks,
Shawn

Re: Document Cache




On 3/18/16 8:56 AM, Emir Arnautovic wrote:

Problem starts with autowarmCount="5000" - that executes 5000 queries
when new searcher is created and as queries are executed, document cache
is filled. If you have large queryResultWindowSize and queries return
big number of documents, that will eat up memory before new search is
executed. It probably takes some time as well.

This is also combined with filter cache. How big is your index?


Index is not very large.


numDocs:
85933

maxDoc:
161115

deletedDocs:
75182

Size
1.08 GB

I have run a query to return all documents with all fields. I could not 
reproduce OOM. I understand that I need to reduce cache sizes but 
wondering what conditions could have caused OOM so I can keep a watch.


Thanks



Thanks,
Emir

On 18.03.2016 15:43, Rallavagu wrote:

Thanks for the recommendations Shawn. Those are the lines I am
thinking as well. I am reviewing application also.

Going with the note on cache invalidation for every two minutes due to
soft commit, wonder how would it go OOM in simply two minutes or is it
likely that a thread is holding the searcher due to long running query
that might be potentially causing OOM? Was trying to reproduce but
could not so far.

Here is the filter cache config



Query Results cache



On 3/18/16 7:31 AM, Shawn Heisey wrote:

On 3/18/2016 8:22 AM, Rallavagu wrote:

So, each soft commit would create a new searcher that would invalidate
the old cache?

Here is the configuration for Document Cache



true


In an earlier message, you indicated you're running into OOM.  I think
we can see why with this cache definition.

There are exactly two ways to deal with OOM.  One is to increase the
heap size.  The other is to reduce the amount of memory that the program
requires by changing something -- that might be the code, the config, or
how you're using it.

Start by reducing that cache size to 4096 or 1024.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If yuo've also got a very large filterCache, reduce that size too.  The
filterCache typically eats up a LOT of memory, because each entry in the
cache is very large.

Thanks,
Shawn

Document Cache


Solr 5.4 embedded Jetty

Is it the right assumption that whenever a document that is returned as 
a response to a query is cached in "Document Cache"?


Essentially, if I request for any entry like /select?q=id: 
will it be cached in "Document Cache"? If yes, what is the TTL?


Thanks in advance

Re: Document Cache

Thanks for the recommendations Shawn. Those are the lines I am thinking 
as well. I am reviewing application also.


Going with the note on cache invalidation for every two minutes due to 
soft commit, wonder how would it go OOM in simply two minutes or is it 
likely that a thread is holding the searcher due to long running query 
that might be potentially causing OOM? Was trying to reproduce but could 
not so far.


Here is the filter cache config

autowarmCount="1000"/>


Query Results cache

autowarmCount="5000"/>


On 3/18/16 7:31 AM, Shawn Heisey wrote:

On 3/18/2016 8:22 AM, Rallavagu wrote:

So, each soft commit would create a new searcher that would invalidate
the old cache?

Here is the configuration for Document Cache



true


In an earlier message, you indicated you're running into OOM.  I think
we can see why with this cache definition.

There are exactly two ways to deal with OOM.  One is to increase the
heap size.  The other is to reduce the amount of memory that the program
requires by changing something -- that might be the code, the config, or
how you're using it.

Start by reducing that cache size to 4096 or 1024.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If yuo've also got a very large filterCache, reduce that size too.  The
filterCache typically eats up a LOT of memory, because each entry in the
cache is very large.

Thanks,
Shawn

Re: Solr5 Optimize


Erick, Thanks for the response. Comments in line...

On 3/16/16 9:56 AM, Erick Erickson wrote:

In general, don't bother with optimize unless the index is quite static,
i.e. there are very few adds/updates or those updates are done in
batches and rarely (i.e. once a day or less frequently).

As far as space, this will require that you have at _least_ as much
free space on your disks as your index occupies. Shouldn't require
much in the way of RAM though.

Optimize, also referred to as "Force Merge" will merge all the segments
down to one, and in the process reclaim data from deleted (or updated)
documents.

The thing is, this is also accomplished by "background merging" which
happens automatically. Every time you do a hard commit, Lucene
figures out if any segments need to be merged and does that automatically.
During that process, any information associated with deleted docs is
reclaimed.
If "optimize" like operation happening automatically every time a hard 
commit happens, with following settings (15 seconds for hard commit) 
what would be impact on performance particularly on disk space?



   ${solr.autoCommit.maxTime:15000}
   false
 

Thanks.



The third video down here:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html
is Mikes visualization of the automatic merging process.

Best,
Erick

On Wed, Mar 16, 2016 at 9:40 AM, Rallavagu <rallav...@gmail.com> wrote:

All,

Solr 5.4 with emdbedded Jetty (4G heap)

Trying to understand behavior of "optimize" operation if not run explicitly.
What is the frequency at which this operation is run, what are the storage
requirements and how do we schedule it? Any comments/pointers would greatly
help.

Thanks in advance

Re: Document Cache

2016-03-18 Thread Rallavagu

So, each soft commit would create a new searcher that would invalidate 
the old cache?


Here is the configuration for Document Cache

autowarmCount="0"/>


true

Thanks

On 3/18/16 12:45 AM, Emir Arnautovic wrote:

Hi,
Your cache will be cleared on soft commits - every two minutes. It seems
that it is either configured to be huge or you have big documents and
retrieving all fields or dont have lazy field loading set to true.

Can you please share your document cache config and heap settings.

Thanks,
Emir

On 17.03.2016 22:24, Rallavagu wrote:

comments in line...

On 3/17/16 2:16 PM, Erick Erickson wrote:

First, I want to make sure when you say "TTL", you're talking about
documents being evicted from the documentCache and not the "Time To
Live"
option whereby documents are removed completely from the index.


May be TTL was not the right word to use here. I wanted learn the
criteria for an entry to be ejected.



The time varies with the number of new documents fetched. This is an LRU
cache whose size is configured in solrconfig.xml. It's pretty much
unpredictable. If for some odd reason every request gets the same
document
it'll never be aged out. If no two queries return the same document,
when
"cache size" docs are fetched by subsequent requests.

The entire thing is thrown out whenever a new searcher is opened (i.e.
softCommit or hardCommit with openSearcher=true)




But maybe this is an XY problem. Why do you care? Is there something
you're
seeing that you're trying to understand or is this just a general
interest
question?

I have following configuration,

${solr.autoCommit.maxTime:15000}false


${solr.autoSoftCommit.maxTime:12}


As you can see, openSearcher is set to "false". What I am seeing is
(from heap dump due to OutOfMemory error) that the LRUCache pertaining
"Document Cache" occupies around 85% of available heap and that is
causing OOM errors. So, trying to understand the behavior to address
the OOM issues.

Thanks



Best,
Erick

On Thu, Mar 17, 2016 at 1:40 PM, Rallavagu <rallav...@gmail.com> wrote:


Solr 5.4 embedded Jetty

Is it the right assumption that whenever a document that is returned
as a
response to a query is cached in "Document Cache"?

Essentially, if I request for any entry like /select?q=id:
will it be cached in "Document Cache"? If yes, what is the TTL?

Thanks in advance

Re: SolrCloud breaks and does not recover

2015-11-03 Thread Rallavagu

One another item to look into is to increase the zookeeper timeout in 
solr.xml of Solr. This would help with timeout caused by long GC pauses.


On 11/3/15 9:12 AM, Björn Häuser wrote:

Hi,

thank you for your answer.

1> No OOM hit, the log does not contain any hind of that. Also solr
wasn't restarted automatically. But the gc log has some pauses which
are longer than 15 seconds.

2> So, if we need to recover a system we need to stop ingesting data into it?

3> The JVMs currently use a little bit more then 1GB of Heap, with a
now changed max-heap of 3GB. Currently thinking of lowering the heap
to 1.5 / 2 GB (following Uwe's post).

Also the RES is 4.1gb and VIRT is 12.5gb. Swap is more or less not
used (40mb of 1GB assigned swap). According to our server monitoring
sometimes an io spike happens, but again not that much.

What I am going todo:

1.) make sure that in case of failure we stop ingesting data into solrcloud
2.) lower the heap to 2GB
3.) Make sure that zookeeper can fsync its write-ahead log fast enough (<1 sec)

Thanks
Björn

2015-11-03 16:27 GMT+01:00 Erick Erickson :

The GC logs don't really show anything interesting, there would
be 15+ second GC pauses. The Zookeeper log isn't actually very
interesting. As far as OOM errors, I was thinking of _solr_ logs.

As to why the cluster doesn't self-heal, a couple of things:

1> Once you hit an OOM, all bets are off. The JVM needs to be
bounced. Many installations have kill scripts that bounce the
JVM. So it's explainable if you have OOM errors.

2> The system may be _trying_ to recover, but if you're
still ingesting data it may get into a resource-starved
situation where it makes progress but never catches up.

Again, though, this seems like very little memory for the
situation you describe, I suspect you're memory-starved to
a point where you can't really run. But that's a guess.

When you run, how much JVM memory are you using? The admin
UI should show that.

But the pattern of 8G physical memory and 6G for Java is a red
flag as per Uwe's blog post, you may be swapping a lot (OS
memory) and that may be slowing things down enough to have
sessions drop. Grasping at straws here, but "top" or similar
should tell you what the system is doing.

Best,
Erick

On Tue, Nov 3, 2015 at 12:04 AM, Björn Häuser  wrote:

Hi!

Thank you for your super fast answer.

I can provide more data, the question is which data :-)

These are the config parameters solr runs with:
https://gist.github.com/bjoernhaeuser/24e7080b9ff2a8785740 (taken from
the admin ui)

These are the log files:

https://gist.github.com/bjoernhaeuser/a60c2319d71eb35e9f1b

I think your first obversation is correct: SolrCloud looses the
connection to zookeeper, because the connection times out.

But why isn't solrcloud able to recover it self?

Thanks
Björn


2015-11-02 22:32 GMT+01:00 Erick Erickson :

Without more data, I'd guess one of two things:

1> you're seeing stop-the-world GC pauses that cause Zookeeper to
think the node is unresponsive, which puts a node into recovery and
things go bad from there.

2> Somewhere in your solr logs you'll see OutOfMemory errors which can
also cascade a bunch of problems.

In general it's an anti-pattern to allocate such a large portion of
our physical memory to the JVM, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html



Best,
Erick



On Mon, Nov 2, 2015 at 1:21 PM, Björn Häuser  wrote:

Hey there,

we are running a SolrCloud, which has 4 nodes, same config. Each node
has 8gb memory, 6GB assigned to the JVM. This is maybe too much, but
worked for a long time.

We currently run with 2 shards, 2 replicas and 11 collections. The
complete data-dir is about 5.3 GB.
I think we should move some JVM heap back to the OS.

We are running Solr 5.2.1., as I could not see any related bugs to
SolrCloud in the release notes for 5.3.0 and 5.3.1, we did not bother
to upgrade first.

One of our nodes (node A) reports these errors:

org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://10.41.199.201:9004/solr/catalogue: Invalid
version (expected 2, but 101) or the data in not in 'javabin' format

Stacktrace: https://gist.github.com/bjoernhaeuser/46ac851586a51f8ec171

And shortly after (4 seconds) this happens on a *different* node (Node B):

Stopping recovery for core=suggestion coreNodeName=core_node2

No Stacktrace for this, but happens for all 11 collections.

6 seconds after that Node C reports these errors:

org.apache.solr.common.SolrException:
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /configs/customers/params.json

Stacktrace: https://gist.github.com/bjoernhaeuser/45a244dc32d74ac989f8

This also happens for 11 collections.

And then different errors happen:

OverseerAutoReplicaFailoverThread had an error in its thread work

growth of tlog

2015-10-30 Thread Rallavagu


4.10.4 solr cloud, 3 zk quorum, jdk 8

autocommit: 15 sec, softcommit: 2 min

Under heavy indexing load with above settings, i have seen tlog growing 
(into GB). After the updates stopped coming in, it settles down and 
takes a while to recover before cloud becomes "green".


With 15 second autocommit setting, what could potentially cause tlog to 
grow? What to look for?

Re: growth of tlog

2015-10-30 Thread Rallavagu

On 10/30/15 8:39 AM, Erick Erickson wrote:

I infer that this statement: "takes a while to recover before cloud
becomes green"
indicates that the node is in recovery or something while indexing. If you're
still indexing, the new documents will be written to the followers
tlog while the
follower is recovering, leading to it growing. I expect that after followers
all recover, the tlog shrinks after a few commits have gone by.

Correct. The recovery time is extended though. Also, this affects
available physical memory as tlog continues to grow and it is memory mapped.

If that's all true, the question is why the follower goes into
recovery in the first
place. Prior to 5.2, there was a situation in which very heavy indexing
could cause a follower to go into Leader Initiated Recovery (LIR) (look for this
in both the leader and follower logs). Here's the blog Tim Potter wrote
on this subject:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

The smoking gun here is
1> heavy indexing is required
2> the _leader_ stays up
3> the _follower_ goes into recovery for no readily apparent reason
4> the nail in the coffin for this particular issue is seeing that the follower
went into LIR.
5> You'll also see a very large number of threads on the leader waiting
on sending the updates to the follower.

If this is a problem, prior to 5.2 there are really only two solutions
1> throttle indexing
2> take all of the followers offline during indexing. When indexing is
completed, bring the followers back up and let them replicate the
full index down from the leader.
Other than shutting followers down, is there a elegant/graceful way of
taking follower nodes offline? Also, to give you more idea, as per the
following document I am testing "Index heavy, Query heavy" situation.

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks

Best,
Erick

On Fri, Oct 30, 2015 at 8:28 AM, Rallavagu <rallav...@gmail.com> wrote:

4.10.4 solr cloud, 3 zk quorum, jdk 8

autocommit: 15 sec, softcommit: 2 min

Under heavy indexing load with above settings, i have seen tlog growing
(into GB). After the updates stopped coming in, it settles down and takes a
while to recover before cloud becomes "green".

With 15 second autocommit setting, what could potentially cause tlog to
grow? What to look for?

Solr for Pictures

2015-10-29 Thread Rallavagu

In general, is there a built-in data handler to index pictures 
(essentially, EXIF and other data embedded in an image)? If not, what is 
the best practice to do so? Thanks.

Re: Solr for Pictures

2015-10-29 Thread Rallavagu

I was playing with exiftool (written in perl) and a custom java class 
built using metadata-extrator project 
(https://github.com/drewnoakes/metadata-extractor) and wondering if 
there is anything built into Solr or are there any best practices 
(general practices) to index pictures.


On 10/29/15 1:56 PM, Daniel Valdivia wrote:

Some extra googling yield this Wiki from a integration between Tika and a 
EXIFTool

https://wiki.apache.org/tika/EXIFToolParser 
<https://wiki.apache.org/tika/EXIFToolParser>


On Oct 29, 2015, at 1:48 PM, Daniel Valdivia <h...@danielvaldivia.com> wrote:

I think you can look into Tika for this https://tika.apache.org/ 
<https://tika.apache.org/>

There’s handlers to integrate Tika and Solr, some context:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
 
<https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika>




On Oct 29, 2015, at 1:47 PM, Rallavagu <rallav...@gmail.com 
<mailto:rallav...@gmail.com>> wrote:

In general, is there a built-in data handler to index pictures (essentially, 
EXIF and other data embedded in an image)? If not, what is the best practice to 
do so? Thanks.

Commit Error


Solr 4.6.1, cloud

Seeing following commit errors.

[commitScheduler-19-thread-1] ERROR org.apache.solr.update.CommitTracker 
– auto commit error...:java.lang.IllegalStateException: this writer hit 
an OutOfMemoryError; cannot commit at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807) 
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984) at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559) 
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440) 
at java.util.concurrent.FutureTask.run(FutureTask.java:138) at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) 
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) 
at java.lang.Thread.run(Thread.java:682)


Looking at the code,

public final void prepareCommit() throws IOException {
ensureOpen();
prepareCommitInternal();
  }

  private void prepareCommitInternal() throws IOException {
synchronized(commitLock) {
  ensureOpen(false);
  if (infoStream.isEnabled("IW")) {
infoStream.message("IW", "prepareCommit: flush");
infoStream.message("IW", "  index before flush " + segString());
  }

  if (hitOOM) {
throw new IllegalStateException("this writer hit an 
OutOfMemoryError; cannot commit");

  }

It simply checking a flag if it hit OOM? What is making to check and set 
the flag? What could be the conditions? Thanks.

Re: Commit Error


Thanks Shawn for the response.

Seeing very high CPU during this time and very high warmup times. During 
this time, there were plenty of these errors logged. So, trying to find 
out possible causes for this to occur. Could it be disk I/O issues or 
something else as it is related to commit (writing to disk).


On 10/28/15 3:57 PM, Shawn Heisey wrote:

On 10/28/2015 2:06 PM, Rallavagu wrote:

Solr 4.6.1, cloud

Seeing following commit errors.

[commitScheduler-19-thread-1] ERROR
org.apache.solr.update.CommitTracker – auto commit
error...:java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
at java.lang.Thread.run(Thread.java:682)

Looking at the code,

public final void prepareCommit() throws IOException {
 ensureOpen();
 prepareCommitInternal();
   }

   private void prepareCommitInternal() throws IOException {
 synchronized(commitLock) {
   ensureOpen(false);
   if (infoStream.isEnabled("IW")) {
 infoStream.message("IW", "prepareCommit: flush");
 infoStream.message("IW", "  index before flush " + segString());
   }

   if (hitOOM) {
 throw new IllegalStateException("this writer hit an
OutOfMemoryError; cannot commit");
   }

It simply checking a flag if it hit OOM? What is making to check and
set the flag? What could be the conditions? Thanks.


This exception handling was revamped in Lucene 4.10.1 (and therefore in
Solr 4.10.1) by this issue:

https://issues.apache.org/jira/browse/LUCENE-5958

The "hitOOM" variable was removed by the following specific commit --
this is the commit on the 4.10 branch, but it was also committed to
branch_4x and trunk as well.  Later commits on this same issue were made
to branch_5x -- the cutover to begin the 5.0 release process was made
while this issue was still being fixed.

https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java?r1=1626189=1626188=1626189

In the code before this fix, the hitOOM flag is set by other methods in
IndexWriter.  It is volatile to prevent problems with multiple threads
updating and accessing it.

Your message doesn't indicate what problems you're having besides an
error message in your log.  LUCENE-5958 indicates that the problems
could be as bad as a corrupt index.

The reason that IndexWriter swallows OOM exceptions is that this is the
only way Lucene can even *attempt* to avoid index corruption in every
error situation.  Lucene has had a very good track record at avoiding
index corruption, but every now and then a bug is found and a user
manages to get a corrupted index.

Thanks,
Shawn

Re: Commit Error

Also, is this thread that went OOM and what could cause it? The heap was 
doing fine and server was live and running.


On 10/28/15 3:57 PM, Shawn Heisey wrote:

On 10/28/2015 2:06 PM, Rallavagu wrote:

Solr 4.6.1, cloud

Seeing following commit errors.

[commitScheduler-19-thread-1] ERROR
org.apache.solr.update.CommitTracker – auto commit
error...:java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440) at
java.util.concurrent.FutureTask.run(FutureTask.java:138) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
at java.lang.Thread.run(Thread.java:682)

Looking at the code,

public final void prepareCommit() throws IOException {
 ensureOpen();
 prepareCommitInternal();
   }

   private void prepareCommitInternal() throws IOException {
 synchronized(commitLock) {
   ensureOpen(false);
   if (infoStream.isEnabled("IW")) {
 infoStream.message("IW", "prepareCommit: flush");
 infoStream.message("IW", "  index before flush " + segString());
   }

   if (hitOOM) {
 throw new IllegalStateException("this writer hit an
OutOfMemoryError; cannot commit");
   }

It simply checking a flag if it hit OOM? What is making to check and
set the flag? What could be the conditions? Thanks.


This exception handling was revamped in Lucene 4.10.1 (and therefore in
Solr 4.10.1) by this issue:

https://issues.apache.org/jira/browse/LUCENE-5958

The "hitOOM" variable was removed by the following specific commit --
this is the commit on the 4.10 branch, but it was also committed to
branch_4x and trunk as well.  Later commits on this same issue were made
to branch_5x -- the cutover to begin the 5.0 release process was made
while this issue was still being fixed.

https://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java?r1=1626189=1626188=1626189

In the code before this fix, the hitOOM flag is set by other methods in
IndexWriter.  It is volatile to prevent problems with multiple threads
updating and accessing it.

Your message doesn't indicate what problems you're having besides an
error message in your log.  LUCENE-5958 indicates that the problems
could be as bad as a corrupt index.

The reason that IndexWriter swallows OOM exceptions is that this is the
only way Lucene can even *attempt* to avoid index corruption in every
error situation.  Lucene has had a very good track record at avoiding
index corruption, but every now and then a bug is found and a user
manages to get a corrupted index.

Thanks,
Shawn

Re: Commit Error




On 10/28/15 5:41 PM, Shawn Heisey wrote:

On 10/28/2015 5:11 PM, Rallavagu wrote:

Seeing very high CPU during this time and very high warmup times. During
this time, there were plenty of these errors logged. So, trying to find
out possible causes for this to occur. Could it be disk I/O issues or
something else as it is related to commit (writing to disk).


Lucene is claiming that you're hitting the Out Of Memory exception.  I
pulled down the 4.6.1 source code to verify IndexWriter's behavior.  The
only time hitOOM can be set to true is when OutOfMemoryError is being
thrown, so unless you're running Solr built from modified source code,
Lucene's claim *is* what's happening.


This is very likely true as source is not modified.



In OOM situations, there's a good chance that Java is going to be
spending a lot of time doing garbage collection, which can cause CPU
usage to go high and make warm times long.


Again, I think this is the likely case. Even though there is no apparent 
OOM, JVM can throw OOM in case of excessive number full GC and unable to 
claim certain amount of memory.




The behavior of most Java programs is completely unpredictable when Java
actually runs out of memory.  As already mentioned, the parts of Lucene
that update the index are specifically programmed to deal with OOM
without causing index corruption.  Writing code that is predictable in
OOM situations is challenging, so only a subset of the code in
Lucene/Solr has been hardened in this way.  Most of it is as
unpredictable in OOM as any other Java program.


Thanks Shawn.



Thanks,
Shawn

Re: Solr hard commit

2015-10-27 Thread Rallavagu

On 10/27/15 8:43 AM, Erick Erickson wrote:

bq: So, the updated file(s) on the disk automatically read into memory
as they are Memory mapped?

Yes.

Not quite sure why you care, curiosity or is there something you're
trying to accomplish?
This is out of curiosity. So, I can get better understanding of Solr's
memory usage (heap & mmap).

The contents of the index's segment files are read into virtual memory
by MMapDirectory as needed to satisfy queries. Which is the point of
autowarming BTW.

Ok. But, I have noticed that even "tlog" files are memory mapped (output
from "lsof") in addition to all other files under "data" directory.

commit in the following is either hard commit with openSearcher=true
or soft commit.

Hard commit is setup with openSearcher=false and softCommit is setup for
every 2 min.

Segments that have been created (closed actually) after the last
commit are _not_ read at all until the next searcher is opened via
another commit. Nothing is done with these new segments before the new
searcher is opened which you control with your commit strategy.

I see. Thanks for the insight.

Best,
Erick

On Mon, Oct 26, 2015 at 9:07 PM, Rallavagu <rallav...@gmail.com> wrote:

Erick, Thanks for clarification. I was under impression that MMapDirectory
is being used for both read/write operations. Now, I see how it is being
used. Essentially, it only reads from MMapDirectory and writes directly to
disk. So, the updated file(s) on the disk automatically read into memory as
they are Memory mapped?

On 10/26/15 8:43 PM, Erick Erickson wrote:

You're really looking at this backwards. The MMapDirectory stuff is
for Solr (Lucene, really) _reading_ data from closed segment files.

When indexing, there are internal memory structures that are flushed
to disk on commit, but these have nothing to do with MMapDirectory.

So the question is really moot ;)

Best,
Erick

On Mon, Oct 26, 2015 at 5:47 PM, Rallavagu <rallav...@gmail.com> wrote:

All,

Are memory mapped files (mmap) flushed to disk during "hard commit"? If
yes,
should we disable OS level (Linux for example) memory mapped flush?

I am referring to following for mmap files for Lucene/Solr

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Linux level flush

http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/

Solr's hard and soft commit

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks in advance.

Re: Using books.json in solr

2015-10-27 Thread Rallavagu

Could you please share your query? You could use "wt=json" query 
parameter to receive JSON formatted results if that is what you are 
looking for.


On 10/27/15 10:44 AM, Salonee Rege wrote:

Hello,
   We are trying to query the books.json that we have posted to solr.
But when we try to specfically query it on genre it does not return a
complete json with valid key-value pairs. Kindly help.

/Salonee Rege/
USC Viterbi School of Engineering
University of Southern California
Master of Computer Science - Student
Computer Science - B.E
salon...@usc.edu  _||_ _619-709-6756_
_
_
_
_

Re: Solr hard commit

2015-10-27 Thread Rallavagu


Is it related to this config?

Solr hard commit

2015-10-26 Thread Rallavagu


All,

Are memory mapped files (mmap) flushed to disk during "hard commit"? If 
yes, should we disable OS level (Linux for example) memory mapped flush?


I am referring to following for mmap files for Lucene/Solr

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Linux level flush

http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/

Solr's hard and soft commit

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks in advance.

Re: Solr hard commit

2015-10-26 Thread Rallavagu

Erick, Thanks for clarification. I was under impression that 
MMapDirectory is being used for both read/write operations. Now, I see 
how it is being used. Essentially, it only reads from MMapDirectory and 
writes directly to disk. So, the updated file(s) on the disk 
automatically read into memory as they are Memory mapped?


On 10/26/15 8:43 PM, Erick Erickson wrote:

You're really looking at this backwards. The MMapDirectory stuff is
for Solr (Lucene, really) _reading_ data from closed segment files.

When indexing, there are internal memory structures that are flushed
to disk on commit, but these have nothing to do with MMapDirectory.

So the question is really moot ;)

Best,
Erick

On Mon, Oct 26, 2015 at 5:47 PM, Rallavagu <rallav...@gmail.com> wrote:

All,

Are memory mapped files (mmap) flushed to disk during "hard commit"? If yes,
should we disable OS level (Linux for example) memory mapped flush?

I am referring to following for mmap files for Lucene/Solr

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Linux level flush

http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/

Solr's hard and soft commit

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks in advance.

Re: locks and high CPU

2015-10-22 Thread Rallavagu Kon

Erick,

Indexing happening via Solr cloud server. This thread was from the leader. Some 
followers show symptom of high cpu during this time. You think this is from 
locking? What is the thread that is holding the lock doing? Also, we are unable 
to reproduce this issue in load test environment. Any clues would help.

> On Oct 22, 2015, at 09:50, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Prior to Solr 5.2, there were several inefficiencies when distributing
> updates to replicas, see:
> https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.
> 
> The symptom was that there was significantly higher CPU utilization on
> the followers
> compared to the leader.
> 
> The only real fix is to upgrade to 5.2+ assuming that's your issue.
> 
> How are you indexing? Using SolrJ with CloudSolrServer would help if
> you're not using
> them.
> 
> Best,
> Erick
> 
>> On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu <rallav...@gmail.com> wrote:
>> Solr 4.6.1 cloud
>> 
>> Looking into thread dump 4-5 threads causing cpu to go very high and causing
>> issues. These are tomcat's http threads and are locking. Can anybody help me
>> understand what is going on here? I see that incoming connections coming in
>> for updates and they are being passed on to StreamingSolrServer and
>> subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.
>> 
>> 
>> "http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
>> native_blocked, daemon
>>at __lll_lock_wait+34(:0)@0x38caa0e262
>>at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
>>at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
>>at _L_unlock_16+44(:0)@0x38caa0f710
>>at
>> java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
>>at
>> org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
>>at
>> org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
>>at
>> org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
>>at
>> org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
>>at
>> org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
>>^-- Holding lock:
>> org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
>>^-- Holding lock:
>> org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
>>at
>> org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
>>at
>> org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
>>at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
>>at
>> org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
>>at
>> org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
>>at
>> org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
>>at
>> org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
>>at
>> org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
>>at
>> org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
>>at
>> org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
>>at
>> org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
>>at
>> org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
>>at
>> org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
>>at
>> org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
>>at
>> org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
>>at
>> org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
>>^-- Holding lock:
>> org/apache/tomcat/util/net/SocketWrapper@0x496e58810[thin lock]
>>at
>> java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
>>at
>> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]
>>at java/lang/Thread.run(Thread.java:682)[optimized]
>>at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: locks and high CPU

2015-10-22 Thread Rallavagu

Thanks Erick. Currently, migrating to 5.3 and it is taking a bit of 
time. Meanwhile, I looked at the JIRAs from the blog and the stack trace 
looks a bit different from what I see but not sure if they are related. 
Also, as per the stack trace I have included in my original email, it is 
the tomcat thread that is locking but not the recovery thread which will 
be responsible writing updates to followers. I agree that we might 
throttle updates but what is annoying is unable to see issues in 
controlled load test env.


Just to understand better, what is the tomcat thread doing in this case?

Thanks

On 10/22/15 12:53 PM, Erick Erickson wrote:

The details are in Tim's blog post and the linked JIRAs

Unfortunately, the only real solution I know of is to upgrade
to at least Solr 5.2. Meanwhile, throttling the indexing rate
will at least smooth out the issue. Not a great approach but
all there is for 4.6.

Best,
Erick

On Thu, Oct 22, 2015 at 10:48 AM, Rallavagu Kon <rallav...@gmail.com> wrote:

Erick,

Indexing happening via Solr cloud server. This thread was from the leader. Some 
followers show symptom of high cpu during this time. You think this is from 
locking? What is the thread that is holding the lock doing? Also, we are unable 
to reproduce this issue in load test environment. Any clues would help.


On Oct 22, 2015, at 09:50, Erick Erickson <erickerick...@gmail.com> wrote:

Prior to Solr 5.2, there were several inefficiencies when distributing
updates to replicas, see:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/.

The symptom was that there was significantly higher CPU utilization on
the followers
compared to the leader.

The only real fix is to upgrade to 5.2+ assuming that's your issue.

How are you indexing? Using SolrJ with CloudSolrServer would help if
you're not using
them.

Best,
Erick


On Thu, Oct 22, 2015 at 9:43 AM, Rallavagu <rallav...@gmail.com> wrote:
Solr 4.6.1 cloud

Looking into thread dump 4-5 threads causing cpu to go very high and causing
issues. These are tomcat's http threads and are locking. Can anybody help me
understand what is going on here? I see that incoming connections coming in
for updates and they are being passed on to StreamingSolrServer and
subsequently ConcurrentUpdateSolrServer and they both have locks. Thanks.


"http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive,
native_blocked, daemon
at __lll_lock_wait+34(:0)@0x38caa0e262
at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
at _L_unlock_16+44(:0)@0x38caa0f710
at
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
at
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
at
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
at
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
at
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
at
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
at
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
at
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
at
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
at
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
at
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
at
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
at
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
at
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
at
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
at
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
at
org/apache/catalina/connector/CoyoteAdapter.service(Coyot

locks and high CPU

2015-10-22 Thread Rallavagu


Solr 4.6.1 cloud

Looking into thread dump 4-5 threads causing cpu to go very high and 
causing issues. These are tomcat's http threads and are locking. Can 
anybody help me understand what is going on here? I see that incoming 
connections coming in for updates and they are being passed on to 
StreamingSolrServer and subsequently ConcurrentUpdateSolrServer and they 
both have locks. Thanks.



"http-bio-8080-exec-4394" id=8774 idx=0x988 tid=14548 prio=5 alive, 
native_blocked, daemon

at __lll_lock_wait+34(:0)@0x38caa0e262
at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7fc29b9c9138
at trapiNormalHandler+484(traps_posix.c:220)@0x7fc29b9fd745
at _L_unlock_16+44(:0)@0x38caa0f710
at 
java/util/concurrent/locks/ReentrantLock.lock(ReentrantLock.java:262)[optimized]
at 
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:391)[inlined]
at 
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
at 
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers$1@0x496cf6e50[biased lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers@0x49d32adc8[biased lock]
at 
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
at 
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]

at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
at 
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
at 
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
at 
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
at 
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
at 
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
at 
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
at 
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
at 
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
at 
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
at 
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
at 
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
at 
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
at 
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
^-- Holding lock: 
org/apache/tomcat/util/net/SocketWrapper@0x496e58810[thin lock]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]

at java/lang/Thread.run(Thread.java:682)[optimized]
at jrockit/vm/RNI.c2java(J)V(Native Method)

coreZkRegister thread

2015-10-20 Thread Rallavagu


Solr 4.6.1, 4 node cloud with 3 zk

I see the following thread as blocked. Could somebody please help me 
understand what is going on here and how will it impact solr cloud? All 
four of these threads blocked. Thanks.


"coreZkRegister-1-thread-1" id=74 idx=0x108 tid=32162 prio=5 alive, 
parked, native_blocked
-- Parking to wait for: 
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject@0x11a61daf8

at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7f41a970aba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7f41a989e0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7f41a989e20e
at vmtPark+164(signaling.c:72)@0x7f41a987a165
at jrockit/vm/Locks.park0(J)V(Native Method)
at jrockit/vm/Locks.park(Locks.java:2230)
at sun/misc/Unsafe.park(ZJ)V(Native Method)
at java/util/concurrent/locks/LockSupport.park(LockSupport.java:156)
at 
java/util/concurrent/locks/AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987)
at 
java/util/concurrent/LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
at 
java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957)
at 
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917)

at java/lang/Thread.run(Thread.java:682)
at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: Help me read Thread

2015-10-16 Thread Rallavagu

One more observation made is that tomcat's acceptor thread for http 
disappears (http-bio-8080-acceptor thread) and due to this no incoming 
connections could be opened on http. During this time ZK potentially 
thinks node is up and shows green from leader.


On 10/13/15 9:17 AM, Erick Erickson wrote:

How heavy is heavy? The proverbial smoking gun here will be messages in any
logs referring to "leader initiated recovery". (note, that's the
message I remember seeing,
it may not be exact).

There's no particular work-around here except to back off the indexing
load. Certainly increasing the
thread pool size allowed this to surface. Also 5.2 has some
significant improvements in this area, see:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

And a lot depends on how you're indexing, batching up updates is a
good thing. If you go to a
multi-shard setup, using SolrJ and CloudSolrServer (CloudSolrClient in
5.x) would help. More
shards would help as well,  but I'd first take a look at the indexing
process and be sure you're
batching up updates.

It's also possible if indexing is a once-a-day process and it fits
with your SLAs to shut off the replicas,
index to the leader, then turn the replicas back on. That's not all
that satisfactory, but I've seen it used.

But with a single shard setup, I really have to ask why indexing at
such a furious rate is
required that you're hitting this. Are you unable to reduce the indexing rate?

Best,
Erick

On Tue, Oct 13, 2015 at 9:08 AM, Rallavagu <rallav...@gmail.com> wrote:

Also, we have increased number of connections per host from default (20) to
100 for http thread pool to communicate with other nodes. Could this have
caused the issues as it can now spin many threads to send updates?


On 10/13/15 8:56 AM, Erick Erickson wrote:


Is this under a very heavy indexing load? There were some
inefficiencies that caused followers to work a lot harder than the
leader, but the leader had to spin off a bunch of threads to send
update to followers. That's fixed int he 5.2 release.

Best,
Erick

On Tue, Oct 13, 2015 at 8:40 AM, Rallavagu <rallav...@gmail.com> wrote:


Please help me understand what is going on with this thread.

Solr 4.6.1, single shard, 4 node cluster, 3 node zk. Running on tomcat
with
500 threads.


There are 47 threads overall and designated leader becomes unresponsive
though shows "green" from cloud perspective. This is causing issues.

particularly,

"   at

org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
  ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]"



"http-bio-8080-exec-2878" id=5899 idx=0x30c tid=17132 prio=5 alive,
native_blocked, daemon
  at __lll_lock_wait+34(:0)@0x382ba0e262
  at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7f83ae266138
  at trapiNormalHandler+484(traps_posix.c:220)@0x7f83ae29a745
  at _L_unlock_16+44(:0)@0x382ba0f710
  at java/util/LinkedList.peek(LinkedList.java:447)[optimized]
  at

org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:384)[inlined]
  at

org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
  at

org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
  at

org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
  at

org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
  ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]
  at

org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
  at

org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
  at
org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
  at

org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
  at

org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
  at

org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
  at

org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
  at

org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
  at

Help me read Thread


Please help me understand what is going on with this thread.

Solr 4.6.1, single shard, 4 node cluster, 3 node zk. Running on tomcat 
with 500 threads.



There are 47 threads overall and designated leader becomes unresponsive 
though shows "green" from cloud perspective. This is causing issues.


particularly,

"   at 
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]

^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]"




"http-bio-8080-exec-2878" id=5899 idx=0x30c tid=17132 prio=5 alive, 
native_blocked, daemon

at __lll_lock_wait+34(:0)@0x382ba0e262
at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7f83ae266138
at trapiNormalHandler+484(traps_posix.c:220)@0x7f83ae29a745
at _L_unlock_16+44(:0)@0x382ba0f710
at java/util/LinkedList.peek(LinkedList.java:447)[optimized]
at 
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:384)[inlined]
at 
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
at 
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
at 
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]

^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
^-- Holding lock: 
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]
at 
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
at 
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]

at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
at 
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
at 
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
at 
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
at 
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
at 
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
at 
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
at 
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
at 
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
at 
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
at 
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
at 
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
at 
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
at 
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
at 
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
^-- Holding lock: 
org/apache/tomcat/util/net/SocketWrapper@0x2ee6e4aa8[thin lock]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
at 
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]

at java/lang/Thread.run(Thread.java:682)[optimized]
at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: Help me read Thread

The heavy load of indexing is true. During this time, all other nodes 
are under "recovery" mode and search queries are referred to leader and 
it times out. Is there a temporary work around for this? Thanks.


On 10/13/15 8:56 AM, Erick Erickson wrote:

Is this under a very heavy indexing load? There were some
inefficiencies that caused followers to work a lot harder than the
leader, but the leader had to spin off a bunch of threads to send
update to followers. That's fixed int he 5.2 release.

Best,
Erick

On Tue, Oct 13, 2015 at 8:40 AM, Rallavagu <rallav...@gmail.com> wrote:

Please help me understand what is going on with this thread.

Solr 4.6.1, single shard, 4 node cluster, 3 node zk. Running on tomcat with
500 threads.


There are 47 threads overall and designated leader becomes unresponsive
though shows "green" from cloud perspective. This is causing issues.

particularly,

"   at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
 ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]"



"http-bio-8080-exec-2878" id=5899 idx=0x30c tid=17132 prio=5 alive,
native_blocked, daemon
 at __lll_lock_wait+34(:0)@0x382ba0e262
 at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7f83ae266138
 at trapiNormalHandler+484(traps_posix.c:220)@0x7f83ae29a745
 at _L_unlock_16+44(:0)@0x382ba0f710
 at java/util/LinkedList.peek(LinkedList.java:447)[optimized]
 at
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:384)[inlined]
 at
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
 at
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
 at
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
 at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
 ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]
 at
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
 at
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
 at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
 at
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
 at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
 at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
 at
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
 at
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
 at
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
 at
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
 at
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
 at
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
 at
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
 at
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
 at
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
 at
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
 at
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
 at
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
 ^-- Holding lock:
org/apache/tomcat/util/net/SocketWrapper@0x2ee6e4aa8[thin lock]
 at
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
 at
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]
 at java/lang/Thread.run(Thread.java:682)[optimized]
 at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: Help me read Thread

Also, we have increased number of connections per host from default (20) 
to 100 for http thread pool to communicate with other nodes. Could this 
have caused the issues as it can now spin many threads to send updates?


On 10/13/15 8:56 AM, Erick Erickson wrote:

Is this under a very heavy indexing load? There were some
inefficiencies that caused followers to work a lot harder than the
leader, but the leader had to spin off a bunch of threads to send
update to followers. That's fixed int he 5.2 release.

Best,
Erick

On Tue, Oct 13, 2015 at 8:40 AM, Rallavagu <rallav...@gmail.com> wrote:

Please help me understand what is going on with this thread.

Solr 4.6.1, single shard, 4 node cluster, 3 node zk. Running on tomcat with
500 threads.


There are 47 threads overall and designated leader becomes unresponsive
though shows "green" from cloud perspective. This is causing issues.

particularly,

"   at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
 ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]"



"http-bio-8080-exec-2878" id=5899 idx=0x30c tid=17132 prio=5 alive,
native_blocked, daemon
 at __lll_lock_wait+34(:0)@0x382ba0e262
 at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7f83ae266138
 at trapiNormalHandler+484(traps_posix.c:220)@0x7f83ae29a745
 at _L_unlock_16+44(:0)@0x382ba0f710
 at java/util/LinkedList.peek(LinkedList.java:447)[optimized]
 at
org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:384)[inlined]
 at
org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
 at
org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
 at
org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
 at
org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
 ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
 ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]
 at
org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
 at
org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
 at org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
 at
org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
 at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
 at
org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
 at
org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
 at
org/apache/catalina/core/ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)[optimized]
 at
org/apache/catalina/core/StandardWrapperValve.invoke(StandardWrapperValve.java:222)[optimized]
 at
org/apache/catalina/core/StandardContextValve.invoke(StandardContextValve.java:123)[optimized]
 at
org/apache/catalina/core/StandardHostValve.invoke(StandardHostValve.java:171)[optimized]
 at
org/apache/catalina/valves/ErrorReportValve.invoke(ErrorReportValve.java:99)[optimized]
 at
org/apache/catalina/valves/AccessLogValve.invoke(AccessLogValve.java:953)[optimized]
 at
org/apache/catalina/core/StandardEngineValve.invoke(StandardEngineValve.java:118)[optimized]
 at
org/apache/catalina/connector/CoyoteAdapter.service(CoyoteAdapter.java:408)[optimized]
 at
org/apache/coyote/http11/AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)[optimized]
 at
org/apache/coyote/AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)[optimized]
 at
org/apache/tomcat/util/net/JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)[optimized]
 ^-- Holding lock:
org/apache/tomcat/util/net/SocketWrapper@0x2ee6e4aa8[thin lock]
 at
java/util/concurrent/ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)[inlined]
 at
java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)[optimized]
 at java/lang/Thread.run(Thread.java:682)[optimized]
 at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: Help me read Thread

The main reason is that the updates are coming from some client 
applications and it is not a controlled indexing process. The controlled 
indexing process works fine (after spending some time to tune it). Will 
definitely look into throttling incoming updates requests and reduce the 
number of connections per host. Thanks for the insight.


On 10/13/15 9:17 AM, Erick Erickson wrote:

How heavy is heavy? The proverbial smoking gun here will be messages in any
logs referring to "leader initiated recovery". (note, that's the
message I remember seeing,
it may not be exact).

There's no particular work-around here except to back off the indexing
load. Certainly increasing the
thread pool size allowed this to surface. Also 5.2 has some
significant improvements in this area, see:
https://lucidworks.com/blog/2015/06/10/indexing-performance-solr-5-2-now-twice-fast/

And a lot depends on how you're indexing, batching up updates is a
good thing. If you go to a
multi-shard setup, using SolrJ and CloudSolrServer (CloudSolrClient in
5.x) would help. More
shards would help as well,  but I'd first take a look at the indexing
process and be sure you're
batching up updates.

It's also possible if indexing is a once-a-day process and it fits
with your SLAs to shut off the replicas,
index to the leader, then turn the replicas back on. That's not all
that satisfactory, but I've seen it used.

But with a single shard setup, I really have to ask why indexing at
such a furious rate is
required that you're hitting this. Are you unable to reduce the indexing rate?

Best,
Erick

On Tue, Oct 13, 2015 at 9:08 AM, Rallavagu <rallav...@gmail.com> wrote:

Also, we have increased number of connections per host from default (20) to
100 for http thread pool to communicate with other nodes. Could this have
caused the issues as it can now spin many threads to send updates?


On 10/13/15 8:56 AM, Erick Erickson wrote:


Is this under a very heavy indexing load? There were some
inefficiencies that caused followers to work a lot harder than the
leader, but the leader had to spin off a bunch of threads to send
update to followers. That's fixed int he 5.2 release.

Best,
Erick

On Tue, Oct 13, 2015 at 8:40 AM, Rallavagu <rallav...@gmail.com> wrote:


Please help me understand what is going on with this thread.

Solr 4.6.1, single shard, 4 node cluster, 3 node zk. Running on tomcat
with
500 threads.


There are 47 threads overall and designated leader becomes unresponsive
though shows "green" from cloud perspective. This is causing issues.

particularly,

"   at

org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
  ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]"



"http-bio-8080-exec-2878" id=5899 idx=0x30c tid=17132 prio=5 alive,
native_blocked, daemon
  at __lll_lock_wait+34(:0)@0x382ba0e262
  at safepointSyncOnPollAccess+167(safepoint.c:83)@0x7f83ae266138
  at trapiNormalHandler+484(traps_posix.c:220)@0x7f83ae29a745
  at _L_unlock_16+44(:0)@0x382ba0f710
  at java/util/LinkedList.peek(LinkedList.java:447)[optimized]
  at

org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:384)[inlined]
  at

org/apache/solr/update/StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:98)[inlined]
  at

org/apache/solr/update/SolrCmdDistributor.finish(SolrCmdDistributor.java:61)[inlined]
  at

org/apache/solr/update/processor/DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:501)[inlined]
  at

org/apache/solr/update/processor/DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1278)[optimized]
  ^-- Holding lock: java/util/LinkedList@0x2ee24e958[thin lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers$1@0x2ee24e9c0[biased lock]
  ^-- Holding lock:
org/apache/solr/update/StreamingSolrServers@0x2ee24ea90[biased lock]
  at

org/apache/solr/handler/ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)[optimized]
  at

org/apache/solr/handler/RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)[optimized]
  at
org/apache/solr/core/SolrCore.execute(SolrCore.java:1859)[optimized]
  at

org/apache/solr/servlet/SolrDispatchFilter.execute(SolrDispatchFilter.java:721)[inlined]
  at

org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:417)[inlined]
  at

org/apache/solr/servlet/SolrDispatchFilter.doFilter(SolrDispatchFilter.java:201)[optimized]
  at

org/apache/catalina/core/ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)[inlined]
  at

org/apache/catalina/core/Applica

solr cloud recovery and search

It appears that when a node that is in "recovery" mode queried it would 
defer the query to leader instead of serving from locally. Is this the 
expected behavior? Thanks.

Re: solr cloud recovery and search


Great. Thanks Erick.

On 10/13/15 5:39 PM, Erick Erickson wrote:

More than expected, guaranteed. As long as at least one replica in a
shard is active, all queries should succeed. Maybe more slowly, but
they should succeed.

Best,
Erick



On Tue, Oct 13, 2015 at 4:25 PM, Rallavagu <rallav...@gmail.com> wrote:

It appears that when a node that is in "recovery" mode queried it would
defer the query to leader instead of serving from locally. Is this the
expected behavior? Thanks.

Re: tlog replay

2015-10-08 Thread Rallavagu

As a follow up.

Eventually the tlog file is disappeared (could not track the time it
took to clear out completely). However, following messages were noticed
in follower's log.

5120638 [recoveryExecutor-14-thread-2] WARN
org.apache.solr.update.UpdateLog – Starting log replay tlog

On 10/7/15 8:29 PM, Erick Erickson wrote:

The only way I can account for such a large file off the top of my
head is if, for some reason,
the Solr on the node somehow was failing to index documents and kept
adding them to the
log for a lnnn time. But how that would happen without the
node being in recovery
mode I'm not sure. I mean the Solr instance would have to be healthy
otherwise but just not
able to index docs which makes no sense.

The usual question here is whether there were any messages in the solr
log file indicating
problems while this built up.

tlogs will build up to very large sizes if there are very long hard
commit intervals, but I don't
see how that interval would be different on the leader and follower.

So color me puzzled.

Best,
Erick

On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu <rallav...@gmail.com> wrote:

Thanks Erick.

Eventually, followers caught up but the 14G tlog file still persists and
they are healthy. Is there anything to look for? Will monitor and see how
long will it take before it disappears.

Evaluating move to Solr 5.3.

On 10/7/15 7:51 PM, Erick Erickson wrote:

Uhm, that's very weird. Updates are not applied from the tlog. Rather the
raw doc is forwarded to the replica which both indexes the doc and
writes it to the local tlog. So having a 14G tlog on a follower but a
small
tlog on the leader is definitely strange, especially if it persists over
time.

I assume the follower is healthy? And does this very large tlog disappear
after a while? I'd expect it to be aged out after a few commits of > 100
docs.

All that said, there have been a LOT of improvements since 4.6, so it
might
be something that's been addressed in the intervening time.

Best,
Erick

On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu <rallav...@gmail.com> wrote:

Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates
happen
on leader and it generates huge tlog (14G sometimes in my case) on other
nodes. At the same time leader's tlog is few KB. So, what is the rate at
which the changes from transaction log are applied at nodes? The
autocommit
interval is set to 15 seconds after going through

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks

Re: tlog replay

2015-10-08 Thread Rallavagu

Erick,

Actually, configured autocommit to 15 seconds and openSearcher is set to
false. Neither 2 nor 3 happened. However, softCommit is set to 10 min.

${solr.autoCommit.maxTime:15000}
false

Working on upgrading to 5.3 which will take a bit of time and trying to
get this under control until that time.

On 10/8/15 5:28 PM, Erick Erickson wrote:

right, so the scenario is
1> somehow you didn't do a hard commit (openSearcher=true or false
doesn't matter) for a really long time while indexing.
2> Solr abnormally terminated.
3> When Solr started back up it replayed the entire log.

How <1> happened is the mystery though. With a hard commit
(autocommit) interval of 15 seconds that's weird.

The message indicates something like that happened. In very recent
Solr versions, the log will have
progress messages printed that'll help see this is happening.

Best,
Erick

On Thu, Oct 8, 2015 at 12:23 PM, Rallavagu <rallav...@gmail.com> wrote:

As a follow up.

Eventually the tlog file is disappeared (could not track the time it took to
clear out completely). However, following messages were noticed in
follower's log.

5120638 [recoveryExecutor-14-thread-2] WARN org.apache.solr.update.UpdateLog
– Starting log replay tlog

On 10/7/15 8:29 PM, Erick Erickson wrote:

The usual question here is whether there were any messages in the solr
log file indicating
problems while this built up.

tlogs will build up to very large sizes if there are very long hard
commit intervals, but I don't
see how that interval would be different on the leader and follower.

So color me puzzled.

Best,
Erick

On Wed, Oct 7, 2015 at 8:09 PM, Rallavagu <rallav...@gmail.com> wrote:

Thanks Erick.

Eventually, followers caught up but the 14G tlog file still persists and
they are healthy. Is there anything to look for? Will monitor and see how
long will it take before it disappears.

Evaluating move to Solr 5.3.

On 10/7/15 7:51 PM, Erick Erickson wrote:

Uhm, that's very weird. Updates are not applied from the tlog. Rather
the
raw doc is forwarded to the replica which both indexes the doc and
writes it to the local tlog. So having a 14G tlog on a follower but a
small
tlog on the leader is definitely strange, especially if it persists over
time.

I assume the follower is healthy? And does this very large tlog
disappear
after a while? I'd expect it to be aged out after a few commits of > 100
docs.

All that said, there have been a LOT of improvements since 4.6, so it
might
be something that's been addressed in the intervening time.

Best,
Erick

On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu <rallav...@gmail.com> wrote:

Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates
happen
on leader and it generates huge tlog (14G sometimes in my case) on
other
nodes. At the same time leader's tlog is few KB. So, what is the rate
at
which the changes from transaction log are applied at nodes? The
autocommit
interval is set to 15 seconds after going through

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks

Re: tlog replay

2015-10-07 Thread Rallavagu


Thanks Erick.

Eventually, followers caught up but the 14G tlog file still persists and 
they are healthy. Is there anything to look for? Will monitor and see 
how long will it take before it disappears.


Evaluating move to Solr 5.3.

On 10/7/15 7:51 PM, Erick Erickson wrote:

Uhm, that's very weird. Updates are not applied from the tlog. Rather the
raw doc is forwarded to the replica which both indexes the doc and
writes it to the local tlog. So having a 14G tlog on a follower but a small
tlog on the leader is definitely strange, especially if it persists over time.

I assume the follower is healthy? And does this very large tlog disappear
after a while? I'd expect it to be aged out after a few commits of > 100 docs.

All that said, there have been a LOT of improvements since 4.6, so it might
be something that's been addressed in the intervening time.

Best,
Erick



On Wed, Oct 7, 2015 at 7:39 PM, Rallavagu <rallav...@gmail.com> wrote:

Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates happen
on leader and it generates huge tlog (14G sometimes in my case) on other
nodes. At the same time leader's tlog is few KB. So, what is the rate at
which the changes from transaction log are applied at nodes? The autocommit
interval is set to 15 seconds after going through
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks

tlog replay

2015-10-07 Thread Rallavagu


Solr 4.6.1, single shard, 4 node cloud, 3 node zk

Like to understand the behavior better when large number of updates 
happen on leader and it generates huge tlog (14G sometimes in my case) 
on other nodes. At the same time leader's tlog is few KB. So, what is 
the rate at which the changes from transaction log are applied at nodes? 
The autocommit interval is set to 15 seconds after going through 
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Thanks

Re: Recovery Thread Blocked

2015-10-06 Thread Rallavagu

Mark - currently 5.3 is being evaluated for upgrade purposes and 
hopefully get there sooner. Meanwhile, following exception is noted from 
logs during updates


ERROR org.apache.solr.update.CommitTracker  – auto commit 
error...:java.lang.IllegalStateException: this writer hit an 
OutOfMemoryError; cannot commit
at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
at 
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)

at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)

at java.lang.Thread.run(Thread.java:682)

Considering the fact that the machine is configured with 48G (24G for 
JVM which will be reduced in future) wondering how would it still go out 
of memory. For memory mapped index files the remaining 24G or what is 
available off of it should be available. Looking at the lsof output the 
memory mapped files were around 10G.


Thanks.


On 10/5/15 5:41 PM, Mark Miller wrote:

I'd make two guess:

Looks like you are using Jrocket? I don't think that is common or well
tested at this point.

There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace of
SolrCloud, you are dealing with something fairly ancient and so it will be
harder to find help with older issues most likely.

- Mark

On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote:


Any takers on this? Any kinda clue would help. Thanks.

On 10/4/15 10:14 AM, Rallavagu wrote:

As there were no responses so far, I assume that this is not a very
common issue that folks come across. So, I went into source (4.6.1) to
see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
// to be air tight we must also check after lock
if (cc.isShutDown()) {
  log.warn("Skipping recovery because Solr is shutdown");
  return;
}
log.info("Running recovery - first canceling any ongoing

recovery");

cancelRecovery();

while (recoveryRunning) {
  try {
recoveryLock.wait(1000);
  } catch (InterruptedException e) {

  }
  // check again for those that were waiting
  if (cc.isShutDown()) {
log.warn("Skipping recovery because Solr is shutdown");
return;
  }
  if (closed) return;
}

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
  synchronized (recoveryLock) {
if (recoveryStrat != null && recoveryRunning) {
  recoveryStrat.close();
  while (true) {
try {
  recoveryStrat.join();
} catch (InterruptedException e) {
  // not interruptible - keep waiting
  continue;
}
break;
  }

  recoveryRunning = false;
  recoveryLock.notifyAll();
}
  }
}

As per the stack trace "recoveryStrat.join()" is where things are
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads
would be waiting on. Is it possible that ZK is not responding or
something else like Operating System resources could cause this? Thanks.


On 10/2/15 4:17 PM, Rallavagu wrote:

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
native_blocked, daemon
  -- Waiting for notification on:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
  at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
  at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
  at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
  at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
  at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
  at


RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a



  at
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
Method)
  at java/lang/Object.wait(J)V(Native Method)
  at java/lang/Thread.join(Thread.java:1206)
  ^-- Lock released while waiting:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
  at ja

Re: Recovery Thread Blocked

2015-10-06 Thread Rallavagu

GC logging shows normal. The "OutOfMemoryError" appears to be pertaining 
to a thread but not to JVM.


On 10/6/15 1:07 PM, Mark Miller wrote:

That amount of RAM can easily be eaten up depending on your sorting,
faceting, data.

Do you have gc logging enabled? That should describe what is happening with
the heap.

- Mark

On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <rallav...@gmail.com> wrote:


Mark - currently 5.3 is being evaluated for upgrade purposes and
hopefully get there sooner. Meanwhile, following exception is noted from
logs during updates

ERROR org.apache.solr.update.CommitTracker  – auto commit
error...:java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
  at

org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
  at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
  at

org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
  at
org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
  at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
  at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
  at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
  at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
  at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
  at java.lang.Thread.run(Thread.java:682)

Considering the fact that the machine is configured with 48G (24G for
JVM which will be reduced in future) wondering how would it still go out
of memory. For memory mapped index files the remaining 24G or what is
available off of it should be available. Looking at the lsof output the
memory mapped files were around 10G.

Thanks.


On 10/5/15 5:41 PM, Mark Miller wrote:

I'd make two guess:

Looks like you are using Jrocket? I don't think that is common or well
tested at this point.

There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace

of

SolrCloud, you are dealing with something fairly ancient and so it will

be

harder to find help with older issues most likely.

- Mark

On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote:


Any takers on this? Any kinda clue would help. Thanks.

On 10/4/15 10:14 AM, Rallavagu wrote:

As there were no responses so far, I assume that this is not a very
common issue that folks come across. So, I went into source (4.6.1) to
see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
 // to be air tight we must also check after lock
 if (cc.isShutDown()) {
   log.warn("Skipping recovery because Solr is shutdown");
   return;
 }
 log.info("Running recovery - first canceling any ongoing

recovery");

 cancelRecovery();

 while (recoveryRunning) {
   try {
 recoveryLock.wait(1000);
   } catch (InterruptedException e) {

   }
   // check again for those that were waiting
   if (cc.isShutDown()) {
 log.warn("Skipping recovery because Solr is shutdown");
 return;
   }
   if (closed) return;
 }

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
   synchronized (recoveryLock) {
 if (recoveryStrat != null && recoveryRunning) {
   recoveryStrat.close();
   while (true) {
 try {
   recoveryStrat.join();
 } catch (InterruptedException e) {
   // not interruptible - keep waiting
   continue;
 }
 break;
   }

   recoveryRunning = false;
   recoveryLock.notifyAll();
 }
   }
 }

As per the stack trace "recoveryStrat.join()" is where things are
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads
would be waiting on. Is it possible that ZK is not responding or
something else like Operating System resources could cause this?

Thanks.



On 10/2/15 4:17 PM, Rallavagu wrote:

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
native_blocked, daemon
   -- Waiting for notification on:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
   at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
   at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
   at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
   at syncWaitForSignal+189(synchronization.c:85)@0x7ff3

Re: Recovery Thread Blocked

2015-10-06 Thread Rallavagu


It is java thread though. Does it need increasing OS level threads?

On 10/6/15 6:21 PM, Mark Miller wrote:

If it's a thread and you have plenty of RAM and the heap is fine, have you
checked raising OS thread limits?

- Mark

On Tue, Oct 6, 2015 at 4:54 PM Rallavagu <rallav...@gmail.com> wrote:


GC logging shows normal. The "OutOfMemoryError" appears to be pertaining
to a thread but not to JVM.

On 10/6/15 1:07 PM, Mark Miller wrote:

That amount of RAM can easily be eaten up depending on your sorting,
faceting, data.

Do you have gc logging enabled? That should describe what is happening

with

the heap.

- Mark

On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <rallav...@gmail.com> wrote:


Mark - currently 5.3 is being evaluated for upgrade purposes and
hopefully get there sooner. Meanwhile, following exception is noted from
logs during updates

ERROR org.apache.solr.update.CommitTracker  – auto commit
error...:java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
   at



org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)

   at


org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)

   at



org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)

   at
org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
   at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
   at



java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)

   at



java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)

   at



java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)

   at



java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)

   at java.lang.Thread.run(Thread.java:682)

Considering the fact that the machine is configured with 48G (24G for
JVM which will be reduced in future) wondering how would it still go out
of memory. For memory mapped index files the remaining 24G or what is
available off of it should be available. Looking at the lsof output the
memory mapped files were around 10G.

Thanks.


On 10/5/15 5:41 PM, Mark Miller wrote:

I'd make two guess:

Looks like you are using Jrocket? I don't think that is common or well
tested at this point.

There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace

of

SolrCloud, you are dealing with something fairly ancient and so it will

be

harder to find help with older issues most likely.

- Mark

On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote:


Any takers on this? Any kinda clue would help. Thanks.

On 10/4/15 10:14 AM, Rallavagu wrote:

As there were no responses so far, I assume that this is not a very
common issue that folks come across. So, I went into source (4.6.1)

to

see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
  // to be air tight we must also check after lock
  if (cc.isShutDown()) {
log.warn("Skipping recovery because Solr is shutdown");
return;
  }
  log.info("Running recovery - first canceling any ongoing

recovery");

  cancelRecovery();

  while (recoveryRunning) {
try {
  recoveryLock.wait(1000);
} catch (InterruptedException e) {

}
// check again for those that were waiting
if (cc.isShutDown()) {
  log.warn("Skipping recovery because Solr is shutdown");
  return;
}
if (closed) return;
  }

Subsequently, the thread will get into cancelRecovery method as

below,


public void cancelRecovery() {
synchronized (recoveryLock) {
  if (recoveryStrat != null && recoveryRunning) {
recoveryStrat.close();
while (true) {
  try {
recoveryStrat.join();
  } catch (InterruptedException e) {
// not interruptible - keep waiting
continue;
  }
  break;
}

recoveryRunning = false;
recoveryLock.notifyAll();
  }
}
  }

As per the stack trace "recoveryStrat.join()" is where things are
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads
would be waiting on. Is it possible that ZK is not responding or
something else like Operating System resources could cause this?

Thanks.



On 10/2/15 4:17 PM, Rallavagu wrote:

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
native_blocked,

Re: Recovery Thread Blocked

2015-10-05 Thread Rallavagu


Any takers on this? Any kinda clue would help. Thanks.

On 10/4/15 10:14 AM, Rallavagu wrote:

As there were no responses so far, I assume that this is not a very
common issue that folks come across. So, I went into source (4.6.1) to
see if I can figure out what could be the cause.


The thread that is locking is in this block of code

synchronized (recoveryLock) {
   // to be air tight we must also check after lock
   if (cc.isShutDown()) {
 log.warn("Skipping recovery because Solr is shutdown");
 return;
   }
   log.info("Running recovery - first canceling any ongoing recovery");
   cancelRecovery();

   while (recoveryRunning) {
 try {
   recoveryLock.wait(1000);
 } catch (InterruptedException e) {

 }
 // check again for those that were waiting
 if (cc.isShutDown()) {
   log.warn("Skipping recovery because Solr is shutdown");
   return;
 }
 if (closed) return;
   }

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
 synchronized (recoveryLock) {
   if (recoveryStrat != null && recoveryRunning) {
 recoveryStrat.close();
 while (true) {
   try {
 recoveryStrat.join();
   } catch (InterruptedException e) {
 // not interruptible - keep waiting
 continue;
   }
   break;
 }

 recoveryRunning = false;
 recoveryLock.notifyAll();
   }
 }
   }

As per the stack trace "recoveryStrat.join()" is where things are
holding up.

I wonder why/how cancelRecovery would take time so around 870 threads
would be waiting on. Is it possible that ZK is not responding or
something else like Operating System resources could cause this? Thanks.


On 10/2/15 4:17 PM, Rallavagu wrote:

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
native_blocked, daemon
 -- Waiting for notification on:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at
RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a


 at
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
Method)
 at java/lang/Object.wait(J)V(Native Method)
 at java/lang/Thread.join(Thread.java:1206)
 ^-- Lock released while waiting:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
 at java/lang/Thread.join(Thread.java:1259)
 at
org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)


 ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)


 ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)


 at jrockit/vm/RNI.c2java(J)V(Native Method)


Stack trace of one of the 870 threads that is waiting for the lock to be
released.

"Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
native_blocked, daemon
 -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat
lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
 at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
 at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
 at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)


 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)


 at jrockit/vm/RNI.c2java(J)V(Native Method)

On 10/2/15 4:12 PM, Rallavagu wrote:

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes
unavailable. The thread dump shows the following thread is blocked 870
threads which explains high CPU. Any clues on where to lo

Re: Recovery Thread Blocked

2015-10-04 Thread Rallavagu

As there were no responses so far, I assume that this is not a very 
common issue that folks come across. So, I went into source (4.6.1) to 
see if I can figure out what could be the cause.



The thread that is locking is in this block of code

synchronized (recoveryLock) {
  // to be air tight we must also check after lock
  if (cc.isShutDown()) {
log.warn("Skipping recovery because Solr is shutdown");
return;
  }
  log.info("Running recovery - first canceling any ongoing recovery");
  cancelRecovery();

  while (recoveryRunning) {
try {
  recoveryLock.wait(1000);
} catch (InterruptedException e) {

}
// check again for those that were waiting
if (cc.isShutDown()) {
  log.warn("Skipping recovery because Solr is shutdown");
  return;
}
if (closed) return;
  }

Subsequently, the thread will get into cancelRecovery method as below,

public void cancelRecovery() {
synchronized (recoveryLock) {
  if (recoveryStrat != null && recoveryRunning) {
recoveryStrat.close();
while (true) {
  try {
recoveryStrat.join();
  } catch (InterruptedException e) {
// not interruptible - keep waiting
continue;
  }
  break;
}

recoveryRunning = false;
recoveryLock.notifyAll();
  }
}
  }

As per the stack trace "recoveryStrat.join()" is where things are 
holding up.


I wonder why/how cancelRecovery would take time so around 870 threads 
would be waiting on. Is it possible that ZK is not responding or 
something else like Operating System resources could cause this? Thanks.



On 10/2/15 4:17 PM, Rallavagu wrote:

Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
native_blocked, daemon
 -- Waiting for notification on:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at
RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a

 at
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)
 at java/lang/Object.wait(J)V(Native Method)
 at java/lang/Thread.join(Thread.java:1206)
 ^-- Lock released while waiting:
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
 at java/lang/Thread.join(Thread.java:1259)
 at
org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)

 ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)

 ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

 at jrockit/vm/RNI.c2java(J)V(Native Method)


Stack trace of one of the 870 threads that is waiting for the lock to be
released.

"Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
native_blocked, daemon
 -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
 at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
 at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
 at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)

 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

 at jrockit/vm/RNI.c2java(J)V(Native Method)

On 10/2/15 4:12 PM, Rallavagu wrote:

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes
unavailable. The thread dump shows the following thread is blocked 870
threads which explains high CPU. Any clues on where to look?

"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
native_blocked, daemon
 -- Blocked trying

Re: Zk and Solr Cloud


Thanks Shawn.

Right. That is a great insight into the issue. We ended up clearing the 
overseer queue and then cloud became normal.


We were running Solr indexing process and wondering if that caused the 
queue to grow. Will Solr (leader) add a work entry to zookeeper for 
every update if not what are those work entries?


Thanks

On 10/1/15 10:58 PM, Shawn Heisey wrote:

On 10/1/2015 1:26 PM, Rallavagu wrote:

Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.

See following errors in ZK and Solr and they are connected.

When I see the following error in Zookeeper,

unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len11823809 is out of range!


This is usually caused by the overseer queue (stored in zookeeper)
becoming extraordinarily huge, because it's being flooded with work
entries far faster than the overseer can process them.  This causes the
znode where the queue is stored to become larger than the maximum size
for a znode, which defaults to about 1MB.  In this case (reading your
log message that says len11823809), something in zookeeper has gotten to
be 11MB in size, so the zookeeper client cannot read it.

I think the zookeeper server code must be handling the addition of
children to the queue znode through a code path that doesn't pay
attention to the maximum buffer size, just goes ahead and adds it,
probably by simply appending data.  I'm unfamiliar with how the ZK
database works, so I'm guessing here.

If I'm right about where the problem is, there are two workarounds to
your immediate issue.

1) Delete all the entries in your overseer queue using a zookeeper
client that lets you edit the DB directly.  If you haven't changed the
cloud structure and all your servers are working, this should be safe.

2) Set the jute.maxbuffer system property on the startup commandline for
all ZK servers and all ZK clients (Solr instances) to a size that's
large enough to accommodate the huge znode.  In order to do the deletion
mentioned in option 1 above,you might need to increase jute.maxbuffer on
the servers and the client you use for the deletion.

These are just workarounds.  Whatever caused the huge queue in the first
place must be addressed.  It is frequently a performance issue.  If you
go to the following link, you will see that jute.maxbuffer is considered
an unsafe option:

http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#Unsafe+Options

In Jira issue SOLR-7191, I wrote the following in one of my comments:

"The giant queue I encountered was about 85 entries, and resulted in
a packet length of a little over 14 megabytes. If I divide 85 by 14,
I know that I can have about 6 overseer queue entries in one znode
before jute.maxbuffer needs to be increased."

https://issues.apache.org/jira/browse/SOLR-7191?focusedCommentId=14347834

Thanks,
Shawn

Re: Zk and Solr Cloud


Thanks for the insight into this Erick. Thanks.

On 10/2/15 8:58 AM, Erick Erickson wrote:

Rallavagu:

Absent nodes going up and down or otherwise changing state, Zookeeper
isn't involved in the normal operations of Solr (adding docs,
querying, all that). That said, things that change the state of the
Solr nodes _do_ involve Zookeeper and the Overseer. The Overseer is
used to serialize and control changing information in the
clusterstate.json (or state.json) and others. If the nodes all tried
to write to Zk directly, it's hard to coordinate. That's a little
simplistic and counterintuitive, but maybe this will help.

When a Solr instance starts up it
1> registers itself as live with ZK
2> creates a listener that ZK pings when there's a state change (some
node goes up or down, goes into recovery, gets added, whatever).
3> gets the current cluster state from ZK.

Thereafter, this particular node doesn't need to ask ZK for anything.
It knows the current topology of cluster and can route requests (index
or query) to the correct Solr replica etc.

Now, let's claim that "something changes". Solr stops on one of the
nodes. Or someone adds a collection. Or. The overseer usually gets
involved in changing the state on ZK for this new action. Part of that
is that ZK sends an event to all the Solr nodes that have registered
themselves as listeners that causes them to ask ZK for the current
state of the cluster, and each Solr node adjusts its actions based on
this information. Note the kind of thing here that changes and
triggers this is that a whole replica becomes able or unable to carry
out its functions, NOT that the some collection gets another doc added
or answers a query.

Zk also periodically pings each Solr instance that's registered itself
and, if the node fails to respond may force it into recovery & etc.
Again, though, that has nothing to do with standard Solr operations.

So a massive overseer queue tends to indicate that there's a LOT of
state changes, lots of nodes going up and down etc. One implication of
the above is that if you turn on all your nodes in a large cluster at
the same time, there'll be a LOT of activity; they'll all register
themselves, try to elect leaders for shards, to into/out of recovery,
become active, all these are things that trigger overseer activity.

Or there are simply bugs in how the overseer works in the version
you're using, I know there's been a lot of effort to harden that area
over the various versions.

Two things that are "interesting".
1> Only one of your Solr instances hosts the overseer. If you're doing
a restart of _all_ your boxes, it's advisable to bounce the node
that's the overseer _last_. Otherwise you risk an odd situation: the
overseer is elected and starts to work, that node restarts which
causes the overseer role to switch to another node which immediately
is bounced and a new overseer is elected and

2> As of 5.x, there are two ZK formats
a> the "old" format where the entire clusterstate for all collections
is kept in a single ZK node (/clusterstate.json)
b> the "new" format where each collection has its own state.json that
only contains the state for that collection.

This is very helpful when you have many clusters. In the  case, any
time _any_ node changes, _all_ nodes have to get a new state. In ,
only the nodes involved in a single collection need to get new
information when any node in _that_ collection change.

FWIW,
Erick



On Fri, Oct 2, 2015 at 8:03 AM, Ravi Solr <ravis...@gmail.com> wrote:

Awesome nugget Shawn, I also faced similar issue a while ago while i was
doing a full re-index. It would be great if such tips are added into FAQ
type documentation on cwiki. I love the SOLR forum everyday I learn
something new :-)

Thanks

Ravi Kiran Bhaskar

On Fri, Oct 2, 2015 at 1:58 AM, Shawn Heisey <apa...@elyograg.org> wrote:


On 10/1/2015 1:26 PM, Rallavagu wrote:

Solr 4.6.1 single shard with 4 nodes. Zookeeper 3.4.5 ensemble of 3.

See following errors in ZK and Solr and they are connected.

When I see the following error in Zookeeper,

unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len11823809 is out of range!


This is usually caused by the overseer queue (stored in zookeeper)
becoming extraordinarily huge, because it's being flooded with work
entries far faster than the overseer can process them.  This causes the
znode where the queue is stored to become larger than the maximum size
for a znode, which defaults to about 1MB.  In this case (reading your
log message that says len11823809), something in zookeeper has gotten to
be 11MB in size, so the zookeeper client cannot read it.

I think the zookeeper server code must be handling the addition of
children to the queue znode through a code path that doesn't pay
attention to the maximum buffer size, just goes ahead and adds it,
probably by simply appendi

Recovery Thread Blocked


Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes 
unavailable. The thread dump shows the following thread is blocked 870 
threads which explains high CPU. Any clues on where to look?


"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked, 
native_blocked, daemon

-- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]

at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)

Re: Recovery Thread Blocked


Here is the stack trace of the thread that is holding the lock.


"Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting, 
native_blocked, daemon
-- Waiting for notification on: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]

at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at 
RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
at 
jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native Method)

at java/lang/Object.wait(J)V(Native Method)
at java/lang/Thread.join(Thread.java:1206)
^-- Lock released while waiting: 
org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]

at java/lang/Thread.join(Thread.java:1259)
at 
org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)

^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)

^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)


Stack trace of one of the 870 threads that is waiting for the lock to be 
released.


"Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked, 
native_blocked, daemon

-- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
at 
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2

at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
at 
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]

at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
at 
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
at 
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

at jrockit/vm/RNI.c2java(J)V(Native Method)

On 10/2/15 4:12 PM, Rallavagu wrote:

Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node zookeeper

During updates, some nodes are going very high cpu and becomes
unavailable. The thread dump shows the following thread is blocked 870
threads which explains high CPU. Any clues on where to look?

"Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
native_blocked, daemon
 -- Blocked trying to get lock: java/lang/Object@0x114d8dd00[fat lock]
 at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
 at eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
 at
syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
 at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
 at syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
 at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
 at jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
 at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
 at
jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
 at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
 at
org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)

 at
org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)

 at jrockit/vm/RNI.c2java(J)V(Native Method)

PoolingClientConnectionManager


Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool. 
As Solr uses "org.apache.http.impl.conn.PoolingClientConnectionManager" 
for replication, my question is does Solr threads use connections from 
tomcat thread pool or they create their own thread pool? I am trying to 
find out if it would be 200 + Solr threads or not. Thanks.

Re: PoolingClientConnectionManager


Thanks for the response Andrea.

Assuming that Solr has it's own thread pool, it appears that 
"PoolingClientConnectionManager" has a maximum 20 threads per host as 
default. Is there a way to changes this increase to handle heavy update 
traffic? Thanks.




On 10/1/15 11:05 AM, Andrea Gazzarini wrote:

Hi,
Maybe I could be wrong as your question is related with Solr internals (I
believe the dev list is a better candidate for such questions).

Anyway, my thoughts: unless you're within a JCA inbound component (and Solr
isn't), the JEE specs say you shouldn' start new threads. For this  reason,
there's no a (standard) way to directly connect to and use the servlet
container threads.

As far as I know Solr 4.x is a standard and JEE compliant web application
so the answer to your question *should* be: "yes, it is using its own
threads"

Best,
Andrea
Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool. As
Solr uses "org.apache.http.impl.conn.PoolingClientConnectionManager" for
replication, my question is does Solr threads use connections from tomcat
thread pool or they create their own thread pool? I am trying to find out
if it would be 200 + Solr threads or not. Thanks.

Re: PoolingClientConnectionManager


Thanks Shawn. This is good data.

On 10/1/15 11:43 AM, Shawn Heisey wrote:

On 10/1/2015 11:50 AM, Rallavagu wrote:

Solr 4.6.1, single Shard, cloud with 4 nodes

Solr is running on Tomcat configured with 200 threads for thread pool.
As Solr uses
"org.apache.http.impl.conn.PoolingClientConnectionManager" for
replication, my question is does Solr threads use connections from
tomcat thread pool or they create their own thread pool? I am trying
to find out if it would be 200 + Solr threads or not. Thanks.


I don't know the answer to the actual question you have asked ... but I
do know that keeping the container maxThreads at 200 can cause serious
problems for Solr.  It does not take a very big installation to exceed
200 threads, and users have had problems fixed by increasing
maxThreads.  This implies that the container is able to control the
threads in Solr to some degree.

The Jetty included with all versions of Solr that I have actually
checked (back to 3.2.0) has maxThreads set to 1, which effectively
removes the thread limit for any typical install.  Very large installs
might need it bumped higher than 1.

Thanks,
Shawn

Re: PoolingClientConnectionManager


Awesome. This is what I was looking for. Will try these. Thanks.

On 10/1/15 1:31 PM, Shawn Heisey wrote:

On 10/1/2015 12:39 PM, Rallavagu wrote:

Thanks for the response Andrea.

Assuming that Solr has it's own thread pool, it appears that
"PoolingClientConnectionManager" has a maximum 20 threads per host as
default. Is there a way to changes this increase to handle heavy
update traffic? Thanks.


You can configure all ShardHandler instances with the solr.xml file.
The shard handler controls SolrJ (and HttpClient) within Solr.

https://cwiki.apache.org/confluence/display/solr/Moving+to+the+New+solr.xml+Format

That page does not go into all the shard handler options, though.  For
that, you need to look at the page for distributed requests ... but
don't configure it in solrconfig.xml as the following link shows,
configure it in solr.xml as shown by the earlier link.

https://cwiki.apache.org/confluence/display/solr/Distributed+Requests#DistributedRequests-ConfiguringtheShardHandlerFactory

Thanks,
Shawn

Zk and Solr Cloud