Metrics in 6.5.1 names and stuff

2017-08-03 Thread Walter Underwood
I’m trying to get what I want out of the metrics reporting in Solr. I want the 
counts and percentiles for each request handler in each collection. If I have 
“/srp”, “/suggest”, and “/seo”, I want three sets of metrics.

I’m getting a lot of weird stuff. For counts for /srp in an eight node cluster, 
one node (new-solr-c01) is reporting all of these graphite metric names:

new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c01.test3.cloud.cheggnet.com:8983/solr/questions_shard1_replica2/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c02.test3.cloud.cheggnet.com:8983/solr/questions_shard2_replica4/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c03.test3.cloud.cheggnet.com:8983/solr/questions_shard4_replica3/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c04.test3.cloud.cheggnet.com:8983/solr/questions_shard2_replica2/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c05.test3.cloud.cheggnet.com:8983/solr/questions_shard4_replica4/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c06.test3.cloud.cheggnet.com:8983/solr/questions_shard3_replica1/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c07.test3.cloud.cheggnet.com:8983/solr/questions_shard1_replica3/srp.post.requests.count
new-solr-c01.solr.node.QUERY.httpShardHandler.http://new-solr-c08.test3.cloud.cheggnet.com:8983/solr/questions_shard3_replica3/srp.post.requests.count
new-solr-c01.solr.node.UPDATE.updateShardHandler./solr/questions/srp.get.requests.count

1. What are the metrics with the URLs? Are those distributed queries to other 
shards? If so, why are they going to the local shard, too?

2. Why is the last one under UPDATE, when it is a query request handler?

3. The jetty metrics seem to lump everything together under the HTTP method.

4. I enabled the http group, but don’t get any metrics from it.

Here is the config. The prefix is set to the nodename. For this test, I sent 
the metrics to localhost and logged them with “nc -l 2003 > graphite.log”.

   
 
   ${graphite_host:NO_HOST_SET}
   2003
   ${graphite_prefix:NO_PREFIX_SET}
   http://observer.wunderwood.org/  (my blog)




Re: Error when trying to replace node with Solr 6.6.0

2017-08-03 Thread Björn Häuser
Okay,

after digging a little bit through the code, I think the problem is in this 
line: 
https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/cloud/OverseerCollectionMessageHandler.java?utf8=%E2%9C%93#L153
 


Is there any reason why this a SynchronousQueue? If I understand this correctly 
this means that there cannot be more than 10 parallel Commands, which means 
that Collection Operations can only be executed for less than 10 collections?

Would love to contribute a patch for this if someone says how that should look 
like :)


Thanks
Björn
> On 3. Aug 2017, at 18:51, Björn Häuser  wrote:
> 
> Hey Folks,
> 
> we today hit the same error three times, a REPLACENODE call was not 
> successful.
> 
> Here is our scenario: 
> 
> 3 Node Solrcloud cluster running in Kubernetes on top of AWS. 
> 
> Today we wanted to rotate the underlying storage (increased from 50gb to 
> 300gb). 
> 
> After we rotated one node we tried to replace with this call:
> 
>   • curl 
> 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REPLACENODE=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr=4495d85b-0aa4-45ab-8067-9d7d4da375d3'
>   • curl 
> 'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REQUESTSTATUS=4495d85b-0aa4-45ab-8067-9d7d4da375d3’
> 
> The error we got was:
> 
> 
> 
> 0 name="QTime">28java.util.concurrent.RejectedExecutionException:java.util.concurrent.RejectedExecutionException:
>  Task 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8
>  rejected from 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running,
>  pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 
> 0]Task 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8
>  rejected from 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running,
>  pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 
> 0]-1 name="state">failedfound 
> [4495d85b-0aa4-45ab-8067-9d7d4da375d3] in failed tasks
> 
> 
> 
> The problem was that afterwards we had the same shard on the same node twice. 
> One recovered and we had to delete the other one manually. For some 
> collections the REPLACENODE went through and everything was fine again.
> 
> Can you advice what we did wrong here or which configuration we need to adapt?
> 
> Thanks
> Björn



Commit takes very long with NoSuchFileException

2017-08-03 Thread Nawab Zada Asad Iqbal
Hi,

I have a host with 3 solr processes running, each with one shard only;
there are no replicas. I am reindexing some 100 GB of data per solr (or per
shard since each solr has one shard).

After about 3 hours, I manually committed once. I was able to get through
40 GB in each shard, and the commit response came within 2 minutes.

After another 3 hours, I stopped the indexing client and manually committed
again. Two shards returned within few minutes (total size now 85GB +).
However, 3rd shard (or solr process) was stuck for almost two hours now
(until I stopped the server).  And has started to throw the following
exception:
This shard also has a lot more files in data/index folder. 68k vs 31k & in
other two shards. I am not sure if that can impact. FWIW, the file
descriptors limit on this host is 65k.
I am relying on default concurrent merge scheduler (which probably opens 4
threads on each solr server).

java.nio.file.NoSuchFileException:
/box/var/solr/shard1/filesearch/data/index/segments_2
at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at
sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55)
at
sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
at
sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
at java.nio.file.Files.readAttributes(Files.java:1737)
at java.nio.file.Files.size(Files.java:2332)
at org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
at
org.apache.solr.handler.admin.LukeRequestHandler.getFileLength(LukeRequestHandler.java:615)
at
org.apache.solr.handler.admin.LukeRequestHandler.getIndexInfo(LukeRequestHandler.java:588)
at
org.apache.solr.handler.admin.LukeRequestHandler.handleRequestBody(LukeRequestHandler.java:138)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)



After stopping the server, I noticed this stacktrace:- (It seems that one
indexing request was still stuck in the system 

Error when trying to replace node with Solr 6.6.0

2017-08-03 Thread Björn Häuser
Hey Folks,

we today hit the same error three times, a REPLACENODE call was not successful.

Here is our scenario: 

3 Node Solrcloud cluster running in Kubernetes on top of AWS. 

Today we wanted to rotate the underlying storage (increased from 50gb to 
300gb). 

After we rotated one node we tried to replace with this call:

• curl 
'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REPLACENODE=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr=solr-2.solr-discovery.default.svc.cluster.local.:8983_solr=4495d85b-0aa4-45ab-8067-9d7d4da375d3'
• curl 
'solr-2.solr-discovery.default.svc.cluster.local:8983/solr/admin/collections?action=REQUESTSTATUS=4495d85b-0aa4-45ab-8067-9d7d4da375d3’

The error we got was:



028java.util.concurrent.RejectedExecutionException:java.util.concurrent.RejectedExecutionException:
 Task 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8
 rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running,
 pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 
0]Task 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$15/509076276@5c9136c8
 rejected from 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1cce4506[Running,
 pool size = 10, active threads = 10, queued tasks = 0, completed tasks = 
0]-1failedfound 
[4495d85b-0aa4-45ab-8067-9d7d4da375d3] in failed tasks



The problem was that afterwards we had the same shard on the same node twice. 
One recovered and we had to delete the other one manually. For some collections 
the REPLACENODE went through and everything was fine again.

Can you advice what we did wrong here or which configuration we need to adapt?

Thanks
Björn

Re: mixed index with commongrams

2017-08-03 Thread David Hastings
Haven't really looked much into that, here is a snipped form todays gc log,
if you wouldn't mind shedding any details on it:

2017-08-03T11:46:16.265-0400: 3200938.383: [GC (Allocation Failure)
2017-08-03T11:46:16.265-0400: 3200938.383: [ParNew
Desired survivor size 1966060336 bytes, new threshold 8 (max 8)
- age   1:  128529184 bytes,  128529184 total
- age   2:   43075632 bytes,  171604816 total
- age   3:   64402592 bytes,  236007408 total
- age   4:   35621704 bytes,  271629112 total
- age   5:   44285584 bytes,  315914696 total
- age   6:   45372512 bytes,  361287208 total
- age   7:   41975368 bytes,  403262576 total
- age   8:   72959688 bytes,  47664 total
: 9133992K->577219K(1088K), 0.2730329 secs]
23200886K->14693007K(49066688K), 0.2732690 secs] [Times: user=2.01
sys=0.01, real=0.28 secs]
Heap after GC invocations=12835 (full 109):
 par new generation   total 1088K, used 577219K [0x7f802300,
0x7f833040, 0x7f833040)
  eden space 8533376K,   0% used [0x7f802300, 0x7f802300,
0x7f822bd6)
  from space 2133312K,  27% used [0x7f82ae0b, 0x7f82d1460d98,
0x7f833040)
  to   space 2133312K,   0% used [0x7f822bd6, 0x7f822bd6,
0x7f82ae0b)
 concurrent mark-sweep generation total 3840K, used 14115788K
[0x7f833040, 0x7f8c5800, 0x7f8c5800)
 Metaspace   used 36698K, capacity 37169K, committed 37512K, reserved
38912K
}





On Thu, Aug 3, 2017 at 11:58 AM, Walter Underwood 
wrote:

> How long are your GC pauses? Those affect all queries, so they make the
> 99th percentile slow with queries that should be fast.
>
> The G1 collector has helped our 99th percentile.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2017, at 8:48 AM, David Hastings 
> wrote:
> >
> > Thanks, thats what i kind of expected.  still debating whether the space
> > increase is worth it, right now Im at .7% of searches taking longer than
> 10
> > seconds, and 6% taking longer than 1, so when i see things like this in
> the
> > morning it bugs me a bit:
> >
> > 2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the
> Courts
> > of Equity of the United States")
> > 2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
> > 2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
> > justice")
> >
> > which could all be annihilated with CG's, at the expense, according to
> HT,
> > of a 40% increase in index size.
> >
> >
> >
> > On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson  >
> > wrote:
> >
> >> bq: will that search still return results form the earlier documents
> >> as well as the new ones
> >>
> >> In a word, "no". By definition the analysis chain applied at index
> >> time puts tokens in the index and that's all you have to search
> >> against for the doc unless and until you re-index the document.
> >>
> >> You really have two choices here:
> >> 1> live with the differing results until you get done re-indexing
> >> 2> index to an offline collection and then use, say, collection
> >> aliasing to make the switch atomically.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
> >>  wrote:
> >>> Hey all, I have yet to run an experiment to test this but was wondering
> >> if
> >>> anyone knows the answer ahead of time.
> >>> If i have an index built with documents before implementing the
> >> commongrams
> >>> filter, then enable it, and start adding documents that have the
> >>> filter/tokenizer applied, will searches that fit the criteria, for
> >> example:
> >>> "to be or not to be"
> >>> will that search still return results form the earlier documents as
> well
> >> as
> >>> the new ones?  The idea is that a full re-index is going to be
> difficult,
> >>> so would rather do it over time by replacing large numbers of documents
> >>> incrementally.  Thanks,
> >>> Dave
> >>
>
>


Re: mixed index with commongrams

2017-08-03 Thread Walter Underwood
How long are your GC pauses? Those affect all queries, so they make the 99th 
percentile slow with queries that should be fast.

The G1 collector has helped our 99th percentile.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2017, at 8:48 AM, David Hastings  
> wrote:
> 
> Thanks, thats what i kind of expected.  still debating whether the space
> increase is worth it, right now Im at .7% of searches taking longer than 10
> seconds, and 6% taking longer than 1, so when i see things like this in the
> morning it bugs me a bit:
> 
> 2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the Courts
> of Equity of the United States")
> 2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
> 2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
> justice")
> 
> which could all be annihilated with CG's, at the expense, according to HT,
> of a 40% increase in index size.
> 
> 
> 
> On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson 
> wrote:
> 
>> bq: will that search still return results form the earlier documents
>> as well as the new ones
>> 
>> In a word, "no". By definition the analysis chain applied at index
>> time puts tokens in the index and that's all you have to search
>> against for the doc unless and until you re-index the document.
>> 
>> You really have two choices here:
>> 1> live with the differing results until you get done re-indexing
>> 2> index to an offline collection and then use, say, collection
>> aliasing to make the switch atomically.
>> 
>> Best,
>> Erick
>> 
>> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
>>  wrote:
>>> Hey all, I have yet to run an experiment to test this but was wondering
>> if
>>> anyone knows the answer ahead of time.
>>> If i have an index built with documents before implementing the
>> commongrams
>>> filter, then enable it, and start adding documents that have the
>>> filter/tokenizer applied, will searches that fit the criteria, for
>> example:
>>> "to be or not to be"
>>> will that search still return results form the earlier documents as well
>> as
>>> the new ones?  The idea is that a full re-index is going to be difficult,
>>> so would rather do it over time by replacing large numbers of documents
>>> incrementally.  Thanks,
>>> Dave
>> 



Re: mixed index with commongrams

2017-08-03 Thread David Hastings
Thanks, thats what i kind of expected.  still debating whether the space
increase is worth it, right now Im at .7% of searches taking longer than 10
seconds, and 6% taking longer than 1, so when i see things like this in the
morning it bugs me a bit:

2017-08-02 11:50:48 : 58979/1000 secs : ("Rules of Practice for the Courts
of Equity of the United States")
2017-08-02 02:16:36 : 54749/1000 secs : ("The American Cause")
2017-08-02 19:27:58 : 54561/1000 secs : ("register of the department of
justice")

which could all be annihilated with CG's, at the expense, according to HT,
of a 40% increase in index size.



On Thu, Aug 3, 2017 at 11:21 AM, Erick Erickson 
wrote:

> bq: will that search still return results form the earlier documents
> as well as the new ones
>
> In a word, "no". By definition the analysis chain applied at index
> time puts tokens in the index and that's all you have to search
> against for the doc unless and until you re-index the document.
>
> You really have two choices here:
> 1> live with the differing results until you get done re-indexing
> 2> index to an offline collection and then use, say, collection
> aliasing to make the switch atomically.
>
> Best,
> Erick
>
> On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
>  wrote:
> > Hey all, I have yet to run an experiment to test this but was wondering
> if
> > anyone knows the answer ahead of time.
> > If i have an index built with documents before implementing the
> commongrams
> > filter, then enable it, and start adding documents that have the
> > filter/tokenizer applied, will searches that fit the criteria, for
> example:
> > "to be or not to be"
> > will that search still return results form the earlier documents as well
> as
> > the new ones?  The idea is that a full re-index is going to be difficult,
> > so would rather do it over time by replacing large numbers of documents
> > incrementally.  Thanks,
> > Dave
>


Re: mixed index with commongrams

2017-08-03 Thread Erick Erickson
bq: will that search still return results form the earlier documents
as well as the new ones

In a word, "no". By definition the analysis chain applied at index
time puts tokens in the index and that's all you have to search
against for the doc unless and until you re-index the document.

You really have two choices here:
1> live with the differing results until you get done re-indexing
2> index to an offline collection and then use, say, collection
aliasing to make the switch atomically.

Best,
Erick

On Thu, Aug 3, 2017 at 8:07 AM, David Hastings
 wrote:
> Hey all, I have yet to run an experiment to test this but was wondering if
> anyone knows the answer ahead of time.
> If i have an index built with documents before implementing the commongrams
> filter, then enable it, and start adding documents that have the
> filter/tokenizer applied, will searches that fit the criteria, for example:
> "to be or not to be"
> will that search still return results form the earlier documents as well as
> the new ones?  The idea is that a full re-index is going to be difficult,
> so would rather do it over time by replacing large numbers of documents
> incrementally.  Thanks,
> Dave


mixed index with commongrams

2017-08-03 Thread David Hastings
Hey all, I have yet to run an experiment to test this but was wondering if
anyone knows the answer ahead of time.
If i have an index built with documents before implementing the commongrams
filter, then enable it, and start adding documents that have the
filter/tokenizer applied, will searches that fit the criteria, for example:
"to be or not to be"
will that search still return results form the earlier documents as well as
the new ones?  The idea is that a full re-index is going to be difficult,
so would rather do it over time by replacing large numbers of documents
incrementally.  Thanks,
Dave


Re: High CPU utilization on Upgrading to Solr Version 6.3

2017-08-03 Thread Erick Erickson
Atita:

Thanks for that update. I've opened SOLR-11188 to track this, please
add any details to that JIRA that you can, in particular what bit of
code you've identified as the problem.

Also, do you have any document you can share that would cause this?
I'm wondering if it's sensitive to the particular data or the query
or.

Thanks again!
Erick

On Wed, Aug 2, 2017 at 11:38 PM, Atita Arora  wrote:
> Hi All ,
>
> Just thought of giving quick update on this.
> So we were able to *knock down this issue by using jvisualvm* which comes
> with java .
> So , we enabled monitoring  through jmx and the CPU profiling showed (as
> attached in one of my previous emails) *Highlighting taking maximum
> processing.*
> Mysteriously , this was happening in highlighting-> merge which was invoked
> through when we enabled *mergecontiguous=true* I'm still surprised as to
> turning this only property false, resolved the issue and we happily went
> live last week.
>
> Later , as I found the code for this particular property is causing endless
> recursions as I traced.
>
> Please guide / share if you may have any other thoughts.
>
> Thanks,
> Atita
>
>
>
> On Fri, Jul 28, 2017 at 7:18 PM, Shawn Heisey  wrote:
>
>> On 7/27/2017 1:30 AM, Atita Arora wrote:
>> > What OS is Solr running on?  I'm only asking because some additional
>> > information I'm after has different gathering methods depending on OS.
>> > Other questions:
>> >
>> > /*OpenJDK 64-Bit Server VM (25.141-b16) for linux-amd64 JRE
>> > (1.8.0_141-b16), built on Jul 20 2017 21:47:59 by "mockbuild" with gcc
>> > 4.4.7 20120313 (Red Hat 4.4.7-18)*/
>> > /*Memory: 4k page, physical 264477520k(92198808k free), swap 0k(0k
>> free)*/
>>
>> Linux is the easiest to get good information from.  Run the "top"
>> program in a commandline session.  Press shift-M to sort by memory size,
>> and grab a screenshot.  Share that screenshot with a file sharing site
>> and give us the URL.
>>
>> > Is there only one Solr process per machine, or more than one?
>> > /*On an average yes , one solr process per machine , however , we do
>> > have a machine (where this log is taken) has two solr processes
>> > (master and slave)*/
>>
>> Running a master and a slave on one machine does nothing for
>> redundancy.  They need to be on separate machines for that to really
>> help.  As for multiple processes per machine, tou can have many indexes
>> in one Solr instance -- you don't need more than one in most cases.
>>
>> > How many total documents are managed by one machine?
>> > */About 220945 per machine ( and double for this machine as it has
>> > instance of master as well as other slave)/*
>> >
>> > How big is all the index data managed by one machine?
>> > */The index is about 4G./*
>>
>> If less than a quarter of a million documents results in a 4GB index,
>> those documents must be ENORMOUS, or else there is something strange
>> going on.
>>
>> > What is the max heap on each Solr process?
>> > */Max heap is 25G for each Solr Process. (Xms 25g Xmx 25g)/*
>> > */
>> > /*
>> > The reason of choosing RAMDirectory was that it was used in the
>> > similar manner while the production Solr was on Version 4.3.2, so no
>> > particular reason but just replicated how it was working , never
>> > thought this may give troubles.
>>
>> Set up the slaves just like the masters, with
>> NRTCachingDirectoryFactory.  For a couple hundred thousand docs, you
>> probably only need a 2GB heap, possibly even less.
>>
>> > I had included a pastebin of GC snapshot (the complete log was too big
>> > to be included in the pastebin , so pasted a sampler)
>>
>> I asked for the full log because that's what I need to look deeper.  A
>> sampler won't be enough.  There are file sharing websites for sharing
>> larger content, and if you compress the file before uploading it, you
>> should be able to achieve a fairly impressive compression ratio.
>> Dropbox is generally a good choice for sharing fairly large content.
>> Dropbox also works for image data, like the "top" screenshot I asked for
>> above.
>>
>> > Another thing is as we observed the CPU cycles yesterday in high load
>> > condition we observed that the Highlighter component was taking
>> > longest , is there anything in particular we forgot to include that
>> > highlighting doesn't gives a performance hit .
>> > Attached is the snapshot taken from jvisualvm.
>>
>> Attachments rarely make it through the mailing list.  Yours didn't, so I
>> cannot see that snapshot.
>>
>> I do not know anything about highlighting, so I cannot comment on how
>> much CPU it takes.  I've never used the feature.
>>
>> My best idea about why your CPU is so high is problems with garbage
>> collection.  To look into that, I need to have the full GC log.  The
>> rest of the information I've asked for will help focus my efforts.
>>
>> Thanks,
>> Shawn
>>
>>


Re: Ambiguous response on TrieDateField

2017-08-03 Thread Erick Erickson
Solr only deals with UTC times. My bet: you're seeing the _stored_
value of the time which is PDT. How are you indexing this field? You
have to have something hanging around that converts the input to
UTC...

Best,
Erick

On Thu, Aug 3, 2017 at 2:48 PM, Imran Rajjad  wrote:
> Hello,
>
> I have observed a difference of Day in TrieDateField when queried from Solr 
> Cloud web interface and SolrK (Java API)
>
> Below is the query response from Web Interface
>
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":22,
> "params":{
>   "q":"id:01af04e1-83ce-4eb0-8fb5-dc737115dcce",
>   "indent":"on",
>   "fl":"dateTime",
>   "sort":"dateTime asc, id asc",
>   "rows":"100",
>   "wt":"json",
>   "_":"1501792144786"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "dateTime":"2017-06-17T00:00:00Z"}]
>   }}
>
> The same query run from SolrJ shows previous day in the same field
>
> query.setQuery("id:01af04e1-83ce-4eb0-8fb5-dc737115dcce");
> query.setFields(""dateTime");
> query.addSort("dateTime", ORDER.asc);
> query.addSort("id", ORDER.asc);
> query.add("wt","json");
>
> gives
> {responseHeader={zkConnected=true,status=0,QTime=24,params={q=id:01af04e1-83ce-4eb0-8fb5-dc737115dcce,_stateVer_=cdr2:818,fl=dateTime,sort=dateTime
>  asc,id asc,wt=javabin,version=2}},response={numFound=1,start=0,docs=
> [SolrDocument{dateTime=Fri Jun 16 17:00:00 PDT 2017}]}}
>
> The problem was found when the a filter query (dateTime:[ 
> 2017-06-17T00:00:00Z TO 2017-06-18T00:00:00Z]) was done for the records of 17 
> June only, however the solrJ response shows some documents with June16 also. 
> Running facet query from web interface shows no records from June16
>
>
> Regards,
> Imran
>
> Sent from Mail for Windows 10
>


Re: plus sign in request / looking for + in title

2017-08-03 Thread Erick Erickson
Take a look at your analysis chain. My bet is that the + is being stripped
by some part of the chain. See the admin UI>>analysis page.

Best,
Erick

On Aug 3, 2017 06:47, "d.ku...@technisat.de"  wrote:

> Hey,
>
> in our title we are having a word named "hd+".
> Now I want to do a query right on these word, but if I do so, solr is just
> looking for "hd" and ignoring the plus sign. But I relay need to search for
> the whole string
> Of course I did a url encode for the plus sign:
>
> q=title:hd%2B
>
> Can please anyone tell me, how to search for the plus sign "+"?
>
> thanks
>
> David
>


plus sign in request / looking for + in title

2017-08-03 Thread d.ku...@technisat.de
Hey,

in our title we are having a word named "hd+".
Now I want to do a query right on these word, but if I do so, solr is just 
looking for "hd" and ignoring the plus sign. But I relay need to search for the 
whole string
Of course I did a url encode for the plus sign:

q=title:hd%2B

Can please anyone tell me, how to search for the plus sign "+"?

thanks

David


Re: Get handler failure

2017-08-03 Thread Chris Ulicny
By 1 replica, I mean a single copy of the shard with no redundancy.

We haven't encountered any problems with the testing environment solr
instances, that weren't expected. At least that I'm aware of.

I do have the logs saved from the time frame the issue occurred in if those
would be useful. We're running Solr 6.3.0 on Ubuntu 16.04 virtual machines.

On Thu, Aug 3, 2017 at 9:18 AM Shawn Heisey  wrote:

> On 8/3/2017 6:30 AM, Chris Ulicny wrote:
> > I've run into an issue in a test environment where a document exists, but
> > fails to be retrieved consistently by /get requests. In a series of 10
> > requests for the specific document across a few minute timespan, one of
> the
> > middle requests returned a null document.
> >
> > Currently, nothing is updating existing records in the collection, so it
> > couldn't have actually been deleted.
> >
> > The test cloud and collection have 3 nodes, 6 shards, and 1 replica per
> > shard. Based on the fact that the node that was queried was not the node
> > the document resided on, I assume that there may have been a temporary
> > connectivity issue that we're unaware of and the request couldn't find
> the
> > document and returned null.
>
> When you say "1 replica" do you mean that there are two copies of each
> shard (leader and replica) or one copy (no redundancy)?  I ask because
> this is a common point of confusion about SolrCloud terminology.  If you
> have two copies, then you have two replicas -- because the leader IS a
> replica.
>
> If there are two copies, you might be in a situation where the two
> copies are out of sync for some reason, and one copy has the document
> but the other doesn't.  Because SolrCloud load balances requests,
> sometimes the query will be serviced by one copy, sometimes by the other.
>
> If there is only one copy of each shard, then I do not know how this
> could happen, and it might indicate some kind of a problem with your
> install.
>
> Thanks,
> Shawn
>
>


Re: Sentence level searching

2017-08-03 Thread Naveen33
Hi Michael,
 what are you were looking for ,it can be achieved in Solr but not directly.
We will have to write a custom query parser which will use Lucene Query
parser. In the parser you will have to use the span queries.
SpanQuery1- your term1, term2, .termN and the range like standard its 50
and the boolean ordered true or false.
SpanQuery2- your sentence starting and sentence ending indications. like you
keep #sb#,#se#
 and your 2nd spanQuery will be #se# #sb# length -1 and order-TRue
final SPanQuery will be- spanQUery1-spanQuery2

In this way you would be able to achieve what you want.It works and i have
tried it.


Thanks & Regard
Naveen
India



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sentence-level-searching-tp474384p4348874.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Get handler failure

2017-08-03 Thread Shawn Heisey
On 8/3/2017 6:30 AM, Chris Ulicny wrote:
> I've run into an issue in a test environment where a document exists, but
> fails to be retrieved consistently by /get requests. In a series of 10
> requests for the specific document across a few minute timespan, one of the
> middle requests returned a null document.
>
> Currently, nothing is updating existing records in the collection, so it
> couldn't have actually been deleted.
>
> The test cloud and collection have 3 nodes, 6 shards, and 1 replica per
> shard. Based on the fact that the node that was queried was not the node
> the document resided on, I assume that there may have been a temporary
> connectivity issue that we're unaware of and the request couldn't find the
> document and returned null.

When you say "1 replica" do you mean that there are two copies of each
shard (leader and replica) or one copy (no redundancy)?  I ask because
this is a common point of confusion about SolrCloud terminology.  If you
have two copies, then you have two replicas -- because the leader IS a
replica.

If there are two copies, you might be in a situation where the two
copies are out of sync for some reason, and one copy has the document
but the other doesn't.  Because SolrCloud load balances requests,
sometimes the query will be serviced by one copy, sometimes by the other.

If there is only one copy of each shard, then I do not know how this
could happen, and it might indicate some kind of a problem with your
install.

Thanks,
Shawn



Get handler failure

2017-08-03 Thread Chris Ulicny
Hi all,

I've run into an issue in a test environment where a document exists, but
fails to be retrieved consistently by /get requests. In a series of 10
requests for the specific document across a few minute timespan, one of the
middle requests returned a null document.

Currently, nothing is updating existing records in the collection, so it
couldn't have actually been deleted.

The test cloud and collection have 3 nodes, 6 shards, and 1 replica per
shard. Based on the fact that the node that was queried was not the node
the document resided on, I assume that there may have been a temporary
connectivity issue that we're unaware of and the request couldn't find the
document and returned null.

So is that a possibility, and are there any other circumstances where the
/get handler would not be able to return a document that exists in a
collection?

Thanks,
Chris


Re: Ambiguous response on TrieDateField

2017-08-03 Thread Shawn Heisey
On 8/3/2017 3:48 PM, Imran Rajjad wrote:
> I have observed a difference of Day in TrieDateField when queried from Solr 
> Cloud web interface and SolrK (Java API)
>
> Below is the query response from Web Interface
>
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":22,
> "params":{
>   "q":"id:01af04e1-83ce-4eb0-8fb5-dc737115dcce",
>   "indent":"on",
>   "fl":"dateTime",
>   "sort":"dateTime asc, id asc",
>   "rows":"100",
>   "wt":"json",
>   "_":"1501792144786"}},
>   "response":{"numFound":1,"start":0,"docs":[
>   {
> "dateTime":"2017-06-17T00:00:00Z"}]
>   }}
>
> The same query run from SolrJ shows previous day in the same field

This is a timezone issue.

Solr itself only stores dates and thinks in terms of the UTC timezone
(universal time constant).  It is possible with date math (using the NOW
keyword) to inform Solr of the timezone so it calculates day boundaries
correctly, but that doesn't change what gets stored and displayed in a
JSON response.

What I quoted above is the text response you are getting (in JSON
format) from the admin UI.  When the response is text-based (JSON or XML
usually), that ISO format is the only thing you are going to get.

SolrJ gets responses in Javabin.  Because this format is a binary
representation of the Java object, rather than text, when SolrJ receives
that information, the dateTime field is a Java date object of some kind,
which is timezone aware, and is going to be populated with the UTC
information stored in Solr.

Whatever you are using for output from the SolrJ response is converting
the Java object to a timezone-specific text representation.  This
happens by default when you use the "toString()" method, which is what
Java calls when you print it without calling any method.

The output is showing up in the PDT (Pacific Daylight Time) timezone,
which is probably the system timezone on the computer that's running the
SolrJ program.  That timezone is 7 hours behind UTC, so midnight on the
17th (what is actually stored in Solr) becomes 5 PM on the 16th.  When
the timezone for the SolrJ program switches to Pacific Standard Time a
few months from now, that display will change to 8 hours behind UTC.

Solr and your SolrJ program are behaving exactly as they are designed. 
The only reasonable way to handle date/time objects with computer
systems is to have the server store them in UTC and the user-facing
program translate them to a local timezone.

Thanks,
Shawn



Re: Solr Pagination

2017-08-03 Thread Vincenzo D'Amore
Don't spend your time reading this, I've just found an answer in the
documentation:


> *One way to ensure that a document will never be returned more then once,
> is to use the uniqueKey field as the primary (and therefore: only
> significant) sort criterion. **In this situation, you will be guaranteed
> that each document is only returned once, no matter how it may be be
> modified during the use of the cursor.*


https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results



On Thu, Aug 3, 2017 at 12:47 PM, Vincenzo D'Amore 
wrote:

> Hi all,
>
> I have a collection that is frequently updated, is it possible that a Solr
> Cloud query returns duplicate documents while paginating?
>
> Just to be clear, there is a collection with about 3M of documents and a
> Solr query selects just 500K documents sorted by Id, which are returned
> simply paginating the results with the parameters start, rows and sort.
>
> The query is like this one:
>
> http://localhost:8983/solr/collection1/select?q=idCat:1;
> start=0=2=id asc
>
> To be honest, I've not verified personally, but the consumer of this query
> claims that after few trials, duplicate documents where returned.
>
> Given that the collection is frequently updated, I suppose that adding a
> large bunch of new documents during the pagination can affect the index and
> change the order of results.
>
> In other words, if I have 500K documents returned by 25 queries (20K
> documents for each request) and during the iteration, 1000 new documents
> are inserted.
> Given that I have a query sorted by Id, I think it is possibile that the
> documents returned reflect the new order, so it is possible that a document
> returned in a previous query now is also present in the current results.
>
> Again, I'm trying to solve this problem using the deep paging.
>
> I have read that "unlike basic pagination, Cursor pagination does not rely
> on using an absolute "offset" into the completed sorted list of matching
> documents.  Instead, the cursorMark specified in a request encapsulates
> information about the relative position of the last document returned,
> based on the absolute sort values of that document.  This means that the
> impact of index modifications is much smaller when using a cursor compared
> to basic pagination."
>
> What do you think about, am I right? The deep paging can help to solve
> this problem?
>
> Best regards and thanks for your time,
> Vincenzo
>
>


Solr Pagination

2017-08-03 Thread Vincenzo D'Amore
Hi all,

I have a collection that is frequently updated, is it possible that a Solr
Cloud query returns duplicate documents while paginating?

Just to be clear, there is a collection with about 3M of documents and a
Solr query selects just 500K documents sorted by Id, which are returned
simply paginating the results with the parameters start, rows and sort.

The query is like this one:

http://localhost:8983/solr/collection1/select?q=idCat:1=0=2=id
asc

To be honest, I've not verified personally, but the consumer of this query
claims that after few trials, duplicate documents where returned.

Given that the collection is frequently updated, I suppose that adding a
large bunch of new documents during the pagination can affect the index and
change the order of results.

In other words, if I have 500K documents returned by 25 queries (20K
documents for each request) and during the iteration, 1000 new documents
are inserted.
Given that I have a query sorted by Id, I think it is possibile that the
documents returned reflect the new order, so it is possible that a document
returned in a previous query now is also present in the current results.

Again, I'm trying to solve this problem using the deep paging.

I have read that "unlike basic pagination, Cursor pagination does not rely
on using an absolute "offset" into the completed sorted list of matching
documents.  Instead, the cursorMark specified in a request encapsulates
information about the relative position of the last document returned,
based on the absolute sort values of that document.  This means that the
impact of index modifications is much smaller when using a cursor compared
to basic pagination."

What do you think about, am I right? The deep paging can help to solve this
problem?

Best regards and thanks for your time,
Vincenzo


SOLR Learning to Rank Questions

2017-08-03 Thread Joao Palotti
​
Dear all,

First of all, I would like to thank you guys for the amazing job with SOLR.
In special, I highly appreciate the learning to rank plugin. It is a
fantastic work.

I have two
​ ​
two questions for the LTR people and I hope this mailing list is the right
place for that.

*1)​ ​This is a direct implementation doubt:*

Let's say that I have the popularity of my documents (document hits) in an
external SQL database instead of saving it in the index.

Can I use this information as a feature? How?


*2) This is slightly more philosophical than a practical question:*

Let's say I would like to normalize the score of my documents, for example,
with MinMaxNormalizer. If I correctly understood it, I would have to
calculate the min and the max values for the score seen in the training set
and upload these values in my model.
When using the model, MinMaxNormalizer will apply its normalization formula
for each value retrieved based on the max and the min set in the model.

Although this is a valid approach, I see it as a global approach, not a
local (per query) one.
Hope you understand what I am talking about here.

I was expecting to have a MinMaxNormalizer without previously min and max
set. This would simply apply the min_max formula to all results for
each query. Thus, when I use this new approach, the first document would
have score 1.0 and the last document retrieved would have score 0.0.

Would it be better to normalize per query instead of a global normalization?


Thanks a lot in advance.
Looking forward to hearing back from you soon.

Best,
--
João Palotti
Website: joaopalotti.com
Twitter: @joaopalotti 
Me at Google Scholar



Ambiguous response on TrieDateField

2017-08-03 Thread Imran Rajjad
Hello,

I have observed a difference of Day in TrieDateField when queried from Solr 
Cloud web interface and SolrK (Java API)

Below is the query response from Web Interface

{
  "responseHeader":{
"zkConnected":true,
"status":0,
"QTime":22,
"params":{
  "q":"id:01af04e1-83ce-4eb0-8fb5-dc737115dcce",
  "indent":"on",
  "fl":"dateTime",
  "sort":"dateTime asc, id asc",
  "rows":"100",
  "wt":"json",
  "_":"1501792144786"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"dateTime":"2017-06-17T00:00:00Z"}]
  }}

The same query run from SolrJ shows previous day in the same field

query.setQuery("id:01af04e1-83ce-4eb0-8fb5-dc737115dcce");
query.setFields(""dateTime");
query.addSort("dateTime", ORDER.asc);
query.addSort("id", ORDER.asc); 
query.add("wt","json");

gives
{responseHeader={zkConnected=true,status=0,QTime=24,params={q=id:01af04e1-83ce-4eb0-8fb5-dc737115dcce,_stateVer_=cdr2:818,fl=dateTime,sort=dateTime
 asc,id asc,wt=javabin,version=2}},response={numFound=1,start=0,docs=
[SolrDocument{dateTime=Fri Jun 16 17:00:00 PDT 2017}]}}

The problem was found when the a filter query (dateTime:[ 2017-06-17T00:00:00Z 
TO 2017-06-18T00:00:00Z]) was done for the records of 17 June only, however the 
solrJ response shows some documents with June16 also. Running facet query from 
web interface shows no records from June16


Regards,
Imran

Sent from Mail for Windows 10



Re: Limiting the number of queries/updates to Solr

2017-08-03 Thread Rick Leir



On 2017-08-02 11:33 PM, Shawn Heisey wrote:

On 8/2/2017 8:41 PM, S G wrote:

Problem is that peak load estimates are just estimates.
It would be nice to enforce them from Solr side such that if a rate higher than 
that is seen at any core, the core will automatically begin to reject the 
requests.
Such a feature would contribute to cluster stability while making sure the 
customer gets an exception to remind them of a slower rate.

Solr doesn't have anything like this.  This is primarily because there
is no network server code in Solr.  The networking is provided by the
servlet container.  The container in modern Solr versions is nearly
guaranteed to be Jetty.  As long as I have been using Solr, it has
shipped with a Jetty container.

https://wiki.apache.org/solr/WhyNoWar

I have no idea whether Jetty is capable of the kind of rate limiting
you're after.  If it is, it would be up to you to figure out the
configuration.

You could always put a proxy server like haproxy in front of Solr.  I'm
pretty sure that haproxy is capable rejecting connections when the
request rate gets too high.  Other proxy servers (nginx, apache, F5
BigIP, solutions from Microsoft, Cisco, etc) are probably also capable
of this.

IMHO, intentionally causing connections to fail when a limit is exceeded
would not be a very good idea.  When the rate gets too high, the first
thing that happens is all the requests slow down.  The slowdown could be
dramatic.  As the rate continues to increase, some of the requests
probably would begin to fail.

What you're proposing would be guaranteed to cause requests to fail.
Failing requests are even more likely than slow requests to result in
users finding a new source for whatever service they are getting from
your organization.

Shawn,
Agreed, a connection limit is not a good idea.  But there is the 
timeAllowed parameter 

timeAllowed - This parameter specifies the amount of time, in 
milliseconds, allowed for a search to complete. If this time expires 
before the search is complete, any partial results will be returned.


https://stackoverflow.com/questions/19557476/timing-out-a-query-in-solr

With timeAllowed, you need not estimate what connection rate is 
unbearable. Rather, you would set a max response time. If some queries 
take much longer than other queries, then this would cause the long ones 
to fail, which might be a good strategy. However, if queries normally 
all take about the same time, then this would cause all queries to 
return partial results until the server recovers, which might be a bad 
strategy. In this case, Walter's post is sensible.


A previous thread suggested that timeAllowed could cause bad performance 
on some cloud servers.

cheers -- Rick






Re: Solr 4.10.4 export handler NPE

2017-08-03 Thread Lasitha Wattaladeniya
Thank you Eric for the reply.

Not possible to change from version 4.10.4, we have built lot of
functionalities wrapping 4.10 version.

I have decided to use /select handler to fetch data incrementally and write
to a file. I think that will work

Regards,
Lasitha

On 2 Aug 2017 23:37, "Erick Erickson"  wrote:

> That the JIRA says 5.5 just means that's the version the submitter was
> using when the NPE was encountered. No workarounds that I know of, but
> that code has changed drastically since the 4.10 days so I really have
> no clue. Any chance of trying it on a 6.6 release?
>
> Best,
> Erick
>
> On Tue, Aug 1, 2017 at 8:43 PM, Lasitha Wattaladeniya 
> wrote:
> > Hi devs,
> >
> > I was exploring the /export handler in solr and got an exception. When I
> > research online I found this open jira case : SOLR-8860
> >
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/SOLR-8806
> >
> > is this a valid jira case? Any workarounds?
> >
> > Jira says affect version is 5.5 but I'm getting this in 4.10.4 also
> >
> >
> > Regards,
> > Lasitha
>


Re: Solr Input and Output format

2017-08-03 Thread Rick Leir

Ranganath,

I googled 'getRecordWriter solr' and came up with (among 446 results) 
this partial stack trace:



at 
org.apache.solr.handler.component.HttpShardHandlerFactory.init(HttpShardHandlerFactory.java:168) 

at 
org.apache.solr.handler.component.ShardHandlerFactory.newInstance(ShardHandlerFactory.java:49) 


at org.apache.solr.core.CoreContainer.load(CoreContainer.java:236)
at 
org.apache.solr.hadoop.SolrRecordWriter.createEmbeddedSolrServer(SolrRecordWriter.java:163) 

at 
org.apache.solr.hadoop.SolrRecordWriter.(SolrRecordWriter.java:119)
at 
org.apache.solr.hadoop.SolrOutputFormat.getRecordWriter(SolrOutputFormat.java:163) 

at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.(ReduceTask.java:540) 


at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:614)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) 


at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)

 org.apache.solr.hadoop.SolrRecordWriter extends FileOutputFormat

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.0/solr/contrib/map-reduce/src/java/org/apache/solr/hadoop/SolrOutputFormat.java

cheers -- Rick



On 2017-08-02 04:04 AM, Ranganath B N wrote:

Hi,






   I am not asking about the file formats. Rather, It is about SolrInputFormat 
and SolrOutputFormat interfaces which deal with getsplit(), getRecordReader() 
and getRecordWriter() methods. Are there any Implementations for these 
interfaces?





Thanks,

Ranganath B. N.








From: Ranganath B N
Sent: Monday, July 31, 2017 6:33 PM
To: 'solr-user@lucene.apache.org'
Cc: Vadiraj Muradi
Subject: Solr Input and Output format


Hi All,

  Can you point me to some of the implementations  of Solr Input and Output 
format? I wanted to know them to  understand the distributed implementation 
approach.


Thanks,
Ranganath B. N.






Re: High CPU utilization on Upgrading to Solr Version 6.3

2017-08-03 Thread Atita Arora
Hi All ,

Just thought of giving quick update on this.
So we were able to *knock down this issue by using jvisualvm* which comes
with java .
So , we enabled monitoring  through jmx and the CPU profiling showed (as
attached in one of my previous emails) *Highlighting taking maximum
processing.*
Mysteriously , this was happening in highlighting-> merge which was invoked
through when we enabled *mergecontiguous=true* I'm still surprised as to
turning this only property false, resolved the issue and we happily went
live last week.

Later , as I found the code for this particular property is causing endless
recursions as I traced.

Please guide / share if you may have any other thoughts.

Thanks,
Atita



On Fri, Jul 28, 2017 at 7:18 PM, Shawn Heisey  wrote:

> On 7/27/2017 1:30 AM, Atita Arora wrote:
> > What OS is Solr running on?  I'm only asking because some additional
> > information I'm after has different gathering methods depending on OS.
> > Other questions:
> >
> > /*OpenJDK 64-Bit Server VM (25.141-b16) for linux-amd64 JRE
> > (1.8.0_141-b16), built on Jul 20 2017 21:47:59 by "mockbuild" with gcc
> > 4.4.7 20120313 (Red Hat 4.4.7-18)*/
> > /*Memory: 4k page, physical 264477520k(92198808k free), swap 0k(0k
> free)*/
>
> Linux is the easiest to get good information from.  Run the "top"
> program in a commandline session.  Press shift-M to sort by memory size,
> and grab a screenshot.  Share that screenshot with a file sharing site
> and give us the URL.
>
> > Is there only one Solr process per machine, or more than one?
> > /*On an average yes , one solr process per machine , however , we do
> > have a machine (where this log is taken) has two solr processes
> > (master and slave)*/
>
> Running a master and a slave on one machine does nothing for
> redundancy.  They need to be on separate machines for that to really
> help.  As for multiple processes per machine, tou can have many indexes
> in one Solr instance -- you don't need more than one in most cases.
>
> > How many total documents are managed by one machine?
> > */About 220945 per machine ( and double for this machine as it has
> > instance of master as well as other slave)/*
> >
> > How big is all the index data managed by one machine?
> > */The index is about 4G./*
>
> If less than a quarter of a million documents results in a 4GB index,
> those documents must be ENORMOUS, or else there is something strange
> going on.
>
> > What is the max heap on each Solr process?
> > */Max heap is 25G for each Solr Process. (Xms 25g Xmx 25g)/*
> > */
> > /*
> > The reason of choosing RAMDirectory was that it was used in the
> > similar manner while the production Solr was on Version 4.3.2, so no
> > particular reason but just replicated how it was working , never
> > thought this may give troubles.
>
> Set up the slaves just like the masters, with
> NRTCachingDirectoryFactory.  For a couple hundred thousand docs, you
> probably only need a 2GB heap, possibly even less.
>
> > I had included a pastebin of GC snapshot (the complete log was too big
> > to be included in the pastebin , so pasted a sampler)
>
> I asked for the full log because that's what I need to look deeper.  A
> sampler won't be enough.  There are file sharing websites for sharing
> larger content, and if you compress the file before uploading it, you
> should be able to achieve a fairly impressive compression ratio.
> Dropbox is generally a good choice for sharing fairly large content.
> Dropbox also works for image data, like the "top" screenshot I asked for
> above.
>
> > Another thing is as we observed the CPU cycles yesterday in high load
> > condition we observed that the Highlighter component was taking
> > longest , is there anything in particular we forgot to include that
> > highlighting doesn't gives a performance hit .
> > Attached is the snapshot taken from jvisualvm.
>
> Attachments rarely make it through the mailing list.  Yours didn't, so I
> cannot see that snapshot.
>
> I do not know anything about highlighting, so I cannot comment on how
> much CPU it takes.  I've never used the feature.
>
> My best idea about why your CPU is so high is problems with garbage
> collection.  To look into that, I need to have the full GC log.  The
> rest of the information I've asked for will help focus my efforts.
>
> Thanks,
> Shawn
>
>