Search through all fields in collection

2017-05-24 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 6.5.1.

Would like to check, is it possible to set a configuration in
solrconfig.xml whereby the search will go through all the fields in the
collections?

Currently, I am defining the fields to be search under the "df" setting,
but unlike "fl", I could not set it to the value "*". The only way which I
can think of currently is to list all the fields under the "df" setting.
However, this doesn't seems like a very good way, so I'm checking if there
are better methods.




  none
  1000
 json
   true
  id,content
  *


Regards,
Edwin


Re: How to handle nested documents in solr (SolrJ)

2017-05-24 Thread David Lee

Hi Rick,

Adding to this subject, I do appreciate you pointing us to these 
articles, but I'm curious about how much of these take into account the 
latest versions of Solr (ie: +6.5 and 7) given the JSON split 
capabilities, etc. I know that is just on the indexing side so the 
searches may be the same but things are changing quickly these days (not 
a bad thing).


Thanks,

David


On 5/24/2017 4:26 AM, Rick Leir wrote:

Prasad,

Gee, you get confusion from a google search for:

nested documents 
site:mail-archives.apache.org/mod_mbox/lucene-solr-user/


https://www.google.ca/search?safe=strict&q=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&oq=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&gs_l=serp.3...34316.37762.0.37969.10.10.0.0.0.0.104.678.9j1.10.00...1.1.64.serp..0.0.0.JTf887wWCDM 



But my recent posting might help: " Yonick has some good blogs on this."

And Mikhail has an excellent blog:

https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr 



cheers -- Rick

On 2017-05-24 02:53 AM, prasad chowdary wrote:

Dear All,

I have a requirement that I need to index the documents in solr using 
Java

code.

Each document contains a sub documents like below ( Its just for
underastanding my question).


student id : 123
student name : john
marks :
maths: 90
English :95

student id : 124
student name : rack
marks :
maths: 80
English :96

etc...

So, as shown above each document contains one child document i.e marks.

Actaully I don't need any joins or anything.My requirement is :

if I query "English:95" ,it should return the complete document ,i.e 
child

along with parent like below

student id : 123
student name : john
marks :
maths: 90
English :95

and also if I query "student id : 123" , it should return the whole 
document

same as above.

Currently I am able to get the child along with parent for child 
match by

using extendedResults option .

But not able to get the child for parent match.






---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: High CPU when use grouping group.ngroups=true

2017-05-24 Thread Nguyen Manh Tien
Without using ngroups=true, is there any way to handle pagination correctly
when we collapse result using grouping?

Regards,
Tien

On Tue, May 23, 2017 at 9:55 PM, Nguyen Manh Tien  wrote:

> The collapse field is high-cardinality field. I haven't profiling yet but
> will do it.
>
> Thanks,
> Tien
>
> On Tue, May 23, 2017 at 9:48 PM, Erick Erickson 
> wrote:
>
>> How many unique values in your group field? For high-cardinality
>> fields there's quite a bit of bookkeeping that needs to be done.
>>
>> Have you tried profiling to see where the CPU time is being spent?
>>
>> Best,
>> Erick
>>
>> On Tue, May 23, 2017 at 7:46 AM, Nguyen Manh Tien
>>  wrote:
>> > Hi All,
>> >
>> > I recently switch from solr field collapse/expand to grouping for
>> collapse
>> > search result
>> > All seem good but CPU is always high (80-100%) when i set param
>> > group.ngroups=true.
>> >
>> > We set ngroups=true to get number of groups so that we can paginate
>> search
>> > result correctly.
>> > Due to CPU issue we need to turn it off.
>> >
>> > Is ngroups=true is expensive feature? Is there any way to prevent CPU
>> issue
>> > and still have correct pagination.
>> >
>> > Thanks,
>> > Tien
>>
>
>


Re: Spread SolrCloud across two locations

2017-05-24 Thread Shawn Heisey
On 5/24/2017 4:14 PM, Jan Høydahl wrote:
> Sure, ZK does by design not support a two-node/two-location setup. But still, 
> users may want/need to deploy that,
> and my question was if there are smart ways to make such a setup as little 
> painful as possible in case of failure.
>
> Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM.
> Without an active action DC2 would be read-only
> What if then the Ops personnel in DC2 could, with a single script/command, 
> instruct DC2 to resume “master” role:
> - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
> - Rolling restart of Solr nodes with new ZK_HOST string
> Of course, they would also then need to make sure that DC1 does not boot up 
> again before compatible change has been done there too.

When ZK 3.5 comes out and SolrCloud is updated to use it, I think that
it might be possible to remove the dc1 servers from the ensemble and add
another server in dc2 to re-form a new quorum, without restarting
anything.  It could be quite some time before a stable 3.5 is available,
based on past release history.  They don't release anywhere near as
often as Lucene/Solr does.

With the current ZK version, I think your procedure would work, but I
definitely wouldn't call it painless.  Indexing would be unavailable
when dc1 goes down, and everything could be down while the restarts are
happening.

Whether ZK 3.5 is there or not, there is potential unknown behavior when
dc1 comes back online, unless you can have dc1 personnel shut the
servers down, or block communication between your servers in dc1 and dc2.

Overall, having one or two ZK servers in each main DC and a tiebreaker
ZK on a low-cost server in a third DC seems like a better option. 
There's no intervention required when a DC goes down, or when it comes
back up.

Thanks,
Shawn



Re: Spread SolrCloud across two locations

2017-05-24 Thread Anirudha Jadhav
Latest zk supports auto reconfigure.
Keep one DC as quorum and another as observers.

When a DC goes down initiate a zk reconfigure action. To flip quorum and
observers.

When I tested this solr survived just fine, but it been a while.

Ani


On Wed, May 24, 2017 at 6:35 PM Pushkar Raste 
wrote:

> A setup I have used in the past was to have an observer I  DC2. If DC1 one
> goes boom you need manual intervention to change observer's role to make it
> a follower.
>
> When DC1 comes back up change on instance in DC2 to make it a observer
> again
>
> On May 24, 2017 6:15 PM, "Jan Høydahl"  wrote:
>
> > Sure, ZK does by design not support a two-node/two-location setup. But
> > still, users may want/need to deploy that,
> > and my question was if there are smart ways to make such a setup as
> little
> > painful as possible in case of failure.
> >
> > Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes
> BOOM.
> > Without an active action DC2 would be read-only
> > What if then the Ops personnel in DC2 could, with a single
> script/command,
> > instruct DC2 to resume “master” role:
> > - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
> > - Rolling restart of Solr nodes with new ZK_HOST string
> > Of course, they would also then need to make sure that DC1 does not boot
> > up again before compatible change has been done there too.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 23. mai 2017 kl. 18.56 skrev Shawn Heisey :
> > >
> > > On 5/23/2017 10:12 AM, Susheel Kumar wrote:
> > >> Hi Jan, FYI - Since last year, I have been running a Solr 6.0 cluster
> > in one of lower env with 6 shards/replica in dc1 & 6 shard/replica in dc2
> > (each shard replicated cross data center) with 3 ZK in dc1 and 2 ZK in
> dc2.
> > (I didn't have the availability of 3rd data center for ZK so went with
> only
> > 2 data center with above configuration) and so far no issues. Its been
> > running fine, indexing, replicating data, serving queries etc. So in my
> > test, setting up single cluster across two zones/data center works
> without
> > any issue when there is no or very minimal latency (in my case around
> 30ms
> > one way
> > >
> > > With that setup, if dc2 goes down, you're all good, but if dc1 goes
> > down, you're not.
> > >
> > > There aren't enough ZK servers in dc2 to maintain quorum when dc1 is
> > unreachable, and SolrCloud is going to go read-only.  Queries would most
> > likely work, but you would not be able to change the indexes at all.
> > >
> > > ZooKeeper with N total servers requires int((N/2)+1) servers to be
> > operational to maintain quorum.  This means that with five total servers,
> > three must be operational and able to talk to each other, or ZK cannot
> > guarantee that there is no split-brain, so quorum is lost.
> > >
> > > ZK in two data centers will never be fully fault-tolerant. There is no
> > combination of servers that will work properly.  You must have three data
> > centers for a geographically fault-tolerant cluster.  Solr would be
> > optional in the third data center.  ZK must be installed in all three.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> >
>
-- 
Anirudha P. Jadhav


Re: Spread SolrCloud across two locations

2017-05-24 Thread Pushkar Raste
A setup I have used in the past was to have an observer I  DC2. If DC1 one
goes boom you need manual intervention to change observer's role to make it
a follower.

When DC1 comes back up change on instance in DC2 to make it a observer
again

On May 24, 2017 6:15 PM, "Jan Høydahl"  wrote:

> Sure, ZK does by design not support a two-node/two-location setup. But
> still, users may want/need to deploy that,
> and my question was if there are smart ways to make such a setup as little
> painful as possible in case of failure.
>
> Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM.
> Without an active action DC2 would be read-only
> What if then the Ops personnel in DC2 could, with a single script/command,
> instruct DC2 to resume “master” role:
> - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
> - Rolling restart of Solr nodes with new ZK_HOST string
> Of course, they would also then need to make sure that DC1 does not boot
> up again before compatible change has been done there too.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 23. mai 2017 kl. 18.56 skrev Shawn Heisey :
> >
> > On 5/23/2017 10:12 AM, Susheel Kumar wrote:
> >> Hi Jan, FYI - Since last year, I have been running a Solr 6.0 cluster
> in one of lower env with 6 shards/replica in dc1 & 6 shard/replica in dc2
> (each shard replicated cross data center) with 3 ZK in dc1 and 2 ZK in dc2.
> (I didn't have the availability of 3rd data center for ZK so went with only
> 2 data center with above configuration) and so far no issues. Its been
> running fine, indexing, replicating data, serving queries etc. So in my
> test, setting up single cluster across two zones/data center works without
> any issue when there is no or very minimal latency (in my case around 30ms
> one way
> >
> > With that setup, if dc2 goes down, you're all good, but if dc1 goes
> down, you're not.
> >
> > There aren't enough ZK servers in dc2 to maintain quorum when dc1 is
> unreachable, and SolrCloud is going to go read-only.  Queries would most
> likely work, but you would not be able to change the indexes at all.
> >
> > ZooKeeper with N total servers requires int((N/2)+1) servers to be
> operational to maintain quorum.  This means that with five total servers,
> three must be operational and able to talk to each other, or ZK cannot
> guarantee that there is no split-brain, so quorum is lost.
> >
> > ZK in two data centers will never be fully fault-tolerant. There is no
> combination of servers that will work properly.  You must have three data
> centers for a geographically fault-tolerant cluster.  Solr would be
> optional in the third data center.  ZK must be installed in all three.
> >
> > Thanks,
> > Shawn
> >
>
>


RE: How to avoid unnecessary query parsing on distributed search in QueryComponent.prepare()?

2017-05-24 Thread Markus Jelsma
I've asked myself this question too some times. In this case extending MLT 
QParser. So far, i've not found a simple means to propagate a parsed top-level 
Lucene query object over the wire. 

But, since there is a clear toString for that Query object, if we could 
retranslate that String to a real object, could that be lighter than parsing 
the incoming query text?

Markus
 
-Original message-
> From:Mikhail Khludnev 
> Sent: Thursday 11th May 2017 10:43
> To: solr-user 
> Subject: How to avoid unnecessary query parsing on distributed search in 
> QueryComponent.prepare()?
> 
> Hello,
> When the distributed search is requested (SolrCloud), the query component
> invokes prepare() where a query is parsed. But then it's just ignored, I
> suppose because all work is done by subordinate shards' requests.
> It's fine the most of the times because query parsing is cheap. Until we
> have a heavy wildcarded complexphrase query, where the expensive expansion
> is done during parsing.
> How to avoid this waste of resources? Just bypass parsing is rb.isDistrib?
> or refactor complexphrase parser moving expansion downstream to rewrite or
> createWeight?
> --
> Mikhail
> 


RE: Spread SolrCloud across two locations

2017-05-24 Thread Markus Jelsma
Hi - Again, hiring a simple VM at a third location without a Solr cloud sounds 
like the simplest solution. It keeps the quorum tight and sound. This simple 
solution is the one i would try first.

Or am i completely missing something and sound like an idiot? Could be, of 
course.

Regards,
Markus

 
 
-Original message-
> From:Jan Høydahl 
> Sent: Thursday 25th May 2017 0:15
> To: solr-user@lucene.apache.org
> Subject: Re: Spread SolrCloud across two locations
> 
> Sure, ZK does by design not support a two-node/two-location setup. But still, 
> users may want/need to deploy that,
> and my question was if there are smart ways to make such a setup as little 
> painful as possible in case of failure.
> 
> Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM.
> Without an active action DC2 would be read-only
> What if then the Ops personnel in DC2 could, with a single script/command, 
> instruct DC2 to resume “master” role:
> - Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
> - Rolling restart of Solr nodes with new ZK_HOST string
> Of course, they would also then need to make sure that DC1 does not boot up 
> again before compatible change has been done there too.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 23. mai 2017 kl. 18.56 skrev Shawn Heisey :
> > 
> > On 5/23/2017 10:12 AM, Susheel Kumar wrote:
> >> Hi Jan, FYI - Since last year, I have been running a Solr 6.0 cluster in 
> >> one of lower env with 6 shards/replica in dc1 & 6 shard/replica in dc2 
> >> (each shard replicated cross data center) with 3 ZK in dc1 and 2 ZK in 
> >> dc2. (I didn't have the availability of 3rd data center for ZK so went 
> >> with only 2 data center with above configuration) and so far no issues. 
> >> Its been running fine, indexing, replicating data, serving queries etc. So 
> >> in my test, setting up single cluster across two zones/data center works 
> >> without any issue when there is no or very minimal latency (in my case 
> >> around 30ms one way
> > 
> > With that setup, if dc2 goes down, you're all good, but if dc1 goes down, 
> > you're not.
> > 
> > There aren't enough ZK servers in dc2 to maintain quorum when dc1 is 
> > unreachable, and SolrCloud is going to go read-only.  Queries would most 
> > likely work, but you would not be able to change the indexes at all.
> > 
> > ZooKeeper with N total servers requires int((N/2)+1) servers to be 
> > operational to maintain quorum.  This means that with five total servers, 
> > three must be operational and able to talk to each other, or ZK cannot 
> > guarantee that there is no split-brain, so quorum is lost.
> > 
> > ZK in two data centers will never be fully fault-tolerant. There is no 
> > combination of servers that will work properly.  You must have three data 
> > centers for a geographically fault-tolerant cluster.  Solr would be 
> > optional in the third data center.  ZK must be installed in all three.
> > 
> > Thanks,
> > Shawn
> > 
> 
> 


Re: Spread SolrCloud across two locations

2017-05-24 Thread Jan Høydahl
Sure, ZK does by design not support a two-node/two-location setup. But still, 
users may want/need to deploy that,
and my question was if there are smart ways to make such a setup as little 
painful as possible in case of failure.

Take the example of DC1: 3xZK and DC2: 2xZK again. And then DC1 goes BOOM.
Without an active action DC2 would be read-only
What if then the Ops personnel in DC2 could, with a single script/command, 
instruct DC2 to resume “master” role:
- Add a 3rd DC2 ZK to the two existing, reconfigure and let them sync up.
- Rolling restart of Solr nodes with new ZK_HOST string
Of course, they would also then need to make sure that DC1 does not boot up 
again before compatible change has been done there too.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 23. mai 2017 kl. 18.56 skrev Shawn Heisey :
> 
> On 5/23/2017 10:12 AM, Susheel Kumar wrote:
>> Hi Jan, FYI - Since last year, I have been running a Solr 6.0 cluster in one 
>> of lower env with 6 shards/replica in dc1 & 6 shard/replica in dc2 (each 
>> shard replicated cross data center) with 3 ZK in dc1 and 2 ZK in dc2. (I 
>> didn't have the availability of 3rd data center for ZK so went with only 2 
>> data center with above configuration) and so far no issues. Its been running 
>> fine, indexing, replicating data, serving queries etc. So in my test, 
>> setting up single cluster across two zones/data center works without any 
>> issue when there is no or very minimal latency (in my case around 30ms one 
>> way
> 
> With that setup, if dc2 goes down, you're all good, but if dc1 goes down, 
> you're not.
> 
> There aren't enough ZK servers in dc2 to maintain quorum when dc1 is 
> unreachable, and SolrCloud is going to go read-only.  Queries would most 
> likely work, but you would not be able to change the indexes at all.
> 
> ZooKeeper with N total servers requires int((N/2)+1) servers to be 
> operational to maintain quorum.  This means that with five total servers, 
> three must be operational and able to talk to each other, or ZK cannot 
> guarantee that there is no split-brain, so quorum is lost.
> 
> ZK in two data centers will never be fully fault-tolerant. There is no 
> combination of servers that will work properly.  You must have three data 
> centers for a geographically fault-tolerant cluster.  Solr would be optional 
> in the third data center.  ZK must be installed in all three.
> 
> Thanks,
> Shawn
> 



Re: solr 6 at scale

2017-05-24 Thread Toke Eskildsen
Nawab Zada Asad Iqbal  wrote:
> @Toke, I stumbled upon your page last week but it seems that your huge
> index doesn't receive a lot of query traffic.

It switches between two kinds of usage:

Everyday use is very low traffic by researchers using it interactively: 1-2 
simultaneous queries, with faceting ranging from somewhat heavy to very heavy. 
Our setup is optimized towards this scenario and latency starts to go up pretty 
quickly if the number of simultaneous request rises.

Now and then some cultural probes are being performed, where the index is being 
hammered continuously by multiple threads. Here it is our experience that max 
throughput for extremely simple queries (existence checks for social security 
numbers) is around 50 queries/second.

> Mine is around 60TB and receives around 120 queries per second; ~90 shards on 
> 30 machines.

Sounds interesting. Do you have a more detailed write-up somewhere?

- Toke


Re: Securing Solr with BasicAuth

2017-05-24 Thread Shawn Heisey
On 5/24/2017 2:08 PM, Warden, Jesse wrote:
> We don’t want people modifying Solr on our website. We found this plugin: 
> https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/basic-authentication-plugin.html#BasicAuthenticationPlugin-EnableBasicAuthentication
>
> However, if someone goes to search on our website, they’re presented with an 
> authentication dialogue. We want our normal users to be able to perform 
> searches, just none of the admin actions.

The admin UI is just static html, css, images, and javascript.  It does
not contain any information about your Solr server.  The admin UI itself
runs in the browser, and its components do not require authentication
when you enable authentication.

It is the Solr request API, which includes searches, information
requests, and indexing, that actually gets authenticated.  This API is
accessed by the admin UI running in the browser in order to display
information about the server and enable admin actions.

Your end users should *NOT* have direct access to your Solr server.  It
sounds like what you have done is put your calls to Solr into javascript
which executes in the end user's browser, and exposed your Solr server
to your users (which may be the entire Internet).  This is a problem. 
The searches should be executed by back-end code running on your
webserver, not by javascript code running in the user's browser.

If you put a proxy server in front of Solr, you may be able to block
certain URL path combinations and prevent the end users from changing
your indexes, but you will not be able to prevent those users from
sending complex/slow denial of service queries.

Thanks,
Shawn



Re: [Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

2017-05-24 Thread Steve Rowe
Hi Robert,

Two possibilities come to mind:

1. Use a char filter factory (runs before the tokenizer) to convert commas 
between digits to spaces, e.g. PatternReplaceCharFilterFactory 
.
2. Use WordDelimiterFilterFactory 


--
Steve
www.lucidworks.com

> On May 24, 2017, at 4:19 PM, Robert Hume  wrote:
> 
> Hi,
> 
> Following up on my last email question ... I've learned more and I
> simplified by question ...
> 
> I have a Solr 3.6 deployment.  Currently I'm using
> solr.StandardTokenizerFactory to parse tokens during indexing.
> 
> Here's two example streams that demonstrate my issue:
> 
> Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
> ... which is good.
> 
> Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
> ... which is not good because users can't search by "000123".
> 
> It seems StandardTokenizerFactory treats the "6,000" differently (like it's
> currency or a product number, maybe?) so it doesn't tokenize at the comma.
> 
> QUESTION: How can I enhance StandardTokenizer to do everything it's doing
> now plus produce a couple of additional tokens like this ...
> 
> `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
> 
> ... so users can search by "000123"?
> 
> Thanks!
> Rob



Re: Why is Standard Tokenizer not separating at this comma?

2017-05-24 Thread Steve Rowe
Hi Robert,

The StandardTokenizer implements the word boundaries rules from UAX#29 
, discarding anything between 
boundaries that is exclusively non-alphanumeric (e.g. punctuation).

--
Steve
www.lucidworks.com

> On May 24, 2017, at 3:05 PM, Robert Hume  wrote:
> 
> I have a Solr 3.6 deployment I inherited.
> 
> The schema.xml specifies the use of StandardTokenizerFactory like so ...
> 
> positionIncrementGap="100">
>...
>  
>...
> 
> 
> According to this reference guide (
> https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
> the StandardTokenizer will treat punctuation as a delimiters.
> 
> 
> However, here is my content that gets indexed:
> 
>"IOM-1:BA9ATS0FAB,\"Company Name
> 
> Module\",8.1.0.16.0.2,B-A,06KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company
> CM Rear Module\",B-6,09XP12133407,"
> 
> 
> 
> This piece `B-A,06KB09029932` gets tokenized into two words ... `|B-A|`
> and `|06KB09029932|`.
> 
> 
> But this piece `B-6,09XP12133407` gets tokenized into one word ...
> `|B-6,09XP12133407|`.
> 
> What I've observed is the comma is not considered a delimiter when it is
> proceeded by a digit ... almost like it considers "6,000" to be currency or
> something?
> 
> 
> QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
> commas are used as delimiters?
> 
> Rob



[Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

2017-05-24 Thread Robert Hume
Hi,

Following up on my last email question ... I've learned more and I
simplified by question ...

I have a Solr 3.6 deployment.  Currently I'm using
solr.StandardTokenizerFactory to parse tokens during indexing.

Here's two example streams that demonstrate my issue:

Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
... which is good.

Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
... which is not good because users can't search by "000123".

It seems StandardTokenizerFactory treats the "6,000" differently (like it's
currency or a product number, maybe?) so it doesn't tokenize at the comma.

QUESTION: How can I enhance StandardTokenizer to do everything it's doing
now plus produce a couple of additional tokens like this ...

`bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`

... so users can search by "000123"?

Thanks!
Rob


Securing Solr with BasicAuth

2017-05-24 Thread Warden, Jesse
We don’t want people modifying Solr on our website. We found this plugin: 
https://home.apache.org/~ctargett/RefGuidePOC/jekyll-full/basic-authentication-plugin.html#BasicAuthenticationPlugin-EnableBasicAuthentication

However, if someone goes to search on our website, they’re presented with an 
authentication dialogue. We want our normal users to be able to perform 
searches, just none of the admin actions.

How do we do this?


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: solrcloud replicas not in sync

2017-05-24 Thread Walter Underwood
Funny, I took a different approach to the same monitoring problem.

Each document has a published_timestamp field set when it is generated. The 
schema has an indexed_timestamp field with a default of NOW. I wrote some 
Python to get the set of nodes in the collection, query each one, then report 
the freshness to Graphite. It is generally under 300 ms.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 24, 2017, at 12:51 PM, Webster Homer  wrote:
> 
> Actually I wrote a service that calls the collections API Cluster Status,
> but it adds data for each replica by calling the Core Admin STATUS
> https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS
> 
> my service fills in the index information for more data
> 
> This returns the current flag, and it may not always be correct?
> 
> On Wed, May 24, 2017 at 10:21 AM, Erick Erickson 
> wrote:
> 
>> I wouldn't rely on the "current" flag in the admin UI as an indicator.
>> As long as your numDocs and the like match I'd say it's a UI issue.
>> 
>> Best,
>> Erick
>> 
>> On Wed, May 24, 2017 at 8:15 AM, Webster Homer 
>> wrote:
>>> We see data in the target clusters. CDCR replication is working. We first
>>> noticed the current=false flag on the target replicas, but since I
>> started
>>> looking I see it on the source too.
>>> 
>>> 
>>> I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
>>> update processor chain, I did two data loads to different collections.
>>> These collections are part of our development system, they are not
>>> configured to use cdcr they are directly loaded by our data load. The ETL
>>> to our solrs use the /update/json request handler and does not send
>>> commits. These collections mirror our production collections and have 2
>>> shards with 2 replicas. I see the situation where the replicas are marked
>>> current=false which should not happen if autoCommit was working
>> correctly.
>>> The last load was yesterday at 5pm and I didn't check until this morning
>>> where I found bb-catalog-material_shard1_replica1 (the leader) was not
>>> current, but the other was. The last modified date on the leader was
>>> 2017-05-23T22:44:54.618Z.
>>> 
>>> My modified autoCommit:
>>>  
>>>   ${solr.autoCommit.maxTime:60}
>>>   false
>>> 
>>> 
>>> 
>>>   ${solr.autoSoftCommit.maxTime:6}
>>> 
>>> 
>>> The last indexed record from a search matches up with the above time. For
>>> this test,the numDocs are the same between the two replicas. I think the
>>> soft commit is working. Why wouldn't both replicas be current after so
>> many
>>> hours?
>>> We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it
>> becomes
>>> available
>>> 
>>> Thanks,
>>> Webster
>>> 
>>> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
>> erickerick...@gmail.com>
>>> wrote:
>>> 
 This is all quite strange. Optimize (BTW, it's rarely
 necessary/desirable on an index that changes, despite its name)
 shouldn't matter here. CDCR forwards the raw documents to the target
 cluster.
 
 Ample time indeed. With a soft commit of 15 seconds, that's your
 window (with some slop for how long CDCR takes).
 
 If you do a search and sort by your timestamp descending, what do you
 see on the target cluster? And when you are indexing and CDCR is
 running, your target cluster solr logs should show updates coming in.
 Mostly checking if the data is even getting to the target cluster
 here.
 
 Also check the tlogs on the source cluster. By "check" here I just
 mean "are they reasonable size", and "reasonable" should be very
 small. The tlogs are the "queue" that CDCR uses to store docs before
 forwarding to the target cluster, so this is just a sanity check. If
 they're huge, then CDCR is not forwarding anything to the target
 cluster.
 
 It's also vaguely possible that
 IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
 a bug and should be reported as a JIRA. If you remove that on the
 target cluster, does the behavior change?
 
 I'm mystified here as you can tell.
 
 Best,
 Erick
 
 On Tue, May 23, 2017 at 10:12 AM, Webster Homer >> 
 wrote:
> We see a pretty consistent issue where the replicas show in the admin
> console as not current, indicating that our auto commit isn't
>> commiting.
 In
> one case we loaded the data to the source, cdcr replicated it to the
> targets and we see the source and the target as having current =
>> false.
 It
> is searchable so the soft commits are happening. We turned off data
 loading
> to investigate this issue, and the replicas are still not current
>> after 3
> days. So there should have been ample time to catch up.
> This is our autoCommit
> 
>   25000
>   ${solr.autoCommit.maxTime:30}
>  

Re: solrcloud replicas not in sync

2017-05-24 Thread Webster Homer
Actually I wrote a service that calls the collections API Cluster Status,
but it adds data for each replica by calling the Core Admin STATUS
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS

my service fills in the index information for more data

This returns the current flag, and it may not always be correct?

On Wed, May 24, 2017 at 10:21 AM, Erick Erickson 
wrote:

> I wouldn't rely on the "current" flag in the admin UI as an indicator.
> As long as your numDocs and the like match I'd say it's a UI issue.
>
> Best,
> Erick
>
> On Wed, May 24, 2017 at 8:15 AM, Webster Homer 
> wrote:
> > We see data in the target clusters. CDCR replication is working. We first
> > noticed the current=false flag on the target replicas, but since I
> started
> > looking I see it on the source too.
> >
> >
> > I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
> > update processor chain, I did two data loads to different collections.
> > These collections are part of our development system, they are not
> > configured to use cdcr they are directly loaded by our data load. The ETL
> > to our solrs use the /update/json request handler and does not send
> > commits. These collections mirror our production collections and have 2
> > shards with 2 replicas. I see the situation where the replicas are marked
> > current=false which should not happen if autoCommit was working
> correctly.
> > The last load was yesterday at 5pm and I didn't check until this morning
> > where I found bb-catalog-material_shard1_replica1 (the leader) was not
> > current, but the other was. The last modified date on the leader was
> > 2017-05-23T22:44:54.618Z.
> >
> > My modified autoCommit:
> >   
> >${solr.autoCommit.maxTime:60}
> >false
> >  
> >
> >  
> >${solr.autoSoftCommit.maxTime:6}
> >  
> >
> > The last indexed record from a search matches up with the above time. For
> > this test,the numDocs are the same between the two replicas. I think the
> > soft commit is working. Why wouldn't both replicas be current after so
> many
> > hours?
> > We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it
> becomes
> > available
> >
> > Thanks,
> > Webster
> >
> > On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> This is all quite strange. Optimize (BTW, it's rarely
> >> necessary/desirable on an index that changes, despite its name)
> >> shouldn't matter here. CDCR forwards the raw documents to the target
> >> cluster.
> >>
> >> Ample time indeed. With a soft commit of 15 seconds, that's your
> >> window (with some slop for how long CDCR takes).
> >>
> >> If you do a search and sort by your timestamp descending, what do you
> >> see on the target cluster? And when you are indexing and CDCR is
> >> running, your target cluster solr logs should show updates coming in.
> >> Mostly checking if the data is even getting to the target cluster
> >> here.
> >>
> >> Also check the tlogs on the source cluster. By "check" here I just
> >> mean "are they reasonable size", and "reasonable" should be very
> >> small. The tlogs are the "queue" that CDCR uses to store docs before
> >> forwarding to the target cluster, so this is just a sanity check. If
> >> they're huge, then CDCR is not forwarding anything to the target
> >> cluster.
> >>
> >> It's also vaguely possible that
> >> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> >> a bug and should be reported as a JIRA. If you remove that on the
> >> target cluster, does the behavior change?
> >>
> >> I'm mystified here as you can tell.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, May 23, 2017 at 10:12 AM, Webster Homer  >
> >> wrote:
> >> > We see a pretty consistent issue where the replicas show in the admin
> >> > console as not current, indicating that our auto commit isn't
> commiting.
> >> In
> >> > one case we loaded the data to the source, cdcr replicated it to the
> >> > targets and we see the source and the target as having current =
> false.
> >> It
> >> > is searchable so the soft commits are happening. We turned off data
> >> loading
> >> > to investigate this issue, and the replicas are still not current
> after 3
> >> > days. So there should have been ample time to catch up.
> >> > This is our autoCommit
> >> >  
> >> >25000
> >> >${solr.autoCommit.maxTime:30}
> >> >false
> >> >  
> >> >
> >> > This is our autoSoftCommit
> >> >  
> >> >${solr.autoSoftCommit.maxTime:15000}
> >> >  
> >> > neither property, solr.autoCommit.maxTime or
> solr.autoSoftCommit.maxTime
> >> > are set.
> >> >
> >> > We also have an updateChain that calls the
> >> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> >> commits.
> >> > Could that be the cause of our
> >> >   
> >> >  
> >> >
> >> >  200
> >> >
> >> >
> >> >
> >> >
> >> >

Re: solrcloud replicas not in sync

2017-05-24 Thread Webster Homer
oh, those logs probably reflect the update job that runs every 15 minutes
if there are updates, typically 1 or 2 changes. thanks for the info

On Wed, May 24, 2017 at 10:37 AM, Erick Erickson 
wrote:

> By default, enough closed log files will be kept to hold the last 100
> documents indexed. This is for "peer sync" purposes. Say replica1 goes
> offline for a bit. When it comes back online, if it's fallen behind by
> no more than 100 docs, the docs are replayed from another replica's
> tlog.
>
> Having such tiny tlogs is kind of unusual. My guess is that your
> ingestion rate is quite low. Every time a hard commit happens, a new
> tlog is opened up and the old one is closed. Having such tiny tlogs
> implies that you are getting one or a few documents per autocommit
> interval, so each tlog contains just a few docs. There's nothing wrong
> with that, mind you, so it's not a problem.
>
> When do log files get deleted? It Depends (tm). In the non-CDCR case,
> if the most recent N closed tlogs contain 100 or more documents, the
> tlogs older than N are deleted.
>
> In the CDCR case, the above condition must be true _and_ the docs in
> tlogs older than N must have been transmitted to the target cluster.
>
> Best,
> Erick
>
> On Wed, May 24, 2017 at 8:27 AM, Webster Homer 
> wrote:
> > The tlog sizes are strange
> > In the case of the collection where we had issues with the replicas the
> > tlog sizes are 740 bytes and 938 bytes on the target side and the same on
> > the source side. There are a lot of them on the source side, when do tlog
> > files get deleted?
> >
> >
> >
> > On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> This is all quite strange. Optimize (BTW, it's rarely
> >> necessary/desirable on an index that changes, despite its name)
> >> shouldn't matter here. CDCR forwards the raw documents to the target
> >> cluster.
> >>
> >> Ample time indeed. With a soft commit of 15 seconds, that's your
> >> window (with some slop for how long CDCR takes).
> >>
> >> If you do a search and sort by your timestamp descending, what do you
> >> see on the target cluster? And when you are indexing and CDCR is
> >> running, your target cluster solr logs should show updates coming in.
> >> Mostly checking if the data is even getting to the target cluster
> >> here.
> >>
> >> Also check the tlogs on the source cluster. By "check" here I just
> >> mean "are they reasonable size", and "reasonable" should be very
> >> small. The tlogs are the "queue" that CDCR uses to store docs before
> >> forwarding to the target cluster, so this is just a sanity check. If
> >> they're huge, then CDCR is not forwarding anything to the target
> >> cluster.
> >>
> >> It's also vaguely possible that
> >> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> >> a bug and should be reported as a JIRA. If you remove that on the
> >> target cluster, does the behavior change?
> >>
> >> I'm mystified here as you can tell.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, May 23, 2017 at 10:12 AM, Webster Homer  >
> >> wrote:
> >> > We see a pretty consistent issue where the replicas show in the admin
> >> > console as not current, indicating that our auto commit isn't
> commiting.
> >> In
> >> > one case we loaded the data to the source, cdcr replicated it to the
> >> > targets and we see the source and the target as having current =
> false.
> >> It
> >> > is searchable so the soft commits are happening. We turned off data
> >> loading
> >> > to investigate this issue, and the replicas are still not current
> after 3
> >> > days. So there should have been ample time to catch up.
> >> > This is our autoCommit
> >> >  
> >> >25000
> >> >${solr.autoCommit.maxTime:30}
> >> >false
> >> >  
> >> >
> >> > This is our autoSoftCommit
> >> >  
> >> >${solr.autoSoftCommit.maxTime:15000}
> >> >  
> >> > neither property, solr.autoCommit.maxTime or
> solr.autoSoftCommit.maxTime
> >> > are set.
> >> >
> >> > We also have an updateChain that calls the
> >> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> >> commits.
> >> > Could that be the cause of our
> >> >   
> >> >  
> >> >
> >> >  200
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >  
> >> >
> >> > We did create a date field to all our collections that defaults to NOW
> >> so I
> >> > can see that no new data was added, but the replicas don't seem to get
> >> the
> >> > commit. I assume this is something in our configuration (see above).
> >> >
> >> > Is there a way to determine when the last commit occurred?
> >> >
> >> > I believe that the one replica got out of sync due to an admin
> running an
> >> > optimize while cdcr was still running.
> >> > That was one collection, but it looks like we are missing commits on
> most
> >> > of our collections.
> >> >
> >> > Any help would be greatly appreci

Re: solr 6 at scale

2017-05-24 Thread Walter Underwood
I remembered why we waited for 6.5.1. It is the object leak in the Zookeeper 
client code. A very slow leak, but worth getting a fix.

I tested our cluster at 6000 requests/minute. It is 18 million documents, four 
shards by four replicas on big AWS instances (c4.8xlarge). We have very long 
free text queries. Students enter queries with hundreds of words (copy/paste), 
but we truncate at 40 terms.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 24, 2017, at 12:33 PM, Nawab Zada Asad Iqbal  wrote:
> 
> Thanks everyone for the responses, I will go with the latest bits for now;
> and will share how it goes.
> 
> @Toke, I stumbled upon your page last week but it seems that your huge
> index doesn't receive a lot of query traffic. Mine is around 60TB and
> receives around 120 queries per second; ~90 shards on 30 machines.
> 
> 
> I look forward to hear more scale stories.
> Nawab
> 
> On Wed, May 24, 2017 at 7:58 AM, Toke Eskildsen  wrote:
> 
>> Shawn Heisey  wrote:
>>> On 5/24/2017 3:44 AM, Toke Eskildsen wrote:
 It is relatively easy to downgrade to an earlier release within the
 same major version. We have not switched to 6.5.1 simply because we
 have no pressing need for it - Solr 6.3 works well for us.
>> 
>>> That strikes me as a little bit dangerous, unless your indexes are very
>>> static.  The Lucene index format does occasionally change in minor
>>> versions.
>> 
>> Err.. Okay? Thank you for that. I was under the impression that the index
>> format was fixed (modulo critical bugs) for major versions. This will
>> change our approach to updating.
>> 
>> Apologies for the confusion,
>> Toke
>> 



Re: solr 6 at scale

2017-05-24 Thread Nawab Zada Asad Iqbal
Thanks everyone for the responses, I will go with the latest bits for now;
and will share how it goes.

@Toke, I stumbled upon your page last week but it seems that your huge
index doesn't receive a lot of query traffic. Mine is around 60TB and
receives around 120 queries per second; ~90 shards on 30 machines.


I look forward to hear more scale stories.
Nawab

On Wed, May 24, 2017 at 7:58 AM, Toke Eskildsen  wrote:

> Shawn Heisey  wrote:
> > On 5/24/2017 3:44 AM, Toke Eskildsen wrote:
> >> It is relatively easy to downgrade to an earlier release within the
> >> same major version. We have not switched to 6.5.1 simply because we
> >> have no pressing need for it - Solr 6.3 works well for us.
>
> > That strikes me as a little bit dangerous, unless your indexes are very
> > static.  The Lucene index format does occasionally change in minor
> > versions.
>
> Err.. Okay? Thank you for that. I was under the impression that the index
> format was fixed (modulo critical bugs) for major versions. This will
> change our approach to updating.
>
> Apologies for the confusion,
> Toke
>


Why is Standard Tokenizer not separating at this comma?

2017-05-24 Thread Robert Hume
I have a Solr 3.6 deployment I inherited.

The schema.xml specifies the use of StandardTokenizerFactory like so ...


...
  
...


According to this reference guide (
https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ...
the StandardTokenizer will treat punctuation as a delimiters.


However, here is my content that gets indexed:

"IOM-1:BA9ATS0FAB,\"Company Name

Module\",8.1.0.16.0.2,B-A,06KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company
CM Rear Module\",B-6,09XP12133407,"



This piece `B-A,06KB09029932` gets tokenized into two words ... `|B-A|`
and `|06KB09029932|`.


But this piece `B-6,09XP12133407` gets tokenized into one word ...
`|B-6,09XP12133407|`.

What I've observed is the comma is not considered a delimiter when it is
proceeded by a digit ... almost like it considers "6,000" to be currency or
something?


QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how
commas are used as delimiters?

Rob


Re: Indexing word with plus sign

2017-05-24 Thread Fundera Developer
Thank you very much Erick!  You're right!

The "Char" part in PatternReplaceCharFilterFactory misguided me and I tought it 
was just for Char replacements. One I have gone through the documentation of 
CharFilters (my fault...) I realized that I could use the very same regex I was 
using with the PatternReplaceFilterFactory to replace the whole "i+d" 
expression, and nothing more than that, and it is working like charm now.

Thanks again!!


El 23/05/17 a las 19:41, Erick Erickson escribió:

You need to distinguish between

PatternReplaceCharFilterFactory

and

PatternReplaceFilterFactory

The first one is applied to the entire input _before_ tokenization.
The second is applied _after_ tokenization to individual tokens, by
that time it's too late.

It's an easy thing to miss.

And at query time you'll have to be careful to keep the + sign from
being interpreted as an operator.
Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
 wrote:


I have also tried this option, by using a PatternReplaceFilterFactory, like 
this:



but it gets processed AFTER the Tokenizer, so when it executes there is no 
longer an "i+d" token, but two "i" and "d" independent tokens.

Is there a way I could make the filter execute before the Tokenizer? I have 
tried to place it first in the Analyzer definition like this:

 
   
   
   
   
   
 

But I had no luck.

Are there any other approaches I could be missing?

Thanks!


El 22/05/17 a las 20:50, Rick Leir escribió:

Fundera,
You need a regex which matches a '+' with non-blank chars before and after. It 
should not replace a  '+' preceded by white space, that is important in Solr. 
This is not a perfect solution, but might improve matters for you.
Cheers -- Rick

On May 22, 2017 1:58:21 PM EDT, Fundera Developer 

 wrote:


Thank you Zahid and Erik,

I was going to try the CharFilter suggestion, but then I doubted. I see
the indexing process, and how the appearance of 'i+d' would be handled,
but, what happens at query time? If I use the same filter, I could
remove '+' chars that are added by the user to identify compulsory
tokens in the search results, couldn't I?  However, if i do not use the
CharFilter I would not be able to match the 'i+d' search tokens...

Thanks all!



El 22/05/17 a las 16:39, Erick Erickson escribió:

You can also use any of the other tokenizers. WhitespaceTokenizer for
instance. There are a couple that use regular expressions. Etc. See:
https://cwiki.apache.org/confluence/display/solr/Tokenizers

Each one has it's considerations. WhitespaceTokenizer won't, for
instance, separate out punctuation so you might then have to use a
filter to remove those. Regex's can be tricky to get right ;). Etc

Best,
Erick

On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal

wrote:


Hi,


Before applying tokenizer, you can replace your special symbols with
some
phrase to preserve it and after tokenized you can replace it back.

For example:



Thanks,
Zahid iqbal

On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
funderadevelo...@outlook.com>
wrote:



Hi all,

I am a bit stuck at a problem that I feel must be easy to solve. In
Spanish it is usual to find the term 'i+d'. We are working with Solr
5.5,
and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
the
index documents both in Spanish and Catalan, and in Catalan it is
frequent
to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
documents as results.

I have tried to use the SynonymFilter, with something like:

i+d => investigacionYdesarrollo

But it does not seem to change anything.

Is there a way I could set an exception to the Tokenizer so that it
does
not split this word?

Thanks in advance!









Re: solrcloud replicas not in sync

2017-05-24 Thread Erick Erickson
By default, enough closed log files will be kept to hold the last 100
documents indexed. This is for "peer sync" purposes. Say replica1 goes
offline for a bit. When it comes back online, if it's fallen behind by
no more than 100 docs, the docs are replayed from another replica's
tlog.

Having such tiny tlogs is kind of unusual. My guess is that your
ingestion rate is quite low. Every time a hard commit happens, a new
tlog is opened up and the old one is closed. Having such tiny tlogs
implies that you are getting one or a few documents per autocommit
interval, so each tlog contains just a few docs. There's nothing wrong
with that, mind you, so it's not a problem.

When do log files get deleted? It Depends (tm). In the non-CDCR case,
if the most recent N closed tlogs contain 100 or more documents, the
tlogs older than N are deleted.

In the CDCR case, the above condition must be true _and_ the docs in
tlogs older than N must have been transmitted to the target cluster.

Best,
Erick

On Wed, May 24, 2017 at 8:27 AM, Webster Homer  wrote:
> The tlog sizes are strange
> In the case of the collection where we had issues with the replicas the
> tlog sizes are 740 bytes and 938 bytes on the target side and the same on
> the source side. There are a lot of them on the source side, when do tlog
> files get deleted?
>
>
>
> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson 
> wrote:
>
>> This is all quite strange. Optimize (BTW, it's rarely
>> necessary/desirable on an index that changes, despite its name)
>> shouldn't matter here. CDCR forwards the raw documents to the target
>> cluster.
>>
>> Ample time indeed. With a soft commit of 15 seconds, that's your
>> window (with some slop for how long CDCR takes).
>>
>> If you do a search and sort by your timestamp descending, what do you
>> see on the target cluster? And when you are indexing and CDCR is
>> running, your target cluster solr logs should show updates coming in.
>> Mostly checking if the data is even getting to the target cluster
>> here.
>>
>> Also check the tlogs on the source cluster. By "check" here I just
>> mean "are they reasonable size", and "reasonable" should be very
>> small. The tlogs are the "queue" that CDCR uses to store docs before
>> forwarding to the target cluster, so this is just a sanity check. If
>> they're huge, then CDCR is not forwarding anything to the target
>> cluster.
>>
>> It's also vaguely possible that
>> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
>> a bug and should be reported as a JIRA. If you remove that on the
>> target cluster, does the behavior change?
>>
>> I'm mystified here as you can tell.
>>
>> Best,
>> Erick
>>
>> On Tue, May 23, 2017 at 10:12 AM, Webster Homer 
>> wrote:
>> > We see a pretty consistent issue where the replicas show in the admin
>> > console as not current, indicating that our auto commit isn't commiting.
>> In
>> > one case we loaded the data to the source, cdcr replicated it to the
>> > targets and we see the source and the target as having current = false.
>> It
>> > is searchable so the soft commits are happening. We turned off data
>> loading
>> > to investigate this issue, and the replicas are still not current after 3
>> > days. So there should have been ample time to catch up.
>> > This is our autoCommit
>> >  
>> >25000
>> >${solr.autoCommit.maxTime:30}
>> >false
>> >  
>> >
>> > This is our autoSoftCommit
>> >  
>> >${solr.autoSoftCommit.maxTime:15000}
>> >  
>> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
>> > are set.
>> >
>> > We also have an updateChain that calls the
>> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
>> commits.
>> > Could that be the cause of our
>> >   
>> >  
>> >
>> >  200
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >  
>> >
>> > We did create a date field to all our collections that defaults to NOW
>> so I
>> > can see that no new data was added, but the replicas don't seem to get
>> the
>> > commit. I assume this is something in our configuration (see above).
>> >
>> > Is there a way to determine when the last commit occurred?
>> >
>> > I believe that the one replica got out of sync due to an admin running an
>> > optimize while cdcr was still running.
>> > That was one collection, but it looks like we are missing commits on most
>> > of our collections.
>> >
>> > Any help would be greatly appreciated!
>> >
>> > Thanks,
>> > Webster Homer
>> >
>> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson > >
>> > wrote:
>> >
>> >> You can ping individual replicas by addressing to a specific replica
>> >> and setting distrib=false, something like
>> >>
>> >>  http://SOLR_NODE:port/solr/collection1_shard1_replica1/
>> >> query?distrib=false&q=..
>> >>
>> >> But one thing to check first is that you've committed. I'd:
>> >> 1> turn off indexing on the source cluster.
>> >> 2> wa

Re: solrcloud replicas not in sync

2017-05-24 Thread Webster Homer
The tlog sizes are strange
In the case of the collection where we had issues with the replicas the
tlog sizes are 740 bytes and 938 bytes on the target side and the same on
the source side. There are a lot of them on the source side, when do tlog
files get deleted?



On Tue, May 23, 2017 at 12:52 PM, Erick Erickson 
wrote:

> This is all quite strange. Optimize (BTW, it's rarely
> necessary/desirable on an index that changes, despite its name)
> shouldn't matter here. CDCR forwards the raw documents to the target
> cluster.
>
> Ample time indeed. With a soft commit of 15 seconds, that's your
> window (with some slop for how long CDCR takes).
>
> If you do a search and sort by your timestamp descending, what do you
> see on the target cluster? And when you are indexing and CDCR is
> running, your target cluster solr logs should show updates coming in.
> Mostly checking if the data is even getting to the target cluster
> here.
>
> Also check the tlogs on the source cluster. By "check" here I just
> mean "are they reasonable size", and "reasonable" should be very
> small. The tlogs are the "queue" that CDCR uses to store docs before
> forwarding to the target cluster, so this is just a sanity check. If
> they're huge, then CDCR is not forwarding anything to the target
> cluster.
>
> It's also vaguely possible that
> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> a bug and should be reported as a JIRA. If you remove that on the
> target cluster, does the behavior change?
>
> I'm mystified here as you can tell.
>
> Best,
> Erick
>
> On Tue, May 23, 2017 at 10:12 AM, Webster Homer 
> wrote:
> > We see a pretty consistent issue where the replicas show in the admin
> > console as not current, indicating that our auto commit isn't commiting.
> In
> > one case we loaded the data to the source, cdcr replicated it to the
> > targets and we see the source and the target as having current = false.
> It
> > is searchable so the soft commits are happening. We turned off data
> loading
> > to investigate this issue, and the replicas are still not current after 3
> > days. So there should have been ample time to catch up.
> > This is our autoCommit
> >  
> >25000
> >${solr.autoCommit.maxTime:30}
> >false
> >  
> >
> > This is our autoSoftCommit
> >  
> >${solr.autoSoftCommit.maxTime:15000}
> >  
> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
> > are set.
> >
> > We also have an updateChain that calls the
> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> commits.
> > Could that be the cause of our
> >   
> >  
> >
> >  200
> >
> >
> >
> >
> >
> >
> >
> >  
> >
> > We did create a date field to all our collections that defaults to NOW
> so I
> > can see that no new data was added, but the replicas don't seem to get
> the
> > commit. I assume this is something in our configuration (see above).
> >
> > Is there a way to determine when the last commit occurred?
> >
> > I believe that the one replica got out of sync due to an admin running an
> > optimize while cdcr was still running.
> > That was one collection, but it looks like we are missing commits on most
> > of our collections.
> >
> > Any help would be greatly appreciated!
> >
> > Thanks,
> > Webster Homer
> >
> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson  >
> > wrote:
> >
> >> You can ping individual replicas by addressing to a specific replica
> >> and setting distrib=false, something like
> >>
> >>  http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> query?distrib=false&q=..
> >>
> >> But one thing to check first is that you've committed. I'd:
> >> 1> turn off indexing on the source cluster.
> >> 2> wait until the CDCR had caught up (if necessary).
> >> 3> issue a hard commit on the target
> >> 4> _then_ see if the counts were what is expected.
> >>
> >> Due to the fact that autocommit settings can fire at different clock
> >> times even for replicas on the same shard, it's easier to track
> >> whether it's a transient issue. The other thing I've seen people do is
> >> have a timestamp on the docs set to NOW (there's an update processor
> >> that can do this). Then when you check for consistency you can use
> >> fq=timestamp:[* TO NOW - (some interval significantly longer than your
> >> autocommit interval)].
> >>
> >> bq: Is there a way to recover when a shard has inconsistent replicas.
> >> If I use the delete replica API call to delete one of them and then use
> add
> >> replica to create it from scratch will it auto-populate from the other
> >> replica in the shard?
> >>
> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> >> before becoming active. It'll have to copy the _entire_ index from the
> >> leader, so you'll see network traffic spike.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer 

Re: solrcloud replicas not in sync

2017-05-24 Thread Erick Erickson
I wouldn't rely on the "current" flag in the admin UI as an indicator.
As long as your numDocs and the like match I'd say it's a UI issue.

Best,
Erick

On Wed, May 24, 2017 at 8:15 AM, Webster Homer  wrote:
> We see data in the target clusters. CDCR replication is working. We first
> noticed the current=false flag on the target replicas, but since I started
> looking I see it on the source too.
>
>
> I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
> update processor chain, I did two data loads to different collections.
> These collections are part of our development system, they are not
> configured to use cdcr they are directly loaded by our data load. The ETL
> to our solrs use the /update/json request handler and does not send
> commits. These collections mirror our production collections and have 2
> shards with 2 replicas. I see the situation where the replicas are marked
> current=false which should not happen if autoCommit was working correctly.
> The last load was yesterday at 5pm and I didn't check until this morning
> where I found bb-catalog-material_shard1_replica1 (the leader) was not
> current, but the other was. The last modified date on the leader was
> 2017-05-23T22:44:54.618Z.
>
> My modified autoCommit:
>   
>${solr.autoCommit.maxTime:60}
>false
>  
>
>  
>${solr.autoSoftCommit.maxTime:6}
>  
>
> The last indexed record from a search matches up with the above time. For
> this test,the numDocs are the same between the two replicas. I think the
> soft commit is working. Why wouldn't both replicas be current after so many
> hours?
> We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it becomes
> available
>
> Thanks,
> Webster
>
> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson 
> wrote:
>
>> This is all quite strange. Optimize (BTW, it's rarely
>> necessary/desirable on an index that changes, despite its name)
>> shouldn't matter here. CDCR forwards the raw documents to the target
>> cluster.
>>
>> Ample time indeed. With a soft commit of 15 seconds, that's your
>> window (with some slop for how long CDCR takes).
>>
>> If you do a search and sort by your timestamp descending, what do you
>> see on the target cluster? And when you are indexing and CDCR is
>> running, your target cluster solr logs should show updates coming in.
>> Mostly checking if the data is even getting to the target cluster
>> here.
>>
>> Also check the tlogs on the source cluster. By "check" here I just
>> mean "are they reasonable size", and "reasonable" should be very
>> small. The tlogs are the "queue" that CDCR uses to store docs before
>> forwarding to the target cluster, so this is just a sanity check. If
>> they're huge, then CDCR is not forwarding anything to the target
>> cluster.
>>
>> It's also vaguely possible that
>> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
>> a bug and should be reported as a JIRA. If you remove that on the
>> target cluster, does the behavior change?
>>
>> I'm mystified here as you can tell.
>>
>> Best,
>> Erick
>>
>> On Tue, May 23, 2017 at 10:12 AM, Webster Homer 
>> wrote:
>> > We see a pretty consistent issue where the replicas show in the admin
>> > console as not current, indicating that our auto commit isn't commiting.
>> In
>> > one case we loaded the data to the source, cdcr replicated it to the
>> > targets and we see the source and the target as having current = false.
>> It
>> > is searchable so the soft commits are happening. We turned off data
>> loading
>> > to investigate this issue, and the replicas are still not current after 3
>> > days. So there should have been ample time to catch up.
>> > This is our autoCommit
>> >  
>> >25000
>> >${solr.autoCommit.maxTime:30}
>> >false
>> >  
>> >
>> > This is our autoSoftCommit
>> >  
>> >${solr.autoSoftCommit.maxTime:15000}
>> >  
>> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
>> > are set.
>> >
>> > We also have an updateChain that calls the
>> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
>> commits.
>> > Could that be the cause of our
>> >   
>> >  
>> >
>> >  200
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >  
>> >
>> > We did create a date field to all our collections that defaults to NOW
>> so I
>> > can see that no new data was added, but the replicas don't seem to get
>> the
>> > commit. I assume this is something in our configuration (see above).
>> >
>> > Is there a way to determine when the last commit occurred?
>> >
>> > I believe that the one replica got out of sync due to an admin running an
>> > optimize while cdcr was still running.
>> > That was one collection, but it looks like we are missing commits on most
>> > of our collections.
>> >
>> > Any help would be greatly appreciated!
>> >
>> > Thanks,
>> > Webster Homer
>> >
>> > On Mon, 

Re: solrcloud replicas not in sync

2017-05-24 Thread Webster Homer
We see data in the target clusters. CDCR replication is working. We first
noticed the current=false flag on the target replicas, but since I started
looking I see it on the source too.


I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
update processor chain, I did two data loads to different collections.
These collections are part of our development system, they are not
configured to use cdcr they are directly loaded by our data load. The ETL
to our solrs use the /update/json request handler and does not send
commits. These collections mirror our production collections and have 2
shards with 2 replicas. I see the situation where the replicas are marked
current=false which should not happen if autoCommit was working correctly.
The last load was yesterday at 5pm and I didn't check until this morning
where I found bb-catalog-material_shard1_replica1 (the leader) was not
current, but the other was. The last modified date on the leader was
2017-05-23T22:44:54.618Z.

My modified autoCommit:
  
   ${solr.autoCommit.maxTime:60}
   false
 

 
   ${solr.autoSoftCommit.maxTime:6}
 

The last indexed record from a search matches up with the above time. For
this test,the numDocs are the same between the two replicas. I think the
soft commit is working. Why wouldn't both replicas be current after so many
hours?
We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it becomes
available

Thanks,
Webster

On Tue, May 23, 2017 at 12:52 PM, Erick Erickson 
wrote:

> This is all quite strange. Optimize (BTW, it's rarely
> necessary/desirable on an index that changes, despite its name)
> shouldn't matter here. CDCR forwards the raw documents to the target
> cluster.
>
> Ample time indeed. With a soft commit of 15 seconds, that's your
> window (with some slop for how long CDCR takes).
>
> If you do a search and sort by your timestamp descending, what do you
> see on the target cluster? And when you are indexing and CDCR is
> running, your target cluster solr logs should show updates coming in.
> Mostly checking if the data is even getting to the target cluster
> here.
>
> Also check the tlogs on the source cluster. By "check" here I just
> mean "are they reasonable size", and "reasonable" should be very
> small. The tlogs are the "queue" that CDCR uses to store docs before
> forwarding to the target cluster, so this is just a sanity check. If
> they're huge, then CDCR is not forwarding anything to the target
> cluster.
>
> It's also vaguely possible that
> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> a bug and should be reported as a JIRA. If you remove that on the
> target cluster, does the behavior change?
>
> I'm mystified here as you can tell.
>
> Best,
> Erick
>
> On Tue, May 23, 2017 at 10:12 AM, Webster Homer 
> wrote:
> > We see a pretty consistent issue where the replicas show in the admin
> > console as not current, indicating that our auto commit isn't commiting.
> In
> > one case we loaded the data to the source, cdcr replicated it to the
> > targets and we see the source and the target as having current = false.
> It
> > is searchable so the soft commits are happening. We turned off data
> loading
> > to investigate this issue, and the replicas are still not current after 3
> > days. So there should have been ample time to catch up.
> > This is our autoCommit
> >  
> >25000
> >${solr.autoCommit.maxTime:30}
> >false
> >  
> >
> > This is our autoSoftCommit
> >  
> >${solr.autoSoftCommit.maxTime:15000}
> >  
> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
> > are set.
> >
> > We also have an updateChain that calls the
> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> commits.
> > Could that be the cause of our
> >   
> >  
> >
> >  200
> >
> >
> >
> >
> >
> >
> >
> >  
> >
> > We did create a date field to all our collections that defaults to NOW
> so I
> > can see that no new data was added, but the replicas don't seem to get
> the
> > commit. I assume this is something in our configuration (see above).
> >
> > Is there a way to determine when the last commit occurred?
> >
> > I believe that the one replica got out of sync due to an admin running an
> > optimize while cdcr was still running.
> > That was one collection, but it looks like we are missing commits on most
> > of our collections.
> >
> > Any help would be greatly appreciated!
> >
> > Thanks,
> > Webster Homer
> >
> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson  >
> > wrote:
> >
> >> You can ping individual replicas by addressing to a specific replica
> >> and setting distrib=false, something like
> >>
> >>  http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> query?distrib=false&q=..
> >>
> >> But one thing to check first is that you've committed. I'd:
> >> 1> turn off inde

Re: solr 6 at scale

2017-05-24 Thread Toke Eskildsen
Shawn Heisey  wrote:
> On 5/24/2017 3:44 AM, Toke Eskildsen wrote:
>> It is relatively easy to downgrade to an earlier release within the
>> same major version. We have not switched to 6.5.1 simply because we
>> have no pressing need for it - Solr 6.3 works well for us.

> That strikes me as a little bit dangerous, unless your indexes are very
> static.  The Lucene index format does occasionally change in minor
> versions.

Err.. Okay? Thank you for that. I was under the impression that the index 
format was fixed (modulo critical bugs) for major versions. This will change 
our approach to updating.

Apologies for the confusion,
Toke


Re: How to handle nested documents in solr (SolrJ)

2017-05-24 Thread Erick Erickson
I would ask if you need nested documents at all. If you can denormlize
the docs it's often much easier. In your case I can think of several
options:
1> just index a separate field for each subject. Solr handles a couple
of hundred fields with ease.
student id : 123
student name : john
maths: 90
English :95

student id : 123
student name : john
marks: math_90 english_95

You have one "marks" field and each token has it's associated score.
Probably want to left-pad if you want to do range queries here, i.e.
math_009, math_098, math_100. That way you can express [math_009 TO
math_020] and have it "do the right thing".

Use payloads for the scores.

Best,
Erick

On Wed, May 24, 2017 at 2:26 AM, Rick Leir  wrote:
> Prasad,
>
> Gee, you get confusion from a google search for:
>
> nested documents
> site:mail-archives.apache.org/mod_mbox/lucene-solr-user/
>
> https://www.google.ca/search?safe=strict&q=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&oq=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&gs_l=serp.3...34316.37762.0.37969.10.10.0.0.0.0.104.678.9j1.10.00...1.1.64.serp..0.0.0.JTf887wWCDM
>
> But my recent posting might help: " Yonick has some good blogs on this."
>
> And Mikhail has an excellent blog:
>
> https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr
>
> cheers -- Rick
>
>
> On 2017-05-24 02:53 AM, prasad chowdary wrote:
>>
>> Dear All,
>>
>> I have a requirement that I need to index the documents in solr using Java
>> code.
>>
>> Each document contains a sub documents like below ( Its just for
>> underastanding my question).
>>
>>
>> student id : 123
>> student name : john
>> marks :
>> maths: 90
>> English :95
>>
>> student id : 124
>> student name : rack
>> marks :
>> maths: 80
>> English :96
>>
>> etc...
>>
>> So, as shown above each document contains one child document i.e marks.
>>
>> Actaully I don't need any joins or anything.My requirement is :
>>
>> if I query "English:95" ,it should return the complete document ,i.e child
>> along with parent like below
>>
>> student id : 123
>> student name : john
>> marks :
>> maths: 90
>> English :95
>>
>> and also if I query "student id : 123" , it should return the whole
>> document
>> same as above.
>>
>> Currently I am able to get the child along with parent for child match by
>> using extendedResults option .
>>
>> But not able to get the child for parent match.
>
>


Re: Too many logs recorded in zookeeper.out

2017-05-24 Thread Noriyuki TAKEI
Hi

Tahnks for your reply.I,ll try to join zookeeper mailg list!! 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Too-many-logs-recorded-in-zookeeper-out-tp4335238p4336914.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 6 at scale

2017-05-24 Thread Shawn Heisey
On 5/24/2017 3:44 AM, Toke Eskildsen wrote:
> It is relatively easy to downgrade to an earlier release within the
> same major version. We have not switched to 6.5.1 simply because we
> have no pressing need for it - Solr 6.3 works well for us. 

That strikes me as a little bit dangerous, unless your indexes are very
static.  The Lucene index format does occasionally change in minor
versions.  I do not know whether the index format changed from 6.3 to
6.5, but if it did, then 6.3 would not be able to read index segments
built by 6.5, which might mean that it would refuse to read the entire
index.

Thanks,
Shawn



Re: JSON facet performance for aggregations

2017-05-24 Thread Yonik Seeley
On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley  wrote:
> I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
> this performance issue.

OK, this has been committed.
A quick test shows about a 30x speedup when faceting on a
string/numeric docvalues field with 100K unique values and doing a
simple aggregation on another numeric field (and when the limit:-1).

-Yonik


Re: solr 6 at scale

2017-05-24 Thread Toke Eskildsen
On Tue, 2017-05-23 at 17:27 -0700, Nawab Zada Asad Iqbal wrote:
> Anyone using solr.6.x for multi-terabytes index size: how did you
> decide which version to upgrade to?

We are still stuck with 4.10 for our 70TB+ (split in 83 shards) index,
due to some custom hacks that has not yet been ported. If not for the
hacks, we would probably have switched to Solr 6.x by now, as we would
very much like some of the newer features.

We do have a 2.8TB (split in 6 shards, 2 replicas) index running on
Solr 6.3, which was the newest version at installation time. As long as
there are known stable and well-functioning releases within the same
major version, we are fine with picking the latest release and see how
it goes: It is relatively easy to downgrade to an earlier release
within the same major version. We have not switched to 6.5.1 simply
because we have no pressing need for it - Solr 6.3 works well for us.

I guess it depends quite a bit on your need for stability. We are a
library and uptime is only "best effort".
-- 
Toke Eskildsen, Royal Danish Library


Re: How to handle nested documents in solr (SolrJ)

2017-05-24 Thread Rick Leir

Prasad,

Gee, you get confusion from a google search for:

nested documents 
site:mail-archives.apache.org/mod_mbox/lucene-solr-user/


https://www.google.ca/search?safe=strict&q=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&oq=nested+documents+site%3Amail-archives.apache.org%2Fmod_mbox%2Flucene-solr-user%2F&gs_l=serp.3...34316.37762.0.37969.10.10.0.0.0.0.104.678.9j1.10.00...1.1.64.serp..0.0.0.JTf887wWCDM

But my recent posting might help: " Yonick has some good blogs on this."

And Mikhail has an excellent blog:

https://blog.griddynamics.com/how-to-use-block-join-to-improve-search-efficiency-with-nested-documents-in-solr

cheers -- Rick

On 2017-05-24 02:53 AM, prasad chowdary wrote:

Dear All,

I have a requirement that I need to index the documents in solr using Java
code.

Each document contains a sub documents like below ( Its just for
underastanding my question).


student id : 123
student name : john
marks :
maths: 90
English :95

student id : 124
student name : rack
marks :
maths: 80
English :96

etc...

So, as shown above each document contains one child document i.e marks.

Actaully I don't need any joins or anything.My requirement is :

if I query "English:95" ,it should return the complete document ,i.e child
along with parent like below

student id : 123
student name : john
marks :
maths: 90
English :95

and also if I query "student id : 123" , it should return the whole document
same as above.

Currently I am able to get the child along with parent for child match by
using extendedResults option .

But not able to get the child for parent match.