Re: Restarting SolrCloud that is taking realtime updates

2016-11-25 Thread Jichi Guo
Thanks so much for the very quick and detailed explanation, Erick!

  

According to the following page, it seems numRecordsToKeep cannot be too high
that must fit in a singe POST.

It seems your 1> or 3> approaches would be the best in pratical when the
number of updated documents is high.

  

https://support.lucidworks.com/hc/en-us/articles/203842143-Recovery-times-
while-restarting-a-SolrCloud-node  

  

Thanks again for happy thanksgiving!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-76daa0e7-1a84/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1=c29s
ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
76daa0e7-1a84?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)

  
On Nov 25 2016, at 2:33 pm, Erick Erickson  wrote:  

> First, get out of thinking about the replication API, things like  
DISABLEPOLL and the like when in SolrCloud mode. The  
"old style" replication is used under the control of the synching  
strategy. Unless you've configured master/slave sections of  
your solrconfig.xml files and somehow dealt with the leader  
changing (who should be polled?), I'm pretty sure this is a total red herring.

>

> As for the rest, that's just the way it works. In SolrCloud, the  
raw documents are forwarded from the leader to the followers.  
Outside of a node going into recovery, replication isn't used  
at all.

>

> However, when a node goes into recovery (which by definition it will  
when the core is reloaded or the Solr instance is restarted) then  
the replica checks with the leader to see if it's "too far" out of date. The  
default "too far" is 100 docs, although this can be changed by setting  
the updatelog numRecordsToKeep to a higher number in solrconfig.xml.  
If the replica is too far out of date, a full index replication is done which  
is what you're observing.

>

> If the number of updates the leader has received is < 100  
(or numRecordsToKeep) the leader sends the raw documents to the  
follower from it's update log and there is no "old style" replication there  
at all.

>

> So, the net-net here is that your choices are limited:

>

> 1> stop indexing while doing the restart.

>

> 2> bump numRecordsToKeep to some larger number that  
 you expect not to be exceeded for the time it takes to  
 restart each node.

>

> 3> live with the full index replication in this situation.

>

> I'll add parenthetically that having to redeploy plugins and the like  
_should_ be a relatively rare operation, and it seems (at least from  
the outside) to be a perfectly reasonable thing to do in a maintenance  
window when index updates are disabled.

>

> You can also consider using collection aliasing to switch back and  
forth between two collections so you can manipulate the current  
cold one and, when you're satisfied, switch the alias.

>

> Best,  
Erick

>

> On Fri, Nov 25, 2016 at 1:40 PM, Jichi Guo  wrote:  
> Hi,  
>  
>  
>  
> I am seeking for the best practice to restart a sharded SolrCloud that
taking  
> search traffic as well as realtime updates without downtime.  
>  
> When I deploy new customized Solr plugins,for example, it will require  
> restarting the whole SolrCloud cluster.  
>  
> I am testing Solr 6.2.1 with 4 shards.  
>  
> And I find that when SolrCloud is taking updates, when I restart any Solr
node  
> (no matter whether it is a leader node or overseer or other normal replica),  
> the restarted node would Reindex it's whole data from its leader. i.e., it  
> will redownload the whole index data and then drop its old data.  
>  
> The only way I find to avoid such reindexing is to temporarily disable  
> updates, such as invoke disableReplication in the leader node before  
> restarting.  
>  
>  
>  
> Additionally, I didn't find a way to temporarily pause Solr replication to a  
> single replica. Before sharding, we can do disablePoll to disable
replication  
> in a slave. But after sharding, disable replication from the leader node is  
> the only way I found, which will pause not only the replication to the one  
> node to restart, but also disable replication in all nodes in the same
shard.  
>  
>  
>  
> The procedure becomes more complex if I want to restart a leader node: I
need  
> first manually trigger a leader node failover through rebalancing, then  
> disable replication in the new leader node, then restart the old leader
node,  
> and at last reenable replication in the new leader node.  
>  
>  
>  
> As you can see, it seems to take many steps to restart SolrCloud node by
node  
> this way.  
>  
> I am not sure if this is the best procedure to restart the whole SolrCloud  
> that is taking realtime update?  
>  
>  
>  
> Thanks!  
>  
>  
> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b  
> /local-

Re: Restarting SolrCloud that is taking realtime updates

2016-11-25 Thread Erick Erickson
First, get out of thinking about the replication API, things like
DISABLEPOLL and the like when in SolrCloud mode. The
"old style" replication is used under the control of the synching
strategy. Unless you've configured master/slave sections of
your solrconfig.xml files and somehow dealt with the leader
changing (who should be polled?), I'm pretty sure this is a total red herring.

As for the rest, that's just the way it works. In SolrCloud, the
raw documents are forwarded from the leader to the followers.
Outside of a node going into recovery, replication isn't used
at all.

However, when a node goes into recovery (which by definition it will
when the core is reloaded or the Solr instance is restarted) then
the replica checks with the leader to see if it's "too far" out of date. The
default "too far" is 100 docs, although this can be changed by setting
the updatelog numRecordsToKeep to a higher number in solrconfig.xml.
If the replica is too far out of date, a full index replication is done which
is what you're observing.

If the number of updates the leader has received is < 100
(or numRecordsToKeep) the leader sends the raw documents to the
follower from it's update log and there is no "old style" replication there
at all.

So, the net-net here is that your choices are limited:

1> stop indexing while doing the restart.

2> bump numRecordsToKeep to some larger number that
 you expect not to be exceeded for the time it takes to
 restart each node.

3> live with the full index replication in this situation.

I'll add parenthetically that having to redeploy plugins and the like
_should_ be a relatively rare operation, and it seems (at least from
the outside) to be a perfectly reasonable thing to do in a maintenance
window when index updates are disabled.

You can also consider using collection aliasing to switch back and
forth between two collections so you can manipulate the current
cold one and, when you're satisfied, switch the alias.

Best,
Erick

On Fri, Nov 25, 2016 at 1:40 PM, Jichi Guo  wrote:
> Hi,
>
>
>
> I am seeking for the best practice to restart a sharded SolrCloud that taking
> search traffic as well as realtime updates without downtime.
>
> When I deploy new customized Solr plugins,for example, it will require
> restarting the whole SolrCloud cluster.
>
> I am testing Solr 6.2.1 with 4 shards.
>
> And I find that when SolrCloud is taking updates, when I restart any Solr node
> (no matter whether it is a leader node or overseer or other normal replica),
> the restarted node would Reindex it's whole data from its leader. i.e., it
> will redownload the whole index data and then drop its old data.
>
> The only way I find to avoid such reindexing is to temporarily disable
> updates, such as invoke disableReplication in the leader node before
> restarting.
>
>
>
> Additionally, I didn't find a way to temporarily pause Solr replication to a
> single replica. Before sharding, we can do disablePoll to disable replication
> in a slave. But after sharding,  disable replication from the leader node is
> the only way I found, which will pause not only the replication to the one
> node to restart, but also disable replication in all nodes in the same shard.
>
>
>
> The procedure becomes more complex if I want to restart a leader node: I need
> first manually trigger a leader node failover through rebalancing, then
> disable replication in the new leader node, then restart the old leader node,
> and at last reenable replication in the new leader node.
>
>
>
> As you can see, it seems to take many steps to restart SolrCloud node by node
> this way.
>
> I am not sure if this is the best procedure to restart the whole SolrCloud
> that is taking realtime update?
>
>
>
> Thanks!
>
>
> Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
> /local-7bf8174b-7288/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1=c29s
> ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.
>
> ![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
> 7bf8174b-7288?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)
>


Restarting SolrCloud that is taking realtime updates

2016-11-25 Thread Jichi Guo
Hi,

  

I am seeking for the best practice to restart a sharded SolrCloud that taking
search traffic as well as realtime updates without downtime.

When I deploy new customized Solr plugins,for example, it will require
restarting the whole SolrCloud cluster.

I am testing Solr 6.2.1 with 4 shards.

And I find that when SolrCloud is taking updates, when I restart any Solr node
(no matter whether it is a leader node or overseer or other normal replica),
the restarted node would Reindex it's whole data from its leader. i.e., it
will redownload the whole index data and then drop its old data.

The only way I find to avoid such reindexing is to temporarily disable
updates, such as invoke disableReplication in the leader node before
restarting.

  

Additionally, I didn't find a way to temporarily pause Solr replication to a
single replica. Before sharding, we can do disablePoll to disable replication
in a slave. But after sharding,  disable replication from the leader node is
the only way I found, which will pause not only the replication to the one
node to restart, but also disable replication in all nodes in the same shard.

  

The procedure becomes more complex if I want to restart a leader node: I need
first manually trigger a leader node failover through rebalancing, then
disable replication in the new leader node, then restart the old leader node,
and at last reenable replication in the new leader node.

  

As you can see, it seems to take many steps to restart SolrCloud node by node
this way.

I am not sure if this is the best procedure to restart the whole SolrCloud
that is taking realtime update?

  

Thanks!

  
Sent from [Nylas N1](https://link.nylas.com/link/5tkvmhpozan5j5h3lhni487b
/local-7bf8174b-7288/0?redirect=https%3A%2F%2Fnylas.com%2Fn1%3Fref%3Dn1=c29s
ci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn), the extensible, open source mail client.

![](https://link.nylas.com/open/5tkvmhpozan5j5h3lhni487b/local-
7bf8174b-7288?r=c29sci11c2VyQGx1Y2VuZS5hcGFjaGUub3Jn)



Re: Search opening hours

2016-11-25 Thread O. Klein
Thank you for your reply David.

Yes, I ended up using a DateRangeField. Down side is that it needs frequent
updates. Luckily not an issue for my use case.

BTW how could I abuse DateRangeField for non-date data?




david.w.smi...@gmail.com wrote
> I just saw this conversation now.  I didn't read every word but I have to
> ask immediately: does DateRangeField address your needs?
> https://cwiki.apache.org/confluence/display/solr/Working+with+Dates  It
> was
> introduced in 5.0.
> 
> On Wed, Nov 16, 2016 at 4:59 AM O. Klein 

> klein@

>  wrote:
> 
>> Above implementation was too slow, so wondering if Solr 6 with all its
>> new
>> features provides a better solution to tackle operating hours. Especially
>> dealing with different timezones.
>>
>> Any thoughts?
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4306073.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> -- 
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4307463.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wildcard searches with space in TextField/StrField

2016-11-25 Thread Ahmet Arslan
Hi,

You could try this:

drop wildcard stuff altogether:
1) Employ edgengramfilter at index time.
2) Use plain searches at query time.

Ahmet



On Friday, November 25, 2016 4:59 PM, Sandeep Khanzode 
 wrote:
Hi All,

Can someone please assist with this query?

My data consists of:
1.] John Doe
2.] John V. Doe
3.] Johnson Doe
4.] Johnson V. Doe
5.] John Smith
6.] Johnson V. Smith
7.] Matt Doe
8.] Matt V. Doe
9.] Matt Doe
10.] Matthew V. Doe
11.] Matthew Smith

12.] Matthew V. Smith

Querying ...
(a) Matt/Matt* should return records 7-12
(b) John/John* should return records 1-6
(c) Doe/Doe* should return records 1-4, 7-10
(d) Smith/Smith* should return records 5,6,11,12
(e) V/V./V.*/V* should return records 2,4,6,8,10,12
(f) V. Doe/V. Doe* should return records 2,4,8,10
(g) John V/John V./John V*/John V.* should return record 2
(h) V. Smith/V. Smith* should return records 6,12

Any guidance would be appreciated!
I have tried ComplexPhraseQueryParser, but with a single token like Doe*, there 
is an error that indicates that the query is being identified as a prefix 
query. I may be missing something in the syntax.
 SRK 


On Thursday, November 24, 2016 11:16 PM, Sandeep Khanzode 
 wrote:


Hi All, Erick,
Please suggest. Would like to use the ComplexPhraseQueryParser for searching 
text (with wildcard) that may contain special characters.
For example ...John* should match John V. DoeJohn* should match Johnson 
SmithBruce-Willis* should match Bruce-WillisV.* should match John V. F. Doe
SRK 

On Thursday, November 24, 2016 5:57 PM, Sandeep Khanzode 
 wrote:


Hi,
This is the typical TextField with ... 




SRK 

On Thursday, November 24, 2016 1:38 AM, Reth RM  
wrote:


what is the fieldType of those records?  
On Tue, Nov 22, 2016 at 4:18 AM, Sandeep Khanzode 
 wrote:

Hi Erick,
I gave this a try. 
These are my results. There is a record with "John D. Smith", and another named 
"John Doe".

1.] {!complexphrase inOrder=true}name:"John D.*" ... does not fetch any 
results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches both results. 



Second observation: There is a record with "John D Smith"
1.] {!complexphrase inOrder=true}name:"John*" ... does not fetch any results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches that record. 

3.] {!complexphrase inOrder=true}name:"John D S*" ... fetches that record. 

SRK

On Sunday, November 13, 2016 7:43 AM, Erick Erickson 
 wrote:


 Right, for that kind of use case you want complexPhraseQueryParser,
see: https://cwiki.apache.org/ confluence/display/solr/Other+ 
Parsers#OtherParsers- ComplexPhraseQueryParser

Best,
Erick

On Sat, Nov 12, 2016 at 9:39 AM, Sandeep Khanzode
 wrote:
> Thanks, Erick.
>
> I am actually not trying to use the String field (prefer a TextField here).
> But, in my comparisons with TextField, it seems that something like phrase
> matching with whitespace and wildcard (like, 'my do*' or say, 'my dog*', or
> say, 'my dog has*') can only be accomplished with a string type field,
> especially because, with a WhitespaceTokenizer in TextField, the space will
> be lost, and all tokens will be individually considered. Am I missing
> something?
>
> SRK
>
>
> On Friday, November 11, 2016 10:05 PM, Erick Erickson
>  wrote:
>
>
> You have to query text and string fields differently, that's just the
> way it works. The problem is getting the query string through the
> parser as a _single_ token or as multiple tokens.
>
> Let's say you have a string field with the "a b" example. You have a
> single token
> a b that starts at offset 0.
>
> But with a text field, you have two tokens,
> a at position 0
> b at position 1
>
> But when the query parser sees "a b" (without quotes) it splits it
> into two tokens, and only the text field has both tokens so the string
> field won't match.
>
> OTOH, when the query parser sees "a\ b" it passes this through as a
> single token, which only matches the string field as there's no
> _single_ token "a b" in the text field.
>
> But a more interesting question is why you want to search this way.
> String fields are intended for keywords, machine-generated IDs and the
> like. They're pretty useless for searching anything except
> 1> exact tokens
> 2> prefixes
>
> While if you have "my dog has fleas" in a string field, you _can_
> search "*dog*" and get a hit but the performance is poor when you get
> a large corpus. Performance for "my*" will be pretty good though.
>
> In all this sounds like an XY problem, what's the use-case you're
> trying to solve?
>
> Best,
> Erick
>
>
>
> On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode
>  wrote:
>> Hi Erick, Reth,
>>
>> The 'a\ b*' as well as 

Re: Query parser behavior with AND and negative clause

2016-11-25 Thread Sandeep Khanzode
WORKS:
+{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
2016-11-22T13:59:00Z]'} +(*:* -{!field f=dateRange2 op=Contains 
v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})


+ConstantScore(IntersectsPrefixTreeFilter(fieldName=dateRange1,queryShape=[2016-11-22T12:01
 TO 2016-11-22T13:59:00],detailLevel=9,prefixGridScanLevel=7)) 
+(MatchAllDocsQuery(*:*) 
-ConstantScore(ContainsPrefixTreeFilter(fieldName=dateRange2,queryShape=[2016-11-22T12:01
 TO 2016-11-22T13:59:00],detailLevel=9,multiOverlappingIndexedShapes=true)))




DOES NOT WORK :
{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO 
2016-11-22T13:59:00Z]'} AND (*:* -{!field f=dateRange2 op=Contains 
v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})


ConstantScore(IntersectsPrefixTreeFilter(fieldName=dateRange1,queryShape=[2016-11-22T12:01
 TO 2016-11-22T13:59:00],detailLevel=9,prefixGridScanLevel=7))
 SRK 


On Thursday, November 24, 2016 9:02 PM, Alessandro Benedetti 
 wrote:
 

 Hey Sandeep,
can you debug the query ( debugQuery=on) and show how the query is parsed ?

Cheers



On Thu, Nov 24, 2016 at 12:38 PM, Sandeep Khanzode <
sandeep_khanz...@yahoo.com.invalid> wrote:

> Hi Erick,
> The example record contains ...dateRange1 = [2016-11-22T18:00:00Z TO
> 2016-11-22T20:00:00Z], [2016-11-22T06:00:00Z TO 
> 2016-11-22T14:00:00Z]dateRange2
> = [2016-11-22T12:00:00Z TO 2016-11-22T14:00:00Z]"
> The first query works ... which means that it is able to EXCLUDE this
> record from the result (since the negative dateRange2 clause should return
> false). Whereas the second query should also work but it does not and
> actually pulls the record in the result.
> WORKS:
> +{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO
> 2016-11-22T13:59:00Z]'} +(*:* -{!field f=dateRange2 op=Contains
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
>
>
> DOES NOT WORK :
> {!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO
> 2016-11-22T13:59:00Z]'} AND (*:* -{!field f=dateRange2 op=Contains
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
>  SRK
>
>    On Tuesday, November 22, 2016 9:41 PM, Erick Erickson <
> erickerick...@gmail.com> wrote:
>
>
>  _How_ does it "not work"? You haven't told us what you expect .vs.
> what you get back.
>
> Plus a sample doc that that violates your expectations (just the
> dateRange field) would
> also help.
>
> Best,
> Erick
>
> On Tue, Nov 22, 2016 at 4:23 AM, Sandeep Khanzode
>  wrote:
> > Hi,
> > I have a simple query that should intersect with dateRange1 and NOT be
> contained within dateRange2. I have tried the following options:
> >
> > WORKS:
> > +{!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO
> 2016-11-22T13:59:00Z]'} +(*:* -{!field f=dateRange2 op=Contains
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
> >
> >
> > DOES NOT WORK :
> > {!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO
> 2016-11-22T13:59:00Z]'} AND (*:* -{!field f=dateRange2 op=Contains
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'})
> >
> > Why?
> >
> > WILL NOT WORK (because of the negative clause at the top level?):
> > {!field f=dateRange1 op=Intersects v='[2016-11-22T12:01:00Z TO
> 2016-11-22T13:59:00Z]'} AND -{!field f=dateRange2 op=Contains
> v='[2016-11-22T12:01:00Z TO 2016-11-22T13:59:00Z]'}
> >
> >
> > SRK
>
>
>
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


   

Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík
I forgot to mention I am creating a jar file beside of a running solr 6.3
instance to which I am hoping to attach with java via the SolrDispatchFilter
to get at the cores and so then I could work with data in code.


2016-11-25 19:31 GMT+01:00 Marek Ščevlík :

> Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
> release of Solr 6.3 to get hold of a running instance of the jetty server
> that is part of the solution? I found some code for previous versions where
> it was captured with this code and one could then obtain cores for a
> running solr instance ...
>
> SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty
>
> .getDispatchFilter().getFilter();
>
>
> I was trying to implement it this way but that is not working out very
> well now. I cant seem to get the jetty server object for the running
> instance. I tried several combinations but none seemed to work.
>
> Can you perhaps point me in the right direction?
>
> Perhaps you may know more than I do at the moment.
>
>
> Any help would be great.
>
>
> Thanks a lot
> Regards Marek Scevlik
>
>
>
> 2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov>:
>
>> Marek,
>>
>> I've wanted to do something like this in the past as well.  However, a
>> rewrite that supports the same XML syntax might be better.   There are
>> several problems with the design of the Data Import Handler that make it
>> not quite suitable:
>>
>> - Not designed for Multi-threading
>> - Bad implementation of XPath
>>
>> Another issue is that one of the big advantages of Data Import Handler
>> goes away at this point, which is that it is hosted within Solr, and has a
>> UI for testing within the Solr admin.
>>
>> A better open-source Java solution might be to connect Solr with Apache
>> Camel - http://camel.apache.org/solr.html.
>>
>> If you are not tied absolutely to pure open-source, and freemium products
>> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
>> is much more established in the market, I find Pentaho's XML-based ETL a
>> bit easier to integrate as a developer, and unit test and such.   Talend
>> does better when you have a full infrastructure set up, but then the
>> attention required to unit tests and Git integration seems over the top.
>>
>> Another powerful way to get things done, depending on what you are
>> indexing, is to use LogStash and couple that with Document processing
>> chains.   Many of our projects benefit from having a single RDBMS view,
>> perhaps a materialized view, that is used for the index.   LogStash does
>> just fine here, pulling from the RDBMS and posting each row to Solr.  The
>> hierarchical execution of Data Import Handler is very nice, but this can
>> often be handled on the RDBMS side by creating a view, maybe using
>> functions to provide some rows.   Many RDBMS systems also support
>> federation and the import of XML from files, so that this brings XML
>> processing into the picture.
>>
>> Hoping this helps,
>>
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>>
>>
>>
>>
>> -Original Message-
>> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
>> Sent: Friday, November 18, 2016 9:29 AM
>> To: solr-user@lucene.apache.org
>> Subject: Data Import Request Handler isolated into its own project - any
>> suggestions?
>>
>> Hello. My name is Marek Scevlik.
>>
>>
>>
>> Currently I am working for a small company where we are interested in
>> implementing your Sorl 6.3 search engine.
>>
>>
>>
>> We are hoping to take out from the original source package the Data
>> Import Request Handler into its own project and create a usable .jar file
>> out of it.
>>
>>
>>
>> It should then serve as tool that would allow to connect to a remote
>> server and return data for us to our other application that would use the
>> returned data.
>>
>>
>>
>> What do you think? Would anything like this possible? To isolate out the
>> Data Import Request Handler into its own standalone project?
>>
>>
>>
>> If we could achieve this we won’t mind to share with the community this
>> new feature.
>>
>>
>>
>> I realize this is a first email and may lead into several hundreds so for
>> the start my request is very simple and not so high level detailed but I am
>> sure you realize it may lead into being quite complex.
>>
>>
>>
>> So I wonder if anyone replies.
>>
>>
>>
>> Thanks a lot for any replies and further info or guidance.
>>
>>
>>
>>
>>
>> Thanks.
>>
>> Regards Marek Scevlik
>>
>
>


Re: Data Import Request Handler isolated into its own project - any suggestions?

2016-11-25 Thread Marek Ščevlík
Hi Daniel. Thanks for a reply. I wonder is it now still possibly with
release of Solr 6.3 to get hold of a running instance of the jetty server
that is part of the solution? I found some code for previous versions where
it was captured with this code and one could then obtain cores for a
running solr instance ...

SolrDispatchFilter solrDispatchFilter = (SolrDispatchFilter) jetty

.getDispatchFilter().getFilter();


I was trying to implement it this way but that is not working out very well
now. I cant seem to get the jetty server object for the running instance. I
tried several combinations but none seemed to work.

Can you perhaps point me in the right direction?

Perhaps you may know more than I do at the moment.


Any help would be great.


Thanks a lot
Regards Marek Scevlik



2016-11-18 15:53 GMT+01:00 Davis, Daniel (NIH/NLM) [C] :

> Marek,
>
> I've wanted to do something like this in the past as well.  However, a
> rewrite that supports the same XML syntax might be better.   There are
> several problems with the design of the Data Import Handler that make it
> not quite suitable:
>
> - Not designed for Multi-threading
> - Bad implementation of XPath
>
> Another issue is that one of the big advantages of Data Import Handler
> goes away at this point, which is that it is hosted within Solr, and has a
> UI for testing within the Solr admin.
>
> A better open-source Java solution might be to connect Solr with Apache
> Camel - http://camel.apache.org/solr.html.
>
> If you are not tied absolutely to pure open-source, and freemium products
> will do, then you might look at Pentaho Spoon and Kettle.   Although Talend
> is much more established in the market, I find Pentaho's XML-based ETL a
> bit easier to integrate as a developer, and unit test and such.   Talend
> does better when you have a full infrastructure set up, but then the
> attention required to unit tests and Git integration seems over the top.
>
> Another powerful way to get things done, depending on what you are
> indexing, is to use LogStash and couple that with Document processing
> chains.   Many of our projects benefit from having a single RDBMS view,
> perhaps a materialized view, that is used for the index.   LogStash does
> just fine here, pulling from the RDBMS and posting each row to Solr.  The
> hierarchical execution of Data Import Handler is very nice, but this can
> often be handled on the RDBMS side by creating a view, maybe using
> functions to provide some rows.   Many RDBMS systems also support
> federation and the import of XML from files, so that this brings XML
> processing into the picture.
>
> Hoping this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
>
>
>
> -Original Message-
> From: Marek Ščevlík [mailto:mscev...@codenameprojects.com]
> Sent: Friday, November 18, 2016 9:29 AM
> To: solr-user@lucene.apache.org
> Subject: Data Import Request Handler isolated into its own project - any
> suggestions?
>
> Hello. My name is Marek Scevlik.
>
>
>
> Currently I am working for a small company where we are interested in
> implementing your Sorl 6.3 search engine.
>
>
>
> We are hoping to take out from the original source package the Data Import
> Request Handler into its own project and create a usable .jar file out of
> it.
>
>
>
> It should then serve as tool that would allow to connect to a remote
> server and return data for us to our other application that would use the
> returned data.
>
>
>
> What do you think? Would anything like this possible? To isolate out the
> Data Import Request Handler into its own standalone project?
>
>
>
> If we could achieve this we won’t mind to share with the community this
> new feature.
>
>
>
> I realize this is a first email and may lead into several hundreds so for
> the start my request is very simple and not so high level detailed but I am
> sure you realize it may lead into being quite complex.
>
>
>
> So I wonder if anyone replies.
>
>
>
> Thanks a lot for any replies and further info or guidance.
>
>
>
>
>
> Thanks.
>
> Regards Marek Scevlik
>


Re: AW: AW: Resync after restart

2016-11-25 Thread Pushkar Raste
Did you index any documents while node was being restarted? There was a
issue introduced due to IndexFingerprint comparison. Check SOLR-9310. I am
not sure if fix made it to Solr6.2

On Nov 25, 2016 3:51 AM, "Arkadi Colson"  wrote:

> I am using SolrCloud on version 6.2.1. I will upgrade to 6.3.0 next week.
>
> This is the current config for numVersionBuckets:
>
> 
>   ${solr.ulog.dir:}
>   ${solr.ulog.numVersionBuckets:65536
> }
> 
>
> Are you saying that I should not use the config below on SolrCloud?
>
>   
> 
>   18.75
>   05:00:00
>   15
>   30
> 
>   
>
> Br,
> Arkadi
>
>
> On 24-11-16 17:46, Erick Erickson wrote:
>
>> Hold on. Are you using SolrCloud or not? There is a lot of talk here
>> about masters and slaves, then you say "I always add slaves with the
>> collection API", collections are a SolrCloud construct.
>>
>> It sounds like you're mixing the two. You should _not_ configure
>> master/slave replication parameters with SolrCloud. Take a look at the
>> sample configs
>>
>> And you haven't told us what version of Solr you're using, we can
>> infer a relatively recent one because of the high number you have for
>> numVersionBuckets, but that's guessing.
>>
>> If you are _not_ in SolrCloud, then maybe:
>> https://issues.apache.org/jira/browse/SOLR-9036 is relevant.
>>
>> Best,
>> Erick
>>
>> On Thu, Nov 24, 2016 at 3:10 AM, Arkadi Colson 
>> wrote:
>>
>>> This is the code from the master node. Al configs are the same on all
>>> nodes.
>>> I always add slaves with the collection API. Is there an other place to
>>> look
>>> for this part of the config?
>>>
>>>
>>>
>>> On 24-11-16 12:02, Michael Aleythe, Sternwald wrote:
>>>
 You need to change this on the master node. The part of the config you
 pasted here, looks like it is from the slave node.

 -Ursprüngliche Nachricht-
 Von: Arkadi Colson [mailto:ark...@smartbit.be]
 Gesendet: Donnerstag, 24. November 2016 11:56
 An: solr-user@lucene.apache.org
 Betreff: Re: AW: Resync after restart

 Hi Michael

 Thanks for the quick response! The line does not exist in my config. So
 can I assume that the default configuration is to not replicate at
 startup?

  

  18.75
  05:00:00
  15
  30

  

 Any other idea's?


 On 24-11-16 11:49, Michael Aleythe, Sternwald wrote:

> Hi Arkadi,
>
> you need to remove the line "startup"
> from your ReplicationHandler-config in solrconfig.xml ->
> https://wiki.apache.org/solr/SolrReplication.
>
> Greetings
> Michael
>
> -Ursprüngliche Nachricht-
> Von: Arkadi Colson [mailto:ark...@smartbit.be]
> Gesendet: Donnerstag, 24. November 2016 09:26
> An: solr-user 
> Betreff: Resync after restart
>
> Hi
>
> Almost every time when restarting a solr instance the index is
> replicated
> completely. Is there a way to avoid this somehow? The index currently
> has a
> size of about 17GB.
> Some advice here would be great.
>
> 99% of the config is defaul:
>
>  ${solr.ulog.dir:}  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}
>  
>${solr.autoCommit.maxTime:15000}
>false
>  
>
> If you need more info, just let me know...
>
> Thx!
> Arkadi
>
>
>


Re: Solr 6 Performance Suggestions

2016-11-25 Thread Max Bridgewater
Thanks folks. It looks like the sweet spot where I get comparable results
is at 30 concurrent threads. It progressively degrades from there as I
increases the number of concurrent threads in the test script.

This made me think that something is configured in Tomcat ((Solr4) that is
not comparatively set in Solr 6. The only thing I found that would make
sense is the connector max number threads that we have set at 800 for
Tomcat. However, it jetty.xml, maxThreads is set to 5. Not sure if
these two maxThreads have the same effect.

I thought about Yonik suggestion a little bit. Where I am scratching my
head is that if specific kind of queries where more expensive than others,
should this be reflected even at 30 concurrent threads?

Anyway, still digging.

On Wed, Nov 23, 2016 at 9:56 AM, Walter Underwood 
wrote:

> I recently ran benchmarks on 4.10.4 and 6.2.1 and found very little
> difference in query performance.
>
> This was with 8 million documents (homework problems) from production. I
> used query logs from
> production. The load is a constant number of requests per minute from 100
> threads. CPU usage
> is under 50% in order to avoid congestion. The benchmarks ran for 100
> minutes.
>
> Measuring median and 95th percentile, the times were within 10%. I think
> that is within the
> repeatability of the benchmark. A different number of GCs could make that
> difference.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Nov 23, 2016, at 8:14 AM, Bram Van Dam  wrote:
> >
> > On 22/11/16 15:34, Prateek Jain J wrote:
> >> I am not sure but I heard this in one of discussions, that you cant
> migrate directly from solr 4 to solr 6. It has to be incremental like solr
> 4 to solr 5 and then to solr 6. I might be wrong but is worth trying.
> >
> > Ideally the index needs to be upgraded using the IndexUpgrader.
> >
> > Something like this should do the trick:
> >
> > java -cp lucene-core-6.0.0.jar:lucene-backward-codecs-6.0.0.jar
> > org.apache.lucene.index.IndexUpgrader /path/to/index
> >
> > - Bram
>
>


Re: Import from S3

2016-11-25 Thread Tom Evans
On Fri, Nov 25, 2016 at 7:23 AM, Aniket Khare  wrote:
> You can use Solr DIH for indexing csv data into solr.
> https://wiki.apache.org/solr/DataImportHandler
>

Seems overkill when you can simply post CSV data to the UpdateHandler,
using either the post tool:

https://cwiki.apache.org/confluence/display/solr/Post+Tool

Or by doing it manually however you wish:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates

Cheers

Tom


Using Solr CDCR with HdfsDirectoryFactory

2016-11-25 Thread ZHOU Ran (SAFRAN IDENTITY AND SECURITY)
Hello


Hi All,

I have followed the guide "Cross Data Center Replication (CDCR)" and get my 
source collection replicated to the target. And then I tried to use HDFS as 
storage for both Solr clusters, but failed with the following error message:

ERROR: Failed to create collection 'collection11' due to: 
{192.168.5.95:8983_solr=org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:Error
 from server at http://192.168.5.95:8983/solr: Error CREATEing SolrCore 
'collection11_shard1_replica1': Unable to create core 
[collection11_shard1_replica1] Caused by: Solr instance is not configured with 
the cdcr update log.}

Actually Solr with HDFS works for me. In the configuration for CDCR, there is 
one block:


  
${solr.ulog.dir:}

  



And I know that if HdfsDirectoryFactory is used, then updateHandler will 
initialized updateLog with the class HdfsUpdateLog. Is this the problem that 
CDCR does not work with HDFS? Since the updateLog cannot be initialized with 
CdcrUpdateLog?

Thanks in advance for your help!

Best Regards

Ran ZHOU
Software Engineer

T +49 (0) 234 97 87 59
E ran.z...@safrangroup.com NEW

L-1 Identity Solutions AG
Universitätsstrasse 160 I 44801 BOCHUM - GERMANY
www.safran-identity-security.com

[cid:image001.png@01D24731.BD3F7D70]

[cid:image002.png@01D24731.BD3F7D70][cid:image003.png@01D24731.BD3F7D70]

[cid:image004.jpg@01D24731.BD3F7D70]

[cid:image005.jpg@01D24731.BD3F7D70]

[cid:image006.jpg@01D24731.BD3F7D70]



[cid:image007.jpg@01D24731.BD3F7D70]

Managing Board: Dr. Martin Werner (Vors. / Chairman) I Christèle Jacqz I 
Francois Rieul Supervisory Board: Jean-Christophe Fondeur (Vors. / Chairman)
Register Court: Amtsgericht Bochum I HRB 69 54 | UST-ID / VAT ID: DE 813124378

#
" Ce courriel et les documents qui lui sont joints peuvent contenir des 
informations confidentielles, être soumis aux règlementations relatives au 
contrôle des exportations ou ayant un caractère privé. S'ils ne vous sont pas 
destinés, nous vous signalons qu'il est strictement interdit de les divulguer, 
de les reproduire ou d'en utiliser de quelque manière que ce soit le contenu. 
Toute exportation ou réexportation non autorisée est interdite Si ce message 
vous a été transmis par erreur, merci d'en informer l'expéditeur et de 
supprimer immédiatement de votre système informatique ce courriel ainsi que 
tous les documents qui y sont attachés."
**
" This e-mail and any attached documents may contain confidential or 
proprietary information and may be subject to export control laws and 
regulations. If you are not the intended recipient, you are notified that any 
dissemination, copying of this e-mail and any attachments thereto or use of 
their contents by any means whatsoever is strictly prohibited. Unauthorized 
export or re-export is prohibited. If you have received this e-mail in error, 
please advise the sender immediately and delete this e-mail and all attached 
documents from your computer system."
#


Re: Wildcard searches with space in TextField/StrField

2016-11-25 Thread Sandeep Khanzode
Hi All,

Can someone please assist with this query?

My data consists of:
1.] John Doe
2.] John V. Doe
3.] Johnson Doe
4.] Johnson V. Doe
5.] John Smith
6.] Johnson V. Smith
7.] Matt Doe
8.] Matt V. Doe
9.] Matt Doe
10.] Matthew V. Doe
11.] Matthew Smith

12.] Matthew V. Smith

Querying ...
(a) Matt/Matt* should return records 7-12
(b) John/John* should return records 1-6
(c) Doe/Doe* should return records 1-4, 7-10
(d) Smith/Smith* should return records 5,6,11,12
(e) V/V./V.*/V* should return records 2,4,6,8,10,12
(f) V. Doe/V. Doe* should return records 2,4,8,10
(g) John V/John V./John V*/John V.* should return record 2
(h) V. Smith/V. Smith* should return records 6,12

Any guidance would be appreciated!
I have tried ComplexPhraseQueryParser, but with a single token like Doe*, there 
is an error that indicates that the query is being identified as a prefix 
query. I may be missing something in the syntax.
 SRK 

On Thursday, November 24, 2016 11:16 PM, Sandeep Khanzode 
 wrote:
 

 Hi All, Erick,
Please suggest. Would like to use the ComplexPhraseQueryParser for searching 
text (with wildcard) that may contain special characters.
For example ...John* should match John V. DoeJohn* should match Johnson 
SmithBruce-Willis* should match Bruce-WillisV.* should match John V. F. Doe
SRK 

    On Thursday, November 24, 2016 5:57 PM, Sandeep Khanzode 
 wrote:
 

 Hi,
This is the typical TextField with ...             
            



SRK 

    On Thursday, November 24, 2016 1:38 AM, Reth RM  
wrote:
 

 what is the fieldType of those records?  
On Tue, Nov 22, 2016 at 4:18 AM, Sandeep Khanzode 
 wrote:

Hi Erick,
I gave this a try. 
These are my results. There is a record with "John D. Smith", and another named 
"John Doe".

1.] {!complexphrase inOrder=true}name:"John D.*" ... does not fetch any 
results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches both results. 



Second observation: There is a record with "John D Smith"
1.] {!complexphrase inOrder=true}name:"John*" ... does not fetch any results. 

2.] {!complexphrase inOrder=true}name:"John D*" ... fetches that record. 

3.] {!complexphrase inOrder=true}name:"John D S*" ... fetches that record. 

SRK

    On Sunday, November 13, 2016 7:43 AM, Erick Erickson 
 wrote:


 Right, for that kind of use case you want complexPhraseQueryParser,
see: https://cwiki.apache.org/ confluence/display/solr/Other+ 
Parsers#OtherParsers- ComplexPhraseQueryParser

Best,
Erick

On Sat, Nov 12, 2016 at 9:39 AM, Sandeep Khanzode
 wrote:
> Thanks, Erick.
>
> I am actually not trying to use the String field (prefer a TextField here).
> But, in my comparisons with TextField, it seems that something like phrase
> matching with whitespace and wildcard (like, 'my do*' or say, 'my dog*', or
> say, 'my dog has*') can only be accomplished with a string type field,
> especially because, with a WhitespaceTokenizer in TextField, the space will
> be lost, and all tokens will be individually considered. Am I missing
> something?
>
> SRK
>
>
> On Friday, November 11, 2016 10:05 PM, Erick Erickson
>  wrote:
>
>
> You have to query text and string fields differently, that's just the
> way it works. The problem is getting the query string through the
> parser as a _single_ token or as multiple tokens.
>
> Let's say you have a string field with the "a b" example. You have a
> single token
> a b that starts at offset 0.
>
> But with a text field, you have two tokens,
> a at position 0
> b at position 1
>
> But when the query parser sees "a b" (without quotes) it splits it
> into two tokens, and only the text field has both tokens so the string
> field won't match.
>
> OTOH, when the query parser sees "a\ b" it passes this through as a
> single token, which only matches the string field as there's no
> _single_ token "a b" in the text field.
>
> But a more interesting question is why you want to search this way.
> String fields are intended for keywords, machine-generated IDs and the
> like. They're pretty useless for searching anything except
> 1> exact tokens
> 2> prefixes
>
> While if you have "my dog has fleas" in a string field, you _can_
> search "*dog*" and get a hit but the performance is poor when you get
> a large corpus. Performance for "my*" will be pretty good though.
>
> In all this sounds like an XY problem, what's the use-case you're
> trying to solve?
>
> Best,
> Erick
>
>
>
> On Thu, Nov 10, 2016 at 10:11 PM, Sandeep Khanzode
>  wrote:
>> Hi Erick, Reth,
>>
>> The 'a\ b*' as well as the q.op=AND approach worked (successfully) only
>> for StrField for me.
>>
>> Any attempt at creating a 'a\ b*' for a TextField does not match any
>> documents. The parsedQuery in debug mode does show 'field:a b*'. I am sure
>> there are 

Re: Zookeeper version

2016-11-25 Thread Novin Novin
Thanks guys.

On Thu, 24 Nov 2016 at 17:03 Erick Erickson  wrote:

> Well, 3.4.6 gets the most testing, so if you want to upgrade it's at
> your own risk.
>
> See: https://issues.apache.org/jira/browse/SOLR-8724, there are
> problems with 3.4.8 in the Solr context for instance.
>
> There's currently an open Zookeeper JIRA for 3.4.9 that, when fixed,
> Solr will try to upgrade to.
>
> Best,
> Erick
>
> On Thu, Nov 24, 2016 at 2:12 AM, Novin Novin  wrote:
> > Hi Guys,
> >
> > I found in solr docs that "Solr currently uses Apache ZooKeeper v3.4.6".
> > Can I use higher version or I have to use 3.4.6 zookeeper.
> >
> > Thanks in advance,
> > Novin
>


Re: AW: AW: Resync after restart

2016-11-25 Thread Arkadi Colson

I am using SolrCloud on version 6.2.1. I will upgrade to 6.3.0 next week.

This is the current config for numVersionBuckets:


  ${solr.ulog.dir:}
  name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}



Are you saying that I should not use the config below on SolrCloud?

  

  18.75
  05:00:00
  15
  30

  

Br,
Arkadi


On 24-11-16 17:46, Erick Erickson wrote:

Hold on. Are you using SolrCloud or not? There is a lot of talk here
about masters and slaves, then you say "I always add slaves with the
collection API", collections are a SolrCloud construct.

It sounds like you're mixing the two. You should _not_ configure
master/slave replication parameters with SolrCloud. Take a look at the
sample configs

And you haven't told us what version of Solr you're using, we can
infer a relatively recent one because of the high number you have for
numVersionBuckets, but that's guessing.

If you are _not_ in SolrCloud, then maybe:
https://issues.apache.org/jira/browse/SOLR-9036 is relevant.

Best,
Erick

On Thu, Nov 24, 2016 at 3:10 AM, Arkadi Colson  wrote:

This is the code from the master node. Al configs are the same on all nodes.
I always add slaves with the collection API. Is there an other place to look
for this part of the config?



On 24-11-16 12:02, Michael Aleythe, Sternwald wrote:

You need to change this on the master node. The part of the config you
pasted here, looks like it is from the slave node.

-Ursprüngliche Nachricht-
Von: Arkadi Colson [mailto:ark...@smartbit.be]
Gesendet: Donnerstag, 24. November 2016 11:56
An: solr-user@lucene.apache.org
Betreff: Re: AW: Resync after restart

Hi Michael

Thanks for the quick response! The line does not exist in my config. So
can I assume that the default configuration is to not replicate at startup?

 
   
 18.75
 05:00:00
 15
 30
   
 

Any other idea's?


On 24-11-16 11:49, Michael Aleythe, Sternwald wrote:

Hi Arkadi,

you need to remove the line "startup"
from your ReplicationHandler-config in solrconfig.xml ->
https://wiki.apache.org/solr/SolrReplication.

Greetings
Michael

-Ursprüngliche Nachricht-
Von: Arkadi Colson [mailto:ark...@smartbit.be]
Gesendet: Donnerstag, 24. November 2016 09:26
An: solr-user 
Betreff: Resync after restart

Hi

Almost every time when restarting a solr instance the index is replicated
completely. Is there a way to avoid this somehow? The index currently has a
size of about 17GB.
Some advice here would be great.

99% of the config is defaul:

 ${solr.ulog.dir:} ${solr.ulog.numVersionBuckets:65536}
 
   ${solr.autoCommit.maxTime:15000}
   false
 

If you need more info, just let me know...

Thx!
Arkadi