Re: CDCR - how to deal with the transaction log files

2017-07-10 Thread Xie, Sean
My guess was the documentation gap.

I did a testing that turning off the CDCR by using action=stop, while 
continuously sending documents to the source cluster. The tlog files were 
growing; And after the hard commit, a new tlog file was created and the old 
files stayed there forever. As soon as I turned on CDCR, the documents started 
to replicate to the target. 

After a hard commit and scheduled log synchronizer run, the old tlog files got 
deleted.

Btw, I’m running on 6.5.1.



On 7/10/17, 10:57 PM, "Varun Thacker"  wrote:

Yeah it just seems weird that you would need to disable the buffer on the
source cluster though.

The docs say "Replicas do not need to buffer updates, and it is recommended
to disable buffer on the target SolrCloud" which means the source should
have it enabled.

But the fact that it's working for you proves otherwise . What version of
Solr are you running? I'll try reproducing this problem at my end and see
if it's a documentation gap or a bug.

On Mon, Jul 10, 2017 at 7:15 PM, Xie, Sean  wrote:

> Yes. Documents are being sent to target. Monitoring the output from
> “action=queues”, depending your settings, you will see the documents
> replication progress.
>
> On the other hand, if enable the buffer, the lastprocessedversion is
> always returning -1. Reading the source code, the CdcrUpdateLogSynchroizer
> does not continue to do the clean if this value is -1.
>
> Sean
>
> On 7/10/17, 5:18 PM, "Varun Thacker"  wrote:
>
> After disabling the buffer are you still seeing documents being
> replicated
> to the target cluster(s) ?
>
> On Mon, Jul 10, 2017 at 1:07 PM, Xie, Sean  wrote:
>
> > After several experiments and observation, finally make it work.
> > The key point is you have to also disablebuffer on source cluster. I
> don’t
> > know why in the wiki, it didn’t mention it, but I figured this out
> through
> > the source code.
> > Once disablebuffer on source cluster, the lastProcessedVersion will
> become
> > a position number, and when there is hard commit, the old unused
> tlog files
> > get deleted.
> >
> > Hope my finding can help other users who experience the same issue.
> >
> >
> > On 7/10/17, 9:08 AM, "Michael McCarthy" 
> wrote:
> >
> > We have been experiencing this same issue for months now, with
> version
> > 6.2.  No solution to date.
> >
> > -Original Message-
> > From: Xie, Sean [mailto:sean@finra.org]
> > Sent: Sunday, July 09, 2017 9:41 PM
> > To: solr-user@lucene.apache.org
> > Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction
> log
> > files
> >
> > Did another round of testing, the tlog on target cluster is
> cleaned up
> > once the hard commit is triggered. However, on source cluster, the
> tlog
> > files stay there and never gets cleaned up.
> >
> > Not sure if there is any command to run manually to trigger the
> > updateLogSynchronizer. The updateLogSynchronizer already set at run
> at
> > every 10 seconds, but seems it didn’t help.
> >
> > Any help?
> >
> > Thanks
> > Sean
> >
> > On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:
> >
> > I have monitored the CDCR process for a while, the updates
> are
> > actively sent to the target without a problem. However the tlog size
> and
> > files count are growing everyday, even when there is 0 updates to
> sent, the
> > tlog stays there:
> >
> > Following is from the action=queues command, and you can see
> after
> > about a month or so running days, the total transaction are reaching
> to
> > 140K total files, and size is about 103G.
> >
> > 
> > 
> > 0
> > 465
> > 
> > 
> > 
> > 
> > 0
> > 2017-07-07T23:19:09.655Z
> > 
> > 
> > 
> > 102740042616
> > 140809
> > stopped
> > 
> >
> > Any help on it? Or do I need to configure something else?
> The CDCR
> > configuration is pretty much following the wiki:
> >
> > On target:
> >
> >   
> > 
> >   disabled
>  

RE: ZooKeeper transaction logs

2017-07-10 Thread Xie, Sean
Not sure if I can answer the question, we previously use the manual command to 
cleanup the log, and use a linux daemon the schedule it. In windows, there 
should be corresponding tool to do so.

We currently use the Netflix exhibitor to manage the zookeeper instances, and 
it works pretty well.

Sean


On 7/10/17, 6:43 AM, "Avi Steiner"  wrote:

I did use this class using batch file (from Windows server), but it still 
does not remove anything. I sent number of snapshots to keep as 3, but I have 
more in my folder.

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 9, 2017 7:33 PM
To: solr-user@lucene.apache.org
Subject: Re: ZooKeeper transaction logs

You can try run purge manually see if it is working: 
org.apache.zookeeper.server.PurgeTxnLog.

And use a cron job to do clean up.


On 7/9/17, 11:07 AM, "Avi Steiner"  wrote:

Hello

I'm using Zookeeper 3.4.6

The ZK log data folder keeps growing with transaction logs files 
(log.*).

I set the following in zoo.cfg:
autopurge.purgeInterval=1
autopurge.snapRetainCount=3
dataDir=..\\data

Per ZK log, it reads those parameters:

2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@78] 
- autopurge.snapRetainCount set to 3
2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@79] 
- autopurge.purgeInterval set to 1

It also says that cleanup process is running:

2017-07-09 17:44:59,792 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-07-09 17:44:59,823 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

But actually nothing is deleted.
Every service restart, new file is created.

The only parameter I managed to change is preAllocSize, which means the 
minimum size per file. The default is 64MB. I changed it to 10KB only for 
watching the effect.



This email and any attachments thereto may contain private, 
confidential, and privileged material for the sole use of the intended 
recipient. Any review, copying, or distribution of this email (or any 
attachments thereto) by others is strictly prohibited. If you are not the 
intended recipient, please contact the sender immediately and permanently 
delete the original and any copies of this email and any attachments thereto.



Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

This email and any attachments thereto may contain private, confidential, 
and privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.




Re: CDCR - how to deal with the transaction log files

2017-07-10 Thread Varun Thacker
Yeah it just seems weird that you would need to disable the buffer on the
source cluster though.

The docs say "Replicas do not need to buffer updates, and it is recommended
to disable buffer on the target SolrCloud" which means the source should
have it enabled.

But the fact that it's working for you proves otherwise . What version of
Solr are you running? I'll try reproducing this problem at my end and see
if it's a documentation gap or a bug.

On Mon, Jul 10, 2017 at 7:15 PM, Xie, Sean  wrote:

> Yes. Documents are being sent to target. Monitoring the output from
> “action=queues”, depending your settings, you will see the documents
> replication progress.
>
> On the other hand, if enable the buffer, the lastprocessedversion is
> always returning -1. Reading the source code, the CdcrUpdateLogSynchroizer
> does not continue to do the clean if this value is -1.
>
> Sean
>
> On 7/10/17, 5:18 PM, "Varun Thacker"  wrote:
>
> After disabling the buffer are you still seeing documents being
> replicated
> to the target cluster(s) ?
>
> On Mon, Jul 10, 2017 at 1:07 PM, Xie, Sean  wrote:
>
> > After several experiments and observation, finally make it work.
> > The key point is you have to also disablebuffer on source cluster. I
> don’t
> > know why in the wiki, it didn’t mention it, but I figured this out
> through
> > the source code.
> > Once disablebuffer on source cluster, the lastProcessedVersion will
> become
> > a position number, and when there is hard commit, the old unused
> tlog files
> > get deleted.
> >
> > Hope my finding can help other users who experience the same issue.
> >
> >
> > On 7/10/17, 9:08 AM, "Michael McCarthy" 
> wrote:
> >
> > We have been experiencing this same issue for months now, with
> version
> > 6.2.  No solution to date.
> >
> > -Original Message-
> > From: Xie, Sean [mailto:sean@finra.org]
> > Sent: Sunday, July 09, 2017 9:41 PM
> > To: solr-user@lucene.apache.org
> > Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction
> log
> > files
> >
> > Did another round of testing, the tlog on target cluster is
> cleaned up
> > once the hard commit is triggered. However, on source cluster, the
> tlog
> > files stay there and never gets cleaned up.
> >
> > Not sure if there is any command to run manually to trigger the
> > updateLogSynchronizer. The updateLogSynchronizer already set at run
> at
> > every 10 seconds, but seems it didn’t help.
> >
> > Any help?
> >
> > Thanks
> > Sean
> >
> > On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:
> >
> > I have monitored the CDCR process for a while, the updates
> are
> > actively sent to the target without a problem. However the tlog size
> and
> > files count are growing everyday, even when there is 0 updates to
> sent, the
> > tlog stays there:
> >
> > Following is from the action=queues command, and you can see
> after
> > about a month or so running days, the total transaction are reaching
> to
> > 140K total files, and size is about 103G.
> >
> > 
> > 
> > 0
> > 465
> > 
> > 
> > 
> > 
> > 0
> > 2017-07-07T23:19:09.655Z
> > 
> > 
> > 
> > 102740042616
> > 140809
> > stopped
> > 
> >
> > Any help on it? Or do I need to configure something else?
> The CDCR
> > configuration is pretty much following the wiki:
> >
> > On target:
> >
> >   
> > 
> >   disabled
> > 
> >   
> >
> >   
> > 
> > 
> >   
> >
> >   
> > 
> >   cdcr-processor-chain
> > 
> >   
> >
> >   
> > 
> >   ${solr.ulog.dir:}
> > 
> > 
> >   ${solr.autoCommit.maxTime:18}
> >   false
> > 
> >
> > 
> >   ${solr.autoSoftCommit.maxTime:3}<
> /maxTime>
> > 
> >   
> >
> > On source:
> >   
> > 
> >   ${TargetZk}
> >   MY_COLLECTION
> >   MY_COLLECTION
> > 
> >
> > 
> >   1
> >   1000
> >   128
> > 
> >
> > 
> >   6
> > 

Re: CDCR - how to deal with the transaction log files

2017-07-10 Thread Xie, Sean
Yes. Documents are being sent to target. Monitoring the output from 
“action=queues”, depending your settings, you will see the documents 
replication progress.

On the other hand, if enable the buffer, the lastprocessedversion is always 
returning -1. Reading the source code, the CdcrUpdateLogSynchroizer does not 
continue to do the clean if this value is -1.

Sean

On 7/10/17, 5:18 PM, "Varun Thacker"  wrote:

After disabling the buffer are you still seeing documents being replicated
to the target cluster(s) ?

On Mon, Jul 10, 2017 at 1:07 PM, Xie, Sean  wrote:

> After several experiments and observation, finally make it work.
> The key point is you have to also disablebuffer on source cluster. I don’t
> know why in the wiki, it didn’t mention it, but I figured this out through
> the source code.
> Once disablebuffer on source cluster, the lastProcessedVersion will become
> a position number, and when there is hard commit, the old unused tlog 
files
> get deleted.
>
> Hope my finding can help other users who experience the same issue.
>
>
> On 7/10/17, 9:08 AM, "Michael McCarthy"  wrote:
>
> We have been experiencing this same issue for months now, with version
> 6.2.  No solution to date.
>
> -Original Message-
> From: Xie, Sean [mailto:sean@finra.org]
> Sent: Sunday, July 09, 2017 9:41 PM
> To: solr-user@lucene.apache.org
> Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction log
> files
>
> Did another round of testing, the tlog on target cluster is cleaned up
> once the hard commit is triggered. However, on source cluster, the tlog
> files stay there and never gets cleaned up.
>
> Not sure if there is any command to run manually to trigger the
> updateLogSynchronizer. The updateLogSynchronizer already set at run at
> every 10 seconds, but seems it didn’t help.
>
> Any help?
>
> Thanks
> Sean
>
> On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:
>
> I have monitored the CDCR process for a while, the updates are
> actively sent to the target without a problem. However the tlog size and
> files count are growing everyday, even when there is 0 updates to sent, 
the
> tlog stays there:
>
> Following is from the action=queues command, and you can see after
> about a month or so running days, the total transaction are reaching to
> 140K total files, and size is about 103G.
>
> 
> 
> 0
> 465
> 
> 
> 
> 
> 0
> 2017-07-07T23:19:09.655Z
> 
> 
> 
> 102740042616
> 140809
> stopped
> 
>
> Any help on it? Or do I need to configure something else? The CDCR
> configuration is pretty much following the wiki:
>
> On target:
>
>   
> 
>   disabled
> 
>   
>
>   
> 
> 
>   
>
>   
> 
>   cdcr-processor-chain
> 
>   
>
>   
> 
>   ${solr.ulog.dir:}
> 
> 
>   ${solr.autoCommit.maxTime:18}
>   false
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:3}
> 
>   
>
> On source:
>   
> 
>   ${TargetZk}
>   MY_COLLECTION
>   MY_COLLECTION
> 
>
> 
>   1
>   1000
>   128
> 
>
> 
>   6
> 
>   
>
>   
> 
>   ${solr.ulog.dir:}
> 
> 
>   ${solr.autoCommit.maxTime:18}
>   false
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:3}
> 
>   
>
> Thanks.
> Sean
>
> On 7/8/17, 12:10 PM, "Erick Erickson" 
> wrote:
>
> This should not be the case if you are actively sending
> updates to the
> target cluster. The tlog is used to store unsent updates, so
> if the
> connection is broken for some time, the target cluster will
> have a
> chance to catch up.

Re: CDCR - how to deal with the transaction log files

2017-07-10 Thread Varun Thacker
After disabling the buffer are you still seeing documents being replicated
to the target cluster(s) ?

On Mon, Jul 10, 2017 at 1:07 PM, Xie, Sean  wrote:

> After several experiments and observation, finally make it work.
> The key point is you have to also disablebuffer on source cluster. I don’t
> know why in the wiki, it didn’t mention it, but I figured this out through
> the source code.
> Once disablebuffer on source cluster, the lastProcessedVersion will become
> a position number, and when there is hard commit, the old unused tlog files
> get deleted.
>
> Hope my finding can help other users who experience the same issue.
>
>
> On 7/10/17, 9:08 AM, "Michael McCarthy"  wrote:
>
> We have been experiencing this same issue for months now, with version
> 6.2.  No solution to date.
>
> -Original Message-
> From: Xie, Sean [mailto:sean@finra.org]
> Sent: Sunday, July 09, 2017 9:41 PM
> To: solr-user@lucene.apache.org
> Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction log
> files
>
> Did another round of testing, the tlog on target cluster is cleaned up
> once the hard commit is triggered. However, on source cluster, the tlog
> files stay there and never gets cleaned up.
>
> Not sure if there is any command to run manually to trigger the
> updateLogSynchronizer. The updateLogSynchronizer already set at run at
> every 10 seconds, but seems it didn’t help.
>
> Any help?
>
> Thanks
> Sean
>
> On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:
>
> I have monitored the CDCR process for a while, the updates are
> actively sent to the target without a problem. However the tlog size and
> files count are growing everyday, even when there is 0 updates to sent, the
> tlog stays there:
>
> Following is from the action=queues command, and you can see after
> about a month or so running days, the total transaction are reaching to
> 140K total files, and size is about 103G.
>
> 
> 
> 0
> 465
> 
> 
> 
> 
> 0
> 2017-07-07T23:19:09.655Z
> 
> 
> 
> 102740042616
> 140809
> stopped
> 
>
> Any help on it? Or do I need to configure something else? The CDCR
> configuration is pretty much following the wiki:
>
> On target:
>
>   
> 
>   disabled
> 
>   
>
>   
> 
> 
>   
>
>   
> 
>   cdcr-processor-chain
> 
>   
>
>   
> 
>   ${solr.ulog.dir:}
> 
> 
>   ${solr.autoCommit.maxTime:18}
>   false
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:3}
> 
>   
>
> On source:
>   
> 
>   ${TargetZk}
>   MY_COLLECTION
>   MY_COLLECTION
> 
>
> 
>   1
>   1000
>   128
> 
>
> 
>   6
> 
>   
>
>   
> 
>   ${solr.ulog.dir:}
> 
> 
>   ${solr.autoCommit.maxTime:18}
>   false
> 
>
> 
>   ${solr.autoSoftCommit.maxTime:3}
> 
>   
>
> Thanks.
> Sean
>
> On 7/8/17, 12:10 PM, "Erick Erickson" 
> wrote:
>
> This should not be the case if you are actively sending
> updates to the
> target cluster. The tlog is used to store unsent updates, so
> if the
> connection is broken for some time, the target cluster will
> have a
> chance to catch up.
>
> If you don't have the remote DC online and do not intend to
> bring it
> online soon, you should turn CDCR off.
>
> Best,
> Erick
>
> On Fri, Jul 7, 2017 at 9:35 PM, Xie, Sean 
> wrote:
> > Once enabled CDCR, update log stores an unlimited number of
> entries. This is causing the tlog folder getting bigger and bigger, as well
> as the open files are growing. How can one reduce the number of open files
> and also to reduce the tlog files? If it’s not taken care properly, sooner
> or later the log files size and open file count will exceed the limits.
> >
> > Thanks
> > Sean
> >
> >
> > Confidentiality Notice::  This email, including attachments,
> may include non-public, proprietary, confidential or legally privileged
> information.  If you are not an intended recipient or an authorized agent
> of an intended recipient, you are hereby notified that any dissemination,
> 

Re: Cross DC SolrCloud anti-patterns in presentation shalinmangar/cross-datacenter-replication-in-apache-solr-6

2017-07-10 Thread Arcadius Ahouansou
Hello Shawn.

Thank you very much for the comment.

On 24 June 2017 at 16:14, Shawn Heisey  wrote:

> On 6/24/2017 2:14 AM, Arcadius Ahouansou wrote:
> > Interpretation 1:
>
> ZooKeeper doesn't *need* an odd number of servers, but there's no
> benefit to an even number.  If you have 5 servers, two can go down.  If
> you have 6 servers, you can still only lose two, so you might as well
> just run 5.  You'd have fewer possible points of failure, less power
> usage, and less bandwidth usage.
>
>
About Slide 8 and the odd/even number of nodes...
what I meant is that on Slide 8, if you loose DC1, then your cluster will
not be able to recover after DC1 comes back as there will be no clear
majority
and you will have:
-  3 ZK nodes with up-to-date data (that is DC2+DC3) and
-  3 ZK nodes with out-of-date data (DC1).

But, if you had only 2 ZK nodes in DC1, then you could afford to loose one
of either DC1, or DC2 or DC3 and the cluster will be able to recover and be
OK


Thank you very much.


Arcadius

-- 
Arcadius Ahouansou
Menelic Ltd | Applied Knowledge Is Power
Office : +441444702101
Mobile: +447908761999
Web: www.menelic.com
---


Re: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Susheel Kumar
Use SolrJ if you end up developing Indexer in Java to send documents to
Solr.  Its been a long i have used DIH but you can gave it a try first,
otherwise as Walter suggested developing external indexer is best.

On Sun, Jul 9, 2017 at 6:46 PM, Walter Underwood 
wrote:

> 4. Write an external program that fetches the file, fetches the metadata,
> combines them, and send them to Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jul 9, 2017, at 3:03 PM, Giovanni De Stefano 
> wrote:
> >
> > Hello all,
> >
> > I have to index (and search) data organised as followed: many files on
> the filesystem and each file has extra metadata stored on a DB (the DB
> table has a reference to the file path).
> >
> > I think I should have 1 Solr document per file with fields coming from
> both the DB (through DIH) and from Tika.
> >
> > How do you suggest to proceed?
> >
> > 1. index into different cores and search across cores (I would rather
> not do that but I would be able to reuse “standard” importers)
> > 2. extend the DIH (which one?)
> > 3. implement a custom import handler
> >
> > How would you do it?
> >
> > Developing in Java is not a problem, I would just need some ideas on
> where to start (I have been away from Solr for many years…).
> >
> > Thanks!
> > G.
>
>


Re: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Giovanni De Stefano
Thank you guys for your advice!

I would rather take advantage as much as possible of the existing 
handlers/processors.

I just realised that nested entities in DIH is extremely slow: I fixed that 
with a view on the DB (that does a join between 2 tables).

The other thing I have to do is chain the extraction of the file content with 
the DIH: tomorrow I will experiment with the different datasources and 
processors supported by DIH.
I have the feeling I will end up writing a separate service that extracts the 
content and puts it in the DB for faster indexing…

I will report here my results in case other might find it useful.



> On 10 Jul 2017, at 22:06, Walter Underwood  wrote:
> 
> I did this at Netflix with Solr 1.3, read stuff out of various databases and 
> sent it all to Solr. I’m not sure DIH even existed then.
> 
> At Chegg, we have slightly more elaborate system because we have so many 
> collections and data sources. Each content owner writes an “extractor” that 
> makes a JSONL feed with the documents to index. We validate those, then have 
> a common “loader” that reads the JSONL and sends it to Solr with multiple 
> connections. Solr-specific stuff is done in update request processors.
> 
> Document parsing is always in a separate process. I’ve implemented it that 
> way three times with three different parser packages on two engines. Never on 
> Solr, though.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B.  wrote:
>> 
>>> 4. Write an external program that fetches the file, fetches the metadata, 
>>> combines them, and send them to Solr.
>> 
>> I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
>> snap:
>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>> 
>> With the caveat that Tika should really be in a separate vm in production 
>> [1].
>> 
>> [1] 
>> http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
>>  
>> 
> 



Re: mm = 1 and multi-field searches

2017-07-10 Thread Susheel Kumar
How are you specifying multiple fields. Use qf parameter to specify
multiple fields e.g.

http://localhost:8983/solr/techproducts/select?indent=on=Samsung%20Maxtor%20hard=json=edismax=name%20manu=on=1


On Mon, Jul 10, 2017 at 4:51 PM, Michael Joyner  wrote:

> Hello all,
>
> How does setting mm = 1 for edismax impact multi-field searches?
>
> We set mm to 1 and get zero results back when specifying multiple fields
> to search across.
>
> Is there a way to set mm = 1 for each field, but to OR the individual
> field searches together?
>
> -Mike/NewsRx
>
>


mm = 1 and multi-field searches

2017-07-10 Thread Michael Joyner

Hello all,

How does setting mm = 1 for edismax impact multi-field searches?

We set mm to 1 and get zero results back when specifying multiple fields 
to search across.


Is there a way to set mm = 1 for each field, but to OR the individual 
field searches together?


-Mike/NewsRx



Re: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Walter Underwood
I did this at Netflix with Solr 1.3, read stuff out of various databases and 
sent it all to Solr. I’m not sure DIH even existed then.

At Chegg, we have slightly more elaborate system because we have so many 
collections and data sources. Each content owner writes an “extractor” that 
makes a JSONL feed with the documents to index. We validate those, then have a 
common “loader” that reads the JSONL and sends it to Solr with multiple 
connections. Solr-specific stuff is done in update request processors.

Document parsing is always in a separate process. I’ve implemented it that way 
three times with three different parser packages on two engines. Never on Solr, 
though.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 10, 2017, at 12:40 PM, Allison, Timothy B.  wrote:
> 
>> 4. Write an external program that fetches the file, fetches the metadata, 
>> combines them, and send them to Solr.
> 
> I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
> snap:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
> 
> With the caveat that Tika should really be in a separate vm in production [1].
> 
> [1] 
> http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
>  
> 



RE: CDCR - how to deal with the transaction log files

2017-07-10 Thread Xie, Sean
After several experiments and observation, finally make it work. 
The key point is you have to also disablebuffer on source cluster. I don’t know 
why in the wiki, it didn’t mention it, but I figured this out through the 
source code. 
Once disablebuffer on source cluster, the lastProcessedVersion will become a 
position number, and when there is hard commit, the old unused tlog files get 
deleted.

Hope my finding can help other users who experience the same issue.


On 7/10/17, 9:08 AM, "Michael McCarthy"  wrote:

We have been experiencing this same issue for months now, with version 6.2. 
 No solution to date.

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 09, 2017 9:41 PM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction log files

Did another round of testing, the tlog on target cluster is cleaned up once 
the hard commit is triggered. However, on source cluster, the tlog files stay 
there and never gets cleaned up.

Not sure if there is any command to run manually to trigger the 
updateLogSynchronizer. The updateLogSynchronizer already set at run at every 10 
seconds, but seems it didn’t help.

Any help?

Thanks
Sean

On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:

I have monitored the CDCR process for a while, the updates are actively 
sent to the target without a problem. However the tlog size and files count are 
growing everyday, even when there is 0 updates to sent, the tlog stays there:

Following is from the action=queues command, and you can see after 
about a month or so running days, the total transaction are reaching to 140K 
total files, and size is about 103G.



0
465




0
2017-07-07T23:19:09.655Z



102740042616
140809
stopped


Any help on it? Or do I need to configure something else? The CDCR 
configuration is pretty much following the wiki:

On target:

  

  disabled

  

  


  

  

  cdcr-processor-chain

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

On source:
  

  ${TargetZk}
  MY_COLLECTION
  MY_COLLECTION



  1
  1000
  128



  6

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

Thanks.
Sean

On 7/8/17, 12:10 PM, "Erick Erickson"  wrote:

This should not be the case if you are actively sending updates to 
the
target cluster. The tlog is used to store unsent updates, so if the
connection is broken for some time, the target cluster will have a
chance to catch up.

If you don't have the remote DC online and do not intend to bring it
online soon, you should turn CDCR off.

Best,
Erick

On Fri, Jul 7, 2017 at 9:35 PM, Xie, Sean  
wrote:
> Once enabled CDCR, update log stores an unlimited number of 
entries. This is causing the tlog folder getting bigger and bigger, as well as 
the open files are growing. How can one reduce the number of open files and 
also to reduce the tlog files? If it’s not taken care properly, sooner or later 
the log files size and open file count will exceed the limits.
>
> Thanks
> Sean
>
>
> Confidentiality Notice::  This email, including attachments, may 
include non-public, proprietary, confidential or legally privileged 
information.  If you are not an intended recipient or an authorized agent of an 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of the information contained in or transmitted with 
this e-mail is unauthorized and strictly prohibited.  If you have received this 
email in error, please notify the sender by replying to this message and 
permanently delete this e-mail, its attachments, and any copies of it 
immediately.  You should not retain, copy or use 

Re: Returning results for multi-word search term

2017-07-10 Thread Erick Erickson
Well, one issue is that  Paddle* Arm* has an implicit OR between the terms. Try

+Paddle* +Arm*

That'll reduce the documents found, although it would find "Paddle
robotic armature" (no such thing, just sayin').

Although another possibility is that you're really sending

some_field:Paddle* Arm*

which is parsed as

some_field:Paddle* default_search_field:Arm*


 “Paddle Arm” should find the last two. I suspect you're using
"string" type for the field you're searching against rather than a
text-based field that tokenizes. You must show us the fieldType of the
field and the results of =query added to the URL to have a hope
of saying anything more.

And if you really need phrases and wildcards, see Complex Phrase Query
Parser here: https://lucene.apache.org/solr/guide/6_6/other-parsers.html.
But before going there, I'd figure out wha't up with not being able to
search "Paddle Arm" as a phrase, it should certainly do what you're
asking given the right field definition.

Best,
Erick

On Mon, Jul 10, 2017 at 12:10 PM, Miller, William K - Norman, OK -
Contractor  wrote:
> I forgot to mention that I am using Solr 6.5.1 and I am indexing XML files.
> My Solr server is running on a Linux OS.
>
>
>
>
>
>
>
>
>
> ~~~
>
> William Kevin Miller
>
> ECS Federal, Inc.
>
> USPS/MTSC
>
> (405) 573-2158
>
>
>
> From: Miller, William K - Norman, OK - Contractor
> [mailto:william.k.mil...@usps.gov.INVALID]
> Sent: Monday, July 10, 2017 2:03 PM
> To: 'solr-user@lucene.apache.org'
> Subject: Returning results for multi-word search term
>
>
>
> I am trying to return results when using a multi-word term.  I am using
> “Paddle Arm” as my search term(including the quotes).  I know that the field
> that I am querying against has these words together.  If I run the query
> using Paddle* Arm* I get the following results, but I want to get only the
> last two.  I have looked at Fuzzy Searches but that I don’t feel will work
> and I have looked at the Proximity Searches and I get no results back with
> that one whether I use 0,1 or 10.  How can I structure my query to get the
> last items in the below list?
>
>
>
> Paddle Assembly
>
> Paddle
>
> Paddle
>
> Paddle Pneumatic Piping
>
> Paddle
>
> Paddle Assembly
>
> Paddle
>
> Paddle Assembly
>
> Paddle to Bucket Offset Check
>
> Paddle to Bucket Wall
>
> Paddle to Bucket Offset
>
> Paddle
>
> Paddle Assembly Troubleshooting
>
> Paddle Assembly Troubleshooting
>
> Paddle Air Pressure
>
> Paddle Assembly
>
> Paddle
>
> Paddle Stop Adjustment
>
> Paddle Stop
>
> Paddle Assembly
>
> Paddle Assembly
>
> Paddle Vacuum Holes
>
> Paddle Position
>
> Paddle Detection Sensor Adjustment
>
> Paddle Assembly
>
> Paddle
>
> Paddle Assembly
>
> Paddle Stop
>
> Paddle Assembly
>
> Paddle Assembly
>
> Paddle
>
> Paddle Assembly
>
> Paddle Assembly
>
> Paddle Rotary Actuator
>
> Paddle Removal and Replacement
>
> Paddle Assembly
>
> Paddle Removal and Replacement
>
> Paddle Seal Removal and Replacement
>
> Paddle Location
>
> Paddle Location
>
> Paddle Removal Location
>
> Paddle/Belt Speed for Photoeye Inputs
>
> Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm
>
> Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm
>
>
>
>
>
>
>
>
>
> ~~~
>
> William Kevin Miller
>
> ECS Federal, Inc.
>
> USPS/MTSC
>
> (405) 573-2158
>
>


RE: How to "chain" import handlers: import from DB and from file system

2017-07-10 Thread Allison, Timothy B.
>4. Write an external program that fetches the file, fetches the metadata, 
>combines them, and send them to Solr.

I've done this with some custom crawls. Thanks to Erick Erickson, this is a 
snap:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

With the caveat that Tika should really be in a separate vm in production [1].

[1] 
http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf
 



RE: Returning results for multi-word search term

2017-07-10 Thread Miller, William K - Norman, OK - Contractor
I forgot to mention that I am using Solr 6.5.1 and I am indexing XML files.  My 
Solr server is running on a Linux OS.




~~~
William Kevin Miller
[ecsLogo]
ECS Federal, Inc.
USPS/MTSC
(405) 573-2158

From: Miller, William K - Norman, OK - Contractor 
[mailto:william.k.mil...@usps.gov.INVALID]
Sent: Monday, July 10, 2017 2:03 PM
To: 'solr-user@lucene.apache.org'
Subject: Returning results for multi-word search term

I am trying to return results when using a multi-word term.  I am using "Paddle 
Arm" as my search term(including the quotes).  I know that the field that I am 
querying against has these words together.  If I run the query using Paddle* 
Arm* I get the following results, but I want to get only the last two.  I have 
looked at Fuzzy Searches but that I don't feel will work and I have looked at 
the Proximity Searches and I get no results back with that one whether I use 
0,1 or 10.  How can I structure my query to get the last items in the below 
list?

Paddle Assembly

Paddle

Paddle

Paddle Pneumatic Piping

Paddle

Paddle Assembly

Paddle

Paddle Assembly

Paddle to Bucket Offset Check

Paddle to Bucket Wall

Paddle to Bucket Offset

Paddle

Paddle Assembly Troubleshooting

Paddle Assembly Troubleshooting

Paddle Air Pressure

Paddle Assembly

Paddle

Paddle Stop Adjustment

Paddle Stop

Paddle Assembly

Paddle Assembly

Paddle Vacuum Holes

Paddle Position

Paddle Detection Sensor Adjustment

Paddle Assembly

Paddle

Paddle Assembly

Paddle Stop

Paddle Assembly

Paddle Assembly

Paddle

Paddle Assembly

Paddle Assembly

Paddle Rotary Actuator

Paddle Removal and Replacement

Paddle Assembly

Paddle Removal and Replacement

Paddle Seal Removal and Replacement

Paddle Location

Paddle Location

Paddle Removal Location

Paddle/Belt Speed for Photoeye Inputs

Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm

Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm





~~~
William Kevin Miller
[ecsLogo]
ECS Federal, Inc.
USPS/MTSC
(405) 573-2158



Returning results for multi-word search term

2017-07-10 Thread Miller, William K - Norman, OK - Contractor
I am trying to return results when using a multi-word term.  I am using "Paddle 
Arm" as my search term(including the quotes).  I know that the field that I am 
querying against has these words together.  If I run the query using Paddle* 
Arm* I get the following results, but I want to get only the last two.  I have 
looked at Fuzzy Searches but that I don't feel will work and I have looked at 
the Proximity Searches and I get no results back with that one whether I use 
0,1 or 10.  How can I structure my query to get the last items in the below 
list?

Paddle Assembly

Paddle

Paddle

Paddle Pneumatic Piping

Paddle

Paddle Assembly

Paddle

Paddle Assembly

Paddle to Bucket Offset Check

Paddle to Bucket Wall

Paddle to Bucket Offset

Paddle

Paddle Assembly Troubleshooting

Paddle Assembly Troubleshooting

Paddle Air Pressure

Paddle Assembly

Paddle

Paddle Stop Adjustment

Paddle Stop

Paddle Assembly

Paddle Assembly

Paddle Vacuum Holes

Paddle Position

Paddle Detection Sensor Adjustment

Paddle Assembly

Paddle

Paddle Assembly

Paddle Stop

Paddle Assembly

Paddle Assembly

Paddle

Paddle Assembly

Paddle Assembly

Paddle Rotary Actuator

Paddle Removal and Replacement

Paddle Assembly

Paddle Removal and Replacement

Paddle Seal Removal and Replacement

Paddle Location

Paddle Location

Paddle Removal Location

Paddle/Belt Speed for Photoeye Inputs

Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm

Paddle Arm Spring, Upper Paddle Arm, and Lower Paddle Arm





~~~
William Kevin Miller
[ecsLogo]
ECS Federal, Inc.
USPS/MTSC
(405) 573-2158



Re: uploading solr.xml to zk

2017-07-10 Thread Cassandra Targett
In your command, you are missing the "zk" part of the command. Try:

bin/solr zk cp file:local/file/path/to/solr.xml zk:/solr.xml -z localhost:2181

I see this is wrong in the documentation, I will fix it for the next
release of the Ref Guide.

I'm not sure about how to refer to it - I don't think you have to do
anything? I could be very wrong on that, though.

On Fri, Jul 7, 2017 at 2:31 PM,   wrote:
> The documentation says
>
> If you for example would like to keep your solr.xml in ZooKeeper to avoid 
> having to copy it to every node's so
> lr_home directory, you can push it to ZooKeeper with the bin/solr utility 
> (Unix example):
> bin/solr cp file:local/file/path/to/solr.xml zk:/solr.xml -z localhost:2181
>
> So Im trying to push the solr.xml my local zookeepr
>
> solr-6.4.1/bin/solr  file:/home/user1/solr/nodes/day1/solr/solr.xml 
> zk:/solr.xml -z localhost:9983
>
> ERROR: cp is not a valid command!
>
> Afterwards
> When starting up a node how do we refer to the solr.xml inside zookeeper? Any 
> examples?
>
> Thanks
> Imran
>
>
> Sent from Mail for Windows 10
>


RE: DIH issue with streaming xml file

2017-07-10 Thread Miller, William K - Norman, OK - Contractor
Please consider this issue closed as we are looking at moving our xml files to 
the solr server for now.




~~~
William Kevin Miller

ECS Federal, Inc.
USPS/MTSC
(405) 573-2158

-Original Message-
From: Miller, William K - Norman, OK - Contractor 
Sent: Monday, June 12, 2017 2:12 PM
To: 'solr-user@lucene.apache.org'
Subject: RE: DIH issue with streaming xml file

Thank you for your response.  I will look into this link.  Also, sorry I did 
not specify the file type.   I am working with XML files.




~~~
William Kevin Miller

ECS Federal, Inc.
USPS/MTSC
(405) 573-2158


-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Monday, June 12, 2017 1:26 PM
To: solr-user
Subject: Re: DIH issue with streaming xml file

Solr 6.5.1 DIH setup has - somewhat broken - RSS example (redone as ATOM 
example in 6.6) that shows how to get stuff from https URL. You can see the 
atom example here:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.6.0/solr/example/example-DIH/solr/atom/conf/atom-data-config.xml


The main issue however is that you are not saying what format is that list of 
file on the server. Is that a plain list? Is it XML with files? Are you doing 
directory listing?

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 12 June 2017 at 14:11, Miller, William K - Norman, OK - Contractor 
 wrote:
> Thank you for your response.  That is the issue that I am having.  I cannot 
> figure out how to get the list of files from the remote server.  I have tried 
> changing the parent Entity Processor to the XPathEntityProcessor and the 
> baseDir to a url using https.  This did not work as it was looking for a 
> "foreach" attribute.  Is there an Entity Processor that can be used to get 
> the list of files from an https source or am I going to have to use solrj or 
> create a custom entity processor?
>
>
>
>
> ~~~
> William Kevin Miller
>
> ECS Federal, Inc.
> USPS/MTSC
> (405) 573-2158
>
>
> -Original Message-
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: Monday, June 12, 2017 12:57 PM
> To: solr-user
> Subject: Re: DIH issue with streaming xml file
>
> How do you get a list of URLs for the files on the remote server? That's 
> probably the first issue. Once you have the URLs in an outside entity or two, 
> you can feed them one by one into the inner entity.
>
> Regards,
>Alex.
>
> 
> http://www.solr-start.com/ - Resources for Solr users, new and 
> experienced
>
> On 12 June 2017 at 09:39, Miller, William K - Norman, OK - Contractor < 
> william.k.mil...@usps.gov.invalid> wrote:
>
>> I am using Solr 6.5.1 and working on importing xml files using the 
>> DataImportHandler.  I am wanting to get the files from a remote 
>> server, but I am dealing with multiple xml files in multiple folders.
>> I am using a nested entity in my dataConfig.  Below is an example of 
>> how I have my dataConfig set up.  I got most of this from an online 
>> reference.  In this example I am getting the xml files from a folder 
>> on the Solr server, but as I mentioned above I want to get the files 
>> from a remote server.  I have looked at the different Entity 
>> Processors for the DIH, but have not seen anything that seems to work.
>> Is there a way to configure the below code to let me do this?
>>
>>
>>
>>
>>
>> 
>>
>>
>>
>> > type="FileDataSource" />
>>
>> 
>>
>> 
>>
>>
>>
>> >
>> name="pickupdir"
>>
>> processor="FileListEntityProcessor"
>>
>> rootEntity="false"
>>
>> dataSource="null"
>>
>> fileName="^[\w\d-]+\.xml$"
>>
>> baseDir="/var/solr/data/hbk/data/xml/"
>>
>> recursive="true"
>>
>>
>>
>> >
>>
>> 
>>
>>
>>
>> >
>> name="xml"
>>
>>
>> pk="itemId"
>>
>>
>> processor="XPathEntityProcessor"
>>
>>
>> transformer="RegexTransformer,TemplateTransformer"
>>
>>
>> datasource="pickupdir"
>>
>>
>> stream="true"
>>
>>
>> xsl="/var/solr/data/hbk/data/xsl/solr_timdex.xsl"
>>
>>
>> url="${pickupdir.fileAbsolutePath}"
>>
>>
>> forEach="/eflow/section | /eflow/section/item"
>>
>> >
>>
>>
>>
>> 
>> > commonField="true" />
>>
>> 
>> > />
>>
>> 
>> > commonField="true" />
>>
>> 
>> > commonField="true" />
>>
>>   

Re: Solr 6.5.1 crashing when too many queries with error or high memory usage are queried

2017-07-10 Thread Joel Bernstein
Yes the hashJoin will read the entire "hashed" query into memory. The
documentation explains this.

In general the streaming joins were designed for OLAP type work loads.
Unless you have a large cluster powering streaming joins you are going to
have problems with high QPS workloads.

Joel Bernstein
http://joelsolr.blogspot.com/

On Sun, Jul 9, 2017 at 10:59 PM, Zheng Lin Edwin Yeo 
wrote:

> I have found that it could be likely due to the hashJoin in the streaming
> expression, as this will store all tuples in memory?
>
> I have more than 12 million in the collections which I am querying, in 1
> shard. The index size of the collection is 45 GB.
> Physical RAM of server: 384 GB
> Java Heap: 22 GB
> Typical search latency: 2 to 4 seconds
>
> Regards,
> Edwin
>
>
> On 7 July 2017 at 16:46, Jan Høydahl  wrote:
>
> > You have not told us how many documents you have, how many shards, how
> big
> > the docs are, physical RAM, Java heap, what typical search latency is
> etc.
> >
> > If you have tried to squeeze too many docs into a single node it might
> get
> > overloaded faster, thus sharding would help.
> > If you return too much content (large fields that you won’t use) that may
> > lower the max QPS for a node, so check that.
> > If you are not using DocValues, faceting etc will take too much memory,
> > but since you use streaming I guess you use Docvalues.
> > There are products that you can put in front of Solr that can do rate
> > limiting for you, such as https://getkong.org/ 
> >
> > You really need to debug what is the bottleneck in your case and try to
> > fix that.
> >
> > Can you share your key numbers here so we can do a qualified guess?
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 2. jul. 2017 kl. 09.00 skrev Zheng Lin Edwin Yeo  >:
> > >
> > > Hi,
> > >
> > > I'm currently facing the issue whereby the Solr crashed when I have
> > issued
> > > too many queries with error or those with high memory usage, like JSON
> > > facet or Streaming expressions.
> > >
> > > What could be the issue here?
> > >
> > > I'm using Solr 6.5.1
> > >
> > > Regards,
> > > Edwin
> >
> >
>


Re: Multiple Field Search on Solr

2017-07-10 Thread Erik Hatcher
I recommend first understanding the Solr API, and the parameters you need to 
add the capabilities with just the /select API.   Once you are familiar with 
that, you can then learn what’s needed and apply that to the HTML and 
JavaScript.   While the /browse UI is fairly straightforward, there’s a fair  
bit of HTML, JavaScript, and Solr know-how needed to do what you’re asking.

A first step would be to try using `fq` instead of appending to `q` for things 
you want to “AND" to the query that aren’t relevancy related.

Erik

> On Jul 10, 2017, at 6:20 AM, Clare Lee  wrote:
> 
> Hello,
> 
> My name is Clare Lee and I'm working on Apache Solr-6.6.0, Solritas right
> now and I'm not able to do something I want to do. Could you help me with
> this?
> 
> I want to be able to search solr with multiple fields. With the basic
> configurations(I'm using the core techproducts and just changing the data),
> I can search like this [image: enter image description here]
> 
> 
> but I want to search like this[image: enter image description here]
> 
> 
> I want to know which file I have to look into and how I should change the
> code to do so.
> 
> I can put the space to put the additional information by copying and
> pasting this in the query_form.vm file.
> 
> 
> 
> 
> 
> but this doesn't AND the values that I put in.
> 
> I was told that I should look where the action file is(code below), but I
> cannot reach that location.
> 
>   method="GET">
> 
> 
>  
>Name:
>
> 
> 
> The below code is relevant, but I don't know how to change it. (from
> head.vm)
> 
> 

RE: CDCR - how to deal with the transaction log files

2017-07-10 Thread Xie, Sean
Did some source code reading, and looks like when lastProcessedVersion==-1, 
then it will do nothing:

https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/handler/CdcrUpdateLogSynchronizer.java

// if we received -1, it means that the log reader on the leader has 
not yet started to read log entries
// do nothing
if (lastVersion == -1) {
  return;
}

So I queried the solr to find out, and here is the results:

/cdcr?action=LASTPROCESSEDVERSION



0
0

-1


Anything could cause this issue to happen?


Sean


On 7/10/17, 9:08 AM, "Michael McCarthy"  wrote:

We have been experiencing this same issue for months now, with version 6.2. 
 No solution to date.

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 09, 2017 9:41 PM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction log files

Did another round of testing, the tlog on target cluster is cleaned up once 
the hard commit is triggered. However, on source cluster, the tlog files stay 
there and never gets cleaned up.

Not sure if there is any command to run manually to trigger the 
updateLogSynchronizer. The updateLogSynchronizer already set at run at every 10 
seconds, but seems it didn’t help.

Any help?

Thanks
Sean

On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:

I have monitored the CDCR process for a while, the updates are actively 
sent to the target without a problem. However the tlog size and files count are 
growing everyday, even when there is 0 updates to sent, the tlog stays there:

Following is from the action=queues command, and you can see after 
about a month or so running days, the total transaction are reaching to 140K 
total files, and size is about 103G.



0
465




0
2017-07-07T23:19:09.655Z



102740042616
140809
stopped


Any help on it? Or do I need to configure something else? The CDCR 
configuration is pretty much following the wiki:

On target:

  

  disabled

  

  


  

  

  cdcr-processor-chain

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

On source:
  

  ${TargetZk}
  MY_COLLECTION
  MY_COLLECTION



  1
  1000
  128



  6

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

Thanks.
Sean

On 7/8/17, 12:10 PM, "Erick Erickson"  wrote:

This should not be the case if you are actively sending updates to 
the
target cluster. The tlog is used to store unsent updates, so if the
connection is broken for some time, the target cluster will have a
chance to catch up.

If you don't have the remote DC online and do not intend to bring it
online soon, you should turn CDCR off.

Best,
Erick

On Fri, Jul 7, 2017 at 9:35 PM, Xie, Sean  
wrote:
> Once enabled CDCR, update log stores an unlimited number of 
entries. This is causing the tlog folder getting bigger and bigger, as well as 
the open files are growing. How can one reduce the number of open files and 
also to reduce the tlog files? If it’s not taken care properly, sooner or later 
the log files size and open file count will exceed the limits.
>
> Thanks
> Sean
>
>
> Confidentiality Notice::  This email, including attachments, may 
include non-public, proprietary, confidential or legally privileged 
information.  If you are not an intended recipient or an authorized agent of an 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of the information contained in or transmitted with 
this e-mail is unauthorized and strictly prohibited.  If you have received this 
email in error, please notify the sender by replying to this message and 

Multiple Field Search on Solr

2017-07-10 Thread Clare Lee
Hello,

My name is Clare Lee and I'm working on Apache Solr-6.6.0, Solritas right
now and I'm not able to do something I want to do. Could you help me with
this?

I want to be able to search solr with multiple fields. With the basic
configurations(I'm using the core techproducts and just changing the data),
I can search like this [image: enter image description here]


but I want to search like this[image: enter image description here]


I want to know which file I have to look into and how I should change the
code to do so.

I can put the space to put the additional information by copying and
pasting this in the query_form.vm file.





but this doesn't AND the values that I put in.

I was told that I should look where the action file is(code below), but I
cannot reach that location.

 


  
Name:



The below code is relevant, but I don't know how to change it. (from
head.vm)

 

Re: High disk write usage

2017-07-10 Thread Shawn Heisey
On 7/10/2017 2:57 AM, Antonio De Miguel wrote:
> I continue deeping inside this problem...  high writing rates continues.
>
> Searching in logs i see this:
>
> 2017-07-10 08:46:18.888 INFO  (commitScheduler-11-thread-1) [c:ads s:shard2
> r:core_node47 x:ads_shard2_replica3] o.a.s.u.LoggingInfoStream
> [DWPT][commitScheduler-11-thread-1]: flushed: segment=_mb7 ramUsed=7.531 MB
> newFlushedSize=2.472 MB docs/MB=334.132
> 2017-07-10 08:46:29.336 INFO  (commitScheduler-11-thread-1) [c:ads s:shard2
> r:core_node47 x:ads_shard2_replica3] o.a.s.u.LoggingInfoStream
> [DWPT][commitScheduler-11-thread-1]: flushed: segment=_mba ramUsed=8.079 MB
> newFlushedSize=1.784 MB docs/MB=244.978
>
>
> A flush happens each 10 seconds (my autosoftcommit time is 10 secs and
> hardcommit 5 minutes).  ¿is the expected behaviour?

If you are indexing continuously, then the auto soft commit time of 10
seconds means that this will be happening every ten seconds.

> I thought soft commits does not write into disk...

If you are using the correct DirectoryFactory type, a soft commit has
the *possibility* of not writing to disk, but the amount of memory
reserved is fairly small.

Looking into the source code for NRTCachingDirectoryFactory, I see that
maxMergeSizeMB defaults to 4, and maxCachedMB defaults to 48.  This is a
little bit different than what the javadoc states for
NRTCachingDirectory (5 and 60):

http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/store/NRTCachingDirectory.html

The way I read this, assuming the amount of segment data created is
small, only the first few soft commits will be entirely handled in
memory.  After that, older segments must be flushed to disk to make room
for new ones.

If the indexing rate is high, there's not really much difference between
soft commits and hard commits.  This also assumes that you have left the
directory at the default of NRTCachingDirectoryFactory.  If this has
been changed, then there is no caching in RAM, and soft commit probably
behaves *exactly* the same as hard commit.

Thanks,
Shawn



RE: CDCR - how to deal with the transaction log files

2017-07-10 Thread Michael McCarthy
We have been experiencing this same issue for months now, with version 6.2.  No 
solution to date.

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 09, 2017 9:41 PM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: CDCR - how to deal with the transaction log files

Did another round of testing, the tlog on target cluster is cleaned up once the 
hard commit is triggered. However, on source cluster, the tlog files stay there 
and never gets cleaned up.

Not sure if there is any command to run manually to trigger the 
updateLogSynchronizer. The updateLogSynchronizer already set at run at every 10 
seconds, but seems it didn’t help.

Any help?

Thanks
Sean

On 7/8/17, 1:14 PM, "Xie, Sean"  wrote:

I have monitored the CDCR process for a while, the updates are actively 
sent to the target without a problem. However the tlog size and files count are 
growing everyday, even when there is 0 updates to sent, the tlog stays there:

Following is from the action=queues command, and you can see after about a 
month or so running days, the total transaction are reaching to 140K total 
files, and size is about 103G.



0
465




0
2017-07-07T23:19:09.655Z



102740042616
140809
stopped


Any help on it? Or do I need to configure something else? The CDCR 
configuration is pretty much following the wiki:

On target:

  

  disabled

  

  


  

  

  cdcr-processor-chain

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

On source:
  

  ${TargetZk}
  MY_COLLECTION
  MY_COLLECTION



  1
  1000
  128



  6

  

  

  ${solr.ulog.dir:}


  ${solr.autoCommit.maxTime:18}
  false



  ${solr.autoSoftCommit.maxTime:3}

  

Thanks.
Sean

On 7/8/17, 12:10 PM, "Erick Erickson"  wrote:

This should not be the case if you are actively sending updates to the
target cluster. The tlog is used to store unsent updates, so if the
connection is broken for some time, the target cluster will have a
chance to catch up.

If you don't have the remote DC online and do not intend to bring it
online soon, you should turn CDCR off.

Best,
Erick

On Fri, Jul 7, 2017 at 9:35 PM, Xie, Sean  wrote:
> Once enabled CDCR, update log stores an unlimited number of entries. 
This is causing the tlog folder getting bigger and bigger, as well as the open 
files are growing. How can one reduce the number of open files and also to 
reduce the tlog files? If it’s not taken care properly, sooner or later the log 
files size and open file count will exceed the limits.
>
> Thanks
> Sean
>
>
> Confidentiality Notice::  This email, including attachments, may 
include non-public, proprietary, confidential or legally privileged 
information.  If you are not an intended recipient or an authorized agent of an 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of the information contained in or transmitted with 
this e-mail is unauthorized and strictly prohibited.  If you have received this 
email in error, please notify the sender by replying to this message and 
permanently delete this e-mail, its attachments, and any copies of it 
immediately.  You should not retain, copy or use this e-mail or any attachment 
for any purpose, nor disclose all or any part of the contents to any other 
person. Thank you.






Nothing in this message is intended to constitute an electronic signature 
unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to 
which it is addressed. It may contain confidential and/or privileged material. 
Any review, transmission, dissemination or other use, or taking of any action 
in reliance upon this message by persons or entities other than the intended 
recipient is prohibited and may be unlawful. If you received this message in 
error, please contact the sender and delete it from your computer.


Resources for solr design and Architecture

2017-07-10 Thread Ranganath B N
Hi,

   Is there any resource  (article or book)  which sheds light on the solr 
design and architecture (interaction between client and server modules in solr, 
interaction b/w solr  modules (java source files) )?


Thanks,
Ranganath B. N.


RE: ZooKeeper transaction logs

2017-07-10 Thread Avi Steiner
I did use this class using batch file (from Windows server), but it still does 
not remove anything. I sent number of snapshots to keep as 3, but I have more 
in my folder.

-Original Message-
From: Xie, Sean [mailto:sean@finra.org]
Sent: Sunday, July 9, 2017 7:33 PM
To: solr-user@lucene.apache.org
Subject: Re: ZooKeeper transaction logs

You can try run purge manually see if it is working: 
org.apache.zookeeper.server.PurgeTxnLog.

And use a cron job to do clean up.


On 7/9/17, 11:07 AM, "Avi Steiner"  wrote:

Hello

I'm using Zookeeper 3.4.6

The ZK log data folder keeps growing with transaction logs files (log.*).

I set the following in zoo.cfg:
autopurge.purgeInterval=1
autopurge.snapRetainCount=3
dataDir=..\\data

Per ZK log, it reads those parameters:

2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2017-07-09 17:44:59,792 [myid:] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 1

It also says that cleanup process is running:

2017-07-09 17:44:59,792 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2017-07-09 17:44:59,823 [myid:] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.

But actually nothing is deleted.
Every service restart, new file is created.

The only parameter I managed to change is preAllocSize, which means the 
minimum size per file. The default is 64MB. I changed it to 10KB only for 
watching the effect.



This email and any attachments thereto may contain private, confidential, 
and privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.



Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

This email and any attachments thereto may contain private, confidential, and 
privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.


Re: High disk write usage

2017-07-10 Thread Antonio De Miguel
Hi!

I continue deeping inside this problem...  high writing rates continues.

Searching in logs i see this:

2017-07-10 08:46:18.888 INFO  (commitScheduler-11-thread-1) [c:ads s:shard2
r:core_node47 x:ads_shard2_replica3] o.a.s.u.LoggingInfoStream
[DWPT][commitScheduler-11-thread-1]: flushed: segment=_mb7 ramUsed=7.531 MB
newFlushedSize=2.472 MB docs/MB=334.132
2017-07-10 08:46:29.336 INFO  (commitScheduler-11-thread-1) [c:ads s:shard2
r:core_node47 x:ads_shard2_replica3] o.a.s.u.LoggingInfoStream
[DWPT][commitScheduler-11-thread-1]: flushed: segment=_mba ramUsed=8.079 MB
newFlushedSize=1.784 MB docs/MB=244.978


A flush happens each 10 seconds (my autosoftcommit time is 10 secs and
hardcommit 5 minutes).  ¿is the expected behaviour?

I thought soft commits does not write into disk...


2017-07-06 0:02 GMT+02:00 Antonio De Miguel :

> Hi erik.
>
> What i want to said is that we have enough memory to store shards, and
> furthermore, JVMs heapspaces
>
> Machine has 400gb of RAM. I think we have enough.
>
> We have 10 JVM running on the machine, each of one using 16gb.
>
> Shard size is about 8gb.
>
> When we have query or indexing peaks our problem are the CPU ussage and
> the disk io, but we have a lot of unused memory.
>
>
>
>
>
>
>
>
>
> El 5/7/2017 19:04, "Erick Erickson"  escribió:
>
>> bq: We have enough physical RAM to store full collection and 16Gb for
>> each JVM.
>>
>> That's not quite what I was asking for. Lucene uses MMapDirectory to
>> map part of the index into the OS memory space. If you've
>> over-allocated the JVM space relative to your physical memory that
>> space can start swapping. Frankly I'd expect your query performance to
>> die if that was happening so this is a sanity check.
>>
>> How much physical memory does the machine have and how much memory is
>> allocated to _all_ of the JVMs running on that machine?
>>
>> see: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on
>> -64bit.html
>>
>> Best,
>> Erick
>>
>>
>> On Wed, Jul 5, 2017 at 9:41 AM, Antonio De Miguel 
>> wrote:
>> > Hi Erik! thanks for your response!
>> >
>> > Our soft commit is 5 seconds. Why generates I/0 a softcommit? first
>> notice.
>> >
>> >
>> > We have enough physical RAM to store full collection and 16Gb for each
>> > JVM.  The collection is relatively small.
>> >
>> > I've tried (for testing purposes)  disabling transactionlog (commenting
>> > )... but cluster does not go up. I'll try writing into
>> separated
>> > drive, nice idea...
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > 2017-07-05 18:04 GMT+02:00 Erick Erickson :
>> >
>> >> What is your soft commit interval? That'll cause I/O as well.
>> >>
>> >> How much physical RAM and how much is dedicated to _all_ the JVMs on a
>> >> machine? One cause here is that Lucene uses MMapDirectory which can be
>> >> starved for OS memory if you use too much JVM, my rule of thumb is
>> >> that _at least_ half of the physical memory should be reserved for the
>> >> OS.
>> >>
>> >> Your transaction logs should fluctuate but even out. By that I mean
>> >> they should increase in size but every hard commit should truncate
>> >> some of them so I wouldn't expect them to grow indefinitely.
>> >>
>> >> One strategy is to put your tlogs on a separate drive exactly to
>> >> reduce contention. You could disable them too at a cost of risking
>> >> your data. That might be a quick experiment you could run though,
>> >> disable tlogs and see what that changes. Of course I'd do this on my
>> >> test system ;).
>> >>
>> >> But yeah, Solr will use a lot of I/O in the scenario you are outlining
>> >> I'm afraid.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Wed, Jul 5, 2017 at 8:08 AM, Antonio De Miguel > >
>> >> wrote:
>> >> > thanks Markus!
>> >> >
>> >> > We already have SSD.
>> >> >
>> >> > About changing topology we probed yesterday with 10 shards, but
>> >> system
>> >> > goes more inconsistent than with the current topology (5x10). I dont
>> know
>> >> > why... too many traffic perhaps?
>> >> >
>> >> > About merge factor.. we set default configuration for some days...
>> but
>> >> when
>> >> > a merge occurs system overload. We probed with mergefactor of 4 to
>> >> improbe
>> >> > query times and trying to have smaller merges.
>> >> >
>> >> > 2017-07-05 16:51 GMT+02:00 Markus Jelsma > >:
>> >> >
>> >> >> Try mergeFactor of 10 (default) which should be fine in most cases.
>> If
>> >> you
>> >> >> got an extreme case, either create more shards and consider better
>> >> hardware
>> >> >> (SSD's)
>> >> >>
>> >> >> -Original message-
>> >> >> > From:Antonio De Miguel 
>> >> >> > Sent: Wednesday 5th July 2017 16:48
>> >> >> > To: solr-user@lucene.apache.org
>> >> >> > Subject: Re: High disk write usage
>> >> >> >
>> >> >> > Thnaks a lot alessandro!
>> >> >> >
>> >> >> > Yes, we have very big physical 

Re: index new discovered fileds of different types

2017-07-10 Thread Jan Høydahl
I think Thaer’s answer clarify how they do it.
So at the time they assemble the full Solr doc to index, there may be a new 
field name not known in advance,
but to my understanding the RDF source contains information on the type (else 
they could not do the mapping
to dynamic field either) and so adding a field to the managed schema on the fly 
once an unknown field is detected
should work just fine!

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 10. jul. 2017 kl. 02.08 skrev Rick Leir :
> 
> Jan
> 
> I hope this is not off-topic, but I am curious: if you do not use the three 
> fields, subject, predicate, and object for indexing RDF
> then what is your algorithm? Maybe document nesting is appropriate for this? 
> cheers -- Rick
> 
> 
> On 2017-07-09 05:52 PM, Jan Høydahl wrote:
>> Hi,
>> 
>> I have personally written a Python script to parse RDF files into an 
>> in-memory graph structure and then pull data from that structure to index to 
>> Solr.
>> I.e. you may perfectly well have RDF (nt, turtle, whatever) as source but 
>> index sub structures in very specific ways.
>> Anyway, as Erick points out, that’s probably where in your code that you 
>> should use Managed Schema REST API in order to
>> 1. Query Solr for what fields are defined
>> 2. If you need to index a field that is not yet in Solr, add it, using the 
>> correct field type (your app should know)
>> 3. Push the data
>> 4. Repeat
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> 
>>> 8. jul. 2017 kl. 02.36 skrev Rick Leir :
>>> 
>>> Thaer
>>> Whoa, hold everything! You said RDF, meaning resource description 
>>> framework? If so, you have exactly​ three fields: subject, predicate, and 
>>> object. Maybe they are text type, or for exact matches you might want 
>>> string fields. Add an ID field, which could be automatically generated by 
>>> Solr, so now you have four fields. Or am I on a tangent again? Cheers -- 
>>> Rick
>>> 
>>> On July 7, 2017 6:01:00 AM EDT, Thaer Sammar  wrote:
 Hi Jan,
 
 Thanks!, I am exploring the schemaless option based on Furkan
 suggestion. I
 need the the flexibility because not all fields are known. We get the
 data
 from RDF database (which changes continuously). To be more specific, we
 have a database and all changes on it are sent to a kafka queue. and we
 have a consumer which listen to the queue and update the Solr index.
 
 regards,
 Thaer
 
 On 7 July 2017 at 10:53, Jan Høydahl  wrote:
 
> If you do not need the flexibility of dynamic fields, don’t use them.
> Sounds to me that you really want a field “price” to be float and a
 field
> “birthdate” to be of type date etc.
> If so, simply create your schema (either manually, through Schema API
 or
> using schemaless) up front and index each field as correct type
 without
> messing with field name prefixes.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
>> 5. jul. 2017 kl. 15.23 skrev Thaer Sammar :
>> 
>> Hi,
>> We are trying to index documents of different types. Document have
> different fields. fields are known at indexing time. We run a query
 on a
> database and we index what comes using query variables as field names
 in
> solr. Our current solution: we use dynamic fields with prefix, for
 example
> feature_i_*, the issue with that
>> 1) we need to define the type of the dynamic field and to be able
 to
> cover the type of discovered fields we define the following
>> feature_i_* for integers, feature_t_* for string, feature_d_* for
> double, 
>> 1.a) this means we need to check the type of the discovered field
 and
> then put in the corresponding dynamic field
>> 2) at search time, we need to know the right prefix
>> We are looking for help to find away to ignore the prefix and check
 of
> the type
>> regards,
>> Thaer
> 
>>> -- 
>>> Sorry for being brief. Alternate email is rickleir at yahoo dot com
> 



Re: index new discovered fileds of different types

2017-07-10 Thread Thaer Sammar
Hi Rick,

yes the RDF structure has subject, predicate and object. The object data
type is not only text, it can be integer or double as well or other data
types. The structure of our solar document doesn't only contain these three
fields. We compose one document per subject and we use all found objects as
fields. Currently, in the schema we define two static fields uri (subject)
and geo filed which contain the geographic point. When we find a message in
the kafka queue, which means something change in the DB, we query DB to get
all subject,predicate,object of the found subjects, based on that we create
the document. For example, for subjects s1 and s2, we might get the
following from the DB

s1,geo,(latitude, longitude)
s1,are,200.0
s1,type,office
s2,geo,(latitude, longitude)

for s1, there are more information available and we like to include it in
the solr doc, therefore we used the dynamic filed
feature_double_*, and feature_text_*. based on the object data type we add
to appropriate dynamic field


s1
(latitude,longitude)
200.0
office

 we appended the predicate name with dynamic filed prefix, and we used pdf
data type to decide which dynamic filed to use

regards,
Thaer

On 8 July 2017 at 02:36, Rick Leir  wrote:

> Thaer
> Whoa, hold everything! You said RDF, meaning resource description
> framework? If so, you have exactly​ three fields: subject, predicate, and
> object. Maybe they are text type, or for exact matches you might want
> string fields. Add an ID field, which could be automatically generated by
> Solr, so now you have four fields. Or am I on a tangent again? Cheers --
> Rick
>
> On July 7, 2017 6:01:00 AM EDT, Thaer Sammar  wrote:
> >Hi Jan,
> >
> >Thanks!, I am exploring the schemaless option based on Furkan
> >suggestion. I
> >need the the flexibility because not all fields are known. We get the
> >data
> >from RDF database (which changes continuously). To be more specific, we
> >have a database and all changes on it are sent to a kafka queue. and we
> >have a consumer which listen to the queue and update the Solr index.
> >
> >regards,
> >Thaer
> >
> >On 7 July 2017 at 10:53, Jan Høydahl  wrote:
> >
> >> If you do not need the flexibility of dynamic fields, don’t use them.
> >> Sounds to me that you really want a field “price” to be float and a
> >field
> >> “birthdate” to be of type date etc.
> >> If so, simply create your schema (either manually, through Schema API
> >or
> >> using schemaless) up front and index each field as correct type
> >without
> >> messing with field name prefixes.
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >>
> >> > 5. jul. 2017 kl. 15.23 skrev Thaer Sammar :
> >> >
> >> > Hi,
> >> > We are trying to index documents of different types. Document have
> >> different fields. fields are known at indexing time. We run a query
> >on a
> >> database and we index what comes using query variables as field names
> >in
> >> solr. Our current solution: we use dynamic fields with prefix, for
> >example
> >> feature_i_*, the issue with that
> >> > 1) we need to define the type of the dynamic field and to be able
> >to
> >> cover the type of discovered fields we define the following
> >> > feature_i_* for integers, feature_t_* for string, feature_d_* for
> >> double, 
> >> > 1.a) this means we need to check the type of the discovered field
> >and
> >> then put in the corresponding dynamic field
> >> > 2) at search time, we need to know the right prefix
> >> > We are looking for help to find away to ignore the prefix and check
> >of
> >> the type
> >> >
> >> > regards,
> >> > Thaer
> >>
> >>
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com


RE: help on implicit routing

2017-07-10 Thread imran
Thanks for the reference, I am guessing this feature is not available through 
the post utility inside solr/bin

Regards,
Imran

Sent from Mail for Windows 10

From: Jan Høydahl
Sent: Friday, July 7, 2017 1:51 AM
To: solr-user@lucene.apache.org
Subject: Re: help on implicit routing

http://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html
 


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 6. jul. 2017 kl. 03.15 skrev im...@elogic.pk:
> 
> I am trying out the document routing feature in Solr 6.4.1. I am unable to 
> comprehend the documentation where it states that 
> “The 'implicit' router does not
> automatically route documents to different
> shards.  Whichever shard you indicate on the
> indexing request (or within each document) will
> be used as the destination for those documents”
> 
> How do you specify the shard inside a document? E.g If I have basic 
> collection with two shards called day_1 and day_2. What value should be 
> populated in the router field that will ensure the document routing to the 
> respective shard?
> 
> Regards,
> Imran
> 
> Sent from Mail for Windows 10
>