Re: Commit after every document - alternate approach

2016-03-04 Thread Shawn Heisey
On 3/3/2016 11:36 PM, sangs8788 wrote:
> When a commit fails, the document doesnt get cleared out from MQ and there is
> a task which runs in a background to republish the files to SOLR. If we do a
> batch commit we will not know we will end up redoing the same batch commit
> again. We currenlty have a client side commit which issue the command to
> SOLR. commit() returns a status code. If we are planning to use
> commitwithin(), I dont think it will actually return any result from solr
> since it is time oriented.

Do your indexing and commits in batches, as already recommended.  I'd
start with 1000 and go up or down from there as needed.  If the batch
indexing fails, or the commit fails, consider the entire batch failed. 
That may not be the end of the world, though -- if the indexing was
successful (and didn't use ConcurrentUpdateSolrClient), then those
updates will be stored in the Solr transaction log, and will be replayed
if Solr is restarted or the core is reloaded.

If you want to be absolutely certain that the update/commit succeeded by
verifying data, one thing you *could* do is send a batch update, do a
commit, and then request every document in the batch with a query that
includes a limited fl parameter, and verify that the document is present
and the values of the fields requested in the fl parameter are correct. 
I would probably do that query with {!cache=false} to avoid polluting
Solr's caches.

Almost every index update you make can be simply made again without
danger.  The exceptions are certain kinds of atomic updates, and certain
situations with deletes.  It's probably best to avoid doing those kinds
of updates, which are described below:

If you're doing atomic updates that increment or decrement a field
value, or atomic updates that add a new value to a multivalued field,
the results will be wrong if that update is repeated, although they
would be correct if the update is replayed from Solr's transaction log,
because atomic updates are no longer atomic when they hit the
transaction log -- they include values for every field in the document,
as if the document were built from scratch.

If you are explicitly deleting a document before replacing it, and those
actions were re-done in the opposite order, then the document would be
missing from the index.  Because Solr handles the deletion automatically
when a document is being updated/replaced, explicit deleting is not
recommended for those situations.

> If we go with SOLR autocommit, is there a way to send a response to MQ
> saying commit successful ?

If commits are completely automatic (autoCommit, autoSoftCommit, or
commitWithin), there's no way for a program to be sure that they have
completed.

The general recommendation for Solr indexing, especially if your
pipeline is multi-threaded, is to simply send your updates, let Solr
handle commits, and rely on the design of Lucene combined with Solr's
transaction logs to keep your data safe.  This approach does mean that
when things go wrong it may be a while before new data is searchable.

Emir's reply is spot on.  Solr is not recommended as a primary data store.

Thanks,
Shawn



Re: Commit after every document - alternate approach

2016-03-04 Thread Emir Arnautovic

Hi Sangeetha,
It seems to me that you are using Solr as primary data store? If that is 
true, you should not do that - you should have some other store that is 
transactional and can support what you are trying to do with Solr. If 
you are not using Solr as primary store, and it is critical to have Solr 
in sync, you can run periodical (about same frequency as Solr commits) 
checks that will ensure the latest data reached Solr.


Regards,
Emir

On 04.03.2016 05:46, sangs8788 wrote:

Hi Emir,

Right now we are having only inserts into SOLR. The main reason for having
commit after each document is to get a guarantee that the document has got
indexed in solr. Until the commit status is received back the document will
not be deleted from MQ. So that even if there is a commit failure the
document can be resent from MQ.

Thanks
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Commit after every document - alternate approach

2016-03-03 Thread sangs8788
When a commit fails, the document doesnt get cleared out from MQ and there is
a task which runs in a background to republish the files to SOLR. If we do a
batch commit we will not know we will end up redoing the same batch commit
again. We currenlty have a client side commit which issue the command to
SOLR. commit() returns a status code. If we are planning to use
commitwithin(), I dont think it will actually return any result from solr
since it is time oriented.

If we go with SOLR autocommit, is there a way to send a response to MQ
saying commit successful ?

Thanks
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261587.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
So batch them. You get a response back from Solr whether the document was 
accepted. If that fail, there is a failure. What do you do then?

After every 100 docs or one minute, do a commit. Then delete the documents from 
the input queue. What do you do when the commit fails?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 3, 2016, at 8:46 PM, sangs8788  
> wrote:
> 
> Hi Emir,
> 
> Right now we are having only inserts into SOLR. The main reason for having
> commit after each document is to get a guarantee that the document has got
> indexed in solr. Until the commit status is received back the document will
> not be deleted from MQ. So that even if there is a commit failure the
> document can be resent from MQ.
> 
> Thanks
> Sangeetha
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Commit after every document - alternate approach

2016-03-03 Thread Walter Underwood
If you need transactions, you should use a different system, like MarkLogic.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 3, 2016, at 8:46 PM, sangs8788  
> wrote:
> 
> Hi Emir,
> 
> Right now we are having only inserts into SOLR. The main reason for having
> commit after each document is to get a guarantee that the document has got
> indexed in solr. Until the commit status is received back the document will
> not be deleted from MQ. So that even if there is a commit failure the
> document can be resent from MQ.
> 
> Thanks
> Sangeetha
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Commit after every document - alternate approach

2016-03-03 Thread sangs8788
Hi Varun,

We dont have SOLR Cloud setup in our system. We have Master-Slave
architecture setup. In that case i dont see a way where SOLR can guarantee
whether a document got indexed/commited successfully or not.

Even thought about having a flag setup in db for whichever documents
commited to SOLR. But that also not feasible because it again requires a
return status from SOLR.

The other option is to run a dataimport periodically to verify if all the
documents got indexed.

Is there anyother option which i have missed out. Please let me know.

Thanks
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261576.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Commit after every document - alternate approach

2016-03-03 Thread sangs8788
Hi Emir,

Right now we are having only inserts into SOLR. The main reason for having
commit after each document is to get a guarantee that the document has got
indexed in solr. Until the commit status is received back the document will
not be deleted from MQ. So that even if there is a commit failure the
document can be resent from MQ.

Thanks
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Commit after every document - alternate approach

2016-03-02 Thread Varun Thacker
Hi Sangeetha,

Well I don't think you need to commit after every document add.

You can rely on Solr's transaction log feature . If you are using SolrCloud
it's mandatory to have a transaction log . So every documents get written
to the tlog . Now say a node crashes even if documents were not committed ,
since it's present in the tlog Solr will replay then on startup.

Also if you are using SolrCloud and have multiple replicas , you should use
the min_rf feature to make sure that N replicas acknowledge the write
before you get back success -
https://cwiki.apache.org/confluence/display/solr/Read+and+Write+Side+Fault+Tolerance

On Wed, Mar 2, 2016 at 3:41 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Sangeetha,
> What is sure is that it is not going to work - with 200-300K doc/hour,
> there will be >50 commits/second, meaning there are <20ms time for
> doc+commit.
> You can do is let Solr handle commits and maybe use real time get to
> verify doc is in Solr or do some periodic sanity checks.
> Are you doing document updates so in order Solr updates are reason why you
> commit each doc before moving to next doc?
>
> Regards,
> Emir
>
>
> On 02.03.2016 09:06, sangeetha.subraman...@gtnexus.com wrote:
>
>> Hi All,
>>
>> I am trying to understand on how we can have commit issued to solr while
>> indexing documents. Around 200K to 300K document/per hour with an avg size
>> of 10 KB size each will be getting into SOLR . JAVA code fetches the
>> document from MQ and streamlines it to SOLR. The problem is the client code
>> issues hard-commit after each document which is sent to SOLR for indexing
>> and it waits for the response from SOLR to get assurance whether the
>> document got indexed successfully. Only if it gets a OK status from SOLR
>> the document is cleared out from SOLR.
>>
>> As far as I understand doing a commit after each document is an expensive
>> operation. But we need to make sure that all the documents which are put
>> into MQ gets indexed in SOLR. Is there any other way of getting this done ?
>> Please let me know.
>> If we do a batch indexing, is there any chances we can identify if some
>> documents is missed from indexing ?
>>
>> Thanks
>> Sangeetha
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


-- 


Regards,
Varun Thacker


Re: Commit after every document - alternate approach

2016-03-02 Thread Emir Arnautovic

Hi Sangeetha,
What is sure is that it is not going to work - with 200-300K doc/hour, 
there will be >50 commits/second, meaning there are <20ms time for 
doc+commit.
You can do is let Solr handle commits and maybe use real time get to 
verify doc is in Solr or do some periodic sanity checks.
Are you doing document updates so in order Solr updates are reason why 
you commit each doc before moving to next doc?


Regards,
Emir

On 02.03.2016 09:06, sangeetha.subraman...@gtnexus.com wrote:

Hi All,

I am trying to understand on how we can have commit issued to solr while 
indexing documents. Around 200K to 300K document/per hour with an avg size of 
10 KB size each will be getting into SOLR . JAVA code fetches the document from 
MQ and streamlines it to SOLR. The problem is the client code issues 
hard-commit after each document which is sent to SOLR for indexing and it waits 
for the response from SOLR to get assurance whether the document got indexed 
successfully. Only if it gets a OK status from SOLR the document is cleared out 
from SOLR.

As far as I understand doing a commit after each document is an expensive 
operation. But we need to make sure that all the documents which are put into 
MQ gets indexed in SOLR. Is there any other way of getting this done ? Please 
let me know.
If we do a batch indexing, is there any chances we can identify if some 
documents is missed from indexing ?

Thanks
Sangeetha



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Commit after every document - alternate approach

2016-03-02 Thread sangeetha.subraman...@gtnexus.com
Hi All,

I am trying to understand on how we can have commit issued to solr while 
indexing documents. Around 200K to 300K document/per hour with an avg size of 
10 KB size each will be getting into SOLR . JAVA code fetches the document from 
MQ and streamlines it to SOLR. The problem is the client code issues 
hard-commit after each document which is sent to SOLR for indexing and it waits 
for the response from SOLR to get assurance whether the document got indexed 
successfully. Only if it gets a OK status from SOLR the document is cleared out 
from SOLR.

As far as I understand doing a commit after each document is an expensive 
operation. But we need to make sure that all the documents which are put into 
MQ gets indexed in SOLR. Is there any other way of getting this done ? Please 
let me know.
If we do a batch indexing, is there any chances we can identify if some 
documents is missed from indexing ?

Thanks
Sangeetha