Re: Index documents in async way

2020-10-15 Thread David Smiley
>
> What I mean here is right now, when we send a batch of documents to Solr.
> We still process it as concrete - unrelated documents by indexing one by
> one. If indexing the fifth document causing error, that won't affect
> already indexed 4 documents. Using this model we can index the batch in an
> atomic way.
>

It sounds like you are suggesting an alternative version of
TolerantUpdateProcessorFactory (would look nothing like it though) that
sends each document forward in its own thread concurrently instead of
serially/synchronously?  If true, I'm quite supportive of this -- Solr
should just work this way IMO.  It could speed up some use-cases
tremendously and also cap abuse of too many client indexing requests
(people who unwittingly send batches asynchronously/concurrently from a
queue, possibly flooding Solr).  I remember I discussed this proposal with
Mikhail many years ago at a Lucene/Solr Revolution conference in Washington
DC (the first one held there?), as a way to speed up all indexing, not just
DIH (which was his curiosity at the time).  He wasn't a fan.  I'll CC him
to see if it jogs his memory :-)

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Oct 15, 2020 at 6:29 AM Đạt Cao Mạnh 
wrote:

> > The biggest problem I have with this is that the client doesn't know
> about indexing problems without awkward callbacks later to see if something
> went wrong.  Even simple stuff like a schema problem (e.g. undefined
> field).  It's a useful *option*, any way.
> > Currently we now guarantee that if solr sends you an OK response the
> document WILL eventually become searchable without further action.
> Maintaining that guarantee becomes impossible if we haven't verified that
> the data is formatted correctly (i.e. dates are in ISO format, etc).
> This may be an acceptable cost for those opting for async indexing but it
> may be very hard for some folks to swallow if it became the only option
> however.
>
> I don't mean that we're gonna replace the sync way. Plus the way sync
> works can be changed to leverage the async method. For example, the sync
> thread basically waits until the indexer threads finish indexing that
> update. Some notice here is all the validations in update processors will
> still happen in *sync* way in this model. Only indexing part to Lucene is
> executed in an async way.
>
> > I'm a bit skeptical that would boost indexing performance.  Please
> understand the intent of that API is about transactionality (atomic add)
> and ensuring all docs go in the same segment.  Solr *does* use that API for
> nested / parent-child documents, and because it has to.  If that API were
> to get called for normal docs, I could see the configured indexing buffer
> RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
> You could test your performance theory on a hacked Solr without much
> modifications, I think?  Just buffer then send in bulk.
>
>>
>>
>> What I mean here is right now, when we send a batch of documents to Solr.
> We still process it as concrete - unrelated documents by indexing one by
> one. If indexing the fifth document causing error, that won't affect
> already indexed 4 documents. Using this model we can index the batch in an
> atomic way.
>
> > I think that's the real thrust of your motivations, and sounds good to
> me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for
> making the optionality of the updateLog be a better supportable option in
> SolrCloud.
>
> Thank you for bring it out, I will take a look.
>
> All other parts that did not get quoted are very valuable things and I
> really appreciate those.
>
> On Wed, Oct 14, 2020 at 12:06 AM Gus Heck  wrote:
>
>> This is interesting, though it opens a few of cans of worms IMHO.
>>
>>1. Currently we now guarantee that if solr sends you an OK response
>>the document WILL eventually become searchable without further action.
>>Maintaining that guarantee becomes impossible if we haven't verified that
>>the data is formatted correctly (i.e. dates are in ISO format, etc).
>>This may be an acceptable cost for those opting for async indexing but it
>>may be very hard for some folks to swallow if it became the only option
>>however.
>>2. In the case of errors we need to hold the error message
>>indefinitely for later discovery by the client, this needs to not
>>accumulate forever. Thus:
>>   1. We have a timed cleanup, leasing or some other self limiting
>>   pattern... possibly by indexing the failures in a TRA with autodelete 
>> so
>>   that clients can efficiently find the status of the particular 
>> document(s)
>>   they sent, obviouysly there's at least an asyc id involved, probably 
>> the
>>   uniqueKey (where available) and timestamps for recieved, and processed 
>> as
>>   well.
>>   2. We log more simply with a sequential id and let clients keep
>>   track 

Re: Index documents in async way

2020-10-15 Thread Đạt Cao Mạnh
> The biggest problem I have with this is that the client doesn't know
about indexing problems without awkward callbacks later to see if something
went wrong.  Even simple stuff like a schema problem (e.g. undefined
field).  It's a useful *option*, any way.
> Currently we now guarantee that if solr sends you an OK response the
document WILL eventually become searchable without further action.
Maintaining that guarantee becomes impossible if we haven't verified that
the data is formatted correctly (i.e. dates are in ISO format, etc).
This may be an acceptable cost for those opting for async indexing but it
may be very hard for some folks to swallow if it became the only option
however.

I don't mean that we're gonna replace the sync way. Plus the way sync
works can be changed to leverage the async method. For example, the sync
thread basically waits until the indexer threads finish indexing that
update. Some notice here is all the validations in update processors will
still happen in *sync* way in this model. Only indexing part to Lucene is
executed in an async way.

> I'm a bit skeptical that would boost indexing performance.  Please
understand the intent of that API is about transactionality (atomic add)
and ensuring all docs go in the same segment.  Solr *does* use that API for
nested / parent-child documents, and because it has to.  If that API were
to get called for normal docs, I could see the configured indexing buffer
RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
You could test your performance theory on a hacked Solr without much
modifications, I think?  Just buffer then send in bulk.

>
>
> What I mean here is right now, when we send a batch of documents to Solr.
We still process it as concrete - unrelated documents by indexing one by
one. If indexing the fifth document causing error, that won't affect
already indexed 4 documents. Using this model we can index the batch in an
atomic way.

> I think that's the real thrust of your motivations, and sounds good to
me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for
making the optionality of the updateLog be a better supportable option in
SolrCloud.

Thank you for bring it out, I will take a look.

All other parts that did not get quoted are very valuable things and I
really appreciate those.

On Wed, Oct 14, 2020 at 12:06 AM Gus Heck  wrote:

> This is interesting, though it opens a few of cans of worms IMHO.
>
>1. Currently we now guarantee that if solr sends you an OK response
>the document WILL eventually become searchable without further action.
>Maintaining that guarantee becomes impossible if we haven't verified that
>the data is formatted correctly (i.e. dates are in ISO format, etc).
>This may be an acceptable cost for those opting for async indexing but it
>may be very hard for some folks to swallow if it became the only option
>however.
>2. In the case of errors we need to hold the error message
>indefinitely for later discovery by the client, this needs to not
>accumulate forever. Thus:
>   1. We have a timed cleanup, leasing or some other self limiting
>   pattern... possibly by indexing the failures in a TRA with autodelete so
>   that clients can efficiently find the status of the particular 
> document(s)
>   they sent, obviouysly there's at least an asyc id involved, probably the
>   uniqueKey (where available) and timestamps for recieved, and processed 
> as
>   well.
>   2. We log more simply with a sequential id and let clients keep
>   track of what they have seen... This can lead us down the path of
>   re-inventing kafka, or making kafka a required dependency.
>   3. We provide a push oriented connection (websocket? HTTP2?) that
>   clients that care about failures can listen to and store nothing. A less
>   appetizing variant is to publish errors to a message bus.
>3. If we have more than one thread picking up the submitted documents
>and writing them, we need a state machine that identifies in-progress
>documents to prevent multiple pickups and resets processing to new on
>startup to ensure we don't index the same document twice and don't lose
>things that were in-flight on power loss.
>4. Backpressure/throttling. If we're losing ground continuously on the
>submissions because indexing is heavier than accepting documents, we may
>fill up the disk. Of course the index itself can do that, but need to think
>about if this makes it worse.
>
> A big plus to this however is that batches with errors could optionally
> just omit the (one or two?) errored document(s) and publish the error for
> each errored document rather than failing the whole batch, meaning that the
> indexing infrastructure submitting in batches doesn't have to leave several
> hundred docs unprocessed, or alternately do a slow doc at a time resubmit
> to weed out the offenders.
>
> Certainly 

Re: Index documents in async way

2020-10-13 Thread Gus Heck
This is interesting, though it opens a few of cans of worms IMHO.

   1. Currently we now guarantee that if solr sends you an OK response the
   document WILL eventually become searchable without further action.
   Maintaining that guarantee becomes impossible if we haven't verified that
   the data is formatted correctly (i.e. dates are in ISO format, etc).
   This may be an acceptable cost for those opting for async indexing but it
   may be very hard for some folks to swallow if it became the only option
   however.
   2. In the case of errors we need to hold the error message indefinitely
   for later discovery by the client, this needs to not accumulate forever.
   Thus:
  1. We have a timed cleanup, leasing or some other self limiting
  pattern... possibly by indexing the failures in a TRA with autodelete so
  that clients can efficiently find the status of the particular
document(s)
  they sent, obviouysly there's at least an asyc id involved, probably the
  uniqueKey (where available) and timestamps for recieved, and processed as
  well.
  2. We log more simply with a sequential id and let clients keep track
  of what they have seen... This can lead us down the path of re-inventing
  kafka, or making kafka a required dependency.
  3. We provide a push oriented connection (websocket? HTTP2?) that
  clients that care about failures can listen to and store nothing. A less
  appetizing variant is to publish errors to a message bus.
   3. If we have more than one thread picking up the submitted documents
   and writing them, we need a state machine that identifies in-progress
   documents to prevent multiple pickups and resets processing to new on
   startup to ensure we don't index the same document twice and don't lose
   things that were in-flight on power loss.
   4. Backpressure/throttling. If we're losing ground continuously on the
   submissions because indexing is heavier than accepting documents, we may
   fill up the disk. Of course the index itself can do that, but need to think
   about if this makes it worse.

A big plus to this however is that batches with errors could optionally
just omit the (one or two?) errored document(s) and publish the error for
each errored document rather than failing the whole batch, meaning that the
indexing infrastructure submitting in batches doesn't have to leave several
hundred docs unprocessed, or alternately do a slow doc at a time resubmit
to weed out the offenders.

Certainly the involvement of kafka sounds interesting. If one persists to
an externally addressable location like a kafka queue one might leave the
option for the write-on-receipt queue to be different from the
read-to-actually-index queue and put a pipeline behind solr instead of
infront of it... possibly atomic updates could then be given identical
processing as initial indexing

On Sat, Oct 10, 2020 at 12:41 AM David Smiley  wrote:

>
>
> On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt  wrote:
>
>> Hi guys,
>>
>> First of all it seems that I used the term async a lot recently :D.
>> Recently I have been thinking a lot about changing the current indexing
>> model of Solr from sync way like currently (user submit an update request
>> waiting for response). What about changing it to async model, where nodes
>> will only persist the update into tlog then return immediately much like
>> what tlog is doing now. Then we have a dedicated executor which reads from
>> tlog to do indexing (producer consumer model with tlog acting like the
>> queue).
>>
>
> The biggest problem I have with this is that the client doesn't know about
> indexing problems without awkward callbacks later to see if something went
> wrong.  Even simple stuff like a schema problem (e.g. undefined field).
> It's a useful *option*, any way.
>
>
>>
>> I do see several big benefits of this approach
>>
>>- We can batching updates in a single call, right now we do not use
>>writer.add(documents) api from lucene, by batching updates this gonna 
>> boost
>>the performance of indexing
>>
>> I'm a bit skeptical that would boost indexing performance.  Please
> understand the intent of that API is about transactionality (atomic add)
> and ensuring all docs go in the same segment.  Solr *does* use that API for
> nested / parent-child documents, and because it has to.  If that API were
> to get called for normal docs, I could see the configured indexing buffer
> RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
> You could test your performance theory on a hacked Solr without much
> modifications, I think?  Just buffer then send in bulk.
>
>>
>>- One common problems with Solr now is we have lot of threads doing
>>indexing so that can ends up with many small segments. Using this model we
>>can have bigger segments so less merge cost
>>
>> This is app/use-case dependent of course.  If you observe the segment
> count to be high, I think it's more 

Re: Index documents in async way

2020-10-09 Thread David Smiley
On Thu, Oct 8, 2020 at 10:21 AM Cao Mạnh Đạt  wrote:

> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current indexing
> model of Solr from sync way like currently (user submit an update request
> waiting for response). What about changing it to async model, where nodes
> will only persist the update into tlog then return immediately much like
> what tlog is doing now. Then we have a dedicated executor which reads from
> tlog to do indexing (producer consumer model with tlog acting like the
> queue).
>

The biggest problem I have with this is that the client doesn't know about
indexing problems without awkward callbacks later to see if something went
wrong.  Even simple stuff like a schema problem (e.g. undefined field).
It's a useful *option*, any way.


>
> I do see several big benefits of this approach
>
>- We can batching updates in a single call, right now we do not use
>writer.add(documents) api from lucene, by batching updates this gonna boost
>the performance of indexing
>
> I'm a bit skeptical that would boost indexing performance.  Please
understand the intent of that API is about transactionality (atomic add)
and ensuring all docs go in the same segment.  Solr *does* use that API for
nested / parent-child documents, and because it has to.  If that API were
to get called for normal docs, I could see the configured indexing buffer
RAM or doc limit could be exceeded substantially.  Perhaps not a big deal.
You could test your performance theory on a hacked Solr without much
modifications, I think?  Just buffer then send in bulk.

>
>- One common problems with Solr now is we have lot of threads doing
>indexing so that can ends up with many small segments. Using this model we
>can have bigger segments so less merge cost
>
> This is app/use-case dependent of course.  If you observe the segment
count to be high, I think it's more likely due to a sub-optimal commit
strategy.  Many users should put more thought into this.  If update
requests to Solr have a small number of docs and it explicitly gives a
commit (commitWithin on the other hand is fine), then this is a recipe for
small segments and is generally expensive as well (commits are kinda
expensive).  Many apps would do well to either pass commitWithin or rely on
a configured commitWithin, accomplishing the same instead of commit.  For
apps that can't do that (e.g. need to immediately read-your-write or for
other reasons where I work), then such an app can't use async any way.

An idea I've had to help throughput for this case would be for commits that
are about to happen concurrently with other indexing to voluntarily wait a
short period of time (500ms?) in an attempt to coalesce the commit needs of
both concurrent indexing requests.  Very opt-in, probably a solrconfig
value, and probably a wait time in the 100s of milliseconds range.  An
ideal implementation would be conservative to avoid this waiting if there
is no concurrent indexing request that did not start after the current
request or that which doesn't require a commit as well.

If your goal is fewer segments, then definitely check out the recent
improvements to Lucene to do some lightweight merge-on-commit.  The
Solr-side hook is SOLR-14582
 and it requires a custom
MergePolicy.  I contributed such a MergePolicy policy here:
https://github.com/apache/lucene-solr/pull/1222 although it needs to be
updated in light of Lucene changes that occured since then.  We've been
using that MergePolicy at Salesforce for a couple years and it has cut our
segment count in half!  Of course if you can generate fewer segments in the
first place, that's preferable and is more to your point.

>
>- Another huge reason here is after switching to this model, we can
>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>purpose of leader in SolrCloud now is ordering updates, the distributed
>queue is already ordering updates for us, so no need to have a dedicated
>leader. That is just the beginning of things that we can do after using a
>distributed queue.
>
> I think that's the real thrust of your motivations, and sounds good to
me!  Also, please see https://issues.apache.org/jira/browse/SOLR-14778 for
making the optionality of the updateLog be a better supportable option in
SolrCloud.


> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>


Re: Index documents in async way

2020-10-09 Thread Ilan Ginzburg
I like the idea.

Two (main) points are not clear for me:
- Order of updates: If the current leader fails (its tlog becoming
inaccessible) and another leader is elected and indexes some more,
what happens when the first leader comes back? What does it do with
its tlog and how to know which part needs to be indexed and which part
should not (more recent versions of same docs already indexed by the
other leader).
- Durability: If the tlog only exists on a single node before the
client gets an ACK and that node fails "forever", the update is lost.
We need a way to even detect that an update was lost (which might not
be obvious).

Unclear to me if your previous answers address these points.

Ilan

On Fri, Oct 9, 2020 at 3:54 PM Cao Mạnh Đạt  wrote:
>
> Thank you Tomas
>
> >Atomic updates, can those be supported? I guess yes if we can guarantee that 
> >messages are read once and only once.
> It won't be straightforward since we have multiple consumers on the tlog 
> queue. But it is possible with appropriate locking
>
> >I'm guessing we'd need to read messages in an ordered way, so it'd be a 
> >single Kafka partition per Solr shard, right? (Don't know Pulsar)
> It will likely be the case, but like I said async updates will be the first 
> piece, switching to using Kafka gonna be an another area to look at
>
> >May be difficult to determine what replicas should do after a document 
> >update failure. Do they continue processing (which means if it was a 
> >transient error they'll become inconsistent) or do they stop? but if none of 
> >the replicas could process the document then they would all go to recovery?
> Good question, I had not thought about this, but I think the current model of 
> SolrCloud needs to answer this question too. i.e: the leader failed but had 
> others success.
>
> > maybe try to recover from other active replicas?
> I think it is totally possible
>
> > Maybe we could have a way to stream those responses out? (i.e. via another 
> > queue)? Maybe with an option to only stream out errors or something.
> It can be, but for REST users, it gonna be difficult for them
>
> >I don't think that'c correct? see DUH2.doNormalUpdate.
> You're right, we actually run the update first then writing to the tlog later
>
> > How would this work in your mind with one of the distributed queues?
> For a distributed queue, basically for every commit we need to store the 
> latest consumed offset corresponding to the commit. An easy solution here can 
> be blocking everything then do the commit, the commit data will store the 
> latest consumed offset
>
> On Fri, Oct 9, 2020 at 11:49 AM Tomás Fernández Löbbe  
> wrote:
>>
>> Interesting idea Đạt. The first questions/comments that come to my mind 
>> would be:
>> * Atomic updates, can those be supported? I guess yes if we can guarantee 
>> that messages are read once and only once.
>> * I'm guessing we'd need to read messages in an ordered way, so it'd be a 
>> single Kafka partition per Solr shard, right? (Don't know Pulsar)
>> * May be difficult to determine what replicas should do after a document 
>> update failure. Do they continue processing (which means if it was a 
>> transient error they'll become inconsistent) or do they stop? maybe try to 
>> recover from other active replicas? but if none of the replicas could 
>> process the document then they would all go to recovery?
>>
>> > Then the user will call another endpoint for tracking the response like 
>> > GET status_updates?trackId=,
>> Maybe we could have a way to stream those responses out? (i.e. via another 
>> queue)? Maybe with an option to only stream out errors or something.
>>
>> > Currently we are also adding to tlog first then call writer.addDoc later
>> I don't think that'c correct? see DUH2.doNormalUpdate.
>>
>> > I think it won't be very different from what we are having now, since on 
>> > commit (producer threads do the commit) we rotate to a new tlog.
>> How would this work in your mind with one of the distributed queues?
>>
>> I think this is a great idea, something that needs to be deeply thought, but 
>> could make big improvements. Thanks for bringing this up, Đạt.
>>
>> On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh  wrote:
>>>
>>> > Can there be a situation where the index writer fails after the document 
>>> > was added to tlog and a success is sent to the user? I think we want to 
>>> > avoid such a situation, isn't it?
>>> > I suppose failures would be returned to the client one the async response?
>>> To make things more clear, the response for async update will be something 
>>> like this
>>> { "trackId" : "" }
>>> Then the user will call another endpoint for tracking the response like GET 
>>> status_updates?trackId=, the response will tell that 
>>> whether the update is in_queue, processing, succeed or failed. Currently we 
>>> are also adding to tlog first then call writer.addDoc later.
>>> Later we can convert current sync operations by waiting until the update 
>>> gets 

Re: Index documents in async way

2020-10-09 Thread Cao Mạnh Đạt
Thank you Tomas

>Atomic updates, can those be supported? I guess yes if we can guarantee
that messages are read once and only once.
It won't be straightforward since we have multiple consumers on the tlog
queue. But it is possible with appropriate locking

>I'm guessing we'd need to read messages in an ordered way, so it'd be a
single Kafka partition per Solr shard, right? (Don't know Pulsar)
It will likely be the case, but like I said async updates will be the first
piece, switching to using Kafka gonna be an another area to look at

>May be difficult to determine what replicas should do after a document
update failure. Do they continue processing (which means if it was a
transient error they'll become inconsistent) or do they stop? but if none
of the replicas could process the document then they would all go to
recovery?
Good question, I had not thought about this, but I think the current model
of SolrCloud needs to answer this question too. i.e: the leader failed but
had others success.

> maybe try to recover from other active replicas?
I think it is totally possible

> Maybe we could have a way to stream those responses out? (i.e. via
another queue)? Maybe with an option to only stream out errors or something.
It can be, but for REST users, it gonna be difficult for them

>I don't think that'c correct? see DUH2.doNormalUpdate.
You're right, we actually run the update first then writing to the tlog
later

> How would this work in your mind with one of the distributed queues?
For a distributed queue, basically for every commit we need to store the
latest consumed offset corresponding to the commit. An easy solution here
can be blocking everything then do the commit, the commit data will store
the latest consumed offset

On Fri, Oct 9, 2020 at 11:49 AM Tomás Fernández Löbbe 
wrote:

> Interesting idea Đạt. The first questions/comments that come to my mind
> would be:
> * Atomic updates, can those be supported? I guess yes if we can guarantee
> that messages are read once and only once.
> * I'm guessing we'd need to read messages in an ordered way, so it'd be a
> single Kafka partition per Solr shard, right? (Don't know Pulsar)
> * May be difficult to determine what replicas should do after a document
> update failure. Do they continue processing (which means if it was a
> transient error they'll become inconsistent) or do they stop? maybe try to
> recover from other active replicas? but if none of the replicas could
> process the document then they would all go to recovery?
>
> > Then the user will call another endpoint for tracking the response like
> GET status_updates?trackId=,
> Maybe we could have a way to stream those responses out? (i.e. via another
> queue)? Maybe with an option to only stream out errors or something.
>
> > Currently we are also adding to tlog first then call writer.addDoc later
> I don't think that'c correct? see DUH2.doNormalUpdate.
>
> > I think it won't be very different from what we are having now, since on
> commit (producer threads do the commit) we rotate to a new tlog.
> How would this work in your mind with one of the distributed queues?
>
> I think this is a great idea, something that needs to be deeply thought,
> but could make big improvements. Thanks for bringing this up, Đạt.
>
> On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh 
> wrote:
>
>> > Can there be a situation where the index writer fails after the
>> document was added to tlog and a success is sent to the user? I think we
>> want to avoid such a situation, isn't it?
>> > I suppose failures would be returned to the client one the async
>>  response?
>> To make things more clear, the response for async update will be
>> something like this
>> { "trackId" : "" }
>> Then the user will call another endpoint for tracking the response like
>> GET status_updates?trackId=, the response will tell that
>> whether the update is in_queue, processing, succeed or failed. Currently we
>> are also adding to tlog first then call writer.addDoc later.
>> Later we can convert current sync operations by waiting until the update
>> gets processed before return to users.
>>
>> >How would one keep the tlog from growing forever if the actual indexing
>> took a long time?
>> I think it won't be very different from what we are having now, since on
>> commit (producer threads do the commit) we rotate to a new tlog.
>>
>> > I'd like to add another wrinkle to this. Which is to store the
>> information about each batch as a record in the index. Each batch record
>> would contain a fingerprint for the batch. This solves lots of problems,
>> and allows us to confirm the integrity of the batch. It also means that we
>> can compare indexes by comparing the batch fingerprints rather than
>> building a fingerprint from the entire index.
>> Thank you, it adds another pros to this model :P
>>
>> On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein  wrote:
>>
>>> I think this model has a lot of potential.
>>>
>>> I'd like to add another wrinkle to this. 

Re: Index documents in async way

2020-10-08 Thread Tomás Fernández Löbbe
Interesting idea Đạt. The first questions/comments that come to my mind
would be:
* Atomic updates, can those be supported? I guess yes if we can guarantee
that messages are read once and only once.
* I'm guessing we'd need to read messages in an ordered way, so it'd be a
single Kafka partition per Solr shard, right? (Don't know Pulsar)
* May be difficult to determine what replicas should do after a document
update failure. Do they continue processing (which means if it was a
transient error they'll become inconsistent) or do they stop? maybe try to
recover from other active replicas? but if none of the replicas could
process the document then they would all go to recovery?

> Then the user will call another endpoint for tracking the response like
GET status_updates?trackId=,
Maybe we could have a way to stream those responses out? (i.e. via another
queue)? Maybe with an option to only stream out errors or something.

> Currently we are also adding to tlog first then call writer.addDoc later
I don't think that'c correct? see DUH2.doNormalUpdate.

> I think it won't be very different from what we are having now, since on
commit (producer threads do the commit) we rotate to a new tlog.
How would this work in your mind with one of the distributed queues?

I think this is a great idea, something that needs to be deeply thought,
but could make big improvements. Thanks for bringing this up, Đạt.

On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh  wrote:

> > Can there be a situation where the index writer fails after the document
> was added to tlog and a success is sent to the user? I think we want to
> avoid such a situation, isn't it?
> > I suppose failures would be returned to the client one the async
>  response?
> To make things more clear, the response for async update will be something
> like this
> { "trackId" : "" }
> Then the user will call another endpoint for tracking the response like
> GET status_updates?trackId=, the response will tell that
> whether the update is in_queue, processing, succeed or failed. Currently we
> are also adding to tlog first then call writer.addDoc later.
> Later we can convert current sync operations by waiting until the update
> gets processed before return to users.
>
> >How would one keep the tlog from growing forever if the actual indexing
> took a long time?
> I think it won't be very different from what we are having now, since on
> commit (producer threads do the commit) we rotate to a new tlog.
>
> > I'd like to add another wrinkle to this. Which is to store the
> information about each batch as a record in the index. Each batch record
> would contain a fingerprint for the batch. This solves lots of problems,
> and allows us to confirm the integrity of the batch. It also means that we
> can compare indexes by comparing the batch fingerprints rather than
> building a fingerprint from the entire index.
> Thank you, it adds another pros to this model :P
>
> On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein  wrote:
>
>> I think this model has a lot of potential.
>>
>> I'd like to add another wrinkle to this. Which is to store the
>> information about each batch as a record in the index. Each batch record
>> would contain a fingerprint for the batch. This solves lots of problems,
>> and allows us to confirm the integrity of the batch. It also means that we
>> can compare indexes by comparing the batch fingerprints rather than
>> building a fingerprint from the entire index.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
>> wrote:
>>
>>> I suppose failures would be returned to the client one the async
>>> response?
>>>
>>> How would one keep the tlog from growing forever if the actual indexing
>>> took a long time?
>>>
>>> I'm guessing that this would be optional..
>>>
>>> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com> wrote:
>>>
 Can there be a situation where the index writer fails after the
 document was added to tlog and a success is sent to the user? I think we
 want to avoid such a situation, isn't it?

 On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:

> > Can you explain a little more on how this would impact durability of
> updates?
> Since we persist updates into tlog, I do not think this will be an
> issue
>
> > What does a failure look like, and how does that information get
> propagated back to the client app?
> I did not be able to do much research but I think this is gonna be the
> same as the current way of our asyncId. In this case asyncId will be the
> version of an update (in case of distributed queue it will be offset)
> failures update will be put into a time-to-live map so users can query the
> failure, for success we can skip that by leverage the max succeeded 
> version
> so far.
>
> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>
>> Interesting idea! Can 

Re: Index documents in async way

2020-10-08 Thread Đạt Cao Mạnh
> Can there be a situation where the index writer fails after the document
was added to tlog and a success is sent to the user? I think we want to
avoid such a situation, isn't it?
> I suppose failures would be returned to the client one the async response?
To make things more clear, the response for async update will be something
like this
{ "trackId" : "" }
Then the user will call another endpoint for tracking the response like GET
status_updates?trackId=, the response will tell that
whether the update is in_queue, processing, succeed or failed. Currently we
are also adding to tlog first then call writer.addDoc later.
Later we can convert current sync operations by waiting until the update
gets processed before return to users.

>How would one keep the tlog from growing forever if the actual indexing
took a long time?
I think it won't be very different from what we are having now, since on
commit (producer threads do the commit) we rotate to a new tlog.

> I'd like to add another wrinkle to this. Which is to store the
information about each batch as a record in the index. Each batch record
would contain a fingerprint for the batch. This solves lots of problems,
and allows us to confirm the integrity of the batch. It also means that we
can compare indexes by comparing the batch fingerprints rather than
building a fingerprint from the entire index.
Thank you, it adds another pros to this model :P

On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein  wrote:

> I think this model has a lot of potential.
>
> I'd like to add another wrinkle to this. Which is to store the information
> about each batch as a record in the index. Each batch record would contain
> a fingerprint for the batch. This solves lots of problems, and allows us to
> confirm the integrity of the batch. It also means that we can compare
> indexes by comparing the batch fingerprints rather than building a
> fingerprint from the entire index.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
> wrote:
>
>> I suppose failures would be returned to the client one the async response?
>>
>> How would one keep the tlog from growing forever if the actual indexing
>> took a long time?
>>
>> I'm guessing that this would be optional..
>>
>> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>>
>>> Can there be a situation where the index writer fails after the document
>>> was added to tlog and a success is sent to the user? I think we want to
>>> avoid such a situation, isn't it?
>>>
>>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>>>
 > Can you explain a little more on how this would impact durability of
 updates?
 Since we persist updates into tlog, I do not think this will be an issue

 > What does a failure look like, and how does that information get
 propagated back to the client app?
 I did not be able to do much research but I think this is gonna be the
 same as the current way of our asyncId. In this case asyncId will be the
 version of an update (in case of distributed queue it will be offset)
 failures update will be put into a time-to-live map so users can query the
 failure, for success we can skip that by leverage the max succeeded version
 so far.

 On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:

> Interesting idea! Can you explain a little more on how this would
> impact durability of updates? What does a failure look like, and how does
> that information get propagated back to the client app?
>
> Mike
>
> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>
>> Hi guys,
>>
>> First of all it seems that I used the term async a lot recently :D.
>> Recently I have been thinking a lot about changing the current
>> indexing model of Solr from sync way like currently (user submit an 
>> update
>> request waiting for response). What about changing it to async model, 
>> where
>> nodes will only persist the update into tlog then return immediately much
>> like what tlog is doing now. Then we have a dedicated executor which 
>> reads
>> from tlog to do indexing (producer consumer model with tlog acting like 
>> the
>> queue).
>>
>> I do see several big benefits of this approach
>>
>>- We can batching updates in a single call, right now we do not
>>use writer.add(documents) api from lucene, by batching updates this 
>> gonna
>>boost the performance of indexing
>>- One common problems with Solr now is we have lot of threads
>>doing indexing so that can ends up with many small segments. Using 
>> this
>>model we can have bigger segments so less merge cost
>>- Another huge reason here is after switching to this model, we
>>can remove tlog and use a distributed queue like Kafka, Pulsar. Since 
>> the
>>purpose of 

Re: Index documents in async way

2020-10-08 Thread Joel Bernstein
I think this model has a lot of potential.

I'd like to add another wrinkle to this. Which is to store the information
about each batch as a record in the index. Each batch record would contain
a fingerprint for the batch. This solves lots of problems, and allows us to
confirm the integrity of the batch. It also means that we can compare
indexes by comparing the batch fingerprints rather than building a
fingerprint from the entire index.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
wrote:

> I suppose failures would be returned to the client one the async response?
>
> How would one keep the tlog from growing forever if the actual indexing
> took a long time?
>
> I'm guessing that this would be optional..
>
> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya 
> wrote:
>
>> Can there be a situation where the index writer fails after the document
>> was added to tlog and a success is sent to the user? I think we want to
>> avoid such a situation, isn't it?
>>
>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>>
>>> > Can you explain a little more on how this would impact durability of
>>> updates?
>>> Since we persist updates into tlog, I do not think this will be an issue
>>>
>>> > What does a failure look like, and how does that information get
>>> propagated back to the client app?
>>> I did not be able to do much research but I think this is gonna be the
>>> same as the current way of our asyncId. In this case asyncId will be the
>>> version of an update (in case of distributed queue it will be offset)
>>> failures update will be put into a time-to-live map so users can query the
>>> failure, for success we can skip that by leverage the max succeeded version
>>> so far.
>>>
>>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>>>
 Interesting idea! Can you explain a little more on how this would
 impact durability of updates? What does a failure look like, and how does
 that information get propagated back to the client app?

 Mike

 On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:

> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current
> indexing model of Solr from sync way like currently (user submit an update
> request waiting for response). What about changing it to async model, 
> where
> nodes will only persist the update into tlog then return immediately much
> like what tlog is doing now. Then we have a dedicated executor which reads
> from tlog to do indexing (producer consumer model with tlog acting like 
> the
> queue).
>
> I do see several big benefits of this approach
>
>- We can batching updates in a single call, right now we do not
>use writer.add(documents) api from lucene, by batching updates this 
> gonna
>boost the performance of indexing
>- One common problems with Solr now is we have lot of threads
>doing indexing so that can ends up with many small segments. Using this
>model we can have bigger segments so less merge cost
>- Another huge reason here is after switching to this model, we
>can remove tlog and use a distributed queue like Kafka, Pulsar. Since 
> the
>purpose of leader in SolrCloud now is ordering updates, the distributed
>queue is already ordering updates for us, so no need to have a 
> dedicated
>leader. That is just the beginning of things that we can do after 
> using a
>distributed queue.
>
> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>


Re: Index documents in async way

2020-10-08 Thread Erick Erickson
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing
took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya 
wrote:

> Can there be a situation where the index writer fails after the document
> was added to tlog and a success is sent to the user? I think we want to
> avoid such a situation, isn't it?
>
> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>
>> > Can you explain a little more on how this would impact durability of
>> updates?
>> Since we persist updates into tlog, I do not think this will be an issue
>>
>> > What does a failure look like, and how does that information get
>> propagated back to the client app?
>> I did not be able to do much research but I think this is gonna be the
>> same as the current way of our asyncId. In this case asyncId will be the
>> version of an update (in case of distributed queue it will be offset)
>> failures update will be put into a time-to-live map so users can query the
>> failure, for success we can skip that by leverage the max succeeded version
>> so far.
>>
>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>>
>>> Interesting idea! Can you explain a little more on how this would impact
>>> durability of updates? What does a failure look like, and how does that
>>> information get propagated back to the client app?
>>>
>>> Mike
>>>
>>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>>>
 Hi guys,

 First of all it seems that I used the term async a lot recently :D.
 Recently I have been thinking a lot about changing the current indexing
 model of Solr from sync way like currently (user submit an update request
 waiting for response). What about changing it to async model, where nodes
 will only persist the update into tlog then return immediately much like
 what tlog is doing now. Then we have a dedicated executor which reads from
 tlog to do indexing (producer consumer model with tlog acting like the
 queue).

 I do see several big benefits of this approach

- We can batching updates in a single call, right now we do not use
writer.add(documents) api from lucene, by batching updates this gonna 
 boost
the performance of indexing
- One common problems with Solr now is we have lot of threads doing
indexing so that can ends up with many small segments. Using this model 
 we
can have bigger segments so less merge cost
- Another huge reason here is after switching to this model, we can
remove tlog and use a distributed queue like Kafka, Pulsar. Since the
purpose of leader in SolrCloud now is ordering updates, the distributed
queue is already ordering updates for us, so no need to have a dedicated
leader. That is just the beginning of things that we can do after using 
 a
distributed queue.

 What do your guys think about this? Just want to hear from your guys
 before going deep into this rabbit hole.

 Thanks!




Re: Index documents in async way

2020-10-08 Thread Ishan Chattopadhyaya
Can there be a situation where the index writer fails after the document
was added to tlog and a success is sent to the user? I think we want to
avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:

> > Can you explain a little more on how this would impact durability of
> updates?
> Since we persist updates into tlog, I do not think this will be an issue
>
> > What does a failure look like, and how does that information get
> propagated back to the client app?
> I did not be able to do much research but I think this is gonna be the
> same as the current way of our asyncId. In this case asyncId will be the
> version of an update (in case of distributed queue it will be offset)
> failures update will be put into a time-to-live map so users can query the
> failure, for success we can skip that by leverage the max succeeded version
> so far.
>
> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>
>> Interesting idea! Can you explain a little more on how this would impact
>> durability of updates? What does a failure look like, and how does that
>> information get propagated back to the client app?
>>
>> Mike
>>
>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>>
>>> Hi guys,
>>>
>>> First of all it seems that I used the term async a lot recently :D.
>>> Recently I have been thinking a lot about changing the current indexing
>>> model of Solr from sync way like currently (user submit an update request
>>> waiting for response). What about changing it to async model, where nodes
>>> will only persist the update into tlog then return immediately much like
>>> what tlog is doing now. Then we have a dedicated executor which reads from
>>> tlog to do indexing (producer consumer model with tlog acting like the
>>> queue).
>>>
>>> I do see several big benefits of this approach
>>>
>>>- We can batching updates in a single call, right now we do not use
>>>writer.add(documents) api from lucene, by batching updates this gonna 
>>> boost
>>>the performance of indexing
>>>- One common problems with Solr now is we have lot of threads doing
>>>indexing so that can ends up with many small segments. Using this model 
>>> we
>>>can have bigger segments so less merge cost
>>>- Another huge reason here is after switching to this model, we can
>>>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>>purpose of leader in SolrCloud now is ordering updates, the distributed
>>>queue is already ordering updates for us, so no need to have a dedicated
>>>leader. That is just the beginning of things that we can do after using a
>>>distributed queue.
>>>
>>> What do your guys think about this? Just want to hear from your guys
>>> before going deep into this rabbit hole.
>>>
>>> Thanks!
>>>
>>>


Re: Index documents in async way

2020-10-08 Thread Cao Mạnh Đạt
> Can you explain a little more on how this would impact durability of
updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get
propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same
as the current way of our asyncId. In this case asyncId will be the
version of an update (in case of distributed queue it will be offset)
failures update will be put into a time-to-live map so users can query the
failure, for success we can skip that by leverage the max succeeded version
so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:

> Interesting idea! Can you explain a little more on how this would impact
> durability of updates? What does a failure look like, and how does that
> information get propagated back to the client app?
>
> Mike
>
> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>
>> Hi guys,
>>
>> First of all it seems that I used the term async a lot recently :D.
>> Recently I have been thinking a lot about changing the current indexing
>> model of Solr from sync way like currently (user submit an update request
>> waiting for response). What about changing it to async model, where nodes
>> will only persist the update into tlog then return immediately much like
>> what tlog is doing now. Then we have a dedicated executor which reads from
>> tlog to do indexing (producer consumer model with tlog acting like the
>> queue).
>>
>> I do see several big benefits of this approach
>>
>>- We can batching updates in a single call, right now we do not use
>>writer.add(documents) api from lucene, by batching updates this gonna 
>> boost
>>the performance of indexing
>>- One common problems with Solr now is we have lot of threads doing
>>indexing so that can ends up with many small segments. Using this model we
>>can have bigger segments so less merge cost
>>- Another huge reason here is after switching to this model, we can
>>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>purpose of leader in SolrCloud now is ordering updates, the distributed
>>queue is already ordering updates for us, so no need to have a dedicated
>>leader. That is just the beginning of things that we can do after using a
>>distributed queue.
>>
>> What do your guys think about this? Just want to hear from your guys
>> before going deep into this rabbit hole.
>>
>> Thanks!
>>
>>


Re: Index documents in async way

2020-10-08 Thread Mike Drob
Interesting idea! Can you explain a little more on how this would impact
durability of updates? What does a failure look like, and how does that
information get propagated back to the client app?

Mike

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:

> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current indexing
> model of Solr from sync way like currently (user submit an update request
> waiting for response). What about changing it to async model, where nodes
> will only persist the update into tlog then return immediately much like
> what tlog is doing now. Then we have a dedicated executor which reads from
> tlog to do indexing (producer consumer model with tlog acting like the
> queue).
>
> I do see several big benefits of this approach
>
>- We can batching updates in a single call, right now we do not use
>writer.add(documents) api from lucene, by batching updates this gonna boost
>the performance of indexing
>- One common problems with Solr now is we have lot of threads doing
>indexing so that can ends up with many small segments. Using this model we
>can have bigger segments so less merge cost
>- Another huge reason here is after switching to this model, we can
>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>purpose of leader in SolrCloud now is ordering updates, the distributed
>queue is already ordering updates for us, so no need to have a dedicated
>leader. That is just the beginning of things that we can do after using a
>distributed queue.
>
> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>


Index documents in async way

2020-10-08 Thread Cao Mạnh Đạt
Hi guys,

First of all it seems that I used the term async a lot recently :D.
Recently I have been thinking a lot about changing the current indexing
model of Solr from sync way like currently (user submit an update request
waiting for response). What about changing it to async model, where nodes
will only persist the update into tlog then return immediately much like
what tlog is doing now. Then we have a dedicated executor which reads from
tlog to do indexing (producer consumer model with tlog acting like the
queue).

I do see several big benefits of this approach

   - We can batching updates in a single call, right now we do not use
   writer.add(documents) api from lucene, by batching updates this gonna boost
   the performance of indexing
   - One common problems with Solr now is we have lot of threads doing
   indexing so that can ends up with many small segments. Using this model we
   can have bigger segments so less merge cost
   - Another huge reason here is after switching to this model, we can
   remove tlog and use a distributed queue like Kafka, Pulsar. Since the
   purpose of leader in SolrCloud now is ordering updates, the distributed
   queue is already ordering updates for us, so no need to have a dedicated
   leader. That is just the beginning of things that we can do after using a
   distributed queue.

What do your guys think about this? Just want to hear from your guys before
going deep into this rabbit hole.

Thanks!