Re: Index documents in async way

2020-10-08 Thread Tomás Fernández Löbbe
Interesting idea Đạt. The first questions/comments that come to my mind
would be:
* Atomic updates, can those be supported? I guess yes if we can guarantee
that messages are read once and only once.
* I'm guessing we'd need to read messages in an ordered way, so it'd be a
single Kafka partition per Solr shard, right? (Don't know Pulsar)
* May be difficult to determine what replicas should do after a document
update failure. Do they continue processing (which means if it was a
transient error they'll become inconsistent) or do they stop? maybe try to
recover from other active replicas? but if none of the replicas could
process the document then they would all go to recovery?

> Then the user will call another endpoint for tracking the response like
GET status_updates?trackId=,
Maybe we could have a way to stream those responses out? (i.e. via another
queue)? Maybe with an option to only stream out errors or something.

> Currently we are also adding to tlog first then call writer.addDoc later
I don't think that'c correct? see DUH2.doNormalUpdate.

> I think it won't be very different from what we are having now, since on
commit (producer threads do the commit) we rotate to a new tlog.
How would this work in your mind with one of the distributed queues?

I think this is a great idea, something that needs to be deeply thought,
but could make big improvements. Thanks for bringing this up, Đạt.

On Thu, Oct 8, 2020 at 7:39 PM Đạt Cao Mạnh  wrote:

> > Can there be a situation where the index writer fails after the document
> was added to tlog and a success is sent to the user? I think we want to
> avoid such a situation, isn't it?
> > I suppose failures would be returned to the client one the async
>  response?
> To make things more clear, the response for async update will be something
> like this
> { "trackId" : "" }
> Then the user will call another endpoint for tracking the response like
> GET status_updates?trackId=, the response will tell that
> whether the update is in_queue, processing, succeed or failed. Currently we
> are also adding to tlog first then call writer.addDoc later.
> Later we can convert current sync operations by waiting until the update
> gets processed before return to users.
>
> >How would one keep the tlog from growing forever if the actual indexing
> took a long time?
> I think it won't be very different from what we are having now, since on
> commit (producer threads do the commit) we rotate to a new tlog.
>
> > I'd like to add another wrinkle to this. Which is to store the
> information about each batch as a record in the index. Each batch record
> would contain a fingerprint for the batch. This solves lots of problems,
> and allows us to confirm the integrity of the batch. It also means that we
> can compare indexes by comparing the batch fingerprints rather than
> building a fingerprint from the entire index.
> Thank you, it adds another pros to this model :P
>
> On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein  wrote:
>
>> I think this model has a lot of potential.
>>
>> I'd like to add another wrinkle to this. Which is to store the
>> information about each batch as a record in the index. Each batch record
>> would contain a fingerprint for the batch. This solves lots of problems,
>> and allows us to confirm the integrity of the batch. It also means that we
>> can compare indexes by comparing the batch fingerprints rather than
>> building a fingerprint from the entire index.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
>> wrote:
>>
>>> I suppose failures would be returned to the client one the async
>>> response?
>>>
>>> How would one keep the tlog from growing forever if the actual indexing
>>> took a long time?
>>>
>>> I'm guessing that this would be optional..
>>>
>>> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com> wrote:
>>>
 Can there be a situation where the index writer fails after the
 document was added to tlog and a success is sent to the user? I think we
 want to avoid such a situation, isn't it?

 On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:

> > Can you explain a little more on how this would impact durability of
> updates?
> Since we persist updates into tlog, I do not think this will be an
> issue
>
> > What does a failure look like, and how does that information get
> propagated back to the client app?
> I did not be able to do much research but I think this is gonna be the
> same as the current way of our asyncId. In this case asyncId will be the
> version of an update (in case of distributed queue it will be offset)
> failures update will be put into a time-to-live map so users can query the
> failure, for success we can skip that by leverage the max succeeded 
> version
> so far.
>
> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>
>> Interesting idea! Can 

Re: Index documents in async way

2020-10-08 Thread Đạt Cao Mạnh
> Can there be a situation where the index writer fails after the document
was added to tlog and a success is sent to the user? I think we want to
avoid such a situation, isn't it?
> I suppose failures would be returned to the client one the async response?
To make things more clear, the response for async update will be something
like this
{ "trackId" : "" }
Then the user will call another endpoint for tracking the response like GET
status_updates?trackId=, the response will tell that
whether the update is in_queue, processing, succeed or failed. Currently we
are also adding to tlog first then call writer.addDoc later.
Later we can convert current sync operations by waiting until the update
gets processed before return to users.

>How would one keep the tlog from growing forever if the actual indexing
took a long time?
I think it won't be very different from what we are having now, since on
commit (producer threads do the commit) we rotate to a new tlog.

> I'd like to add another wrinkle to this. Which is to store the
information about each batch as a record in the index. Each batch record
would contain a fingerprint for the batch. This solves lots of problems,
and allows us to confirm the integrity of the batch. It also means that we
can compare indexes by comparing the batch fingerprints rather than
building a fingerprint from the entire index.
Thank you, it adds another pros to this model :P

On Fri, Oct 9, 2020 at 2:10 AM Joel Bernstein  wrote:

> I think this model has a lot of potential.
>
> I'd like to add another wrinkle to this. Which is to store the information
> about each batch as a record in the index. Each batch record would contain
> a fingerprint for the batch. This solves lots of problems, and allows us to
> confirm the integrity of the batch. It also means that we can compare
> indexes by comparing the batch fingerprints rather than building a
> fingerprint from the entire index.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
> wrote:
>
>> I suppose failures would be returned to the client one the async response?
>>
>> How would one keep the tlog from growing forever if the actual indexing
>> took a long time?
>>
>> I'm guessing that this would be optional..
>>
>> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>>
>>> Can there be a situation where the index writer fails after the document
>>> was added to tlog and a success is sent to the user? I think we want to
>>> avoid such a situation, isn't it?
>>>
>>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>>>
 > Can you explain a little more on how this would impact durability of
 updates?
 Since we persist updates into tlog, I do not think this will be an issue

 > What does a failure look like, and how does that information get
 propagated back to the client app?
 I did not be able to do much research but I think this is gonna be the
 same as the current way of our asyncId. In this case asyncId will be the
 version of an update (in case of distributed queue it will be offset)
 failures update will be put into a time-to-live map so users can query the
 failure, for success we can skip that by leverage the max succeeded version
 so far.

 On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:

> Interesting idea! Can you explain a little more on how this would
> impact durability of updates? What does a failure look like, and how does
> that information get propagated back to the client app?
>
> Mike
>
> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>
>> Hi guys,
>>
>> First of all it seems that I used the term async a lot recently :D.
>> Recently I have been thinking a lot about changing the current
>> indexing model of Solr from sync way like currently (user submit an 
>> update
>> request waiting for response). What about changing it to async model, 
>> where
>> nodes will only persist the update into tlog then return immediately much
>> like what tlog is doing now. Then we have a dedicated executor which 
>> reads
>> from tlog to do indexing (producer consumer model with tlog acting like 
>> the
>> queue).
>>
>> I do see several big benefits of this approach
>>
>>- We can batching updates in a single call, right now we do not
>>use writer.add(documents) api from lucene, by batching updates this 
>> gonna
>>boost the performance of indexing
>>- One common problems with Solr now is we have lot of threads
>>doing indexing so that can ends up with many small segments. Using 
>> this
>>model we can have bigger segments so less merge cost
>>- Another huge reason here is after switching to this model, we
>>can remove tlog and use a distributed queue like Kafka, Pulsar. Since 
>> the
>>purpose of 

Re: 8.6.3 Release

2020-10-08 Thread David Smiley
The way GitHub works for contributors is that you are expected to fork a
repo and then push to your fork.  At that point when you go to the PR area,
you'll see a convenient yellow dialog to create a PR based on your pushed
branch.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Oct 8, 2020 at 10:20 AM Chris Hostetter 
wrote:

>
> FWIW: I followed the docs to update the Dockerfiles + TAGS for 8.6.3, and
> run tests; but since it's in a distinct github repo I don't think i can
> push to it?
>
> so i creaed a GH issue w/patch...
>
> https://github.com/docker-solr/docker-solr/issues/349
>
>
>
> : Date: Tue, 6 Oct 2020 11:33:15 -0400
> : From: Houston Putman 
> : Reply-To: dev@lucene.apache.org
> : To: Solr/Lucene Dev 
> : Subject: Re: 8.6.3 Release
> :
> : That is correct. 8.x docker builds have not been affected in any way.
> :
> : On Tue, Oct 6, 2020 at 11:30 AM Cassandra Targett  >
> : wrote:
> :
> : > I wanted to ask now that the 8.6.3 vote is underway - for the
> docker-solr
> : > image, are the update instructions in the docker-solr repo still the
> same
> : > for 8.x even though the build process has been moved to the main
> project
> : > for 9.0? Meaning, to release the 8.6.3 image there’s no change from
> before,
> : > right?
> : >
> : > I’m asking specifically about these instructions:
> : >
> : > https://github.com/docker-solr/docker-solr/blob/master/update.md
> : > On Oct 1, 2020, 9:28 AM -0500, Jason Gerlowski  >,
> : > wrote:
> : >
> : > I've put together draft Release Notes for 8.6.3 here. [1] [2]. Can
> : > someone please sanity check the summaries there when they get a
> : > chance? Would appreciate the review.
> : >
> : > 8.6.3 is a bit interesting in that Lucene has no changes in this
> : > bugfix release. As a result I had to omit the standard phrase in the
> : > Solr release notes about there being additional changes at the Lucene
> : > level, and change some of the wording in the Lucene announcement to
> : > indicate the lack of changes. So that's something to pay particular
> : > attention to, if someone can check my wording there.
> : >
> : > [1]
> https://cwiki.apache.org/confluence/display/SOLR/DRAFT-ReleaseNote863
> : > [2]
> : >
> https://cwiki.apache.org/confluence/display/LUCENE/DRAFT-ReleaseNote863
> : >
> : > On Wed, Sep 30, 2020 at 10:57 AM Jason Gerlowski <
> gerlowsk...@gmail.com>
> : > wrote:
> : >
> : >
> : > The only one that was previously mentioned as a blocker was
> : > SOLR-14835, but from the comments on the ticket it looks like it ended
> : > up being purely a cosmetic issue. Andrzej left a comment there
> : > suggesting that we "address" this with documentation for 8.6.3 but
> : > otherwise leave it as-is.
> : >
> : > So it looks like we're unblocked on starting the release process.
> : > Will begin the preliminary steps this afternoon.
> : >
> : > On Tue, Sep 29, 2020 at 3:40 PM Cassandra Targett <
> casstarg...@gmail.com>
> : > wrote:
> : >
> : >
> : > It looks to me like everything for 8.6.3 is resolved now (
> : > https://issues.apache.org/jira/projects/SOLR/versions/12348713), and
> it
> : > seems from comments in SOLR-14897 and SOLR-14898 that those fixes make
> a
> : > Jetty upgrade less compelling to try.
> : >
> : > Are there any other issues not currently marked for 8.6.3 we’re waiting
> : > for before starting the RC?
> : > On Sep 29, 2020, 12:04 PM -0500, Jason Gerlowski <
> gerlowsk...@gmail.com>,
> : > wrote:
> : >
> : > That said, if someone can use 8.6.3, what’s stopping them from going to
> : > 8.7 when it’e released?
> : >
> : >
> : > The same things that always stop users from going directly to the
> : > latest-and-greatest: fear of instability from new minor-release
> : > features, reliance on behavior changed across minor versions, breaking
> : > changes on Lucene elements that don't guarantee backcompat (e.g.
> : > SOLR-14254), security issues in later versions (new libraries pulled
> : > in with vulns), etc. There's lots of reasons a given user might want
> : > to stick on 8.6.x rather than 8.7 (in the short/medium term).
> : >
> : > I'm ambivalent to whether we upgrade Jetty in 8.6.3 - as I said above
> : > the worst of the Jetty issue should be mitigated by work on our end -
> : > but I think there's a lot of reasons users might not upgrade as far as
> : > we'd expect/like.
> : >
> : >
> : > On Mon, Sep 28, 2020 at 2:05 PM Erick Erickson <
> erickerick...@gmail.com>
> : > wrote:
> : >
> : >
> : > For me, there’s a sharp distinction between changing a dependency in a
> : > point release just because there’s a new version, and changing the
> : > dependency because there’s a bug in it. That said, if someone can use
> : > 8.6.3, what’s stopping them from going to 8.7 when it’e released?
> Would it
> : > make more sense to do the upgrades for 8.7 and get that out the door
> rather
> : > than backport?
> : >
> : > FWIW,
> : > Erick
> : >
> : > On Sep 28, 2020, at 1:45 PM, Jason 

Re: [ANNOUNCE] Apache Solr 8.6.3 released

2020-10-08 Thread Uwe Schindler
Thanks Jason!

Am October 8, 2020 7:14:33 PM UTC schrieb Jason Gerlowski 
:
>The Lucene PMC is pleased to announce the release of Apache Solr 8.6.3.
>
>Solr is the popular, blazing fast, open source NoSQL search platform
>from the Apache Lucene project. Its major features include powerful
>full-text search, hit highlighting, faceted search, dynamic
>clustering, database integration, rich document handling, and
>geospatial search. Solr is highly scalable, providing fault tolerant
>distributed search and indexing, and powers the search and navigation
>features of many of the world's largest internet sites.
>
>Solr 8.6.3 is available for immediate download at:
>  
>
>### Solr 8.6.3 Release Highlights:
>
> * SOLR-14898: Prevent duplicate header accumulation on internally
>forwarded requests
>* SOLR-14768: Fix HTTP multipart POST requests to Solr (8.6.0
>regression)
> * SOLR-14859: PrefixTree-based fields now reject invalid schema
>properties instead of quietly failing certain queries
> * SOLR-14663: CREATE ConfigSet action now copies base node content
>
>Please refer to the Upgrade Notes in the Solr Ref Guide for
>information on upgrading from previous Solr versions:
>  
>
>Please read CHANGES.txt for a full list of bugfixes:
>  
>
>Solr 8.6.3 also includes bugfixes in the corresponding Apache Lucene
>release:
>  
>
>Note: The Apache Software Foundation uses an extensive mirroring
>network for
>distributing releases. It is possible that the mirror you are using may
>not have
>replicated the release yet. If that is the case, please try another
>mirror.
>This also applies to Maven access.

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

[ANNOUNCE] Apache Solr 8.6.3 released

2020-10-08 Thread Jason Gerlowski
The Lucene PMC is pleased to announce the release of Apache Solr 8.6.3.

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document handling, and
geospatial search. Solr is highly scalable, providing fault tolerant
distributed search and indexing, and powers the search and navigation
features of many of the world's largest internet sites.

Solr 8.6.3 is available for immediate download at:
  

### Solr 8.6.3 Release Highlights:

 * SOLR-14898: Prevent duplicate header accumulation on internally
forwarded requests
 * SOLR-14768: Fix HTTP multipart POST requests to Solr (8.6.0 regression)
 * SOLR-14859: PrefixTree-based fields now reject invalid schema
properties instead of quietly failing certain queries
 * SOLR-14663: CREATE ConfigSet action now copies base node content

Please refer to the Upgrade Notes in the Solr Ref Guide for
information on upgrading from previous Solr versions:
  

Please read CHANGES.txt for a full list of bugfixes:
  

Solr 8.6.3 also includes bugfixes in the corresponding Apache Lucene release:
  

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not have
replicated the release yet. If that is the case, please try another mirror.
This also applies to Maven access.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[ANNOUNCE] Apache Lucene 8.6.3 released

2020-10-08 Thread Jason Gerlowski
The Lucene PMC is pleased to announce the release of Apache Lucene 8.6.3.

Apache Lucene is a high-performance, full-featured text search engine
library written entirely in Java. It is a technology suitable for
nearly any application that requires full-text search, especially
cross-platform.

This release contains no additional bug fixes over the previous
version 8.6.2. The release is available for immediate download at:

  

Note: The Apache Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using may not have
replicated the release yet. If that is the case, please try another mirror.

This also applies to Maven access.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Index documents in async way

2020-10-08 Thread Joel Bernstein
I think this model has a lot of potential.

I'd like to add another wrinkle to this. Which is to store the information
about each batch as a record in the index. Each batch record would contain
a fingerprint for the batch. This solves lots of problems, and allows us to
confirm the integrity of the batch. It also means that we can compare
indexes by comparing the batch fingerprints rather than building a
fingerprint from the entire index.


Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Oct 8, 2020 at 11:31 AM Erick Erickson 
wrote:

> I suppose failures would be returned to the client one the async response?
>
> How would one keep the tlog from growing forever if the actual indexing
> took a long time?
>
> I'm guessing that this would be optional..
>
> On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya 
> wrote:
>
>> Can there be a situation where the index writer fails after the document
>> was added to tlog and a success is sent to the user? I think we want to
>> avoid such a situation, isn't it?
>>
>> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>>
>>> > Can you explain a little more on how this would impact durability of
>>> updates?
>>> Since we persist updates into tlog, I do not think this will be an issue
>>>
>>> > What does a failure look like, and how does that information get
>>> propagated back to the client app?
>>> I did not be able to do much research but I think this is gonna be the
>>> same as the current way of our asyncId. In this case asyncId will be the
>>> version of an update (in case of distributed queue it will be offset)
>>> failures update will be put into a time-to-live map so users can query the
>>> failure, for success we can skip that by leverage the max succeeded version
>>> so far.
>>>
>>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>>>
 Interesting idea! Can you explain a little more on how this would
 impact durability of updates? What does a failure look like, and how does
 that information get propagated back to the client app?

 Mike

 On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:

> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current
> indexing model of Solr from sync way like currently (user submit an update
> request waiting for response). What about changing it to async model, 
> where
> nodes will only persist the update into tlog then return immediately much
> like what tlog is doing now. Then we have a dedicated executor which reads
> from tlog to do indexing (producer consumer model with tlog acting like 
> the
> queue).
>
> I do see several big benefits of this approach
>
>- We can batching updates in a single call, right now we do not
>use writer.add(documents) api from lucene, by batching updates this 
> gonna
>boost the performance of indexing
>- One common problems with Solr now is we have lot of threads
>doing indexing so that can ends up with many small segments. Using this
>model we can have bigger segments so less merge cost
>- Another huge reason here is after switching to this model, we
>can remove tlog and use a distributed queue like Kafka, Pulsar. Since 
> the
>purpose of leader in SolrCloud now is ordering updates, the distributed
>queue is already ordering updates for us, so no need to have a 
> dedicated
>leader. That is just the beginning of things that we can do after 
> using a
>distributed queue.
>
> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>


Re: terms filter in json.facet

2020-10-08 Thread David Smiley
>
> Can I raise a JIRA and work on the code change to support it?
>

Of course.  Search for an existing JIRA first, just in case someone
reported this already.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Thu, Oct 8, 2020 at 12:35 PM gopikannan  wrote:

> Hi David,
>Thanks for replying. I know that it is not supported. Can I raise a
> JIRA and work on the code change to support it? FacetField class needs to
> be modified and probably a new FacetFieldProcessor needs to be added.
>
> Thanks
> Gopi
>
> On Wed, Oct 7, 2020 at 3:56 PM David Smiley  wrote:
>
>> Please ask on the solr-user list.  I think the answer is "no" but I'm not
>> sure.
>>
>> ~ David Smiley
>> Apache Lucene/Solr Search Developer
>> http://www.linkedin.com/in/davidwsmiley
>>
>>
>> On Mon, Oct 5, 2020 at 9:10 PM gopikannan  wrote:
>>
>>> Hi,
>>>   In normal facet request below can be used to filter the facet terms. I
>>> am not able to do the same using json.facet. Please let me know whether I
>>> can raise a JIRA for this. Checked the code and I think I can work on the
>>> changes to support this.
>>>
>>> facet.field={!terms='alfa,betta,with\,with\',with space'}symbol
>>>
>>> https://lucene.apache.org/solr/guide/6_6/faceting.html
>>>
>>> Thanks
>>> Gopi
>>>
>>


Re: terms filter in json.facet

2020-10-08 Thread gopikannan
Hi David,
   Thanks for replying. I know that it is not supported. Can I raise a JIRA
and work on the code change to support it? FacetField class needs to be
modified and probably a new FacetFieldProcessor needs to be added.

Thanks
Gopi

On Wed, Oct 7, 2020 at 3:56 PM David Smiley  wrote:

> Please ask on the solr-user list.  I think the answer is "no" but I'm not
> sure.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Oct 5, 2020 at 9:10 PM gopikannan  wrote:
>
>> Hi,
>>   In normal facet request below can be used to filter the facet terms. I
>> am not able to do the same using json.facet. Please let me know whether I
>> can raise a JIRA for this. Checked the code and I think I can work on the
>> changes to support this.
>>
>> facet.field={!terms='alfa,betta,with\,with\',with space'}symbol
>>
>> https://lucene.apache.org/solr/guide/6_6/faceting.html
>>
>> Thanks
>> Gopi
>>
>


Re: Index documents in async way

2020-10-08 Thread Erick Erickson
I suppose failures would be returned to the client one the async response?

How would one keep the tlog from growing forever if the actual indexing
took a long time?

I'm guessing that this would be optional..

On Thu, Oct 8, 2020, 11:14 Ishan Chattopadhyaya 
wrote:

> Can there be a situation where the index writer fails after the document
> was added to tlog and a success is sent to the user? I think we want to
> avoid such a situation, isn't it?
>
> On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:
>
>> > Can you explain a little more on how this would impact durability of
>> updates?
>> Since we persist updates into tlog, I do not think this will be an issue
>>
>> > What does a failure look like, and how does that information get
>> propagated back to the client app?
>> I did not be able to do much research but I think this is gonna be the
>> same as the current way of our asyncId. In this case asyncId will be the
>> version of an update (in case of distributed queue it will be offset)
>> failures update will be put into a time-to-live map so users can query the
>> failure, for success we can skip that by leverage the max succeeded version
>> so far.
>>
>> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>>
>>> Interesting idea! Can you explain a little more on how this would impact
>>> durability of updates? What does a failure look like, and how does that
>>> information get propagated back to the client app?
>>>
>>> Mike
>>>
>>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>>>
 Hi guys,

 First of all it seems that I used the term async a lot recently :D.
 Recently I have been thinking a lot about changing the current indexing
 model of Solr from sync way like currently (user submit an update request
 waiting for response). What about changing it to async model, where nodes
 will only persist the update into tlog then return immediately much like
 what tlog is doing now. Then we have a dedicated executor which reads from
 tlog to do indexing (producer consumer model with tlog acting like the
 queue).

 I do see several big benefits of this approach

- We can batching updates in a single call, right now we do not use
writer.add(documents) api from lucene, by batching updates this gonna 
 boost
the performance of indexing
- One common problems with Solr now is we have lot of threads doing
indexing so that can ends up with many small segments. Using this model 
 we
can have bigger segments so less merge cost
- Another huge reason here is after switching to this model, we can
remove tlog and use a distributed queue like Kafka, Pulsar. Since the
purpose of leader in SolrCloud now is ordering updates, the distributed
queue is already ordering updates for us, so no need to have a dedicated
leader. That is just the beginning of things that we can do after using 
 a
distributed queue.

 What do your guys think about this? Just want to hear from your guys
 before going deep into this rabbit hole.

 Thanks!




Re: Index documents in async way

2020-10-08 Thread Ishan Chattopadhyaya
Can there be a situation where the index writer fails after the document
was added to tlog and a success is sent to the user? I think we want to
avoid such a situation, isn't it?

On Thu, 8 Oct, 2020, 8:25 pm Cao Mạnh Đạt,  wrote:

> > Can you explain a little more on how this would impact durability of
> updates?
> Since we persist updates into tlog, I do not think this will be an issue
>
> > What does a failure look like, and how does that information get
> propagated back to the client app?
> I did not be able to do much research but I think this is gonna be the
> same as the current way of our asyncId. In this case asyncId will be the
> version of an update (in case of distributed queue it will be offset)
> failures update will be put into a time-to-live map so users can query the
> failure, for success we can skip that by leverage the max succeeded version
> so far.
>
> On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:
>
>> Interesting idea! Can you explain a little more on how this would impact
>> durability of updates? What does a failure look like, and how does that
>> information get propagated back to the client app?
>>
>> Mike
>>
>> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>>
>>> Hi guys,
>>>
>>> First of all it seems that I used the term async a lot recently :D.
>>> Recently I have been thinking a lot about changing the current indexing
>>> model of Solr from sync way like currently (user submit an update request
>>> waiting for response). What about changing it to async model, where nodes
>>> will only persist the update into tlog then return immediately much like
>>> what tlog is doing now. Then we have a dedicated executor which reads from
>>> tlog to do indexing (producer consumer model with tlog acting like the
>>> queue).
>>>
>>> I do see several big benefits of this approach
>>>
>>>- We can batching updates in a single call, right now we do not use
>>>writer.add(documents) api from lucene, by batching updates this gonna 
>>> boost
>>>the performance of indexing
>>>- One common problems with Solr now is we have lot of threads doing
>>>indexing so that can ends up with many small segments. Using this model 
>>> we
>>>can have bigger segments so less merge cost
>>>- Another huge reason here is after switching to this model, we can
>>>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>>purpose of leader in SolrCloud now is ordering updates, the distributed
>>>queue is already ordering updates for us, so no need to have a dedicated
>>>leader. That is just the beginning of things that we can do after using a
>>>distributed queue.
>>>
>>> What do your guys think about this? Just want to hear from your guys
>>> before going deep into this rabbit hole.
>>>
>>> Thanks!
>>>
>>>


Re: Index documents in async way

2020-10-08 Thread Cao Mạnh Đạt
> Can you explain a little more on how this would impact durability of
updates?
Since we persist updates into tlog, I do not think this will be an issue

> What does a failure look like, and how does that information get
propagated back to the client app?
I did not be able to do much research but I think this is gonna be the same
as the current way of our asyncId. In this case asyncId will be the
version of an update (in case of distributed queue it will be offset)
failures update will be put into a time-to-live map so users can query the
failure, for success we can skip that by leverage the max succeeded version
so far.

On Thu, Oct 8, 2020 at 9:31 PM Mike Drob  wrote:

> Interesting idea! Can you explain a little more on how this would impact
> durability of updates? What does a failure look like, and how does that
> information get propagated back to the client app?
>
> Mike
>
> On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:
>
>> Hi guys,
>>
>> First of all it seems that I used the term async a lot recently :D.
>> Recently I have been thinking a lot about changing the current indexing
>> model of Solr from sync way like currently (user submit an update request
>> waiting for response). What about changing it to async model, where nodes
>> will only persist the update into tlog then return immediately much like
>> what tlog is doing now. Then we have a dedicated executor which reads from
>> tlog to do indexing (producer consumer model with tlog acting like the
>> queue).
>>
>> I do see several big benefits of this approach
>>
>>- We can batching updates in a single call, right now we do not use
>>writer.add(documents) api from lucene, by batching updates this gonna 
>> boost
>>the performance of indexing
>>- One common problems with Solr now is we have lot of threads doing
>>indexing so that can ends up with many small segments. Using this model we
>>can have bigger segments so less merge cost
>>- Another huge reason here is after switching to this model, we can
>>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>>purpose of leader in SolrCloud now is ordering updates, the distributed
>>queue is already ordering updates for us, so no need to have a dedicated
>>leader. That is just the beginning of things that we can do after using a
>>distributed queue.
>>
>> What do your guys think about this? Just want to hear from your guys
>> before going deep into this rabbit hole.
>>
>> Thanks!
>>
>>


Re: Index documents in async way

2020-10-08 Thread Mike Drob
Interesting idea! Can you explain a little more on how this would impact
durability of updates? What does a failure look like, and how does that
information get propagated back to the client app?

Mike

On Thu, Oct 8, 2020 at 9:21 AM Cao Mạnh Đạt  wrote:

> Hi guys,
>
> First of all it seems that I used the term async a lot recently :D.
> Recently I have been thinking a lot about changing the current indexing
> model of Solr from sync way like currently (user submit an update request
> waiting for response). What about changing it to async model, where nodes
> will only persist the update into tlog then return immediately much like
> what tlog is doing now. Then we have a dedicated executor which reads from
> tlog to do indexing (producer consumer model with tlog acting like the
> queue).
>
> I do see several big benefits of this approach
>
>- We can batching updates in a single call, right now we do not use
>writer.add(documents) api from lucene, by batching updates this gonna boost
>the performance of indexing
>- One common problems with Solr now is we have lot of threads doing
>indexing so that can ends up with many small segments. Using this model we
>can have bigger segments so less merge cost
>- Another huge reason here is after switching to this model, we can
>remove tlog and use a distributed queue like Kafka, Pulsar. Since the
>purpose of leader in SolrCloud now is ordering updates, the distributed
>queue is already ordering updates for us, so no need to have a dedicated
>leader. That is just the beginning of things that we can do after using a
>distributed queue.
>
> What do your guys think about this? Just want to hear from your guys
> before going deep into this rabbit hole.
>
> Thanks!
>
>


Re: 8.6.3 Release

2020-10-08 Thread Chris Hostetter

FWIW: I followed the docs to update the Dockerfiles + TAGS for 8.6.3, and 
run tests; but since it's in a distinct github repo I don't think i can 
push to it?

so i creaed a GH issue w/patch...

https://github.com/docker-solr/docker-solr/issues/349



: Date: Tue, 6 Oct 2020 11:33:15 -0400
: From: Houston Putman 
: Reply-To: dev@lucene.apache.org
: To: Solr/Lucene Dev 
: Subject: Re: 8.6.3 Release
: 
: That is correct. 8.x docker builds have not been affected in any way.
: 
: On Tue, Oct 6, 2020 at 11:30 AM Cassandra Targett 
: wrote:
: 
: > I wanted to ask now that the 8.6.3 vote is underway - for the docker-solr
: > image, are the update instructions in the docker-solr repo still the same
: > for 8.x even though the build process has been moved to the main project
: > for 9.0? Meaning, to release the 8.6.3 image there’s no change from before,
: > right?
: >
: > I’m asking specifically about these instructions:
: >
: > https://github.com/docker-solr/docker-solr/blob/master/update.md
: > On Oct 1, 2020, 9:28 AM -0500, Jason Gerlowski ,
: > wrote:
: >
: > I've put together draft Release Notes for 8.6.3 here. [1] [2]. Can
: > someone please sanity check the summaries there when they get a
: > chance? Would appreciate the review.
: >
: > 8.6.3 is a bit interesting in that Lucene has no changes in this
: > bugfix release. As a result I had to omit the standard phrase in the
: > Solr release notes about there being additional changes at the Lucene
: > level, and change some of the wording in the Lucene announcement to
: > indicate the lack of changes. So that's something to pay particular
: > attention to, if someone can check my wording there.
: >
: > [1] https://cwiki.apache.org/confluence/display/SOLR/DRAFT-ReleaseNote863
: > [2]
: > https://cwiki.apache.org/confluence/display/LUCENE/DRAFT-ReleaseNote863
: >
: > On Wed, Sep 30, 2020 at 10:57 AM Jason Gerlowski 
: > wrote:
: >
: >
: > The only one that was previously mentioned as a blocker was
: > SOLR-14835, but from the comments on the ticket it looks like it ended
: > up being purely a cosmetic issue. Andrzej left a comment there
: > suggesting that we "address" this with documentation for 8.6.3 but
: > otherwise leave it as-is.
: >
: > So it looks like we're unblocked on starting the release process.
: > Will begin the preliminary steps this afternoon.
: >
: > On Tue, Sep 29, 2020 at 3:40 PM Cassandra Targett 
: > wrote:
: >
: >
: > It looks to me like everything for 8.6.3 is resolved now (
: > https://issues.apache.org/jira/projects/SOLR/versions/12348713), and it
: > seems from comments in SOLR-14897 and SOLR-14898 that those fixes make a
: > Jetty upgrade less compelling to try.
: >
: > Are there any other issues not currently marked for 8.6.3 we’re waiting
: > for before starting the RC?
: > On Sep 29, 2020, 12:04 PM -0500, Jason Gerlowski ,
: > wrote:
: >
: > That said, if someone can use 8.6.3, what’s stopping them from going to
: > 8.7 when it’e released?
: >
: >
: > The same things that always stop users from going directly to the
: > latest-and-greatest: fear of instability from new minor-release
: > features, reliance on behavior changed across minor versions, breaking
: > changes on Lucene elements that don't guarantee backcompat (e.g.
: > SOLR-14254), security issues in later versions (new libraries pulled
: > in with vulns), etc. There's lots of reasons a given user might want
: > to stick on 8.6.x rather than 8.7 (in the short/medium term).
: >
: > I'm ambivalent to whether we upgrade Jetty in 8.6.3 - as I said above
: > the worst of the Jetty issue should be mitigated by work on our end -
: > but I think there's a lot of reasons users might not upgrade as far as
: > we'd expect/like.
: >
: >
: > On Mon, Sep 28, 2020 at 2:05 PM Erick Erickson 
: > wrote:
: >
: >
: > For me, there’s a sharp distinction between changing a dependency in a
: > point release just because there’s a new version, and changing the
: > dependency because there’s a bug in it. That said, if someone can use
: > 8.6.3, what’s stopping them from going to 8.7 when it’e released? Would it
: > make more sense to do the upgrades for 8.7 and get that out the door rather
: > than backport?
: >
: > FWIW,
: > Erick
: >
: > On Sep 28, 2020, at 1:45 PM, Jason Gerlowski 
: > wrote:
: >
: > Hey all,
: >
: > I wanted to add 2 more blocker tickets to the list: SOLR-14897 and
: > SOLR-14898. These tickets (while bad bugs in their own right) are
: > especially necessary because they work around a Jetty buffer-reuse bug
: > (see SOLR-14896) that causes sporadic request failures once triggered.
: >
: > So that brings the list of 8.6.3 blockers up to: SOLR-14850,
: > SOLR-14835, SOLR-14897, and SOLR-14898. (Thanks David for the quick
: > work on SOLR-14768!)
: >
: > Additionally, should we also consider a Jetty upgrade for 8.6.3 in
: > light of the issue mentioned above? I know it's atypical for bug-fix
: > releases to change deps, but here the bug is serious and tied directly
: > 

Index documents in async way

2020-10-08 Thread Cao Mạnh Đạt
Hi guys,

First of all it seems that I used the term async a lot recently :D.
Recently I have been thinking a lot about changing the current indexing
model of Solr from sync way like currently (user submit an update request
waiting for response). What about changing it to async model, where nodes
will only persist the update into tlog then return immediately much like
what tlog is doing now. Then we have a dedicated executor which reads from
tlog to do indexing (producer consumer model with tlog acting like the
queue).

I do see several big benefits of this approach

   - We can batching updates in a single call, right now we do not use
   writer.add(documents) api from lucene, by batching updates this gonna boost
   the performance of indexing
   - One common problems with Solr now is we have lot of threads doing
   indexing so that can ends up with many small segments. Using this model we
   can have bigger segments so less merge cost
   - Another huge reason here is after switching to this model, we can
   remove tlog and use a distributed queue like Kafka, Pulsar. Since the
   purpose of leader in SolrCloud now is ordering updates, the distributed
   queue is already ordering updates for us, so no need to have a dedicated
   leader. That is just the beginning of things that we can do after using a
   distributed queue.

What do your guys think about this? Just want to hear from your guys before
going deep into this rabbit hole.

Thanks!


Re: Should ChildDocTransformerFactory's limit be local or global for deep-nested documents?

2020-10-08 Thread David Smiley
On Thu, Oct 8, 2020 at 9:13 AM Bar Rotstein  wrote:

> Hey David,
> long time no speak.
>
> I think I'll start working on SOLR-14869.
>
> Do you have any tips that might enable me to tackle it a little faster?
>
>
ChildDocTransformer loops over document IDs.  They should be in the same
segment.  You should get the LeafReader for that segment and call
getLiveDocs on it.  In the transformer when you loop the IDs, check to see
if the doc is "live".


Re: Should ChildDocTransformerFactory's limit be local or global for deep-nested documents?

2020-10-08 Thread Bar Rotstein
Hey David,
long time no speak.

I think I'll start working on SOLR-14869.

Do you have any tips that might enable me to tackle it a little faster?

Thanks,
Bar.

On Sun, Oct 4, 2020 at 12:25 AM David Smiley  wrote:

> Glad to hear from you again Bar!
> Also, FYI https://issues.apache.org/jira/browse/SOLR-14869 is a serious
> bug relating to child documents.  It returns deleted docs!
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sat, Oct 3, 2020 at 3:23 PM Bar Rotstein  wrote:
>
>> Hey,
>> Was a ticket opened?
>>
>> I'd gladly tackle that one if it hasn't been assigned yet.
>>
>> Thanks in advance,
>> Bar
>> On Fri, Oct 2, 2020 at 3:13 PM David Smiley  wrote:
>>
>>> I think that's a bug!  Good catch!
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Thu, Oct 1, 2020 at 11:38 PM Alexandre Rafalovitch <
>>> arafa...@gmail.com> wrote:
>>>
 I am indexing a deeply nested structure and am trying to return it
 with fl=*,[child].

 And it is supposed to have 5 children under the top element but
 returns only 4. Two hours of debugging later, I realize that the
 "limit" parameter is set to 10 by default and that 10 seems to be
 counting children at ANY level. And calculating them depth-first. So,
 it was quite unobvious to discover when the children suddenly stopped
 showing up.

 The documentation says:
 > The maximum number of child documents to be returned per parent
 document. > The default is `10`.

 So, is that (all nested children included in limit) what we actually
 mean? Or did we mean maximum number of "immediate children" for any
 specific document/level and the code is wrong?

 I can update the doc to clarify the results, but I don't know whether
 I am looking at the bug or the feature.

 Regards,
Alex.

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




RE: [VOTE] Release Lucene/Solr 8.6.3 RC1

2020-10-08 Thread Uwe Schindler
Here is also my +1:

 

Policeman Jenkins was happy: 
https://jenkins.thetaphi.de/job/Lucene-Solr-Release-Tester/35/console

 

SUCCESS! [1:25:03.704981]

 

It tested Java 8 and 9.

 

-

Uwe Schindler

Achterdiek 19, D-28357 Bremen

https://www.thetaphi.de

eMail: u...@thetaphi.de

 

From: Jason Gerlowski  
Sent: Sunday, October 4, 2020 3:54 AM
To: Lucene Dev 
Subject: [VOTE] Release Lucene/Solr 8.6.3 RC1

 

Please vote for release candidate 1 for Lucene/Solr 8.6.3

 

The artifacts can be downloaded from:

https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.3-RC1-reve001c2221812a0ba9e9378855040ce72f93eced4

 

You can run the smoke tester directly with this command:

 

python3 -u dev-tools/scripts/smokeTestRelease.py \

https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.3-RC1-reve001c2221812a0ba9e9378855040ce72f93eced4

 

The vote will be open for at least 72 hours i.e. until 2020-10-07 02:00 UTC.

 

[ ] +1  approve

[ ] +0  no opinion

[ ] -1  disapprove (and reason why)

 

Here is my +1