Re: [VOTE] Release Lucene/Solr 8.8.0 RC2

2021-01-25 Thread Tomás Fernández Löbbe
Thanks Noble! And thanks for fixing that concurrency issue, I'd hit it but
didn't have time to investigate it.

+1
SUCCESS! [0:58:32.036482]

On Mon, Jan 25, 2021 at 10:19 AM Timothy Potter 
wrote:

> Thanks Noble!
>
> +1 SUCCESS! [1:24:28.212370] (my internet is super slow today)
>
> Re-ran all the Solr operator tests and verified the Cloud graph UI renders
> correctly now.
>
> On Mon, Jan 25, 2021 at 3:22 AM Noble Paul  wrote:
>
>> Please vote for release candidate 2 for Lucene/Solr 8.8.0
>>
>> The artifacts can be downloaded from:
>>
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/
>>
>> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/
>>
>>
>>
>> The vote will be open for at least 72 hours
>>
>> [ ] +1  approve
>> [ ] +0  no opinion
>> [ ] -1  disapprove (and reason why)
>>
>> Here is my +1
>> --
>> -
>> Noble Paul
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Re: [DISCUSS] ConfigSet ZK to file system fallback

2021-01-25 Thread David Smiley
I'm not entirely sure how to react to the feedback.  Maybe in listing
multiple benefits and a follow-on proposal, I inadvertently opened doors to
distracting points.  I know I can be guilty of scope creep.  My proposal
has no impact on where JARs go, and so let's not discuss lib directories,
the package store, or LTR's feature store either which my proposal is not
related to, ok?  My proposal doesn't even add a new configuration place
that doesn't already exist.

Let me try to express this proposal through a different angle / lens that I
think is more clear and motivating than the first:

Each physical Solr node (perhaps a Docker image) is composed of Solr's
code, perhaps some plugin code too, and some configuration files with some
settings.  Baked into any code are settings with a default value.  There
are trivial primitive settings like an integer for "maxMergeAtOnce" on
TieredMergePolicy, and there are more aggregate settings, like what the
default MergePolicy is.  Sometimes the default changes from one release to
the next, or new settings get added or go away (albeit rarely).  Let's just
consider SolrCloud.

... Let's say you need to make a settings change.  ...

For changes specified in solrconfig.xml (generalizable to any file in the
configSet, really), you MUST deploy this to ZooKeeper.  That sucks when the
configuration might only make sense for some nodes.  Most likely you are
doing an upgrade in which you can't simply change the Solr nodes in an
instant, but perhaps some nodes are simply different (different hardware?
-- SSDs vs HDDs).  Upgrades can be orchestrated but it's more complex when
there is ZK resident configuration, and it will impose annoying
restrictions on the underlying code (i.e. back-compat concerns).  By having
a "physical layer configuration" (borrowing Eric's terminology), we can tie
some settings to this layer while still having a higher level layer.  I
proposed one way of doing this; I'd be happy to discuss others.

I'd like to extend the same argument to solr.xml, a node level
configuration file.  Here, at least there is already _some_ flexibility --
you can supply solr.xml with the physical layer (the Docker image) *OR* in
ZooKeeper.  But IMO it's not ideal because it's either-or..  Some
configuration might make sense with the physical node, and some at the
cluster node.  Ideally IMO, we'd have a way to blend both such that the
deployer chooses where the configuration makes sense based on their cluster.

WDYT?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Sun, Jan 24, 2021 at 6:08 AM Ilan Ginzburg  wrote:

> An aspect that would be interesting to consider IMO is upgrade and
> configuration changes.
> For example a collection in use across Solr version upgrade might require
> different configuration (config set) with the old and new Solr versions.
> Solr itself can require changes in config across updates.
>
> Backward compatibility is the usual answer (the new code continues working
> with the old config that can be updated once all nodes have been deployed)
> but this imposes constraints on new code.
> If there was a way for the new Solr code to "magically" use a different
> config set for the collection (and for Solr config in general) there would
> be more freedom to add or change features, change default behavior across
> Solr versions etc.
>
> Ilan
>
> On Sat 23 Jan 2021 at 22:22, Gus Heck  wrote:
>
>> I'm in agreement with Eric here that fewer ways (or at least a clearer
>> default way) of supplying resources would be better. Additionally, it
>> should be easy to specify that this resource that I've shared should be
>> loaded on a per SolrCore or per node basis (or even better per collection
>> present on the node, accessible under a standard name to replicas belonging
>> to that collection?). Not many cases beyond the simplest single collection
>> install few shards where you want a 1GB resource to be duplicated in memory
>> across N cores running on the same node, though obviously there's ample
>> cases where the 10k stop words file is meant to differ across collections.
>>
>> As it stands Eric's list seems like something that should be in the
>> documentation somewhere just so people can properly troubleshoot where
>> something they don't expect to be loaded is getting loaded from, or why
>> their attempts to load something new aren't working...  especially if it
>> were ordered to show the precedence of these options.
>>
>> As for ease of editing configurations, I've long felt that this should be
>> possible via the admin UI though there's been much worry about security
>> implications there. Personally, I think that those concerns are resolvable,
>> but have not found time to make that case. Aside from that I think we need
>> to support tooling to enable easy management of config sets rather than
>> expanding the possible number of places the configurations might get loaded
>> from.
>>
>> Several years ago I 

Re: Consider Removing the `@` Special Character from RegExp

2021-01-25 Thread Marcus Eagan
That's right. It's optional. I think we should remove it unless we have a
good reason to keep it. I just think that it's maddening and unnecessary.
Perhaps, I am the only one?

On Fri, Jan 22, 2021 at 7:54 AM Gus Heck  wrote:

> I think it's already an optional feature; if you construct the regexp with
> explicit syntax flags you can get an instance that won't consider '@'
> special. Haven't actually had a need to do that so I'm assuming it works as
> documented.
>
> /** Syntax flag, enables anystring (@). */
> public static final int ANYSTRING = 0x0008;
>
>
>
> On Thu, Jan 21, 2021 at 9:21 PM Marcus Eagan 
> wrote:
>
>> Hi All,
>>
>> In looking at the Java Docs, our Lucene team noticed that the `@` symbol
>> is a reserved character in the Lucene regular expression syntax.
>>
>> In re-visiting the page in curiosity, I found that the symbol was
>> [Optional] for "any string." This came at a surprise because there's a very
>> common way to achieve "any string" in `.*`. Is there any compelling reason
>> to preserve this tiny vector of complexity? I suspect there may be some
>> differences in the constructions of the finite automata produced by `.*`
>> and `@` but I am not sure.
>>
>> If insignificant or non-existent, I suggest we remove `@` from the
>> regular expression syntax.
>>
>> --
>> Marcus Eagan
>>
>>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>


-- 
Marcus Eagan


Re: Merging segment parts concurrently (SegmentMerger)

2021-01-25 Thread Dawid Weiss
Thanks for early feedback.

I freely admit I never had to touch codecs so I'm not sure what ordering
dependencies need to be respected. But it's certainly something I'd like to
look into since that "last" segment merge can now take ~10 minutes on
mostly idle CPU (64 cores, remember...) and I/O. Worth a shot to improve
this.

Dawid

On Mon, Jan 25, 2021 at 10:39 PM Michael Sokolov  wrote:

> At least in theory, since the segmentWriteState is shared among these
> phases, there could be dependencies, but it seems as if it ought to be
> limited to making sure that the FieldInfos are written last? This is
> pure speculation, I haven't dug deeply in the code. However, it would
> be necessary to have some kind of synchronization on updates to that
> state if these were to be run concurrently. If we do this, should we
> also handle the various steps in IndexingChain.flush concurrently? I
> guess the mechanism fort providing threads to do so might be
> different. At least in this case, there do seem to be *some*
> dependencies, like between norms and terms?
>
> On Mon, Jan 25, 2021 at 1:58 PM David Smiley  wrote:
> >
> > I suppose we should add a CallerRunsMergeScheduler (a new superclass of
> SerialMergeScheduler)?  Or make this aspect of SMS configurable.  We might
> use a semaphore to control how many callers can merge at once (1 == SMS of
> today, larger for expanded).  It might be debatable if it is then "serial"
> or not.
> >
> > I do think it'd be possible to merge parts of a segment at once!  That'd
> be a cool feature to add.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Mon, Jan 25, 2021 at 11:05 AM Michael Sokolov 
> wrote:
> >>
> >> It makes sense to me. I don't have the full picture, but I did just
> >> implement merging for vector format, and that at least, could be done
> >> fully concurrent with other formats. I expect the same is true of
> >> DocValues, Terms, etc. I'm not sure about the different kinds of
> >> DocValues - they might want to be done together?
> >>
> >> On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss 
> wrote:
> >> >
> >> >
> >> > Hey everyone,
> >> >
> >> > I'm trying to cut the total wall-time of indexing for some fairly
> large document collections on machines with a high CPU count (> 32 indexing
> threads). So far my observations are:
> >> >
> >> > 1) I resigned from using the concurrent merge scheduler in favor of
> "same thread" merging. This means the indexing thread that encounters a
> merge just does it. The CMS is designed to favor concurrent searches over
> indexing and it really didn't do anything I needed - in fact, I had to
> disable most things it offers. I/O throttling and thread stalling are not
> really practical on fast I/O in the absence of concurrent searches - you
> can literally just use as many merge threads as needed to saturate the I/O.
> >> >
> >> > 2) It is quite frequent that everything is churning nicely until the
> last few merges combine huge smaller segments and form a "long-tail" where
> most cores are just idle... Here comes my question - can we execute the
> individual "parts" involved in segment merging (the logic inside
> SegmentMerger) in separate threads? On the surface it looks like these
> steps can be done independently (even if they're executed sequentially at
> the moment) but perhaps I'm missing something?
> >> >
> >> > I'd like to ask before I try to tinker with it. Thanks for any
> feedback.
> >> >
> >> > Dawid
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Merging segment parts concurrently (SegmentMerger)

2021-01-25 Thread Michael Sokolov
At least in theory, since the segmentWriteState is shared among these
phases, there could be dependencies, but it seems as if it ought to be
limited to making sure that the FieldInfos are written last? This is
pure speculation, I haven't dug deeply in the code. However, it would
be necessary to have some kind of synchronization on updates to that
state if these were to be run concurrently. If we do this, should we
also handle the various steps in IndexingChain.flush concurrently? I
guess the mechanism fort providing threads to do so might be
different. At least in this case, there do seem to be *some*
dependencies, like between norms and terms?

On Mon, Jan 25, 2021 at 1:58 PM David Smiley  wrote:
>
> I suppose we should add a CallerRunsMergeScheduler (a new superclass of 
> SerialMergeScheduler)?  Or make this aspect of SMS configurable.  We might 
> use a semaphore to control how many callers can merge at once (1 == SMS of 
> today, larger for expanded).  It might be debatable if it is then "serial" or 
> not.
>
> I do think it'd be possible to merge parts of a segment at once!  That'd be a 
> cool feature to add.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Mon, Jan 25, 2021 at 11:05 AM Michael Sokolov  wrote:
>>
>> It makes sense to me. I don't have the full picture, but I did just
>> implement merging for vector format, and that at least, could be done
>> fully concurrent with other formats. I expect the same is true of
>> DocValues, Terms, etc. I'm not sure about the different kinds of
>> DocValues - they might want to be done together?
>>
>> On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss  wrote:
>> >
>> >
>> > Hey everyone,
>> >
>> > I'm trying to cut the total wall-time of indexing for some fairly large 
>> > document collections on machines with a high CPU count (> 32 indexing 
>> > threads). So far my observations are:
>> >
>> > 1) I resigned from using the concurrent merge scheduler in favor of "same 
>> > thread" merging. This means the indexing thread that encounters a merge 
>> > just does it. The CMS is designed to favor concurrent searches over 
>> > indexing and it really didn't do anything I needed - in fact, I had to 
>> > disable most things it offers. I/O throttling and thread stalling are not 
>> > really practical on fast I/O in the absence of concurrent searches - you 
>> > can literally just use as many merge threads as needed to saturate the I/O.
>> >
>> > 2) It is quite frequent that everything is churning nicely until the last 
>> > few merges combine huge smaller segments and form a "long-tail" where most 
>> > cores are just idle... Here comes my question - can we execute the 
>> > individual "parts" involved in segment merging (the logic inside 
>> > SegmentMerger) in separate threads? On the surface it looks like these 
>> > steps can be done independently (even if they're executed sequentially at 
>> > the moment) but perhaps I'm missing something?
>> >
>> > I'd like to ask before I try to tinker with it. Thanks for any feedback.
>> >
>> > Dawid
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Merging segment parts concurrently (SegmentMerger)

2021-01-25 Thread David Smiley
I suppose we should add a CallerRunsMergeScheduler (a new superclass of
SerialMergeScheduler)?  Or make this aspect of SMS configurable.  We might
use a semaphore to control how many callers can merge at once (1 == SMS of
today, larger for expanded).  It might be debatable if it is then "serial"
or not.

I do think it'd be possible to merge parts of a segment at once!  That'd be
a cool feature to add.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jan 25, 2021 at 11:05 AM Michael Sokolov  wrote:

> It makes sense to me. I don't have the full picture, but I did just
> implement merging for vector format, and that at least, could be done
> fully concurrent with other formats. I expect the same is true of
> DocValues, Terms, etc. I'm not sure about the different kinds of
> DocValues - they might want to be done together?
>
> On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss  wrote:
> >
> >
> > Hey everyone,
> >
> > I'm trying to cut the total wall-time of indexing for some fairly large
> document collections on machines with a high CPU count (> 32 indexing
> threads). So far my observations are:
> >
> > 1) I resigned from using the concurrent merge scheduler in favor of
> "same thread" merging. This means the indexing thread that encounters a
> merge just does it. The CMS is designed to favor concurrent searches over
> indexing and it really didn't do anything I needed - in fact, I had to
> disable most things it offers. I/O throttling and thread stalling are not
> really practical on fast I/O in the absence of concurrent searches - you
> can literally just use as many merge threads as needed to saturate the I/O.
> >
> > 2) It is quite frequent that everything is churning nicely until the
> last few merges combine huge smaller segments and form a "long-tail" where
> most cores are just idle... Here comes my question - can we execute the
> individual "parts" involved in segment merging (the logic inside
> SegmentMerger) in separate threads? On the surface it looks like these
> steps can be done independently (even if they're executed sequentially at
> the moment) but perhaps I'm missing something?
> >
> > I'd like to ask before I try to tinker with it. Thanks for any feedback.
> >
> > Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: [VOTE] Release Lucene/Solr 8.8.0 RC2

2021-01-25 Thread Timothy Potter
Thanks Noble!

+1 SUCCESS! [1:24:28.212370] (my internet is super slow today)

Re-ran all the Solr operator tests and verified the Cloud graph UI renders
correctly now.

On Mon, Jan 25, 2021 at 3:22 AM Noble Paul  wrote:

> Please vote for release candidate 2 for Lucene/Solr 8.8.0
>
> The artifacts can be downloaded from:
>
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/
>
> python3 -u dev-tools/scripts/smokeTestRelease.py \
>
> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/
>
>
>
> The vote will be open for at least 72 hours
>
> [ ] +1  approve
> [ ] +0  no opinion
> [ ] -1  disapprove (and reason why)
>
> Here is my +1
> --
> -
> Noble Paul
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Merging segment parts concurrently (SegmentMerger)

2021-01-25 Thread Michael Sokolov
It makes sense to me. I don't have the full picture, but I did just
implement merging for vector format, and that at least, could be done
fully concurrent with other formats. I expect the same is true of
DocValues, Terms, etc. I'm not sure about the different kinds of
DocValues - they might want to be done together?

On Mon, Jan 25, 2021 at 5:45 AM Dawid Weiss  wrote:
>
>
> Hey everyone,
>
> I'm trying to cut the total wall-time of indexing for some fairly large 
> document collections on machines with a high CPU count (> 32 indexing 
> threads). So far my observations are:
>
> 1) I resigned from using the concurrent merge scheduler in favor of "same 
> thread" merging. This means the indexing thread that encounters a merge just 
> does it. The CMS is designed to favor concurrent searches over indexing and 
> it really didn't do anything I needed - in fact, I had to disable most things 
> it offers. I/O throttling and thread stalling are not really practical on 
> fast I/O in the absence of concurrent searches - you can literally just use 
> as many merge threads as needed to saturate the I/O.
>
> 2) It is quite frequent that everything is churning nicely until the last few 
> merges combine huge smaller segments and form a "long-tail" where most cores 
> are just idle... Here comes my question - can we execute the individual 
> "parts" involved in segment merging (the logic inside SegmentMerger) in 
> separate threads? On the surface it looks like these steps can be done 
> independently (even if they're executed sequentially at the moment) but 
> perhaps I'm missing something?
>
> I'd like to ask before I try to tinker with it. Thanks for any feedback.
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Separate git repo(s) for Solr modules

2021-01-25 Thread Ishan Chattopadhyaya
I haven't been able to follow up on creation of the extras repo, but more
importantly I wanted to respond to Hoss. I'm out on an emergency for a week
or so, shall resume then. If there's a decision on this until then, I shall
accept it.

On Mon, 25 Jan, 2021, 9:04 am Jason Gerlowski, 
wrote:

> Tentative +1 to Hoss' questions.  I agree with his summary of the
> potential risks here, and I share his ignorance of the perceived
> benefits.
>
> (I thought for a time that this was driven by interest in having
> release cadences independent from Solr-core releases.  I'm all for
> that goal, but if that's the motivation I'm not sure what the obstacle
> is to doing that with a single repo - all build systems these days
> support versioning and releasing modules independent of one another.
> But maybe that was never a driving factor here.)
>
> I know there have been a lot of discussions about this, and I know the
> repo has already been created.  So maybe it's too late to object even
> if I wanted to, which I'm not sure I do.  But can someone that
> understands the motivation please summarize what multiple-repos gets
> us over a single repo that outweighs the "cons" that Hoss mentioned?
>
> Jason
>
> On Thu, Jan 14, 2021 at 12:34 PM Chris Hostetter
>  wrote:
> >
> >
> > : As we discussed over the last few months, there seems a need to move
> > : non-core pieces away from the Solr core module. The contribs are
> presently
> > : a good place, but it makes sense to have a separate git repository
> hosting
> > : such modules. Some candidates that come to mind are the present day
> contrib
> >
> > can you explain why it makes sense to have a separate git repo for these
> > things?
> >
> > I can think of lots of reasons why it makes sense to have a single
> > repo for all things solr (unified CI that quickly identifies if core
> > changes break "first order" plugins, shared feature branches & monotomic
> > commits of code that affects APIs and impls of those APIs, etc...) but
> > I've yet to see any concrete specifics of why multiple git repos are
> > "better" then just having distinct sub-projects (with distinct artifacts)
> > in the same repo other then "it makes sense"
> >
> > why does it make sense?
> >
> > why can't the ideas of "solr-sandbox" and "solr-extras" just be
> > directories in the "solr repo" ? ... what value is gained by making them
> > new repos?
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Merging segment parts concurrently (SegmentMerger)

2021-01-25 Thread Dawid Weiss
Hey everyone,

I'm trying to cut the total wall-time of indexing for some fairly large
document collections on machines with a high CPU count (> 32 indexing
threads). So far my observations are:

1) I resigned from using the concurrent merge scheduler in favor of "same
thread" merging. This means the indexing thread that encounters a merge
just does it. The CMS is designed to favor concurrent searches over
indexing and it really didn't do anything I needed - in fact, I had to
disable most things it offers. I/O throttling and thread stalling are not
really practical on fast I/O in the absence of concurrent searches - you
can literally just use as many merge threads as needed to saturate the I/O.

2) It is quite frequent that everything is churning nicely until the last
few merges combine huge smaller segments and form a "long-tail" where most
cores are just idle... Here comes my question - can we execute the
individual "parts" involved in segment merging (the logic inside
SegmentMerger) in separate threads? On the surface it looks like these
steps can be done independently (even if they're executed sequentially at
the moment) but perhaps I'm missing something?

I'd like to ask before I try to tinker with it. Thanks for any feedback.

Dawid


[VOTE] Release Lucene/Solr 8.8.0 RC2

2021-01-25 Thread Noble Paul
Please vote for release candidate 2 for Lucene/Solr 8.8.0

The artifacts can be downloaded from:

https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/

python3 -u dev-tools/scripts/smokeTestRelease.py \
https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.8.0-RC2-revb10659f0fc18b58b90929cfdadde94544d202c4a/



The vote will be open for at least 72 hours

[ ] +1  approve
[ ] +0  no opinion
[ ] -1  disapprove (and reason why)

Here is my +1
--
-
Noble Paul

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org