Solr 8 - Sort Issue

2019-12-05 Thread Anuj Bhargava
When I sort desc on posting_id sort=posting_id%20desc, I get the following
result
"posting_id":"313"
"posting_id":"312"
"posting_id":"310"

When I sort asc on posting_id sort=posting_id%20asc, I get the following
result
"posting_id":"10005343"
"posting_id":"10005349"
"posting_id":"10005359"

*In descending the 8 figure numbers are not coming up first and in
ascending the 7 figure numbers are not coming up first.*

Entry in schema is -



Re: [Q] Faster Atomic Updates - use docValues?

2019-12-05 Thread Paras Lehana
Hi Erick,

I believed optimizing explicitly merges segments and that's why I was
expecting it to give performance boost. I know that optimizations should
not be done very frequently. For a full indexing, optimizations occurred 30
times between batches. I take your suggestion to undo all the changes and
that's what I'm going to do. I mentioned about the optimizations giving an
indexing boost (for sometime) only to support your point of my mergePolicy
backfiring. I will certainly read again about the merge process.

Taking your suggestions - so, commits would be handled by autoCommit. What
implicitly handles optimizations? I think the merge policy or is there any
other setting I'm missing?

I'm indexing via Curl API on the same server. The Current Speed of curl is
only 50k (down from 1300k in the first batch). I think - as the curl is
transmitting the XML, the documents are getting indexing. Because then only
would speed be so low. I don't think that the whole XML is taking the
memory - I remember I had to change the curl options to get rid of the
transmission error for large files.

This is my curl request:

curl 'http://localhost:$port/solr/product/update?commit=true'  -T
batch1.xml -X POST -H 'Content-type:text/xml

Although, we had been doing this since ages - I think I should now consider
using the solr post service (since the indexing files stays on the same
server) or using Solarium (we use PHP to make XMLs).

On Thu, 5 Dec 2019 at 20:00, Erick Erickson  wrote:

> >  I think I should have also done optimize between batches, no?
>
> No, no, no, no. Absolutely not. Never. Never, never, never between batches.
> I don’t  recommend optimizing at _all_ unless there are demonstrable
> improvements.
>
> Please don’t take this the wrong way, the whole merge process is really
> hard to get your head around. But the very fact that you’d suggest
> optimizing between batches shows that the entire merge process is
> opaque to you. I’ve seen many people just start changing things and
> get themselves into a bad place, then try to change more things to get
> out of that hole. Rinse. Repeat.
>
> I _strongly_ recommend that you undo all your changes. Neither
> commit nor optimize from outside Solr. Set your autocommit
> settings to something like 5 minutes with openSearcher=true.
> Set all autowarm counts in your caches in solrconfig.xml to 0,
> especially filterCache and queryResultCache.
>
> Do not set soft commit at all, leave it at -1.
>
> Repeat do _not_ commit or optimize from the client! Just let your
> autocommit settings do the commits.
>
> It’s also pushing things to send 5M docs in a single XML packet.
> That all has to be held in memory and then indexed, adding to
> pressure on the heap. I usually index from SolrJ in batches
> of 1,000. See:
> https://lucidworks.com/post/indexing-with-solrj/
>
> Simply put, your slowdown should not be happening. I strongly
> believe that it’s something in your environment, most likely
> 1> your changes eventually shoot you in the foot OR
> 2> you are running in too little memory and eventually GC is killing you.
> Really, analyze your GC logs. OR
> 3> you are running on underpowered hardware which just can’t take the load
> OR
> 4> something else in your environment
>
> I’ve never heard of a Solr installation with such a massive slowdown during
> indexing that was fixed by tweaking things like the merge policy etc.
>
> Best,
> Erick
>
>
> > On Dec 5, 2019, at 12:57 AM, Paras Lehana 
> wrote:
> >
> > Hey Erick,
> >
> > This is a huge red flag to me: "(but I could only test for the first few
> >> thousand documents”.
> >
> >
> > Yup, that's probably where the culprit lies. I could only test for the
> > starting batch because I had to wait for a day to actually compare. I
> > tweaked the merge values and kept whatever gave a speed boost. My first
> > batch of 5 million docs took only 40 minutes (atomic updates included)
> and
> > the last batch of 5 million took more than 18 hours. If this is an issue
> of
> > mergePolicy, I think I should have also done optimize between batches,
> no?
> > I remember, when I indexed a single XML of 80 million after optimizing
> the
> > core already indexed with 30 XMLs of 5 million each, I could post 80
> > million in a day only.
> >
> >
> >
> >> The indexing rate you’re seeing is abysmal unless these are _huge_
> >> documents
> >
> >
> > Documents only contain the suggestion name, possible titles,
> > phonetics/spellcheck/synonym fields and numerical fields for boosting.
> They
> > are far smaller than what a Search Document would contain. Auto-Suggest
> is
> > only concerned about suggestions so you can guess how simple the
> documents
> > would be.
> >
> >
> > Some data is held on the heap and some in the OS RAM due to MMapDirectory
> >
> >
> > I'm using StandardDirectory (which will make Solr choose the right
> > implementation). Also, planning to read more about these (looking forward
> > to use MMap). Thanks for the article!
> >
> >
> > 

Re: Solr indexing performance

2019-12-05 Thread Shawn Heisey

On 12/5/2019 10:42 PM, Paras Lehana wrote:

Can ulimit

settings impact this? Review once.


If the OS limits prevent Solr from opening a file or starting a thread, 
it is far more likely that the indexing would fail.  It's not likely 
that such problems would make indexing slow.


Thanks,
Shawn


Re: From solr to solr cloud

2019-12-05 Thread Shawn Heisey

On 12/5/2019 12:28 PM, Vignan Malyala wrote:

I currently have 500 collections in my stand alone solr. Bcoz of day by day
increase in Data, I want to convert it into solr cloud.
Can you suggest me how to do it successfully.
How many shards should be there?
How many nodes should be there?
Are so called nodes different machines i should take?
How many zoo keeper nodes should be there?
Are so called zoo keeper nodes different machines i should take?
Total how many machines i have to take to implement scalable solr cloud?


500 collections is large enough that running it in SolrCloud is likely 
to encounter scalability issues.  SolrCloud's design does not do well 
with that many collections in the cluster, even if there are a lot of 
machines.


There's a lot of comment history on this issue:

https://issues.apache.org/jira/browse/SOLR-7191

Generally speaking, each machine should only house one Solr node, 
whether you're running cloud or not.  If each one requires a really huge 
heap, it might be worthwhile to split it, but that's the only time I 
would do so.  And I would generally prefer to add more machines than to 
run multiple Solr nodes on one machine.


One thing you might do, if the way your data is divided will permit it, 
is to run multiple SolrCloud clusters.  Multiple clusters can all use 
one ZooKeeper ensemble.


ZooKeeper requires a minimum of three machines for fault tolerance. 
With 3 or 4 machines in the ensemble, you can survive one machine 
failure.  To survive two failures requires at least 5 machines.


Thanks,
Shawn


Re: xms/xmx choices

2019-12-05 Thread Shawn Heisey

On 12/5/2019 12:57 PM, David Hastings wrote:

That probably isnt enough data, so if youre interested:

https://gofile.io/?c=rZQ2y4


The previous one was less than 4 minutes, so it doesn't reveal anything 
useful.


This one is a little bit less than two hours.  That's more useful, but 
still pretty short.


Here's the "heap after GC" graph from the larger file:

https://www.dropbox.com/s/q9hs8fl0gfkfqi1/david.hastings.gc.graph.2019.12.png?dl=0

At around 14:15, the heap usage was rather high. It got up over 25GB. 
There were some very long GCs right at that time, which probably means 
they were full GCs.  And they didn't free up any significant amount of 
memory.  So I'm betting that sometimes you actually *do* need a big 
chunk of that 60GB of heap.  You might try reducing it to 31g instead of 
6m.  Java's memory usage is a lot more efficient if the max heap 
size is less than 32 GB.


I can't give you any information about what happened at that time which 
required so much heap.  You could see if you have logfiles that cover 
that timeframe.


Thanks,
Shawn


Re: xms/xmx choices

2019-12-05 Thread Paras Lehana
Hi David,

Your Xmx seems to be an overkill though without usage stats, this cannot be
factified. I think you should analyze long GC pauses given that you have so
much difference between the min and max. I prefer making the min/max same
before stressing on the values. You can start with 20G but what would you
do with the remaining memory?

PS: Your configuration is something I admire. :P

On Fri, 6 Dec 2019 at 01:56, David Hastings 
wrote:

> and if this may be of use:
> https://imgur.com/a/qXBuSxG
>
> just been more or less winging the options since solr 1.3
>
>
> On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:
>
> > On 12/5/2019 11:58 AM, David Hastings wrote:
> > > as of now we do an xms of 8gb and xmx of 60gb, generally through the
> > > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed
> to
> > be
> > > the same so thats the change #1 on my end, I am just concerned of
> > dropping
> > > it from 60 as thus far over the last few years I have had no problems
> nor
> > > performance issues.  I know its said a lot of times to make it lower
> and
> > > let the OS use the ram for caching the file system/index files, so my
> > first
> > > experiment was going to be around 20gb, was wondering if this seems
> > sound,
> > > or should i go even lower?
> >
> > The Xms and Xmx settings should be the same so Java doesn't need to take
> > special action to increase the pool size when more than the minimum is
> > required.  Java tends to always increase to the maximum as it runs, so
> > there's usually little benefit to specifying a lower minimum than the
> > maximum.  With a 60GB max heap, Java is likely to grab a little more
> > than 60GB from the OS, regardless of how much heap is actually in use.
> >
> > If you can provide GC logs from Solr that cover a signficant timeframe,
> > especially heavy indexing, we can analyze those and make an estimate
> > about the values you should have for Xms and Xmx.  It will only be a
> > guess ... something might happen later that requires more heap.
> >
> > We can't make recommendations without hard data.  The information you
> > provided isn't enough to guess how much heap you'll need.  Depending on
> > how such a system is used, a few GB might be enough, or you might need a
> > lot more.
> >
> >
> >
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > Thanks,
> > Shawn
> >
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 


Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Paras Lehana
Hi Michael,

I think you only want to use FlattenGraphFilter *once* in the indexing
> analysis chain


I had been doing this for a long time before I finally shifted to use FGF
after every GraphFilterFactory. Although I don't know much about it on the
code level, are you sure that all the following filters will be able to
consume graph in case we don't use FGF after a graph factory?

On Fri, 6 Dec 2019 at 01:22, Eric Buss  wrote:

> Thanks for the reply,
>
> I wouldn't be surprised if the issue you linked is related, I also found
> another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723
>
> You are absolutely right that the FlattenGraphFilter should only be used
> once, but as you noted the issue I am experiencing seems unrelated.
>
> On 2019-12-05, 10:23 AM, "Michael Gibney" 
> wrote:
>
> I wonder if this might be similar/related to the underlying problem
> that is intended to be addressed by
> https://issues.apache.org/jira/browse/LUCENE-8985?
>
> btw, I think you only want to use FlattenGraphFilter *once* in the
> indexing analysis chain, towards the end (after all components that
> emit graphs). ...though that's probably *not* what's causing the
> problem (based on the fact that the extra FGF doesn't seem to modify
> any attributes).
>
>
>
> On Mon, Nov 25, 2019 at 2:19 PM Eric Buss 
> wrote:
> >
> > Hi all,
> >
> > I have been trying to solve an issue where FlattenGraphFilter (FGF)
> removes
> > tokens produced by WordDelimiterGraphFilter (WDGF) - consequently
> searches that
> > contain the contraction "can't" do not match.
> >
> > This is on Solr version 7.7.1.
> >
> > The field in question is defined as follows:
> >
> >  stored="true"/>
> >
> > And the relevant fieldType "text_general":
> >
> >  positionIncrementGap="100">
> > 
> > 
> >  words="stopwords.txt"/>
> >  stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1"
> splitOnCaseChange="0"/>
> > 
> >  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > 
> >  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> > 
> > 
> > 
> >  words="stopwords.txt"/>
> >  stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0"
> splitOnCaseChange="0"/>
> >  words="stopwords.txt"/>
> >  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> > 
> > 
> >
> > Finally, the relevant entries in synonyms.txt are:
> >
> > can,cans
> > cants,cant
> >
> > Using the Solr console Analysis and "can't" as the Field Value, the
> following
> > tokens are produced (find the verbose output at the bottom of this
> email):
> >
> > Index
> > ST| can't
> > SF| can't
> > WDGF  | cant | can't | can | t
> > FGF   | cant | can't | can | t
> > SGF   | cants | cant | can't | | cans | can | t
> > ICUFF | cants | cant | can't | | cans | can | t
> > FGF   | cants | cant | can't | | t
> >
> > Query
> > ST| can't
> > SF| can't
> > WDGF  | can | t
> > SF| can | t
> > ICUFF | can | t
> >
> > As you can see after the FGF the tokens "can" and "cans" are pruned
> so the query
> > does not match. Is there a reasonable way to preserve these tokens?
> >
> > My key concern is that I want the "fix" for this to have as little
> impact on
> > other queries as possible.
> >
> > Some things I have checked/tried:
> >
> > Searching for similar problems I found this thread:
> >
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> > Here it is suggested that FGF is not necessary (without any
> supporting
> > evidence). This goes directly against the documentation that states
> "If you use
> > [the SynonymGraphFilter] during indexing, you must follow it with a
> Flatten
> > Graph Filter":
> > https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> > Despite this warning I tried out removing the FGF on a local
> > cluster and indeed it still runs and this search now works, however
> I am
> > paranoid that this will break far more things than it fixes.
> >
> > I have tried adding the FGF as a filter to the query. This does not
> eliminate
> > the "can" term in the query analysis.
> >
> > I have tested other contracted words. Some have this issue as well -
> others do
> > not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't"
> all
> > preserve their tokens "won't" does not. I believe the pattern here
> is that
> > whenever part of the contraction has synonyms this problem manifests.
> >
> > Eliminating WDGF is not viable as we rely on this functionality 

Re: From solr to solr cloud

2019-12-05 Thread Paras Lehana
Do you mean 500 cores? Tell us about the data more. How many documents per
core do you have or what performance issues are you facing?

On Fri, 6 Dec 2019 at 01:01, David Hastings 
wrote:

> are you noticing performance decreases in stand alone solr as of now?
>
> On Thu, Dec 5, 2019 at 2:29 PM Vignan Malyala 
> wrote:
>
> > Hi
> > I currently have 500 collections in my stand alone solr. Bcoz of day by
> day
> > increase in Data, I want to convert it into solr cloud.
> > Can you suggest me how to do it successfully.
> > How many shards should be there?
> > How many nodes should be there?
> > Are so called nodes different machines i should take?
> > How many zoo keeper nodes should be there?
> > Are so called zoo keeper nodes different machines i should take?
> > Total how many machines i have to take to implement scalable solr cloud?
> >
> > Plz detail these questions. Any of documents on web aren't clear for
> > production environments.
> > Thanks in advance.
> >
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 


Re: Solr indexing performance

2019-12-05 Thread Paras Lehana
Can ulimit

settings impact this? Review once.

On Thu, 5 Dec 2019 at 23:31, Shawn Heisey  wrote:

> On 12/5/2019 10:28 AM, Rahul Goswami wrote:
> > We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
> > parallel threads with 5000 docs per batch. This is a test setup and all
> > documents are indexed on the same node. We are seeing connection timeout
> > issues thereafter some time into indexing. I am yet to analyze GC pauses
> > and other possibilities, but as a guideline just wanted to know what
> > indexing rate might be "too high" for Solr so as to consider throttling ?
> > The documents are mostly metadata with about 25 odd fields, so not very
> > heavy.
> > Would be nice to know a baseline performance expectation for better
> > application design considerations.
>
> It's not really possible to give you a number here.  It depends on a lot
> of things, and every install is going to be different.
>
> On a setup that I once dealt with, where there was only a single thread
> doing the indexing, indexing on each core happened at about 1000 docs
> per second.  I've heard people mention rates beyond 5 docs per
> second.  I've also heard people talk about rates of indexing far lower
> than what I was seeing.
>
> When you say "connection timeout" issues ... that could mean a couple of
> different things.  It could mean that the connection never gets
> established because it times out while trying, or it could mean that the
> connection gets established, and then times out after that.  Which are
> you seeing?  Usually dealing with that involves changing timeout
> settings on the client application.  Figuring out what's causing the
> delays that lead to the timeouts might be harder.  GC pauses are a
> primary candidate.
>
> There are typically two bottlenecks possible when indexing.  One is that
> the source system cannot supply the documents fast enough.  The other is
> that the Solr server is sitting mostly idle while the indexing program
> waits for an opportunity to send more documents.  The first is not
> something we can help you with.  The second is dealt with by making the
> indexing application multi-threaded or multi-process, or adding more
> threads/processes.
>
> Thanks,
> Shawn
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
*
*

 


Re: [ANNOUNCE] Apache Solr 8.3.1 released

2019-12-05 Thread Paras Lehana
Yup, now reflected. :)

On Thu, 5 Dec, 2019, 19:43 Erick Erickson,  wrote:

> It’s there for me when I click on your link.
>
> > On Dec 5, 2019, at 1:08 AM, Paras Lehana 
> wrote:
> >
> > Hey Ishan,
> >
> > Cannot find 8.3.1 here: https://lucene.apache.org/solr/downloads.html
> (8.3.0
> > is listed here).
> >
> > Anyways, I'm downloading it from here:
> > https://archive.apache.org/dist/lucene/solr/8.3.1/
> >
> >
> >
> > On Wed, 4 Dec 2019 at 20:27, Rahul Goswami 
> wrote:
> >
> >> Thanks Ishan. I was just going through the list of fixes in 8.3.1
> >> (published in changes.txt) and couldn't see the below JIRA.
> >>
> >> SOLR-13971 : Velocity
> >> response writer's resource loading now possible only through startup
> >> parameters.
> >>
> >> Is it linked appropriately? Or is it some access rights issue for
> non-PMC
> >> members like me ?
> >>
> >> Thanks,
> >> Rahul
> >>
> >>
> >> On Wed, Dec 4, 2019 at 7:12 AM Noble Paul  wrote:
> >>
> >>> Thanks ishan
> >>>
> >>> On Wed, Dec 4, 2019, 3:32 PM Ishan Chattopadhyaya <
> >>> ichattopadhy...@gmail.com>
> >>> wrote:
> >>>
>  ## 3 December 2019, Apache Solr™ 8.3.1 available
> 
>  The Lucene PMC is pleased to announce the release of Apache Solr
> 8.3.1.
> 
>  Solr is the popular, blazing fast, open source NoSQL search platform
>  from the Apache Lucene project. Its major features include powerful
>  full-text search, hit highlighting, faceted search, dynamic
>  clustering, database integration, rich document handling, and
>  geospatial search. Solr is highly scalable, providing fault tolerant
>  distributed search and indexing, and powers the search and navigation
>  features of many of the world's largest internet sites.
> 
>  Solr 8.3.1 is available for immediate download at:
> 
>   
> 
>  ### Solr 8.3.1 Release Highlights:
> 
>   * JavaBinCodec has concurrent modification of CharArr resulting in
>  corrupt internode updates
>   * findRequestType in AuditEvent is more robust
>   * CoreContainer.auditloggerPlugin is volatile now
>   * Velocity response writer's resource loading now possible only
>  through startup parameters
> 
> 
>  Please read CHANGES.txt for a full list of changes:
> 
>   
> 
>  Solr 8.3.1 also includes  and bugfixes in the corresponding Apache
>  Lucene release:
> 
>   
> 
>  Note: The Apache Software Foundation uses an extensive mirroring
> >> network
>  for
>  distributing releases. It is possible that the mirror you are using
> may
>  not have
>  replicated the release yet. If that is the case, please try another
> >>> mirror.
>  This also applies to Maven access.
> 
>  -
>  To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: dev-h...@lucene.apache.org
> 
> 
> >>>
> >>
> >
> >
> > --
> > --
> > Regards,
> >
> > *Paras Lehana* [65871]
> > Development Engineer, Auto-Suggest,
> > IndiaMART Intermesh Ltd.
> >
> > 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> > Noida, UP, IN - 201303
> >
> > Mob.: +91-9560911996
> > Work: 01203916600 | Extn:  *8173*
> >
> > --
> > *
> > *
> >
> > 
>
>

-- 
*
*

 


Re: Re:Learning to rank - Bad Request

2019-12-05 Thread walia4
I am using SOLR 8.2.0 Cloud mode... but when i start with
*-Dsolr.ltr.enabled=true* it shows me the  error 
*techproducts_shard1_replica_n2:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Failed to create new ManagedResource /schema/model-store of type
org.apache.solr.ltr.store.rest.ManagedModelStore due to:
org.apache.solr.common.SolrException:
org.apache.solr.ltr.model.ModelException: Model type does not exist
org.apache.solr.ltr.model.LinearModel techproducts_shard2_replica_n6:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Failed to create new ManagedResource /schema/model-store of type
org.apache.solr.ltr.store.rest.ManagedModelStore due to:
org.apache.solr.common.SolrException:
org.apache.solr.ltr.model.ModelException: Model type does not exist
org.apache.solr.ltr.model.LinearModel Please check your logs for more
information*

Can you please provide me a command to enable ltr on solr cloud mode



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: xms/xmx choices

2019-12-05 Thread David Hastings
and if this may be of use:
https://imgur.com/a/qXBuSxG

just been more or less winging the options since solr 1.3


On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:

> On 12/5/2019 11:58 AM, David Hastings wrote:
> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to
> be
> > the same so thats the change #1 on my end, I am just concerned of
> dropping
> > it from 60 as thus far over the last few years I have had no problems nor
> > performance issues.  I know its said a lot of times to make it lower and
> > let the OS use the ram for caching the file system/index files, so my
> first
> > experiment was going to be around 20gb, was wondering if this seems
> sound,
> > or should i go even lower?
>
> The Xms and Xmx settings should be the same so Java doesn't need to take
> special action to increase the pool size when more than the minimum is
> required.  Java tends to always increase to the maximum as it runs, so
> there's usually little benefit to specifying a lower minimum than the
> maximum.  With a 60GB max heap, Java is likely to grab a little more
> than 60GB from the OS, regardless of how much heap is actually in use.
>
> If you can provide GC logs from Solr that cover a signficant timeframe,
> especially heavy indexing, we can analyze those and make an estimate
> about the values you should have for Xms and Xmx.  It will only be a
> guess ... something might happen later that requires more heap.
>
> We can't make recommendations without hard data.  The information you
> provided isn't enough to guess how much heap you'll need.  Depending on
> how such a system is used, a few GB might be enough, or you might need a
> lot more.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Thanks,
> Shawn
>


Re: xms/xmx choices

2019-12-05 Thread David Hastings
That probably isnt enough data, so if youre interested:

https://gofile.io/?c=rZQ2y4

On Thu, Dec 5, 2019 at 2:52 PM David Hastings 
wrote:

> I know theres no hard answer, and I know the Xms and Xmx should be the
> same, but it was a set it and forget it sort of thing from years ago.  I
> will definitely be changing it but figured I may as well figure out as
> much as possible from this user group resource.
> as far as the raw GC data goes:
> https://pastebin.com/vBtpYR1W
>
> (i dont know if people still use pastebin)  i can get more if needed.  the
> systems dont do ANY indexing at all, they are search only slaves.  they
> share resources only with a DB install, and one node will never do both
> live search and live DB.  If theres any more info youd like I would be
> happy to provide, this is interesting.
>
> On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:
>
>> On 12/5/2019 11:58 AM, David Hastings wrote:
>> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
>> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed
>> to be
>> > the same so thats the change #1 on my end, I am just concerned of
>> dropping
>> > it from 60 as thus far over the last few years I have had no problems
>> nor
>> > performance issues.  I know its said a lot of times to make it lower and
>> > let the OS use the ram for caching the file system/index files, so my
>> first
>> > experiment was going to be around 20gb, was wondering if this seems
>> sound,
>> > or should i go even lower?
>>
>> The Xms and Xmx settings should be the same so Java doesn't need to take
>> special action to increase the pool size when more than the minimum is
>> required.  Java tends to always increase to the maximum as it runs, so
>> there's usually little benefit to specifying a lower minimum than the
>> maximum.  With a 60GB max heap, Java is likely to grab a little more
>> than 60GB from the OS, regardless of how much heap is actually in use.
>>
>> If you can provide GC logs from Solr that cover a signficant timeframe,
>> especially heavy indexing, we can analyze those and make an estimate
>> about the values you should have for Xms and Xmx.  It will only be a
>> guess ... something might happen later that requires more heap.
>>
>> We can't make recommendations without hard data.  The information you
>> provided isn't enough to guess how much heap you'll need.  Depending on
>> how such a system is used, a few GB might be enough, or you might need a
>> lot more.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> Thanks,
>> Shawn
>>
>


Re: xms/xmx choices

2019-12-05 Thread David Hastings
I know theres no hard answer, and I know the Xms and Xmx should be the
same, but it was a set it and forget it sort of thing from years ago.  I
will definitely be changing it but figured I may as well figure out as
much as possible from this user group resource.
as far as the raw GC data goes:
https://pastebin.com/vBtpYR1W

(i dont know if people still use pastebin)  i can get more if needed.  the
systems dont do ANY indexing at all, they are search only slaves.  they
share resources only with a DB install, and one node will never do both
live search and live DB.  If theres any more info youd like I would be
happy to provide, this is interesting.

On Thu, Dec 5, 2019 at 2:41 PM Shawn Heisey  wrote:

> On 12/5/2019 11:58 AM, David Hastings wrote:
> > as of now we do an xms of 8gb and xmx of 60gb, generally through the
> > dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to
> be
> > the same so thats the change #1 on my end, I am just concerned of
> dropping
> > it from 60 as thus far over the last few years I have had no problems nor
> > performance issues.  I know its said a lot of times to make it lower and
> > let the OS use the ram for caching the file system/index files, so my
> first
> > experiment was going to be around 20gb, was wondering if this seems
> sound,
> > or should i go even lower?
>
> The Xms and Xmx settings should be the same so Java doesn't need to take
> special action to increase the pool size when more than the minimum is
> required.  Java tends to always increase to the maximum as it runs, so
> there's usually little benefit to specifying a lower minimum than the
> maximum.  With a 60GB max heap, Java is likely to grab a little more
> than 60GB from the OS, regardless of how much heap is actually in use.
>
> If you can provide GC logs from Solr that cover a signficant timeframe,
> especially heavy indexing, we can analyze those and make an estimate
> about the values you should have for Xms and Xmx.  It will only be a
> guess ... something might happen later that requires more heap.
>
> We can't make recommendations without hard data.  The information you
> provided isn't enough to guess how much heap you'll need.  Depending on
> how such a system is used, a few GB might be enough, or you might need a
> lot more.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> Thanks,
> Shawn
>


Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Eric Buss
Thanks for the reply,

I wouldn't be surprised if the issue you linked is related, I also found 
another similar issue: https://issues.apache.org/jira/browse/LUCENE-8723 

You are absolutely right that the FlattenGraphFilter should only be used once, 
but as you noted the issue I am experiencing seems unrelated.

On 2019-12-05, 10:23 AM, "Michael Gibney"  wrote:

I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss  wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) 
removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently 
searches that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> 
>
> And the relevant fieldType "text_general":
>
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the 
following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST| can't
> SF| can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST| can't
> SF| can't
> WDGF  | can | t
> SF| can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so 
the query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact 
on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> 
https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If 
you use
> [the SynonymGraphFilter] during indexing, you must follow it with a 
Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not 
eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - 
others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other 
uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we 
have
> the data "historical-text" we want this to match the search "history 
text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory 
to
> replace "can't" with "cant". Though this technically solves the issue, I 
hope it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the 
filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST| text  | can't|
>   | raw_bytes | [63 61 6e 27 74] |
>   | start | 0|
>   | end   | 5|
>   | positionLength| 1|
>   | type  ||
>   | termFrequency | 1|
>   | position  | 1|
   

Re: xms/xmx choices

2019-12-05 Thread Shawn Heisey

On 12/5/2019 11:58 AM, David Hastings wrote:

as of now we do an xms of 8gb and xmx of 60gb, generally through the
dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to be
the same so thats the change #1 on my end, I am just concerned of dropping
it from 60 as thus far over the last few years I have had no problems nor
performance issues.  I know its said a lot of times to make it lower and
let the OS use the ram for caching the file system/index files, so my first
experiment was going to be around 20gb, was wondering if this seems sound,
or should i go even lower?


The Xms and Xmx settings should be the same so Java doesn't need to take 
special action to increase the pool size when more than the minimum is 
required.  Java tends to always increase to the maximum as it runs, so 
there's usually little benefit to specifying a lower minimum than the 
maximum.  With a 60GB max heap, Java is likely to grab a little more 
than 60GB from the OS, regardless of how much heap is actually in use.


If you can provide GC logs from Solr that cover a signficant timeframe, 
especially heavy indexing, we can analyze those and make an estimate 
about the values you should have for Xms and Xmx.  It will only be a 
guess ... something might happen later that requires more heap.


We can't make recommendations without hard data.  The information you 
provided isn't enough to guess how much heap you'll need.  Depending on 
how such a system is used, a few GB might be enough, or you might need a 
lot more.


https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Thanks,
Shawn


Re: From solr to solr cloud

2019-12-05 Thread David Hastings
are you noticing performance decreases in stand alone solr as of now?

On Thu, Dec 5, 2019 at 2:29 PM Vignan Malyala  wrote:

> Hi
> I currently have 500 collections in my stand alone solr. Bcoz of day by day
> increase in Data, I want to convert it into solr cloud.
> Can you suggest me how to do it successfully.
> How many shards should be there?
> How many nodes should be there?
> Are so called nodes different machines i should take?
> How many zoo keeper nodes should be there?
> Are so called zoo keeper nodes different machines i should take?
> Total how many machines i have to take to implement scalable solr cloud?
>
> Plz detail these questions. Any of documents on web aren't clear for
> production environments.
> Thanks in advance.
>


From solr to solr cloud

2019-12-05 Thread Vignan Malyala
Hi
I currently have 500 collections in my stand alone solr. Bcoz of day by day
increase in Data, I want to convert it into solr cloud.
Can you suggest me how to do it successfully.
How many shards should be there?
How many nodes should be there?
Are so called nodes different machines i should take?
How many zoo keeper nodes should be there?
Are so called zoo keeper nodes different machines i should take?
Total how many machines i have to take to implement scalable solr cloud?

Plz detail these questions. Any of documents on web aren't clear for
production environments.
Thanks in advance.


xms/xmx choices

2019-12-05 Thread David Hastings
Hey all, over time ive adjusted and changed the solr Xms/Xmx various times
with not too much thought aside from more is better, but ive noticed in
many of the emails the recommended values are much lower than the numbers
ive historically put in.  i never really bothered to change them as the
performance was always more than acceptable.  Until now as well just got a
memory upgrade on our solr nodes so figure may as well do it right.

so im sitting at around
580 gb core
150gb core
270gb core
300gb core
depending on merges etc.  with around 50k-100k searches a day depending on
the time of year/school calendar
the three live nodes each have 4tb of decent SSD's that hold the indexes,
and now just went from 148gb to 288gb of memory.
as of now we do an xms of 8gb and xmx of 60gb, generally through the
dashboard the JVM hangs around 16gb.  I know Xms and Xmx are supposed to be
the same so thats the change #1 on my end, I am just concerned of dropping
it from 60 as thus far over the last few years I have had no problems nor
performance issues.  I know its said a lot of times to make it lower and
let the OS use the ram for caching the file system/index files, so my first
experiment was going to be around 20gb, was wondering if this seems sound,
or should i go even lower?

Thanks, always good learning with this email group.
-Dave


Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Michael Gibney
I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss  wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches 
> that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> 
>
> And the relevant fieldType "text_general":
>
>  positionIncrementGap="100">
> 
> 
>  words="stopwords.txt"/>
>  stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" 
> splitOnCaseChange="0"/>
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> 
> 
> 
>  words="stopwords.txt"/>
>  stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>  words="stopwords.txt"/>
>  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> 
> 
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST| can't
> SF| can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST| can't
> SF| can't
> WDGF  | can | t
> SF| can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so the 
> query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If you 
> use
> [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we have
> the data "historical-text" we want this to match the search "history text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory to
> replace "can't" with "cant". Though this technically solves the issue, I hope 
> it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST| text  | can't|
>   | raw_bytes | [63 61 6e 27 74] |
>   | start | 0|
>   | end   | 5|
>   | positionLength| 1|
>   | type  ||
>   | termFrequency | 1|
>   | position  | 1|
> SF| text  | can't|
>   | raw_bytes | [63 61 6e 27 74] |
>   | start | 0|
>   | end   | 5|
>   | positionLength| 1|
>   | type  ||
>   | termFrequency | 1|
>   | position  | 1|
> WDGF  | text  | cant  | can't| 

Re: Solr indexing performance

2019-12-05 Thread Shawn Heisey

On 12/5/2019 10:28 AM, Rahul Goswami wrote:

We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
parallel threads with 5000 docs per batch. This is a test setup and all
documents are indexed on the same node. We are seeing connection timeout
issues thereafter some time into indexing. I am yet to analyze GC pauses
and other possibilities, but as a guideline just wanted to know what
indexing rate might be "too high" for Solr so as to consider throttling ?
The documents are mostly metadata with about 25 odd fields, so not very
heavy.
Would be nice to know a baseline performance expectation for better
application design considerations.


It's not really possible to give you a number here.  It depends on a lot 
of things, and every install is going to be different.


On a setup that I once dealt with, where there was only a single thread 
doing the indexing, indexing on each core happened at about 1000 docs 
per second.  I've heard people mention rates beyond 5 docs per 
second.  I've also heard people talk about rates of indexing far lower 
than what I was seeing.


When you say "connection timeout" issues ... that could mean a couple of 
different things.  It could mean that the connection never gets 
established because it times out while trying, or it could mean that the 
connection gets established, and then times out after that.  Which are 
you seeing?  Usually dealing with that involves changing timeout 
settings on the client application.  Figuring out what's causing the 
delays that lead to the timeouts might be harder.  GC pauses are a 
primary candidate.


There are typically two bottlenecks possible when indexing.  One is that 
the source system cannot supply the documents fast enough.  The other is 
that the Solr server is sitting mostly idle while the indexing program 
waits for an opportunity to send more documents.  The first is not 
something we can help you with.  The second is dealt with by making the 
indexing application multi-threaded or multi-process, or adding more 
threads/processes.


Thanks,
Shawn


Re: Solr indexing performance

2019-12-05 Thread Vincenzo D'Amore
Hi, the clients are reusing their SolrClient? 

Ciao,
Vincenzo

--
mobile: 3498513251
skype: free.dev

> On 5 Dec 2019, at 18:28, Rahul Goswami  wrote:
> 
> Hello,
> 
> We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
> parallel threads with 5000 docs per batch. This is a test setup and all
> documents are indexed on the same node. We are seeing connection timeout
> issues thereafter some time into indexing. I am yet to analyze GC pauses
> and other possibilities, but as a guideline just wanted to know what
> indexing rate might be "too high" for Solr so as to consider throttling ?
> The documents are mostly metadata with about 25 odd fields, so not very
> heavy.
> Would be nice to know a baseline performance expectation for better
> application design considerations.
> 
> Thanks,
> Rahul


Solr indexing performance

2019-12-05 Thread Rahul Goswami
Hello,

We have a Solr 7.2.1 Solr Cloud setup where the client is indexing in 5
parallel threads with 5000 docs per batch. This is a test setup and all
documents are indexed on the same node. We are seeing connection timeout
issues thereafter some time into indexing. I am yet to analyze GC pauses
and other possibilities, but as a guideline just wanted to know what
indexing rate might be "too high" for Solr so as to consider throttling ?
The documents are mostly metadata with about 25 odd fields, so not very
heavy.
Would be nice to know a baseline performance expectation for better
application design considerations.

Thanks,
Rahul


Re: [Q] Faster Atomic Updates - use docValues?

2019-12-05 Thread Erick Erickson
>  I think I should have also done optimize between batches, no?

No, no, no, no. Absolutely not. Never. Never, never, never between batches.
I don’t  recommend optimizing at _all_ unless there are demonstrable
improvements.

Please don’t take this the wrong way, the whole merge process is really
hard to get your head around. But the very fact that you’d suggest
optimizing between batches shows that the entire merge process is
opaque to you. I’ve seen many people just start changing things and
get themselves into a bad place, then try to change more things to get
out of that hole. Rinse. Repeat.

I _strongly_ recommend that you undo all your changes. Neither
commit nor optimize from outside Solr. Set your autocommit
settings to something like 5 minutes with openSearcher=true.
Set all autowarm counts in your caches in solrconfig.xml to 0,
especially filterCache and queryResultCache.

Do not set soft commit at all, leave it at -1.

Repeat do _not_ commit or optimize from the client! Just let your
autocommit settings do the commits.

It’s also pushing things to send 5M docs in a single XML packet.
That all has to be held in memory and then indexed, adding to
pressure on the heap. I usually index from SolrJ in batches
of 1,000. See:
https://lucidworks.com/post/indexing-with-solrj/

Simply put, your slowdown should not be happening. I strongly
believe that it’s something in your environment, most likely
1> your changes eventually shoot you in the foot OR
2> you are running in too little memory and eventually GC is killing you. 
Really, analyze your GC logs. OR
3> you are running on underpowered hardware which just can’t take the load OR
4> something else in your environment

I’ve never heard of a Solr installation with such a massive slowdown during
indexing that was fixed by tweaking things like the merge policy etc.

Best,
Erick


> On Dec 5, 2019, at 12:57 AM, Paras Lehana  wrote:
> 
> Hey Erick,
> 
> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”.
> 
> 
> Yup, that's probably where the culprit lies. I could only test for the
> starting batch because I had to wait for a day to actually compare. I
> tweaked the merge values and kept whatever gave a speed boost. My first
> batch of 5 million docs took only 40 minutes (atomic updates included) and
> the last batch of 5 million took more than 18 hours. If this is an issue of
> mergePolicy, I think I should have also done optimize between batches, no?
> I remember, when I indexed a single XML of 80 million after optimizing the
> core already indexed with 30 XMLs of 5 million each, I could post 80
> million in a day only.
> 
> 
> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents
> 
> 
> Documents only contain the suggestion name, possible titles,
> phonetics/spellcheck/synonym fields and numerical fields for boosting. They
> are far smaller than what a Search Document would contain. Auto-Suggest is
> only concerned about suggestions so you can guess how simple the documents
> would be.
> 
> 
> Some data is held on the heap and some in the OS RAM due to MMapDirectory
> 
> 
> I'm using StandardDirectory (which will make Solr choose the right
> implementation). Also, planning to read more about these (looking forward
> to use MMap). Thanks for the article!
> 
> 
> You're right. I should change one thing at a time. Let me experiment and
> then I will summarize here what I tried. Thank you for your responses. :)
> 
> On Wed, 4 Dec 2019 at 20:31, Erick Erickson  wrote:
> 
>> This is a huge red flag to me: "(but I could only test for the first few
>> thousand documents”
>> 
>> You’re probably right that that would speed things up, but pretty soon
>> when you’re indexing
>> your entire corpus there are lots of other considerations.
>> 
>> The indexing rate you’re seeing is abysmal unless these are _huge_
>> documents, but you
>> indicate that at the start you’re getting 1,400 docs/second so I don’t
>> think the complexity
>> of the docs is the issue here.
>> 
>> Do note that when we’re throwing RAM figures out, we need to draw a sharp
>> distinction
>> between Java heap and total RAM. Some data is held on the heap and some in
>> the OS
>> RAM due to MMapDirectory, see Uwe’s excellent article:
>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>> 
>> Uwe recommends about 25% of your available physical RAM be allocated to
>> Java as
>> a starting point. Your particular Solr installation may need a larger
>> percent, IDK.
>> 
>> But basically I’d go back to all default settings and change one thing at
>> a time.
>> First, I’d look at GC performance. Is it taking all your CPU? In which
>> case you probably need to
>> increase your heap. I pick this first because it’s very common that this
>> is a root cause.
>> 
>> Next, I’d put a profiler on it to see exactly where I’m spending time.
>> Otherwise you wind
>> up making random changes and hoping one of them works.
>> 

Re: [ANNOUNCE] Apache Solr 8.3.1 released

2019-12-05 Thread Erick Erickson
It’s there for me when I click on your link.

> On Dec 5, 2019, at 1:08 AM, Paras Lehana  wrote:
> 
> Hey Ishan,
> 
> Cannot find 8.3.1 here: https://lucene.apache.org/solr/downloads.html (8.3.0
> is listed here).
> 
> Anyways, I'm downloading it from here:
> https://archive.apache.org/dist/lucene/solr/8.3.1/
> 
> 
> 
> On Wed, 4 Dec 2019 at 20:27, Rahul Goswami  wrote:
> 
>> Thanks Ishan. I was just going through the list of fixes in 8.3.1
>> (published in changes.txt) and couldn't see the below JIRA.
>> 
>> SOLR-13971 : Velocity
>> response writer's resource loading now possible only through startup
>> parameters.
>> 
>> Is it linked appropriately? Or is it some access rights issue for non-PMC
>> members like me ?
>> 
>> Thanks,
>> Rahul
>> 
>> 
>> On Wed, Dec 4, 2019 at 7:12 AM Noble Paul  wrote:
>> 
>>> Thanks ishan
>>> 
>>> On Wed, Dec 4, 2019, 3:32 PM Ishan Chattopadhyaya <
>>> ichattopadhy...@gmail.com>
>>> wrote:
>>> 
 ## 3 December 2019, Apache Solr™ 8.3.1 available
 
 The Lucene PMC is pleased to announce the release of Apache Solr 8.3.1.
 
 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document handling, and
 geospatial search. Solr is highly scalable, providing fault tolerant
 distributed search and indexing, and powers the search and navigation
 features of many of the world's largest internet sites.
 
 Solr 8.3.1 is available for immediate download at:
 
  
 
 ### Solr 8.3.1 Release Highlights:
 
  * JavaBinCodec has concurrent modification of CharArr resulting in
 corrupt internode updates
  * findRequestType in AuditEvent is more robust
  * CoreContainer.auditloggerPlugin is volatile now
  * Velocity response writer's resource loading now possible only
 through startup parameters
 
 
 Please read CHANGES.txt for a full list of changes:
 
  
 
 Solr 8.3.1 also includes  and bugfixes in the corresponding Apache
 Lucene release:
 
  
 
 Note: The Apache Software Foundation uses an extensive mirroring
>> network
 for
 distributing releases. It is possible that the mirror you are using may
 not have
 replicated the release yet. If that is the case, please try another
>>> mirror.
 This also applies to Maven access.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org
 
 
>>> 
>> 
> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> *
> *
> 
> 



Re: shard.preference for single shard queries

2019-12-05 Thread Tomás Fernández Löbbe
Look at SOLR-12217, it explains the limitation and has a patch for SolrJ
cases. Should be merged soon.

Note that the combination of replica types you are describing is not
recommended. See
https://lucene.apache.org/solr/guide/8_1/shards-and-indexing-data-in-solrcloud.html#combining-replica-types-in-a-cluster


On Thu, Dec 5, 2019 at 5:58 AM spanchal 
wrote:

> Hi all, Thanks to  SOLR-11982
>    we can now give solr
> parameter to sort replicas while giving results but ONLY for distributed
> queries as per documentation. May I know why this limitation?
>
> As my setup, I have 3 replicas(2 NRT, 1 PULL) of a single shard on 3
> different machines. Since NRT replicas might be busy with indexing, I would
> like my queries to land on PULL replica as a preferred option. And
> shard.preference=replica.type:PULL is not working in my case.
> Please help, thanks.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Enabling LTR in SOLRCloud (solr version8.2)

2019-12-05 Thread walia4
I am trying to work with solr-cloud and I have to use learning to rank models
and features for my project. But I am facing this issue of *SolrCore
Initialization Failures*

*techproducts_shard1_replica_n2:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Failed to create new ManagedResource /schema/model-store of type
org.apache.solr.ltr.store.rest.ManagedModelStore due to:
org.apache.solr.common.SolrException:
org.apache.solr.ltr.model.ModelException: Model type does not exist
org.apache.solr.ltr.model.LinearModel techproducts_shard2_replica_n6:
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Failed to create new ManagedResource /schema/model-store of type
org.apache.solr.ltr.store.rest.ManagedModelStore due to:
org.apache.solr.common.SolrException:
org.apache.solr.ltr.model.ModelException: Model type does not exist
org.apache.solr.ltr.model.LinearModel Please check your logs for more
information*



These are the solr logs:-

**2019-12-04 12:44:05.760 ERROR
(searcherExecutor-15-thread-1-processing-n:192.168.137.1:8983_solr
x:techproducts_shard1_replica_n2 c:techproducts s:shard1 r:core_node5)
[c:techproducts s:shard1 r:core_node5 x:techproducts_shard1_replica_n2]
o.a.s.h.RequestHandlerBase java.lang.NullPointerException
at
org.apache.solr.handler.component.SearchHandler.initComponents(SearchHandler.java:183)
at
org.apache.solr.handler.component.SearchHandler.getComponents(SearchHandler.java:203)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:260)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2578)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:74)
at
org.apache.solr.core.SolrCore.lambda$getSearcher$18(SolrCore.java:2344)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-12-04 12:44:05.760 ERROR
(searcherExecutor-14-thread-1-processing-n:192.168.137.1:8983_solr
x:techproducts_shard2_replica_n6 c:techproducts s:shard2 r:core_node8)
[c:techproducts s:shard2 r:core_node8 x:techproducts_shard2_replica_n6]
o.a.s.h.RequestHandlerBase java.lang.NullPointerException
at
org.apache.solr.handler.component.SearchHandler.initComponents(SearchHandler.java:183)
at
org.apache.solr.handler.component.SearchHandler.getComponents(SearchHandler.java:203)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:260)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2578)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:74)
at
org.apache.solr.core.SolrCore.lambda$getSearcher$18(SolrCore.java:2344)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Search Performance and omitNorms

2019-12-05 Thread Odysci
Hi Erick,
thanks for the reply.
Just to follow up, I'm using "unified" highlighter (fastVector does not
work for my purposes). I search and highlight on a multivalued string
string field which contains small strings (usually less than 200 chars).
This multivalued field is subject to various processors (tokenizer, word
delimiter, stemming), and all termVectors, termPositions, termOffsets are
"true".
This is what I'm using:

-- schema --
   
























-- schema --

And the java code I set the following params. Considering the multivalued
field above is called "text_msearchp")

SolrQuery solrQ = new SolrQuery();
solrQ.setFilterQueries( -- set some filters --);
solrQ.setStart(0);
solrQ.setRows( -- set max rows --);
solrQ.setQuery("text_msearchp"+":(\"+string_being_searched+ "\")");
// ativate highlight
solrQ.setHighlight(true);
solrQ.setHighlightSnippets(500);   // normally this number is low

// set highligher type
solrQ.setParam("hl.method", "unified");
// set highlight field to be the same as the search field
solrQ.setParam("hl.fl", "text_msearchp");
//Seta o termo que irá gerar o highlight
solrQ.setParam("hl.q", "text_msearchp"+":(\"+string_being_searched+ "\")");



Still, my tests indicate a significant speed up using omitNorms="false".
Best,

Reinaldo

On Tue, Dec 3, 2019 at 6:35 PM Erick Erickson 
wrote:

> I suspect this is spurious. Norms are just an encoding
> of the length of a field, offhand I have no clue how having
> them (or not) would affect highlighting at all.
>
> Term _vectors_ OTOH could have a major impact. If
> FastVectorHighlighter is not used, the highlighter has
> to re-analyze the text in order to highlight, and if you’re
> highlighting in large text fields that can be very expensive.
>
> Norms, aren’t relevant there….
>
> So let’s see the full highlighter configuration you have, along
> with the field definition for the field you’re highlighting on.
>
> Best,
> Erick
>
> > On Dec 3, 2019, at 4:27 PM, Odysci  wrote:
> >
> > I'm using solr-8.3.1 on a solrcloud set up with 2 solr nodes and 2 ZK
> nodes.
> > I was experiencing very slow search-with-highlighting on a index that had
> > 'omitNorms="true"' on all fields.
> > At the suggestion of a stackoverflow post, I changed all fields to be
> > 'omitNorms="false"' and the search-with-highlight time came down to about
> > 1/10th of what it was!!!
> >
> > This was a relatively small index and I had no issues with memory
> increase.
> > Now my question is whether I should expect the same speed up on regular
> > search calls, or search with only filters (no query)?
> > This would be on a different, much larger index - and I do want to incur
> > the memory increase unless the search is significantly faster.
> > Does anyone have any experience in comparing search speed using
> "omitNorms"
> > true or false in regular search (non-highlight)?
> > Thanks!
> >
> > Reinaldo
>
>


shard.preference for single shard queries

2019-12-05 Thread spanchal
Hi all, Thanks to  SOLR-11982
   we can now give solr
parameter to sort replicas while giving results but ONLY for distributed
queries as per documentation. May I know why this limitation?

As my setup, I have 3 replicas(2 NRT, 1 PULL) of a single shard on 3
different machines. Since NRT replicas might be busy with indexing, I would
like my queries to land on PULL replica as a preferred option. And
shard.preference=replica.type:PULL is not working in my case. 
Please help, thanks.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html