Re: Apache Solr Reference Guide isn't accessible

2021-02-01 Thread Bernd Fehling

Yeah, but guide 8.8 is still buggy.

As I reported a month ago, "ICU Normalizer 2 Filter" states:
- NFC: ... Normalization Form C, canonical decomposition
- NFD: ... Normalization Form D, canonical decomposition, followed by canonical 
composition
- NFKC: ... Normalization Form KC, compatibility decomposition
- NFKD: ... Normalization Form KD, compatibility decomposition, followed by 
canonical composition

But the link to "Unicode Standard Annex #15" right above says:
- NFC: ... Normalization Form C, Canonical Decomposition, followed by Canonical 
Composition
- NFD: ... Normalization Form D, Canonical Decomposition
- NFKC: ... Normalization Form KC, Compatibility Decomposition, followed by 
Canonical Composition
- NFKD: ... Normalization Form KD, Compatibility Decomposition

But, well who cares.

Have a nice day.


Am 01.02.21 um 23:04 schrieb Cassandra Targett:

The problem causing this has been fixed and the docs should be available again.
On Feb 1, 2021, 2:15 PM -0600, Alexandre Rafalovitch , 
wrote:

And if you need something more recent while this is being fixed, you
can look right at the source in GitHub, though a navigation, etc is
missing:
https://github.com/apache/lucene-solr/blob/master/solr/solr-ref-guide/src/analyzers.adoc

Open Source :-)

Regards,
Alex.

On Mon, 1 Feb 2021 at 15:04, Mike Drob  wrote:


Hi Dorion,

We are currently working with our infra team to get these restored. In the
meantime, the 8.4 guide is still available at
https://lucene.apache.org/solr/guide/8_4/ and are hopeful that the 8.8
guide will be back up soon. Thank you for your patience.

Mike

On Mon, Feb 1, 2021 at 1:58 PM Dorion Caroline 
wrote:


Hi,

I can't access to Apache Solr Reference Guide since few days.
Example:
URL

* https://lucene.apache.org/solr/guide/8_8/
* https://lucene.apache.org/solr/guide/8_7/
Result:
Not Found
The requested URL was not found on this server.

Do you know what going on?

Thanks
Caroline Dorion





Unicode Normalization and ICUNormalizer2Filter

2021-01-15 Thread Bernd Fehling

Hello list,

cloud it be that Apache Solr Reference Guide of all versions is wrong?

Example:
https://lucene.apache.org/solr/guide/8_7/filter-descriptions.html#icu-normalizer-2-filter

NFC: (name="nfc" mode="compose") Normalization Form C, canonical decomposition
NFD: (name="nfc" mode="decompose") Normalization Form D, canonical 
decomposition, followed by canonical composition
...

versus:
https://unicode.org/reports/tr15/#Norm_Forms

Normalization Form C (NFC) - Canonical Decomposition, followed by Canonical 
Composition
Normalization Form D (NFD) - Canonical Decomposition
...


I assume that unicode.org is correct?

Can someone please check this and if needed update the Reference Guides?

Regards
Bernd


Re: Handling acronyms

2021-01-15 Thread Bernd Fehling

If you are using multiword synonyms, acronyms, ...
Your should escape the space within the multiwords.

As synonyms.txt:
SRN, Stroke\ Research\ Network
IGBP, isolated\ gastric\ bypass
...

Redards
Bernd


Am 15.01.21 um 10:48 schrieb Shaun Campbell:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



Re: Getting error "Bad Message 414 reason: URI Too Long"

2021-01-14 Thread Bernd Fehling

AFAIK, that could be a limit in Jetty and be raised in jetty.xml.
You might check the Jetty docs and look for something like BufferSize.
At least for Solr 6.6.x

Regards
Bernd


Am 14.01.21 um 13:19 schrieb Abhay Kumar:

Thank you Nicolas. Yes, we are making Post request to Solr using SolrNet 
library.
The current request length is approx. 32K characters, I have tested with 10K 
characters length request and it works fine.

Any suggestion to increase request length size in Solr configuration.

Thanks.
Abhay

-Original Message-
From: Nicolas Franck 
Sent: 14 January 2021 15:12
To: solr-user@lucene.apache.org
Subject: Re: Getting error "Bad Message 414 reason: URI Too Long"

Euh, sorry: I did not read your message well enough.
You did actually use a post request, with the parameters in the body
(your example suggests otherwise)


On 14 Jan 2021, at 10:37, Nicolas Franck  wrote:

I believe you can also access this path in a HTTP POST request.
That way you do no hit the URI size limit

cf. 
https://stackoverflow.com/questions/2997014/can-you-use-post-to-run-a-query-in-solr-select

I think some solr libraries already use this approach (e.g.  WebService::Solr 
in perl)

On 14 Jan 2021, at 10:31, Abhay Kumar 
mailto:abhay.ku...@anjusoftware.com>> wrote:

Hello,

I am trying to post below query to Solr but getting error as “Bad Message 
414reason: URI Too Long”.

I am sending query using SolrNet library. Please suggest how to resolve this 
issue.
...





Re: different score from different replica of same shard

2021-01-13 Thread Bernd Fehling

Hello Markus,

thanks a lot.
Is TLOG also for SOLR 6.6.6 or only 8.x and up?

I will first try ExactStatsCache.
Should be added as invariant to request handler, right?

Comparing the replica index directories they have different size and
the index version and generation is different. Also Max Doc.
But Num Docs is the same.

Regards,
Bernd


Am 13.01.21 um 14:54 schrieb Markus Jelsma:

Hello Bernd,

This is normal for NRT replicas, because the way segments are merged and
deletes are removed is not synchronized between replicas. In that case
counts for TF and IDF and norms become slightly different.

You can either use ExactStatsCache that fetches counts for terms before
scoring, so that all replica's use the same counts. Or change the replica
types to TLOG. With TLOG segments are fetched from the leader and thus
identical.

Regards,
Markus

Op wo 13 jan. 2021 om 14:45 schreef Bernd Fehling <
bernd.fehl...@uni-bielefeld.de>:


Hello list,

a question for better understanding scoring of a shard in a cloud.

I see different scores from different replicas of the same shard.
Is this normal and if yes, why?

My understanding until now was that replicas are always the same within a
shard
and the same query to each replica within a shard gives always the same
score.

Can someone help me to understand this?

Regards
Bernd





different score from different replica of same shard

2021-01-13 Thread Bernd Fehling

Hello list,

a question for better understanding scoring of a shard in a cloud.

I see different scores from different replicas of the same shard.
Is this normal and if yes, why?

My understanding until now was that replicas are always the same within a shard
and the same query to each replica within a shard gives always the same score.

Can someone help me to understand this?

Regards
Bernd


Re: Solr8.7 Munin ?

2020-11-23 Thread Bernd Fehling

Hi Bruno,

yes, I use munin-solr plugin.
https://github.com/averni/munin-solr

I renamed it to solr_*.py on my servers.

Regards
Bernd


Am 23.11.20 um 09:54 schrieb Bruno Mannina:

Hello Bernd,

Do you use a specific plugins for Sorl ?

Thanks,
Bruno

-Message d'origine-
De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
Envoyé : lundi 23 novembre 2020 09:02
À : solr-user@lucene.apache.org
Objet : Re: Solr8.7 Munin ?

We are using Munin for years now for Solr monitoring.
Currently Munin 2.0.40 and SolrCloud 6.6.

Regards
Bernd


Am 20.11.20 um 21:02 schrieb Matheo Software:

Hello,

   


I would like to use Munin to check my Solr 8.7 but it don't work. I
try to configure munin plugins without success.

   


Is somebody use Munin with a recent version of Solr ? (version > 5.4)

   


Thanks a lot,

   


Cordialement, Best Regards

Bruno Mannina

   <http://www.matheo-software.com> www.matheo-software.com

   <http://www.patent-pulse.com> www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

   <https://www.facebook.com/PatentPulse> facebook (1)
<https://twitter.com/matheosoftware> 1425551717
<https://www.linkedin.com/company/matheo-software> 1425551737
<https://www.youtube.com/user/MatheoSoftware> 1425551760

   









--
*********
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de
  https://www.ub.uni-bielefeld.de/~befehl/

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


Re: Solr8.7 Munin ?

2020-11-23 Thread Bernd Fehling

We are using Munin for years now for Solr monitoring.
Currently Munin 2.0.40 and SolrCloud 6.6.

Regards
Bernd


Am 20.11.20 um 21:02 schrieb Matheo Software:

Hello,

  


I would like to use Munin to check my Solr 8.7 but it don’t work. I try to
configure munin plugins without success.

  


Is somebody use Munin with a recent version of Solr ? (version > 5.4)

  


Thanks a lot,

  


Cordialement, Best Regards

Bruno Mannina

   www.matheo-software.com

   www.patent-pulse.com

Tél. +33 0 970 738 743

Mob. +33 0 634 421 817

   facebook (1)
 1425551717
 1425551737
 1425551760

  






Re: [CVE-2020-13957] The checks added to unauthenticated configset uploads in Apache Solr can be circumvented

2020-10-13 Thread Bernd Fehling
Good to know that Version 6.6.6 is not affected, so I am safe ;-)

Regards
Bernd

Am 12.10.20 um 20:38 schrieb Tomas Fernandez Lobbe:
> Severity: High
> 
> Vendor: The Apache Software Foundation
> 
> Versions Affected:
> 6.6.0 to 6.6.5
> 7.0.0 to 7.7.3
> 8.0.0 to 8.6.2
> 
> Description:
> Solr prevents some features considered dangerous (which could be used for
> remote code execution) to be configured in a ConfigSet that's uploaded via
> API without authentication/authorization. The checks in place to prevent
> such features can be circumvented by using a combination of UPLOAD/CREATE
> actions.
> 
> Mitigation:
> Any of the following are enough to prevent this vulnerability:
> * Disable UPLOAD command in ConfigSets API if not used by setting the
> system property: "configset.upload.enabled" to "false" [1]
> * Use Authentication/Authorization and make sure unknown requests aren't
> allowed [2]
> * Upgrade to Solr 8.6.3 or greater.
> * If upgrading is not an option, consider applying the patch in SOLR-14663
> ([3])
> * No Solr API, including the Admin UI, is designed to be exposed to
> non-trusted parties. Tune your firewall so that only trusted computers and
> people are allowed access
> 
> Credit:
> Tomás Fernández Löbbe, András Salamon
> 
> References:
> [1] https://lucene.apache.org/solr/guide/8_6/configsets-api.html
> [2]
> https://lucene.apache.org/solr/guide/8_6/authentication-and-authorization-plugins.html
> [3] https://issues.apache.org/jira/browse/SOLR-14663
> [4] https://issues.apache.org/jira/browse/SOLR-14925
> [5] https://wiki.apache.org/solr/SolrSecurity
> 


Re: Daylight savings time issue using NOW in Solr 6.1.0

2020-10-07 Thread Bernd Fehling
Hi,

because you are using solr.in.cmd I guess you are using Windows OS.
I don't know much about Solr and Windows but you can check your
Windows, Jetty and Solr time by looking at your solr-8983-console.log
file after starting Solr.
First the timestamp of the file itself, then the timestamp of the
log message leading each message and finally the timestamp within the
log message reporting the "Start time:".

Regards
Bernd


Am 07.10.20 um 08:12 schrieb vishal patel:
> Hi
> 
> I am using Solr 6.1.0. My SOLR_TIMEZONE=UTC  in solr.in.cmd.
> My current Solr server machine time zone is also UTC.
> 
> My one collection has below one field in schema.
>  docValues="true"/>
>  positionIncrementGap="0"/>
> Suppose my current Solr server machine time is 2020-10-01 10:00:00.000. I 
> have one document in that collection and in that document action_date is 
> 2020-10-01T09:45:46Z.
> When I search in Solr action_date:[2020-10-01T08:00:00Z TO NOW] , I cannot 
> return that record. I check my solr log and found that time was different 
> between Solr log time and solr server machine time.(almost 1 hours difference)
> 
> Why I cannot get the result? Why NOW is not taking the 2020-10-01T10:00:00Z?
> "NOW" takes which time? Is there difference due to daylight saving 
> time? How can I configure 
> or change timezone which consider daylight saving time?
> 


Re: How to persist the data in dataimport.properties

2020-09-09 Thread Bernd Fehling
It is kept in zookeeper within /configs/[collection_name], at least with my 
SolrCloud 6.6.6.

bin/solr zk ls /configs/[your_collection_name]

Regards
Bernd

Am 08.09.20 um 21:40 schrieb yaswanth kumar:
> Can someone help me on how to persists the data that's updated in
> dataimport.properties file because it got a last index time so that my data
> import depends on it for catching up the delta imports.
> 
> What I noticed is that every time when I restart solr this file is wiped
> out and getting its default content instead of what I used to see before
> solr service restart. So want to know if there is anything that I can do to
> persist the last successful index timestamp?
> 
> Solr version: 8.2
> Zookeeper: 3.4
> 


Re: Understanding Solr heap %

2020-09-02 Thread Bernd Fehling
You should _not_ set "-XX:G1HeapRegionSize=n" , because:
"... The goal is to have around 2048 regions based on the minimum Java heap 
size"
The value of G1HeapRegionSize is automatically calculated upon start up of the 
JVM.

The parameter "-XX:MaxGCPauseMillis=200" is the default.
Where is the sense of explicitly setting a default parameter to its default 
value?

Regards
Bernd


Am 01.09.20 um 18:00 schrieb Walter Underwood:
> This is misleading and not particularly good advice.
> 
> Solr 8 does NOT contain G1. G1GC is a feature of the JVM. We’ve been using
> it with Java 8 and Solr 6.6.2 for a few years.
> 
> A test with eighty documents doesn’t test anything. Try a million documents to
> get Solr memory usage warmed up.
> 
> GC_TUNE has been in the solr.in.sh file for a long time. Here are the settings
> we use with Java 8. We have about 120 hosts running Solr in six prod clusters.
> 
> SOLR_HEAP=8g
> # Use G1 GC  -- wunder 2017-01-23
> # Settings from https://wiki.apache.org/solr/ShawnHeisey
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Sep 1, 2020, at 8:39 AM, Joe Doupnik  wrote:
>>
>> Erick states this correctly. To give some numbers from my experiences, 
>> here are two slides from my presentation about installing Solr 
>> (https://netlab1.net/ , locate item "Solr/Lucene 
>> Search Service"):
>>> 
>>
>>> 
>>
>> Thus we see a) experiments are the key, just as Erick says, and b) the 
>> choice of garbage collection algorithm plays a major role.
>> In my setup I assigned SOLR_HEAP to be 2048m, SOLR_OPTS has -Xss1024k, 
>> plus stock GC_TUNE values. Your "memorage" may vary.
>> Thanks,
>> Joe D.
>>
>> On 01/09/2020 15:33, Erick Erickson wrote:
>>> You want to run with the smallest heap you can due to Lucene’s use of 
>>> MMapDirectory, 
>>> see the excellent:
>>>
>>> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html 
>>> 
>>>
>>> There’s also little reason to have different Xms and Xmx values, that just 
>>> means you’ll
>>> eventually move a bunch of memory around as the heap expands, I usually set 
>>> them both
>>> to the same value.
>>>
>>> How to determine what “the smallest heap you can” is? Unfortunately there’s 
>>> no good way
>>> outside of stress-testing your application with less and less memory until 
>>> you have problems,
>>> then add some extra…
>>>
>>> Best,
>>> Erick
>>>
 On Sep 1, 2020, at 10:27 AM, Dominique Bejean  
  wrote:

 Hi,

 As all Java applications the Heap memory is regularly cleaned by the
 garbage collector (some young items moved to the old generation heap zone
 and unused old items removed from the old generation heap zone). This
 causes heap usage to continuously grow and reduce.

 Regards

 Dominique




 Le mar. 1 sept. 2020 à 13:50, yaswanth kumar  
  a
 écrit :

> Can someone make me understand on how the value % on the column Heap is
> calculated.
>
> I did created a new solr cloud with 3 solr nodes and one zookeeper, its
> not yet live neither interms of indexing or searching, but I do see some
> spikes in the HEAP column against nodes when I refresh the page multiple
> times. Its like almost going to 95% (sometimes) and then coming down to 
> 50%
>
> Solr version: 8.2
> Zookeeper: 3.4
>
> JVM size configured in solr.in.sh is min of 1GB to max of 10GB (actually
> RAM size on the node is 16GB)
>
> Basically need to understand if I need to worry about this heap % which
> was quite altering before making it live? or is that quite normal, because
> this is new UI change on solr cloud is kind of new to us as we used to 
> have
> solr 5 version before and this UI component doesn't exists then.
>
> --
> Thanks & Regards,
> Yaswanth Kumar Konathala.
> yaswanth...@gmail.com 
>
> Sent from my iPhone
>>
> 
> 


Re: Recent and upcoming deprecations

2020-07-17 Thread Bernd Fehling
At first glance I see many Deprecations but also many TBD at Package location. 
:-(

To understand this right, you kicked out the code and now waiting
for the community to take over and reinvent the wheel?

Or are there any recent plans of the PMC members to at least start
Package locations on github and then let the community take over?

Or what?

I'm following this list for many years but where were the discussions
about these decisions?

Regards
Bernd


Am 17.07.20 um 07:47 schrieb Ishan Chattopadhyaya:
> Hi Solr Users,
> Here is a list of recent and upcoming deprecations in Solr 8.x.
> https://cwiki.apache.org/confluence/display/SOLR/Deprecations
> 
> Please feel free to chime in if you have any questions. You can comment
> here or in the specific JIRA issues.
> 
> Thanks and regards,
> Ishan Chattopadhyaya
> 


Re: [ANNOUNCE] Apache Solr 8.6.0 released

2020-07-15 Thread Bernd Fehling



Am 15.07.20 um 16:07 schrieb Ishan Chattopadhyaya:
> Dear Solr Users,
> 
> In this release (Solr 8.6), we have deprecated the following:
> 
>   1. Data Import Handler
> 
>   2. HDFS support
> 
>   3. Cross Data Center Replication (CDCR)
> 

Seriously? :-(

So next steps will be kicking out Cloud and go back to single node or what?

Why don't you just freeze the whole Solr development and switch to Elastic?


> 
> 
> All of these are scheduled to be removed in a future 9.x release.
> 
> It was decided that these components did not meet the standards of quality
> and support that we wish to ensure for all components we ship. Some of
> these also relied on design patterns that we no longer recommend for use in
> critical production environments.
> 
> If you rely on these features, you are encouraged to try out community
> supported versions of these, where available [0]. Where such community
> support is not available, we encourage you to participate in the migration
> of these components into community supported packages and help continue the
> development. We envision that using packages for these components via
> package manager will actually make it easier for users to use such features.
> 
> Regards,
> 
> Ishan Chattopadhyaya
> 
> (On behalf of the Apache Lucene/Solr PMC)
> 
> [0] -
> https://cwiki.apache.org/confluence/display/SOLR/Community+supported+packages+for+Solr
> 
> On Wed, Jul 15, 2020 at 2:30 PM Bruno Roustant 
> wrote:
> 
>> The Lucene PMC is pleased to announce the release of Apache Solr 8.6.0.
>>
>>
>> Solr is the popular, blazing fast, open source NoSQL search platform from
>> the Apache Lucene project. Its major features include powerful full-text
>> search, hit highlighting, faceted search, dynamic clustering, database
>> integration, rich document handling, and geospatial search. Solr is highly
>> scalable, providing fault tolerant distributed search and indexing, and
>> powers the search and navigation features of many of the world's largest
>> internet sites.
>>
>>
>> Solr 8.6.0 is available for immediate download at:
>>
>>
>>   
>>
>>
>> ### Solr 8.6.0 Release Highlights:
>>
>>
>>  * Cross-Collection Join Queries: Join queries can now work
>> cross-collection, even when shared or when spanning nodes.
>>
>>  * Search: Performance improvement for some types of queries when exact
>> hit count isn't needed by using BlockMax WAND algorithm.
>>
>>  * Streaming Expression: Percentiles and standard deviation aggregations
>> added to stats, facet and time series.  Streaming expressions added to
>> /export handler.  Drill Streaming Expression for efficient and accurate
>> high cardinality aggregation.
>>
>>  * Package manager: Support for cluster (CoreContainer) level plugins.
>>
>>  * Health Check: HealthCheckHandler can now require that all cores are
>> healthy before returning OK.
>>
>>  * Zookeeper read API: A read API at /api/cluster/zk/* to fetch raw ZK
>> data and view contents of a ZK directory.
>>
>>  * Admin UI: New panel with security info in admin UI's dashboard.
>>
>>  * Query DSL: Support for {param:ref} and {bool: {excludeTags:""}}
>>
>>  * Ref Guide: Major redesign of Solr's documentation.
>>
>>
>> Please read CHANGES.txt for a full list of new features and changes:
>>
>>
>>   
>>
>>
>> Solr 8.6.0 also includes features, optimizations  and bugfixes in the
>> corresponding Apache Lucene release:
>>
>>
>>   
>>
>>
>> Note: The Apache Software Foundation uses an extensive mirroring network
>> for
>>
>> distributing releases. It is possible that the mirror you are using may
>> not have
>>
>> replicated the release yet. If that is the case, please try another mirror.
>>
>> This also applies to Maven access.
>>
> 


Re: SOLR and Zookeeper compatibility

2020-07-13 Thread Bernd Fehling



Am 13.07.20 um 09:55 schrieb Mithun Seal:
> Hi Team,
> 
> Could you please help me with below compatibility question.
> 
> 1. We are trying to install zookeeper externally along with SOLR 7.5.0. 
> As noted, SOLR 7.5.0 comes with Zookeeper 1.3.11. 

Where did you get that info from?
AFAIK, Solr 7.5.0 comes with Apache ZooKeeper 3.4.11.

Regards
Bernd

> Can I install Zookeeper
> 1.3.10 with SOLR 7.5.0. Zookeeper 1.3.10 will be compatible with SOLR 7.5.0?
> 
> 2. What is the suggested version of Zookeeper should be used with SOLR
> 7.5.0?
> 
> 
> Thanks,
> Mithun
> 


Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-24 Thread Bernd Fehling
I'm following this thread now for a while and I can understand
the wish to change some naming/wording/speech in one or the other
programs but I always get back to the one question:
"Is it the weapon which kills people or the hand controlled by
the mind which fires the weapon?"

The thread started with slave - slavery, then turned over to master
and followed by leader (for me as a german... you know).
What will come next?

And more over, we now discuss about changes in the source code and
due to this there need to be changes to the documentation.
What about the books people wrote about this programs and source code,
should we force this authors to rewrite their books?
May be we should file a request to all web search engines to reject
all stored content about these "banned" words?
And contact all web hosters about providing bad content.

To sum things up, within my 40 years of computer science and writing
programs I have never had a nanosecond any thoughts about words
like master, slave, leader, ... other than thinking about computers
and programming.

Just my 2 cents.

For what it is worth, I tend to guide/follower if there "must be" any changes.

Bernd


Re: unique key accross collections within datacenter

2020-05-13 Thread Bernd Fehling
Thanks Eric for your answer.

I was thinking to complex and seeing problems which are not there.

I have your second scenario. The first huge collection still remains
and will grow further while the second will start with same schema but
content from a new source. Sure I could also load the content
from the new source into the first huge collection but I want to
have source, loading, maintenance handling separated.
May be I also start the new collection with a new instance.

Regards
Bernd

Am 13.05.20 um 13:40 schrieb Erick Erickson:
> So a doc in your new collection is expected to supersede a doc
> with the same ID in the old one, right? 
> 
> What I’d do is delete the IDs from my old collection as they were added to
> the new one, there’s not much use in keeping both if you always want
> the new one.
> 
> Let’s assume you do this, the next issue is making sure all of your docs in 
> the new collection are deleted from the old one, and your process will
> inevitably have a hiccough or two. You could periodically use streaming to 
> produce a list of IDs common to both collections, and have a cleanup
> process you occasionally ran to make up for any glitches in the normal
> delete-from-the-old-collection process, see:
> https://lucene.apache.org/solr/guide/6_6/stream-decorators.html#stream-decorators
> 
> If that’s not the case, then having the same id in the different collections
> doesn’t matter. Solr doesn’t use the ID for combining results, just routing 
> and
> then updating.
> 
> This is illustrated by the fact that, through user error, you can even get 
> the same
> document repeated in a result set if it gets indexed to two different shards.
> 
> And if neither of those are on target, what about “handling” unique IDs across
> two collections do you think might go wrong?
> 
> Best,
> Erick
> 
>> On May 13, 2020, at 4:26 AM, Bernd Fehling  
>> wrote:
>>
>> Dear list,
>>
>> in my SolrCloud 6.6 I have a huge collection and now I will get
>> much more data from a different source to be indexed.
>> So I'm thinking about a new collection and combine both, the existing
>> one and the new one with an alias.
>>
>> But how to handle the unique key accross collections within a datacenter?
>> Is it at all possible?
>>
>> I don't see any problems with add, update and delete of documents because
>> these operations are not using the alias.
>>
>> But searching accross collections with alias and then fetching documents
>> by id from the result may lead to results where the id is in both 
>> collections?
>>
>> I have no idea, but there are SolrClouds with a lot of collections out there.
>> How do they handle uniqueness accross collections within a datacenter?
>>
>> Regards
>> Bernd
> 


unique key accross collections within datacenter

2020-05-13 Thread Bernd Fehling
Dear list,

in my SolrCloud 6.6 I have a huge collection and now I will get
much more data from a different source to be indexed.
So I'm thinking about a new collection and combine both, the existing
one and the new one with an alias.

But how to handle the unique key accross collections within a datacenter?
Is it at all possible?

I don't see any problems with add, update and delete of documents because
these operations are not using the alias.

But searching accross collections with alias and then fetching documents
by id from the result may lead to results where the id is in both collections?

I have no idea, but there are SolrClouds with a lot of collections out there.
How do they handle uniqueness accross collections within a datacenter?

Regards
Bernd


Re:

2020-05-13 Thread Bernd Fehling
Dear list and mailer admins,

it looks like the mailer of this list needs some care .
Can someone please set this "ART GALLERY" on a black list?

Thank you,
Bernd


Am 13.05.20 um 08:47 schrieb ART GALLERY:
> check out the videos on this website TROO.TUBE don't be such a
> sheep/zombie/loser/NPC. Much love!
> https://troo.tube/videos/watch/aaa64864-52ee-4201-922f-41300032f219
> 
> On Tue, May 12, 2020 at 9:16 AM Nikolai Efseaff  wrote:
>>
>>
>>
>>
>> Any tax advice in this e-mail should be considered in the context of the tax 
>> services we are providing to you. Preliminary tax advice should not be 
>> relied upon and may be insufficient for penalty protection.
>> 
>> The information contained in this message may be privileged and confidential 
>> and protected from disclosure. If the reader of this message is not the 
>> intended recipient, or an employee or agent responsible for delivering this 
>> message to the intended recipient, you are hereby notified that any 
>> dissemination, distribution or copying of this communication is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> us immediately by replying to the message and deleting it from your computer.
>>
>> Notice required by law: This e-mail may constitute an advertisement or 
>> solicitation under U.S. law, if its primary purpose is to advertise or 
>> promote a commercial product or service. You may choose not to receive 
>> advertising and promotional messages from Ernst & Young LLP (except for EY 
>> Client Portal and the ey.com website, which track e-mail preferences through 
>> a separate process) at this e-mail address by forwarding this message to 
>> no-more-m...@ey.com. If you do so, the sender of this message will be 
>> notified promptly. Our principal postal address is 5 Times Square, New York, 
>> NY 10036. Thank you. Ernst & Young LLP


Re: Solr Ref Guide Redesign coming in 8.6

2020-04-29 Thread Bernd Fehling
+1

And a fully indexed search for the Ref Guide.
I have to use Google to search for infos in Ref Guide of a search engine. :-(


Am 29.04.20 um 02:11 schrieb matthew sporleder:
> I highly recommend a version selector in the header!  I am *always*
> landing on 6.x docs from google.
> 
> On Tue, Apr 28, 2020 at 5:18 PM Cassandra Targett  wrote:
>>
>> In case the list breaks the URL to view the Jenkins build, here's a shorter
>> URL:
>>
>> https://s.apache.org/df7ew.
>>
>> On Tue, Apr 28, 2020 at 3:12 PM Cassandra Targett 
>> wrote:
>>
>>> The PMC would like to engage the Solr user community for feedback on an
>>> extensive redesign of the Solr Reference Guide I've just committed to the
>>> master (future 9.0) branch.
>>>
>>> You can see the new design from our Jenkins build of master:
>>>
>>> https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-master/javadoc/
>>>
>>> The hope is that you will receive these changes positively. If so, we'll
>>> use this for the upcoming 8.6 Ref Guide and future releases. We also may
>>> re-publish earlier 8.x versions so they use this design.
>>>
>>> I embarked on this project last December simply as an attempt to upgrade
>>> the version of Bootstrap used by the Guide. After a couple of days, I'd
>>> changed the layout entirely. In the ensuing few months I've tried to iron
>>> out the kinks and made some extensive changes to the "backend" (the CSS,
>>> JavaScript, etc.).
>>>
>>> I'm no graphic designer, but some of my guiding thoughts were to try to
>>> make full use of the browser window, improve responsiveness for different
>>> sized screens, and just give it a more modern feel. The full list of what
>>> has changed is detailed in the Jira issue if you are interested:
>>> https://issues.apache.org/jira/browse/SOLR-14173
>>>
>>> This is Phase 1 of several changes. There is one glaring remaining issue,
>>> which is that our list of top-level categories is too long for the new
>>> design. I've punted fixing that to Phase 2, which will be an extensive
>>> re-consideration of how the Ref Guide is organized with the goal of
>>> trimming down the top-level categories to only 4-6. SOLR-1 will track
>>> phase 2.
>>>
>>> One last thing to note: this redesign really only changes the presentation
>>> of the pages and some of the framework under the hood - it doesn't yet add
>>> full-text search. All of the obstacles to providing search still exist, but
>>> please know that we fully understand frustration on this point and still
>>> hope to fix it.
>>>
>>> I look forward to hearing your feedback in this thread.
>>>
>>> Best,
>>> Cassandra
>>>


Re: Use boolean operator "-", the result is incorrect

2020-04-08 Thread Bernd Fehling
About first query, you have a negative query telling the searcher to give only
results _NOT_ containing "name_s:a". From that result list you want only
results of "age_i:10".
Boolean table for OR is:
0 OR 0 = 0
1 OR 0 = 1
0 OR 1 = 1
1 OR 1 = 1
You get one result.

About second query your "parsedquery" says you _must_ have id:1 OR id:2
(where you will get both) but _must_ _not_ have anything with "name_s:a".
But you have something with "name_s:a", so you get no results.

Regards
Bernd

Am 08.04.20 um 11:53 schrieb slly:
> My default query operator is OR.There are two pieces of data in the index:
> { "id":"1", "name_s":"a", "age_i":10, "_version_":1663396766955864064}, { 
> "id":"2", "name_s":"b", "age_i":10, "_version_":1663396767058624512}] }
> 
> 
>   1.   -name_s:a OR age_i:10  # I think two pieces of data should be 
> returned, but only one
> 
> "rawquerystring":"-name_s:a age_i:10", "querystring":"-name_s:a age_i:10", 
> "parsedquery":"-name_s:a IndexOrDocValuesQuery(age_i:[10 TO 10])", 
> "parsedquery_toString":"-name_s:a age_i:[10 TO 10]", 
> "QParser":"LuceneQParser",
> 
> 
> 
>   2.  id:("1" "2") AND (-name_s:a) # I think one data should be returned, but 
> 0 data 
> 
> "rawquerystring":"id:(\"1\" \"2\") AND (-name_s:a)", "querystring":"id:(\"1\" 
> \"2\") AND (-name_s:a)", "parsedquery":"+(id:1 id:2) +(-name_s:a)", 
> "parsedquery_toString":"+(id:1 id:2) +(-name_s:a)", "QParser":"LuceneQParser",
> 
> 
> 
> 
> 
> At 2020-04-08 17:46:37, "Bernd Fehling"  
> wrote:
>> What is debugQuery telling you about:
>> - "rawquerystring"
>> - "querystring"
>> - "parsedquery"
>> - "parsedquery_toString"
>> - "QParser"
>>
>> Also what is your default query operator, AND or OR?
>> This is what matters for your second example with  id:("1" "2")
>> It could be  id:("1" AND "2")  or  id:("1" OR "2") .
>>
>> Regards
>> Bernd
>>
>> Am 08.04.20 um 11:30 schrieb slly:
>>> Thanks Bernd for your reply.
>>>  I run the query on the Solr Web UI in Solr 7.3.1/7.7.2, the screenshot of 
>>> my execution results is as follows,  I don't understand whether there is a 
>>> grammatical error ?
>>> 1. -name_s:a OR age_i:10
>>>
>>> 2. id:("1" "2") AND (-name_s:a)
>>>
>>>
>>> At 2020-04-08 16:33:20, "Bernd Fehling"  
>>> wrote:
>>>> Looks correct to me.
>>>>
>>>> You have to obey the level of the operators and the parenthesis.
>>>> Turn debugQuery on to see the results of parsing of your query.
>>>>
>>>> Regards
>>>> Bernd
>>>>
>>>> Am 08.04.20 um 09:34 schrieb slly:
>>>>>
>>>>>
>>>>> If the following query is executed, the result is different:
>>>>>
>>>>>
>>>>> id:("1" "2") AND (-name_s:a) --> numFound is 0 
>>>>>
>>>>>
>>>>> id:("1" "2") AND -(name_s:a)--> numFound is 1 
>>>>>
>>>>>
>>>>>
>>>>> At 2020-04-08 14:56:26, "slly"  wrote:
>>>>>> Hello Folks,
>>>>>> We are using Solr 7.3.1,  I write the following two lines of data into 
>>>>>> collection:
>>>>>> id, name_s, age_i
>>>>>> 1, a, 10
>>>>>> 2, b, 10
>>>>>> Use the following query syntax:
>>>>>> -name_s:a OR age_i:10
>>>>>>
>>>>>>
>>>>>> I think we should return two pieces of data, but actually only one piece 
>>>>>> of data:
>>>>>> id, name_s, age_i
>>>>>> 2, b, 10
>>>>>>
>>>>>>
>>>>>> Did I get it wrong?  Looking forward to some valuable suggestions. 
>>>>>> Thanks.
>>>


Re: Use boolean operator "-", the result is incorrect

2020-04-08 Thread Bernd Fehling
What is debugQuery telling you about:
- "rawquerystring"
- "querystring"
- "parsedquery"
- "parsedquery_toString"
- "QParser"

Also what is your default query operator, AND or OR?
This is what matters for your second example with  id:("1" "2")
It could be  id:("1" AND "2")  or  id:("1" OR "2") .

Regards
Bernd

Am 08.04.20 um 11:30 schrieb slly:
> Thanks Bernd for your reply.
>  I run the query on the Solr Web UI in Solr 7.3.1/7.7.2, the screenshot of my 
> execution results is as follows,  I don't understand whether there is a 
> grammatical error ?
> 1. -name_s:a OR age_i:10
> 
> 2. id:("1" "2") AND (-name_s:a)
> 
> 
> At 2020-04-08 16:33:20, "Bernd Fehling"  
> wrote:
>> Looks correct to me.
>>
>> You have to obey the level of the operators and the parenthesis.
>> Turn debugQuery on to see the results of parsing of your query.
>>
>> Regards
>> Bernd
>>
>> Am 08.04.20 um 09:34 schrieb slly:
>>>
>>>
>>> If the following query is executed, the result is different:
>>>
>>>
>>> id:("1" "2") AND (-name_s:a) --> numFound is 0 
>>>
>>>
>>> id:("1" "2") AND -(name_s:a)--> numFound is 1 
>>>
>>>
>>>
>>> At 2020-04-08 14:56:26, "slly"  wrote:
>>>> Hello Folks,
>>>> We are using Solr 7.3.1,  I write the following two lines of data into 
>>>> collection:
>>>> id, name_s, age_i
>>>> 1, a, 10
>>>> 2, b, 10
>>>> Use the following query syntax:
>>>> -name_s:a OR age_i:10
>>>>
>>>>
>>>> I think we should return two pieces of data, but actually only one piece 
>>>> of data:
>>>> id, name_s, age_i
>>>> 2, b, 10
>>>>
>>>>
>>>> Did I get it wrong?  Looking forward to some valuable suggestions. Thanks.
> 


Re: Use boolean operator "-", the result is incorrect

2020-04-08 Thread Bernd Fehling
Looks correct to me.

You have to obey the level of the operators and the parenthesis.
Turn debugQuery on to see the results of parsing of your query.

Regards
Bernd

Am 08.04.20 um 09:34 schrieb slly:
> 
> 
> If the following query is executed, the result is different:
> 
> 
> id:("1" "2") AND (-name_s:a) --> numFound is 0 
> 
> 
> id:("1" "2") AND -(name_s:a)--> numFound is 1 
> 
> 
> 
> At 2020-04-08 14:56:26, "slly"  wrote:
>> Hello Folks,
>> We are using Solr 7.3.1,  I write the following two lines of data into 
>> collection:
>> id, name_s, age_i
>> 1, a, 10
>> 2, b, 10
>> Use the following query syntax:
>> -name_s:a OR age_i:10
>>
>>
>> I think we should return two pieces of data, but actually only one piece of 
>> data:
>> id, name_s, age_i
>> 2, b, 10
>>
>>
>> Did I get it wrong?  Looking forward to some valuable suggestions. Thanks.


Re: howto replace fieldType string with text lowercase

2019-12-06 Thread Bernd Fehling
Hi Munendra S N,

thanks for the hint about Tokenizer.
Could I omit Tokenizer at all or is it needed by LowerCaseFilterFactory?

The field "firstname" has no facetting and sorting.
Also, I want to keep the raw content as is, with capital letters and so on.
I think update processors and preprocessing before loading would not help here.

Regards
Bernd


Am 06.12.19 um 11:31 schrieb Munendra S N:
> Instead of StandardTokenizerFactory use KeywordTokenizerFactory which emits
> whole text as a single token. Once you make this change, full reindexing
> needs to be done. After field type, some functionality might not be
> performant on the field like faceting, sorting.
> I'm not sure if there are any out-of-the-box update processors to convert
> the value to lowercase but implementing one should be easy. Other approach
> is to convert the value in the preprocessing phase before sending it Solr.
> 
> Regards,
> Munendra S N
> 
> 
> 
> On Fri, Dec 6, 2019 at 2:45 PM Bernd Fehling 
> wrote:
> 
>> Dear list,
>>
>> for one field I want to change fieldType from string to something
>> equal to string, but only lowercase.
>>
>> currently:
>> > multiValued="true">
>> 
>>
>> new:
>> > multiValued="true">
>> > positionIncrementGap="100" autoGeneratePhraseQueries="true">
>>   
>> 
>> 
>>   
>> 
>>
>> Is this the right replacement for "string"?
>> Are the attributes for solr.TextField ok?
>>
>> Regards
>> Bernd
>>
> 


howto replace fieldType string with text lowercase

2019-12-06 Thread Bernd Fehling
Dear list,

for one field I want to change fieldType from string to something
equal to string, but only lowercase.

currently:



new:


  


  


Is this the right replacement for "string"?
Are the attributes for solr.TextField ok?

Regards
Bernd


Re: hi question about solr

2019-12-03 Thread Bernd Fehling
No, I don't use any highlighting.

Am 03.12.19 um 12:28 schrieb Paras Lehana:
> Hi Bernd,
> 
> Have you gone through Highlighting
> <https://lucene.apache.org/solr/guide/8_3/highlighting.html>?
> 
> On Mon, 2 Dec 2019 at 17:00, eli chen  wrote:
> 
>> yes
>>
>> On Mon, 2 Dec 2019 at 13:29, Bernd Fehling >>
>> wrote:
>>
>>> In short,
>>>
>>> you are trying to use an indexer as a full-text search engine, right?
>>>
>>> Regards
>>> Bernd
>>>
>>> Am 02.12.19 um 12:24 schrieb eli chen:
>>>> hi im kind of new to solr so please be patient
>>>>
>>>> i'll try to explain what do i need and what im trying to do.
>>>>
>>>> we a have a lot of books content and we want to index them and allow
>>> search
>>>> in the books.
>>>> when someone search for a term
>>>> i need to get back the position of matchen word in the book
>>>> for example
>>>> if the book content is "hello my name is jeff" and someone search for
>>> "my".
>>>> i want to get back the position of my in the content field (which is 1
>> in
>>>> this case)
>>>> i tried to do that with payloads but no success. and another problem i
>>>> encourage is .
>>>> lets say the content field is "hello my name is jeff what is your
>> name".
>>>> now if someone search for "name" i want to get back the index of all
>>>> occurrences not just the first one
>>>>
>>>> is there any way to that with solr without develop new plugins
>>>>
>>>> thx
>>>>
>>>
>>
> 
> 


Re: hi question about solr

2019-12-02 Thread Bernd Fehling
In short,

you are trying to use an indexer as a full-text search engine, right?

Regards
Bernd

Am 02.12.19 um 12:24 schrieb eli chen:
> hi im kind of new to solr so please be patient
> 
> i'll try to explain what do i need and what im trying to do.
> 
> we a have a lot of books content and we want to index them and allow search
> in the books.
> when someone search for a term
> i need to get back the position of matchen word in the book
> for example
> if the book content is "hello my name is jeff" and someone search for "my".
> i want to get back the position of my in the content field (which is 1 in
> this case)
> i tried to do that with payloads but no success. and another problem i
> encourage is .
> lets say the content field is "hello my name is jeff what is your name".
> now if someone search for "name" i want to get back the index of all
> occurrences not just the first one
> 
> is there any way to that with solr without develop new plugins
> 
> thx
> 


Re: Synonym filters memory usage

2019-09-30 Thread Bernd Fehling

Yes, I think so.
While integrating a Thesaurus as synonyms.txt I saw massive memory usage.
A heap dump and analysis with MemoryAnalyzer pointed out that the
SynonymMap took 3 times a huge amount of memory, together with each
opened index segment.
Just try it and check that by yourself with heap dump and MemoryAnalyzer.

Regards
Bernd


Am 30.09.19 um 09:44 schrieb Andrea Gazzarini:
mmm, ok for the core but are you sure things in this case are working per-segment? I would expect a FilterFactory instance per index, 
initialized at schema loading time.


On 30/09/2019 09:04, Bernd Fehling wrote:

And I think this is per core per index segment.

2 cores per instance, each core with 3 index segments, sums up to 6 times
the 2 SynonymMaps. Results in 12 times SynonymMaps.

Regards
Bernd


Am 30.09.19 um 08:41 schrieb Andrea Gazzarini:

  Hi,
looking at the stateful nature of SynonymGraphFilter/FilterFactory classes,
the answer should be 2 times (one time per type instance).
The SynonymMap, which internally holds the synonyms table, is a private
member of the filter factory and it is loaded each time the factory needs
to create a type.

Best,
Andrea

On 29/09/2019 23:49, Dominique Bejean wrote:

Hi,

My concern is about memory used by synonym filter, especially if synonyms
resources files are large.

If in my schema, there are two field types "TypeSyno1" and "TypeSyno2"
using synonym filter with the same synonyms files.
For each of these two field types there are two fields

Field1 type is TypeSyno1
Field2 type is TypeSyno1
Field3 type is TypeSyno2
Field4 type is TypeSyno2

How many times is the synonym file loaded in memory ?
4 times, so one time per field ?
2 times, so one time per instanciated type ?

Regards

Dominique




Re: Synonym filters memory usage

2019-09-30 Thread Bernd Fehling

And I think this is per core per index segment.

2 cores per instance, each core with 3 index segments, sums up to 6 times
the 2 SynonymMaps. Results in 12 times SynonymMaps.

Regards
Bernd


Am 30.09.19 um 08:41 schrieb Andrea Gazzarini:

  Hi,
looking at the stateful nature of SynonymGraphFilter/FilterFactory classes,
the answer should be 2 times (one time per type instance).
The SynonymMap, which internally holds the synonyms table, is a private
member of the filter factory and it is loaded each time the factory needs
to create a type.

Best,
Andrea

On 29/09/2019 23:49, Dominique Bejean wrote:

Hi,

My concern is about memory used by synonym filter, especially if synonyms
resources files are large.

If in my schema, there are two field types "TypeSyno1" and "TypeSyno2"
using synonym filter with the same synonyms files.
For each of these two field types there are two fields

Field1 type is TypeSyno1
Field2 type is TypeSyno1
Field3 type is TypeSyno2
Field4 type is TypeSyno2

How many times is the synonym file loaded in memory ?
4 times, so one time per field ?
2 times, so one time per instanciated type ?

Regards

Dominique


Re: Query number of Lucene documents using Solr?

2019-08-27 Thread Bernd Fehling

You might use the Lucene internal CheckIndex included in lucene core.
It should tell you everything you need. At least a good starting
point for writing your own tool.

Copy lucene-core-x.y.z-SNAPSHOT.jar and lucene-misc-x.y.z-SNAPSHOT.jar
to a local directory.

java -cp lucene-core-x.y.z-SNAPSHOT.jar -ea:org.apache.lucene... 
org.apache.lucene.index.CheckIndex /path/to/your/index

If you append a "-verbose" you will get tons of info about your index.

Regards
Bernd


Am 26.08.19 um 22:19 schrieb Bram Van Dam:

Possibly somewhat unusual question: I'm looking for a way to query the
number of *lucene documents* from within Solr. This can be different
from the number of Solr documents (because of unmerged deletes/updates/
etc).

As a bit of background; we recently found this lovely little error
message in a Solr log, and we'd like to get a bit of an early warning
system going :-)


Too many documents, composite IndexReaders cannot exceed 2147483647


If no way currently exists, I'm not adverse to hacking one in, but I
could use a few pointers in the general direction.

As an alternative strategy, I guess I could use Lucene to walk through
each index segment and add the segment info maxDoc values. But I'm not
sure if that would be a good idea.

Thanks a bunch,

  - Bram



Re: Problem with uploading Large synonym files in cloud mode

2019-08-02 Thread Bernd Fehling

http://lucene.apache.org/solr/guide/6_6/command-line-utilities.html
"Upload a configuration directory"

Take my advise and read the SolrCloud section of Solr Ref Guide.
It will answer most of your questions and is a good start.



Am 02.08.19 um 08:30 schrieb Salmaan Rashid Syed:

Hi Bernd,

Yet, another noob question.

Consider that my conf directory for creating a collection is _default. Suppose
now I made changes to managed-schema and conf.xml, How do I upload it to
external zookeeper at 2181 port?

Can you please give me the command that uploads altered config.xml and
managed-schema to zookeeper?

Thanks.


On Fri, Aug 2, 2019 at 11:53 AM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:



to 1) yes, because -Djute.maxbuffer is going to JAVA as a start parameter.

to 2) I don't know because i never use internal zookeeper

to 3) the configs are located at solr/server/solr/configsets/
- choose one configset, make your changes and upload it to zookeeper
- when creating a new collection choose your uploaded config
- whenever you change something at your config you have to upload
it to zookeeper

I don't know which Solr version you are using, but a good starting point
with solr cloud is
http://lucene.apache.org/solr/guide/6_6/solrcloud.html

Regards
Bernd



Am 02.08.19 um 07:59 schrieb Salmaan Rashid Syed:

Hi Bernd,

Sorry for noob questions.

1) What do you mean by restart? Do you mean that I shoud issue ./bin/solr
stop -all?

And then issue these commands,

bin/solr restart -cloud -s example/cloud/node1/solr -p 8983

bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr


2) Where can I find solr internal Zookeeper folder for issuing this

command

SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"?


3) Where can I find schema.xml and config.xmo files for Solr Cloud Cores

to

make changes in schema and configuration? Or do I have to make chages in
the directory that contains managed-schema and config.xml files with

which

I initialized and created collections? And then the solr will pick them

up

from there when it restarts?


Regards,

Salmaan



On Thu, Aug 1, 2019 at 5:40 PM Bernd Fehling <

bernd.fehl...@uni-bielefeld.de>

wrote:




Am 01.08.19 um 13:57 schrieb Salmaan Rashid Syed:

After I make the -Djute.maxbuffer changes to Solr, deployed in

production,

Do I need to restart the solr to be able to add synonyms >1MB?


Yes, you have to restart Solr.




Or, Was this supposed to be done before putting Solr to production

ever?

Can we make chages when the Solr is running in production?


It depends on your system. In my cloud with 5 shards and 3 replicas I

can

take one by one offline, stop, modify and start again without problems.




Thanks.

Regards,
Salmaan



On Tue, Jul 30, 2019 at 4:53 PM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:


You have to increase the -Djute.maxbuffer for large configs.

In Solr bin/solr/solr.in.sh use e.g.
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=1000"
This will increase maxbuffer for zookeeper on solr side to 10MB.

In Zookeeper zookeeper/conf/zookeeper-env.sh
SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"

I have a >10MB Thesaurus and use 30MB for jute.maxbuffer, works

perfect.


Regards


Am 30.07.19 um 13:09 schrieb Salmaan Rashid Syed:

Hi Solr Users,

I have a very big synonym file (>5MB). I am unable to start Solr in

cloud

mode as it throws an error message stating that the synonmys file is
too large. I figured out that the zookeeper doesn't take a file

greater

than 1MB size.

I tried to break down my synonyms file to smaller chunks less than

1MB

each. But, I am not sure about how to include all the filenames into

the

Solr schema.

Should it be seperated by commas like synonyms = "__1_synonyms.txt,
__2_synonyms.txt, __3synonyms.txt"

Or is there a better way of doing that? Will the bigger file when

broken

down to smaller chunks will be uploaded to zookeeper as well.

Please help or please guide me to relevant documentation regarding

this.


Thank you.

Regards.
Salmaan.















Re: Problem with uploading Large synonym files in cloud mode

2019-08-02 Thread Bernd Fehling



to 1) yes, because -Djute.maxbuffer is going to JAVA as a start parameter.

to 2) I don't know because i never use internal zookeeper

to 3) the configs are located at solr/server/solr/configsets/
  - choose one configset, make your changes and upload it to zookeeper
  - when creating a new collection choose your uploaded config
  - whenever you change something at your config you have to upload it to 
zookeeper

I don't know which Solr version you are using, but a good starting point with 
solr cloud is
http://lucene.apache.org/solr/guide/6_6/solrcloud.html

Regards
Bernd



Am 02.08.19 um 07:59 schrieb Salmaan Rashid Syed:

Hi Bernd,

Sorry for noob questions.

1) What do you mean by restart? Do you mean that I shoud issue ./bin/solr
stop -all?

And then issue these commands,

bin/solr restart -cloud -s example/cloud/node1/solr -p 8983

bin/solr restart -c -p 7574 -z localhost:9983 -s example/cloud/node2/solr


2) Where can I find solr internal Zookeeper folder for issuing this command
SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"?


3) Where can I find schema.xml and config.xmo files for Solr Cloud Cores to
make changes in schema and configuration? Or do I have to make chages in
the directory that contains managed-schema and config.xml files with which
I initialized and created collections? And then the solr will pick them up
from there when it restarts?


Regards,

Salmaan



On Thu, Aug 1, 2019 at 5:40 PM Bernd Fehling 
wrote:




Am 01.08.19 um 13:57 schrieb Salmaan Rashid Syed:

After I make the -Djute.maxbuffer changes to Solr, deployed in

production,

Do I need to restart the solr to be able to add synonyms >1MB?


Yes, you have to restart Solr.




Or, Was this supposed to be done before putting Solr to production ever?
Can we make chages when the Solr is running in production?


It depends on your system. In my cloud with 5 shards and 3 replicas I can
take one by one offline, stop, modify and start again without problems.




Thanks.

Regards,
Salmaan



On Tue, Jul 30, 2019 at 4:53 PM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:


You have to increase the -Djute.maxbuffer for large configs.

In Solr bin/solr/solr.in.sh use e.g.
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=1000"
This will increase maxbuffer for zookeeper on solr side to 10MB.

In Zookeeper zookeeper/conf/zookeeper-env.sh
SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"

I have a >10MB Thesaurus and use 30MB for jute.maxbuffer, works perfect.

Regards


Am 30.07.19 um 13:09 schrieb Salmaan Rashid Syed:

Hi Solr Users,

I have a very big synonym file (>5MB). I am unable to start Solr in

cloud

mode as it throws an error message stating that the synonmys file is
too large. I figured out that the zookeeper doesn't take a file greater
than 1MB size.

I tried to break down my synonyms file to smaller chunks less than 1MB
each. But, I am not sure about how to include all the filenames into

the

Solr schema.

Should it be seperated by commas like synonyms = "__1_synonyms.txt,
__2_synonyms.txt, __3synonyms.txt"

Or is there a better way of doing that? Will the bigger file when

broken

down to smaller chunks will be uploaded to zookeeper as well.

Please help or please guide me to relevant documentation regarding

this.


Thank you.

Regards.
Salmaan.











Re: Problem with uploading Large synonym files in cloud mode

2019-08-01 Thread Bernd Fehling




Am 01.08.19 um 13:57 schrieb Salmaan Rashid Syed:

After I make the -Djute.maxbuffer changes to Solr, deployed in production,
Do I need to restart the solr to be able to add synonyms >1MB?


Yes, you have to restart Solr.




Or, Was this supposed to be done before putting Solr to production ever?
Can we make chages when the Solr is running in production?


It depends on your system. In my cloud with 5 shards and 3 replicas I can
take one by one offline, stop, modify and start again without problems.




Thanks.

Regards,
Salmaan



On Tue, Jul 30, 2019 at 4:53 PM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:


You have to increase the -Djute.maxbuffer for large configs.

In Solr bin/solr/solr.in.sh use e.g.
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=1000"
This will increase maxbuffer for zookeeper on solr side to 10MB.

In Zookeeper zookeeper/conf/zookeeper-env.sh
SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"

I have a >10MB Thesaurus and use 30MB for jute.maxbuffer, works perfect.

Regards


Am 30.07.19 um 13:09 schrieb Salmaan Rashid Syed:

Hi Solr Users,

I have a very big synonym file (>5MB). I am unable to start Solr in cloud
mode as it throws an error message stating that the synonmys file is
too large. I figured out that the zookeeper doesn't take a file greater
than 1MB size.

I tried to break down my synonyms file to smaller chunks less than 1MB
each. But, I am not sure about how to include all the filenames into the
Solr schema.

Should it be seperated by commas like synonyms = "__1_synonyms.txt,
__2_synonyms.txt, __3synonyms.txt"

Or is there a better way of doing that? Will the bigger file when broken
down to smaller chunks will be uploaded to zookeeper as well.

Please help or please guide me to relevant documentation regarding this.

Thank you.

Regards.
Salmaan.







Re: Problem with uploading Large synonym files in cloud mode

2019-07-30 Thread Bernd Fehling

You have to increase the -Djute.maxbuffer for large configs.

In Solr bin/solr/solr.in.sh use e.g.
SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=1000"
This will increase maxbuffer for zookeeper on solr side to 10MB.

In Zookeeper zookeeper/conf/zookeeper-env.sh
SERVER_JVMFLAGS="$SERVER_JVMFLAGS -Djute.maxbuffer=1000"

I have a >10MB Thesaurus and use 30MB for jute.maxbuffer, works perfect.

Regards


Am 30.07.19 um 13:09 schrieb Salmaan Rashid Syed:

Hi Solr Users,

I have a very big synonym file (>5MB). I am unable to start Solr in cloud
mode as it throws an error message stating that the synonmys file is
too large. I figured out that the zookeeper doesn't take a file greater
than 1MB size.

I tried to break down my synonyms file to smaller chunks less than 1MB
each. But, I am not sure about how to include all the filenames into the
Solr schema.

Should it be seperated by commas like synonyms = "__1_synonyms.txt,
__2_synonyms.txt, __3synonyms.txt"

Or is there a better way of doing that? Will the bigger file when broken
down to smaller chunks will be uploaded to zookeeper as well.

Please help or please guide me to relevant documentation regarding this.

Thank you.

Regards.
Salmaan.



Re: Solr-8.1.0 uses much more memory

2019-05-27 Thread Bernd Fehling

I think it is not fair blaiming Solr not also having a load balancer.
It is up to you and your needs to set up the required infrastucture
including load balancing. The are many products available on the market.
If your current system can't handle all requests then install more replicas.

Regards
Bernd

Am 27.05.19 um 10:33 schrieb Joe Doupnik:
     While on the topic of resource consumption and locks etc, there is one other aspect to which Solr has been vulnerable. It is failing to 
fend off too many requests at one time. The standard approach is, of course, named back pressure, such as not replying to a query until 
resources permit and thus keeping competion outside of the application. That limits resource consumption, including locks, memory and sundry, 
while permiting normal work within to progress smoothly. Let the crowds coming to a hit show queue in the rain outside the theatre until empty 
seats become available.


On 27/05/2019 08:52, Joe Doupnik wrote:
Generalizations tend to fail when confronted with conflicting evidence. The simple  evidence is asking how much real memory the Solr owned 
process has been allocated (top, or ps aux or similar) and that yields two very different values (the ~1.6GB of Solr v8.0 and 4.5+GB of Solr 
v8.1). I have no knowledge of how Java chooses to name its usage (heap or otherwise). Prior to v8.1 Solr memory consumption varied with 
activity, thus memory management was occuring, memory was borrowed from and returned to the system. What might be happening in Solr v8.1 is 
the new memory management code is failing to do a proper job, for reasons which are not visible to us in the field, and that failure is 
important to us.
    In regard to the referenced lock discussion, it would be a good idea to not let the tail wag the dog, tend the common cases and live with 
a few corner case difficulties because perfection is not possible.

    Thanks,
    Joe D.

On 26/05/2019 20:30, Shawn Heisey wrote:

On 5/26/2019 12:52 PM, Joe Doupnik wrote:
 I do queries while indexing, have done so for a long time, without difficulty nor memory usage spikes from dual use. The system has 
been designed to support that.
 Again, one may look at the numbers using "top" or similar. Try Solr v8.0 and 8.1 to see the difference which I experience here. For 
reference, the only memory adjustables set in my configuration is in the Solr startup script solr.in.sh saying add "-Xss1024k" in the 
SOLR_OPTS list and setting SOLR_HEAP="4024m".


There is one significant difference between 8.0 and 8.1 in the realm of memory management -- we have switched from the CMS garbage collector 
to the G1 collector.  So the way that Java manages the heap has changed. This was done because the CMS collector is slated for removal from 
Java.


https://issues.apache.org/jira/browse/SOLR-13394

Java is unlike other programs in one respect -- once it allocates heap from the OS, it never gives it back.  This behavior has given Java an 
undeserved reputation as a memory hog ... but in fact Java's overall memory usage can be very easily limited ... an option that many other 
programs do NOT have.


In your configuration, you set the max heap to a little less than 4GB. You have to expect that it *WILL* use that memory.  By using the 
SOLR_HEAP variable, you have instructed Solr's startup script to use the same setting for the minimum heap as well as the maximum heap. This 
is the design intent.


If you want to know how much heap is being used, you can't ask the operating system, which means tools like top.  You have to ask Java. And 
you will have to look at a long-term graph, finding the low points. An instananeous look at Java's heap usage could show you that the whole 
heap is allocated ... but a significant part of that allocation could be garbage, which becomes available once the garbage is collected.


Thanks,
Shawn






Re: My problem with T-shirts and nested documents

2019-05-24 Thread Bernd Fehling

How about "Pivot (Decision Tree) Faceting"?

http://lucene.apache.org/solr/guide/6_6/faceting.html#Faceting-Pivot_DecisionTree_Faceting

Regards
Bernd


Am 24.05.19 um 14:16 schrieb Gian Marco Tagliani:

Hi all,
I'm facing a problem with Nested Documents.

To illustrate my problem I'll use the example with T-shirts in stock.
For every model of a T-shirt, we can have different colors and sizes, for
each combination we have the number of items in stock.

In Solr, for every model we have a document, for every combination of color
and size we have a nested child document.


model A
 - color : red, size M, quantity 8
 - color : blue, size L, quantity 4
 - color : white, size M, quantity 1

model B
 - color yellow, size S, quantity 7
 - color yellow, size M, quantity 3

model C
 - color red, size M, quantity 5
 - color black, size L, quantity 6


I'm interested in size M only, and I want to know our stock ordered by
quantity.

model A, color red, quantity 8
model C, color red, quantity 5
model B, color yellow, quantity 3
model A, color white, quantity 1



My first idea was using the Json Nested Facet (
https://lucene.apache.org/solr/guide/json-facet-api.html#nested-facet-example
)
In that case I'm not able to sort by quantity nor discriminate between the
"color red" and "color white" lines for model A.

My second idea was to use the Analytics Component (
https://lucene.apache.org/solr/guide/analytics.html)
In this case I'm not able to get data from father and child document to
build a facet.

Has any of you encountered a similar problem? Do you have any idea on how
to address my case?


Thanks in advance
Gian Marco Tagliani



Re: Ignore faceting for particular fields in solr using Solrconfig.xml

2019-05-23 Thread Bernd Fehling

Have a look at "invariants" for your requestHandler in solrconfig.xml.
It might be an option for you.

Regards
Bernd


Am 22.05.19 um 22:23 schrieb RaviTeja:

Hello Solr Expert,

How are you?

Am trying to ignore faceting for some of the fields. Can you please help me
out to ignore faceting using solrconfig.xml.
I tried but I can ignore faceting all the fields that useless. I'm trying
to ignore some specific fields.

Really Appreciate your help for the response!

Regards,
Ravi



Re: Solr query takes a too much time in Solr 6.1.0

2019-05-13 Thread Bernd Fehling

Your "sort" parameter has "sort=id+desc,id+desc".
1. It doesn't make sense to have a sort on "id" in descending order twice.
2. Be aware that the id field has the highest cadinality.
3. To speedup sorting have a separate field with docValues=true for sorting.
   E.g.




Regards
Bernd


Am 10.05.19 um 15:32 schrieb vishal patel:

We have 2 shards and 2 replicas in Live environment. we have multiple 
collections.
Some times some query takes much time(QTime=52552).  There are so many 
documents indexing and searching within milliseconds.
When we executed the same query again using admin panel, it does not take a 
much time and it completes within 20 milliseconds.

My Solr Logs :
2019-05-10 09:48:56.744 INFO  (qtp1239731077-128223) [c:actionscomments s:shard1 r:core_node1 
x:actionscomments] o.a.s.c.S.Request [actionscomments]  webapp=/solr path=/select 
params={q=%2Bproject_id:(2102117)%2Brecipient_id:(4642365)+%2Bentity_type:(1)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2B(is_formtype_active:true)+%2B(appType:1)=s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1}
 hits=198 status=0 QTime=52552
2019-05-10 09:48:56.744 INFO  (qtp1239731077-127998) [c:actionscomments s:shard1 r:core_node1 
x:actionscomments] o.a.s.c.S.Request [actionscomments]  webapp=/solr path=/select 
params={q=%2Bproject_id:(2102117)%2Brecipient_id:(4642365)+%2Bentity_type:(1)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2Bdue_date:[2019-05-09T19:30:00Z+TO+2019-05-09T19:30:00Z%2B1DAY]+%2B(is_formtype_active:true)+%2B(appType:1)=s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1}
 hits=0 status=0 QTime=51970
2019-05-10 09:48:56.746 INFO  (qtp1239731077-128224) [c:actionscomments s:shard1 r:core_node1 
x:actionscomments] o.a.s.c.S.Request [actionscomments]  webapp=/solr path=/select 
params={q=%2Bproject_id:(2121600+2115171+2104206)%2Brecipient_id:(2834330)+%2Bentity_type:(2)+-action_id:(20+32)+%2Baction_status:(0)+%2Bis_active:(true)+%2Bdue_date:[2019-05-10T00:00:00Z+TO+2019-05-10T00:00:00Z%2B1DAY]=s1.example.com:8983/solr/actionscomments|s1r1.example.com:8983/solr/actionscomments,s2.example.com:8983/solr/actionscomments|s2r1.example.com:8983/solr/actionscomments=off=true=id=0=id+desc,id+desc==1}
 hits=98 status=0 QTime=51402


My schema fields below :












What could be a problem here? why the query takes too much time at that time?

Sent from Outlook



Re: Solr monitoring

2019-04-30 Thread Bernd Fehling

I would say yes.
https://lucene.apache.org/solr/guide/7_3/monitoring-solr-with-prometheus-and-grafana.html


Am 30.04.19 um 13:30 schrieb shruti suri:

Prometheus with grafana can be used?

Thanks
Shruti Suri



-
Regards
Shruti
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Solr monitoring

2019-04-30 Thread Bernd Fehling



We use munin with solr plugin but you can also use zabbix with solr plugin.

But there are much more.

Even Oracle has a Monitoring (Java Mission Control with Java Flight Recorder).

Regards,
Bernd


Am 30.04.19 um 13:09 schrieb shruti suri:

Hi Emir,

Is there any open source tool for monitoring.

Thanks
Shruti



-
Regards
Shruti
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html





SOLR / Lucene which openJDK to use

2019-04-29 Thread Bernd Fehling

Hi list,

while going to change my JAVA from Oracle to openJDK the big question is
which distribution to take?

Currently we use Oracle JDK Java SE 8 because of LTS.
Next would be JDK Java SE 11 again because of LTS but now we have to
change to openJDK.

Any recommendations about openJDK 11 distributions?
( https://www.baeldung.com/oracle-jdk-vs-openjdk )

Also any pros and cons about the different distributions?

What about the SOLR components (Tika, UIMA, Zookeeper, Jetty, ...)
are they all tested with openJDK 11?
( https://wiki.openjdk.java.net/display/quality/Quality+Outreach )

Any own experiences or pitfalls?

What about the recommendation in the Solr Ref Guide?
They point to Oracle:
...
If you don’t have the required version, or if the java command is not
found, download and install the latest version from Oracle at
http://www.oracle.com/technetwork/java/javase/downloads/index.html
...
Shouldn't they better point to openJDK?

Regards
Bernd


SolrCloud with separate JAVA instances

2019-04-03 Thread Bernd Fehling

I have SolrCloud with a collection "test1" with 5 shards 2 replicas accoss 5 
server.
This cloud is started at port 8983 on each server.

Now I have a second collection "test2" with 5 shards 1 replica accross the same
5 server. But this second collection is started in seperate JAVA instances at
port 7574 on all 5 server.

Both JAVA instances use the same zookeeper pool but each collection has its own
config in zookeeper.

If I now use the Admin GUI at port 8983 and select "Cloud"->"Graph" I see both 
collections.
Also with Admin GUI at port port 7574.
And I can select both collection in "Collection Selection" dropdown box.

Why and is this how it should be?

I thought different JAVA instances at different ports are separated by each 
other?

Regards,
Bernd


Re: Solr index slow response

2019-03-19 Thread Bernd Fehling

Isn't there somthing about largePageTables which must be enabled
in JAVA and also supported by OS for such huge heaps?

Just a guess.

Am 19.03.19 um 15:01 schrieb Jörn Franke:

It could be an issue with jdk 8 that may not be suitable for such large heaps. 
Have more nodes with smaller heaps (eg 31 gb)


Am 18.03.2019 um 11:47 schrieb Aaron Yingcai Sun :

Hello, Solr!


We are having some performance issue when try to send documents for solr to 
index. The repose time is very slow and unpredictable some time.


Solr server is running on a quit powerful server, 32 cpus, 400GB RAM, while 300 
GB is reserved for solr, while this happening, cpu usage is around 30%, mem 
usage is 34%.  io also look ok according to iotop. SSD disk.


Our application send 100 documents to solr per request, json encoded. the size 
is around 5M each time. some times the response time is under 1 seconds, some 
times could be 300 seconds, the slow response happens very often.


"Soft AutoCommit: disabled", "Hard AutoCommit: if uncommited for 360ms; if 
100 uncommited docs"


There are around 100 clients sending those documents at the same time, but each 
for the client is blocking call which wait the http response then send the next 
one.


I tried to make the number of documents smaller in one request, such as 20, but 
 still I see slow response time to time, like 80 seconds.


Would you help to give some hint how improve the response time?  solr does not 
seems very loaded, there must be a way to make the response faster.


BRs

//Aaron





Re: only error logging in solr

2019-02-19 Thread Bernd Fehling

After looking into the source code there seams nothing in there for
error logging together with the request which produced the error.
I think there is a need for this to log the request along with the error.

Could be done at o.a.s.core.SolrCore.execute() where the INFO logging is also 
located.
And the response from o.a.s.handler.RequestHandlerBase.handleRequest() is
setting rsp.setException(e) which could be used to select logging only
requests which produced an ERROR.

Are there any opinions about this?

Regards
Bernd


Am 18.02.19 um 14:43 schrieb Bernd Fehling:

Hi list,

logging in solr sounds easy but the problem is logging only errors
and the request which produced the error.
I want to log all 4xx and 5xx http and also solr ERROR.

My request_logs from jetty show nothing useful because of POST requests.
Only that a request got HTTP 4xx or 5xx from solr.

INFO log level for solr_logs is not used because of to much log writing at high 
QPS.

My solr_logs should report ERRORs the request which produced the ERROR.

Has anyone an idea or solved this problem?

Is it possible to raise the level of a request from INFO to ERROR if
the request produced an ERROR in solr_logs?

Regards
Bernd





only error logging in solr

2019-02-18 Thread Bernd Fehling

Hi list,

logging in solr sounds easy but the problem is logging only errors
and the request which produced the error.
I want to log all 4xx and 5xx http and also solr ERROR.

My request_logs from jetty show nothing useful because of POST requests.
Only that a request got HTTP 4xx or 5xx from solr.

INFO log level for solr_logs is not used because of to much log writing at high 
QPS.

My solr_logs should report ERRORs the request which produced the ERROR.

Has anyone an idea or solved this problem?

Is it possible to raise the level of a request from INFO to ERROR if
the request produced an ERROR in solr_logs?

Regards
Bernd





Re: REBALANCELEADERS is not reliable

2019-01-21 Thread Bernd Fehling

Hi Erik,

patches and the new comments look good.
Unfortunately I'm at 6.6.5 and can't test this with my cloud.
Replica (o.a.s.common.cloud.Replica) at 6.6.5 is to far away from 7.6 and up.
And a backport for 6.6.5 is to much rework, if possible at all.

Thanks for solving this issue.

Regards,
Bernd


Am 20.01.19 um 17:04 schrieb Erick Erickson:

Bernd:

I just committed fixes on SOLR-13091 and SOLR-10935 to the repo, if
you wanted to give it a whirl it's ready. By tonight (Sunday) I expect
to change the response format a bit and update the ref guide, although
you'll have to look at the doc changes in the format. There's a new
summary section that gives "Success" or "Failure" that's supposed to
be the only thing you really need to check...

One judgement call I made was that if a replica on a down node is the
preferredLeader, it _can't_ be made leader, but this is still labeled
"Success".

Best,
Erick

On Sun, Jan 13, 2019 at 7:43 PM Erick Erickson  wrote:


Bernd:

I just attached a patch to
https://issues.apache.org/jira/browse/SOLR-13091. It's still rough,
the response from REBALANCELEADERS needs quite a bit of work (lots of
extra stuff in it now, and no overall verification).
I haven't run all the tests, nor precommit.

I wanted to get something up so if you have a test environment that
you can easily test it in you'd have an early chance to play with it.

It's against master, I also haven't tried to backport to 8.0 or 7x
yet. I doubt it'll be a problem, but if it does't apply cleanly let me
know.

Best,
Erick

On Fri, Jan 11, 2019 at 8:33 AM Erick Erickson  wrote:


bq: You have to check if the cores, participating in leadership
election, are _really_
in sync. And this must be done before starting any rebalance.
Sounds ugly... :-(

This _should_ not be necessary. I'll add parenthetically that leader
election has
been extensively re-worked in Solr 7.3+ though because "interesting" things
could happen.

Manipulating the leader election queue is really no different than
having to deal with, say, someone killing the leader un-gracefully. It  should
"just work". That said if you're seeing evidence to the contrary that's reality.

What do you mean by "stats" though? It's perfectly ordinary for there to
be different numbers of _deleted_ documents on various replicas, and
consequently things like term frequencies and doc frequencies being
different. What's emphatically _not_ expected is for there to be different
numbers of "live" docs.

"making sure nodes are in sync" is certainly an option. That should all
be automatic if you pause indexing and issue a commit, _then_
do a rebalance.

I certainly agree that the code is broken and needs to be fixed, but I
also have to ask how many shards are we talking here? The code was
originally written for the case where 100s of leaders could be on the
same node, until you get in to a significant number of leaders on
a single node (10s at least) there haven't been reliable stats showing
that it's a performance issue. If you have threshold numbers where
you've seen it make a material difference it'd be great to share them.

And I won't be getting back to this until the weekend, other urgent
stuff has come up...

Best,
Erick

On Fri, Jan 11, 2019 at 12:58 AM Bernd Fehling
 wrote:


Hi Erik,
yes, I would be happy to test any patches.

Good news, I got rebalance working.
After running the rebalance about 50 times with debugger and watching
the behavior of my problem shard and its core_nodes within my test cloud
I came to the point of failure. I solved it and now it works.

Bad news, rebalance is still not reliable and there are many more
problems and point of failure initiated by rebalanceLeaders or better
by re-queueing the watchlist.

How I located _my_ problem:
Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java
instance per server. 3 separate zookeepers.
My problem, shard2 wasn't willing to rebalance to a specific core_node.
core_nodes related (core_node1, core_node2, core_node10).
core_node10 was the preferredLeader.
It was just changing leader ship between core_node1 and core_node2,
back and forth, whenever I called rebalanceLeader.
First step, I stopped the server holding core_node2.
Result, the leadership was staying at core_node1 whenever I called 
rebalanceLeaders.
Second step, from debugger I _forced_ during rebalanceLeaders the
system to give the leadership to core_node10.
Result, there was no leader anymore for that shard. Yes it can happen,
you can end up with a shard having no leader but active core_nodes!!!
To fix this I was giving preferredLeader to core_node1 and called 
rebalanceLeaders.
After that, preferredLeader was set back to core_node10 and I was back
at the point I started, all calls to rebalanceLeaders kept the leader at 
core_node1.

  From the debug logs I got the hint about PeerSync of cores and 
IndexFingerprint.
The stats from my problem cor

Re: Solr Size Limitation upto 32 kb limitation

2019-01-18 Thread Bernd Fehling

Hi,

assuming you have a fieldType for "text_general" defined in your schema, change 
from:



to:




Regards,
Bernd


Am 18.01.19 um 11:51 schrieb Kranthi Kumar K:

Hi team,

Thank you Erick Erickson ,Bernd Fehling , Jan Hoydahl for your suggested 
solutions. I've tried the suggested one's and still we are unable to import files 
havingsize  >32 kb, it is displaying same error.

Below link has the suggested solutions. Please have a look once.

http://lucene.472066.n3.nabble.com/Solr-Size-Limitation-upto-32-KB-files-td4419779.html


   1.  As per Erick Erickson, I've changed the string type to Text type based 
and still the issue occurs .

I've changed from :







Changed to:







If we do so, it is showing error in the log, please find the error in the 
attachment.



If I change to:







It is not showing any error , but the issue still exists.



   1.  As per Jan Hoydahl, I have gone through the link that you have provided 
and checked 'requestParsers' tag in solrconfig.xml,



RequestParsers tag in our application is as follows:



''

Request parsers, which we are using and in the link you have provided are similar. 
And still we are unable to import the files size >32 kb.



   1.  As per Bernd Fehling, we are using Solr 4.10.2. you have mentioned as,
'If you are trying to add larger content then you have to "chop" that
by yourself and add it as multivalued. Can be done within a self written 
loader. '


I'm a newbie to Solr and I didn't get what exactly 'self written loader' is?



Could you please provide us sample code, that helps us to go further?


[image001]
Thanks & Regards,
Kranthi Kumar.K,
Software Engineer,
Ccube Fintech Global Services Pvt Ltd.,
Email/Skype: 
kranthikuma...@ccubefintech.com<mailto:kranthikuma...@ccubefintech.com>,
Mobile: +91-8978078449.


From: Kranthi Kumar K 
Sent: Thursday, January 17, 2019 12:43 PM
To: d...@lucene.apache.org; solr-user@lucene.apache.org
Cc: Ananda Babu medida ; Srinivasa Reddy Karri 
; Michelle Ngo 
Subject: Re: Solr Size Limitation upto 32 kb limitation


Hi Team,



Can we have any updates on the below issue? We are awaiting your reply.



Thanks,

Kranthi kumar.K


From: Kranthi Kumar K
Sent: Friday, January 4, 2019 5:01:38 PM
To: d...@lucene.apache.org<mailto:d...@lucene.apache.org>
Cc: Ananda Babu medida; Srinivasa Reddy Karri
Subject: Solr Size Limitation upto 32 kb limitation


Hi team,



We are currently using Solr 4.2.1 version in our project and everything is 
going well. But recently, we are facing an issue with Solr Data Import. It is 
not importing the files with size greater than 32766 bytes (i.e, 32 kb) and 
showing 2 exceptions:



   1.  java.lang.illegalargumentexception
   2.  org.apache.lucene.util.bytesref hash$maxbyteslengthexceededexception



Please find the attached screenshot for reference.



We have searched for solutions in many forums and didn't find the exact 
solution for this issue. Interestingly, we found in the article, by changing 
the type of the 'field' from sting to  'text_general' might solve the issue. 
Please have a look in the below forum:



https://stackoverflow.com/questions/29445323/adding-a-document-to-the-index-in-solr-document-contains-at-least-one-immense-t



Schema.xml:

Changed from:

''



Changed to:

''



We have tried it but still it is not importing the files > 32 KB or 32766 bytes.



Could you please let us know the solution to fix this issue? We'll be awaiting 
your reply.




Re: REBALANCELEADERS is not reliable

2019-01-11 Thread Bernd Fehling

Hi Erik,
yes, I would be happy to test any patches.

Good news, I got rebalance working.
After running the rebalance about 50 times with debugger and watching
the behavior of my problem shard and its core_nodes within my test cloud
I came to the point of failure. I solved it and now it works.

Bad news, rebalance is still not reliable and there are many more
problems and point of failure initiated by rebalanceLeaders or better
by re-queueing the watchlist.

How I located _my_ problem:
Test cloud is 5 server (VM), 5 shards, 3 replica per shard, 1 java
instance per server. 3 separate zookeepers.
My problem, shard2 wasn't willing to rebalance to a specific core_node.
core_nodes related (core_node1, core_node2, core_node10).
core_node10 was the preferredLeader.
It was just changing leader ship between core_node1 and core_node2,
back and forth, whenever I called rebalanceLeader.
First step, I stopped the server holding core_node2.
Result, the leadership was staying at core_node1 whenever I called 
rebalanceLeaders.
Second step, from debugger I _forced_ during rebalanceLeaders the
system to give the leadership to core_node10.
Result, there was no leader anymore for that shard. Yes it can happen,
you can end up with a shard having no leader but active core_nodes!!!
To fix this I was giving preferredLeader to core_node1 and called 
rebalanceLeaders.
After that, preferredLeader was set back to core_node10 and I was back
at the point I started, all calls to rebalanceLeaders kept the leader at 
core_node1.

From the debug logs I got the hint about PeerSync of cores and IndexFingerprint.
The stats from my problem core_node10 showed that they differ from leader 
core_node1.
And the system notices the difference, starts a PeerSync and ends with success.
But actually the PeerSync seem to fail, because the stats of core_node1 and
core_node10 still differ afterwards.
Solution, I also stopped my server holding my problem core_node10, wiped all 
data
directories and started that server again. The core_nodes where rebuilt from 
leader
and now they are really in sync.
Calling now rebalanceLeaders ended now with success to preferredLeader.

My guess:
You have to check if the cores, participating in leadership election, are 
_really_
in sync. And this must be done before starting any rebalance.
Sounds ugly... :-(

Next question, why is PeerSync not reporting an error?
There is an info about "PeerSync START", "PeerSync Received 0 versions from ... 
fingeprint:null"
and "PeerSync DONE. sync succeeded" but the cores are not really in sync.

Another test I did (with my new knowledge about synced cores):
- Removing all preferredLeader properties
- stopping, wiping data directory, starting all server one by one to get
  all cores of all shards in sync
- setting one preferredLeader for each shard but different from the actual 
leader
- calling rebalanceLeaders succeeded only at 2 shards with the first run,
  not for all 5 shards (even with really all cores in sync).
- after calling rebalanceLeaders again the other shards succeeded also.
Result, rebalanceLeaders is still not reliable.

I have to mention that I have about 520.000 docs per core in my test cloud
and that there might also be a timing issue between calling rebalanceLeaders,
detecting that cores to become leader are not in sync with actual leader,
and resync while waiting for new leader election.

So far,
Bernd


Am 10.01.19 um 17:02 schrieb Erick Erickson:

Bernd:

Don't feel bad about missing it, I wrote the silly stuff and it took me
some time to remember.

Those are  the rules.

It's always humbling to look back at my own code and say "that
idiot should have put some comments in here..." ;)

yeah, I agree there are a lot of moving parts here. I have a note to
myself to provide better feedback in the response. You're absolutely
right that we fire all these commands and hope they all work.  Just
returning "success" status doesn't guarantee leadership change.

I'll be on another task the rest of this week, but I should be able
to dress things up over the weekend. That'll give you a patch to test
if you're willing.

The actual code changes are pretty minimal, the bulk of the patch
will be the reworked test.

Best,
Erick



Re: REBALANCELEADERS is not reliable

2019-01-10 Thread Bernd Fehling
 code has three problems. I think I have fixes for all of them:

1> assigning the preferredLeader (or any SHARDUNIQUE property) does not
  properly remove that property from other replicas in the shard if
 present. So you may have multiple preferredLeaders in a shard.

2> the code to resolve tied sequence numbers had been changed
  during some refactoring so the wrong node could be elected.

3> the response from the rebalanceleaders command isn't very useful, it's
  on my plate to fix that. Partly it was not reporting useful
  info, and partly your comment from the other day that it returns
  without verifying the leadership has actually changed is well taken. At
  present, it just changes the election queue and assumes that the
  right thing happens. The test code was supposed to point out when
  that assumption was incorrect, but you know the story there.

Currently, the code is pretty ugly in terms of all the junk I put in trying to
track this down, but when I clean it up I'll put up a patch. I added some
code to restart some of the jettys in the test (it's now "@Slow:") that
catches the restart case. Additionally, I changed the test to force
unique properties to be concentrated on a particular node then issue the
BALANCESHARDUNIQUE command to make sure that <1> above
doesn't happen.

Meanwhile, if there's an alternative approach that's simpler I'd be all
for it.

Best,
Erick

On Wed, Jan 9, 2019 at 1:32 AM Bernd Fehling
 wrote:


Yes, your findings are also very strange.
I wonder if we can discover the "inventor" of all this and ask him
how it should work or better how he originally wanted it to work.

Comments in the code (RebalanceLeaders.java) state that it is possible
to have more than one electionNode with the same sequence number.
Absolutely strange.

I wonder why the queue is not rotated until the new and preferred
leader is at front (position 0)?
But why is it a queue anyway?
Wherever I see any java code to get the content from the queue it
is sorted. Where is the sense of this?

Also, the elctionNodes have another attribute with name "ephemeral".
Where is that for and why is it not tested in TestRebalanceLeaders.java?

Regards, Bernd


Am 09.01.19 um 02:31 schrieb Erick Erickson:

It's weirder than that. In the current test on master, the
assumption is that the node recorded as leader in ZK
is actually the leader, see
TestRebalanceLeaders.checkZkLeadersAgree(). The theory
is that the identified leader node in ZK is actually the leader
after the rebalance command. But you're right, I don't see
an actual check that the collection's status agrees.

That aside, though, there are several problems I'm uncovering

1> BALANCESHARDUNIQUE can wind up with multiple
"preferredLeader" properties defined. Some time between
the original code and now someone refactored a bunch of
code and missed removing a unique property if it was
already assigned and being assigned to another replica
in the same slice.

2> to make it much worse, I've rewritten the tests
extensively and I can beast the rewritten tests 1,000
times and no failures. If I test manually by just issuing
the commands, everything works fine. By "testing manually"
I mean (working with 4 Vms, 10 shards 4 replicas)

create the collection
issue the BALANCESHARDUNIQUE command
issue the REBALANCELEADERS command



However, if instead I

create the collection
issue the BALANCESHARDUNIQUE command
shut down 3 of 4 Solr instances so all the leaders

 are on the same host.

restart the 3 instances
issue the REBALANCELEADERS command then

 it doesn't work.

At least that's what I think I'm seeing, but it makes no
real sense yet.

So I'm first trying to understand why my manual test
fails so regularly, then I can incorporate that setup
into the unit test (I'm thinking of just shutting down
and restarting some of the Jetty instances).

But it's a total mystery to me why restarting Solr instances
should have any effect. But that's certainly not
something that happens in the current test so I have
hopes that tracking that down will lead to understanding
what the invalid assumption I'm making is and we can
test for that too.,

On Tue, Jan 8, 2019 at 1:42 AM Bernd Fehling
 wrote:


Hi Erick,

after some more hours of debugging the rough result is, who ever invented
this leader election did not check if an action returns the estimated
result. There are only checks for exceptions, true/false, new sequence
numbers and so on, but never if a leader election to the preferredleader
really took place.

If doing a rebalanceleaders to preferredleader I also have to check if:
- a rebalance took place
- the preferredleader has really become leader (and not anyone else)

Currently this is not checked and the call rebalanceleaders to preferredleader
is like a shot into the dark with hope of success. And thats why any
problems have never been discovered or reported.

Bernd



Re: how to recover state.json files

2019-01-09 Thread Bernd Fehling

Have you lost dataDir from all zookeepers?

If not, first take a backup of remaining dataDir and then start that zookeeper.
Take ZooInspector to connect to dataDir at localhost and get your
state.json including all other configs and setting.


Am 09.01.19 um 12:25 schrieb Yogendra Kumar Soni:

How to know attributes like shard name and hash ranges with associated core
names if we lost state.json file from zookeeper.
core.properties only contains core level information but hash ranges are
not stored there.

Does solr stores collection information, shards information anywhere.





Re: REBALANCELEADERS is not reliable

2019-01-09 Thread Bernd Fehling

Yes, your findings are also very strange.
I wonder if we can discover the "inventor" of all this and ask him
how it should work or better how he originally wanted it to work.

Comments in the code (RebalanceLeaders.java) state that it is possible
to have more than one electionNode with the same sequence number.
Absolutely strange.

I wonder why the queue is not rotated until the new and preferred
leader is at front (position 0)?
But why is it a queue anyway?
Wherever I see any java code to get the content from the queue it
is sorted. Where is the sense of this?

Also, the elctionNodes have another attribute with name "ephemeral".
Where is that for and why is it not tested in TestRebalanceLeaders.java?

Regards, Bernd


Am 09.01.19 um 02:31 schrieb Erick Erickson:

It's weirder than that. In the current test on master, the
assumption is that the node recorded as leader in ZK
is actually the leader, see
TestRebalanceLeaders.checkZkLeadersAgree(). The theory
is that the identified leader node in ZK is actually the leader
after the rebalance command. But you're right, I don't see
an actual check that the collection's status agrees.

That aside, though, there are several problems I'm uncovering

1> BALANCESHARDUNIQUE can wind up with multiple
"preferredLeader" properties defined. Some time between
the original code and now someone refactored a bunch of
code and missed removing a unique property if it was
already assigned and being assigned to another replica
in the same slice.

2> to make it much worse, I've rewritten the tests
extensively and I can beast the rewritten tests 1,000
times and no failures. If I test manually by just issuing
the commands, everything works fine. By "testing manually"
I mean (working with 4 Vms, 10 shards 4 replicas)

create the collection
issue the BALANCESHARDUNIQUE command
issue the REBALANCELEADERS command



However, if instead I

create the collection
issue the BALANCESHARDUNIQUE command
shut down 3 of 4 Solr instances so all the leaders

are on the same host.

restart the 3 instances
issue the REBALANCELEADERS command then

it doesn't work.

At least that's what I think I'm seeing, but it makes no
real sense yet.

So I'm first trying to understand why my manual test
fails so regularly, then I can incorporate that setup
into the unit test (I'm thinking of just shutting down
and restarting some of the Jetty instances).

But it's a total mystery to me why restarting Solr instances
should have any effect. But that's certainly not
something that happens in the current test so I have
hopes that tracking that down will lead to understanding
what the invalid assumption I'm making is and we can
test for that too.,

On Tue, Jan 8, 2019 at 1:42 AM Bernd Fehling
 wrote:


Hi Erick,

after some more hours of debugging the rough result is, who ever invented
this leader election did not check if an action returns the estimated
result. There are only checks for exceptions, true/false, new sequence
numbers and so on, but never if a leader election to the preferredleader
really took place.

If doing a rebalanceleaders to preferredleader I also have to check if:
- a rebalance took place
- the preferredleader has really become leader (and not anyone else)

Currently this is not checked and the call rebalanceleaders to preferredleader
is like a shot into the dark with hope of success. And thats why any
problems have never been discovered or reported.

Bernd


Am 21.12.18 um 18:00 schrieb Erick Erickson:

I looked at the test last night and it's...disturbing. It succeeds
100% of the time. Manual testing seems to fail very often.
Of course it was late and I was a bit cross-eyed, so maybe
I wasn't looking at the manual tests correctly. Or maybe the
test is buggy.

I beasted the test 100x last night and all of them succeeded.

This was with all NRT replicas.

Today I'm going to modify the test into a stand-alone program
to see if it's something in the test environment that causes
it to succeed. I've got to get this to fail as a unit test before I
have confidence in any fixes, and also confidence that things
like this will be caught going forward.

Erick

On Fri, Dec 21, 2018 at 3:59 AM Bernd Fehling
 wrote:


As far as I could see with debugger there is still a problem in requeing.

There is a watcher and it is recognized that the watcher is not a 
preferredleader.
So it tries to locate a preferredleader with success.
It then calls makeReplicaFirstWatcher and gets a new sequence number for
the preferredleader replica. But now we have two replicas with the same
sequence number. One replica which already owns that sequence number and
the replica which got the new (and the same) number as new sequence number.
It now tries to solve this with queueNodesWithSameSequence.
Might be something in rejoinElection.
At least the call to rejoinElection seems right. For preferredleader it
is true for rejoinAtHead and for the other replica with same sequence num

Re: REBALANCELEADERS is not reliable

2019-01-08 Thread Bernd Fehling

Hi Erick,

after some more hours of debugging the rough result is, who ever invented
this leader election did not check if an action returns the estimated
result. There are only checks for exceptions, true/false, new sequence
numbers and so on, but never if a leader election to the preferredleader
really took place.

If doing a rebalanceleaders to preferredleader I also have to check if:
- a rebalance took place
- the preferredleader has really become leader (and not anyone else)

Currently this is not checked and the call rebalanceleaders to preferredleader
is like a shot into the dark with hope of success. And thats why any
problems have never been discovered or reported.

Bernd


Am 21.12.18 um 18:00 schrieb Erick Erickson:

I looked at the test last night and it's...disturbing. It succeeds
100% of the time. Manual testing seems to fail very often.
Of course it was late and I was a bit cross-eyed, so maybe
I wasn't looking at the manual tests correctly. Or maybe the
test is buggy.

I beasted the test 100x last night and all of them succeeded.

This was with all NRT replicas.

Today I'm going to modify the test into a stand-alone program
to see if it's something in the test environment that causes
it to succeed. I've got to get this to fail as a unit test before I
have confidence in any fixes, and also confidence that things
like this will be caught going forward.

Erick

On Fri, Dec 21, 2018 at 3:59 AM Bernd Fehling
 wrote:


As far as I could see with debugger there is still a problem in requeing.

There is a watcher and it is recognized that the watcher is not a 
preferredleader.
So it tries to locate a preferredleader with success.
It then calls makeReplicaFirstWatcher and gets a new sequence number for
the preferredleader replica. But now we have two replicas with the same
sequence number. One replica which already owns that sequence number and
the replica which got the new (and the same) number as new sequence number.
It now tries to solve this with queueNodesWithSameSequence.
Might be something in rejoinElection.
At least the call to rejoinElection seems right. For preferredleader it
is true for rejoinAtHead and for the other replica with same sequence number
it is false for rejoinAtHead.

A test case should have 3 shards with 3 cores per shard and should try to
set preferredleader to different replicas at random. And then try to
rebalance and check the results.

So far, regards, Bernd


Am 21.12.18 um 07:11 schrieb Erick Erickson:

I'm reworking the test case, so hold off on doing that. If you want to
raise a JIRA, though. please do and attach your patch...

On Thu, Dec 20, 2018 at 10:53 AM Erick Erickson  wrote:


Nothing that I know of was _intentionally_ changed with this between
6x and 7x. That said, nothing that I know of was done to verify that
TLOG and PULL replicas (added in 7x) were handled correctly. There's a
test "TestRebalanceLeaders" for this functionality that has run since
the feature was put in, but it has _not_ been modified to create TLOG
and PULL replicas and test with those.

For this patch to be complete, we should either extend that test or
make another that fails without this patch and succeeds with it.

I'd probably recommend modifying TestRebalanceLeaders to randomly
create TLOG and (maybe) PULL replicas so we'd keep covering the
various cases.

Best,
Erick


On Thu, Dec 20, 2018 at 8:06 AM Bernd Fehling
 wrote:


Hi Vadim,
I just tried it with 6.6.5.
In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed
one shard to become leader. But noticed that one shard already was
leader. No errors or exceptions in logs.
May be I should enable debug logging and try again to see all logging
messages from the patch.

Might be they also changed other parts between 6.6.5 and 7.6.0 so that
it works for you.

I also just changed from zookeeper 3.4.10 to 3.4.13 which works fine,
even with 3.4.10 dataDir. No errors no complains. Seems to be compatible.

Regards, Bernd


Am 20.12.18 um 12:31 schrieb Vadim Ivanov:

Yes! It works!
I have tested RebalanceLeaders today with the patch provided by Endika Posadas. 
(http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html)
And at last it works as expected on my collection with 5 nodes and about 400 
shards.
Original patch was slightly incompatible with 7.6.0
I hope this patch will help to try this feature with 7.6
https://drive.google.com/file/d/19z_MPjxItGyghTjXr6zTCVsiSJg1tN20

RebalanceLeaders was not very useful feature before 7.0 (as all replicas were 
NRT)
But new replica types made it very helpful to keep big clusters in order...

I wonder, why there is no any jira about this case (or maybe I missed it)?
Anyone who cares, please, help to create jira and improve this feature in the 
nearest releaase



Re: Solr Replication

2019-01-07 Thread Bernd Fehling

In SolrCloud there are Data Centers.
Your Cluster 1 is DataCenter 1 and your Cluster 2 is Data Center 2.
You can then use CDCR (Cross Data Center Replication).
http://lucene.apache.org/solr/guide/7_0/cross-data-center-replication-cdcr.html

Nevertheless I would spend your Cluster 2 another 2 zookeeper instances.

Regards, Bernd

Am 07.01.19 um 06:39 schrieb Mannar mannan:

Hi All,

I would like to configure master slave between two solr cloud clusters (for
failover). Below is the scenario

Solr version : 7.0

Cluster 1:
3 zookeeper instances :   zk1, zk2, zk3
2 solr instances : solr1, solr2

Cluster 2:
1 zookeeper instance : bkpzk1,
1 solr instances : bkpsolr1, bkpsolr2

Master / Slave :  solr1 / bkpsolr1
   solr2 / bkpsolr2

Is it possible to have master / slave replication configured for solr
instances running in cluster1 & cluster2 (for failover). Kindly let me know
the possibility.



Re: Solr Size Limitation upto 32 KB files

2019-01-02 Thread Bernd Fehling

Hi,
I don't know the limits about Solr 4.2.1 but the RefGuide of Solr 6.6
says about Field Types for Class StrField:
"String (UTF-8 encoded string or Unicode). Strings are intended for
small fields and are not tokenized or analyzed in any way.
They have a hard limit of slightly less than 32K."

If you are trying to add larger content then you have to "chop" that
by yourself and add it as multivalued. Can be done within a self written loader.

Don't forget, Solr/Lucene is an indexer and not a fulltext engine.

Regards
Bernd


Am 02.01.19 um 10:23 schrieb Kranthi Kumar K:

Hi,

We are currently using Solr 4.2.1 version in our project and everything is 
going well. But recently, we are facing an issue with Solr Data Import. It is 
not importing the files with size greater than 32766 bytes (i.e, 32 kb) and 
showing 2 exceptions:


   1.  java.lang.illegalargumentexception
   2.  org.apache.lucene.util.bytesref hash$maxbyteslengthexceededexception


Please find the attached screenshot for reference.

We have searched for solutions in many forums and didn't find the exact 
solution for this issue. Interestingly, we found in the article, by changing 
the type of the 'field' from sting to  'text_general' might solve the issue. 
Please have a look in the below forum:

https://stackoverflow.com/questions/29445323/adding-a-document-to-the-index-in-solr-document-contains-at-least-one-immense-t

Schema.xml:
Changed from:
''

Changed to:
''

We have tried it but still it is not importing the files > 32 KB or 32766 bytes.

Could you please let us know the solution to fix this issue? We'll be awaiting 
your reply.


[image001]
Thanks & Regards,
Kranthi Kumar.K,
Software Engineer,
Ccube Fintech Global Services Pvt Ltd.,
Email/Skype: 
kranthikuma...@ccubefintech.com,
Mobile: +91-8978078449.





Re: REBALANCELEADERS is not reliable

2018-12-21 Thread Bernd Fehling

As far as I could see with debugger there is still a problem in requeing.

There is a watcher and it is recognized that the watcher is not a 
preferredleader.
So it tries to locate a preferredleader with success.
It then calls makeReplicaFirstWatcher and gets a new sequence number for
the preferredleader replica. But now we have two replicas with the same
sequence number. One replica which already owns that sequence number and
the replica which got the new (and the same) number as new sequence number.
It now tries to solve this with queueNodesWithSameSequence.
Might be something in rejoinElection.
At least the call to rejoinElection seems right. For preferredleader it
is true for rejoinAtHead and for the other replica with same sequence number
it is false for rejoinAtHead.

A test case should have 3 shards with 3 cores per shard and should try to
set preferredleader to different replicas at random. And then try to
rebalance and check the results.

So far, regards, Bernd


Am 21.12.18 um 07:11 schrieb Erick Erickson:

I'm reworking the test case, so hold off on doing that. If you want to
raise a JIRA, though. please do and attach your patch...

On Thu, Dec 20, 2018 at 10:53 AM Erick Erickson  wrote:


Nothing that I know of was _intentionally_ changed with this between
6x and 7x. That said, nothing that I know of was done to verify that
TLOG and PULL replicas (added in 7x) were handled correctly. There's a
test "TestRebalanceLeaders" for this functionality that has run since
the feature was put in, but it has _not_ been modified to create TLOG
and PULL replicas and test with those.

For this patch to be complete, we should either extend that test or
make another that fails without this patch and succeeds with it.

I'd probably recommend modifying TestRebalanceLeaders to randomly
create TLOG and (maybe) PULL replicas so we'd keep covering the
various cases.

Best,
Erick


On Thu, Dec 20, 2018 at 8:06 AM Bernd Fehling
 wrote:


Hi Vadim,
I just tried it with 6.6.5.
In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed
one shard to become leader. But noticed that one shard already was
leader. No errors or exceptions in logs.
May be I should enable debug logging and try again to see all logging
messages from the patch.

Might be they also changed other parts between 6.6.5 and 7.6.0 so that
it works for you.

I also just changed from zookeeper 3.4.10 to 3.4.13 which works fine,
even with 3.4.10 dataDir. No errors no complains. Seems to be compatible.

Regards, Bernd


Am 20.12.18 um 12:31 schrieb Vadim Ivanov:

Yes! It works!
I have tested RebalanceLeaders today with the patch provided by Endika Posadas. 
(http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html)
And at last it works as expected on my collection with 5 nodes and about 400 
shards.
Original patch was slightly incompatible with 7.6.0
I hope this patch will help to try this feature with 7.6
https://drive.google.com/file/d/19z_MPjxItGyghTjXr6zTCVsiSJg1tN20

RebalanceLeaders was not very useful feature before 7.0 (as all replicas were 
NRT)
But new replica types made it very helpful to keep big clusters in order...

I wonder, why there is no any jira about this case (or maybe I missed it)?
Anyone who cares, please, help to create jira and improve this feature in the 
nearest releaase



Re: REBALANCELEADERS is not reliable

2018-12-20 Thread Bernd Fehling

Hi Vadim,
I just tried it with 6.6.5.
In my test cloud with 5 shards, 5 nodes, 3 cores per node it missed
one shard to become leader. But noticed that one shard already was
leader. No errors or exceptions in logs.
May be I should enable debug logging and try again to see all logging
messages from the patch.

Might be they also changed other parts between 6.6.5 and 7.6.0 so that
it works for you.

I also just changed from zookeeper 3.4.10 to 3.4.13 which works fine,
even with 3.4.10 dataDir. No errors no complains. Seems to be compatible.

Regards, Bernd


Am 20.12.18 um 12:31 schrieb Vadim Ivanov:

Yes! It works!
I have tested RebalanceLeaders today with the patch provided by Endika Posadas. 
(http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html)
And at last it works as expected on my collection with 5 nodes and about 400 
shards.
Original patch was slightly incompatible with 7.6.0
I hope this patch will help to try this feature with 7.6
https://drive.google.com/file/d/19z_MPjxItGyghTjXr6zTCVsiSJg1tN20

RebalanceLeaders was not very useful feature before 7.0 (as all replicas were 
NRT)
But new replica types made it very helpful to keep big clusters in order...

I wonder, why there is no any jira about this case (or maybe I missed it)?
Anyone who cares, please, help to create jira and improve this feature in the 
nearest releaase



which Zookeper version for Solr 6.6.5

2018-12-14 Thread Bernd Fehling

This question sounds simple but nevertheless its spinning in my head.

While using Solr 6.6.5 in Cloud mode which has Apache ZooKeeper 3.4.10
in the list of "Major Components" is it possible to use
Apache ZooKeeper 3.4.13 as stand-alone ensemble together with SolrCloud 6.6.5
or do I have to recompile SolrCloud 6.6.5 with Zookeeper 3.4.13 libraries?

Regards
Bernd


Re: REBALANCELEADERS is not reliable

2018-12-07 Thread Bernd Fehling

Thanks for looking this up.
It could be a hint where to jump into the code.
I wonder why they rejected a jira ticket about this problem?

Regards, Bernd

Am 06.12.18 um 16:31 schrieb Vadim Ivanov:

Is solr-dev forum I came across this post
http://lucene.472066.n3.nabble.com/Rebalance-Leaders-Leader-node-deleted-when-rebalancing-leaders-td4417040.html
May be it will shed some light?



-Original Message-
From: Atita Arora [mailto:atitaar...@gmail.com]
Sent: Thursday, November 29, 2018 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: REBALANCELEADERS is not reliable

Indeed, I tried that on 7.4 & 7.5 too, indeed did not work for me as well,
even with the preferredLeader property as recommended in the
documentation.
I handled it with a little hack but certainly this dint work as expected.
I can provide more details if there's a ticket.

On Thu, Nov 29, 2018 at 8:42 PM Aman Tandon
 wrote:


++ correction

On Fri, Nov 30, 2018, 01:10 Aman Tandon 
wrote:



For me today, I deleted the leader replica of one of the two shard
collection. Then other replicas of that shard wasn't getting elected for
leader.

After waiting for long tried the setting addreplicaprop preferred leader
on one of the replica then tried FORCELEADER but no luck. Then also tried
rebalance but no help. Finally have to recreate the whole collection.

Not sure what was the issue but both FORCELEADER AND REBALANCING

didn't

work if there was no leader however preferred leader property was setted.

On Wed, Nov 28, 2018, 12:54 Bernd Fehling <

bernd.fehl...@uni-bielefeld.de

wrote:


Hi Vadim,

thanks for confirming.
So it seems to be a general problem with Solr 6.x, 7.x and might
be still there in the most recent versions.

But where to start to debug this problem, is it something not
correctly stored in zookeeper or is overseer the problem?

I was also reading something about a "leader queue" where possible
leaders have to be requeued or something similar.

May be I should try to get a situation where a "locked" core
is on the overseer and then connect the debugger to it and step
through it.
Peeking and poking around, like old Commodore 64 days :-)

Regards, Bernd


Am 27.11.18 um 15:47 schrieb Vadim Ivanov:

Hi, Bernd
I have tried REBALANCELEADERS with Solr 6.3 and 7.5
I had very similar results and notion that it's not reliable :(
--
Br, Vadim


-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
Sent: Tuesday, November 27, 2018 5:13 PM
To: solr-user@lucene.apache.org
Subject: REBALANCELEADERS is not reliable

Hi list,

unfortunately REBALANCELEADERS is not reliable and the leader
election has unpredictable results with SolrCloud 6.6.5 and
Zookeeper 3.4.10.
Seen with 5 shards / 3 replicas.

- CLUSTERSTATUS reports all replicas (core_nodes) as state=active.
- setting with ADDREPLICAPROP the property preferredLeader to other

replicas

- calling REBALANCELEADERS
- some leaders have changed, some not.

I then tried:
- removing all preferredLeader properties from replicas which

succeeded.

- trying again REBALANCELEADERS for the rest. No success.
- Shutting down nodes to force the leader to a specific replica left

running.

   No success.
- calling REBALANCELEADERS responds that the replica is inactive!!!
- calling CLUSTERSTATUS reports that the replica is active!!!

Also, the replica which don't want to become leader is not in the

list

of collections->[collection_name]->leader_elect->shard1..x->election

Where is CLUSTERSTATUS getting it's state info from?

Has anyone else problems with REBALANCELEADERS?

I noticed that the Reference Guide writes "preferredLeader" (with

capital "L")

but the JAVA code has "preferredleader".

Regards, Bernd












Re: solr crashes

2018-12-06 Thread Bernd Fehling




Am 05.12.18 um 17:11 schrieb Walter Underwood:

I’ve never heard a recommendation to have three times as much RAM as the heap. 
That doesn’t make sense to me.


https://wiki.apache.org/solr/SolrPerformanceProblems#RAM



You might need 3X as much disk space as the index size.

For RAM, it is best to have the sum of:

* JVM heap
* A couple of gigabytes for OS and demons
* RAM for other processes needed on the host (keep to a minimum)
* Enough RAM to hold the entire index

Clearly, you are not going to have enough RAM for a 555 gigabyte index. Well, 
Amazon does have a dozen instance types that can do that, but they are 
expensive.

A 24 GB heap on a 30 GB machine will be pretty tight.

Always set Xms (starting heap) to the same as Xmx (maximum heap). If you set it 
smaller, the JVM will keep increasing the heap until it hits the max before 
doing a full GC. It will always end up with the max setting, but it will have 
to do more work to get there. The setting for initial heap size is about the 
most useless thing in Java.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Dec 4, 2018, at 6:06 AM, Bernd Fehling  
wrote:

Hi Danilo,

Full GC points out that you need more heap which also implies that you need 
more RAM.
Raise your heap to 24GB and your physical RAM to about 75GB or better 96GB.
RAM should be about 3 to 4 times heap size.

Regards, Bernd


Am 04.12.18 um 13:37 schrieb Danilo Tomasoni:

Hello Bernd,
Here I list the extra info you requested:
- actually the virtual machine has 22GB of RAM and 16GB of heap
- my 40 million raw data takes about 1364GB on filesystem (in xml format)
- my index optimized (1 segment, 0 deleted docs) takes about 555GB
- solr 7.3, openjdk 1.8.0_181
- GC logs are like
2018-12-03T07:40:22.302+0100: 28752.505: [Full GC (Allocation Failure) 
2018-12-03T07:40:22.302+0100: 28752.505: [CMS: 12287999K->12287999K(12288000K), 
13.6470083 secs] 15701375K->15701373K(15701376K), [Metaspace: 
37438K->37438K(1083392K)], 13.6470726 secs] [Times: user=13.66 sys=0.00, real=13.64 
secs]
Heap after GC invocations=2108 (full 1501):
  par new generation   total 3413376K, used 3413373K [0x0003d800, 
0x0004d200, 0x0004d200)
   eden space 2730752K,  99% used [0x0003d800, 0x00047eabfdc0, 
0x00047eac)
   from space 682624K,  99% used [0x00047eac, 0x0004a855f8a0, 
0x0004a856)
   to   space 682624K,   0% used [0x0004a856, 0x0004a856, 
0x0004d200)
  concurrent mark-sweep generation total 12288000K, used 12287999K 
[0x0004d200, 0x0007c000, 0x0007c000)
  Metaspace   used 37438K, capacity 38438K, committed 38676K, reserved 
1083392K
   class spaceused 4257K, capacity 4521K, committed 4628K, reserved 1048576K
}
Thank you for your help
Danilo
On 03/12/18 10:36, Bernd Fehling wrote:

Hi Danilo,

you have to give more infos about your system and the config.

- 30gb RAM (physical RAM?) how much heap do you have for JAVA?
- how large (in GByte) are your 40 million raw data being indexed?
- how large is your index (in GByte) with 40 million docs indexed?
- which version of Solr and JAVA?
- do you have JAVA garbage collection logs and if so what are they reporting?
- Any FullGC in GC logs?

Regards, Bernd


Am 03.12.18 um 10:09 schrieb Danilo Tomasoni:

Hello all,

We have a configuration with a single node with 30gb of RAM.

We use it to index ~40MLN of documents.

We perform queries with edismax parser that contain often edismax parser 
subqueries with the syntax

'_query_:{!edismax mm=X v=$subqueryN}'

Often X == 1.

This solves the "too many boolean clauses" error we got expanding the query 
terms (often phrase queries) directly in the main query.

Unfortunately in this scenario solr often crashes while performing a query, 
even with a single query and no other source of system load.


Do you have any idea of what's going on here?

Otherwise,

What kind of solr configuration parameters do you think I need to investigate 
first?

What kind of log lines should I search for to understand what's going on?


Thank you

Danilo





Re: solr crashes

2018-12-04 Thread Bernd Fehling




Am 04.12.18 um 16:47 schrieb Danilo Tomasoni:

Hello Bernd,

Thanks for the suggestion,

the problem is that we don't have 75 GB of RAM.

Are you aware of any way to reduce solr memory usage?


Yes, remove all Faceting, especially those for Fields with high cardinality.

Don't use huge Synonym files which build a Synonyms FST for SpellCheckComponent
used for autocomplete suggestions (e.g. Thesuarus).





Thanks

Danilo

On 04/12/18 15:06, Bernd Fehling wrote:

Hi Danilo,

Full GC points out that you need more heap which also implies that you need 
more RAM.
Raise your heap to 24GB and your physical RAM to about 75GB or better 96GB.
RAM should be about 3 to 4 times heap size.

Regards, Bernd


Am 04.12.18 um 13:37 schrieb Danilo Tomasoni:

Hello Bernd,

Here I list the extra info you requested:

- actually the virtual machine has 22GB of RAM and 16GB of heap

- my 40 million raw data takes about 1364GB on filesystem (in xml format)

- my index optimized (1 segment, 0 deleted docs) takes about 555GB

- solr 7.3, openjdk 1.8.0_181

- GC logs are like

2018-12-03T07:40:22.302+0100: 28752.505: [Full GC (Allocation Failure) 2018-12-03T07:40:22.302+0100: 28752.505: [CMS: 
12287999K->12287999K(12288000K), 13.6470083 secs] 15701375K->15701373K(15701376K), [Metaspace: 37438K->37438K(1083392K)], 13.6470726 secs] 
[Times: user=13.66 sys=0.00, real=13.64 secs]

Heap after GC invocations=2108 (full 1501):
  par new generation   total 3413376K, used 3413373K [0x0003d800, 
0x0004d200, 0x0004d200)
   eden space 2730752K,  99% used [0x0003d800, 0x00047eabfdc0, 
0x00047eac)
   from space 682624K,  99% used [0x00047eac, 0x0004a855f8a0, 
0x0004a856)
   to   space 682624K,   0% used [0x0004a856, 0x0004a856, 
0x0004d200)
  concurrent mark-sweep generation total 12288000K, used 12287999K 
[0x0004d200, 0x0007c000, 0x0007c000)
  Metaspace   used 37438K, capacity 38438K, committed 38676K, reserved 
1083392K
   class space    used 4257K, capacity 4521K, committed 4628K, reserved 1048576K
}


Thank you for your help

Danilo


On 03/12/18 10:36, Bernd Fehling wrote:

Hi Danilo,

you have to give more infos about your system and the config.

- 30gb RAM (physical RAM?) how much heap do you have for JAVA?
- how large (in GByte) are your 40 million raw data being indexed?
- how large is your index (in GByte) with 40 million docs indexed?
- which version of Solr and JAVA?
- do you have JAVA garbage collection logs and if so what are they reporting?
- Any FullGC in GC logs?

Regards, Bernd


Am 03.12.18 um 10:09 schrieb Danilo Tomasoni:

Hello all,

We have a configuration with a single node with 30gb of RAM.

We use it to index ~40MLN of documents.

We perform queries with edismax parser that contain often edismax parser 
subqueries with the syntax

'_query_:{!edismax mm=X v=$subqueryN}'

Often X == 1.

This solves the "too many boolean clauses" error we got expanding the query 
terms (often phrase queries) directly in the main query.

Unfortunately in this scenario solr often crashes while performing a query, 
even with a single query and no other source of system load.


Do you have any idea of what's going on here?

Otherwise,

What kind of solr configuration parameters do you think I need to investigate 
first?

What kind of log lines should I search for to understand what's going on?


Thank you

Danilo





Re: solr crashes

2018-12-04 Thread Bernd Fehling

Hi Danilo,

Full GC points out that you need more heap which also implies that you need 
more RAM.
Raise your heap to 24GB and your physical RAM to about 75GB or better 96GB.
RAM should be about 3 to 4 times heap size.

Regards, Bernd


Am 04.12.18 um 13:37 schrieb Danilo Tomasoni:

Hello Bernd,

Here I list the extra info you requested:

- actually the virtual machine has 22GB of RAM and 16GB of heap

- my 40 million raw data takes about 1364GB on filesystem (in xml format)

- my index optimized (1 segment, 0 deleted docs) takes about 555GB

- solr 7.3, openjdk 1.8.0_181

- GC logs are like

2018-12-03T07:40:22.302+0100: 28752.505: [Full GC (Allocation Failure) 2018-12-03T07:40:22.302+0100: 28752.505: [CMS: 
12287999K->12287999K(12288000K), 13.6470083 secs] 15701375K->15701373K(15701376K), [Metaspace: 37438K->37438K(1083392K)], 13.6470726 secs] 
[Times: user=13.66 sys=0.00, real=13.64 secs]

Heap after GC invocations=2108 (full 1501):
  par new generation   total 3413376K, used 3413373K [0x0003d800, 
0x0004d200, 0x0004d200)
   eden space 2730752K,  99% used [0x0003d800, 0x00047eabfdc0, 
0x00047eac)
   from space 682624K,  99% used [0x00047eac, 0x0004a855f8a0, 
0x0004a856)
   to   space 682624K,   0% used [0x0004a856, 0x0004a856, 
0x0004d200)
  concurrent mark-sweep generation total 12288000K, used 12287999K 
[0x0004d200, 0x0007c000, 0x0007c000)
  Metaspace   used 37438K, capacity 38438K, committed 38676K, reserved 
1083392K
   class space    used 4257K, capacity 4521K, committed 4628K, reserved 1048576K
}


Thank you for your help

Danilo


On 03/12/18 10:36, Bernd Fehling wrote:

Hi Danilo,

you have to give more infos about your system and the config.

- 30gb RAM (physical RAM?) how much heap do you have for JAVA?
- how large (in GByte) are your 40 million raw data being indexed?
- how large is your index (in GByte) with 40 million docs indexed?
- which version of Solr and JAVA?
- do you have JAVA garbage collection logs and if so what are they reporting?
- Any FullGC in GC logs?

Regards, Bernd


Am 03.12.18 um 10:09 schrieb Danilo Tomasoni:

Hello all,

We have a configuration with a single node with 30gb of RAM.

We use it to index ~40MLN of documents.

We perform queries with edismax parser that contain often edismax parser 
subqueries with the syntax

'_query_:{!edismax mm=X v=$subqueryN}'

Often X == 1.

This solves the "too many boolean clauses" error we got expanding the query 
terms (often phrase queries) directly in the main query.

Unfortunately in this scenario solr often crashes while performing a query, 
even with a single query and no other source of system load.


Do you have any idea of what's going on here?

Otherwise,

What kind of solr configuration parameters do you think I need to investigate 
first?

What kind of log lines should I search for to understand what's going on?


Thank you

Danilo



--
*********
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de
  https://www.ub.uni-bielefeld.de/~befehl/

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


Re: solr crashes

2018-12-03 Thread Bernd Fehling

Hi Danilo,

you have to give more infos about your system and the config.

- 30gb RAM (physical RAM?) how much heap do you have for JAVA?
- how large (in GByte) are your 40 million raw data being indexed?
- how large is your index (in GByte) with 40 million docs indexed?
- which version of Solr and JAVA?
- do you have JAVA garbage collection logs and if so what are they reporting?
- Any FullGC in GC logs?

Regards, Bernd


Am 03.12.18 um 10:09 schrieb Danilo Tomasoni:

Hello all,

We have a configuration with a single node with 30gb of RAM.

We use it to index ~40MLN of documents.

We perform queries with edismax parser that contain often edismax parser 
subqueries with the syntax

'_query_:{!edismax mm=X v=$subqueryN}'

Often X == 1.

This solves the "too many boolean clauses" error we got expanding the query 
terms (often phrase queries) directly in the main query.

Unfortunately in this scenario solr often crashes while performing a query, 
even with a single query and no other source of system load.


Do you have any idea of what's going on here?

Otherwise,

What kind of solr configuration parameters do you think I need to investigate 
first?

What kind of log lines should I search for to understand what's going on?


Thank you

Danilo



Re: REBALANCELEADERS is not reliable

2018-11-27 Thread Bernd Fehling

Hi Vadim,

thanks for confirming.
So it seems to be a general problem with Solr 6.x, 7.x and might
be still there in the most recent versions.

But where to start to debug this problem, is it something not
correctly stored in zookeeper or is overseer the problem?

I was also reading something about a "leader queue" where possible
leaders have to be requeued or something similar.

May be I should try to get a situation where a "locked" core
is on the overseer and then connect the debugger to it and step
through it.
Peeking and poking around, like old Commodore 64 days :-)

Regards, Bernd


Am 27.11.18 um 15:47 schrieb Vadim Ivanov:

Hi, Bernd
I have tried REBALANCELEADERS with Solr 6.3 and 7.5
I had very similar results and notion that it's not reliable :(
--
Br, Vadim


-Original Message-
From: Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de]
Sent: Tuesday, November 27, 2018 5:13 PM
To: solr-user@lucene.apache.org
Subject: REBALANCELEADERS is not reliable

Hi list,

unfortunately REBALANCELEADERS is not reliable and the leader
election has unpredictable results with SolrCloud 6.6.5 and
Zookeeper 3.4.10.
Seen with 5 shards / 3 replicas.

- CLUSTERSTATUS reports all replicas (core_nodes) as state=active.
- setting with ADDREPLICAPROP the property preferredLeader to other replicas
- calling REBALANCELEADERS
- some leaders have changed, some not.

I then tried:
- removing all preferredLeader properties from replicas which succeeded.
- trying again REBALANCELEADERS for the rest. No success.
- Shutting down nodes to force the leader to a specific replica left running.
   No success.
- calling REBALANCELEADERS responds that the replica is inactive!!!
- calling CLUSTERSTATUS reports that the replica is active!!!

Also, the replica which don't want to become leader is not in the list
of collections->[collection_name]->leader_elect->shard1..x->election

Where is CLUSTERSTATUS getting it's state info from?

Has anyone else problems with REBALANCELEADERS?

I noticed that the Reference Guide writes "preferredLeader" (with capital "L")
but the JAVA code has "preferredleader".

Regards, Bernd




REBALANCELEADERS is not reliable

2018-11-27 Thread Bernd Fehling

Hi list,

unfortunately REBALANCELEADERS is not reliable and the leader
election has unpredictable results with SolrCloud 6.6.5 and
Zookeeper 3.4.10.
Seen with 5 shards / 3 replicas.

- CLUSTERSTATUS reports all replicas (core_nodes) as state=active.
- setting with ADDREPLICAPROP the property preferredLeader to other replicas
- calling REBALANCELEADERS
- some leaders have changed, some not.

I then tried:
- removing all preferredLeader properties from replicas which succeeded.
- trying again REBALANCELEADERS for the rest. No success.
- Shutting down nodes to force the leader to a specific replica left running.
  No success.
- calling REBALANCELEADERS responds that the replica is inactive!!!
- calling CLUSTERSTATUS reports that the replica is active!!!

Also, the replica which don't want to become leader is not in the list
of collections->[collection_name]->leader_elect->shard1..x->election

Where is CLUSTERSTATUS getting it's state info from?

Has anyone else problems with REBALANCELEADERS?

I noticed that the Reference Guide writes "preferredLeader" (with capital "L")
but the JAVA code has "preferredleader".

Regards, Bernd


Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-09-03 Thread Bernd Fehling

Yes thats right, there is no "best" setup at all, only one that
gives most advantage to your requirements.
And any setup has some disadvantages.

Currently I'm short in time and have to bring our Cloud to production
but a write-up is in the queue as already done with other developments.
https://www.ub.uni-bielefeld.de/~befehl/base/solr/index.html

Regards
Bernd


Am 03.09.2018 um 11:33 schrieb Toke Eskildsen:

On Tue, 2018-08-28 at 09:37 +0200, Bernd Fehling wrote:

Yes, I tested many cases.


Erick is absolutely right about the challenge of finding "best" setups.
What we can do is gather observations, as you have done, and hope that
people with similar use cases finds them. With that in mind, have you
considered posting a write-up of your hard work somewhere? It seems a
shame only to have is as an input on this mailing list.

- Toke Eskildsen, Royal Danish Library



Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-28 Thread Bernd Fehling

Yes, I tested many cases.
As I already mentioned 3 Server as 3x3 SolrCloud cluster.
- 12 Mio. data records from our big single index
- always the same queries (SWD, german keyword norm data)
- Apache jmeter 3.1 for the load (separate server)
- Haproxy 1.6.11 with roundrobin (separate server)
- no autowarming in solr
- always with any setup, one first (cold) run (to see how the system behaves 
with empty caches)
- afterwards two (warm) runs with filled caches from first and second run
- all this with preferLocalShards set to true and false
- and all this with single instance multicore and multi instance multinode.
That was a lot of testing, starting, stopping, loading test data...

The difference between single instance and multi instance was that
single instance per server got 12GB JAVA heap (because it had to handle 3 cores)
and multi instance got 4GB JAVA heap per instance (because each instance had to 
handle just 1 core).

No real difference in CPU/memory utilization, but I used different
heap size between single instance and multi instance (see above).
But the response time with multi instance is much better and gives higher 
performance.
Between 30 and 60 QPS multi instance is about 1.5 times better than single 
instance
in my test case with my test data ... and so on, but the Cloud is much more 
complex.

preferLocalShards really gives advantage in 3x3 or 5x5 SolrCloud but I don't
know how it would compare to say 5x3 (5 server, 5 shards, 3 replicas).

Servers in total:
- 3 VM server on 3 different XEN hosts connected with 2 Gigabit Networks
  (the discs were not SSD as in our production system, just 15rpm spinning 
discs)
  3 zookeeper, one on each server but separate instances (not the solr internal 
ones)
- 1 extra server for haproxy
- 1 extra server for Apache jmeter

It's hard to tell where the bottleneck is, at least not with 60QPS and with 
spinning discs.
SSD as storage and separate physical server boxes will increase performance.

I think the matter is how complex is your data in the index, your query and 
query analysis.
My query not very easy, rows=100, facet.limit=100, 9 facet.fields and a boost 
with bq.
If you have rows=10 and facet=false without bq you will get higher performance.

Regards
Bernd


Am 27.08.2018 um 22:45 schrieb Wei:

Thanks Bernd.  Do you have preferLocalShards=true in both cases? Do you
notice CPU/memory utilization difference between the two deployments? How
many servers did you use in total?  I am curious what's the bottleneck for
the one instance and 3 cores configuration.

Thanks,
Wei

On Mon, Aug 27, 2018 at 1:45 AM Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:


My tests with many combinations (instance, node, core) on a 3 server
cluster
with SolrCloud pointed out that highest performance is with multiple solr
instances and shards and replicas placed by rules so that you get advantage
from preferLocalShards=true.

The disadvantage ist the handling of the system, which means setup,
starting
and stopping, setting up the shards and replicas with rules and so on.

I tested with 3x3 SolrCloud (3 shards, 3 replicas).
A 3x3 system with one instance and 3 cores per host could handle up to
30QPS.
A 3x3 system with multi instance (different ports, single core and shard
per
instance) could handle 60QPS on same hardware with same data.

Also, the single instance per server setup has spikes in the response time
graph
which are not seen with a multi instance setup.

Tested about 2 month ago with SolCloud 6.4.2.

Regards,
Bernd


Am 26.08.2018 um 08:00 schrieb Wei:

Hi,

I have a question about the deployment configuration in solr cloud.  When
we need to increase the number of shards in solr cloud, there are two
options:

1.  Run multiple solr instances per host, each with a different port and
hosting a single core for one shard.

2.  Run one solr instance per host, and have multiple cores(shards) in

the

same solr instance.

Which would be better performance wise? For the first option I think JVM
size for each solr instance can be smaller, but deployment is more
complicated? Are there any differences for cpu utilization?

Thanks,
Wei







Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-27 Thread Bernd Fehling

There was no real bottleneck.
I just started with 30QPS and after that just doubled the QPS.
But as you mentioned I used my specific data and analysis, and also
used SWD (german keyword norm data) dictionary for querying.

Regards,
Bernd


Am 27.08.2018 um 15:41 schrieb Jan Høydahl:

What was your bottleneck when maxing on 30QPS on 3 node cluster?
I expect such tests to vary quite much between use cases, so a good approach is 
to do just as you did: benchmark on your specific data and usage.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com


27. aug. 2018 kl. 10:45 skrev Bernd Fehling :

My tests with many combinations (instance, node, core) on a 3 server cluster
with SolrCloud pointed out that highest performance is with multiple solr
instances and shards and replicas placed by rules so that you get advantage
from preferLocalShards=true.

The disadvantage ist the handling of the system, which means setup, starting
and stopping, setting up the shards and replicas with rules and so on.

I tested with 3x3 SolrCloud (3 shards, 3 replicas).
A 3x3 system with one instance and 3 cores per host could handle up to 30QPS.
A 3x3 system with multi instance (different ports, single core and shard per
instance) could handle 60QPS on same hardware with same data.

Also, the single instance per server setup has spikes in the response time graph
which are not seen with a multi instance setup.

Tested about 2 month ago with SolCloud 6.4.2.

Regards,
Bernd


Am 26.08.2018 um 08:00 schrieb Wei:

Hi,
I have a question about the deployment configuration in solr cloud.  When
we need to increase the number of shards in solr cloud, there are two
options:
1.  Run multiple solr instances per host, each with a different port and
hosting a single core for one shard.
2.  Run one solr instance per host, and have multiple cores(shards) in the
same solr instance.
Which would be better performance wise? For the first option I think JVM
size for each solr instance can be smaller, but deployment is more
complicated? Are there any differences for cpu utilization?
Thanks,
Wei





Re: Multiple solr instances per host vs Multiple cores in same solr instance

2018-08-27 Thread Bernd Fehling

My tests with many combinations (instance, node, core) on a 3 server cluster
with SolrCloud pointed out that highest performance is with multiple solr
instances and shards and replicas placed by rules so that you get advantage
from preferLocalShards=true.

The disadvantage ist the handling of the system, which means setup, starting
and stopping, setting up the shards and replicas with rules and so on.

I tested with 3x3 SolrCloud (3 shards, 3 replicas).
A 3x3 system with one instance and 3 cores per host could handle up to 30QPS.
A 3x3 system with multi instance (different ports, single core and shard per
instance) could handle 60QPS on same hardware with same data.

Also, the single instance per server setup has spikes in the response time graph
which are not seen with a multi instance setup.

Tested about 2 month ago with SolCloud 6.4.2.

Regards,
Bernd


Am 26.08.2018 um 08:00 schrieb Wei:

Hi,

I have a question about the deployment configuration in solr cloud.  When
we need to increase the number of shards in solr cloud, there are two
options:

1.  Run multiple solr instances per host, each with a different port and
hosting a single core for one shard.

2.  Run one solr instance per host, and have multiple cores(shards) in the
same solr instance.

Which would be better performance wise? For the first option I think JVM
size for each solr instance can be smaller, but deployment is more
complicated? Are there any differences for cpu utilization?

Thanks,
Wei



Leader is stuck on offline node

2018-08-09 Thread Bernd Fehling

Something strange happened,
in my Solr 6.6.5 cloud (1 collection, 5 shards, 3 replica) the
leader is stuck on offline node for shard3.

I already tried setting property preferredLeader to true on the
active core_node5 and called REBALANCELEADERS but nothing happened.
In the response of REBALANCELEADERS was nothing about shard3.

It feels like it doesn't know anything about core_node5.

Any idea how to fix this?


  e666-1998
  active
  

  base1_shard3_replica1
  http://server05.myip.com:8983/solr
  server05.myip.com:8983_solr
  active
  true


  base1_shard3_replica2
  http://server02.myip.com:8983/solr
  server02.myip.com:8983_solr
  down
  true


  base1_shard3_replica3
  http://server03.myip.com:8983/solr
  server03.myip.com:8983_solr
  down

  


Regards,
Bernd



Re: Index filename while indexing JSON file

2018-05-22 Thread Bernd Fehling

I don't know if DIH can solve your problem but I would go for
a simple self programmed ETL in JAVA and use SolrJ for loading.

Best regards,
Bernd


Am 18.05.2018 um 21:47 schrieb S.Ashwath:

Hello,

I have 2 directories: 1 with txt files and the other with corresponding
JSON (metadata) files (around 9 of each). There is one JSON file for
each CSV file, and they share the same name (they don't share any other
fields).

The txt files just have plain text, I mapped each line to a field call
'sentence' and included the file name as a field using the data import
handler. No problems here.

The JSON file has metadata: 3 tags: a URL, author and title (for the
content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the
fields to the schema, as explained in the official solr tutorial),* I don't
know how to get the file name into the index as a field.* As far as i know,
that's no way to use the Data import handler for JSON files. I've read that
I can pass a literal through the bin/post tool, but again, as far as I
understand, I can't pass in the file name dynamically as a literal.

I NEED to get the file name, it is the only way in which I can associate
the metadata with each sentence in the txt files in my downstream Python
code.

So if anybody has a suggestion about how I should index the JSON file name
along with the JSON content (or even some workaround), I'd be eternally
grateful.

Regards,

Ash



Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling



Am 15.05.2018 um 14:33 schrieb Erick Erickson:

You might find this useful:

https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/


I have seen that already and can confirm it.
From my observations about a 3x3 cluster with 3 server and my hardware:
- have at least 6 CPUs on each server to keep search performance during NRT 
indexing
- I tried with batch-/queue-size between 100 and 1.
--- with a batch size of 100 and nearly even distribution accross 3
shards I get about 33 docs per update per shard.
--- with a batch size of 1000 I get about 333 docs per update per shard
--- with a batch size of 1 it can go up to  docs per shard

Yes, the last is "it can go up to" because the size is obviuosly to high
and I get lots of smaler updates "FROMLEADER". So somewhere between
1000 and 1 is the best size for my 3x3 cluster with my hardware.

Another observation in a 3x3 cluster, a multi-node (3 JVM 4G instances per
server [3 nodes]) outperforms a multi-core (1 JVM 12G instance per
server [3 cores]) due to JAVA GC impact at multi-core.
A multi-node at 60qps has nearly the performance as a multi-core at 30qps.




One tricky bit: Assuming docs have a random distribution amongst
shards, you should batch so at least 100 docs go to each _shard_. You
can see from the link that the speedup is mostly going from 1 to 100.
So if you have 5 shards, I'd create batches of at least 500. That was
a fairly simple test with stupid-simple docs. Large complicated
documents wouldn't show the same curve.

Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
PULL replicas you want at collection creation time. NOTE: this is only
on Solr 7x. See:
https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas


Unfortunately I'm still at solr 6.4.2 and therefore have to stay with NRT.



About creating your own queue, mine usually look like
List list...
while (more docs) {
   list.add(new_doc);
   if (list.size > X) {
   client.add(list);
   list.clear();
   }
}


Yes, mine looks similar, a recursive file traverser with for-loop over files.
But don't forget a final client.add(list) after the while-loop ;-)




Not exactly a sophisticated queue ;).

On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:

Hi Erik,

yes indeed, batching solved it.
I used ConcurrentUpdateSolrClient with queue size of 1 but
CloudSolrClient doesn't have this feature.
I build my own queue now.

Ah!!! So I obviously use default NRT but actually don't need it because
I don't have any NRT data to index. A latency of several hours is OK for me.
Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
server).

I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
performed
better, less influence of GarbageCollection.

I have to read more about PULL or TLOG replicas, how to set this up and so
on.
If it is to complex I will go with NRT and indexing is anyway during the
night.
Thanks for pointing this out.

Regards,
Bernd


Am 15.05.2018 um 13:28 schrieb Erick Erickson:


What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:


Thanks, solved, performance is good now.

Regards,
Bernd


Am 15.05.2018 um 08:12 schrieb Bernd Fehling:



OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and
cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:



You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the

Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling

Hi Erik,

yes indeed, batching solved it.
I used ConcurrentUpdateSolrClient with queue size of 1 but
CloudSolrClient doesn't have this feature.
I build my own queue now.

Ah!!! So I obviously use default NRT but actually don't need it because
I don't have any NRT data to index. A latency of several hours is OK for me.
Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per server).

I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which 
performed
better, less influence of GarbageCollection.

I have to read more about PULL or TLOG replicas, how to set this up and so on.
If it is to complex I will go with NRT and indexing is anyway during the night.
Thanks for pointing this out.

Regards,
Bernd


Am 15.05.2018 um 13:28 schrieb Erick Erickson:

What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:

Thanks, solved, performance is good now.

Regards,
Bernd


Am 15.05.2018 um 08:12 schrieb Bernd Fehling:


OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:


You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:


Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd


Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling

Thanks, solved, performance is good now.

Regards,
Bernd

Am 15.05.2018 um 08:12 schrieb Bernd Fehling:

OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search 
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:

Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd


Re: question about updates to shard leaders only

2018-05-15 Thread Bernd Fehling

OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search 
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
<bernd.fehl...@uni-bielefeld.de> wrote:

Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd


question about updates to shard leaders only

2018-05-09 Thread Bernd Fehling
Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd


Re: Howto disable PrintGCTimeStamps in Solr

2018-05-08 Thread Bernd Fehling
Hi Shawn,

the goal is that some GCviewer get confused if both DateStamps and TimeStamps
are present in solr_gc.log file. And _not_ to reduce the GC log size, that
would be stupid.
Now I have a Perl-Script which will remove the TimeStamps (and only leaf the
DateStamps) for Analysis of solr_gc.log for some GCviewers.
Problem solved :-)

Generally I can understand that DateStamps or TimeStamps are added as default
when logging to a file, but it should be only one type and not both at once 
possible.

Thanks for filing the bug report, I missed that.

Regards
Bernd


Am 08.05.2018 um 11:32 schrieb Shawn Heisey:
> On 5/7/2018 8:22 AM, Bernd Fehling wrote:
>> thanks for asking, I figured it out this morning.
>> If setting -Xloggc= the option -XX:+PrintGCTimeStamps will be set
>> as default and can't be disabled. It's inside JAVA.
>>
>> Currently using Solr 6.4.2 with
>> Java HotSpot(TM) 64-Bit Server VM (25.121-b13) for linux-amd64 JRE 
>> (1.8.0_121-b13)
> 
> What is the end goal that has you trying to disable PrintGCTimeStamps? 
> Is it to reduce the size of the GC log by only including one timestamp,
> or something else?
> 
> Running java 1.8.0_144, I cannot seem to actually do it.  I tried
> removing the parameter from the start script, and I also tried
> *changing* the parameter to explicitly disable it:
> 
>  -XX:-PrintGCTimeStamps
> 
> Both times, I verified that the commandline had changed.  GC logging
> still includes both the full date stamp, which PrintGCDateStamps
> enables, and seconds since JVM start, which PrintGCTimeStamps enables.
> 
> For the attempt where I changed the parameter instead of removing it,
> this is the full commandline on the running java process that the start
> script executed:
> 
> "C:\Program Files\Java\jdk1.8.0_144\bin\java"  -server -Xms512m -Xmx512m
> -Duser.timezone=UTC -XX:NewRatio=3    -XX:SurvivorRatio=4   
> -XX:TargetSurvivorRatio=90    -XX:MaxTenuringThreshold=8   
> -XX:+UseConcMarkSweepGC    -XX:ConcGCThreads=4
> -XX:ParallelGCThreads=4    -XX:+CMSScavengeBeforeRemark   
> -XX:PretenureSizeThreshold=64m    -XX:+UseCMSInitiatingOccupancyOnly   
> -XX:CMSInitiatingOccupancyFraction=50   
> -XX:CMSMaxAbortablePrecleanTime=6000    -XX:+CMSParallelRemarkEnabled   
> -XX:+ParallelRefProcEnabled    -XX:-OmitStackTraceInFastThrow
> -verbose:gc  -XX:+PrintHeapAtGC  -XX:+PrintGCDetails 
> -XX:-PrintGCTimeStamps  -XX:+PrintGCDateStamps 
> -XX:+PrintTenuringDistribution  -XX:+PrintGCApplicationStoppedTime
> "-Xloggc:C:\Users\sheisey\Downloads\solr-7.3.0\server\logs\solr_gc.log"
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M
> -Xss256k 
> -Dsolr.log.dir="C:\Users\sheisey\Downloads\solr-7.3.0\server\logs"
> -Dlog4j.configuration="file:C:\Users\sheisey\Downloads\solr-7.3.0\server\resources\log4j.properties"
> -DSTOP.PORT=7983 -DSTOP.KEY=solrrocks -Dsolr.log.muteconsole
> -Dsolr.solr.home="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr"
> -Dsolr.install.dir="C:\Users\sheisey\Downloads\solr-7.3.0"
> -Dsolr.default.confdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\solr\configsets\_default\conf"
> 
> -Djetty.host=0.0.0.0 -Djetty.port=8983
> -Djetty.home="C:\Users\sheisey\Downloads\solr-7.3.0\server"
> -Djava.io.tmpdir="C:\Users\sheisey\Downloads\solr-7.3.0\server\tmp" -jar
> start.jar "--module=http" ""
> 
> That change should have done it.  I think we're dealing with a Java
> bug/misfeature.
> 
> Solr 5.5.5 with Java 1.7.0_80, 1.7.0_45, and 1.7.0_04 behave the same as
> 7.3.0 with Java 8.  I have also verified that Solr 4.7.2 with Java
> 1.7.0_72 has the same issue.  I do not have any information for Java 6
> versions.  All java versions examined are from Sun/Oracle.
> 
> I filed a bug with Oracle.  They have accepted it and it is now visible
> publicly.
> 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8202752
> 
> Thanks,
> Shawn
> 


Re: Howto disable PrintGCTimeStamps in Solr

2018-05-07 Thread Bernd Fehling
Hi Dominique,

thanks for asking, I figured it out this morning.
If setting -Xloggc= the option -XX:+PrintGCTimeStamps will be set
as default and can't be disabled. It's inside JAVA.

Currently using Solr 6.4.2 with
Java HotSpot(TM) 64-Bit Server VM (25.121-b13) for linux-amd64 JRE 
(1.8.0_121-b13)


Regards,
Bernd


Am 07.05.2018 um 14:50 schrieb Dominique Bejean:
> Hi,
> 
> Which version of Solr are you using ?
> 
> Regards
> 
> Dominique
> 
> 
> Le ven. 4 mai 2018 à 09:13, Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
> a écrit :
> 
>> Hi list,
>>
>> this sounds simple but I can't disable PrintGCTimeStamps in solr_gc
>> logging.
>> I tried with GC_LOG_OPTS in start scripts and --verbose reporting during
>> start to make sure it is not in Solr start scripts.
>> But if Solr is up and running there are always TimeStamps in solr_gc.log
>> and
>> the file reports at the top with "CommandLine flags:" that the option
>> -XX:+PrintGCTimeStamps has been set.
>> But where?
>>
>> Is it something passed down from Jetty?
>>
>> Regards,
>> Bernd
>>
>>
>>
>> --
> Dominique Béjean
> 06 08 46 12 43
> 


Howto disable PrintGCTimeStamps in Solr

2018-05-04 Thread Bernd Fehling
Hi list,

this sounds simple but I can't disable PrintGCTimeStamps in solr_gc logging.
I tried with GC_LOG_OPTS in start scripts and --verbose reporting during
start to make sure it is not in Solr start scripts.
But if Solr is up and running there are always TimeStamps in solr_gc.log and
the file reports at the top with "CommandLine flags:" that the option
-XX:+PrintGCTimeStamps has been set.
But where?

Is it something passed down from Jetty?

Regards,
Bernd





Re: SolrCloud design question

2018-04-20 Thread Bernd Fehling
Thanks Alessandro for the info.

I am currently in the phase to find the right setup with shards,
nodes, replicas and so on.
I have decided to begin with 5 hosts and want to setup 1 collection with 5 
shards.
And start with 2 replicas per shard.

But the next design question is, should each replica get its own instance?

What will give better performance, all replicas in one java instance or
having one instance for each replica?

What is your opinion?

Regards
Bernd


Am 20.04.2018 um 12:17 schrieb Alessandro Benedetti:
> Unless you use recent Solt 7.x features where replicas can have different
> properties[1], each replica is functionally the same at Solr level.
> Zookeeper will elect a leader among them ( so temporary a replica will have
> more responsibilities ) but (R1-R2-R3) does not really exist at Solr level.
> It will just be Shard1 (ReplicaHost1, ReplicaHost2, ReplicaHost3).
> 
> So you can't really shuffle anything at this level.
> 
> 
> 
> 
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


Re: SolrCloud design question

2018-04-19 Thread Bernd Fehling
Hi Shawn,
OK, got that.

Would shuffling or shifting the replicas bring any benfit or is it just wasted 
time?

   
 |  | |  | |  |
 shard1  | |r1| | | |r2| | | |r3| |
 |  | |  | |  |
 |  | |  | |  |
 |  | |  | |  |
 shard2  | |r3| | | |r1| | | |r2| |
 |  | |  | |  |
 |  | |  | |  |
 |  | |  | |  |
 shard3  | |r2| | | |r3| | | |r1| |
 |  | |  | |  |
   
   host1host2host3


Regards
Bernd


Am 19.04.2018 um 14:43 schrieb Shawn Heisey:
> On 4/19/2018 6:28 AM, Bernd Fehling wrote:
>> How would you setup a SolrCloud an why?
>>
>>
>>   shard1   shard2   shard3
>>   
>> |  | |  | |  |
>> | |r1| | | |r1| | | |r1| |
>> |  | |  | |  |
>> |  | |  | |  |
>> |  | |  | |  |
>> | |r2| | | |r2| | | |r2| |
>> |  | |  | |  |
>> |  | |  | |  |
>> |  | |  | |  |
>> | |r3| | | |r3| | | |r3| |
>> |  | |  | |  |
>>   
>>   host1    host2    host3
> 
> I'm assuming that "r1" means replica1.
> 
> If you set it up this way, you lose one third of the whole index (all 
> replicas of one shard) if *any* host goes down.  All queries will fail in
> that situation if shards.tolerant is not set.  With shards.tolerant=true, you 
> would get partial results.
> 
> So you have three machines that are all single points of failure.  This setup 
> is a bad idea.
> 
>>   
>> |  | |  | |  |
>> shard1  | |r1| | | |r2| | | |r3| |
>> |  | |  | |  |
>> |  | |  | |  |
>> |  | |  | |  |
>> shard2  | |r1| | | |r2| | | |r3| |
>> |  | |  | |  |
>> |  | |  | |  |
>> |  | |  | |  |
>> shard3  | |r1| | | |r2| | | |r3| |
>> |  | |  | |  |
>>   
>>   host1    host2    host3
> 
> With this setup, when any host fails, you still have two working replicas of 
> all shards.  If two hosts fail, you still have one working
> replica.  There are no single points of failure, as long as your clients are 
> able to direct queries to a working replica.  SolrJ clients using
> CloudSolrClient will do this automatically.  Other clients may need a load 
> balancer sitting in front of the cloud.
> 
> This is the recommended way of setting up replicas.
> 
> Thanks,
> Shawn
> 


SolrCloud design question

2018-04-19 Thread Bernd Fehling
How would you setup a SolrCloud an why?


 shard1   shard2   shard3
  
|  | |  | |  |
| |r1| | | |r1| | | |r1| |
|  | |  | |  |
|  | |  | |  |
|  | |  | |  |
| |r2| | | |r2| | | |r2| |
|  | |  | |  |
|  | |  | |  |
|  | |  | |  |
| |r3| | | |r3| | | |r3| |
|  | |  | |  |
  
 host1host2host3


  
|  | |  | |  |
shard1  | |r1| | | |r2| | | |r3| |
|  | |  | |  |
|  | |  | |  |
|  | |  | |  |
shard2  | |r1| | | |r2| | | |r3| |
|  | |  | |  |
|  | |  | |  |
|  | |  | |  |
shard3  | |r1| | | |r2| | | |r3| |
|  | |  | |  |
  
 host1host2host3


Regards
Bernd




Re: Howto change log level with Solr Admin UI ?

2018-04-19 Thread Bernd Fehling
Hi Emir,
thanks for the infos, it works for the Admin UI.
But it fills the Admin UI pretty heavy with log messages.

I think it is a misunderstanding on my side because I was hoping to change
the log level for solr.log with Admin UI.
Would be cool if that would be possible, to change the log level for solr.log
from the Admin UI. Imagine, a running system with problems, you can change
log level and get more logging info into solr.log without restarting the system
and overloading the Logging Admin UI:

In my C and C++ programs I use SIGUSR1 and SIGUSR2 to change log levels during 
runtime.

Regards
Bernd


Am 18.04.2018 um 17:18 schrieb Emir Arnautović:
> Hi,
> It is not exposed in the admin console (would be nice if it is!), but there 
> is a way to set threshold for admin UI logs. You can simply execute 
> following:  
> http://localhost:8983/solr/admin/info/logging?since=0=INFO 
> <http://localhost:8983/solr/admin/info/logging?since=0=INFO> and 
> INFO logs will start appearing in admin UI.
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 18 Apr 2018, at 16:30, Shawn Heisey <apa...@elyograg.org> wrote:
>>
>> On 4/18/2018 8:03 AM, Bernd Fehling wrote:
>>> I just tried to change the log level with Solr Admin UI but it
>>> does not change any logging on my running SolrCloud.
>>> It just shows the changes in the Admin UI and the commands in the
>>> request log, but no changes in the level of logging.
>>>
>>> Do I have to RELOAD the collection after changing log level?
>>>
>>> I tried all setting from ALL, TRACE, DEBUG, ...
>>>
>>> Also the Reference Guide 6.6 shows the Admin UI as I see it, but
>>> the table below the image has levels FINEST, FINE, CONFIG, ...
>>> https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
>>> This is confusing.
>>
>> What exact setting in the logging tab did you change, and what did you 
>> expect to happen that didn't happen?
>>
>> The logging events that show up in the admin UI will never include anything 
>> with a severity lower than WARN.  Anything lower would be far too much 
>> information for the admin UI to handle.  Changing the level shown in the 
>> admin UI is likely possible, but probably requires a code change.  If 
>> changed, I think it would result in a UI page that's unusable because it 
>> contains far too many events.
>>
>> Assuming that log4j.properties hasn't been altered, you will find lower 
>> severity events in solr.log, a file on disk.  The default logging level that 
>> Solr uses is INFO, but INFO logs never show up in the admin UI.
>>
>> Also, changes made to logging levels in the admin UI only last as long as 
>> Solr is running.  When Solr is restarted, those changes are gone.  Only 
>> changes made in log4j.properties will survive a restart.
>>
>> Thanks,
>> Shawn
>>
> 
> 


Howto change log level with Solr Admin UI ?

2018-04-18 Thread Bernd Fehling
I just tried to change the log level with Solr Admin UI but it
does not change any logging on my running SolrCloud.
It just shows the changes in the Admin UI and the commands in the
request log, but no changes in the level of logging.

Do I have to RELOAD the collection after changing log level?

I tried all setting from ALL, TRACE, DEBUG, ...

Also the Reference Guide 6.6 shows the Admin UI as I see it, but
the table below the image has levels FINEST, FINE, CONFIG, ...
https://lucene.apache.org/solr/guide/6_6/configuring-logging.html
This is confusing.


Regards,
Bernd


Re: Infostream question

2018-04-18 Thread Bernd Fehling
You have to check your log4j.properties, usually located 
server/resources/log4j.properties
There is a line about infostream logging, change it from OFF to ON.

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=OFF

Regards
Bernd


Am 17.04.2018 um 20:56 schrieb Yunee Lee:
> Hi,
> Current solr server is 5.2 and I want to enable infoStream and updated the 
> solrconfig.xml.
> Reload the config. But it doesn’t create any logs. Do I need to configure 
> anything else?
> Thanks.
> true
> 


Re: Performance & CPU Usage of 6.2.1 vs 6.5.1 & above

2018-04-16 Thread Bernd Fehling
It would help if you can trace it down to a version change.
Do you have a test system and start with 6.3.0 as next version above 6.2.1
to see which version change is making you trouble?
You can then try 6.4.0 and 6.5.0 next. And after that go into subversions.

Regards, Bernd


Am 16.04.2018 um 09:39 schrieb mganeshs:
> Hi Bernd,
> 
> We didn't change any default settings. 
> 
> Both 6.2.1 and 6.5.1 is running with same settings, same volume of data,
> same code, which means indexing rate is also same. 
> 
> In Case of 6.2.1 CPU is around 60 to 70%. But in 6.5.1 it's always around
> 95%. The CPU % in 6.5.1 is alarming for us and we keep getting alerts as
> it's always more than 95%. 
> 
> Basically, my question is why is that in 6.2.1 CPU is low and for 6.5.1 it's
> very high ? I though only I am facing this issue, but one more in the forum
> also raised this issue, but nothing concluded so far. 
> 
> In another thread Shawn also suggested changes wrt merge policy numbers. But
> CPU % didn't come down. But in 6.2.1 with default settings itself, it works
> fine and CPU is also normal. So created new thread to discuss wrt CPU
> utilization between old version (6.2.1 ) and new version (6.5.1+)
> 
> Regards,
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


Re: Performance & CPU Usage of 6.2.1 vs 6.5.1 & above

2018-04-15 Thread Bernd Fehling
As a first guess I would say that you have much higher GC activity
which causes much higher CPU usage.
Why do you have much higher GC activity?
Any GC settings changed?
Have you tried increasing heap size?

Regards
Bernd


Am 16.04.2018 um 06:22 schrieb mganeshs:
> Solr experts,
> 
> We found following  link
> 
>   
> where its mentioned like in 6.2.1 it's faster where as in 6.6 its slower. 
> 
> We are also facing same issue...with 6.2.1 in our performance environment
> and we 
> found that CPU usage is around 60 to 70% where as in 6.5.1 it was always 
> more than 95% 
> 
> Settings are same and data size and indexing speed remains same. Pls check 
> the  JVM snapshot 
> 
>
> when we index using 6.2.1 
> 
> 
> Following is the  snapshot 
> 
>  
> taken with 6.5.1 
> 
> Is there any reason why such a huge difference with CPU usage patterns 
> between 6.2.1 and 6.5.1 ? 
> 
> Can we do something in 6.5.1 to make it as 6.2.1? Because we don't want to 
> downgrade to 6.2.1 from 6.5.1. 
> 
> Let us know your thoughts on this. 
> 
> Thanks and Regards, 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


Re: edit gc parameters in solr.in.sh or solr?

2018-03-28 Thread Bernd Fehling
Hi Shawn,

the problem with heap regions is, you can't get one advantage without any 
disadvantage.

According to your G1 example:
4GB heap with default 2MB region size = 2048 heap regions

4GB heap with G1HeapRegionSize set to 8MB = 512 heap regions

You see, you only have 1/4th of heap regions left.
This also means that objects which are only 1MB in size occupy 8MB on heap
and therefore a whole region, which is already very low.

Humongous Allocations are not genrally bad. Sure, the G1 part for humongous 
allocations
is not that performant and takes time. But just try to limit humongous 
allocations
and not to avoid it under all circumstances.

Regards
Bernd


Am 27.03.2018 um 23:07 schrieb Shawn Heisey:
> On 3/27/2018 12:13 AM, Bernd Fehling wrote:
>> may I give you the advise to _NOT_ set XX:G1HeapRegionSize.
>> That is computed during JAVA start by the engine according to heap and 
>> available memory.
>> A wrong set size can even a huge machine with 31GB heap and 157GB RAM force 
>> into OOM.
>> Guess how I figured that out, took me about one week to locate it.
> 
> I have some notes on why I included that parameter on my wiki page.
> 
> https://wiki.apache.org/solr/ShawnHeisey#G1_.28Garbage_First.29_Collector
> 
> Basically, the filterCache entries were being marked as humongous
> allocations, because each one for my indexes is over 2MB in size. 
> Apparently it takes a full collection to collect humongous allocations
> that become garbage, at least in the versions of Java that I was
> experimenting with.  So without that parameter, full GCs were required,
> and that will always make GC slow unless the heap size is very small.
> 
> If Oracle has made it so that humongous allocations can be collected by
> the generation-specific collectors, then that parameter may no longer be
> required in newer Java versions.  I do not know if this has happened.
> 
> Thanks,
> Shawn
> 


Re: edit gc parameters in solr.in.sh or solr?

2018-03-27 Thread Bernd Fehling
Hi Walter,

may I give you the advise to _NOT_ set XX:G1HeapRegionSize.
That is computed during JAVA start by the engine according to heap and 
available memory.
A wrong set size can even a huge machine with 31GB heap and 157GB RAM force 
into OOM.
Guess how I figured that out, took me about one week to locate it.

Regards
Bernd

Am 26.03.2018 um 17:08 schrieb Walter Underwood:
> We use the G1 collector in Java 8u131 and it works well. We are running 
> 6.6.2. Our Solr instances do a LOT of allocation. We have long queries (25 
> terms average) and many unique queries.
> 
> SOLR_HEAP=8g
> # Use G1 GC  -- wunder 2017-01-23
> # Settings from https://wiki.apache.org/solr/ShawnHeisey
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Mar 26, 2018, at 1:22 AM, Derek Poh  wrote:
>>
>> Hi
>>
>> From your experience, would like to know if It is advisable to change the gc 
>> parameters in solr.in.sh or solrfile?
>> It is mentioned in the documentation to edit solr.in.sh but would like 
>> toknow which file you actually edit.
>>
>> I am using Solr 6.6.2at the moment.
>>
>> Regards,
>> Derek
>>
>>


[poll] which loadbalancer are you using for SolrCloud

2018-03-02 Thread Bernd Fehling
Dear list,

I would like to poll for the loadbalancer you are using for SolrCloud.

Are you using a loadbalancer for SolrCloud?

If yes, which one (SolrJ, HAProxy, Varnish, Nginx,...) and why?

If not, why not?


Regards, Bernd


Re: Reading data from Oracle

2018-02-15 Thread Bernd Fehling
So it is not SolrJ, but Solr is your problem?

In your first email there was nothing about heap exceptions, only the runtime 
about loading.

What do you means by "injecting too many rows", what is "too many"?

Some numbers while loading from scratch:
- single node 412GB index
- 92 fields
- 123.6 million docs
- 1.937 billion terms
- loading from file system
- indexing time 9 hrs 5 min
- using SolJ ConcurrentUpdateSolrClient
--- queueSize=1, threads=12
--- waitFlush=true, waitSearcher=true, softcommit=false
And, Solr must be configured to "swallow" all this :-)


You say "8GB per node" so it is SolrCloud?

Anyhting else than heap exception?

How many commits?

Regards
Bernd


Am 15.02.2018 um 10:31 schrieb LOPEZ-CORTES Mariano-ext:
> Injecting too many rows into Solr throws Java heap exception (Higher memory? 
> We have 8GB per node).
> 
> Have DIH support for paging queries?
> 
> Thanks!
> 
> -Message d'origine-
> De : Bernd Fehling [mailto:bernd.fehl...@uni-bielefeld.de] 
> Envoyé : jeudi 15 février 2018 10:13
> À : solr-user@lucene.apache.org
> Objet : Re: Reading data from Oracle
> 
> And where is the bottleneck?
> 
> Is it reading from Oracle or injecting to Solr?
> 
> Regards
> Bernd
> 
> 
> Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
>> Hello
>>
>> We have to delete our Solr collection and feed it periodically from an 
>> Oracle database (up to 40M rows).
>>
>> We've done the following test: From a java program, we read chunks of data 
>> from Oracle and inject to Solr (via Solrj).
>>
>> The problem : It is really really slow (1'5 nights).
>>
>> Is there one faster method to do that ?
>>
>> Thanks in advance.
>>


Re: Reading data from Oracle

2018-02-15 Thread Bernd Fehling
And where is the bottleneck?

Is it reading from Oracle or injecting to Solr?

Regards
Bernd


Am 15.02.2018 um 08:34 schrieb LOPEZ-CORTES Mariano-ext:
> Hello
> 
> We have to delete our Solr collection and feed it periodically from an Oracle 
> database (up to 40M rows).
> 
> We've done the following test: From a java program, we read chunks of data 
> from Oracle and inject to Solr (via Solrj).
> 
> The problem : It is really really slow (1'5 nights).
> 
> Is there one faster method to do that ?
> 
> Thanks in advance.
> 


Re: Distributed search cross cluster

2018-01-31 Thread Bernd Fehling
Many years ago, in a different universe, when Federated Search was a buzzword we
used Unity from FAST FDS (which is now MS ESP). It worked pretty well across
many systems like FAST FDS, Google, Gigablast, ...
Very flexible with different mixers, parsers, query transformers.
Was written in Python and used pylib.medusa.
Search for "unity federated search", there is a book at Google about this, just
to get an idea.

Regards, Bernd


Am 30.01.2018 um 17:09 schrieb Jan Høydahl:
> Hi,
> 
> A customer has 10 separate SolrCloud clusters, with same schema across all, 
> but different content.
> Now they want users in each location to be able to federate a search across 
> all locations.
> Each location is 100% independent, with separate ZK etc. Bandwidth and 
> latency between the
> clusters is not an issue, they are actually in the same physical datacenter.
> 
> Now my first thought was using a custom  parameter, and let the 
> receiving node fan
> out to all shards of all clusters. We’d need to contact the ZK for each 
> environment and find
> all shards and replicas participating in the collection and then construct 
> the shards=A1|A2,B1|B2…
> sting which would be quite big, but if we get it right, it should “just work".
> 
> Now, my question is whether there are other smarter ways that would leave it 
> up to existing Solr
> logic to select shards and load balance, that would also take into account 
> any shard.keys/_route_
> info etc. I thought of these
>   * =collA,collB  — but it only supports collections local to one 
> cloud
>   * Create a collection ALIAS to point to all 10 — but same here, only local 
> to one cluster
>   * Streaming expression top(merge(search(q=,zkHost=blabla))) — but we want 
> it with pure search API
>   * Write a custom ShardHandler plugin that knows about all clusters — but 
> this is complex stuff :)
>   * Write a custom SearchComponent plugin that knows about all clusters and 
> adds the = param
> 
> Another approach would be for the originating cluster to fan out just ONE 
> request to each of the other
> clusters and then write some SearchComponent to merge those responses. That 
> would let us query
> the other clusters using one LB IP address instead of requiring full 
> visibility to all solr nodes
> of all clusters, but if we don’t need that isolation, that extra merge code 
> seems fairly complex.
> 
> So far I opt for the custom SearchComponent and = param approach. Any 
> useful input from
> someone who tried a similar approach would be priceless!
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 


Re: is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-11 Thread Bernd Fehling
To sum it up, there is no way for bulk loading in solr, due to the lack
of preserving the order of operation.
Solr can only supply bulk loading if you really have unique data, right?

By the way, the queue used is java.util.concurrent.BlockingQueue.
Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I 
guess.
Because the bottleneck is not reading the content from filesystem, but
analyzing and indexing.

Any other options for bulk loading?

You say "If there are at least three threads in the concurrent client...", but
two threads would work?

How are other users doing bulk loading with archived backups and preserving the 
order?
Can't believe that I'm the only one on earth having this need.

Regards
Bernd


Am 11.01.2018 um 08:53 schrieb Shawn Heisey:
> On 1/11/2018 12:05 AM, Bernd Fehling wrote:
>> This will nerver pass a Jepsen test and I call it _NOT_ thread safe.
>>
>> I haven't looked into the code yet, to see if the queue is FIFO, otherwise
>> this would be stupid.
> 
> I was not thinking about order of operations when I said that the client was 
> threadsafe.  I meant that one client object can be used
> simultaneously by multiple threads without anything getting 
> cross-contaminated within the program.
> 
> If you are absolutely reliant on operations happening in a precise order, 
> such that a document could get indexed in one request and then
> replaced (or updated) with a later request, you should not use the concurrent 
> client.  You could define it with a single thread, but if you do
> that, then the concurrent client doesn't work any faster than the standard 
> client.
> 
> When a concurrent client is built, it creates the specified number of 
> processing threads.  When updates are sent, they are added to an internal
> queue.  The processing threads will handle requests from the queue as long as 
> the queue is not empty.
> 
> Those threads will process the requests they have been assigned 
> simultaneously.  Although I'm sure that each thread pulls requests off the 
> queue
> in a FIFO manner, I have a scenario for you to consider.  This scenario is 
> not just an intellectual exercise, it is the kind of thing that can
> easily happen in the wild.
> 
> Let's say that when document X is initially indexed, it is at position 997 in 
> a batch of 1000 documents.  Then two update requests later, the
> new version of document X is at position 2 in another batch of 1000 documents.
> 
> If there are at least three threads in the concurrent client, those update 
> requests may begin execution at nearly the same time.  In that
> situation, Solr is likely to index document X in the request added later 
> before it indexes document X in the request added earlier, resulting in
> outdated information ending up in the index.
> 
> The same thing can happen even with a non-concurrent client when it is used 
> in a multi-threaded manner.
> 
> Preserving order of operations cannot be guaranteed if there are multiple 
> threads.  It could be possible to add some VERY sophisticated
> synchronization capabilities, but writing code to do that would be very 
> difficult, and it wouldn't be trivial to use either.
> 
> Thanks,
> Shawn


Re: is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Bernd Fehling
Hi Shawn,

from your answer I see that you are obviously not using 
ConcurrentUpdateSolrClient.
I didn't say that I use ConcurrentUpdateSolrClient in multiple threads.
I say that ConcurrentUpdateSolrClient.Builder has a method to set
"withThreadCount", to empty the Clients queue with multiple threads.
This is useful for bulk loading huge data volumes or replay backup into index.

As I can see at the indexer with infostream, there are _no_ indexing errors.

I tried now with one thread several times and everything was fine.
The newer docs replaced the older docs (wich were marked deleted) in the index.
With more than 1 "threadCount" for emtying the queue there are problems with
ConcurrentUpdateSolrClient.

This will nerver pass a Jepsen test and I call it _NOT_ thread safe.

I haven't looked into the code yet, to see if the queue is FIFO, otherwise
this would be stupid.

Regards
Bernd


Am 11.01.2018 um 02:27 schrieb Shawn Heisey:
> On 1/10/2018 8:33 AM, Bernd Fehling wrote:
>> after some strange search results I was trying to locate the problem
>> and it turned out that it starts with bulk loading with SolrJ
>> and ConcurrentUpdateSolrClient.Builder with several threads.
>>
>> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
>> according the docs send to the indexer?
> 
> Why would you need the Builder to be threadsafe?
> 
> The actual client object (ConcurrentUpdateSolrClient) should be perfectly 
> threadsafe, but the Builder probably isn't, and I can't think of any
> reason to try and use it with multiple threads.  In a well-constructed 
> program, you will use the Builder exactly once, in an initialization
> thread, and then have all the indexing threads use the client object that the 
> Builder creates.
> 
> I hope you're aware that the concurrent client swallows all indexing errors 
> and does not tell your program about them.
> 
> Thanks,
> Shawn
> 


is ConcurrentUpdateSolrClient.Builder thread safe?

2018-01-10 Thread Bernd Fehling
Hi list,

after some strange search results I was trying to locate the problem
and it turned out that it starts with bulk loading with SolrJ
and ConcurrentUpdateSolrClient.Builder with several threads.

I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
according the docs send to the indexer?

It feels like documents with the same doc_id are not always indexed
in the order they are sent to the indexer. It is some kind of random generator.

Example:
file LR00010.xml

  my_uniq_id_1234
  2017-03-28T23:21:40Z
  ...

file LR01000.xml

  my_uniq_id_1234
  2017-04-26T00:42:10Z
  ...


The files are in the same subdir.
They are loaded, processed, and send to the indexer in ascending natural order.
LR00010.xml is handled way before LR01000.xml.

But the result is that sometimes the older doc of LR00010.xml is in the index
and the newer doc from LR01000.xml is marked as deleted, and sometimes the
newer doc of LR01000.xml is in the index and the older doc from LR00010.xml
is marked as deleted.

Anyone seens this?

I could try ConcurrentUpdateSolrClient.Builder with only one thread and
see if the problem still exists.

Regards
Bernd




docValues with stored and useDocValuesAsStored

2018-01-08 Thread Bernd Fehling
What is the precedence when docValues with stored=true is used?
e.g.


My guess, because of useDocValuesAsStored=true is default, that stored=true is
ignored and the values are pulled from docValues.

And only if useDocValuesAsStored=false is explicitly used then stored=true comes
into play.

Or short, useDocValuesAsStored=true (the default) has precedence over 
stored=true.
I this right?

Regards
Bernd


Re: howto sum of terms of specific field in index

2017-12-21 Thread Bernd Fehling
Hi Emir,

thank you, thats it.

But a question while reading the docs about sumTotalTermFreq from your link.
Example in the docs:

If doc1:(fieldX:A B C) and doc2:(fieldX:A A A A):
...
freq(doc1, fieldX:A) = 4 (A appears 4 times in doc 2)


Shouldn't it be:
freq(doc2, fieldX:A) = 4 (A appears 4 times in doc 2)

Because the "freq" of _doc2_ and not _doc1_ for fieldX:A is 4?
A typo in the docs?


Regards
Bernd


Am 21.12.2017 um 09:53 schrieb Emir Arnautović:
> HI Bernd,
> It seems to me that you are looking for sumTotalTermFreq function.
> https://lucene.apache.org/solr/guide/6_6/function-queries.html 
> <https://lucene.apache.org/solr/guide/6_6/function-queries.html>
> 
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> 
> 
> 
>> On 21 Dec 2017, at 09:23, Bernd Fehling <bernd.fehl...@uni-bielefeld.de> 
>> wrote:
>>
>> Hi list,
>>
>> actually a simple question, but somehow i can't figure out how to get
>> the total number of terms in a field in the index, example:
>>
>> record_1: fruit: apple, banana, cherry
>> record_2: fruit: apple, pineapple, cherry
>> record_3: fruit: kiwi, pineapple
>> record_4: fruit:
>>
>> - a search for fruit:* gives 3 results   (just a search)
>> - the number of unique terms for fruit is 5  (reported by luke)
>> - the number of term apple is 2  (reported by luke)
>> - the number of terms for fruit of record_1 and record_2 is 3 and
>>  for record_3 is 2
>>
>> But how to get the number of all terms for fruit of all records which should 
>> be 8?
>>
>> I'm talking about 100 Million records, the 4 above are just an example.
>> This is not a general use case, more for statistical purposes.
>>
>> Regards
>> Bernd
> 
> 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


howto sum of terms of specific field in index

2017-12-21 Thread Bernd Fehling
Hi list,

actually a simple question, but somehow i can't figure out how to get
the total number of terms in a field in the index, example:

record_1: fruit: apple, banana, cherry
record_2: fruit: apple, pineapple, cherry
record_3: fruit: kiwi, pineapple
record_4: fruit:

- a search for fruit:* gives 3 results   (just a search)
- the number of unique terms for fruit is 5  (reported by luke)
- the number of term apple is 2  (reported by luke)
- the number of terms for fruit of record_1 and record_2 is 3 and
  for record_3 is 2

But how to get the number of all terms for fruit of all records which should be 
8?

I'm talking about 100 Million records, the 4 above are just an example.
This is not a general use case, more for statistical purposes.

Regards
Bernd


Re: OutOfMemoryError in 6.5.1

2017-11-20 Thread Bernd Fehling
Hi Walter,

you can check if the JVM OOM hook is acknowledged by JVM
and setup in the JVM. The options are "-XX:+PrintFlagsFinal -version"

You can modify your bin/solr script and tweak the function "launch_solr"
at the end of the script. Replace "-jar start.jar" with "-XX:+PrintFlagsFinal 
-version".
Instead of starting solr this will print a huge list of all really
used (and accepted) JVM parameters.
Check what "ccstrlist OnOutOfMemoryError" is telling you.
Is it really pointing to your OOM script?

You can give more MaxGCPauseMillis to give GC more time to cleanup.

The default InitiatingHeapOccupancyPercent is at 45, try it with 75
by setting -XX:InitiatingHeapOccupancyPercent=75



By the way, do you really use UseLargePages in your system
(because the OS must also support this) or is the JVM parameter
just set because some else is also using it?
http://www.oracle.com/technetwork/java/javase/tech/largememory-jsp-137182.html


Regards,
Bernd


Am 21.11.2017 um 02:17 schrieb Walter Underwood:
> When I ran load benchmarks with 6.3.0, an overloaded cluster would get super 
> slow but keep functioning. With 6.5.1, we hit 100% CPU, then start getting 
> OOMs. That is really bad, because it means we need to reboot every node in 
> the cluster.
> 
> Also, the JVM OOM hook isn’t running the process killer (JVM 1.8.0_121-b13). 
> Using the G1 collector with the Shawn Heisey settings in an 8G heap.
> 
> GC_TUNE=" \
> -XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts \
> "
> 
> This is not good behavior in prod. The process goes to the bad place, then we 
> need to wait until someone is paged and kills it manually. Luckily, it 
> usually drops out of the live nodes for each collection and doesn’t take user 
> traffic.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
> 


Re: [ANNOUNCE] Apache Solr 6.6.2 released

2017-10-18 Thread Bernd Fehling
Thanks,
but I tried to access the mentioned issues of
https://lucene.apache.org/solr/6_6_2/changes/Changes.html

https://issues.apache.org/jira/browse/SOLR-11477
https://issues.apache.org/jira/browse/SOLR-11482

I get something like "permissionViolation=true", even after login!!!

Is SOLR going to be closed source?

Do we have to pay for seeing the issues? ;-)

Regards
Bernd


Am 18.10.2017 um 10:29 schrieb Ishan Chattopadhyaya:
> 18 October 2017, Apache Solr™ 6.6.2 available
> 
> The Lucene PMC is pleased to announce the release of Apache Solr 6.6.2
> 
> Solr is the popular, blazing fast, open source NoSQL search platform from
> the
> Apache Lucene project. Its major features include powerful full-text
> search,
> hit highlighting, faceted search and analytics, rich document parsing,
> geospatial search, extensive REST APIs as well as parallel SQL. Solr is
> enterprise grade, secure and highly scalable, providing fault tolerant
> distributed search and indexing, and powers the search and navigation
> features
> of many of the world's largest internet sites.
> 
> This release includes a critical security fix and a bugfix. Details:
> 
> * Fix for a 0-day exploit (CVE-2017-12629), details:
> https://s.apache.org/FJDl.
>   RunExecutableListener has been disabled by default (can be enabled by
>   -Dsolr.enableRunExecutableListener=true) and resolving external entities
> in
>   the XML query parser (defType=xmlparser or {!xmlparser ... }) is disabled
> by
>   default.
> 
> * Fix a bug where Solr was attempting to load the same core twice (Error
> message:
>   "Lock held by this virtual machine").
> 
> Furthermore, this release includes Apache Lucene 6.6.2 which includes one
> security
> fix since the 6.6.1 release.
> 
> The release is available for immediate download at:
> 
>   http://www.apache.org/dyn/closer.lua/lucene/solr/6.6.2
> 
> Please read CHANGES.txt for a detailed list of changes:
> 
>   https://lucene.apache.org/solr/6_6_2/changes/Changes.html
> 
> Please report any feedback to the mailing lists
> (http://lucene.apache.org/solr/discussion.html)
> 
> Note: The Apache Software Foundation uses an extensive mirroring
> network for distributing releases. It is possible that the mirror you
> are using may not have replicated the release yet. If that is the
> case, please try another mirror. This also goes for Maven access.
> 


  1   2   3   4   >