RE: CDCR - how to deal with the transaction log files

2017-07-20 Thread Patrick Hoeffel
I'm working on my first setup of CDCR, and I'm seeing the same "The log reader 
for target collection {collection name} is not initialised" as you saw.

It looks like you're creating collections on a regular basis, but for me, I 
create it one time and never again. I've been creating the collection first 
from defaults and then applying the CDCR-aware solrconfig changes afterward. It 
sounds like maybe I need to create the configset in ZK first, then create the 
collections, first on the Target and then on the Source, and I should be good?

Thanks,

Patrick Hoeffel

Senior Software Engineer
(Direct)  719-452-7371
(Mobile) 719-210-3706
patrick.hoef...@polarisalpha.com
PolarisAlpha.com 


-Original Message-
From: jmyatt [mailto:jmy...@wayfair.com] 
Sent: Wednesday, July 12, 2017 4:49 PM
To: solr-user@lucene.apache.org
Subject: Re: CDCR - how to deal with the transaction log files

glad to hear you found your solution!  I have been combing over this post and 
others on this discussion board many times and have tried so many tweaks to 
configuration, order of steps, etc, all with absolutely no success in getting 
the Source cluster tlogs to delete.  So incredibly frustrating.  If anyone has 
other pearls of wisdom I'd love some advice.  Quick hits on what I've tried:

- solrconfig exactly like Sean's (target and source respectively) expect no 
autoSoftCommit
- I am also calling cdcr?action=DISABLEBUFFER (on source as well as on
target) explicitly before starting since the config setting of 
defaultState=disabled doesn't seem to work
- when I create the collection on source first, I get the warning "The log 
reader for target collection {collection name} is not initialised".  When I 
reverse the order (create the collection on target first), no such warning
- tlogs replicate as expected, hard commits on both target and source cause 
tlogs to rollover, etc - all of that works as expected
- action=QUEUES on source reflects the queueSize accurately.  Also *always* 
shows updateLogSynchronizer state as "stopped"
- action=LASTPROCESSEDVERSION on both source and target always seems correct (I 
don't see the -1 that Sean mentioned).
- I'm creating new collections every time and running full data imports that 
take 5-10 minutes. Again, all data replication, log rollover, and autocommit 
activity seems to work as expected, and logs on target are deleted.  It's just 
those pesky source tlogs I can't get to delete.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CDCR-how-to-deal-with-the-transaction-log-files-tp4345062p4345715.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Getting IO Exception while Indexing

2017-07-20 Thread mesenthil1
While debugging following are the findings. 

When we send the same document as json, it is getting indexed without an
issue. When the same document is converted as SolrInputDocument and sent to
solr using SolrServer, it fails.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4347096.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Issue While indexing Data

2017-07-20 Thread rajat rastogi
Hi Shawn ,
I have Two instances of solr running and my indexing process is in java as well 
.
PID 15958 is my indexing process.
PID 4499 is my Solr instance which has Stuck Commits
PID 9299 is another solr instance which is forking fine

regards

Rajat

On 20-Jul-2017, at 16:40, Shawn Heisey-2 [via Lucene] 
>
 wrote:

On 7/20/2017 12:29 AM, rajat rastogi wrote:
> I shared The code base, config , schema with you . Were they of any help , or 
> can You point what I am doing wrong in them .

I did not see any schema or config.

The top output shows that you have three large Java processes, all
running as root.  Which of these is Solr?  Are they all instances of Solr?

Thanks,
Shawn




If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4346953.html
To unsubscribe from Solr Issue While indexing Data, click 
here.
NAML

IMPORTANT NOTICE: "This email is confidential containing HT Media confidential 
information, may be legally privileged, and is for the intended recipient only. 
Access, disclosure, copying, distribution, or reliance on any of it by anyone 
else is prohibited and may be a criminal offense. Please delete if obtained in 
error and email confirmation to the sender." Experience news. Like never 
before. Only on www.hindustantimes.com




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4347094.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: The unified highlighter html escaping. Seems rather extreme...

2017-07-20 Thread David Smiley
The escaping does appear excessive.  Please file a bug to the Lucene
project in Apache JIRA.

On Fri, May 26, 2017 at 11:26 AM Michael Joyner  wrote:

> Isn't the unified html escaper a rather bit extreme in it's escaping?
>
> It makes it hard to deal with for simple post-processing.
>
> The original html escaper seems to do minimial escaping, not every
> non-alphabetical character it can find.
>
> Also, is there a way to control how much text is returned as context
> around the highlighted frag?
>
> Compare:
>
>
> Unified Snippet:
> 

Re: Highlighting words with special characters

2017-07-20 Thread Lasitha Wattaladeniya
Hi Shawn,

Yes I can confirm, it works with out any errors with multiple tokenizers.
Following is my analysis chain

StandardTokenizerFactory (only in index)
StopFilterFactory
LowerCaseFilterFactory
ASCIIFoldingFilterFactory
EnglishPossessiveFilterFactory
StemmerOverrideFilterFactory (only in query)
NgramTokenizerFactory (only in index)

I'll have a look more into what you said, Single tokenizer in analysis
chain.

Regards,
Lasitha

Lasitha Wattaladeniya
Software Engineer

Mobile : +6593896893
Blog : techreadme.blogspot.com

On Thu, Jul 20, 2017 at 7:12 PM, Shawn Heisey  wrote:

> On 7/19/2017 8:31 PM, Lasitha Wattaladeniya wrote:
> > But I have NgramTokenizerFactory at the end of indexing analyzer chain.
> > Therefore I should still tokenize the email address. But how this affects
> > the highlighting?, that's what I'm confused to understand
>
> You can only have one tokenizer in an analysis chain.  I have no idea
> what happens if you have more than one.  I personally would expect that
> to result in an initialization error, but maybe what it does is ignore
> the additional tokenizers.  Your experience seems to indicate that it
> does NOT result in an error.  Can you confirm?
>
> The analysis is done in this order:
>
> CharFilters
> Tokenizer
> Filters
>
> Thanks,
> Shawn
>
>


RE: Issues trying to boost phrase containing stop word

2017-07-20 Thread Phil Scadden
The simplest suggestion is get rid of the stop word filter. I've seen people 
here comment that it is not worth it for the amount of space it saves.

-Original Message-
From: shamik [mailto:sham...@gmail.com]
Sent: Friday, 21 July 2017 9:49 a.m.
To: solr-user@lucene.apache.org
Subject: Re: Issues trying to boost phrase containing stop word

Any suggestion?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4347068.html
Sent from the Solr - User mailing list archive at Nabble.com.
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: Issues trying to boost phrase containing stop word

2017-07-20 Thread shamik
Any suggestion?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-trying-to-boost-phrase-containing-stop-word-tp4346860p4347068.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: finds all documents without a value for field

2017-07-20 Thread Hendrik Haddorp
If the range query is so much better shouldn't the Solr query parser 
create a range query for a token query that only contains the wildcard? 
For the *:* case it does already contain a special path.


On 20.07.2017 21:00, Shawn Heisey wrote:

On 7/20/2017 7:20 AM, Hendrik Haddorp wrote:

the Solr 6.6. ref guide states that to "finds all documents without a
value for field" you can use:
-field:[* TO *]

While this is true I'm wondering why it is recommended to use a range
query instead of simply:
-field:*

Performance.

A wildcard is expanded to all possible term values for that field.  If
the field has millions of possible terms, then the query object created
at the Lucene level will quite literally have millions of terms in it.
No matter how you approach a query with those characteristics, it's
going to be slow, for both getting the terms list and executing the query.

A full range query might be somewhat slow when there are many possible
values, but it's a lot faster than a wildcard in those cases.

If the field is only used by a handful of documents and has very few
possible values, then it might be faster than a range query ... but this
is not common, so the recommended way to do this is with a range query.

Thanks,
Shawn





Re: DateRangeField and Timezone

2017-07-20 Thread Ulul
Hi

Got it thanks to debug option

TZ applies only to date computations, so you have to compute a date :)

The document {"date" : "2016-12-31T04:15:00Z", "desc" : "winter time day
before" }

is retrieved with query date:[2016-12-31T12:15:00Z/DAY TO
2017-01-03T12:15:00Z/DAY]  and "TZ":"Europe/Paris" that translates to
[2016-12-30T23 TO 2017-01-02T23:00:00]

the same query with TZ=America/New_York translates to [2016-12-31T05 TO
2017-01-03T05:00:00] and does not retrives the doc

Cheers

On 20/07/2017 02:10, Ulul wrote:
> Hi everyone
>
> I'm trying to query on dates with time zone taken into account. I have
> the following document
>
> {"date" : "2016-12-31T04:15:00Z", "desc" : "winter time day before" }
> date being of type DateRangeField
>
> I would like to be able to perform a query based on local date. For
> instance the above date corresponds to 2016-12-30 in New York (UTC-5 in
> winter) so I would expect the following query NOT to retrieve the document :
>
> http://127.0.1.1:7574/solr/date_test/select?TZ=America/New_York=on=date:2016-12-31=json
>
> Unfortunately it does... and it's the same using filter query
>
> https://cwiki.apache.org/confluence/display/solr/Working+with+Dates
> describes how to use TZ in facets, why doesn't it work with simple queries ?
>
> I'm using Solr 6.5.1
>
> I had to add DateRangeField type myself to the collection schema. I did
> it with :
>
> curl -X POST -H 'Content-type:application/json' --data-binary '{
>   "add-field-type" : {
>  "name":"DateRangeField",
>  "class":"solr.DateRangeField"
>   }
> }' http://localhost:7574/solr/date_test/schema
>
> Thank you for your help
>
> Ulul
>



Re: Solr 6.6 test failure: TestSolrCloudWithKerberosAlt.testBasics

2017-07-20 Thread Nawab Zada Asad Iqbal
Mine is actually very different:-




-test:
   [junit4]  says ᐊᐃ! Master seed: C3B77541FB9DE693
   [junit4] Executing 1 suite with 1 JVM.
   [junit4]
   [junit4] Started J0 PID(37742@mbp-9009).
   [junit4] Suite: org.apache.solr.cloud.TestSolrCloudWithKerberosAlt
   [junit4]   2> NOTE: reproduce with: ant test
-Dtestcase=TestSolrCloudWithKerberosAlt -Dtests.method=testBasics
-Dtests.seed=C3B77541FB9DE693 -Dtests.slow=true -Dtests.locale=fr-CA
-Dtests.timezone=IST -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   7.95s | TestSolrCloudWithKerberosAlt.testBasics <<<
   [junit4]> Throwable #1: java.lang.NoSuchFieldError: id_aes128_CBC
   [junit4]> at
__randomizedtesting.SeedInfo.seed([C3B77541FB9DE693:FE6FDB6DC373B8E3]:0)
   [junit4]> at
org.bouncycastle.jce.provider.symmetric.AESMappings.(Unknown Source)
   [junit4]> at
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
   [junit4]> at java.lang.Class.newInstance(Class.java:442)
   [junit4]> at
org.bouncycastle.jce.provider.BouncyCastleProvider.loadAlgorithms(Unknown
Source)
   [junit4]> at
org.bouncycastle.jce.provider.BouncyCastleProvider.setup(Unknown Source)
   [junit4]> at
org.bouncycastle.jce.provider.BouncyCastleProvider.access$000(Unknown
Source)
   [junit4]> at
org.bouncycastle.jce.provider.BouncyCastleProvider$1.run(Unknown Source)
   [junit4]> at java.security.AccessController.doPrivileged(Native
Method)
   [junit4]> at
org.bouncycastle.jce.provider.BouncyCastleProvider.(Unknown Source)
   [junit4]> at
org.apache.directory.server.core.security.TlsKeyGenerator.(TlsKeyGenerator.java:98)
   [junit4]> at
org.apache.directory.server.core.DefaultDirectoryService.createBootstrapEntries(DefaultDirectoryService.java:1483)
   [junit4]> at
org.apache.directory.server.core.DefaultDirectoryService.initialize(DefaultDirectoryService.java:1828)
   [junit4]> at
org.apache.directory.server.core.DefaultDirectoryService.startup(DefaultDirectoryService.java:1248)
   [junit4]> at
org.apache.hadoop.minikdc.MiniKdc.initDirectoryService(MiniKdc.java:383)
   [junit4]> at
org.apache.hadoop.minikdc.MiniKdc.start(MiniKdc.java:319)
   [junit4]> at
org.apache.solr.cloud.KerberosTestServices.start(KerberosTestServices.java:59)
   [junit4]> at
org.apache.solr.cloud.TestSolrCloudWithKerberosAlt.setupMiniKdc(TestSolrCloudWithKerberosAlt.java:115)
   [junit4]> at
org.apache.solr.cloud.TestSolrCloudWithKerberosAlt.setUp(TestSolrCloudWithKerberosAlt.java:101)
   [junit4]> at java.lang.Thread.run(Thread.java:745)
   [junit4]   2> NOTE: leaving temporary files on disk at:
/Users/niqbal/otherdev/box/lucene-solr/solr/build/solr-core/test/J0/temp/solr.cloud.TestSolrCloudWithKerberosAlt_C3B77541FB9DE693-001
   [junit4]   2> Jul 20, 2017 12:34:00 PM
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
   [junit4]   2> WARNING: Will linger awaiting termination of 5 leaked
thread(s).
   [junit4]   2> Jul 20, 2017 12:34:20 PM
com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
   [junit4]   2> SEVERE: 5 threads leaked from SUITE scope at
org.apache.solr.cloud.TestSolrCloudWithKerberosAlt:
   [junit4]   2>1) Thread[id=15, name=apacheds, state=WAITING,
group=TGRP-TestSolrCloudWithKerberosAlt]
   [junit4]   2> at java.lang.Object.wait(Native Method)
   [junit4]   2> at java.lang.Object.wait(Object.java:502)
   [junit4]   2> at java.util.TimerThread.mainLoop(Timer.java:526)
   [junit4]   2> at java.util.TimerThread.run(Timer.java:505)
   [junit4]   2>2) Thread[id=17, name=groupCache.data,
state=TIMED_WAITING, group=TGRP-TestSolrCloudWithKerberosAlt]
   [junit4]   2> at sun.misc.Unsafe.park(Native Method)
   [junit4]   2> at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
   [junit4]   2> at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
   [junit4]   2> at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
   [junit4]   2> at
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
   [junit4]   2> at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
   [junit4]   2> at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
   [junit4]   2> at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   [junit4]   2> at java.lang.Thread.run(Thread.java:745)
   [junit4]   2>3) Thread[id=16, name=kdcReplayCache.data,
state=TIMED_WAITING, group=TGRP-TestSolrCloudWithKerberosAlt]
   [junit4]   2> at sun.misc.Unsafe.park(Native Method)
   [junit4]   2> at

Re: Solr 6.6 test failure: TestSolrCloudWithKerberosAlt.testBasics

2017-07-20 Thread Steve Rowe
Does it look like this?: 


I see failures like that on my Jenkins once or twice a week.

--
Steve
www.lucidworks.com

> On Jul 20, 2017, at 3:53 PM, Nawab Zada Asad Iqbal  wrote:
> 
> Hi,
> 
> I cloned solr 6.6 branch today and I see this failure consistently.
> 
> TestSolrCloudWithKerberosAlt.testBasics
> 
> 
> I had done some script changes but after seeing this failure I reverted
> them and ran: `ant -Dtestcase=TestSolrCloudWithKerberosAlt clean test` but
> this test still fails with this error:-
> 
>   [junit4]> Throwable #1: java.lang.NoSuchFieldError: id_aes128_CBC
>   [junit4]> at
> __randomizedtesting.SeedInfo.seed([453D16027AC52FD9:78E5B82E422B71A9]:0)
> 
> 
> I see the jenkins build are all clean, so not sure what I am hitting.
> 
> https://builds.apache.org/job/Lucene-Solr-Maven-6.x/
> 
> https://builds.apache.org/job/Solr-Artifacts-6.x/
> 
> Regards
> Nawab



Solr 6.6 test failure: TestSolrCloudWithKerberosAlt.testBasics

2017-07-20 Thread Nawab Zada Asad Iqbal
Hi,

I cloned solr 6.6 branch today and I see this failure consistently.

TestSolrCloudWithKerberosAlt.testBasics


I had done some script changes but after seeing this failure I reverted
them and ran: `ant -Dtestcase=TestSolrCloudWithKerberosAlt clean test` but
this test still fails with this error:-

   [junit4]> Throwable #1: java.lang.NoSuchFieldError: id_aes128_CBC
   [junit4]> at
__randomizedtesting.SeedInfo.seed([453D16027AC52FD9:78E5B82E422B71A9]:0)


I see the jenkins build are all clean, so not sure what I am hitting.

https://builds.apache.org/job/Lucene-Solr-Maven-6.x/

https://builds.apache.org/job/Solr-Artifacts-6.x/

Regards
Nawab


Re: Copy field a source of copy field

2017-07-20 Thread Erick Erickson
Yep, we're not communication ;)

Use the original source field for the genus, as:




The difficulty here is that there might be false hits if the genera
names happen to match words in the input that are not part of a
genus/species pair.



On Thu, Jul 20, 2017 at 9:55 AM, tstusr  wrote:
> Well, correct me if I'm wrong.
>
> Your suggestion is to use species field as a source of genus field. We try
> with this
>
> 
> 
>
> Where species work as described and genus just use a KWF, like this:
>
>  positionIncrementGap="0">
> 
>   
>ignoreCase="true"/>
> 
> 
>   
>   
> 
>   
>
> But now, the problem now is different.
>
> When we try the behavior in analysis section in solr provided UI it works as
> expected.
>
> Nevertheless, when we use it at indexing time (When we post pdf files, to
> extractor) the field doesn't even appear. We think it's because the info
> becomes from another copyField.
>
> Did I misunderstand your suggestion?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347013.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: finds all documents without a value for field

2017-07-20 Thread Erick Erickson
One other possibility is to create a second boolean field "has_terms"
or something and just add an fq clause like "=has_terms:false"

On Thu, Jul 20, 2017 at 12:00 PM, Shawn Heisey  wrote:
> On 7/20/2017 7:20 AM, Hendrik Haddorp wrote:
>> the Solr 6.6. ref guide states that to "finds all documents without a
>> value for field" you can use:
>> -field:[* TO *]
>>
>> While this is true I'm wondering why it is recommended to use a range
>> query instead of simply:
>> -field:*
>
> Performance.
>
> A wildcard is expanded to all possible term values for that field.  If
> the field has millions of possible terms, then the query object created
> at the Lucene level will quite literally have millions of terms in it.
> No matter how you approach a query with those characteristics, it's
> going to be slow, for both getting the terms list and executing the query.
>
> A full range query might be somewhat slow when there are many possible
> values, but it's a lot faster than a wildcard in those cases.
>
> If the field is only used by a handful of documents and has very few
> possible values, then it might be faster than a range query ... but this
> is not common, so the recommended way to do this is with a range query.
>
> Thanks,
> Shawn
>


Re: finds all documents without a value for field

2017-07-20 Thread Shawn Heisey
On 7/20/2017 7:20 AM, Hendrik Haddorp wrote:
> the Solr 6.6. ref guide states that to "finds all documents without a
> value for field" you can use:
> -field:[* TO *]
>
> While this is true I'm wondering why it is recommended to use a range
> query instead of simply:
> -field:*

Performance.

A wildcard is expanded to all possible term values for that field.  If
the field has millions of possible terms, then the query object created
at the Lucene level will quite literally have millions of terms in it. 
No matter how you approach a query with those characteristics, it's
going to be slow, for both getting the terms list and executing the query.

A full range query might be somewhat slow when there are many possible
values, but it's a lot faster than a wildcard in those cases.

If the field is only used by a handful of documents and has very few
possible values, then it might be faster than a range query ... but this
is not common, so the recommended way to do this is with a range query.

Thanks,
Shawn



RE: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Davis, Daniel (NIH/NLM) [C]
Muhammad,

This sounds like it might be handled better by multiple collections rather than 
multiple "sub collections".   If you create a new collection for each date, all 
using the same common config set, and then create an alias that contains all of 
 these collections.   Then, the alias will function as your "collection", and 
the date-specific collections will function as your "sub-collections".

This is a supported scenario, and I agree with the others that playing around 
with specific shard placement and shards is a poor choice.

One way you could do something similar is to limit the # of shards/replicas 
used for date-specific collections.

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Thursday, July 20, 2017 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Need guidance solrcloud shardings with date interval

Well, you have bad problem. You have a requirement that forces you to build an 
expensive, unreliable search system.

You need to do specific shard creation at specific times every day. What 
happens if that fails? Does search go down until it is fixed because all 
searches are going to a shard that doesn’t exist? Or do the documents get 
randomly sent to existing shards, so you need to search all the shards anyway? 
If docs are distributed, you’ll need to clean that day up with delete by query. 
You need to build that as a failure recovery.

Does your code handle leap years for shard creation? Daylight saving time? How 
do you test that code?

You’ll be writing a lot of custom code that other people don’t need. If you are 
a consultant, this is great. For the customer, not so good.

Whoever wrote that requirement does not know very much about Solr. It sounds 
like they are trying to force RDBMS sharding onto Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 20, 2017, at 8:09 AM, rehman kahloon 
>  wrote:
> 
> blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
> #715FFA solid !important; padding-left:1ex !important; background-color:white 
> !important; } Hi Eric,
>   Thank you very much for your guidance.
> No sir that is our requirmnt to load data into specific shard and later after 
> rentention time we will delete that shard.
> Please share if you have any manual sharding exercise dicument. 
> 2nd is it posible data automatically load into specific shard without using 
> shard name during loading. 
> 
> Is there any solr file where i mentioned all my shards name with specific 
> date. When data come automTically load dara into alredy mentioned shard?
> Once again Thank you very much. 
> Kind regards,Muhammad rehman kahloon
> 
> Sent from Yahoo Mail for iPhone
> 
> 
> On Thursday, July 20, 2017, 19:57, Erick Erickson  
> wrote:
> 
> Use the "implicit" router (being renamed "manual". that takes the 
> value of a particular field (_route_ by default) and sends docs to 
> that exact shard.
> 
> But I also question whether sharding on this schema is a good idea. If 
> you have an access pattern where most queries are for, say, the last 
> two days then all the work will be done on only 2 machines and all the 
> rest will be idle. You should at least consider just using normal 
> routing that distributes the data across all shards and then use 
> delete-by-query to delete the data older than 10 days.
> 
> Best,
> Erick
> 
> On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon 
>  wrote:
>> 
>> Hi Sir,
>> Taken your id from your document on SlideShare.
>> Need your guidance on my plan ,My target is to create sub-collection/shards 
>> within a collection.
>> e.g
>>   Currently 1 have 10 days data and want to store data 
>> against each date in separate partitions.  like oracle partition concepts 
>> (one table can have many partitions) Plan is to store each date data with in 
>> separate node, Total physical nodes are 10 and after 10 days, 11th date data 
>> load in node1 and existing data backup (oldest date data with purge and 
>> backed up).
>> Please guide me how can i perform that using SolrCloud.  1 collection with 
>> unlimited sub collection.
>> 
>> Thank you very much in advanced.
>> 
>> Kind Regards,Muhammad Rehman Kahloon.
> 
> 
> 



Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Walter Underwood
Well, you have bad problem. You have a requirement that forces you to build an 
expensive, unreliable search system.

You need to do specific shard creation at specific times every day. What 
happens if that fails? Does search go down until it is fixed because all 
searches are going to a shard that doesn’t exist? Or do the documents get 
randomly sent to existing shards, so you need to search all the shards anyway? 
If docs are distributed, you’ll need to clean that day up with delete by query. 
You need to build that as a failure recovery.

Does your code handle leap years for shard creation? Daylight saving time? How 
do you test that code?

You’ll be writing a lot of custom code that other people don’t need. If you are 
a consultant, this is great. For the customer, not so good.

Whoever wrote that requirement does not know very much about Solr. It sounds 
like they are trying to force RDBMS sharding onto Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 20, 2017, at 8:09 AM, rehman kahloon 
>  wrote:
> 
> blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
> #715FFA solid !important; padding-left:1ex !important; background-color:white 
> !important; } Hi Eric,
>   Thank you very much for your guidance.
> No sir that is our requirmnt to load data into specific shard and later after 
> rentention time we will delete that shard.
> Please share if you have any manual sharding exercise dicument. 
> 2nd is it posible data automatically load into specific shard without using 
> shard name during loading. 
> 
> Is there any solr file where i mentioned all my shards name with specific 
> date. When data come automTically load dara into alredy mentioned shard?
> Once again Thank you very much. 
> Kind regards,Muhammad rehman kahloon
> 
> Sent from Yahoo Mail for iPhone
> 
> 
> On Thursday, July 20, 2017, 19:57, Erick Erickson  
> wrote:
> 
> Use the "implicit" router (being renamed "manual". that takes the
> value of a particular field (_route_ by default) and sends docs to
> that exact shard.
> 
> But I also question whether sharding on this schema is a good idea. If
> you have an access pattern where most queries are for, say, the last
> two days then all the work will be done on only 2 machines and all the
> rest will be idle. You should at least consider just using normal
> routing that distributes the data across all shards and then use
> delete-by-query to delete the data older than 10 days.
> 
> Best,
> Erick
> 
> On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
>  wrote:
>> 
>> Hi Sir,
>> Taken your id from your document on SlideShare.
>> Need your guidance on my plan ,My target is to create sub-collection/shards 
>> within a collection.
>> e.g
>>   Currently 1 have 10 days data and want to store data against each 
>> date in separate partitions.  like oracle partition concepts (one table can 
>> have many partitions)
>> Plan is to store each date data with in separate node, Total physical nodes 
>> are 10 and after 10 days, 11th date data load in node1 and existing data 
>> backup (oldest date data with purge and backed up).
>> Please guide me how can i perform that using SolrCloud.  1 collection with 
>> unlimited sub collection.
>> 
>> Thank you very much in advanced.
>> 
>> Kind Regards,Muhammad Rehman Kahloon.
> 
> 
> 



Re: Copy field a source of copy field

2017-07-20 Thread tstusr
Well, correct me if I'm wrong.

Your suggestion is to use species field as a source of genus field. We try
with this




Where species work as described and genus just use a KWF, like this:



  
  


  
  

  

But now, the problem now is different.

When we try the behavior in analysis section in solr provided UI it works as
expected.

Nevertheless, when we use it at indexing time (When we post pdf files, to
extractor) the field doesn't even appear. We think it's because the info
becomes from another copyField.

Did I misunderstand your suggestion?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Copy-field-a-source-of-copy-field-tp4346425p4347013.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Erick Erickson
bq:  that is our requirmnt to load data into specific shard and later
after rentention time we will delete that shard

Why is it necessary to delete a shard when deleting the old data by
query removes it? This sounds like an XY problem. Someone has
"required" that you enforce data retention by deleting a shard so
you're asking about deleting shards. Whereas the problem is to purge
old data that _could_ be accomplished by rotating shards, but is
_also_ accomplished by just issuing a "delete all data more than 10
days old" query.

But if it's a requirement, se the reference guide for "implicit"
routing in the document routing sections.

Best,
Erick

On Thu, Jul 20, 2017 at 8:15 AM, Susheel Kumar  wrote:
> Agree. One should first try to measure the performance with standard/common
> approach.
>
> On Thu, Jul 20, 2017 at 11:00 AM, Walter Underwood 
> wrote:
>
>> I agree. Use the standard shard distribution and delete by query to remove
>> older documents.
>>
>> Much, much simpler and probably faster at query time.
>>
>> I’m seeing a lot of e-mails about people trying to do fancy things with
>> sharding before they’ve even tried and measured the performance.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> > On Jul 20, 2017, at 7:57 AM, Erick Erickson 
>> wrote:
>> >
>> > Use the "implicit" router (being renamed "manual". that takes the
>> > value of a particular field (_route_ by default) and sends docs to
>> > that exact shard.
>> >
>> > But I also question whether sharding on this schema is a good idea. If
>> > you have an access pattern where most queries are for, say, the last
>> > two days then all the work will be done on only 2 machines and all the
>> > rest will be idle. You should at least consider just using normal
>> > routing that distributes the data across all shards and then use
>> > delete-by-query to delete the data older than 10 days.
>> >
>> > Best,
>> > Erick
>> >
>> > On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
>> >  wrote:
>> >>
>> >> Hi Sir,
>> >>Taken your id from your document on SlideShare.
>> >> Need your guidance on my plan ,My target is to create
>> sub-collection/shards within a collection.
>> >> e.g
>> >> Currently 1 have 10 days data and want to store data against
>> each date in separate partitions.  like oracle partition concepts (one
>> table can have many partitions)
>> >> Plan is to store each date data with in separate node, Total physical
>> nodes are 10 and after 10 days, 11th date data load in node1 and existing
>> data backup (oldest date data with purge and backed up).
>> >> Please guide me how can i perform that using SolrCloud.  1 collection
>> with unlimited sub collection.
>> >>
>> >> Thank you very much in advanced.
>> >>
>> >> Kind Regards,Muhammad Rehman Kahloon.
>>
>>


Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread rehman kahloon
 blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
#715FFA solid !important; padding-left:1ex !important; background-color:white 
!important; } Hi Eric,
  Thank you very much for your guidance.
No sir that is our requirmnt to load data into specific shard and later after 
rentention time we will delete that shard.
Please share if you have any manual sharding exercise dicument. 
2nd is it posible data automatically load into specific shard without using 
shard name during loading. 

Is there any solr file where i mentioned all my shards name with specific date. 
When data come automTically load dara into alredy mentioned shard?
Once again Thank you very much. 
Kind regards,Muhammad rehman kahloon

Sent from Yahoo Mail for iPhone


On Thursday, July 20, 2017, 19:57, Erick Erickson  
wrote:

Use the "implicit" router (being renamed "manual". that takes the
value of a particular field (_route_ by default) and sends docs to
that exact shard.

But I also question whether sharding on this schema is a good idea. If
you have an access pattern where most queries are for, say, the last
two days then all the work will be done on only 2 machines and all the
rest will be idle. You should at least consider just using normal
routing that distributes the data across all shards and then use
delete-by-query to delete the data older than 10 days.

Best,
Erick

On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
 wrote:
>
> Hi Sir,
>            Taken your id from your document on SlideShare.
> Need your guidance on my plan ,My target is to create sub-collection/shards 
> within a collection.
> e.g
>          Currently 1 have 10 days data and want to store data against each 
>date in separate partitions.  like oracle partition concepts (one table can 
>have many partitions)
> Plan is to store each date data with in separate node, Total physical nodes 
> are 10 and after 10 days, 11th date data load in node1 and existing data 
> backup (oldest date data with purge and backed up).
> Please guide me how can i perform that using SolrCloud.  1 collection with 
> unlimited sub collection.
>
> Thank you very much in advanced.
>
> Kind Regards,Muhammad Rehman Kahloon.





Re: Debug Queries field explaination

2017-07-20 Thread Charlie Hull

On 20/07/2017 11:41, Swapnil Pande wrote:

Hi ,
Being an amateur in solr i wanted to learn how solr queries internally and
how score is calculated.
So setting debug=true.
I get a json with fields like 'fromSetSize' , 'toSetSize'.. etc.
Can I get a reference link to understand what these exactly mean.


You might take a look at Open Source Connections excellent Splainer tool 
- www.splainer.io


Cheers

Charlie


Thanks.


---
This email has been checked for viruses by AVG.
http://www.avg.com




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Susheel Kumar
Agree. One should first try to measure the performance with standard/common
approach.

On Thu, Jul 20, 2017 at 11:00 AM, Walter Underwood 
wrote:

> I agree. Use the standard shard distribution and delete by query to remove
> older documents.
>
> Much, much simpler and probably faster at query time.
>
> I’m seeing a lot of e-mails about people trying to do fancy things with
> sharding before they’ve even tried and measured the performance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jul 20, 2017, at 7:57 AM, Erick Erickson 
> wrote:
> >
> > Use the "implicit" router (being renamed "manual". that takes the
> > value of a particular field (_route_ by default) and sends docs to
> > that exact shard.
> >
> > But I also question whether sharding on this schema is a good idea. If
> > you have an access pattern where most queries are for, say, the last
> > two days then all the work will be done on only 2 machines and all the
> > rest will be idle. You should at least consider just using normal
> > routing that distributes the data across all shards and then use
> > delete-by-query to delete the data older than 10 days.
> >
> > Best,
> > Erick
> >
> > On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
> >  wrote:
> >>
> >> Hi Sir,
> >>Taken your id from your document on SlideShare.
> >> Need your guidance on my plan ,My target is to create
> sub-collection/shards within a collection.
> >> e.g
> >> Currently 1 have 10 days data and want to store data against
> each date in separate partitions.  like oracle partition concepts (one
> table can have many partitions)
> >> Plan is to store each date data with in separate node, Total physical
> nodes are 10 and after 10 days, 11th date data load in node1 and existing
> data backup (oldest date data with purge and backed up).
> >> Please guide me how can i perform that using SolrCloud.  1 collection
> with unlimited sub collection.
> >>
> >> Thank you very much in advanced.
> >>
> >> Kind Regards,Muhammad Rehman Kahloon.
>
>


Re: Getting IO Exception while Indexing

2017-07-20 Thread Susheel Kumar
You can try to submit only the failed documents directly one by one/all and
see if you get any error etc.

On Thu, Jul 20, 2017 at 11:01 AM, Walter Underwood 
wrote:

> If Apache is returning 400, then it really is a bad request. Debug the
> request and fix it.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Jul 19, 2017, at 11:27 PM, mesenthil1  viacomcontractor.com> wrote:
> >
> > Hi,
> > This is happening repeatedly for few documents.  When we compared with
> other
> > similar documents, we could not find any difference.
> >
> > As we are seeing 400 on apache, the request is not submitted to solr.  So
> > unable to find out the cause.
> >
> > Senthil
> >
> >
> >
> > --
> > View this message in context: http://lucene.472066.n3.
> nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-
> tp4346801p4346930.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Getting IO Exception while Indexing

2017-07-20 Thread Walter Underwood
If Apache is returning 400, then it really is a bad request. Debug the request 
and fix it.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 19, 2017, at 11:27 PM, mesenthil1 
>  wrote:
> 
> Hi, 
> This is happening repeatedly for few documents.  When we compared with other
> similar documents, we could not find any difference. 
> 
> As we are seeing 400 on apache, the request is not submitted to solr.  So
> unable to find out the cause. 
> 
> Senthil
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4346930.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Walter Underwood
I agree. Use the standard shard distribution and delete by query to remove 
older documents.

Much, much simpler and probably faster at query time.

I’m seeing a lot of e-mails about people trying to do fancy things with 
sharding before they’ve even tried and measured the performance.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 20, 2017, at 7:57 AM, Erick Erickson  wrote:
> 
> Use the "implicit" router (being renamed "manual". that takes the
> value of a particular field (_route_ by default) and sends docs to
> that exact shard.
> 
> But I also question whether sharding on this schema is a good idea. If
> you have an access pattern where most queries are for, say, the last
> two days then all the work will be done on only 2 machines and all the
> rest will be idle. You should at least consider just using normal
> routing that distributes the data across all shards and then use
> delete-by-query to delete the data older than 10 days.
> 
> Best,
> Erick
> 
> On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
>  wrote:
>> 
>> Hi Sir,
>>Taken your id from your document on SlideShare.
>> Need your guidance on my plan ,My target is to create sub-collection/shards 
>> within a collection.
>> e.g
>> Currently 1 have 10 days data and want to store data against each 
>> date in separate partitions.  like oracle partition concepts (one table can 
>> have many partitions)
>> Plan is to store each date data with in separate node, Total physical nodes 
>> are 10 and after 10 days, 11th date data load in node1 and existing data 
>> backup (oldest date data with purge and backed up).
>> Please guide me how can i perform that using SolrCloud.  1 collection with 
>> unlimited sub collection.
>> 
>> Thank you very much in advanced.
>> 
>> Kind Regards,Muhammad Rehman Kahloon.



Re: Getting IO Exception while Indexing

2017-07-20 Thread mesenthil1
Hi, 
This is happening repeatedly for few documents.  When we compared with other
similar documents, we could not find any difference. 

As we are seeing 400 on apache, the request is not submitted to solr.  So
unable to find out the cause. 

Senthil



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-IO-Exception-while-Indexing-Documents-in-SolrCloud-tp4346801p4346930.html
Sent from the Solr - User mailing list archive at Nabble.com.


Debug Queries field explaination

2017-07-20 Thread Swapnil Pande
Hi ,
Being an amateur in solr i wanted to learn how solr queries internally and
how score is calculated.
So setting debug=true.
I get a json with fields like 'fromSetSize' , 'toSetSize'.. etc.
Can I get a reference link to understand what these exactly mean.

Thanks.


Re: Need guidance solrcloud shardings with date interval

2017-07-20 Thread Erick Erickson
Use the "implicit" router (being renamed "manual". that takes the
value of a particular field (_route_ by default) and sends docs to
that exact shard.

But I also question whether sharding on this schema is a good idea. If
you have an access pattern where most queries are for, say, the last
two days then all the work will be done on only 2 machines and all the
rest will be idle. You should at least consider just using normal
routing that distributes the data across all shards and then use
delete-by-query to delete the data older than 10 days.

Best,
Erick

On Thu, Jul 20, 2017 at 12:51 AM, rehman kahloon
 wrote:
>
> Hi Sir,
> Taken your id from your document on SlideShare.
> Need your guidance on my plan ,My target is to create sub-collection/shards 
> within a collection.
> e.g
>  Currently 1 have 10 days data and want to store data against each 
> date in separate partitions.  like oracle partition concepts (one table can 
> have many partitions)
> Plan is to store each date data with in separate node, Total physical nodes 
> are 10 and after 10 days, 11th date data load in node1 and existing data 
> backup (oldest date data with purge and backed up).
> Please guide me how can i perform that using SolrCloud.  1 collection with 
> unlimited sub collection.
>
> Thank you very much in advanced.
>
> Kind Regards,Muhammad Rehman Kahloon.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread Erick Erickson
1> are you replaying the tlog? If you have a large tlog for some
reason you may be replaying it. Although a reload should do a commit
first.

2> What do the Solr logs show the node in question to be doing?

3> Sorry to mislead you, async is not a 4.10 option for the RELOAD
command so that was bogus on my part, that support was added later.

Best,
Erick


On Thu, Jul 20, 2017 at 4:38 AM, alessandro.benedetti
 wrote:
> Additional information :
> Try single core reload I identified that an entire shard is not reloading (
> while the other shard is ).
> Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the
> core reload stucks here :
>
> org.apache.solr.core.SolrCores#waitAddPendingCoreOps
>
> The problem is that the wait seems to continue indefinitely and silently.
> Apart a restart, is there any way to clean up the pending core operations ?
> I will continue my investigations
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346966.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread Erick Erickson
The key is removing the entire data directory as in

"rm -rf solr_core/data"

with Solr down then restarting Solr. Or create a new core.

It's most probably working on Windows because the schema
was set with multiVauled=false when you indexed your first
document.

Best,
Erick

On Thu, Jul 20, 2017 at 5:16 AM, prashantas  wrote:
> I am not running solr in cloud mode.
>
> On Thu, Jul 20, 2017 at 4:40 PM, Shawn Heisey-2 [via Lucene] <
> ml+s472066n4346954...@n3.nabble.com> wrote:
>
>> On 7/20/2017 2:30 AM, prashantas wrote:
>> > I am using solr6.4. In my managed-schema, I have defined my field
>> details.
>> > None of my fields are multiValued. If I set property multiValued=false ,
>> it
>> > works fine in Windows, but in CentOS/RHEL, it does not accept the same
>> and
>> > the field still shows multiValued true in my solr admin UI. Please help
>> me
>> > how can I set multiValued = false  in some fields.
>> > 
>>
>>
>> Is Solr running in cloud mode on either of these systems?
>>
>> Thanks,
>> Shawn
>>
>>
>>
>> --
>> If you reply to this email, your message will be added to the discussion
>> below:
>> http://lucene.472066.n3.nabble.com/multiValued-false-
>> is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346954.html
>> To unsubscribe from multiValued=false is not working in Solr 6.4 in
>> RHEL/CentOS, click here
>> 
>> .
>> NAML
>> 
>>
>
>
>
> --
>
> *with regards,Prashanta*
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346967.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: default values for numRecordsToKeep and maxNumLogsToKeep

2017-07-20 Thread Erick Erickson
bq:  I am pretty sure that anytime a core starts for *any* reason, all
the transaction logs that are present will get replayed.

This isn't quite true. If Solr is shut down gracefully, or a hard
commit happened before shutdown (with no new docs added) then the tlog
will _not_ be replayed on startup. It's only when Solr is killed
without a commit and thus the tlog is not truncated (and segments not
closed) by hard commit that the tlog will be replayed on startup.
Which is why I strongly recommend that people stop Solr with the
script rather than "kill -9".

Best,
Erick

On Thu, Jul 20, 2017 at 5:39 AM, Shawn Heisey  wrote:
> On 7/18/2017 11:53 AM, suresh pendap wrote:
>> After looking at the source code I see that the default values for
>> numRecordsToKeep is 100 and maxNumLogsToKeep is 10.
>>
>> So it seems by default the replica can only have 1000 document updates lag
>> before the replica goes for a Full recovery from the leader.
>
> I don't think that's quite right.  In many situations, the number of
> documents in the transaction log will likely be less than 1000.
>
> Enough logs will be kept that *at least* 100 documents are there, if
> that can be accomplished with ten logfiles or less.  Based on a quick
> reading of the code, if the newest ten logs have less than 100
> documents, then there will be less than 100 docs available.  This would
> not end up being a problem for data integrity, because small infrequent
> updates would be the only way to end up with less than 100 docs, and in
> that situation, the small number of documents in the transaction log,
> when replayed at core startup, will be enough to ensure integrity.
>
> I think the reasons the default numbers are so small is an attempt to
> keep startup time low.  I am pretty sure that anytime a core starts for
> *any* reason, all the transaction logs that are present will get
> replayed.  I know for sure that this happens on Solr restart; I think it
> also happens on core reload.  By keeping the required minimum documents
> at a low value like 100, there's a better chance that the transaction
> logs will be small, and therefore core startup will be fast.
>
> On a system where there are no hard commits, all updates end up going
> into a single super-large transaction log.  This meets the default
> configuration numbers, because there are less than ten logs present, and
> what is present contains at least 100 documents.  Unfortunately, this
> means that when the core starts, it will replay that HUGE transaction
> log, a process that could take hours.  This situation is prevented by
> enabling autoCommit with a relatively short maxTime value.  Setting
> openSearcher to false in the autoCommit ensures that document visibility
> behavior is not altered by autoCommit.
>
> Thanks,
> Shawn
>


Re: finds all documents without a value for field

2017-07-20 Thread Hendrik Haddorp

forgot the link with the statement:
https://lucene.apache.org/solr/guide/6_6/the-standard-query-parser.html

On 20.07.2017 15:20, Hendrik Haddorp wrote:

Hi,

the Solr 6.6. ref guide states that to "finds all documents without a 
value for field" you can use:

-field:[* TO *]

While this is true I'm wondering why it is recommended to use a range 
query instead of simply:

-field:*

regards,
Hendrik




finds all documents without a value for field

2017-07-20 Thread Hendrik Haddorp

Hi,

the Solr 6.6. ref guide states that to "finds all documents without a 
value for field" you can use:

-field:[* TO *]

While this is true I'm wondering why it is recommended to use a range 
query instead of simply:

-field:*

regards,
Hendrik


RE: 6.6 cloud starting to eat CPU after 8+ hours

2017-07-20 Thread Markus Jelsma
cc mailinglist

Hello,

I thought that would come to your mind but do not worry, the heap averages at 
55 % all day long, there is very little garbage collection going on, and if so, 
it is the eden space that gets collected. If you really want, i can send such a 
file when the problem occurs again, but even at those moments, GC is minimal 
and the heap stays at about 55 - 60 % and only peaks every 15 minutes when 
documents are indexed.

Thanks,
Markus
 
-Original message-
> From:Shawn Heisey 
> Sent: Wednesday 19th July 2017 16:08
> To: Markus Jelsma 
> Subject: Re: 6.6 cloud starting to eat CPU after 8+ hours
> 
> On 7/19/2017 3:35 AM, Markus Jelsma wrote:
> > Another peculiarity here, our six node (2 shards / 3 replica's) cluster is 
> > going crazy after a good part of the day has passed. It starts eating CPU 
> > for no good reason and its latency goes up. Grafana graphs show the problem 
> > really well
> >
> > After restarting 2/6 nodes, there is also quite a distinction in the 
> > VisualVM monitor views, and the VisualVM CPU sampler reports (sorted on 
> > self time (CPU)). The busy nodes are deeply red in 
> > o.a.h.impl.io.AbstractSessionInputBuffer.fillBuffer (as usual), the 
> > restarted nodes are not.
> >
> > The real distinction between busy and calm nodes is that busy nodes all 
> > have o.a.l.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms() as 
> > second to fillBuffer(), what are they doing?! Why? The calm nodes don't 
> > show this at all. Busy nodes all have o.a.l.codec stuff on top, restarted 
> > nodes don't.
> >
> > So, actually, i don't have a clue! Any, any ideas? 
> >
> > Thanks,
> > Markus
> >
> > Each replica is underpowered but performing really well after restart (and 
> > JVM warmup), 4 CPU's, 900M heap, 8 GB RAM, maxDoc 2.8 million, index size 
> > 18 GB.
> 
> A 900MB heap seems very small for an 18GB index with millions of
> documents.  The first thing I would suspect is that the heap is running
> very near the maximum and the JVM is spending a lot of time doing
> garbage collection.  Can you share the gc.log file from an instance that
> is running the high CPU so this  can be checked?  I'd also be interested
> in seeing solrconfig.xml.
> 
> Thanks,
> Shawn
> 
> 


Re: default values for numRecordsToKeep and maxNumLogsToKeep

2017-07-20 Thread Shawn Heisey
On 7/18/2017 11:53 AM, suresh pendap wrote:
> After looking at the source code I see that the default values for
> numRecordsToKeep is 100 and maxNumLogsToKeep is 10.
>
> So it seems by default the replica can only have 1000 document updates lag
> before the replica goes for a Full recovery from the leader.

I don't think that's quite right.  In many situations, the number of
documents in the transaction log will likely be less than 1000.

Enough logs will be kept that *at least* 100 documents are there, if
that can be accomplished with ten logfiles or less.  Based on a quick
reading of the code, if the newest ten logs have less than 100
documents, then there will be less than 100 docs available.  This would
not end up being a problem for data integrity, because small infrequent
updates would be the only way to end up with less than 100 docs, and in
that situation, the small number of documents in the transaction log,
when replayed at core startup, will be enough to ensure integrity.

I think the reasons the default numbers are so small is an attempt to
keep startup time low.  I am pretty sure that anytime a core starts for
*any* reason, all the transaction logs that are present will get
replayed.  I know for sure that this happens on Solr restart; I think it
also happens on core reload.  By keeping the required minimum documents
at a low value like 100, there's a better chance that the transaction
logs will be small, and therefore core startup will be fast.

On a system where there are no hard commits, all updates end up going
into a single super-large transaction log.  This meets the default
configuration numbers, because there are less than ten logs present, and
what is present contains at least 100 documents.  Unfortunately, this
means that when the core starts, it will replay that HUGE transaction
log, a process that could take hours.  This situation is prevented by
enabling autoCommit with a relatively short maxTime value.  Setting
openSearcher to false in the autoCommit ensures that document visibility
behavior is not altered by autoCommit.

Thanks,
Shawn



Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread prashantas
I am not running solr in cloud mode.

On Thu, Jul 20, 2017 at 4:40 PM, Shawn Heisey-2 [via Lucene] <
ml+s472066n4346954...@n3.nabble.com> wrote:

> On 7/20/2017 2:30 AM, prashantas wrote:
> > I am using solr6.4. In my managed-schema, I have defined my field
> details.
> > None of my fields are multiValued. If I set property multiValued=false ,
> it
> > works fine in Windows, but in CentOS/RHEL, it does not accept the same
> and
> > the field still shows multiValued true in my solr admin UI. Please help
> me
> > how can I set multiValued = false  in some fields.
> > 
>
>
> Is Solr running in cloud mode on either of these systems?
>
> Thanks,
> Shawn
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/multiValued-false-
> is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346954.html
> To unsubscribe from multiValued=false is not working in Solr 6.4 in
> RHEL/CentOS, click here
> 
> .
> NAML
> 
>



-- 

*with regards,Prashanta*




--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346967.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Additional information :
Try single core reload I identified that an entire shard is not reloading (
while the other shard is ).
Taking a look to the "not reloading" shard ( 2 replicas) , it seems that the
core reload stucks here :

org.apache.solr.core.SolrCores#waitAddPendingCoreOps

The problem is that the wait seems to continue indefinitely and silently.
Apart a restart, is there any way to clean up the pending core operations ?
I will continue my investigations




-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346966.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Issue While indexing Data

2017-07-20 Thread rajat rastogi
Hi Shawn ,

I mailed you the info @ apa...@elyograg.org
I can resend it the mail.

regards

Rajat



On 20-Jul-2017, at 16:40, Shawn Heisey-2 [via Lucene] 
>
 wrote:

On 7/20/2017 12:29 AM, rajat rastogi wrote:
> I shared The code base, config , schema with you . Were they of any help , or 
> can You point what I am doing wrong in them .

I did not see any schema or config.

The top output shows that you have three large Java processes, all
running as root.  Which of these is Solr?  Are they all instances of Solr?

Thanks,
Shawn




If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4346953.html
To unsubscribe from Solr Issue While indexing Data, click 
here.
NAML

IMPORTANT NOTICE: "This email is confidential containing HT Media confidential 
information, may be legally privileged, and is for the intended recipient only. 
Access, disclosure, copying, distribution, or reliance on any of it by anyone 
else is prohibited and may be a criminal offense. Please delete if obtained in 
error and email confirmation to the sender." Experience news. Like never 
before. Only on www.hindustantimes.com




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4346956.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread Shawn Heisey
On 7/20/2017 2:30 AM, prashantas wrote:
> I am using solr6.4. In my managed-schema, I have defined my field details.
> None of my fields are multiValued. If I set property multiValued=false , it
> works fine in Windows, but in CentOS/RHEL, it does not accept the same and
> the field still shows multiValued true in my solr admin UI. Please help me
> how can I set multiValued = false  in some fields.
>  

Is Solr running in cloud mode on either of these systems?

Thanks,
Shawn



Re: Solr Issue While indexing Data

2017-07-20 Thread Shawn Heisey
On 7/20/2017 12:29 AM, rajat rastogi wrote:
> I shared The code base, config , schema with you . Were they of any help , or 
> can You point what I am doing wrong in them .

I did not see any schema or config.

The top output shows that you have three large Java processes, all
running as root.  Which of these is Solr?  Are they all instances of Solr?

Thanks,
Shawn



Re: Boost by Integer value on top of query

2017-07-20 Thread Erik Hatcher
If you’re using edismax, adding a boost parameter 
`boost=num_employees=num_locations` should incorporate those integers 
into the scores.  Just try one at a time at first - you’ll likely want to wrap 
it into a single function, along the lines of something like 
`boost=mul(num_employees,num_locations)` 

Erik



> On Jul 20, 2017, at 6:35 AM, marotosg  wrote:
> 
> Hi,
> 
> I have a use where I need to boost documents based on two integer values.
> Basically I need to retrieve companies using specific criteria like Company
> name, nationality etc. 
> On top of that query I need to boost the most important ones which are
> suppose to be the ones with higher number of employees or locations around
> the world.
> 
> These are two integer fields on my Solr index. My question here is
> How can I boost  the companies with a higher number of employees or
> locations?
> 
> Thanks,
> Sergio MAroto
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Boost-by-Integer-value-on-top-of-query-tp4346948.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Taking a look to 4.10.2 source I may see why the async call does not work :

/log.info("Reloading Collection : " + req.getParamString());
String name = req.getParams().required().get("name");

*ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION,
OverseerCollectionProcessor.RELOADCOLLECTION, "name", name);*

handleResponse(OverseerCollectionProcessor.RELOADCOLLECTION, m, rsp);
/

Are we sure we are actually passing the "async" param as a ZkNodeProp ?
Because the handleResponse does :

private void handleResponse(String operation, *ZkNodeProps m*,
  SolrQueryResponse rsp, long timeout)
...
if(m.containsKey(ASYNC) && m.get(ASYNC) != null) {
 
   String asyncId = m.getStr(ASYNC);
...



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346949.html
Sent from the Solr - User mailing list archive at Nabble.com.


Boost by Integer value on top of query

2017-07-20 Thread marotosg
Hi,

I have a use where I need to boost documents based on two integer values.
Basically I need to retrieve companies using specific criteria like Company
name, nationality etc. 
On top of that query I need to boost the most important ones which are
suppose to be the ones with higher number of employees or locations around
the world.

These are two integer fields on my Solr index. My question here is
How can I boost  the companies with a higher number of employees or
locations?

Thanks,
Sergio MAroto



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Boost-by-Integer-value-on-top-of-query-tp4346948.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread Amrit Sarkar
By saying:

 I am just adding multiValued=false in the managed-schema file.


Are you modifying in the local filesystem "conf" or going into the core
conf directory and changing there? If you are SolrCloud, you should change
the same on Zookeeper.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
Assuming the service solr service restart does its job, I think the only
thing I would do is to completely remove the data directory content, instead
of just running the delete query.

Bare in mind that when you delete a document in Solr, this is marked as
deleted, but it takes potentially a while until it really leaves the index (
after a successful segment merge).
This could bring to potential conflict in the data structures when documents
of different schemas are in the index.
I don't know if it is your case, but I would double check.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346945.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Create too many zookeeper connections when recreate CloudSolrServer instance

2017-07-20 Thread wg85907
Hi Walter, Shawn,
Thanks for your quickly reply, the information you provide is really
helpful. Now I know how to find a right way to resolve my issue.
Regards,
Geng, Wei 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Create-too-many-zookeeper-connections-when-recreate-CloudSolrServer-instance-tp4346040p4346944.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread prashantas
I am just adding multiValued=false in the managed-schema file.

Then deleting the complete data by running the command   curl
http://localhost:8983/solr/Schools/update?commit=true -d
'*:*'   where 'Schools' is my core name.

Then restart the solr by "service solr restart"
And then import the csv file by executing the command  curl '
http://localhost:8983/solr/Schools/update?commit=true' --data-binary
@tbl_SCHOOLS.csv -H 'Content-type:application/csv'

Please let me know if I am doing anything wrong.

with regards,
Prashanta

On Thu, Jul 20, 2017 at 2:29 PM, alessandro.benedetti [via Lucene] <
ml+s472066n4346941...@n3.nabble.com> wrote:

> I doubt it is an environment problem at all.
> How are you modifying your schema ?
> How you reloading your core/collection ?
> Are you restarting your Solr instance ?
>
> Regards
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://lucene.472066.n3.nabble.com/multiValued-false-
> is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346941.html
> To unsubscribe from multiValued=false is not working in Solr 6.4 in
> RHEL/CentOS, click here
> 
> .
> NAML
> 
>



-- 

*with regards,Prashanta*




--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346943.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread alessandro.benedetti
I doubt it is an environment problem at all.
How are you modifying your schema ?
How you reloading your core/collection ?
Are you restarting your Solr instance ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939p4346941.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Apache Solr 4.10.x - Collection Reload times out

2017-07-20 Thread alessandro.benedetti
Thanks for the prompt response Erick,
the reason that I am issuing a Collection reload is because I modify from
time to the time the Solrconfig for example, with different spellcheck and
request parameter default params.
So after the upload to Zookeeper I reload the collection to reflect the
modification.
Aliasing is definitely a valid option but at the moment I don't have set up
the infrastructure necessary to programmatically operate that.

Returning to my issue, I see no effect at all if I try to run the request
async ( it seems like it is completely ignoring the parameter) .

http://blabla:8983/solr/admin/collections?action=RELOAD=news=55

I checked the source code and the async param seems to be in 4.10.2 version,
so this is really weird.
I will proceed with my investigations.



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-Solr-4-10-x-Collection-Reload-times-out-tp4346075p4346940.html
Sent from the Solr - User mailing list archive at Nabble.com.


multiValued=false is not working in Solr 6.4 in RHEL/CentOS

2017-07-20 Thread prashantas
I am using solr6.4. In my managed-schema, I have defined my field details.
None of my fields are multiValued. If I set property multiValued=false , it
works fine in Windows, but in CentOS/RHEL, it does not accept the same and
the field still shows multiValued true in my solr admin UI. Please help me
how can I set multiValued = false  in some fields.
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiValued-false-is-not-working-in-Solr-6-4-in-RHEL-CentOS-tp4346939.html
Sent from the Solr - User mailing list archive at Nabble.com.


Need guidance solrcloud shardings with date interval

2017-07-20 Thread rehman kahloon

Hi Sir,
            Taken your id from your document on SlideShare.
Need your guidance on my plan ,My target is to create sub-collection/shards 
within a collection.
e.g
         Currently 1 have 10 days data and want to store data against each date 
in separate partitions.  like oracle partition concepts (one table can have 
many partitions)
Plan is to store each date data with in separate node, Total physical nodes are 
10 and after 10 days, 11th date data load in node1 and existing data backup 
(oldest date data with purge and backed up).
Please guide me how can i perform that using SolrCloud.  1 collection with 
unlimited sub collection.

Thank you very much in advanced.

Kind Regards,Muhammad Rehman Kahloon.

Re: Solr Issue While indexing Data

2017-07-20 Thread rajat rastogi
Hi Shawn ,

I shared The code base, config , schema with you . Were they of any help , or 
can You point what I am doing wrong in them .

regards

Rajat

On 19-Jul-2017, at 21:41, Shawn Heisey-2 [via Lucene] 
>
 wrote:

On 6/7/2017 5:10 AM, [hidden 
email] wrote:
> My enviorment
>
> os :Ubuntu 14.04.1 LTS
> java : Orcale hotspot 1.8.0_121
> solr version :6.4.2
> cpu :16 cores
> ram :124 gb

Everybody seems to want different information from you.  Here's my
contribution:

On the linux commandline, run the "top" utility (not htop, or anything
else, actually type "top").  Press shift-M to sort the list by memory
usage, then grab a screenshot or a photo of that display.  Share the
image with us in some way.  Typically a file-sharing website is the best
option.

That will provide a wealth of information that can be useful for
narrowing down performance issues.

Thanks,
Shawn




If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4346826.html
To unsubscribe from Solr Issue While indexing Data, click 
here.
NAML

IMPORTANT NOTICE: "This email is confidential containing HT Media confidential 
information, may be legally privileged, and is for the intended recipient only. 
Access, disclosure, copying, distribution, or reliance on any of it by anyone 
else is prohibited and may be a criminal offense. Please delete if obtained in 
error and email confirmation to the sender." Experience news. Like never 
before. Only on www.hindustantimes.com




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Issue-While-indexing-Data-tp4339417p4346931.html
Sent from the Solr - User mailing list archive at Nabble.com.