Re: Heavy operations in PostFilter are heavy

2018-01-03 Thread Solrmails
Yes I do so. The Problem ist that the collect-Method is called for EVERY 
document the query matches. Even if the User only wants to see like 10 
documents. The Operation I have to perform takes maybe 50ms/per document if 
have to process them singel. And maybe 30ms if I could get a Document-List. But 
if the user e.g. uses an Wildcard query that matches maybe 10 Documents 
even 25ms are much to long.

I can't speedup my code anymore. Is there an other good place to do my checks? 
I tried to remove the documents later but I don't know how to fetch more 
documents after removing them on a later step. (Otherwise I would return maybe 
only 5 or zero documents even if the user wants 10 and there are more documents 
available)

Sent with [ProtonMail](https://protonmail.com) Secure Email.

>  Original Message 
> Subject: Re: Heavy operations in PostFilter are heavy
> Local Time: 3 January 2018 4:08 PM
> UTC Time: 3 January 2018 15:08
> From: arafa...@gmail.com
> To: solr-user , Solrmails 
> 
>
> Are you doing cache=false and cost > 100?
>
> See the recent article on the topic deep-dive, if you haven't:
> https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/
>
> Regards,
> Alex.
>
> On 3 January 2018 at 05:31, Solrmails solrma...@protonmail.com wrote:
>
>> Hello,
>> I tried to write a Solr PostFilter to do filtering within the 
>> 'collect'-Method(DelegatingCollector). I have to do some heavy operations 
>> within the 'collect'-Method. This isn't a problem for a few results. But 
>> unfortunately it taks forever with 50 or more results. This is because I 
>> have to do the checks for every single id again and can't process a list of 
>> ids within 'collect'.
>> Is there a better place to do PostFiltering? But I don't want to reimplement 
>> the Solr Paging/Coursor-Feature to get my things to work.
>> Thank You

Re: Query fields with data of certain length

2018-01-03 Thread Zheng Lin Edwin Yeo
Hi Emir,

So this would likely be different from what the operating system counts, as
the operating system may consider each Chinese characters as 3 to 4 bytes.
Which is probably why I could not find any record with subject:/.{255,}.*/

Is there other tools that we can use to query the length for data that are
already indexed which are not in the standard English language? (Eg:
Chinese, Japanese, etc)

Regards,
Edwin

On 3 January 2018 at 23:51, Emir Arnautović 
wrote:

> Hi Edwin,
> I do not know, but my guess would be that each character is counted as 1
> in regex regardless how many bytes it takes in used encoding.
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Thanks for the reply.
> >
> > I am doing the search on existing data that has already been indexed, and
> > it is likely to be a one time thing.
> >
> > This  subject:/.{255,}.*/  works for English characters. However, there
> are
> > Chinese characters in some of the records. The length seems to be more
> than
> > 255, but it does not shows up in the results.
> >
> > Do you know how the length for Chinese characters and other languages are
> > being determined?
> >
> > Regards,
> > Edwin
> >
> >
> > On 3 January 2018 at 23:01, Alexandre Rafalovitch 
> > wrote:
> >
> >> Do that during indexing as Emir suggested. Specifically, use an
> >> UpdateRequestProcessor chain, probably with the Clone and FieldLength
> >> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
> >> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo  >
> >> wrote:
> >>> Hi,
> >>>
> >>> Would like to check, if it is possible to query a field which has data
> of
> >>> more than a certain length?
> >>>
> >>> Like for example, I want to query the field subject that has more than
> >> 255
> >>> bytes. Is it possible?
> >>>
> >>> I am currently using Solr 6.5.1.
> >>>
> >>> Regards,
> >>> Edwin
> >>
>
>


problem with Solr Sorting by score and distance together

2018-01-03 Thread Deepak Udapudi
Hi all,

Problem :-

Assume that,  I am searching for car care centers. Solr collection has the data 
for all the major car care centers. As an example I search for Firestone car 
care centers in a 5 miles radius. In the search results I am supposed to 
receive the firestone car care centers list within 5 miles from the specified 
location and the centers should be sorted in distance order.

In the solr query handler I have specified the following.


i)I have specified the query condition (q) to be based on 
distance parameter (basically search for records within a certain distance in 
miles).

ii)   I have specified the filter query  conditions (fq) where 
fields accepting general text are matching free text input (for ex :- Firestone 
Carcare)

iii) I have specified the sort condition(sort ) to be based on 
score(calculated based on the filter query (fq) conditions applied in the 2nd 
item in the list) and distance. If there are duplicate records matching by 
score, then distance should be used to order the duplicate records.

I am not getting the desired results. Records are still sorted by score which 
was calculated using the distance field but not using the filter query 
conditions that were specified. Basically, the firestone car care centers that 
I am looking for are not appearing in the top of the results and also they are 
not in the order of distance if the records are matching.

Queries :-


i)Why is my sort clause specified by me (based on fq 
conditions score and distance) not being applied correctly?

ii)   Is my approach to fix the problem correct?

iii) Please let me know the corrective action.

Thanks in advance.

Regards,
Deepak



The information contained in this email message and any attachments is 
confidential and intended only for the addressee(s). If you are not an 
addressee, you may not copy or disclose the information, or act upon it, and 
you should delete it entirely from your email system. Please notify the sender 
that you received this email in error.


problem with Solr Sorting by score and distance together

2018-01-03 Thread Deepak Udapudi
Hi all,

Problem :-

Assume that,  I am searching for car care centers. Solr collection has the data 
for all the major car care centers. As an example I search for Firestone car 
care centers in a 5 miles radius. In the search results I am supposed to 
receive the firestone car care centers list within 5 miles from the specified 
location and the centers should be sorted in distance order.

In the solr query handler I have specified the following.


i)I have specified the query condition (q) to be based on 
distance parameter (basically search for records within a certain distance in 
miles).

ii)   I have specified the filter query  conditions (fq) where 
fields accepting general text are matching free text input (for ex :- Firestone 
Carcare)

iii) I have specified the sort condition(sort ) to be based on 
score(calculated based on the filter query (fq) conditions applied in the 2nd 
item in the list) and distance. If there are duplicate records matching by 
score, then distance should be used to order the duplicate records.

I am not getting the desired results. Records are still sorted by score which 
was calculated using the distance field but not using the filter query 
conditions that were specified. Basically, the firestone car care centers that 
I am looking for are not appearing in the top of the results and also they are 
not in the order of distance if the records are matching.

Queries :-


i)Why is my sort clause specified by me (based on fq 
conditions score and distance) not being applied correctly?

ii)   Is my approach to fix the problem correct?

iii) Please let me know the corrective action.

Thanks in advance.

Regards,
Deepak



The information contained in this email message and any attachments is 
confidential and intended only for the addressee(s). If you are not an 
addressee, you may not copy or disclose the information, or act upon it, and 
you should delete it entirely from your email system. Please notify the sender 
that you received this email in error.


Re: Small Tokenization issue

2018-01-03 Thread Shawn Heisey

On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote:

Thanks Emir, Erick.

What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate empty
tokens?


I use LengthFilterFactory with a minimum of 1 and a maximum of 512 to 
remove empty tokens.


Thanks,
Shawn



Re: Is DataImportHandler ready for production-usage?

2018-01-03 Thread Shawn Heisey

On 1/3/2018 2:20 PM, Tech Id wrote:

I stumbled across https://wiki.apache.org/solr/DataImportHandler and found
it matching my needs exactly.
So I just wanted to confirm if it is an actively supported plugin, before I
start using it for production.

Are there any users who have had a good or a bad experience with DIH ?


DIH was already there when I first started using Solr 1.4.0 in late 
2009.  It has not changed very much since then.


I used DIH exclusively for all indexing for the first little while.  
Everything was controlled by a Perl program, which also handled deletes.


Now the Perl code has been replaced by SolrJ code, and the SolrJ code 
handles all the normal indexing.  DIH is still used in the environment, 
but only for full index rebuilds, which are very rare.


For a single-threaded indexing process, DIH is remarkably efficient.  
For the fastest possible indexing, multi-threaded or multi-process 
indexing is required, which DIH cannot do.


DIH is not going anywhere as far as I know.  It doesn't get a lot of 
developer attention, because it works well and doesn't really need much 
attention.


Thanks,
Shawn



Re: Is DataImportHandler ready for production-usage?

2018-01-03 Thread Erick Erickson
It's been around forever and lots of people use it in production.

That said, an independent client using SolrJ is often preferable for
reasons outlined here:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

If DIH fits your needs by all means use it. The article I linked to,
though, provides some
bits you should be aware of.

Best,
Erick

On Wed, Jan 3, 2018 at 1:20 PM, Tech Id  wrote:
> Hi,
>
> I stumbled across https://wiki.apache.org/solr/DataImportHandler and found
> it matching my needs exactly.
> So I just wanted to confirm if it is an actively supported plugin, before I
> start using it for production.
>
> Are there any users who have had a good or a bad experience with DIH ?
>
> Thanks
> TI


Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
WordDelimiterGraphFilterFactory is a new implementation so it's also
quite possible that the behavior just changed.

I just took a look and indeed it does. WordDelimiterFilterFactory
(done on "p / n whatever) produces
token:  p  n  whatever
position:  1  2  3

whereas WordDelimiterGraphFilterFactory produces:

token:  p  n  whatever
position:  1  3  4


Arguably the Graph version is correct behavior.

What if you use phrases to search for this instead?

Best,
Erick

On Wed, Jan 3, 2018 at 12:56 PM, Nawab Zada Asad Iqbal  wrote:
> Thanks Emir, Erick.
>
> What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
> Is there any such option in WordDelimiterGraphFilter to not generate empty
> tokens?
>
> This index field is intended to use for strange strings e.g. part numbers.
> P/N HSC0424PP
> The benefit of removing the empty tokens is that if someone unintentionally
> puts a space around the '/' (in above example) this field is still able to
> match.
>
> In previous solr version, ShingleFilter used to work fine in case of empty
> positions and was making shingles across the empty space. Although, it is
> possible that i have learned to rely on a bug.
>
>
>
>
>
>
> On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Nawab,
>> The reason why you do not get shingle is because there is empty token
>> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
>> that you are interested in are not next to each other and cannot form
>> shingle.
>> What you can do is apply char filter before tokenization to remove ‘-‘
>> something like:
>>
>> >  pattern=“\s*-\s*” replacement=“ ”/>
>>
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
>> >
>> > Hi,
>> >
>> > So, I have a string for indexing:
>> >
>> > abc - def (notice the space on either side of hyphen)
>> >
>> > which is being processed with this filter-list:-
>> >
>> >
>> >> > positionIncrementGap="100">
>> >  
>> >> > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
>> > name="nfkc" mode="compose"/>
>> >
>> >> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
>> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>> >
>> >> > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>> >
>> >
>> >> > outputUnigrams="false" fillerToken=""/>
>> >
>> >> > maxTokenCount="1" consumeAllTokens="false"/>
>> >
>> >  
>> >
>> >
>> > I get two shingle tokens at the end:
>> >
>> > "abc" "def"
>> >
>> > I want to get "abc def" . What can I tweak to get this?
>> >
>> >
>> > Thanks
>> > Nawab
>>
>>


Is DataImportHandler ready for production-usage?

2018-01-03 Thread Tech Id
Hi,

I stumbled across https://wiki.apache.org/solr/DataImportHandler and found
it matching my needs exactly.
So I just wanted to confirm if it is an actively supported plugin, before I
start using it for production.

Are there any users who have had a good or a bad experience with DIH ?

Thanks
TI


Small Query.

2018-01-03 Thread Fiz Newyorker
Hello Solr Group,

I have a small Question ?  How does the Autosuggest and Spell Check work
together in SOLR. ?  I need to implement AutoSuggest on word”iPhine” But
this should return the Results of “iPhone”  on Autosuggest ?  What is the
best Suggester Component  for  addressing this requirement ?



Thanks
Fiz..


Re: Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Thanks Emir, Erick.

What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate empty
tokens?

This index field is intended to use for strange strings e.g. part numbers.
P/N HSC0424PP
The benefit of removing the empty tokens is that if someone unintentionally
puts a space around the '/' (in above example) this field is still able to
match.

In previous solr version, ShingleFilter used to work fine in case of empty
positions and was making shingles across the empty space. Although, it is
possible that i have learned to rely on a bug.






On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Nawab,
> The reason why you do not get shingle is because there is empty token
> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
> that you are interested in are not next to each other and cannot form
> shingle.
> What you can do is apply char filter before tokenization to remove ‘-‘
> something like:
>
>   pattern=“\s*-\s*” replacement=“ ”/>
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
> >
> > Hi,
> >
> > So, I have a string for indexing:
> >
> > abc - def (notice the space on either side of hyphen)
> >
> > which is being processed with this filter-list:-
> >
> >
> > > positionIncrementGap="100">
> >  
> > > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> > name="nfkc" mode="compose"/>
> >
> > > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
> >
> > > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
> >
> >
> > > outputUnigrams="false" fillerToken=""/>
> >
> > > maxTokenCount="1" consumeAllTokens="false"/>
> >
> >  
> >
> >
> > I get two shingle tokens at the end:
> >
> > "abc" "def"
> >
> > I want to get "abc def" . What can I tweak to get this?
> >
> >
> > Thanks
> > Nawab
>
>


Re: Small Tokenization issue

2018-01-03 Thread Emir Arnautović
Hi Nawab,
The reason why you do not get shingle is because there is empty token because 
after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token that you 
are interested in are not next to each other and cannot form shingle.
What you can do is apply char filter before tokenization to remove ‘-‘ 
something like:



Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal  wrote:
> 
> Hi,
> 
> So, I have a string for indexing:
> 
> abc - def (notice the space on either side of hyphen)
> 
> which is being processed with this filter-list:-
> 
> 
> positionIncrementGap="100">
>  
> class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> name="nfkc" mode="compose"/>
>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
>
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>
>
> outputUnigrams="false" fillerToken=""/>
>
> maxTokenCount="1" consumeAllTokens="false"/>
>
>  
> 
> 
> I get two shingle tokens at the end:
> 
> "abc" "def"
> 
> I want to get "abc def" . What can I tweak to get this?
> 
> 
> Thanks
> Nawab



Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
If it's regular, you could try using a PatternReplaceCharFilterFactory
(PRCFF), which gets applied to the input before tokenization (note,
this is NOT PatternReplaceFilterFatory, which gets applied after
tokenization).

I don't really see how you could make this work though.
WhitespaceTokenizer will break "abc def" into "abc" and "def" even if
you use PRCFF. WordDelimiterGraphFilterFactory would break up
"abc-def" into "abc" "def" and possibly "abcdef" depending on
catenateWords' value.

Instead of this, would it answer to use _phrase_ searches when you
wanted to find "abc def"?

Best,
Erick

On Wed, Jan 3, 2018 at 12:04 PM, Nawab Zada Asad Iqbal  wrote:
> Hi,
>
> So, I have a string for indexing:
>
> abc - def (notice the space on either side of hyphen)
>
> which is being processed with this filter-list:-
>
>
>  positionIncrementGap="100">
>   
>  class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> name="nfkc" mode="compose"/>
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
> 
>  pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
> 
> 
>  outputUnigrams="false" fillerToken=""/>
> 
>  maxTokenCount="1" consumeAllTokens="false"/>
> 
>   
>
>
> I get two shingle tokens at the end:
>
> "abc" "def"
>
> I want to get "abc def" . What can I tweak to get this?
>
>
> Thanks
> Nawab


Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Hi,

So, I have a string for indexing:

abc - def (notice the space on either side of hyphen)

which is being processed with this filter-list:-



  











  


I get two shingle tokens at the end:

"abc" "def"

I want to get "abc def" . What can I tweak to get this?


Thanks
Nawab


Re: Sorting on Child document.

2018-01-03 Thread crezy
Hello All,

any updates on my post.

It's too much urget.


Thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Always use leader for searching queries

2018-01-03 Thread Walter Underwood
If you have a field for the indexed datetime, you can use a filter query to get 
rid of recent updates that might be in transit. I’d use double the autocommit 
time, to leave time for the followers to index.

If the autocommit interval is one minute:

fq=indexed_datetime:[* TO NOW-2MIN]

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 3, 2018, at 8:58 AM, Erick Erickson  wrote:
> 
> [I probably not need to do this because I have only one shard but I did
> anyway count was different.]
> 
> This isn't what I meant. I meant to query each replica directly
> _within_ the same shard. Your problem statement is that the leader and
> replicas (I use "followers") have different document counts. How are
> you verifying this? Through the admin UI? Using =false is
> useful when you want to query each core directly (and you have to use
> the core name) in some automated fashion.
> 
> [I have actually turned off auto soft commit for a time being but
> nothing changed]
> 
> OK, I'm assuming then that you issue a manual commit sometime, right?
> Here's what I'd do:
> 1> turn off indexing
> 2> issue a commit (soft or hard-with-opensearcher-true)
> 3> now look at your doc counts on each replica.
> 
> If the counts are different then something's not right, Solr tries
> very hard to not lose data, it's concerning if the leader and replicas
> have different counts.
> 
> Best,
> Erick
> 
> On Wed, Jan 3, 2018 at 1:51 AM, Novin Novin  wrote:
>> Hi Erick,
>> 
>> Thanks for your reply.
>> 
>> [ First of all, replicas can be off in terms of counts for the soft
>> commit interval. The commits don't all happen on the replicas at the
>> same wall-clock time. Solr promises eventual consistency, in this case
>> NOW-autocommit time.]
>> 
>> I realized that, to stop it. I have actually turned off auto soft commit
>> for a time being but nothing changed. Non leader replica still had extra
>> documents.
>> 
>> [ So my first question is whether the replicas in the shard are
>> inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
>> factor of 10 seconds earlier just to be sure I was past autowarming.
>> This does require that there be a time stamp. Absent a timestamp, you
>> could suspend indexing for a few minutes and run the test like below.]
>> 
>> When data was indexing at that time I was checking how the counts are in
>> both replica. What I found leader replica has 3 doc less than other replica
>> always. I don't think so they were of by NOW-soft_commit_time, 
>> CloudSolrClient
>> add some thing like this "_stateVer_=main:114" in query which I assume is
>> for results to be consistent between both replica search.
>> 
>> [Adding =false to your command and directing it at a specific
>> _core_ (something like collection1_shard1_replica1) will only return
>> data from that core.]
>> I probably not need to do this because I have only one shard but I did
>> anyway count was different.
>> 
>> [When you say you index every minute, I'm guessing you only index for
>> part of that minute, is that true? In that case you might get more
>> consistency if, instead of relying totally on your autoconfig
>> settings, specify commitWithin on your update command. That should
>> force the commits to happen more closely in-sync, although still not
>> perfect.]
>> 
>> We receive data every minute, so whenever we have new data we send it to
>> Solr cloud using queue. You said don't rely on auto config. Do you mean I
>> should turn off autocommit and use commitWithin using solrj or leave
>> autoCommit as it is and also use commitWithin from solrj client.
>> 
>> I apologize If I am not clear, thanks for your help again.
>> 
>> Thanks in advance,
>> Navin
>> 
>> 
>> 
>> 
>> 
>> On Tue, 2 Jan 2018 at 18:05 Erick Erickson  wrote:
>> 
>>> First of all, replicas can be off in terms of counts for the soft
>>> commit interval. The commits don't all happen on the replicas at the
>>> same wall-clock time. Solr promises eventual consistency, in this case
>>> NOW-autocommit time.
>>> 
>>> So my first question is whether the replicas in the shard are
>>> inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
>>> factor of 10 seconds earlier just to be sure I was past autowarming.
>>> This does require that there be a time stamp. Absent a timestamp, you
>>> could suspend indexing for a few minutes and run the test like below.
>>> 
>>> Adding =false to your command and directing it at a specific
>>> _core_ (something like collection1_shard1_replica1) will only return
>>> data from that core.
>>> 
>>> When you say you index every minute, I'm guessing you only index for
>>> part of that minute, is that true? In that case you might get more
>>> consistency if, instead of relying totally on your autoconfig
>>> settings, specify commitWithin on your update command. That should
>>> force the commits to happen more closely in-sync, 

Re: Always use leader for searching queries

2018-01-03 Thread Erick Erickson
[I probably not need to do this because I have only one shard but I did
anyway count was different.]

This isn't what I meant. I meant to query each replica directly
_within_ the same shard. Your problem statement is that the leader and
replicas (I use "followers") have different document counts. How are
you verifying this? Through the admin UI? Using =false is
useful when you want to query each core directly (and you have to use
the core name) in some automated fashion.

[I have actually turned off auto soft commit for a time being but
nothing changed]

OK, I'm assuming then that you issue a manual commit sometime, right?
Here's what I'd do:
1> turn off indexing
2> issue a commit (soft or hard-with-opensearcher-true)
3> now look at your doc counts on each replica.

If the counts are different then something's not right, Solr tries
very hard to not lose data, it's concerning if the leader and replicas
have different counts.

Best,
Erick

On Wed, Jan 3, 2018 at 1:51 AM, Novin Novin  wrote:
> Hi Erick,
>
> Thanks for your reply.
>
> [ First of all, replicas can be off in terms of counts for the soft
> commit interval. The commits don't all happen on the replicas at the
> same wall-clock time. Solr promises eventual consistency, in this case
> NOW-autocommit time.]
>
> I realized that, to stop it. I have actually turned off auto soft commit
> for a time being but nothing changed. Non leader replica still had extra
> documents.
>
> [ So my first question is whether the replicas in the shard are
> inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
> factor of 10 seconds earlier just to be sure I was past autowarming.
> This does require that there be a time stamp. Absent a timestamp, you
> could suspend indexing for a few minutes and run the test like below.]
>
> When data was indexing at that time I was checking how the counts are in
> both replica. What I found leader replica has 3 doc less than other replica
> always. I don't think so they were of by NOW-soft_commit_time, CloudSolrClient
> add some thing like this "_stateVer_=main:114" in query which I assume is
> for results to be consistent between both replica search.
>
> [Adding =false to your command and directing it at a specific
> _core_ (something like collection1_shard1_replica1) will only return
> data from that core.]
> I probably not need to do this because I have only one shard but I did
> anyway count was different.
>
> [When you say you index every minute, I'm guessing you only index for
> part of that minute, is that true? In that case you might get more
> consistency if, instead of relying totally on your autoconfig
> settings, specify commitWithin on your update command. That should
> force the commits to happen more closely in-sync, although still not
> perfect.]
>
> We receive data every minute, so whenever we have new data we send it to
> Solr cloud using queue. You said don't rely on auto config. Do you mean I
> should turn off autocommit and use commitWithin using solrj or leave
> autoCommit as it is and also use commitWithin from solrj client.
>
> I apologize If I am not clear, thanks for your help again.
>
> Thanks in advance,
> Navin
>
>
>
>
>
> On Tue, 2 Jan 2018 at 18:05 Erick Erickson  wrote:
>
>> First of all, replicas can be off in terms of counts for the soft
>> commit interval. The commits don't all happen on the replicas at the
>> same wall-clock time. Solr promises eventual consistency, in this case
>> NOW-autocommit time.
>>
>> So my first question is whether the replicas in the shard are
>> inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
>> factor of 10 seconds earlier just to be sure I was past autowarming.
>> This does require that there be a time stamp. Absent a timestamp, you
>> could suspend indexing for a few minutes and run the test like below.
>>
>> Adding =false to your command and directing it at a specific
>> _core_ (something like collection1_shard1_replica1) will only return
>> data from that core.
>>
>> When you say you index every minute, I'm guessing you only index for
>> part of that minute, is that true? In that case you might get more
>> consistency if, instead of relying totally on your autoconfig
>> settings, specify commitWithin on your update command. That should
>> force the commits to happen more closely in-sync, although still not
>> perfect.
>>
>> Another option if you're totally and completely sure that your commits
>> happen _only_ from your indexing program is to fire the commit at the
>> end of the run from your SolrJ program.
>>
>> Let us know,
>> Erick
>>
>> On Tue, Jan 2, 2018 at 9:33 AM, Novin Novin  wrote:
>> > Hi Erick,
>> >
>> > You are right, it is XY Problem.
>> >
>> > Allow me to explain best I can, I have two replica of one collection
>> called
>> > "Main". When I was using search feature in my application I get two
>> > different numFound count. So I start digging after 

Re: DIH XPathEntityProcessor XPath subset?

2018-01-03 Thread Erik Hatcher
Stefan -

If you pre-transform the XML, I’d personally recommend either transforming it 
into straight up Solr XML (docs/fields/values) or some other format or posting 
directly to Solr.   Avoid this DIH thing when things get complicated.

Erik

> On Jan 3, 2018, at 11:40 AM, Stefan Moises  wrote:
> 
> Hi there,
> 
> I'm trying to index a wordpress site using DIH XPathEntityProcessor... I've 
> read it only supports a subset of XPath, but I couldn't find any docs what 
> exactly is supported.
> 
> After some painful trial and error, I've found that xpath expressions like 
> the following don't work:
> 
>  xpath="/methodResponse/params/param/value/array/data/value/struct/member[name='post_title']/value/string"
>  />
> 
> I want to find elements like this ("the 'value' element after a 'member' 
> element with a name element 'post_title'"):
> 
> 
>   
> 
>   
> 
> 
> 
> 
> post_id11809
> post_titleSome 
> titel
> 
> Unfortunately that is the default output structure of Wordpress' XMLrpc calls.
> 
> My Xpath expression works e.g. when testing it with 
> https://www.freeformatter.com/xpath-tester.html but not if I try to index it 
> with Solr any ideas? Or do I have to pre-transform the XML myself to 
> match XPathEntityProcessors limited abilites?
> 
> Thanks in advance,
> 
> Stefan
> 
> -- 
> --
> 
> Stefan Moises
> Manager Research & Development
> shoptimax GmbH
> Ulmenstraße 52 H
> 90443 Nürnberg
> Tel.: 0911/25566-0
> Fax: 0911/25566-29
> moi...@shoptimax.de
> http://www.shoptimax.de
> 
> Geschäftsführung: Friedrich Schreieck
> Ust.-IdNr.: DE 814340642
> Amtsgericht Nürnberg HRB 21703
>  
> 



DIH XPathEntityProcessor XPath subset?

2018-01-03 Thread Stefan Moises

Hi there,

I'm trying to index a wordpress site using DIH XPathEntityProcessor... 
I've read it only supports a subset of XPath, but I couldn't find any 
docs what exactly is supported.


After some painful trial and error, I've found that xpath expressions 
like the following don't work:


    xpath="/methodResponse/params/param/value/array/data/value/struct/member[name='post_title']/value/string" 
/>


I want to find elements like this ("the 'value' element after a 'member' 
element with a name element 'post_title'"):



  
    
  
    
    
    
    
post_id11809
post_titleSome 
titel


Unfortunately that is the default output structure of Wordpress' XMLrpc 
calls.


My Xpath expression works e.g. when testing it with 
https://www.freeformatter.com/xpath-tester.html but not if I try to 
index it with Solr any ideas? Or do I have to pre-transform the XML 
myself to match XPathEntityProcessors limited abilites?


Thanks in advance,

Stefan

--
--

Stefan Moises
Manager Research & Development
shoptimax GmbH
Ulmenstraße 52 H
90443 Nürnberg
Tel.: 0911/25566-0
Fax: 0911/25566-29
moi...@shoptimax.de
http://www.shoptimax.de

Geschäftsführung: Friedrich Schreieck
Ust.-IdNr.: DE 814340642
Amtsgericht Nürnberg HRB 21703
  





Re: Removing some fields from uprefix

2018-01-03 Thread Zheng Lin Edwin Yeo
Hi Alex,

Thanks for your advice. It works.

Regards,
Edwin


On 3 January 2018 at 23:06, Alexandre Rafalovitch 
wrote:

> uprefix is only for the fields that do NOT exist in schema. So, you
> can define your x_parsed_by in schema, but map it to the type that has
> index=false, store=false, docvalues=false. Which means the field is
> acknowledged but effectively dropped.
>
> Regards,
>Alex.
>
> On 3 January 2018 at 05:53, Zheng Lin Edwin Yeo 
> wrote:
> > Hi,
> >
> > I'm using Solr 7.2.0, and I have this /extract handler in my
> solrconfig.xml
> >
> >>   startup="lazy"
> >   class="solr.extraction.ExtractingRequestHandler" >
> > 
> >   /xhtml:html/xhtml:body/descendant:node()
> >   content
> >   attr_meta_
> >   attr_
> >   true
> >   dedupe
> > 
> >   
> >
> > Understand that this attr_ will cause all
> > generated fileds that aren't defined in the schema to be prefixed with
> > attr_
> >
> > Is there any way that we can remove some of the fields, but keep the
> rest?
> > For example, I would like to remove attr_x_parsed_by.
> >
> > Regards,
> > Edwin
>


Re: Query fields with data of certain length

2018-01-03 Thread Emir Arnautović
Hi Edwin,
I do not know, but my guess would be that each character is counted as 1 in 
regex regardless how many bytes it takes in used encoding.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 16:43, Zheng Lin Edwin Yeo  wrote:
> 
> Thanks for the reply.
> 
> I am doing the search on existing data that has already been indexed, and
> it is likely to be a one time thing.
> 
> This  subject:/.{255,}.*/  works for English characters. However, there are
> Chinese characters in some of the records. The length seems to be more than
> 255, but it does not shows up in the results.
> 
> Do you know how the length for Chinese characters and other languages are
> being determined?
> 
> Regards,
> Edwin
> 
> 
> On 3 January 2018 at 23:01, Alexandre Rafalovitch 
> wrote:
> 
>> Do that during indexing as Emir suggested. Specifically, use an
>> UpdateRequestProcessor chain, probably with the Clone and FieldLength
>> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
>> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
>> 
>> Regards,
>>   Alex.
>> 
>> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo 
>> wrote:
>>> Hi,
>>> 
>>> Would like to check, if it is possible to query a field which has data of
>>> more than a certain length?
>>> 
>>> Like for example, I want to query the field subject that has more than
>> 255
>>> bytes. Is it possible?
>>> 
>>> I am currently using Solr 6.5.1.
>>> 
>>> Regards,
>>> Edwin
>> 



Re: Query fields with data of certain length

2018-01-03 Thread Zheng Lin Edwin Yeo
Thanks for the reply.

I am doing the search on existing data that has already been indexed, and
it is likely to be a one time thing.

This  subject:/.{255,}.*/  works for English characters. However, there are
Chinese characters in some of the records. The length seems to be more than
255, but it does not shows up in the results.

Do you know how the length for Chinese characters and other languages are
being determined?

Regards,
Edwin


On 3 January 2018 at 23:01, Alexandre Rafalovitch 
wrote:

> Do that during indexing as Emir suggested. Specifically, use an
> UpdateRequestProcessor chain, probably with the Clone and FieldLength
> processors: http://www.solr-start.com/javadoc/solr-lucene/org/
> apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html
>
> Regards,
>Alex.
>
> On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo 
> wrote:
> > Hi,
> >
> > Would like to check, if it is possible to query a field which has data of
> > more than a certain length?
> >
> > Like for example, I want to query the field subject that has more than
> 255
> > bytes. Is it possible?
> >
> > I am currently using Solr 6.5.1.
> >
> > Regards,
> > Edwin
>


Re: SolrJ with Async Http Client

2018-01-03 Thread Walter Underwood
HTTPClient is non-blocking. Send the request, then the client gets control 
back. It only blocks when you do the read. So one thread can send multiple 
requests then check for each response.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jan 3, 2018, at 5:11 AM, RAUNAK AGRAWAL  wrote:
> 
> Yes, I am talking about event driven way of calling solr, so that I can
> write pure async web service. Does SolrJ provides support for non-blocking
> calls?
> 
> On Wed, Jan 3, 2018 at 6:22 PM, Hendrik Haddorp 
> wrote:
> 
>> There is asynchronous and non-blocking. If I use 100 threads to perform
>> calls to Solr using the standard Java HTTP client or SolrJ I block 100
>> threads even if I don't block my program logic threads by using async
>> calls. However if I perform those HTTP calls using a non-blocking HTTP
>> client, like netty, I basically only need a single eventing thread in
>> addition to my normal threads. The advantage is less memory usage and an
>> often better scaling. I would however expect that the main advantage would
>> be on the server side.
>> 
>> 
>> On 02.01.2018 22:02, Gus Heck wrote:
>> 
>>> It's not very clear (to me) what your use case is, but generally speaking,
>>> asynchronous requests can be achieved by using threads/executors/futures
>>> (java) or ajax (javascript). The link seems to be a scala project, I'm
>>> sure
>>> scala has analogous facilities.
>>> 
>>> On Tue, Jan 2, 2018 at 10:31 AM, RAUNAK AGRAWAL >> wrote:
>>> 
>>> Hi Guys,
 
 I am trying to write fully async service where solr calls are also async.
 Just wondering did anyone tried calling solr in non-blocking mode or is
 there is a way to do it? I have come across one such project
  but wondering is there anything
 provided
 by solrj?
 
 Thanks
 
 
>>> 
>>> 
>> 



Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Emir Arnautović
Hi Sravan,
DBQ does not play well with indexing - it causes indexing to be completely 
blocked on replicas while it is running. It is highly likely that it is the 
root cause of your issues. If you can change indexing logic to avoid it, you 
can quickly test it. What you can do as a workaround is to query for IDs that 
needs to be deleted and execute bulk delete by ID - that will not cause issues 
as DBQ.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 16:04, Sravan Kumar  wrote:
> 
> Emir,
>Yes there is a delete_by_query on every bulk insert.
>This delete_by_query deletes all the documents which are updated lesser
> than a day before the current time.
>Is bulk delete_by_query the reason?
> 
> On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Do you have deletes by query while indexing or it is append only index?
>> 
>> Regards,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 3 Jan 2018, at 12:16, sravan  wrote:
>>> 
>>> SolrCloud Nodes going to recovery state during indexing
>>> 
>>> 
>>> We have solr cloud setup with the settings shared below. We have a
>> collection with 3 shards and a replica for each of them.
>>> 
>>> Normal State(As soon as the whole cluster is restarted):
>>>- Status of all the shards is UP.
>>>- a bulk update request of 50 documents each takes < 100ms.
>>>- 6-10 simultaneous bulk updates.
>>> 
>>> Nodes going to recover state after updates for 15-30 mins.
>>>- Some shards starts giving the following ERRORs:
>>>- o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
>> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
>> exception during distributed update: Read timed out
>>>- o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException:
>> Read timed out
>>>- the following error is seen on the shard which goes to recovery
>> state.
>>>- too many updates received since start - startingUpdates no
>> longer overlaps with our currentUpdates.
>>>- Sometimes, the same shard even goes to DOWN state and needs a node
>> restart to come back.
>>>- a bulk update request of 50 documents takes more than 5 seconds.
>> Sometimes even >120 secs. This is seen for all the requests if at least one
>> node is in recovery state in the whole cluster.
>>> 
>>> We have a standalone setup with the same collection schema which is able
>> to take update & query load without any errors.
>>> 
>>> 
>>> We have the following solrcloud setup.
>>>- setup in AWS.
>>> 
>>>- Zookeeper Setup:
>>>- number of nodes: 3
>>>- aws instance type: t2.small
>>>- instance memory: 2gb
>>> 
>>>- Solr Setup:
>>>- Solr version: 6.6.0
>>>- number of nodes: 3
>>>- aws instance type: m5.xlarge
>>>- instance memory: 16gb
>>>- number of cores: 4
>>>- JAVA HEAP: 8gb
>>>- JAVA VERSION: oracle java version "1.8.0_151"
>>>- GC settings: default CMS.
>>> 
>>>collection settings:
>>>- number of shards: 3
>>>- replication factor: 2
>>>- total 6 replicas.
>>>- total number of documents in the collection: 12 million
>>>- total number of documents in each shard: 4 million
>>>- Each document has around 25 fields with 12 of them
>> containing textual analysers & filters.
>>>- Commit Strategy:
>>>- No explicit commits from application code.
>>>- Hard commit of 15 secs with OpenSearcher as false.
>>>- Soft commit of 10 mins.
>>>- Cache Strategy:
>>>- filter queries
>>>- number: 512
>>>- autowarmCount: 100
>>>- all other caches
>>>- number: 512
>>>- autowarmCount: 0
>>>- maxWarmingSearchers: 2
>>> 
>>> 
>>> - We tried the following
>>>- commit strategy
>>>- hard commit - 150 secs
>>>- soft commit - 5 mins
>>>- with GCG1 garbage collector based on https://wiki.apache.org/solr/
>> ShawnHeisey#Java_8_recommendation_for_Solr:
>>>- the nodes go to recover state in less than a minute.
>>> 
>>> The issue is seen even when the leaders are balanced across the three
>> nodes.
>>> 
>>> Can you help us find the soluttion to this problem?
>> 
>> 
> 
> 
> -- 
> Regards,
> Sravan



Re: Heavy operations in PostFilter are heavy

2018-01-03 Thread Alexandre Rafalovitch
Are you doing cache=false and cost > 100?

See the recent article on the topic deep-dive, if you haven't:
https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/

Regards,
   Alex.

On 3 January 2018 at 05:31, Solrmails  wrote:
> Hello,
>
> I tried to write a Solr PostFilter to do filtering within the 
> 'collect'-Method(DelegatingCollector). I have to do some heavy operations 
> within the 'collect'-Method. This isn't a problem for a few results. But 
> unfortunately it taks forever with 50 or more results. This is because I have 
> to do the checks for every single id again and can't process a list of ids 
> within 'collect'.
>
> Is there a better place to do PostFiltering? But I don't want to reimplement 
> the Solr Paging/Coursor-Feature to get my things to work.
>
> Thank You


Re: Removing some fields from uprefix

2018-01-03 Thread Alexandre Rafalovitch
uprefix is only for the fields that do NOT exist in schema. So, you
can define your x_parsed_by in schema, but map it to the type that has
index=false, store=false, docvalues=false. Which means the field is
acknowledged but effectively dropped.

Regards,
   Alex.

On 3 January 2018 at 05:53, Zheng Lin Edwin Yeo  wrote:
> Hi,
>
> I'm using Solr 7.2.0, and I have this /extract handler in my solrconfig.xml
>
>  startup="lazy"
>   class="solr.extraction.ExtractingRequestHandler" >
> 
>   /xhtml:html/xhtml:body/descendant:node()
>   content
>   attr_meta_
>   attr_
>   true
>   dedupe
> 
>   
>
> Understand that this attr_ will cause all
> generated fileds that aren't defined in the schema to be prefixed with
> attr_
>
> Is there any way that we can remove some of the fields, but keep the rest?
> For example, I would like to remove attr_x_parsed_by.
>
> Regards,
> Edwin


Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Sravan Kumar
Emir,
Yes there is a delete_by_query on every bulk insert.
This delete_by_query deletes all the documents which are updated lesser
than a day before the current time.
Is bulk delete_by_query the reason?

On Wed, Jan 3, 2018 at 7:58 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Do you have deletes by query while indexing or it is append only index?
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 12:16, sravan  wrote:
> >
> > SolrCloud Nodes going to recovery state during indexing
> >
> >
> > We have solr cloud setup with the settings shared below. We have a
> collection with 3 shards and a replica for each of them.
> >
> > Normal State(As soon as the whole cluster is restarted):
> > - Status of all the shards is UP.
> > - a bulk update request of 50 documents each takes < 100ms.
> > - 6-10 simultaneous bulk updates.
> >
> > Nodes going to recover state after updates for 15-30 mins.
> > - Some shards starts giving the following ERRORs:
> > - o.a.s.h.RequestHandlerBase org.apache.solr.update.processor.
> DistributedUpdateProcessor$DistributedUpdatesAsyncException: Async
> exception during distributed update: Read timed out
> > - o.a.s.u.StreamingSolrClients error 
> > java.net.SocketTimeoutException:
> Read timed out
> > - the following error is seen on the shard which goes to recovery
> state.
> > - too many updates received since start - startingUpdates no
> longer overlaps with our currentUpdates.
> > - Sometimes, the same shard even goes to DOWN state and needs a node
> restart to come back.
> > - a bulk update request of 50 documents takes more than 5 seconds.
> Sometimes even >120 secs. This is seen for all the requests if at least one
> node is in recovery state in the whole cluster.
> >
> > We have a standalone setup with the same collection schema which is able
> to take update & query load without any errors.
> >
> >
> > We have the following solrcloud setup.
> > - setup in AWS.
> >
> > - Zookeeper Setup:
> > - number of nodes: 3
> > - aws instance type: t2.small
> > - instance memory: 2gb
> >
> > - Solr Setup:
> > - Solr version: 6.6.0
> > - number of nodes: 3
> > - aws instance type: m5.xlarge
> > - instance memory: 16gb
> > - number of cores: 4
> > - JAVA HEAP: 8gb
> > - JAVA VERSION: oracle java version "1.8.0_151"
> > - GC settings: default CMS.
> >
> > collection settings:
> > - number of shards: 3
> > - replication factor: 2
> > - total 6 replicas.
> > - total number of documents in the collection: 12 million
> > - total number of documents in each shard: 4 million
> > - Each document has around 25 fields with 12 of them
> containing textual analysers & filters.
> > - Commit Strategy:
> > - No explicit commits from application code.
> > - Hard commit of 15 secs with OpenSearcher as false.
> > - Soft commit of 10 mins.
> > - Cache Strategy:
> > - filter queries
> > - number: 512
> > - autowarmCount: 100
> > - all other caches
> > - number: 512
> > - autowarmCount: 0
> > - maxWarmingSearchers: 2
> >
> >
> > - We tried the following
> > - commit strategy
> > - hard commit - 150 secs
> > - soft commit - 5 mins
> > - with GCG1 garbage collector based on https://wiki.apache.org/solr/
> ShawnHeisey#Java_8_recommendation_for_Solr:
> > - the nodes go to recover state in less than a minute.
> >
> > The issue is seen even when the leaders are balanced across the three
> nodes.
> >
> > Can you help us find the soluttion to this problem?
>
>


-- 
Regards,
Sravan


Re: Query fields with data of certain length

2018-01-03 Thread Alexandre Rafalovitch
Do that during indexing as Emir suggested. Specifically, use an
UpdateRequestProcessor chain, probably with the Clone and FieldLength
processors: 
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/FieldLengthUpdateProcessorFactory.html

Regards,
   Alex.

On 31 December 2017 at 22:00, Zheng Lin Edwin Yeo  wrote:
> Hi,
>
> Would like to check, if it is possible to query a field which has data of
> more than a certain length?
>
> Like for example, I want to query the field subject that has more than 255
> bytes. Is it possible?
>
> I am currently using Solr 6.5.1.
>
> Regards,
> Edwin


Re: SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread Emir Arnautović
Do you have deletes by query while indexing or it is append only index?

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Jan 2018, at 12:16, sravan  wrote:
> 
> SolrCloud Nodes going to recovery state during indexing
> 
> 
> We have solr cloud setup with the settings shared below. We have a collection 
> with 3 shards and a replica for each of them.
> 
> Normal State(As soon as the whole cluster is restarted):
> - Status of all the shards is UP.
> - a bulk update request of 50 documents each takes < 100ms.
> - 6-10 simultaneous bulk updates.
> 
> Nodes going to recover state after updates for 15-30 mins.
> - Some shards starts giving the following ERRORs:
> - o.a.s.h.RequestHandlerBase 
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Read timed out
> - o.a.s.u.StreamingSolrClients error java.net.SocketTimeoutException: 
> Read timed out
> - the following error is seen on the shard which goes to recovery state.
> - too many updates received since start - startingUpdates no longer 
> overlaps with our currentUpdates.
> - Sometimes, the same shard even goes to DOWN state and needs a node 
> restart to come back.
> - a bulk update request of 50 documents takes more than 5 seconds. 
> Sometimes even >120 secs. This is seen for all the requests if at least one 
> node is in recovery state in the whole cluster.
> 
> We have a standalone setup with the same collection schema which is able to 
> take update & query load without any errors.
> 
> 
> We have the following solrcloud setup.
> - setup in AWS.
> 
> - Zookeeper Setup:
> - number of nodes: 3
> - aws instance type: t2.small
> - instance memory: 2gb
> 
> - Solr Setup:
> - Solr version: 6.6.0
> - number of nodes: 3
> - aws instance type: m5.xlarge
> - instance memory: 16gb
> - number of cores: 4
> - JAVA HEAP: 8gb
> - JAVA VERSION: oracle java version "1.8.0_151"
> - GC settings: default CMS.
> 
> collection settings:
> - number of shards: 3
> - replication factor: 2
> - total 6 replicas.
> - total number of documents in the collection: 12 million
> - total number of documents in each shard: 4 million
> - Each document has around 25 fields with 12 of them containing 
> textual analysers & filters.
> - Commit Strategy:
> - No explicit commits from application code.
> - Hard commit of 15 secs with OpenSearcher as false.
> - Soft commit of 10 mins.
> - Cache Strategy:
> - filter queries
> - number: 512
> - autowarmCount: 100
> - all other caches
> - number: 512
> - autowarmCount: 0
> - maxWarmingSearchers: 2
> 
> 
> - We tried the following
> - commit strategy
> - hard commit - 150 secs
> - soft commit - 5 mins
> - with GCG1 garbage collector based on 
> https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:
> - the nodes go to recover state in less than a minute.
> 
> The issue is seen even when the leaders are balanced across the three nodes.
> 
> Can you help us find the soluttion to this problem?



Re: Query fields with data of certain length

2018-01-03 Thread Emir Arnautović
Hi Edwin,
If it is one time thing you can use regex to filter out results that are not 
long enough. Something like: subject:/.{255,}.*/.
Of course, this means subject is not tokenized.

It would be probably best if you index subject length as separate field and 
include it in query as subject_length:[255 TO *].

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 1 Jan 2018, at 04:00, Zheng Lin Edwin Yeo  wrote:
> 
> Hi,
> 
> Would like to check, if it is possible to query a field which has data of
> more than a certain length?
> 
> Like for example, I want to query the field subject that has more than 255
> bytes. Is it possible?
> 
> I am currently using Solr 6.5.1.
> 
> Regards,
> Edwin



Re: Limit edismax search to a certain field value and find out matched fields on the results

2018-01-03 Thread Emir Arnautović
Hi Sami,
I would just add that it is probably better to use fq to limit results to some 
category, e.g. q=iphone=category:phones.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 31 Dec 2017, at 22:13, Sami al Subhi  wrote:
> 
> thank you for your reply Erick. 
> That solved the problem. 
> 
> q=iphone +category:phones
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: SolrJ with Async Http Client

2018-01-03 Thread Joel Bernstein
Streaming expressions has an event driven architecture built in. There are
two blogs that describe how it works.

This describes the message queues:

http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

This describes an async model of execution:

http://joelsolr.blogspot.com/2017/01/deploying-solrs-new-parallel-executor.html

After you've read through the two blogs let me know if you have questions
about how to apply this to your use case.


Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jan 3, 2018 at 8:11 AM, RAUNAK AGRAWAL 
wrote:

> Yes, I am talking about event driven way of calling solr, so that I can
> write pure async web service. Does SolrJ provides support for non-blocking
> calls?
>
> On Wed, Jan 3, 2018 at 6:22 PM, Hendrik Haddorp 
> wrote:
>
> > There is asynchronous and non-blocking. If I use 100 threads to perform
> > calls to Solr using the standard Java HTTP client or SolrJ I block 100
> > threads even if I don't block my program logic threads by using async
> > calls. However if I perform those HTTP calls using a non-blocking HTTP
> > client, like netty, I basically only need a single eventing thread in
> > addition to my normal threads. The advantage is less memory usage and an
> > often better scaling. I would however expect that the main advantage
> would
> > be on the server side.
> >
> >
> > On 02.01.2018 22:02, Gus Heck wrote:
> >
> >> It's not very clear (to me) what your use case is, but generally
> speaking,
> >> asynchronous requests can be achieved by using threads/executors/futures
> >> (java) or ajax (javascript). The link seems to be a scala project, I'm
> >> sure
> >> scala has analogous facilities.
> >>
> >> On Tue, Jan 2, 2018 at 10:31 AM, RAUNAK AGRAWAL <
> agrawal.rau...@gmail.com
> >> >
> >> wrote:
> >>
> >> Hi Guys,
> >>>
> >>> I am trying to write fully async service where solr calls are also
> async.
> >>> Just wondering did anyone tried calling solr in non-blocking mode or is
> >>> there is a way to do it? I have come across one such project
> >>>  but wondering is there anything
> >>> provided
> >>> by solrj?
> >>>
> >>> Thanks
> >>>
> >>>
> >>
> >>
> >
>


Re: SolrJ with Async Http Client

2018-01-03 Thread RAUNAK AGRAWAL
Yes, I am talking about event driven way of calling solr, so that I can
write pure async web service. Does SolrJ provides support for non-blocking
calls?

On Wed, Jan 3, 2018 at 6:22 PM, Hendrik Haddorp 
wrote:

> There is asynchronous and non-blocking. If I use 100 threads to perform
> calls to Solr using the standard Java HTTP client or SolrJ I block 100
> threads even if I don't block my program logic threads by using async
> calls. However if I perform those HTTP calls using a non-blocking HTTP
> client, like netty, I basically only need a single eventing thread in
> addition to my normal threads. The advantage is less memory usage and an
> often better scaling. I would however expect that the main advantage would
> be on the server side.
>
>
> On 02.01.2018 22:02, Gus Heck wrote:
>
>> It's not very clear (to me) what your use case is, but generally speaking,
>> asynchronous requests can be achieved by using threads/executors/futures
>> (java) or ajax (javascript). The link seems to be a scala project, I'm
>> sure
>> scala has analogous facilities.
>>
>> On Tue, Jan 2, 2018 at 10:31 AM, RAUNAK AGRAWAL > >
>> wrote:
>>
>> Hi Guys,
>>>
>>> I am trying to write fully async service where solr calls are also async.
>>> Just wondering did anyone tried calling solr in non-blocking mode or is
>>> there is a way to do it? I have come across one such project
>>>  but wondering is there anything
>>> provided
>>> by solrj?
>>>
>>> Thanks
>>>
>>>
>>
>>
>


Re: SolrJ with Async Http Client

2018-01-03 Thread Hendrik Haddorp
There is asynchronous and non-blocking. If I use 100 threads to perform 
calls to Solr using the standard Java HTTP client or SolrJ I block 100 
threads even if I don't block my program logic threads by using async 
calls. However if I perform those HTTP calls using a non-blocking HTTP 
client, like netty, I basically only need a single eventing thread in 
addition to my normal threads. The advantage is less memory usage and an 
often better scaling. I would however expect that the main advantage 
would be on the server side.


On 02.01.2018 22:02, Gus Heck wrote:

It's not very clear (to me) what your use case is, but generally speaking,
asynchronous requests can be achieved by using threads/executors/futures
(java) or ajax (javascript). The link seems to be a scala project, I'm sure
scala has analogous facilities.

On Tue, Jan 2, 2018 at 10:31 AM, RAUNAK AGRAWAL 
wrote:


Hi Guys,

I am trying to write fully async service where solr calls are also async.
Just wondering did anyone tried calling solr in non-blocking mode or is
there is a way to do it? I have come across one such project
 but wondering is there anything provided
by solrj?

Thanks








SolrCloud Nodes going to recovery state during indexing

2018-01-03 Thread sravan

SolrCloud Nodes going to recovery state during indexing


We have solr cloud setup with the settings shared below. We have a 
collection with 3 shards and a replica for each of them.


Normal State(As soon as the whole cluster is restarted):
    - Status of all the shards is UP.
    - a bulk update request of 50 documents each takes < 100ms.
    - 6-10 simultaneous bulk updates.

Nodes going to recover state after updates for 15-30 mins.
    - Some shards starts giving the following ERRORs:
        - o.a.s.h.RequestHandlerBase 
org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: 
Async exception during distributed update: Read timed out
        - o.a.s.u.StreamingSolrClients error 
java.net.SocketTimeoutException: Read timed out
    - the following error is seen on the shard which goes to recovery 
state.
        - too many updates received since start - startingUpdates no 
longer overlaps with our currentUpdates.
    - Sometimes, the same shard even goes to DOWN state and needs a 
node restart to come back.
    - a bulk update request of 50 documents takes more than 5 seconds. 
Sometimes even >120 secs. This is seen for all the requests if at least 
one node is in recovery state in the whole cluster.


We have a standalone setup with the same collection schema which is able 
to take update & query load without any errors.



We have the following solrcloud setup.
    - setup in AWS.

    - Zookeeper Setup:
        - number of nodes: 3
        - aws instance type: t2.small
        - instance memory: 2gb

    - Solr Setup:
        - Solr version: 6.6.0
        - number of nodes: 3
        - aws instance type: m5.xlarge
        - instance memory: 16gb
        - number of cores: 4
        - JAVA HEAP: 8gb
        - JAVA VERSION: oracle java version "1.8.0_151"
        - GC settings: default CMS.

        collection settings:
            - number of shards: 3
            - replication factor: 2
            - total 6 replicas.
            - total number of documents in the collection: 12 million
            - total number of documents in each shard: 4 million
            - Each document has around 25 fields with 12 of them 
containing textual analysers & filters.

            - Commit Strategy:
                - No explicit commits from application code.
                - Hard commit of 15 secs with OpenSearcher as false.
                - Soft commit of 10 mins.
            - Cache Strategy:
                - filter queries
                    - number: 512
                    - autowarmCount: 100
                - all other caches
                    - number: 512
                    - autowarmCount: 0
            - maxWarmingSearchers: 2


- We tried the following
    - commit strategy
        - hard commit - 150 secs
        - soft commit - 5 mins
    - with GCG1 garbage collector based on 
https://wiki.apache.org/solr/ShawnHeisey#Java_8_recommendation_for_Solr:

        - the nodes go to recover state in less than a minute.

The issue is seen even when the leaders are balanced across the three 
nodes.


Can you help us find the soluttion to this problem?


Removing some fields from uprefix

2018-01-03 Thread Zheng Lin Edwin Yeo
Hi,

I'm using Solr 7.2.0, and I have this /extract handler in my solrconfig.xml

  

  /xhtml:html/xhtml:body/descendant:node()
  content
  attr_meta_
  attr_
  true
  dedupe

  

Understand that this attr_ will cause all
generated fileds that aren't defined in the schema to be prefixed with
attr_

Is there any way that we can remove some of the fields, but keep the rest?
For example, I would like to remove attr_x_parsed_by.

Regards,
Edwin


Heavy operations in PostFilter are heavy

2018-01-03 Thread Solrmails
Hello,

I tried to write a Solr PostFilter to do filtering within the 
'collect'-Method(DelegatingCollector). I have to do some heavy operations 
within the 'collect'-Method. This isn't a problem for a few results. But 
unfortunately it taks forever with 50 or more results. This is because I have 
to do the checks for every single id again and can't process a list of ids 
within 'collect'.

Is there a better place to do PostFiltering? But I don't want to reimplement 
the Solr Paging/Coursor-Feature to get my things to work.

Thank You

Re: Always use leader for searching queries

2018-01-03 Thread Novin Novin
Hi Erick,

Thanks for your reply.

[ First of all, replicas can be off in terms of counts for the soft
commit interval. The commits don't all happen on the replicas at the
same wall-clock time. Solr promises eventual consistency, in this case
NOW-autocommit time.]

I realized that, to stop it. I have actually turned off auto soft commit
for a time being but nothing changed. Non leader replica still had extra
documents.

[ So my first question is whether the replicas in the shard are
inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
factor of 10 seconds earlier just to be sure I was past autowarming.
This does require that there be a time stamp. Absent a timestamp, you
could suspend indexing for a few minutes and run the test like below.]

When data was indexing at that time I was checking how the counts are in
both replica. What I found leader replica has 3 doc less than other replica
always. I don't think so they were of by NOW-soft_commit_time, CloudSolrClient
add some thing like this "_stateVer_=main:114" in query which I assume is
for results to be consistent between both replica search.

[Adding =false to your command and directing it at a specific
_core_ (something like collection1_shard1_replica1) will only return
data from that core.]
I probably not need to do this because I have only one shard but I did
anyway count was different.

[When you say you index every minute, I'm guessing you only index for
part of that minute, is that true? In that case you might get more
consistency if, instead of relying totally on your autoconfig
settings, specify commitWithin on your update command. That should
force the commits to happen more closely in-sync, although still not
perfect.]

We receive data every minute, so whenever we have new data we send it to
Solr cloud using queue. You said don't rely on auto config. Do you mean I
should turn off autocommit and use commitWithin using solrj or leave
autoCommit as it is and also use commitWithin from solrj client.

I apologize If I am not clear, thanks for your help again.

Thanks in advance,
Navin





On Tue, 2 Jan 2018 at 18:05 Erick Erickson  wrote:

> First of all, replicas can be off in terms of counts for the soft
> commit interval. The commits don't all happen on the replicas at the
> same wall-clock time. Solr promises eventual consistency, in this case
> NOW-autocommit time.
>
> So my first question is whether the replicas in the shard are
> inconsistent as of, say, NOW-your_soft_commit_time. I'd add a fudge
> factor of 10 seconds earlier just to be sure I was past autowarming.
> This does require that there be a time stamp. Absent a timestamp, you
> could suspend indexing for a few minutes and run the test like below.
>
> Adding =false to your command and directing it at a specific
> _core_ (something like collection1_shard1_replica1) will only return
> data from that core.
>
> When you say you index every minute, I'm guessing you only index for
> part of that minute, is that true? In that case you might get more
> consistency if, instead of relying totally on your autoconfig
> settings, specify commitWithin on your update command. That should
> force the commits to happen more closely in-sync, although still not
> perfect.
>
> Another option if you're totally and completely sure that your commits
> happen _only_ from your indexing program is to fire the commit at the
> end of the run from your SolrJ program.
>
> Let us know,
> Erick
>
> On Tue, Jan 2, 2018 at 9:33 AM, Novin Novin  wrote:
> > Hi Erick,
> >
> > You are right, it is XY Problem.
> >
> > Allow me to explain best I can, I have two replica of one collection
> called
> > "Main". When I was using search feature in my application I get two
> > different numFound count. So I start digging after spending 2 3 hours I
> > found the one replica has numFound count higher than other (higher count
> > was not leader). I am not sure how It got end up like that. This count
> > difference affects paging on my application side not solr side.
> >
> > Extra info might be useful to know
> > Same query not a single letter difference.
> > auto soft commit 2
> > soft commit 6
> > indexing data every minute.
> >
> > Let me know if you need to know anything else. Any help would highly
> > appreciated.
> >
> > Thanks in advance,
> > Navin
> >
> >
> >
> > On Tue, 2 Jan 2018 at 15:14 Erick Erickson 
> wrote:
> >
> >> This seems like an XY problem. You're asking how to do X
> >> because you think it will solve problem Y without telling
> >> us what Y is.
> >>
> >> I say this because on the surface this seems to defeat the
> >> purpose behind SolrCloud. Why would you want to only make
> >> use of one piece of hardware? That will limit your throughput,
> >> so why bother to have replicas in the first place?
> >>
> >> Or is this some kind of diagnostic you're trying to implement?
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue,