Re: Limit search queries only to pull replicas

2018-02-14 Thread Ere Maijala
I've now posted https://issues.apache.org/jira/browse/SOLR-11982 with a 
patch. It works just like preferLocalShards. SOLR-10880 is awesome, but 
my idea is not to filter out anything, so this just adjusts the order of 
nodes.


--Ere

Tomas Fernandez Lobbe kirjoitti 8.1.2018 klo 21.42:

This feature is not currently supported. I was thinking in implementing it by 
extending the work done in SOLR-10880. I still didn’t have time to work on it 
though.  There is a patch for SOLR-10880 that doesn’t implement support for 
replica types, but could be used as base.

Tomás


On Jan 8, 2018, at 12:04 AM, Ere Maijala  wrote:

Server load alone doesn't always indicate the server's ability to serve 
queries. Memory and cache state are important too, and they're not as easy to 
monitor. Additionally, server load at any single point in time or a short term 
average is not indicative of the server's ability to handle search requests if 
indexing happens in short but intense bursts.

It can also complicate things if there are more than one Solr instance running 
on a single server.

I'm definitely not against intelligent routing. In many cases it makes perfect 
sense, and I'd still like to use it, just limited to the pull replicas.

--Ere

Erick Erickson kirjoitti 5.1.2018 klo 19.03:

Actually, I think a much better option is to route queries to server load.
The theory of preferring pull replicas to leaders would be that the leader
will be doing the indexing work and the pull replicas would be doing less
work therefore serving queries faster. But that's a fragile assumption.
Let's say indexing stops totally. Now your leader is sitting there idle
when it could be serving queries.
The autoscaling work will allow for more intelligent routing, you can
monitor the CPU load on your servers and if the leader has some spare
cycles use them .vs. crudely routing all queries to pull replicas (or tlog
replicas for that matter). NOTE: I don't know whether this is being
actively worked on or not, but seems a logical extension of the increased
monitoring capabilities being put in place for autoscaling, but I'd rather
see effort put in there than support routing based solely on a node's type.
Best,
Erick
On Fri, Jan 5, 2018 at 7:51 AM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

It is interesting that ES had similar feature to prefer primary/replica
but it deprecating that and will remove it - could not find explanation why.

Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




On 5 Jan 2018, at 15:22, Ere Maijala  wrote:

Hi,

It would be really nice to have a server-side option, though. Not

everyone uses Solrj, and a typical fairly dummy client just queries the
server without any understanding about shards etc. Solr could be clever
enough to not forward the query to NRT shards when configured to prefer
PULL shards and they're available. Maybe it could be something similar to
the preferLocalShards parameter, like "preferShardTypes=TLOG,PULL".


--Ere

Emir Arnautović kirjoitti 14.12.2017 klo 11.41:

Hi Stanislav,
I don’t think that there is a built in feature to do this, but that

sounds like nice feature of Solrj - maybe you should check if available.
You can implement it outside of Solrj - check cluster state to see which
shards are available and send queries only to pull replicas.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/

On 14 Dec 2017, at 09:58, Stanislav Sandalnikov <

s.sandalni...@gmail.com> wrote:


Hi,

We have a Solr 7.1 setup with SolrCloud where we have multiple shards

on one server (for indexing) each shard has a pull replica on other servers.


What are the possible ways to limit search request only to pull type

replicase?

At the moment the only solution I found is to append shards parameter

to each query, but if new shards added later it requires to change
solrconfig. Is it the only way to do this?


Thank you

Regards
Stanislav



--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland


Re: docvalues set to true, and indexed is false and stored is set to false

2018-02-14 Thread Emir Arnautović
Hi Ganesh,
I cannot confirm for sure, but I would assume that it will not get reindexed, 
but just segments doc values file rewritten. It is best if you test this and 
see for yourself.

Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Feb 2018, at 13:09, mganeshs  wrote:
> 
> Hi Emir,
> 
> Thanks for confirming that strField is not considered / available for in
> place updates. 
> 
> As per documentation, it says...
> 
> *An atomic update operation is performed using this approach only when the
> fields to be updated meet these three conditions:
> 
> are non-indexed (indexed="false"), non-stored (stored="false"), single
> valued (multiValued="false") numeric docValues (docValues="true") fields;
> 
> the _version_ field is also a non-indexed, non-stored single valued
> docValues field; and,
> 
> copy targets of updated fields, if any, are also non-indexed, non-stored
> single valued numeric docValues fields.*
> 
> Let's consider I have declared following three fields in the schema
> 
> id
> 
>  docValues="false"/>
>  docValues="false"/>
>  docValues="true"/>
> 
> With this I am trying to create couple of solr document ( id =1) with only
> Field1 and Field2 and it's also indexed. And I could search the documents
> based on Field1 and Field2
> 
> Now after a while, I am adding a new field called Field3 by passing the id
> field ( id=1) and Field3 ( Field3=100 ( which is docvalues field in our case
> ).
> 
> What will happen now ? Will the complete document gets re indexed or only
> Field3 get added under docValues ?
> 
> Pls confirm.
> 
> Regards,
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Replicas: sending query to leader and replica simultaneously

2018-02-14 Thread Emir Arnautović
Hi,
Solr will loadbalance replicas and if one is unresponsive, send it to another 
and flag unresponsive one. But it will not send request to multiple replicas - 
that would be a waste of resources. If you want something like that, you would 
probably have to set up two separate clusters and send two requests from your 
client code.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 13 Feb 2018, at 18:17, SOLR4189  wrote:
> 
> Hi all,
> 
> I use SOLR-6.5.1 and I want to start to use replicas in SolrCloud mode. I
> read ref guide and Solr in Action, and I want to make sure only one thing
> about REPLICAS:
> 
> SOLR can't send query both to leader and to slave simultaneously and returns
> the fastest response of them?
> 
> (in the case leader or slave is active, but one of them is overloaded and
> responses a lot of time).
> 
> Thank you. 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Alessandro Benedetti
"I can go with the "title" field and have that include the synonyms in 
analysis. Only problem is that the number of fields and number of synonyms 
files are quite a lot (~ 8 synonyms files) due to different weightage and 
type of expansion (exact vs partial) based on these. Hence going with this 
approach would mean creating more fields for all these synonyms 
(synonyms.txt) 

So, I am looking to build a custom parser for which I could supply the file 
and the field and that would expand the synonyms and return a score. "

Having a binary or scalar feature is completely up to you and the way you
configure the Solr feature.
If you have 8 (copy?)fields with same content but different expansion, that
is still ok.
You can have 8 features, one per type of expansion.
LTR will take care of the weight to be assigned to those features.

"So, I am looking to build a custom parser for which I could supply the file 
and the field and that would expand the synonyms and return a score. ""
I don't get this , can you elaborate ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: docvalues set to true, and indexed is false and stored is set to false

2018-02-14 Thread Emir Arnautović
Hi Ganesh,
Doc values are enabled for strField and UUID but in place updates are not.

It is not free = according to some discussions on mailing list (did not check 
the code) in place updates are not update of some value in doc values file but 
rewrite of doc values file for the segment that it is holding doc that is 
updated. In case of updating docs that are in larger segment, larger doc values 
file will be rewritten.

Regards,
Emir

--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 14 Feb 2018, at 04:38, mganeshs  wrote:
> 
> Hi,
> 
> Thanks for clearing.
> 
> But as per this  link
> 
>  
> (Enabling DocValues) it says that it supports strField and UUID field also. 
> 
> Again, what you mean by it's not free for large segments. Can you point me
> to some documentation on that ?
> 
> Regards,
> Ganesh
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html



RE: Solr search word NOT followed by another word

2018-02-14 Thread ivan
Hi Timothy,

i'm trying to use your Parser, but i'm having some trouble with the versions
of solr\lucene.
I'm trying to use version 6.4.1 but i'm facing a lot of incompatibilities
with version 5. Is there any updated version of the plugin?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: docvalues set to true, and indexed is false and stored is set to false

2018-02-14 Thread mganeshs
Hi Emir,

Thanks for confirming that strField is not considered / available for in
place updates. 

As per documentation, it says...

*An atomic update operation is performed using this approach only when the
fields to be updated meet these three conditions:

are non-indexed (indexed="false"), non-stored (stored="false"), single
valued (multiValued="false") numeric docValues (docValues="true") fields;

the _version_ field is also a non-indexed, non-stored single valued
docValues field; and,

copy targets of updated fields, if any, are also non-indexed, non-stored
single valued numeric docValues fields.*

Let's consider I have declared following three fields in the schema

id





With this I am trying to create couple of solr document ( id =1) with only
Field1 and Field2 and it's also indexed. And I could search the documents
based on Field1 and Field2

Now after a while, I am adding a new field called Field3 by passing the id
field ( id=1) and Field3 ( Field3=100 ( which is docvalues field in our case
).

What will happen now ? Will the complete document gets re indexed or only
Field3 get added under docValues ?

Pls confirm.

Regards,



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr search word NOT followed by another word

2018-02-14 Thread Allison, Timothy B.
In process, should finish by end of this week.  I had to put SlowFuzzyQuery 
back in, and I discovered SOLR-11976 while trying to upgrade.  I'll have to do 
a workaround until that is fixed.

-Original Message-
From: simon [mailto:mtnes...@gmail.com] 
Sent: Monday, February 12, 2018 1:21 PM
To: solr-user 
Subject: Re: Solr search word NOT followed by another word

Tim:

How up to date is the Solr-5410  patch/zip in JIRA ?.  Looking to use the Span 
Query parser in 6.5.1, migrating to 7.x sometime soon.

Would love to see these committed !

-Simon

On Mon, Feb 12, 2018 at 10:41 AM, Allison, Timothy B. 
wrote:

> That requires a SpanNotQuery.  AFAIK, there is no way to do this with 
> the current parsers included in Solr.
>
> My SpanQueryParser does cover this, and I'm hoping to port it to 7.x 
> today or tomorrow.
>
> Syntax would be "Leonardo [da vinci]"!~0,1
>
> https://issues.apache.org/jira/browse/LUCENE-5205
>
> https://github.com/tballison/lucene-addons/tree/master/lucene-5205
>
> https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205
>
> With Solr wrapper: https://issues.apache.org/jira/browse/SOLR-5410
>
>
> -Original Message-
> From: ivan [mailto:i...@presstoday.com]
> Sent: Monday, February 12, 2018 6:00 AM
> To: solr-user@lucene.apache.org
> Subject: Solr search word NOT followed by another word
>
> What i'm trying to do is to only get results for "Leonardo" when is 
> not followed by "da vinci".
> So any result containing "Leonardo" (not followed by "da vinci") is 
> fine even if i have "Leonardo da vinci" in the result. I want to 
> filter out only the results where i don't have "Leonardo" without "da vinci".
>
> Examples:
> "Leonardo abc abc abc"   OK
> "Leonardo da vinci abab"  KO
> "Leonardo is the name of Leonardo da Vinci"  OK
>
>
> I can't seem to find any way to do that using solr queries. I can't 
> use regex (i have a tokenized text field) and any combination of 
> boolean logic doesn't seem to work.
>
> Any help?
> Thanks
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Not getting appropriate spell suggestions

2018-02-14 Thread Alessandro Benedetti
 Given your schema the stemmer seems a very likely responsible.
You need to disable it and re-index.
Just commenting it is not going to work if you don't re-index.

Cheers



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Not getting appropriate spell suggestions

2018-02-14 Thread Jaimin Patel
Here is link to solrconfig.xml
 and
managed-schema
.

I have document with following value in suggestions (field used for spell
check), same information as information available in text field.

"suggestion":"Amylase Production, Part 1\nIn humans, salivary α-amylase is
produced by the AMY1 gene on chromosome 1. \n\n;;br\nHumans are diploid
organisms, meaning that they generally have two copies of genes that are
not present on the X and Y chromosomes -- one copy inherited from each
parent. \n\n;;br\nHowever, genetic studies show that people can have
anywhere from two to 15 copies of the AMY1 gene on each chromosome 1,
suggesting that the gene has been duplicated during human
evolution.\n\n;;br\nWhy would humans evolve multiple copies of a gene?"

*Search term and result*
"amylase" - works and provides appropriate result
"amylas" - No result, it identify as correct spelling
"amylos" - No result, spell suggestion is "amylas" (without e)


Similar situation with another term -
"declaratoin" - Correct spell for it is "declaration", I have document with
suggestions field value "declaration of independence", but the spell
suggestion returns "declar", I would imagine that it would return the close
match one instead of something like "declar" (which is also without e, so I
am suspecting it has something to do with PorterStemFilterFactory, but even
after I commented it, it shows same result.

Can someone help what I am not understanding correctly?How can I improve
when there is one character difference (missing or swap postion)

Thanks.


-- 
Jaimin

ᐧ


Re: Solr - Managed Resources REST API to get stopwords

2018-02-14 Thread ruby
I was hoping to get back the list of stopwords which are defined in 
server\solr\collection\conf\lang\stopwords_en.txt  file.

So are you saying this REST api can't give me access to stopwords defined in
this file?

Is there a query which will give me stopwords defined in
server\solr\collection\conf\lang\stopwords_en.txt  file ?

Thanks



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Request routing / load-balancing TLOG & PULL replica types

2018-02-14 Thread Ere Maijala

A patch is now available: https://issues.apache.org/jira/browse/SOLR-11982

--Ere

Greg Roodt kirjoitti 12.2.2018 klo 22.06:

Thanks Ere. I've taken a look at the discussion here:
http://lucene.472066.n3.nabble.com/Limit-search-queries-only-to-pull-replicas-td4367323.html
This is how I was imagining TLOG & PULL replicas would wor, so if this
functionality does get developed, it would be useful to me.

I still have 2 questions at the moment:
1. I am running the single shard scenario. I'm thinking of using a
dedicated HTTP load-balancer in front of the PULL replicas only with
read-only queries directed directly at the load-balancer. In this
situation, the healthy PULL replicas *should* handle the queries on the
node itself without a proxy hop (assuming state=active). New PULL replicas
added to the load-balancer will internally proxy queries to the other PULL
or TLOG replicas while in state=recovering until the switch to
state=active. Is my understanding correct?

2. Is it all worth it? Is there any advantage to running a cluster of 3
TLOGs + 10 PULL replicas vs running 13 TLOG replicas?




On 12 February 2018 at 19:25, Ere Maijala  wrote:


Your question about directing queries to PULL replicas only has been
discussed on the list. Look for topic "Limit search queries only to pull
replicas". What I'd like to see is something similar to the
preferLocalShards parameter. It could be something like
"preferReplicaTypes=TLOG,PULL". Tomás mentioned previously that
SOLR-10880 could be used as a base for such funtionality, and I'm
considering taking a stab at implementing it.

--Ere


Greg Roodt kirjoitti 12.2.2018 klo 6.55:


Thank you both for your very detailed answers.

This is great to know. I knew that SolrJ had the cluster aware knowledge
(via zookeeper), but I was wondering what something like curl would do.
Great to know that internally the cluster will proxy queries to the
appropriate place regardless.

I am running the single shard scenario. I'm thinking of using a dedicated
HTTP load-balancer in front of the PULL replicas only with read-only
queries directed directly at the load-balancer. In this situation, the
healthy PULL replicas *should* handle the queries on the node itself
without a proxy hop (assuming state=active). New PULL replicas added to
the
load-balancer will internally proxy queries to the other PULL or TLOG
replicas while in state=recovering until the switch to state=active.

Is my understanding correct?

Is this sensible to do, or is it not worth it due to the smart proxying
that SolrCloud can do anyway?

If the TLOG and PULL replicas are so similar, is there any real advantage
to having a mixed cluster? I assume a bit less work is required across the
cluster to propagate writes if you only have 3 TLOG nodes vs 10+ PULL
nodes? Or would it be better to just have 13 TLOG nodes?





On 12 February 2018 at 15:24, Tomas Fernandez Lobbe 
wrote:

On the last question:

For Writes: Yes. Writes are going to be sent to the shard leader, and
since PULL replicas can’t  be leaders, it’s going to be a TLOG replica.
If
you are using CloudSolrClient, then this routing will be done directly
from
the client (since it will send the update to the leader), and if you are
using some other HTTP client, then yes, the PULL replica will forward the
update, the same way any non-leader node would.

For reads: this won’t happen today, and any replica can respond to
queries. I do believe there is value in this kind of routing logic,
sometimes you simply don’t want the leader to handle any queries,
specially
when queries can be expensive. You could do this today if you want, by
putting some load balancer in front and just direct your queries to the
nodes you know are PULL, but keep in mind that this would only work in
the
single shard scenario, and only if you hit an active replica (otherwise,
as
you said, the query will be routed to any other node of the shard,
regardless of the type), if you have multiple shards then you need to use
the “shards” parameter and tell Solr exactly which nodes you want to hit
for each shard (the “shards” approach can also be done in the single
shard
case, although you would be adding an extra hop I believe)

Tomás
Sent from my iPhone

On Feb 11, 2018, at 6:35 PM, Greg Roodt  wrote:


Hi

I have a question around how queries are routed and load-balanced in a
cluster of mixed TLOG and PULL replicas.

I thought that I might have to put a load-balancer in front of the PULL
replicas and direct queries at them manually as nodes are added and


removed


as PULL replicas. However, it seems that SolrCloud handles this
automatically?

If I add a new PULL replica node, it goes into state="recovering" while


it


pulls the core. As expected. What happens if queries are directed at
this
node while in this state? From what I am observing, the query gets


directed


to another node?

If SolrCloud is handling the routing of requests to 

Re: Solr search word NOT followed by another word

2018-02-14 Thread ivan
I'm working on 6.4.1 (but i tried on 7.2.1 too) and i'm not getting results
for the case i've shown before.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Judging the MoreLikeThis results for relevancy

2018-02-14 Thread Alessandro Benedetti
So let me answer point by point :

1) Similarity is misleading here if you interpret it as a probabilistic
measure. 
Given a query, it doesn't exist the "Ideal Document". Both with TF-IDF and
BM25 ( that solves the problem better) you are scoring the document. Higher
the score, higher the relevance of that document for the given query. BM25
does a better job in this , the relevance function will hit a saturation
point so it is closer to your expectation, this blog from Doug should
help[1]

2) "if document vector A is at a 
distance of 5 and 10 units from document vectors B and C respectively then 
can't we say that B is twice as relevant to A as C is to A? Or in terms of 
distance, C is twice as distant to  A and B is to A?"

Not in Lucene, at least not strictly.
Current MLT uses TF-IDF as a scoring formula.
When the score of B is double of the score of C, you can say that B is twice
as relevant to A than C for Lucene.
>From a User perspective this can be different (quoting Doug  : "If an
article mentions “dog” six times is it twice as relevant as an article
mentioning “dog” 3 times? Most users say no")

3) MLT under the hood build a Lucene query and retrieve documents from the
index.
When building the MLT query, to keep it simple it extract from the seed
document a subset of terms which are considered representative of the seed
document ( let's call them relevant terms).
This is managed through a parameter, but usually and by default you collect
a limited set of relevant terms ( not all the terms).
When retrieving similar documents you score them using TF-IDF ( and in the
future BM25).
So first of all, you can have documents with higher scores than the original
( it doesn't make sense in a probabilistic world, but this is how Lucene
works).
Reverting the documents, so applying the MLT to document B you could build a
slightly different query.
So :
given seed(a) the score(b) != the score(a) given seed(b)

I understand you think it doesn't make sense, but this how Lucene works.

I do also understand that a lot of times users want a percentage out of a
MLT query.
I will work toward that direction for sure, step by step, first I need to
have the MLT refactor approved and patched :)




[1]
https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
I had a similar issue with index size after upgrading to version 6.4.1 from
5.x. The issue for me was that the field which caused index size to be
increased disproportionately had a field type("text_general") for which
default value of omitNorms was not true. Turning it on explicitly on field
fixed the problem. Following is the link to my related question.  You can
verify value of omitNorms for your fields to check whether this is
applicable in your case or not.
http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size

On Tue, Feb 13, 2018 at 8:48 PM, Howe, David 
wrote:

>
> I have set docValues=false on all of the string fields in our index that
> have indexed=false and stored=true.  This gave a small improvement in the
> index size from 13.3GB to 12.82GB.
>
> I have also tried running an optimize, which then reduced the index to
> 12.6GB.
>
> Next step is to dump the sizes of the Solr index files for the index
> version that is the correct size and the version that has the large size.
>
> Regards,
>
> David
>
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> -Original Message-
> From: Howe, David [mailto:david.h...@auspost.com.au]
> Sent: Wednesday, 14 February 2018 7:26 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Index size increases disproportionately to size of added
> field when indexed=false
>
>
> Thanks Hoss.  I will try setting docValues to false, as we only ever want
> to be able to retrieve the value of this field.
>
> Regards,
>
> David
>
> David Howe
> Java Domain Architect
> Postal Systems
> Level 16, 111 Bourke Street Melbourne VIC 3000
>
> T  0391067904
>
> M  0424036591
>
> E  david.h...@auspost.com.au
>
> W  auspost.com.au
> W  startrack.com.au
>
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
> Australia Post is committed to providing our customers with excellent
> service. If we can assist you in any way please telephone 13 13 18 or visit
> our website.
>
> The information contained in this email communication may be proprietary,
> confidential or legally professionally privileged. It is intended
> exclusively for the individual or entity to which it is addressed. You
> should only read, disclose, re-transmit, copy, distribute, act in reliance
> on or commercialise the information if you are authorised to do so.
> Australia Post does not represent, warrant or guarantee that the integrity
> of this email communication has been maintained nor that the communication
> is free of errors, virus or interference.
>
> If you are not the addressee or intended recipient please notify us by
> replying direct to the sender and then destroy any electronic or paper copy
> of this message. Any views expressed in this email communication are taken
> to be those of the individual sender, except where the sender specifically
> attributes those views to Australia Post and is authorised to do so.
>
> Please consider the environment before printing this email.
>


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Roopa Rao
So, I would end up with ~6 copy fields with ~8 synonym files so that would
be about 48 field/synonym combination. Would that be a significant in terms
of index size. What would be the best way to measure this?

Custom parser:
This would take the file name, field to run the analysis on. This field
need not be a copy field which holds data, since we can use this is only
for getting the analysis.
Get the synonyms for the user query as tokens.
Create a edismax query based on the query tokens.
Return the score

This custom parser would be called in LTR as a scalar feature.

I am at the stage I can get the synonyms from the analysis chain, however
tokens are individual tokens and not phrases. So, I am stuck at how to
construct a correct query based on the synonym tokens and positions.

Thank you,
Roopa

On Wed, Feb 14, 2018 at 10:12 AM, Roopa Rao  wrote:

> So, I would end up with ~6 copy fields with ~8 synonym files so that would
> be about 48 field/synonym combination. Would that be a significant in terms
> of index size. I guess that depends on the thesaurus size, what would be
> the best way to measure this?
>
> Custom parser:
> This would take the file name, field to run the analysis on. This field
> need not be a copy field which holds data, since we can use this is only
> for getting the analysis.
> Get the synonyms for the user query as tokens.
> Create a edismax query based on the query tokens.
> Return the score
>
> This custom parser would be called in LTR as a scalar feature.
>
> I am at the stage I can get the synonyms from the analysis chain, however
> tokens are individual tokens and not phrases. So, I am stuck at how to
> construct a correct query based on the synonym tokens and positions.
>
> Thank you,
> Roopa
>
>
>
> On Wed, Feb 14, 2018 at 5:23 AM, Alessandro Benedetti <
> a.benede...@sease.io> wrote:
>
>> "I can go with the "title" field and have that include the synonyms in
>> analysis. Only problem is that the number of fields and number of synonyms
>> files are quite a lot (~ 8 synonyms files) due to different weightage and
>> type of expansion (exact vs partial) based on these. Hence going with this
>> approach would mean creating more fields for all these synonyms
>> (synonyms.txt)
>>
>> So, I am looking to build a custom parser for which I could supply the
>> file
>> and the field and that would expand the synonyms and return a score. "
>>
>> Having a binary or scalar feature is completely up to you and the way you
>> configure the Solr feature.
>> If you have 8 (copy?)fields with same content but different expansion,
>> that
>> is still ok.
>> You can have 8 features, one per type of expansion.
>> LTR will take care of the weight to be assigned to those features.
>>
>> "So, I am looking to build a custom parser for which I could supply the
>> file
>> and the field and that would expand the synonyms and return a score. ""
>> I don't get this , can you elaborate ?
>>
>> Regards
>>
>>
>>
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
>


Re: Replicas: sending query to leader and replica simultaneously

2018-02-14 Thread SOLR4189
Thank you, Emir for your answer

*But it will not send request to multiple replicas - that would be a waste
of resources.*
What if server is overloaded, but it is responsive? Then it will not be a
waste of resources, because second replica will response faster then
overloaded replica.


*and flag unresponsive one*
Until when it will marked unresponsive? If solr will check it every request,
it is also would be a waste of resources...





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr CDCR doesn't work if the authentication is enabled

2018-02-14 Thread dimaf
I set up CDCR in my test environment and it worked perfectly until I uploaded
security.json files to Zookeeper clusters of a Target and a Source
SolrClouds. security.json files are identical for both Clouds as well as
collections name.
The Source has the next errors:

org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from
server at http://target_node:port/solr/col01_shard1_replica1: Expected mime
type application/octet-stream but got text/html. 
...
Error 401 Unauthorized request, Response code: 401

Any idea how should I fix it?
Thanks!




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Roopa Rao
So, I would end up with ~6 copy fields with ~8 synonym files so that would
be about 48 field/synonym combination. Would that be a significant in terms
of index size. I guess that depends on the thesaurus size, what would be
the best way to measure this?

Custom parser:
This would take the file name, field to run the analysis on. This field
need not be a copy field which holds data, since we can use this is only
for getting the analysis.
Get the synonyms for the user query as tokens.
Create a edismax query based on the query tokens.
Return the score

This custom parser would be called in LTR as a scalar feature.

I am at the stage I can get the synonyms from the analysis chain, however
tokens are individual tokens and not phrases. So, I am stuck at how to
construct a correct query based on the synonym tokens and positions.

Thank you,
Roopa



On Wed, Feb 14, 2018 at 5:23 AM, Alessandro Benedetti 
wrote:

> "I can go with the "title" field and have that include the synonyms in
> analysis. Only problem is that the number of fields and number of synonyms
> files are quite a lot (~ 8 synonyms files) due to different weightage and
> type of expansion (exact vs partial) based on these. Hence going with this
> approach would mean creating more fields for all these synonyms
> (synonyms.txt)
>
> So, I am looking to build a custom parser for which I could supply the file
> and the field and that would expand the synonyms and return a score. "
>
> Having a binary or scalar feature is completely up to you and the way you
> configure the Solr feature.
> If you have 8 (copy?)fields with same content but different expansion, that
> is still ok.
> You can have 8 features, one per type of expansion.
> LTR will take care of the weight to be assigned to those features.
>
> "So, I am looking to build a custom parser for which I could supply the
> file
> and the field and that would expand the synonyms and return a score. ""
> I don't get this , can you elaborate ?
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Alessandro Benedetti
I see,
According to what I know it is not possible to run for the same field
different query time analysis.

Not sure if anyone was working on that.

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Recommended setup

2018-02-14 Thread Wael Kader
Hi,

I would like to get a recommendation for the SOLR setup I have.

I have an index getting around 2 Million records per day. The index used is
in Cloudera Search (Solr).
I am running everything on one node. I run SOLR commits for whatever data
that comes to the index every 5 minutes.
The whole Cloudera VM has 64 GB of Ram.

Its working fine till now having around 80 Million records but Solr gets
slow once a week so I restart the VM for things to work.
I would like to get a recommendation on the setup. Note that I can add VM's
for my setup if needed.
I read somewhere that its wrong to index and read data from the same place.
I am doing this now and I do know I am doing things wrong.
How can I do a setup on Cloudera for SOLR to do indexing in one VM and do
the reading on another and what recommendations should I do for my setup.


-- 
Regards,
Wael


Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-14 Thread Antelmo Aguilar
Hello,

I just wanted to follow up on this issue I am having in case it got lost.
I have been trying to figure this out and so far the only solution I can
find is using the older version.

If you need more details from me, please let me know.  I would really
appreciate any help.

Best,
Antelmo

On Feb 12, 2018 4:55 PM, "Antelmo Aguilar"  wrote:

> Hi,
>
> I was using the following part of a query to get facet buckets so that I
> can use the information in the buckets for some post-processing:
>
> "json": "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_
> b:true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"
> term\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:
> species_category}\",\"facet\":{\"collection_dates\":{\"type\
> ":\"terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\":{\"collection\":
> {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"
> facet\":{\"abnd\":\"sum(div(sample_size_i, collection_duration_days_i))\"
> "
>
> Sorry if it is hard to read.  Basically what is was doing was getting the
> following buckets:
>
> First bucket will be categorized by "Species category" by default unless
> we pass in the request the "term" parameter which we will categories the
> first bucket by whatever "term" is set to.  Then inside this first bucket,
> we create another buckets of the "Collection date" category.  Then inside
> the "Collection date" category buckets, we would use some functions to do
> some calculations and return those calculations inside the "Collection
> date" category buckets.
>
> This query is working fine in Solr 6.2, but I upgraded our instance of
> Solr 6.2 to the latest 6.6 version.  However it seems that upgrading to
> Solr 6.6 broke the above query.  Now it complains when trying to create the
> buckets of the "Collection date" category.  I get the following error:
>
> Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>
> It seems that when creating the buckets of a date field, it does some
> conversion of the way the date is stored and causes the error to appear.
> Does anyone have an idea as to why this error is happening?  I would really
> appreciate any help.  Hopefully I was able to explain my issue well.
>
> Thanks,
> Antelmo
>


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Alessandro Benedetti
Hi pratik,
how is it possible that just the norms for a single field were causing such
a massive index size increment in your case ?

In your case I think it was for a field type used by multiple fields, but
it's still suspicious in my opinions,
norms should be that big.
If I remember correctly in old versions of Solr before the drop of index
time boost, norms were containing both an approximation of the length of the
field + index time boost.
>From your mailing list problem you moved from 10 Gb to 300 Gb.
It can't be just the norms, are you sure you didn't face some bug ?

Regards



-
---
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr - Managed Resources REST API to get stopwords

2018-02-14 Thread Alessandro Hoss
So are you saying this REST api can't give me access to stopwords defined in
this file?

Is there a query which will give me stopwords defined in
server\solr\collection\conf\lang\stopwords_en.txt  file ?

No, the managed resources are managed via API, and stored in a "
schema_analysis_stopwords_english.json" file inside the core directory.

Maybe you can convert your .txt file to the new json file format and change
the file name accordingly for a warm start.

Regards,
Alessandro Hoss

On Wed, Feb 14, 2018 at 10:57 AM ruby  wrote:

> I was hoping to get back the list of stopwords which are defined in
> server\solr\collection\conf\lang\stopwords_en.txt  file.
>
> So are you saying this REST api can't give me access to stopwords defined
> in
> this file?
>
> Is there a query which will give me stopwords defined in
> server\solr\collection\conf\lang\stopwords_en.txt  file ?
>
> Thanks
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Using Synonyms as a feature with LTR

2018-02-14 Thread Roopa Rao
I see okay, thank you.

On Wed, Feb 14, 2018 at 10:34 AM, Alessandro Benedetti  wrote:

> I see,
> According to what I know it is not possible to run for the same field
> different query time analysis.
>
> Not sure if anyone was working on that.
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Using dynamic synonyms file

2018-02-14 Thread Roopa Rao
Hi,

Is it possible to specify the synonyms file as a variable, set a default
synonym file and passing the file name from the request? If so, is there an
example of this?

Such as,



Thanks,
Roopa


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Erick Erickson
Pratik may have jumped right to the difference. We'd have gotten there
eventually by looking at file extensions, but just checking his
recommendation would be the first thing to do!

bq:  what would be the right scenarios to use docvalues='true'?

Whenever you want to facet, group or sort on the field. This _will_
increase the index size on disk, but it's almost always a good
tradeoff, here's why:

To facet, group or sort you need to "uninvert" the field. If you have
docValues=false, this universion is done at run-time into Java's heap.
If you have docValues=true, the uninversion is done at _index_ time
and the result stored on disk. Now when it's required, it can be
loaded in from disk efficiently (essentially de-serialized) and is
stored on the OS memory due to the magic of MMapDirectory, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

bq:  In what situation would it make sense to have indexed=false and
docValues=true?

When you want to return _only_ fields that have docValues=true. If you
return fields with stored=true and docValues=false, Solr/Lucene has to
1> read the stored values from disk (minimum 16K block)
2> decrypt it
3> extract the field

With docValues, since they're only simple field types, all that you
have to do is read the value from the docValues structure., much more
efficient. HOWEVER, there are two caveats
1> The entire docValues field will be MMapped, so there's a time/space tradeoff.
2> docValues are stored in a sorted_set. This is relevant for
multiValued field because:
2a> values are returned in sorted order, not the order they were in the document
2b> identical values are collapsed.

So if the input values for a particular doc were 4, 3, 6, 4, 5, 2, 6,
5, 6, 5, 4, 3, 2 you'd get back 2, 3, 4, 5, 6

If you an live with those caveats, then returning field values would
involve much less work (both I/O and CPU), especially in
high-throughput situations. NOTE: there are a couple of JIRAs IIRC
that have to do with not storing the  though.

Best,
Erick

On Wed, Feb 14, 2018 at 7:01 AM, Pratik Patel  wrote:
> I had a similar issue with index size after upgrading to version 6.4.1 from
> 5.x. The issue for me was that the field which caused index size to be
> increased disproportionately had a field type("text_general") for which
> default value of omitNorms was not true. Turning it on explicitly on field
> fixed the problem. Following is the link to my related question.  You can
> verify value of omitNorms for your fields to check whether this is
> applicable in your case or not.
> http://search-lucene.com/m/Solr/eHNlagIB7209f1w1?subj=Fwd+Solr+dynamic+field+blowing+up+the+index+size
>
> On Tue, Feb 13, 2018 at 8:48 PM, Howe, David 
> wrote:
>
>>
>> I have set docValues=false on all of the string fields in our index that
>> have indexed=false and stored=true.  This gave a small improvement in the
>> index size from 13.3GB to 12.82GB.
>>
>> I have also tried running an optimize, which then reduced the index to
>> 12.6GB.
>>
>> Next step is to dump the sizes of the Solr index files for the index
>> version that is the correct size and the version that has the large size.
>>
>> Regards,
>>
>> David
>>
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  david.h...@auspost.com.au
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> -Original Message-
>> From: Howe, David [mailto:david.h...@auspost.com.au]
>> Sent: Wednesday, 14 February 2018 7:26 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: Index size increases disproportionately to size of added
>> field when indexed=false
>>
>>
>> Thanks Hoss.  I will try setting docValues to false, as we only ever want
>> to be able to retrieve the value of this field.
>>
>> Regards,
>>
>> David
>>
>> David Howe
>> Java Domain Architect
>> Postal Systems
>> Level 16, 111 Bourke Street Melbourne VIC 3000
>>
>> T  0391067904
>>
>> M  0424036591
>>
>> E  david.h...@auspost.com.au
>>
>> W  auspost.com.au
>> W  startrack.com.au
>>
>> Australia Post is committed to providing our customers with excellent
>> service. If we can assist you in any way please telephone 13 13 18 or visit
>> our website.
>>
>> The information contained in this email communication may be proprietary,
>> confidential or legally professionally privileged. It is intended
>> exclusively for the individual or entity to which it is addressed. You
>> should only read, disclose, re-transmit, copy, distribute, act in reliance
>> on or commercialise the information if you are authorised to do so.
>> Australia Post does not represent, warrant or guarantee that the integrity
>> of this email communication has been maintained nor that the communication
>> is free of errors, virus or interference.
>>
>> If you are not the addressee or intended recipient please notify us by
>> replying direct to the 

Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-14 Thread Yonik Seeley
Could you provide the full stack trace containing "Invalid Date
String"  and the full request that causes it?
Are you using any custom code/plugins in Solr?
-Yonik


On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar  wrote:
> Hi,
>
> I was using the following part of a query to get facet buckets so that I
> can use the information in the buckets for some post-processing:
>
> "json":
> "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"term\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:species_category}\",\"facet\":{\"collection_dates\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\":{\"collection\":
> {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"facet\":{\"abnd\":\"sum(div(sample_size_i,
> collection_duration_days_i))\""
>
> Sorry if it is hard to read.  Basically what is was doing was getting the
> following buckets:
>
> First bucket will be categorized by "Species category" by default unless we
> pass in the request the "term" parameter which we will categories the first
> bucket by whatever "term" is set to.  Then inside this first bucket, we
> create another buckets of the "Collection date" category.  Then inside the
> "Collection date" category buckets, we would use some functions to do some
> calculations and return those calculations inside the "Collection date"
> category buckets.
>
> This query is working fine in Solr 6.2, but I upgraded our instance of Solr
> 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr 6.6
> broke the above query.  Now it complains when trying to create the buckets
> of the "Collection date" category.  I get the following error:
>
> Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>
> It seems that when creating the buckets of a date field, it does some
> conversion of the way the date is stored and causes the error to appear.
> Does anyone have an idea as to why this error is happening?  I would really
> appreciate any help.  Hopefully I was able to explain my issue well.
>
> Thanks,
> Antelmo


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Yonik Seeley
On Wed, Feb 14, 2018 at 2:28 PM, Wei  wrote:
> Thanks all!   It's really great learning.  A bit off the topic, after I
> enabled facet.method = uif in solr cloud,  the faceting performance is
> actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
> with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
> that fieldValueCache is getting utilized.  Any reason uif could be so
> slow?

I haven't seen that before.  Are you sure it's not the first time
faceting on a field?  uif has big upfront cost, but is usually faster
once that cost has been paid.


-Yonik

> On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:
>
>> Great, thanks for tracking that down!
>> It's interesting that a mincount of 0 disables uif processing in the
>> first place.  IIRC, it's only the hash-based method (as opposed to
>> array-based) that can't return zero counts.
>>
>> -Yonik
>>
>>
>> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>>  wrote:
>> > *Update* : This has been actually already solved by Hoss.
>> >
>> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
>> > Request : https://github.com/apache/lucene-solr/pull/279/files
>> >
>> > This should go live with 7.3
>> >
>> > Cheers
>> >
>> >
>> >
>> > -
>> > ---
>> > Alessandro Benedetti
>> > Search Consultant, R Software Engineer, Director
>> > Sease Ltd. - www.sease.io
>> > --
>> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Wei
Thanks all!   It's really great learning.  A bit off the topic, after I
enabled facet.method = uif in solr cloud,  the faceting performance is
actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
that fieldValueCache is getting utilized.  Any reason uif could be so
slow?

On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:

> Great, thanks for tracking that down!
> It's interesting that a mincount of 0 disables uif processing in the
> first place.  IIRC, it's only the hash-based method (as opposed to
> array-based) that can't return zero counts.
>
> -Yonik
>
>
> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>  wrote:
> > *Update* : This has been actually already solved by Hoss.
> >
> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
> > Request : https://github.com/apache/lucene-solr/pull/279/files
> >
> > This should go live with 7.3
> >
> > Cheers
> >
> >
> >
> > -
> > ---
> > Alessandro Benedetti
> > Search Consultant, R Software Engineer, Director
> > Sease Ltd. - www.sease.io
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Pratik Patel
You are right, in my case this field type was applied to many text fields.
These includes many copy fields and dynamic fields as well. In my case,
only specifying omitNorms=true for field type "text_general" fixed the
issue. I didn't do anything else or had any other bug.

On Wed, Feb 14, 2018 at 1:01 PM, Alessandro Benedetti 
wrote:

> Hi pratik,
> how is it possible that just the norms for a single field were causing such
> a massive index size increment in your case ?
>
> In your case I think it was for a field type used by multiple fields, but
> it's still suspicious in my opinions,
> norms should be that big.
> If I remember correctly in old versions of Solr before the drop of index
> time boost, norms were containing both an approximation of the length of
> the
> field + index time boost.
> From your mailing list problem you moved from 10 Gb to 300 Gb.
> It can't be just the norms, are you sure you didn't face some bug ?
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Wei
Thanks Yonik. If uif has big upfront cost when hits solr the first time,
in solr cloud the same faceting request could hit different replicas in the
same shard, so that cost will happen at least for the number of replicas?
If we are doing frequent auto commits, fieldvaluecache will be invalidated
and uif will have to pay the upfront cost again after each commit?



On Wed, Feb 14, 2018 at 11:51 AM, Yonik Seeley  wrote:

> On Wed, Feb 14, 2018 at 2:28 PM, Wei  wrote:
> > Thanks all!   It's really great learning.  A bit off the topic, after I
> > enabled facet.method = uif in solr cloud,  the faceting performance is
> > actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
> > with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
> > that fieldValueCache is getting utilized.  Any reason uif could be so
> > slow?
>
> I haven't seen that before.  Are you sure it's not the first time
> faceting on a field?  uif has big upfront cost, but is usually faster
> once that cost has been paid.
>
>
> -Yonik
>
> > On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley  wrote:
> >
> >> Great, thanks for tracking that down!
> >> It's interesting that a mincount of 0 disables uif processing in the
> >> first place.  IIRC, it's only the hash-based method (as opposed to
> >> array-based) that can't return zero counts.
> >>
> >> -Yonik
> >>
> >>
> >> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
> >>  wrote:
> >> > *Update* : This has been actually already solved by Hoss.
> >> >
> >> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
> >> > Request : https://github.com/apache/lucene-solr/pull/279/files
> >> >
> >> > This should go live with 7.3
> >> >
> >> > Cheers
> >> >
> >> >
> >> >
> >> > -
> >> > ---
> >> > Alessandro Benedetti
> >> > Search Consultant, R Software Engineer, Director
> >> > Sease Ltd. - www.sease.io
> >> > --
> >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>
>


RE: Index size increases disproportionately to size of added field when indexed=false

2018-02-14 Thread Howe, David

I have re-run both scenarios and captured the total size of each type of index 
file.  The MB (1) column is for the baseline scenario which has the smaller 
index and acceptable performance.  The MB(2) column is after I have added the 
extra field to the index.

Ext MB (1)  MB (2)
.cfe0.000.01
.cfs335.01  3612.09
.dii0.000.00
.dim324.38  319.07
.doc1094.68 2767.53
.dvd1211.84 625.44
.dvm0.140.08
.fdt1633.21 5387.92
.fdx2.121.44
.fnm0.110.12
.loc0.000.00
.nvd127.84  110.67
.nvm0.010.01
.pos809.23  1272.70
.si 0.020.03
.tim137.94  156.82
.tip2.523.04
Total   5679.06 14256.98


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

-Original Message-
From: Howe, David [mailto:david.h...@auspost.com.au]
Sent: Wednesday, 14 February 2018 12:49 PM
To: solr-user@lucene.apache.org
Subject: RE: Index size increases disproportionately to size of added field 
when indexed=false


I have set docValues=false on all of the string fields in our index that have 
indexed=false and stored=true.  This gave a small improvement in the index size 
from 13.3GB to 12.82GB.

I have also tried running an optimize, which then reduced the index to 12.6GB.

Next step is to dump the sizes of the Solr index files for the index version 
that is the correct size and the version that has the large size.

Regards,

David


David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

-Original Message-
From: Howe, David [mailto:david.h...@auspost.com.au]
Sent: Wednesday, 14 February 2018 7:26 AM
To: solr-user@lucene.apache.org
Subject: RE: Index size increases disproportionately to size of added field 
when indexed=false


Thanks Hoss.  I will try setting docValues to false, as we only ever want to be 
able to retrieve the value of this field.

Regards,

David

David Howe
Java Domain Architect
Postal Systems
Level 16, 111 Bourke Street Melbourne VIC 3000

T  0391067904

M  0424036591

E  david.h...@auspost.com.au

W  auspost.com.au
W  startrack.com.au

Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email communication may be proprietary, 
confidential or legally professionally privileged. It is intended exclusively 
for the individual or entity to which it is addressed. You should only read, 
disclose, re-transmit, copy, distribute, act in reliance on or commercialise 
the information if you are authorised to do so. Australia Post does not 
represent, warrant or guarantee that the integrity of this email communication 
has been maintained nor that the communication is free of errors, virus or 
interference.

If you are not the addressee or intended recipient please notify us by replying 
direct to the sender and then destroy any electronic or paper copy of this 
message. Any views expressed in this email communication are taken to be those 
of the individual sender, except where the sender specifically attributes those 
views to Australia Post and is authorised to do so.

Please consider the environment before printing this email.
Australia Post is committed to providing our customers with excellent service. 
If we can assist you in any way please telephone 13 13 18 or visit our website.

The information contained in this email 

Reading data from Oracle

2018-02-14 Thread LOPEZ-CORTES Mariano-ext
Hello

We have to delete our Solr collection and feed it periodically from an Oracle 
database (up to 40M rows).

We've done the following test: From a java program, we read chunks of data from 
Oracle and inject to Solr (via Solrj).

The problem : It is really really slow (1'5 nights).

Is there one faster method to do that ?

Thanks in advance.