Re: Getting to grips with auto-scaling

2020-06-09 Thread Tom Evans
Hi Radu

Thanks for the reply - I'm starting to look that way myself, to create
a different collection for each set of data, that way I can control
more easily the scaling on each collection, eg to increase replication
factor on those that will be queried more. I was looking at Category
Routed Alias, but that seems to have quite a few gotchas:

* Can't restrict the collections queried - even if you specify the
exact collections to query, eg "collections=items__CRA__2020" (which
exists) returns no results. Even when querying the underlying
collection and specifying its name returns no results. I only get
results with collections=items__CRA - its as if the underlying
collection thinks its name really is "items__CRA" rather than
"items__CRA__2020"
* Some problems with indexing to a new category, I get errors the
first time a category is encountered.

Looks like it might be manually setup and managed collections and
aliases for now.

Cheers

Tom

On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe
 wrote:
>
> Hi Tom,
>
> To your last two questions, I'd like to vent an alternative design: have
> dedicated "hot" and "warm" nodes. That is, 2020+lists will go to the hot
> tier, and 2019, 2018,2017+lists go to the warm tier.
>
> Then you can scale the hot tier based on your query load. For the warm
> tier, I assume there will be less need for scaling, and if it is, I guess
> it's less important for shards of each index to be perfectly balanced (so a
> simple "make sure cores are evenly distributed" should be enough).
>
> Granted, this design isn't as flexible as the one you suggested, but it's
> simpler. So simple that I've seen it done without autoscaling (just a few
> scripts from when you add nodes in each tier).
>
> Best regards,
> Radu
>
> https://sematext.com
>
> vin., 5 iun. 2020, 21:59 Tom Evans  a
> scris:
>
> > Hi
> >
> > I'm trying to get a handle on the newer auto-scaling features in Solr.
> > We're in the process of upgrading an older SolrCloud cluster from 5.5
> > to 8.5, and re-architecture it slightly to improve performance and
> > automate operations.
> >
> > If I boil it down slightly, currently we have two collections, "items"
> > and "lists". Both collections have just one shard. We publish new data
> > to "items" once each day, and our users search and do analysis on
> > them, whilst "lists" contains NRT user-specified collections of ids
> > from items, which we join to from "items" in order to allow them to
> > restrict their searches/analysis to just docs in their curated lists.
> >
> > Most of our searches have specific date ranges in them, usually only
> > from the last 3 years or so, but sometimes we need to do searches
> > across all the data. With the new setup, we want to:
> >
> > * shard by date (year) to make the hottest data available in smaller shards
> > * have more nodes with these shards than we do of the older data.
> > * be able to add/remove nodes predictably based upon our clients
> > (predictable) query load
> > * use TLOG for "items" and NRT for "lists", to avoid unnecessary
> > indexing load for "items" and have NRT for "lists".
> > * spread cores across two AZ
> >
> > With that in mind, I came up with a bunch of simplified rules for
> > testing, with just 4 shards for "items":
> >
> > * "lists" collection has one NRT replica on each node
> > * "items" collection shard 2020 has one TLOG replica on each node
> > * "items" collection shard 2019 has one TLOG replica on 75% of nodes
> > * "items" collection shards 2018 and 2017 each have one TLOG replica
> > on 50% of nodes
> > * all shards have at least 2 replicas if number of nodes > 1
> > * no node should have 2 replicas of the same shard
> > * number of cores should be balanced across nodes
> >
> > Eg, with 1 node, I want to see this topology:
> > A: items: 2020, 2019, 2018, 2017 + lists
> >
> > with 2 nodes:
> > A: items: 2020, 2019, 2018, 2017 + lists
> > B: items: 2020, 2019, 2018, 2017 + lists
> >
> > and if I add two more nodes:
> > A: items: 2020, 2019, 2018 + lists
> > B: items: 2020, 2019, 2017 + lists
> > C: items: 2020, 2019, 2017 + lists
> > D: items: 2020, 2018 + lists
> >
> > To the questions:
> >
> > * The type of replica created when nodeAdded is triggered can't be set
> > per collection. Either everything gets NRT or everything gets TLOG.
> > Even if I specify nrtReplicas=0 when creating a collection, nod

Indexing error when using Category Routed Alias

2020-06-09 Thread Tom Evans
  at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
at 
org.eclipse.jetty.server.handler.InetAccessHandler.handle(InetAccessHandler.java:177)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:322)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:500)
at 
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938)
at java.base/java.lang.Thread.run(Unknown Source)

2020-06-09 02:12:58.507 INFO  (qtp90045638-16)
[c:products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP
s:shard1 r:core_node2
x:products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP_shard1_replica_n1]
o.a.s.c.S.Request
[products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP_shard1_replica_n1]
 webapp=/solr path=/update/json/docs params={} status=400 QTime=2422

Cheers

Tom


Getting to grips with auto-scaling

2020-06-05 Thread Tom Evans
Hi

I'm trying to get a handle on the newer auto-scaling features in Solr.
We're in the process of upgrading an older SolrCloud cluster from 5.5
to 8.5, and re-architecture it slightly to improve performance and
automate operations.

If I boil it down slightly, currently we have two collections, "items"
and "lists". Both collections have just one shard. We publish new data
to "items" once each day, and our users search and do analysis on
them, whilst "lists" contains NRT user-specified collections of ids
from items, which we join to from "items" in order to allow them to
restrict their searches/analysis to just docs in their curated lists.

Most of our searches have specific date ranges in them, usually only
from the last 3 years or so, but sometimes we need to do searches
across all the data. With the new setup, we want to:

* shard by date (year) to make the hottest data available in smaller shards
* have more nodes with these shards than we do of the older data.
* be able to add/remove nodes predictably based upon our clients
(predictable) query load
* use TLOG for "items" and NRT for "lists", to avoid unnecessary
indexing load for "items" and have NRT for "lists".
* spread cores across two AZ

With that in mind, I came up with a bunch of simplified rules for
testing, with just 4 shards for "items":

* "lists" collection has one NRT replica on each node
* "items" collection shard 2020 has one TLOG replica on each node
* "items" collection shard 2019 has one TLOG replica on 75% of nodes
* "items" collection shards 2018 and 2017 each have one TLOG replica
on 50% of nodes
* all shards have at least 2 replicas if number of nodes > 1
* no node should have 2 replicas of the same shard
* number of cores should be balanced across nodes

Eg, with 1 node, I want to see this topology:
A: items: 2020, 2019, 2018, 2017 + lists

with 2 nodes:
A: items: 2020, 2019, 2018, 2017 + lists
B: items: 2020, 2019, 2018, 2017 + lists

and if I add two more nodes:
A: items: 2020, 2019, 2018 + lists
B: items: 2020, 2019, 2017 + lists
C: items: 2020, 2019, 2017 + lists
D: items: 2020, 2018 + lists

To the questions:

* The type of replica created when nodeAdded is triggered can't be set
per collection. Either everything gets NRT or everything gets TLOG.
Even if I specify nrtReplicas=0 when creating a collection, nodeAdded
will add NRT replicas if configured that way.
* I'm having difficulty expressing these rules in terms of a policy -
I can't seem to figure out a way to specify the number of replicas for
a shard based upon the total number of nodes.
* Is this beyond the current scope of autoscaling triggers/policies?
Should I instead use the trigger with a custom plugin action (or to
trigger a web hook) to be a bit more intelligent?
* Am I wasting my time trying to ensure there are more replicas of the
hotter shards than the colder shards? It seems to add a lot of
complexity - should I just instead think that they aren't getting
queried much, so won't be using up cache space that the hot shards
will be using. Disk space is pretty cheap after all (total size for
"items" + "lists" is under 60GB).

Cheers

Tom


Outdated information on JVM heap sizes in Solr 8.3 documentation?

2020-02-14 Thread Tom Burton-West
Hello,

In the section on JVM tuning in the  Solr 8.3 documentation (
https://lucene.apache.org/solr/guide/8_3/jvm-settings.html#jvm-settings)
there is a paragraph which cautions about setting heap sizes over 2 GB:

"The larger the heap the longer it takes to do garbage collection. This can
mean minor, random pauses or, in extreme cases, "freeze the world" pauses
of a minute or more. As a practical matter, this can become a serious
problem for heap sizes that exceed about **two gigabytes**, even if far
more physical memory is available. On robust hardware, you may get better
results running multiple JVMs, rather than just one with a large memory
heap. "  (** added by me)

I suspect this paragraph is severely outdated, but am not a Java expert.
 It seems to be contradicted by the statement in "
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#memory-and-gc-settings;
"...values between 10 and 20 gigabytes are not uncommon for production
servers"

Are "freeze the world" pauses still an issue with modern JVM's?
Is it still advisable to avoid heap sizes over 2GB?

Tom
https://www.hathitrust.org/blogslarge-scale-search


mapping and tuning payloads in Solr 8

2020-02-12 Thread Burgmans, Tom
Hi all,

In our Solr 6 setup we use string payloads to boost certain tokens (URIs). 
These strings are mapped to floats via a schema parameter "PayloadMapping", 
which can be read out in our custom WKSimilarity class (extending 
TFIDFSimilarity).









   
0.4
0.4
0.5
0
0.0
10.0
3.0
 1.0
 isAbout=15.0,coversFiscalPeriod=10.0,type=5.0,hasTheme=5.0,subject=4.0,mentions=2.0,creator=2.0
   


The reason for this indirection is convenience: by storing payload strings 
i.s.o. floats we could change & tune the boosts easily by updating the schema 
without having to change the content set.
Inside WKSimilarity each payload string is mapped to its corresponding boost 
value and the final boost is applied via the scorePayload method (where we 
could tune the boost curve via some additional schema parameters). This works 
well in Solr 6.

The problem: we are about to migrate to Solr 8 and after LUCENE-8014 it isn't 
possible anymore the override the scorePayload method in WKSimilarity (it is 
removed from TFIDFSimilarity). I wonder what alternatives there are for mapping 
strings payload to floats and use them in a tunable formula for boosting.

Thanks,
Tom Burgmans


UAX29 URL Email Tokenizer not working as expected

2019-05-06 Thread Tom Van Cuyck
Hi,

The UAX29 URL Email Tokenizer is not working as expected.
According to the documentation (
https://lucene.apache.org/solr/guide/7_2/tokenizers.html): "Words are split
at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved."

So I expect "ABC-123" to remain "ABC-123"
However the term is split in 2 separate tokens "ABC" and "123".

Same for "AB12-CD34" --> "AB12" and "CD34" etc...

Is this behavior to be expected? Or is there a way to get the behavior I
expect?

Kind regards, Tom

-- 

Would you like to receive our newsletter to stay updated? Please click here
<http://eepurl.com/dwoymH>


Tom Van Cuyck
Software Engineer

<http://www.ontoforce.com>
ONTOFORCE
WINNER of EY scale-up of the year 2018
@: tom.vancu...@ontoforce.com
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
<https://goo.gl/maps/UjuekPHVoFK2>
CIC, One Broadway, MA 02142 Cambridge, United States
<https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.


RE: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

2019-02-13 Thread Burgmans, Tom
I like to bump this issue up, since this is a showstopper for us to upgrade 
from Solr 6. In https://issues.apache.org/jira/browse/SOLR-13126 I described a 
couple of more use cases in which this bug appears. We see different scores in 
the EXPLAIN compared to the actual scores and our analysis is that the EXPLAIN 
in fact is correct. It happens when a multiplicative boost is used (via the 
"boost" parameter) in combination with some function queries, like "query" and 
"field". 

One example (tested on Solr 7.5.0), when running: 

http://localhost:8983/solr/test/select?defType=edismax=id,score,[explain 
style=text]=*:*=sum(field(price),4)

then the expectation is that a document that doesn't have the price field gets 
a score of 4. The result however is: 

{
"id": "docid123576",
"score": 1.0,
"[explain]": "4.0 = product of:\n  1.0 = boost\n  4.0 = product of:\n
1.0 = *:*\n4.0 = sum(float(price)=0.0,const(4))\n"
}

EXPLAIN and score are not consistent.

Best regards Tom


-Original Message-
From: Tobias Ibounig [mailto:t.ibou...@netconomy.net] 
Sent: dinsdag 22 januari 2019 10:14
To: solr-user@lucene.apache.org
Subject: Multiplicative Boosts broken since 7.3 (LUCENE-8099)

Hello,

As described in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSOLR-13126data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=paFEStnQwxcKQQ9mM1MfPXQm%2BrStTaqQnYFH2LolVl8%3Dreserved=0
 multiplicative boots (in certain conditions) seem to be broken since 7.3.
The error seems to be introduced in 
https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-8099data=02%7C01%7Ctom.burgmans%40wolterskluwer.com%7C82b7f7923bd74285295e08d68049f3da%7C8ac76c91e7f141ffa89c3553b2da2c17%7C0%7C0%7C636837452448856240sdata=Gs1EzQ%2FCSO8ryZJv0EGx2etxmDA7HkW8Crj5H6mE%2FvE%3Dreserved=0.
 Reverting the SOLR parts to the now deprecated BoostingQuery again fixes the 
issue.
The filed issue contains a test case and a patch with the revert (for testing 
purposes, not really a clean fix).
We sadly couldn't find the actual issue, which seems to lie with the use of 
"FunctionScoreQuery" for boosting.

We were able to patch our 7.5 installation with the patch. As others might be 
affected as well, we hope this can be helpful in resolving this bug.

To all SOLR/Lucene developers, thank you for your work. Looking trough the code 
base gave me a new appreciation of your work.

Best Regards,
Tobias

PS: This issue was already posted by a colleague, "Inconsistent debugQuery 
score with multiplicative boost", but I wanted to create a new post with a 
clearer title.



Limit facet terms based on a substring using the JSON facet API

2019-01-29 Thread Tom Van Cuyck
Hi

In the old Solr facet API there are the facet.contains and
facet.conains.ignoreCase parameters to limit the facet values to those
terms containing the specified substring.
Is there an equivalent option in the JSON facet API? Or is there a way to
obtain the same behavior with the JSON API? I can't find anything in the
official documentation.

Kind regards, Tom
-- 

Would you like to receive our newsletter to stay updated? Please click here
<http://eepurl.com/dwoymH>


Tom Van Cuyck
Software Engineer

<http://www.ontoforce.com>
ONTOFORCE
WINNER of EY scale-up of the year 2018
@: tom.vancu...@ontoforce.com
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium
<https://goo.gl/maps/UjuekPHVoFK2>
CIC, One Broadway, MA 02142 Cambridge, United States
<https://www.google.com/maps/place/One+Broadway,+1+Broadway,+Cambridge,+MA+02142/@42.3627659,-71.0857549,17z/data=!3m2!4b1!5s0x89e370a5bef53651:0xa9387af4906ce9a3!4m5!3m4!1s0x89e370a5b9258c7b:0x7d922521464507ad!8m2!3d42.3627822!4d-71.0835375>

DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.


Re: loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
Thanks Erick,

Silly oversight on my part.  I went into the admin panel and used the core
selector to view information about the core and it was running.
I did some more thinking about it and restarted solr and looked at the core
admin panel where I could see that the startTime was "-".

So the problem is operator error.  I didn't think about how the core
selector actually sends a query to the core to get stats, which of course
starts the core.


Tom

On Fri, Aug 17, 2018 at 12:18 PM, Erick Erickson 
wrote:

> Tom:
>
> That hasn't been _intentionally_ changed. However, any request that
> comes in (update or query) will permanently load the core (assuming no
> transient cores), and any request to the core will autoload it. How
> are you determining that the core hasn't been loaded? And are there
> any background tasks that could be causing them to load (autowarming
> in solrconfig doesn't count).
>
> On Fri, Aug 17, 2018 at 8:57 AM, Tom Burton-West 
> wrote:
> > Hello,
> >
> > I'm not using SolrCloud and want to have some cores not load when Solr
> > starts up.
> > I tried loadOnStartup=false, but the cores seem to start up anyway.
> >
> > Is the loadOnStartup parameter still usable with Solr 6.6 or does the
> > documentation need updating?
> >  Or  Is there something else I need to do/set?
> >
> > Tom
>


loadOnStartup=false doesn't appear to work for Solr 6.6

2018-08-17 Thread Tom Burton-West
Hello,

I'm not using SolrCloud and want to have some cores not load when Solr
starts up.
I tried loadOnStartup=false, but the cores seem to start up anyway.

Is the loadOnStartup parameter still usable with Solr 6.6 or does the
documentation need updating?
 Or  Is there something else I need to do/set?

Tom


Re: Can the export handler be used with the edismax or dismax query handler

2018-07-29 Thread Tom Burton-West
Thanks Mikhail and Erick,

I don't need ranks or score.  I just need the full set of results.  Will
the export handler work with a fq that uses edismax? (I'm not at work
today, but I can try it out tomorrow.)

I compared a simple (not edismax) query and the export handler with
cursormark with rows = 50K to 200K.  The export handler took about 8 ms to
export all 1.9 million results and had minimal impact on server CPU and
memory.  With the cursormark it took about 1 minute 20 seconds, CPU use
increased by about 25% and there were many more garbage collections
although the time for GC totaled only a few seconds.

Tom



On Sat, Jul 28, 2018 at 4:25 AM, Mikhail Khludnev  wrote:

> Tom,
> Do you say you don't need rank results or you don't need to export score?
> If the former is true, you can just put edismax to fq.
> Just a note: using cursor mark with the score may cause some kind of hit
> dupes and probably missing some ones.
>
> On Sat, Jul 28, 2018 at 5:20 AM Erick Erickson 
> wrote:
>
> > What about cursorMark? That's designed to handle repeated calls with
> > increasing "start" parameters without bogging down.
> >
> > https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html
> >
> > Best,
> > Erick
> >
> > On Fri, Jul 27, 2018 at 9:47 AM, Tom Burton-West 
> > wrote:
> > > Thanks Joel,
> > >
> > > My use case is that I have a complex edismax query (example below)  and
> > the
> > > user wants to download the set of *all* search results (ids and some
> > small
> > > metadata fields).   So they don't need the relevance ranking.
> However, I
> > > need to somehow get the exact set that the complex edismax query
> matched.
> > >
> > > Should I try to write some code to rewrite  the logic of the edismax
> > query
> > > with a complex boolean query or would it make sense for me to look at
> > > possibly modifying the export handler for my use case?
> > >
> > > Tom
> > >
> > > "q= _query_:"{!edismax
> > >
> > qf='ocr^5+allfieldsProper^2+allfields^1+titleProper^50+
> title_topProper^30+title_restProper^15+title^10+title_
> top^5+title_rest^2+series^5+series2^5+author^80+author2^
> 50+issn^1+isbn^1+oclc^1+sdrnum^1+ctrlnum^1+id^1+
> rptnum^1+topicProper^2+topic^1+hlb3^1+fullgeographic^1+fullgenre^1+era^1+'
> > >
> > pf='title_ab^1+titleProper^1500+title_topProper^1000+title_
> restProper^800+series^100+series2^100+author^1600+
> author2^800+topicProper^200+fullgenre^200+hlb3^200+allfieldsProper^100+'
> > > mm='100%25' tie='0.9' } European Art History"
> > >
> > >
> > > On Thu, Jul 26, 2018 at 6:02 PM, Joel Bernstein 
> > wrote:
> > >
> > >> The export handler doesn't allow sorting by score at this time. It
> only
> > >> supports sorting on fields. So the edismax qparser won't cxcurrently
> > work
> > >> with the export handler.
> > >>
> > >> Joel Bernstein
> > >> http://joelsolr.blogspot.com/
> > >>
> > >> On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West 
> > >> wrote:
> > >>
> > >> > Hello all,
> > >> >
> > >> > I am completely new to the export handler.
> > >> >
> > >> > Can the export handler be used with the edismax or dismax query
> > handler?
> > >> >
> > >> > I tried using local params :
> > >> >
> > >> > q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50'
> > >> > mm='100%25'
> > >> > tie='0.9' } art"
> > >> >
> > >> > which does not seem to be working.
> > >> >
> > >> > Tom
> > >> >
> > >>
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: Can the export handler be used with the edismax or dismax query handler

2018-07-27 Thread Tom Burton-West
Thanks Joel,

My use case is that I have a complex edismax query (example below)  and the
user wants to download the set of *all* search results (ids and some small
metadata fields).   So they don't need the relevance ranking.  However, I
need to somehow get the exact set that the complex edismax query matched.

Should I try to write some code to rewrite  the logic of the edismax query
with a complex boolean query or would it make sense for me to look at
possibly modifying the export handler for my use case?

Tom

"q= _query_:"{!edismax
qf='ocr^5+allfieldsProper^2+allfields^1+titleProper^50+title_topProper^30+title_restProper^15+title^10+title_top^5+title_rest^2+series^5+series2^5+author^80+author2^50+issn^1+isbn^1+oclc^1+sdrnum^1+ctrlnum^1+id^1+rptnum^1+topicProper^2+topic^1+hlb3^1+fullgeographic^1+fullgenre^1+era^1+'
pf='title_ab^1+titleProper^1500+title_topProper^1000+title_restProper^800+series^100+series2^100+author^1600+author2^800+topicProper^200+fullgenre^200+hlb3^200+allfieldsProper^100+'
mm='100%25' tie='0.9' } European Art History"


On Thu, Jul 26, 2018 at 6:02 PM, Joel Bernstein  wrote:

> The export handler doesn't allow sorting by score at this time. It only
> supports sorting on fields. So the edismax qparser won't cxcurrently work
> with the export handler.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Jul 26, 2018 at 5:52 PM, Tom Burton-West 
> wrote:
>
> > Hello all,
> >
> > I am completely new to the export handler.
> >
> > Can the export handler be used with the edismax or dismax query handler?
> >
> > I tried using local params :
> >
> > q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50'
> > mm='100%25'
> > tie='0.9' } art"
> >
> > which does not seem to be working.
> >
> > Tom
> >
>


Can the export handler be used with the edismax or dismax query handler

2018-07-26 Thread Tom Burton-West
Hello all,

I am completely new to the export handler.

Can the export handler be used with the edismax or dismax query handler?

I tried using local params :

q= _query_:"{!edismax qf='ocr^5+allfields^1+titleProper^50' mm='100%25'
tie='0.9' } art"

which does not seem to be working.

Tom


ExternalFileField management strategy with SolrCloud

2018-04-26 Thread Tom Peters
Is there a recommended way of managing external files with SolrCloud. At first 
glance it appears that I would need to manually manage the placement of the 
external_.txt file in each shard's data directory. Is there a better 
way of managing this (Solr API, interface, etc?)


This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Re: CDCR Bootstrap

2018-04-26 Thread Tom Peters
I'm not sure under what conditions it will be automatically triggered, but if 
you manually wanted to trigger a CDCR Bootstrap you need to issue the following 
query to the leader in your target data center.

/solr//cdcr?action=BOOTSTRAP=

The masterUrl will look something like (change the necessary values):
http%3A%2F%2Fsolr-leader.solrurl%3A8983%2Fsolr%2Fcollection

> On Apr 26, 2018, at 10:15 AM, Susheel Kumar  wrote:
> 
> Anybody has idea how to trigger Solr CDCR BOOTSTRAP or under what condition
> it gets triggered ?
> 
> Thanks,
> Susheel
> 
> On Tue, Apr 24, 2018 at 12:34 PM, Susheel Kumar 
> wrote:
> 
>> Hello,
>> 
>> I am wondering under what different conditions does that CDCR bootstrap
>> process gets triggered.  I did notice it getting triggered after I stopped
>> CDCR and then started again later and now I am trying to reproduce the same
>> behavior.
>> 
>> In case target cluster is left behind and buffer was disabled on source, i
>> would like the CDCR bootstrap to trigger and sync target.
>> 
>> Does deleting records from target and then starting CDCR would trigger
>> bootstrap ?
>> 
>> Thanks,
>> Susheel
>> 
>> 
>> 





This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Re: Does CDCR Bootstrap sync leaves replica's out of sync

2018-04-16 Thread Tom Peters
There are two ways I've gotten around this issue:

1. Add replicas in the target data center after CDCR bootstrapping has 
completed.

-or-

2. After the bootstrapping has completed, restart the replica nodes one-at-time 
in the target data center (restart, wait for replica to catch up, then restart 
the next).


I recommend doing method #1 over #2 if you can. If you accidentally restart the 
leader node using method #2, it will promote an out-of-sync replica to the 
leader and all followers will receive that out-of-date index.

I also recommend pausing indexing if you can while you let the target replicas 
catch up. I have run into issues where the replicas will not catch up if the 
leader has a fair amount of updates to replay from the source.

> On Apr 16, 2018, at 2:15 PM, Amrit Sarkar  wrote:
> 
> Hi Susheel,
> 
> Pretty sure you are talking about this:
> https://issues.apache.org/jira/browse/SOLR-11724
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Mon, Apr 16, 2018 at 11:35 PM, Susheel Kumar 
> wrote:
> 
>> Does anybody know about known issue where CDCR bootstrap sync leaves the
>> replica's on target cluster non touched/out of sync.
>> 
>> After I stopped and restart CDCR, it builds my target leaders index but
>> replica's on target cluster still showing old index / not modified.
>> 
>> 
>> Thnx
>> 



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Re: CDCR performance issues

2018-03-23 Thread Tom Peters
Thanks for responding. My responses are inline.

> On Mar 23, 2018, at 8:16 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote:
> 
> Hey Tom,
> 
> I'm also having issue with replicas in the target data center. It will go
>> from recovering to down. And when one of my replicas go to down in the
>> target data center, CDCR will no longer send updates from the source to
>> the target.
> 
> 
> Are you able to figure out the issue? As long as the leaders of each shard
> in each collection is up and serving, CDCR shouldn't stop.

I cannot replicate the issue I was having. In a test environment, I'm able to 
knock one of the replicas into recovery mode and can verify that CDCR updates 
are still being sent.
> 
> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't be
>> able to keep up? Manually trigger a bootstrap again? Or is there something
>> else we can do?
>> 
> 
> That's one of the limitations of CDCR, it cannot handle bulk indexing,
> preferable way to do is
> * stop cdcr
> * bulk index
> * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> * start cdcr

I plan on testing this, but if I issue a bootstrap, will I run into the 
https://issues.apache.org/jira/browse/SOLR-11724 
<https://issues.apache.org/jira/browse/SOLR-11724> bug where the bootstrap 
doesn't replicate to the replicas?

> 1. Is it accurate that updates are not actually batched in transit from the
>> source to the target and instead each document is posted separately?
> 
> 
> The batchsize and schedule regulate how many docs are sent across target.
> This has more details:
> https://lucene.apache.org/solr/guide/7_2/cdcr-config.html#the-replicator-element
> 

As far as I can tell, I'm not seeing batching. I'm using tcpdump (and a script 
to decompile the JavaBin bytes) to monitor what is actually being sent and I'm 
seeing documents arrive one-at-a-time.

POST 
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 
HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields:
 [solr_id=Mytest, _version_=1595749902502068224]):null]]}
--
POST 
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 
HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields:
 [solr_id=Mytest, _version_=1595749902600634368]):null]]}
--
POST 
/solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199=javabin=2 
HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields:
 [solr_id=Mytest, _version_=1595749902698151936]):null]]}

> 
> 
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <tpet...@synacor.com> wrote:
> 
>> I'm also having issue with replicas in the target data center. It will go
>> from recovering to down. And when one of my replicas go to down in the
>> target data center, CDCR will no longer send updates from the source to the
>> target.
>> 
>>> On Mar 12, 2018, at 9:24 AM, Tom Peters <tpet...@synacor.com> wrote:
>>> 
>>> Anyone have any thoughts on the questions I raised?
>>> 
>>> I have another question related to CDCR:
>>> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't be able
>> to keep up? Manually trigger a bootstrap again? Or is there something else
>> we can do?
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
>>>> On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote:
>>>> 
>>>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the
>> requests to the target data center are not batched in any way. Each update
>> comes in as an independent update. Some foll

Re: CDCR performance issues

2018-03-12 Thread Tom Peters
I'm also having issue with replicas in the target data center. It will go from 
recovering to down. And when one of my replicas go to down in the target data 
center, CDCR will no longer send updates from the source to the target.

> On Mar 12, 2018, at 9:24 AM, Tom Peters <tpet...@synacor.com> wrote:
> 
> Anyone have any thoughts on the questions I raised?
> 
> I have another question related to CDCR:
> Sometimes we have to reindex a large chunk of our index (1M+ documents). 
> What's the best way to handle this if the normal CDCR process won't be able 
> to keep up? Manually trigger a bootstrap again? Or is there something else we 
> can do?
> 
> Thanks.
> 
> 
> 
>> On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote:
>> 
>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the 
>> requests to the target data center are not batched in any way. Each update 
>> comes in as an independent update. Some follow-up questions:
>> 
>> 1. Is it accurate that updates are not actually batched in transit from the 
>> source to the target and instead each document is posted separately?
>> 
>> 2. Are they done synchronously? I assume yes (since you wouldn't want 
>> operations applied out of order)
>> 
>> 3. If they are done synchronously, and are not batched in any way, does that 
>> mean that the best performance I can expect would be roughly how long it 
>> takes to round-trip a single document? ie. If my average ping is 25ms, then 
>> I can expect a peak performance of roughly 40 ops/s.
>> 
>> Thanks
>> 
>> 
>> 
>>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] 
>>> <daniel.da...@nih.gov> wrote:
>>> 
>>> These are general guidelines, I've done loads of networking, but may be 
>>> less familiar with SolrCloud  and CDCR architecture.  However, I know it's 
>>> all TCP sockets, so general guidelines do apply.
>>> 
>>> Check the round-trip time between the data centers using ping or TCP ping.  
>>>  Throughput tests may be high, but if Solr has to wait for a response to a 
>>> request before sending the next action, then just like any network protocol 
>>> that does that, it will get slow.
>>> 
>>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check 
>>> whether some proxy/load balancer between data centers is causing it to be a 
>>> single connection per operation.   That will *kill* performance.   Some 
>>> proxies default to HTTP/1.0 (open, send request, server send response, 
>>> close), and that will hurt.
>>> 
>>> Why you should listen to me even without SolrCloud knowledge - checkout 
>>> paper "Latency performance of SOAP Implementations".   Same distribution of 
>>> skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still 
>>> improved response time of Apache Axis 1.1 by 250ms per call with 1-line of 
>>> code.
>>> 
>>> -Original Message-
>>> From: Tom Peters [mailto:tpet...@synacor.com] 
>>> Sent: Wednesday, March 7, 2018 6:19 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: CDCR performance issues
>>> 
>>> I'm having issues with the target collection staying up-to-date with 
>>> indexing from the source collection using CDCR.
>>> 
>>> This is what I'm getting back in terms of OPS:
>>> 
>>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>>  {
>>>"responseHeader": {
>>>  "status": 0,
>>>  "QTime": 0
>>>},
>>>"operationsPerSecond": [
>>>  "zook01,zook02,zook03/solr",
>>>  [
>>>"mycollection",
>>>[
>>>  "all",
>>>  49.10140553500938,
>>>  "adds",
>>>  10.27612635309587,
>>>  "deletes",
>>>  38.82527896994054
>>>]
>>>  ]
>>>]
>>>  }
>>> 
>>> The source and target collections are in separate data centers.
>>> 
>>> Doing a network test between the leader node in the source data center and 
>>> the ZooKeeper nodes in the target data center show decent enough network 
>>> performance: ~181 Mbit/s
>>> 
>>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
>>> 2000, 2500) and they've haven't made much of a differe

Re: CDCR performance issues

2018-03-12 Thread Tom Peters
Anyone have any thoughts on the questions I raised?

I have another question related to CDCR:
Sometimes we have to reindex a large chunk of our index (1M+ documents). What's 
the best way to handle this if the normal CDCR process won't be able to keep 
up? Manually trigger a bootstrap again? Or is there something else we can do?

Thanks.



> On Mar 9, 2018, at 3:59 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the 
> requests to the target data center are not batched in any way. Each update 
> comes in as an independent update. Some follow-up questions:
> 
> 1. Is it accurate that updates are not actually batched in transit from the 
> source to the target and instead each document is posted separately?
> 
> 2. Are they done synchronously? I assume yes (since you wouldn't want 
> operations applied out of order)
> 
> 3. If they are done synchronously, and are not batched in any way, does that 
> mean that the best performance I can expect would be roughly how long it 
> takes to round-trip a single document? ie. If my average ping is 25ms, then I 
> can expect a peak performance of roughly 40 ops/s.
> 
> Thanks
> 
> 
> 
>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] 
>> <daniel.da...@nih.gov> wrote:
>> 
>> These are general guidelines, I've done loads of networking, but may be less 
>> familiar with SolrCloud  and CDCR architecture.  However, I know it's all 
>> TCP sockets, so general guidelines do apply.
>> 
>> Check the round-trip time between the data centers using ping or TCP ping.   
>> Throughput tests may be high, but if Solr has to wait for a response to a 
>> request before sending the next action, then just like any network protocol 
>> that does that, it will get slow.
>> 
>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check 
>> whether some proxy/load balancer between data centers is causing it to be a 
>> single connection per operation.   That will *kill* performance.   Some 
>> proxies default to HTTP/1.0 (open, send request, server send response, 
>> close), and that will hurt.
>> 
>> Why you should listen to me even without SolrCloud knowledge - checkout 
>> paper "Latency performance of SOAP Implementations".   Same distribution of 
>> skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still 
>> improved response time of Apache Axis 1.1 by 250ms per call with 1-line of 
>> code.
>> 
>> -Original Message-
>> From: Tom Peters [mailto:tpet...@synacor.com] 
>> Sent: Wednesday, March 7, 2018 6:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: CDCR performance issues
>> 
>> I'm having issues with the target collection staying up-to-date with 
>> indexing from the source collection using CDCR.
>> 
>> This is what I'm getting back in terms of OPS:
>> 
>>   curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>   {
>> "responseHeader": {
>>   "status": 0,
>>   "QTime": 0
>> },
>> "operationsPerSecond": [
>>   "zook01,zook02,zook03/solr",
>>   [
>> "mycollection",
>> [
>>   "all",
>>   49.10140553500938,
>>   "adds",
>>   10.27612635309587,
>>   "deletes",
>>   38.82527896994054
>> ]
>>   ]
>> ]
>>   }
>> 
>> The source and target collections are in separate data centers.
>> 
>> Doing a network test between the leader node in the source data center and 
>> the ZooKeeper nodes in the target data center show decent enough network 
>> performance: ~181 Mbit/s
>> 
>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
>> 2000, 2500) and they've haven't made much of a difference.
>> 
>> Any suggestions on potential settings to tune to improve the performance?
>> 
>> Thanks
>> 
>> --
>> 
>> Here's some relevant log lines from the source data center's leader:
>> 
>>   2018-03-07 23:16:11.984 INFO  
>> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>   2018-03-07 23:16:23.062 INFO  
>> (cdcr-replicator-207-thread-4-proce

Re: CDCR performance issues

2018-03-09 Thread Tom Peters
Thanks. This was helpful. I did some tcpdumps and I'm noticing that the 
requests to the target data center are not batched in any way. Each update 
comes in as an independent update. Some follow-up questions:

1. Is it accurate that updates are not actually batched in transit from the 
source to the target and instead each document is posted separately?

2. Are they done synchronously? I assume yes (since you wouldn't want 
operations applied out of order)

3. If they are done synchronously, and are not batched in any way, does that 
mean that the best performance I can expect would be roughly how long it takes 
to round-trip a single document? ie. If my average ping is 25ms, then I can 
expect a peak performance of roughly 40 ops/s.

Thanks



> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] 
> <daniel.da...@nih.gov> wrote:
> 
> These are general guidelines, I've done loads of networking, but may be less 
> familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP 
> sockets, so general guidelines do apply.
> 
> Check the round-trip time between the data centers using ping or TCP ping.   
> Throughput tests may be high, but if Solr has to wait for a response to a 
> request before sending the next action, then just like any network protocol 
> that does that, it will get slow.
> 
> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check 
> whether some proxy/load balancer between data centers is causing it to be a 
> single connection per operation.   That will *kill* performance.   Some 
> proxies default to HTTP/1.0 (open, send request, server send response, 
> close), and that will hurt.
> 
> Why you should listen to me even without SolrCloud knowledge - checkout paper 
> "Latency performance of SOAP Implementations".   Same distribution of skills 
> - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved 
> response time of Apache Axis 1.1 by 250ms per call with 1-line of code.
> 
> -Original Message-
> From: Tom Peters [mailto:tpet...@synacor.com] 
> Sent: Wednesday, March 7, 2018 6:19 PM
> To: solr-user@lucene.apache.org
> Subject: CDCR performance issues
> 
> I'm having issues with the target collection staying up-to-date with indexing 
> from the source collection using CDCR.
> 
> This is what I'm getting back in terms of OPS:
> 
>curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>{
>  "responseHeader": {
>"status": 0,
>"QTime": 0
>  },
>  "operationsPerSecond": [
>"zook01,zook02,zook03/solr",
>[
>  "mycollection",
>  [
>"all",
>49.10140553500938,
>"adds",
>10.27612635309587,
>"deletes",
>38.82527896994054
>  ]
>]
>  ]
>}
> 
> The source and target collections are in separate data centers.
> 
> Doing a network test between the leader node in the source data center and 
> the ZooKeeper nodes in the target data center show decent enough network 
> performance: ~181 Mbit/s
> 
> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
> 2000, 2500) and they've haven't made much of a difference.
> 
> Any suggestions on potential settings to tune to improve the performance?
> 
> Thanks
> 
> --
> 
> Here's some relevant log lines from the source data center's leader:
> 
>2018-03-07 23:16:11.984 INFO  
> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>2018-03-07 23:16:23.062 INFO  
> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>2018-03-07 23:16:32.063 INFO  
> (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>2018-03-07 23:16:36.209 INFO  
> (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n

Re: CDCR performance issues

2018-03-08 Thread Tom Peters
So I'm continuing to look into this and not making much headway, but I have 
additional questions now as well.

I restarted the nodes in the source data center to see if it would have any 
impact. It appeared to initiate another bootstrap with the target. The lag and 
queueSize were brought back down to zero.

Over the next two hours the queueSize has grown back to 106,122 (as reported by 
solr/mycollection/cdcr?action=QUEUES). When I actually look at what we sent to 
Solr though, I only deleted or added a total of 3,805 documents. Could this be 
part of the problem? Should queueSize be representative of the total number of 
document updates, or are there other updates under the hood that I wouldn't see 
that would still need to be tracked by Solr.

Also, if there are any other suggestions on my original issue which is that the 
CDCR cannot keep up despite the relatively low number of updates (3805 over two 
hours).

Thanks. 

> On Mar 7, 2018, at 6:19 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
> I'm having issues with the target collection staying up-to-date with indexing 
> from the source collection using CDCR.
> 
> This is what I'm getting back in terms of OPS:
> 
>curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>{
>  "responseHeader": {
>"status": 0,
>"QTime": 0
>  },
>  "operationsPerSecond": [
>"zook01,zook02,zook03/solr",
>[
>  "mycollection",
>  [
>"all",
>49.10140553500938,
>"adds",
>10.27612635309587,
>"deletes",
>38.82527896994054
>  ]
>]
>  ]
>}
> 
> The source and target collections are in separate data centers.
> 
> Doing a network test between the leader node in the source data center and 
> the ZooKeeper nodes in the target data center
> show decent enough network performance: ~181 Mbit/s
> 
> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
> 2000, 2500) and they've haven't made much of a difference.
> 
> Any suggestions on potential settings to tune to improve the performance?
> 
> Thanks
> 
> --
> 
> Here's some relevant log lines from the source data center's leader:
> 
>2018-03-07 23:16:11.984 INFO  
> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>2018-03-07 23:16:23.062 INFO  
> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>2018-03-07 23:16:32.063 INFO  
> (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>2018-03-07 23:16:36.209 INFO  
> (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>2018-03-07 23:16:42.091 INFO  
> (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>2018-03-07 23:16:46.790 INFO  
> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>2018-03-07 23:16:50.004 INFO  
> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> 
> 
> And what the log looks like in the target:
> 
>2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 
> r:core_n

CDCR performance issues

2018-03-07 Thread Tom Peters
I'm having issues with the target collection staying up-to-date with indexing 
from the source collection using CDCR.
 
This is what I'm getting back in terms of OPS:

curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
{
  "responseHeader": {
"status": 0,
"QTime": 0
  },
  "operationsPerSecond": [
"zook01,zook02,zook03/solr",
[
  "mycollection",
  [
"all",
49.10140553500938,
"adds",
10.27612635309587,
"deletes",
38.82527896994054
  ]
]
  ]
}

The source and target collections are in separate data centers.

Doing a network test between the leader node in the source data center and the 
ZooKeeper nodes in the target data center
show decent enough network performance: ~181 Mbit/s

I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 
2000, 2500) and they've haven't made much of a difference.

Any suggestions on potential settings to tune to improve the performance?

Thanks

--

Here's some relevant log lines from the source data center's leader:

2018-03-07 23:16:11.984 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:23.062 INFO  
(cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
2018-03-07 23:16:32.063 INFO  
(cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:36.209 INFO  
(cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:42.091 INFO  
(cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
2018-03-07 23:16:46.790 INFO  
(cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
2018-03-07 23:16:50.004 INFO  
(cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr 
x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) 
[c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] 
o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection


And what the log looks like in the target:

2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067896487950==javabin=2}
 status=0 QTime=0
2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067896487951==javabin=2}
 status=0 QTime=0
2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536512==javabin=2}
 status=0 QTime=0
2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536513==javabin=2}
 status=0 QTime=0
2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 
r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1]  webapp=/solr path=/update 
params={_stateVer_=mycollection:30&_version_=-1594317067897536514==javabin=2}
 status=0 QTime=0
2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 
r:core_node2 

Re: Issues with CDCR in Solr 7.1

2018-03-05 Thread Tom Peters
You can ignore this. I think I found the issue (I was missing a block of XML in 
the source ocnfig). I'm going to monitor it over the next day and see if it was 
resolved.

> On Mar 5, 2018, at 4:29 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
> I'm trying to get Solr CDCR setup in Solr 7.1 and I'm having issues 
> post-bootstrap.
> 
> I have about 5,572,933 documents in the source cluster (index size is 3.77 
> GB). I'm enabling CDCR in the following manner:
> 
> 1. Delete the existing cluster in the target data center
>   admin/collections?action=DELETE=mycollection
> 
> 2. Stop indexing in source data center
> 
> 3. Do one final hard commit in source data center
>   update -d '{"commit":{}}'
> 
> 4. Create the cluster in the target datacenter
>   
> admin/collections?action=CREATE=mycollection=1=myconfig
> 
>   Note: I'm only creating one replica initially because there is a bug 
> that prevents the bootstrap index from replicating to the replicas
> 
> 5. Disable the buffer in the target data center
>   cdcr?action=DISABLEBUFFER
> 
>   Note: the buffer has already been disabled in the source
> 
> 6. Start CDCR in the source data center
>   cdcr?action=START
> 
> 7. Monitor cdcr?action=BOOTSTRAP_STATUS and wait for complete message
>   NOTE: At this point I can confirm that the documents count in both the 
> source and target data centers are identical
> 
> 8. Re-enable indexing on source
> 
> 
> I'm not seeing any new documents in the target cluster, even after a commit. 
> The document count in the target does change, but it's nothing new. Looking 
> at the logs, I do see plenty of messages like:
>   SOURCE:
> 2018-03-05 21:20:06.290 INFO (qtp1595212853-65472) [c:mycollection 
> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.c.S.Request 
> [mycollection_shard1_replica_n6] webapp=/solr path=/cdcr 
> params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
> 2018-03-05 21:20:06.430 INFO 
> (cdcr-replicator-79-thread-2-processing-n:solr2-a:8080_solr) [ ] 
> o.a.s.h.CdcrReplicator Forwarded 128 updates to target mycollection
> 
>   TARGET:
> 2018-03-05 21:19:38.637 INFO (qtp1595212853-134) [c:mycollection 
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
> [mycollection_shard1_replica_n1] webapp=/solr path=/update 
> params={_stateVer_=mycollection:52&_version_=-1593959559286751241==javabin=2}
>  status=0 QTime=0
> 
> 
> The weird thing though is that the lastTimestamp is from a couple days ago 
> when I query cdcr?action=QUEUES
> 
> {
>  "responseHeader": {
>"status": 0,
>"QTime": 24
>  },
>  "queues": [
>"zook01.be,zook02.be,zook03.be/solr",
>[
>  "mycollection",
>  [
>"queueSize",
>8685952,
>"lastTimestamp",
>"2018-03-03T23:07:14.179Z"
>  ]
>]
>  ],
>  "tlogTotalSize": 3458777355,
>  "tlogTotalCount": 5226,
>  "updateLogSynchronizer": "stopped"
> }
> 
> 
> Ultimately my questions are:
> 
> 1. Why am I not seeing updates in the target datacenter after bootstrapping 
> has completed?
> 
> 2. Is there anything I need to do to "reset" the bootstrap if I blow away the 
> target data center and start from scratch again.
> 
> 3. Am I missing anything?
> 
> Thanks for taking the time to read this.
> 
> 
> This message and any attachment may contain information that is confidential 
> and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
> this e-mail or any attached file by anyone other than the intended recipient 
> is strictly prohibited. If you have received this message in error, please 
> notify the sender by reply email and delete the message and any attachments. 
> Thank you.



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Issues with CDCR in Solr 7.1

2018-03-05 Thread Tom Peters
I'm trying to get Solr CDCR setup in Solr 7.1 and I'm having issues 
post-bootstrap.

I have about 5,572,933 documents in the source cluster (index size is 3.77 GB). 
I'm enabling CDCR in the following manner:

1. Delete the existing cluster in the target data center
admin/collections?action=DELETE=mycollection

2. Stop indexing in source data center

3. Do one final hard commit in source data center
update -d '{"commit":{}}'

4. Create the cluster in the target datacenter

admin/collections?action=CREATE=mycollection=1=myconfig

Note: I'm only creating one replica initially because there is a bug 
that prevents the bootstrap index from replicating to the replicas

5. Disable the buffer in the target data center
cdcr?action=DISABLEBUFFER

Note: the buffer has already been disabled in the source

6. Start CDCR in the source data center
cdcr?action=START

7. Monitor cdcr?action=BOOTSTRAP_STATUS and wait for complete message
NOTE: At this point I can confirm that the documents count in both the 
source and target data centers are identical

8. Re-enable indexing on source


I'm not seeing any new documents in the target cluster, even after a commit. 
The document count in the target does change, but it's nothing new. Looking at 
the logs, I do see plenty of messages like:
SOURCE:
  2018-03-05 21:20:06.290 INFO (qtp1595212853-65472) [c:mycollection 
s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.c.S.Request 
[mycollection_shard1_replica_n6] webapp=/solr path=/cdcr 
params={action=LASTPROCESSEDVERSION=javabin=2} status=0 QTime=0
  2018-03-05 21:20:06.430 INFO 
(cdcr-replicator-79-thread-2-processing-n:solr2-a:8080_solr) [ ] 
o.a.s.h.CdcrReplicator Forwarded 128 updates to target mycollection

TARGET:
  2018-03-05 21:19:38.637 INFO (qtp1595212853-134) [c:mycollection 
s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request 
[mycollection_shard1_replica_n1] webapp=/solr path=/update 
params={_stateVer_=mycollection:52&_version_=-1593959559286751241==javabin=2}
 status=0 QTime=0


The weird thing though is that the lastTimestamp is from a couple days ago when 
I query cdcr?action=QUEUES

{
  "responseHeader": {
"status": 0,
"QTime": 24
  },
  "queues": [
"zook01.be,zook02.be,zook03.be/solr",
[
  "mycollection",
  [
"queueSize",
8685952,
"lastTimestamp",
"2018-03-03T23:07:14.179Z"
  ]
]
  ],
  "tlogTotalSize": 3458777355,
  "tlogTotalCount": 5226,
  "updateLogSynchronizer": "stopped"
}


Ultimately my questions are:

1. Why am I not seeing updates in the target datacenter after bootstrapping has 
completed?

2. Is there anything I need to do to "reset" the bootstrap if I blow away the 
target data center and start from scratch again.

3. Am I missing anything?

Thanks for taking the time to read this.


This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Re: /var/solr/data has lots of index* directories

2018-03-05 Thread Tom Peters
Thanks. I went ahead and did that.

I think the multiple directories stemmed from an issue I sent to the list a 
week or two ago about deleteByQueries knocking my replicas offline.

> On Mar 5, 2018, at 1:44 PM, Shalin Shekhar Mangar <shalinman...@gmail.com> 
> wrote:
> 
> You can look inside the index.properties. The directory name mentioned in
> that properties file is the one being used actively. The rest are old
> directories that should be cleaned up on Solr restart but you can delete
> them yourself without any issues.
> 
> On Mon, Mar 5, 2018 at 11:43 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
>> While trying to debug an issue with CDCR, I noticed that the
>> /var/solr/data directories on my source cluster have wildly different sizes.
>> 
>>  % for i in solr2-{a..e}; do echo -n "$i: "; ssh -A $i du -sh
>> /var/solr/data; done
>>  solr2-a: 9.5G   /var/solr/data
>>  solr2-b: 29G/var/solr/data
>>  solr2-c: 6.6G   /var/solr/data
>>  solr2-d: 9.7G   /var/solr/data
>>  solr2-e: 19G/var/solr/data
>> 
>> The leader is currently "solr2-a"
>> 
>> Here's the actual index size:
>> 
>>  Master (Searching)
>>  1520273178244 # version
>>  73034 # gen
>>  3.66 GB   # size
>> 
>> When I look inside /var/solr/data/ on solr2-b, I see a bunch of index.*
>> directories:
>> 
>>  % ls | grep index
>>  index.20180223021742634
>>  index.20180223024901983
>>  index.20180223033852960
>>  index.20180223034811193
>>  index.20180223035648403
>>  index.20180223041040318
>>  index.properties
>> 
>> On solr2-a, I only see one index directory (index.20180222192820572).
>> 
>> Does anyone know why this will happen and how I can clean it up without
>> potentially causing any issues? We're currently on version Solr 7.1.
>> 
>> 
>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any attachments. Thank you.
>> 
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


/var/solr/data has lots of index* directories

2018-03-05 Thread Tom Peters
While trying to debug an issue with CDCR, I noticed that the /var/solr/data 
directories on my source cluster have wildly different sizes.

  % for i in solr2-{a..e}; do echo -n "$i: "; ssh -A $i du -sh /var/solr/data; 
done
  solr2-a: 9.5G   /var/solr/data
  solr2-b: 29G/var/solr/data
  solr2-c: 6.6G   /var/solr/data
  solr2-d: 9.7G   /var/solr/data
  solr2-e: 19G/var/solr/data

The leader is currently "solr2-a"

Here's the actual index size:

  Master (Searching)
  1520273178244 # version
  73034 # gen
  3.66 GB   # size

When I look inside /var/solr/data/ on solr2-b, I see a bunch of index.* 
directories:

  % ls | grep index
  index.20180223021742634
  index.20180223024901983
  index.20180223033852960
  index.20180223034811193
  index.20180223035648403
  index.20180223041040318
  index.properties

On solr2-a, I only see one index directory (index.20180222192820572).

Does anyone know why this will happen and how I can clean it up without 
potentially causing any issues? We're currently on version Solr 7.1.


This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Is there a way to sort by conditional function in the Solr 7.2 JSON API?

2018-03-02 Thread Tom Van Cuyck
Hi,

In the Solr 7.2 JSON API, when faceting over terms, I would like to sort
the buckets over the average of a numerical property, as shown below

curl http://localhost:8983/solr/core/select -d '
q=*:*&
rows=0&
wt=json&
json.facet={
 "field" : {
"type" : "terms",
"field" : "string-field",
"sort" : "avg desc",
"limit" : 50,
facet : {
avg : "avg(number_i)",
unique : "unique(number_i)"
   }
  }
}'


However, when none of the documents in a bucket has a value for the
numerical property (e.g. unique = 0 in this case), an average value avg = 0
is returned.
This average value of 0 is then used for sorting the buckets.

I would like the buckets with no value for the numerical property to be
sorted last.
Is there a way to e.g. use conditional sorting? E.g.
sort: "if(gt(unique,0),avg,-9) desc"

I can't get this to work, while in the old API this appaers to be possible.

Or is there another way to sort the buckets with a missing numeric value
last?

Kind regards, Tom


Re: Indexing timeout issues with SolrCloud 7.1

2018-03-01 Thread Tom Peters
Thanks Erick. I found an older mailing list thread online where someone had 
similar issues to what I was experiencing 
(http://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html
 
<http://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html>).

I decided to try and rewrite our indexing code to use delete by ID as opposed 
to delete by query (we deployed it today) and it appears to have significantly 
improved the indexing performance and reliability of the replicas.



> On Feb 26, 2018, at 12:08 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> DBQ is something of a heavyweight action. Basically in order to
> preserve ordering it has to lock out updates while it executes since
> all docs (which may live on all shards) have to be deleted before
> subsequent adds of one of the affected docs is processed. In order to
> do that, things need to be locked.
> 
> Delete-by-id OTOH, can use the normal optimistic locking to insure
> proper ordering. So if object_id is your , this may be much
> more robust if you delete-by-id
> 
> Best,
> Erick
> 
> On Sat, Feb 24, 2018 at 1:37 AM, Deepak Goel <deic...@gmail.com> wrote:
>> From the error list, i can see multiple errors:
>> 
>> 1. Failure to recover replica
>> 2. Peer sync error
>> 3. Failure to download file
>> 
>> On 24 Feb 2018 03:10, "Tom Peters" <tpet...@synacor.com> wrote:
>> 
>> I included the last 25 lines from the logs from each of the five nodes
>> during that time period.
>> 
>> I _think_ I'm running into issues with bulking up deleteByQuery. Quick
>> background: we have objects in our system that may have multiple
>> availability windows. So when we index an object, will store it as separate
>> documents each with their own begins and expires date. At index time we
>> don't know if the all of the windows are still valid or not, so we remove
>> all of them with a deleteByQuery (e.g. deleteByQuery=object_id:12345) and
>> then index one or more documents.
>> 
>> I ran an isolated test a number of times where I indexed 1500 documents in
>> this manner (deletes then index). In Solr 3.4, it takes about 15s to
>> complete. In Solr 7.1, it's taking about 5m. If I remove the deleteByQuery,
>> the indexing times are nearly identical.
>> 
>> When run in normal production mode where we have lots of processes indexing
>> at once (~20 or so), it starts to cause lots of issues (which you see
>> below).
>> 
>> 
>> Please let me know if anything I mentioned is unclear. Thanks!
>> 
>> 
>> 
>> 
>> solr2-a:
>> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2672-
>> processing-http:solr2-b:8080//solr//mycollection_shard1_replica_n1
>> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh.
>> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u.
>> ErrorReportingConcurrentUpdateSolrClient error
>> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2692-
>> processing-http:solr2-d:8080//solr//mycollection_shard1_replica_n11
>> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh.
>> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u.
>> ErrorReportingConcurrentUpdateSolrClient error
>> 2018-02-23 04:09:36.551 ERROR (updateExecutor-2-thread-2711-
>> processing-http:solr2-e:8080//solr//mycollection_shard1_replica_n4
>> x:mycollection_shard1_replica_n6 r:core_node9 n:solr2-a.vam.be.cmh.
>> mycollection.com:8080_solr s:shard1 c:mycollection) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.u.
>> ErrorReportingConcurrentUpdateSolrClient error
>> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.u.p.DistributedUpdateProcessor
>> Setting up to try to start recovery on replica http://solr2-b:8080/solr/
>> mycollection_shard1_replica_n1/
>> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.u.p.DistributedUpdateProcessor
>> Setting up to try to start recovery on replica http://solr2-d:8080/solr/
>> mycollection_shard1_replica_n11/
>> 2018-02-23 04:09:36.552 ERROR (qtp1595212853-32739) [c:mycollection
>> s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.u.p.DistributedUpdateProcessor
>> Setting up to try to start recov

Re: Indexing timeout issues with SolrCloud 7.1

2018-02-23 Thread Tom Peters
-n:solr2-e:8080_solr 
x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) 
[c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] 
o.a.s.h.IndexFetcher Error deleting file: 
tlog.0046787.1593163366289899520
2018-02-23 04:12:22.405 ERROR 
(recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr 
x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) 
[c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] 
o.a.s.c.RecoveryStrategy Error while trying to 
recover:org.apache.solr.common.SolrException: Replication for recovery failed.
2018-02-23 04:12:22.405 ERROR 
(recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr 
x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) 
[c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] 
o.a.s.c.RecoveryStrategy Recovery failed - trying again... (1)
2018-02-23 04:12:22.405 ERROR 
(recoveryExecutor-3-thread-6-processing-n:solr2-e:8080_solr 
x:mycollection_shard1_replica_n4 s:shard1 c:mycollection r:core_node7) 
[c:mycollection s:shard1 r:core_node7 x:mycollection_shard1_replica_n4] 
o.a.s.h.ReplicationHandler Index fetch failed 
:org.apache.solr.common.SolrException: Unable to download 
tlog.0046787.1593163366289899520 completely. Downloaded 0!=179060


> On Feb 23, 2018, at 4:15 PM, Deepak Goel <deic...@gmail.com> wrote:
> 
> Can you please post all the errors? The current error is only for the node
> 'solr-2d'
> 
> On 23 Feb 2018 09:42, "Tom Peters" <tpet...@synacor.com> wrote:
> 
> I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues.
> It will hang most of the time, and timeout the rest.
> 
> Here's an example:
> 
>time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d
> '{"solr_id":"test_001", "data_type":"test"}'|jq .
>{
>  "responseHeader": {
>"status": 0,
>"QTime": 5004
>  }
>}
>curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d   0.00s
> user 0.00s system 0% cpu 5.025 total
>jq .  0.01s user 0.00s system 0% cpu 5.025 total
> 
> Here's some of the timeout errors I'm seeing:
> 
>2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection
> s:shard1 r:core_node12 x:mycollection_shard1_replica_n11]
> o.a.s.h.RequestHandlerBase java.io.IOException:
> java.util.concurrent.TimeoutException:
> Idle timeout expired: 12/12 ms
>2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection
> s:shard1 r:core_node12 x:mycollection_shard1_replica_n11]
> o.a.s.s.HttpSolrCall null:java.io.IOException:
> java.util.concurrent.TimeoutException:
> Idle timeout expired: 12/12 ms
>2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-
> processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11
> s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1
> r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.ReplicationHandler
> Index fetch failed :org.apache.solr.common.SolrException: Index fetch
> failed :
>2018-02-23 03:55:36.517 ERROR (recoveryExecutor-3-thread-4-
> processing-n:solr2-d.myhost:8080_solr x:mycollection_shard1_replica_n11
> s:shard1 c:mycollection r:core_node12) [c:mycollection s:shard1
> r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.c.RecoveryStrategy
> Error while trying to recover:org.apache.solr.common.SolrException:
> Replication for recovery failed.
> 
> 
> We currently have two separate Solr clusters. Our current in-production
> cluster which runs on Solr 3.4 and a new ring that I'm trying to bring up
> which runs on SolrCloud 7.1. I have the exact same code that is indexing to
> both clusters. The Solr 3.4 indexes fine, but I'm running into lots of
> issues with SolrCloud 7.1.
> 
> 
> Some additional details about the setup:
> 
> * 5 nodes solr2-a through solr2-e.
> * 5 replicas
> * 1 shard
> * The servers have 48G of RAM with -Xmx and -Xms set to 16G
> * I currently have soft commits at 10m intervals and hard commits (with
> openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard)
> as well.
> 
> Any help or pointers would be greatly appreciated. Thanks!
> 
> 
> This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Indexing timeout issues with SolrCloud 7.1

2018-02-22 Thread Tom Peters
I'm trying to debug why indexing in SolrCloud 7.1 is having so many issues. It 
will hang most of the time, and timeout the rest.

Here's an example:

time curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d 
'{"solr_id":"test_001", "data_type":"test"}'|jq .
{
  "responseHeader": {
"status": 0,
"QTime": 5004
  }
}
curl -s 'myhost:8080/solr/mycollection/update/json/docs' -d   0.00s user 
0.00s system 0% cpu 5.025 total
jq .  0.01s user 0.00s system 0% cpu 5.025 total

Here's some of the timeout errors I'm seeing:

2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 
r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.h.RequestHandlerBase 
java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout 
expired: 12/12 ms
2018-02-23 03:55:02.903 ERROR (qtp1595212853-3607) [c:mycollection s:shard1 
r:core_node12 x:mycollection_shard1_replica_n11] o.a.s.s.HttpSolrCall 
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout 
expired: 12/12 ms
2018-02-23 03:55:36.517 ERROR 
(recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr 
x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) 
[c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] 
o.a.s.h.ReplicationHandler Index fetch failed 
:org.apache.solr.common.SolrException: Index fetch failed :
2018-02-23 03:55:36.517 ERROR 
(recoveryExecutor-3-thread-4-processing-n:solr2-d.myhost:8080_solr 
x:mycollection_shard1_replica_n11 s:shard1 c:mycollection r:core_node12) 
[c:mycollection s:shard1 r:core_node12 x:mycollection_shard1_replica_n11] 
o.a.s.c.RecoveryStrategy Error while trying to 
recover:org.apache.solr.common.SolrException: Replication for recovery failed.


We currently have two separate Solr clusters. Our current in-production cluster 
which runs on Solr 3.4 and a new ring that I'm trying to bring up which runs on 
SolrCloud 7.1. I have the exact same code that is indexing to both clusters. 
The Solr 3.4 indexes fine, but I'm running into lots of issues with SolrCloud 
7.1.


Some additional details about the setup:

* 5 nodes solr2-a through solr2-e.
* 5 replicas
* 1 shard
* The servers have 48G of RAM with -Xmx and -Xms set to 16G
* I currently have soft commits at 10m intervals and hard commits (with 
openSearcher=false) at 1m intervals. I also tried 5m (soft) and 15s (hard) as 
well.

Any help or pointers would be greatly appreciated. Thanks!


This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Issues with refine parameter when subfaceting in a range facet

2018-01-24 Thread Tom Van Cuyck
Hi,

We encountered an issue when using the refine parameter when subfaceting in
a range facet.
When enabling the refine option, the counts of the response are the double
of the counts of the response without refine option.
We are running Solr 6.6.1 in a cloud setup.

If I execute the query:

curl http://localhost:8899/solr/data/select -d '{ "params" :
{"wt":"json","rows":0,"json.facet":"
  {

\"MaximumAge_f\":
{
  \"type\":\"range\",
  \"field\":\"MaximumAge_f\",
  \"start\":0.0,
  \"end\":55000.0,
  \"gap\":1000.0,
  \"other\":\"between\",
  \"facet\":
  {
\"Gender_sf\":
{
  \"type\":\"terms\",
  \"field\":\"Gender_sf\",
  \"missing\":true,
*  \"refine\":true,*
  \"overrequest\":24,
  \"limit\":12,
  \"offset\":0
}
  }
}
  }",
  "q":"*:*"
}'

I get the following response:

  "facets": {
"count": 379417,
"MaximumAge_f": {
  "buckets": [
{
  "val": 0,
  "count": 8252,
  "Gender_sf": {
"buckets": [
  {
"val": "All",
"count": 8152
  },
  {
"val": "Male",
"count": 74
  {
  },
  {:wink
"val": "Female",
"count": 26
  }
],
"missing": {
  "count": 0
}
  }
},
...

If I execute the same query WITHOUT refine: true in the subfacet, I get the
following response:

  "facets": {
"count": 379417,
"MaximumAge_f": {
  "buckets": [
{
  "val": 0,
  "count": 4126,
  "Gender_sf": {
"buckets": [
  {
"val": "All",
"count": 4076
  },
  {
"val": "Male",
"count": 37
  },
      {
"val": "Female",
"count": 13
  }
],
"missing": {
  "count": 0
}
  }
},
...

There is a factor 2 difference for each count in each bucket.

If I perform the same queries with a larger range gap, e.g.
  \"start\":0.0,
  \"end\":55000.0,
  \"gap\":5000.0,
there is no difference between the response with and without refine: true.

Is this a known issue, or is there something we are overlooking?
And is there information on whether or not this behavior will be the same
in Solr 7?

Kind regards, Tom


Re: Issue with CDCR bootstrapping in Solr 7.1

2017-12-04 Thread Tom Peters
Not sure how it's possible. But I also tried using the _default config and just 
adding in the source and target configuration to make sure I didn't have 
something wonky in my custom solrconfig that was causing this issue. I can 
confirm that until I restart the follower nodes, they will not receive the 
initial index.

> On Dec 1, 2017, at 12:52 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote:
> 
> Tom,
> 
> (and take care not to restart the leader node otherwise it will replicate
>> from one of the replicas which is missing the index).
> 
> How is this possible? Ok I will look more into it. Appreciate if someone
> else also chimes in if they have similar issue.
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Fri, Dec 1, 2017 at 4:49 AM, Tom Peters <tpet...@synacor.com> wrote:
> 
>> Hi Amrit, I tried issuing hard commits to the various nodes in the target
>> cluster and it does not appear to cause the follower replicas to receive
>> the initial index. The only way I can get the replicas to see the original
>> index is by restarting those nodes (and take care not to restart the leader
>> node otherwise it will replicate from one of the replicas which is missing
>> the index).
>> 
>> 
>>> On Nov 30, 2017, at 12:16 PM, Amrit Sarkar <sarkaramr...@gmail.com>
>> wrote:
>>> 
>>> Tom,
>>> 
>>> This is very useful:
>>> 
>>>> I found a way to get the follower replicas to receive the documents from
>>>> the leader in the target data center, I have to restart the solr
>> instance
>>>> running on that server. Not sure if this information helps at all.
>>> 
>>> 
>>> You have to issue hardcommit on target after the bootstrapping is done.
>>> Reloading makes the core opening a new searcher. While explicit commit is
>>> issued at target leader after the BS is done, follower are left
>> unattended
>>> though the docs are copied over.
>>> 
>>> Amrit Sarkar
>>> Search Engineer
>>> Lucidworks, Inc.
>>> 415-589-9269
>>> www.lucidworks.com
>>> Twitter http://twitter.com/lucidworks
>>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>> Medium: https://medium.com/@sarkaramrit2
>>> 
>>> On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters <tpet...@synacor.com>
>> wrote:
>>> 
>>>> Hi Amrit,
>>>> 
>>>> Starting with more documents doesn't appear to have made a difference.
>>>> This time I tried with >1000 docs. Here are the steps I took:
>>>> 
>>>> 1. Deleted the collection on both the source and target DCs.
>>>> 
>>>> 2. Recreated the collections.
>>>> 
>>>> 3. Indexed >1000 documents on source data center, hard commmit
>>>> 
>>>> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>>>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
>> done
>>>> solr01-a: 1368
>>>> solr01-b: 1368
>>>> solr01-c: 1368
>>>> solr02-a: 0
>>>> solr02-b: 0
>>>> solr02-c: 0
>>>> 
>>>> 4. Enabled CDCR and checked docs
>>>> 
>>>> $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
>>>> 
>>>> $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>>>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound';
>> done
>>>> solr01-a: 1368
>>>> solr01-b: 1368
>>>> solr01-c: 1368
>>>> solr02-a: 0
>>>> solr02-b: 0
>>>> solr02-c: 1368
>>>> 
>>>> Some additional notes:
>>>> 
>>>> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I
>> assume
>>>> it will use the default of 100
>>>> 
>>>> * I found a way to get the follower replicas to receive the documents
>> from
>>>> the leader in the target data center, I have to restart the solr
>> instance
>>>> running on that server. Not sure if this information helps at all.
>>>> 
>>>>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi Tom,
>>>>> 
>>>>> I see what you are saying and I too think 

Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
Hi Amrit, I tried issuing hard commits to the various nodes in the target 
cluster and it does not appear to cause the follower replicas to receive the 
initial index. The only way I can get the replicas to see the original index is 
by restarting those nodes (and take care not to restart the leader node 
otherwise it will replicate from one of the replicas which is missing the 
index).


> On Nov 30, 2017, at 12:16 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote:
> 
> Tom,
> 
> This is very useful:
> 
>> I found a way to get the follower replicas to receive the documents from
>> the leader in the target data center, I have to restart the solr instance
>> running on that server. Not sure if this information helps at all.
> 
> 
> You have to issue hardcommit on target after the bootstrapping is done.
> Reloading makes the core opening a new searcher. While explicit commit is
> issued at target leader after the BS is done, follower are left unattended
> though the docs are copied over.
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Thu, Nov 30, 2017 at 10:06 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
>> Hi Amrit,
>> 
>> Starting with more documents doesn't appear to have made a difference.
>> This time I tried with >1000 docs. Here are the steps I took:
>> 
>> 1. Deleted the collection on both the source and target DCs.
>> 
>> 2. Recreated the collections.
>> 
>> 3. Indexed >1000 documents on source data center, hard commmit
>> 
>>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>  solr01-a: 1368
>>  solr01-b: 1368
>>  solr01-c: 1368
>>  solr02-a: 0
>>  solr02-b: 0
>>  solr02-c: 0
>> 
>> 4. Enabled CDCR and checked docs
>> 
>>  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'
>> 
>>  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>  solr01-a: 1368
>>  solr01-b: 1368
>>  solr01-c: 1368
>>  solr02-a: 0
>>  solr02-b: 0
>>  solr02-c: 1368
>> 
>> Some additional notes:
>> 
>> * I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume
>> it will use the default of 100
>> 
>> * I found a way to get the follower replicas to receive the documents from
>> the leader in the target data center, I have to restart the solr instance
>> running on that server. Not sure if this information helps at all.
>> 
>>> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com>
>> wrote:
>>> 
>>> Hi Tom,
>>> 
>>> I see what you are saying and I too think this is a bug, but I will
>> confirm
>>> once on the code. Bootstrapping should happen on all the nodes of the
>>> target.
>>> 
>>> Meanwhile can you index more than 100 documents in the source and do the
>>> exact same experiment again. Followers will not copy the entire index of
>>> Leader unless the difference in versions in docs are more than
>>> "numRecordsToKeep", which is default 100, unless you have modified in
>>> solrconfig.xml.
>>> 
>>> Looking forward to your analysis.
>>> 
>>> Amrit Sarkar
>>> Search Engineer
>>> Lucidworks, Inc.
>>> 415-589-9269
>>> www.lucidworks.com
>>> Twitter http://twitter.com/lucidworks
>>> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>>> Medium: https://medium.com/@sarkaramrit2
>>> 
>>> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters <tpet...@synacor.com> wrote:
>>> 
>>>> I'm running into an issue with the initial CDCR bootstrapping of an
>>>> existing index. In short, after turning on CDCR only the leader replica
>> in
>>>> the target data center will have the documents replicated and it will
>> not
>>>> exist in any of the follower replicas in the target data center. All
>>>> subsequent incremental updates made to the source datacenter will
>> appear in
>>>> all replicas in the target data center.
>>>> 
>>>> A little more details:
>>>> 
>>>> I have two clusters setup, a source cluster and a target cluster. Each
>>>> cluster has only one shard and th

Re: Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
Hi Amrit,

Starting with more documents doesn't appear to have made a difference. This 
time I tried with >1000 docs. Here are the steps I took:

1. Deleted the collection on both the source and target DCs.

2. Recreated the collections.

3. Indexed >1000 documents on source data center, hard commmit

  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
  solr01-a: 1368
  solr01-b: 1368
  solr01-c: 1368
  solr02-a: 0
  solr02-b: 0
  solr02-c: 0

4. Enabled CDCR and checked docs

  $ curl 'solr01-a:8080/solr/synacor/cdcr?action=START'

  $ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
  solr01-a: 1368
  solr01-b: 1368
  solr01-c: 1368
  solr02-a: 0
  solr02-b: 0
  solr02-c: 1368

Some additional notes:

* I do not have numRecordsToKeep defined in my solrconfig.xml, so I assume it 
will use the default of 100

* I found a way to get the follower replicas to receive the documents from the 
leader in the target data center, I have to restart the solr instance running 
on that server. Not sure if this information helps at all.

> On Nov 30, 2017, at 11:22 AM, Amrit Sarkar <sarkaramr...@gmail.com> wrote:
> 
> Hi Tom,
> 
> I see what you are saying and I too think this is a bug, but I will confirm
> once on the code. Bootstrapping should happen on all the nodes of the
> target.
> 
> Meanwhile can you index more than 100 documents in the source and do the
> exact same experiment again. Followers will not copy the entire index of
> Leader unless the difference in versions in docs are more than
> "numRecordsToKeep", which is default 100, unless you have modified in
> solrconfig.xml.
> 
> Looking forward to your analysis.
> 
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
> 
> On Thu, Nov 30, 2017 at 9:03 PM, Tom Peters <tpet...@synacor.com> wrote:
> 
>> I'm running into an issue with the initial CDCR bootstrapping of an
>> existing index. In short, after turning on CDCR only the leader replica in
>> the target data center will have the documents replicated and it will not
>> exist in any of the follower replicas in the target data center. All
>> subsequent incremental updates made to the source datacenter will appear in
>> all replicas in the target data center.
>> 
>> A little more details:
>> 
>> I have two clusters setup, a source cluster and a target cluster. Each
>> cluster has only one shard and three replicas. I used the configuration
>> detailed in the Source and Target sections of the reference guide as-is
>> with the exception of updating the zkHost (https://lucene.apache.org/
>> solr/guide/7_1/cross-data-center-replication-cdcr.html#
>> cdcr-configuration-2).
>> 
>> The source data center has the following nodes:
>>solr01-a, solr01-b, and solr01-c
>> 
>> The target data center has the following nodes:
>>solr02-a, solr02-b, and solr02-c
>> 
>> Here are the steps that I've done:
>> 
>> 1. Create collection in source and target data centers
>> 
>> 2. Add a number of documents to the source data center
>> 
>> 3. Verify:
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 81
>>solr01-b: 81
>>solr01-c: 81
>>solr02-a: 0
>>solr02-b: 0
>>solr02-c: 0
>> 
>> 4. Start CDCR:
>> 
>>$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'
>> 
>> 5. See if target data center has received the initial index
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 81
>>solr01-b: 81
>>solr01-c: 81
>>solr02-a: 0
>>solr02-b: 0
>>solr02-c: 81
>> 
>>note: only -c has received the index
>> 
>> 6. Add another document to the source cluster
>> 
>> 7. See how many documents are in each node:
>> 
>>$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s
>> $i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
>>solr01-a: 82
>>solr01-b: 82
>>solr01-c: 82
>>solr02-a: 1
>>solr02-b: 1
>>solr02-c: 82
>> 
>> 
&

Issue with CDCR bootstrapping in Solr 7.1

2017-11-30 Thread Tom Peters
I'm running into an issue with the initial CDCR bootstrapping of an existing 
index. In short, after turning on CDCR only the leader replica in the target 
data center will have the documents replicated and it will not exist in any of 
the follower replicas in the target data center. All subsequent incremental 
updates made to the source datacenter will appear in all replicas in the target 
data center.

A little more details:

I have two clusters setup, a source cluster and a target cluster. Each cluster 
has only one shard and three replicas. I used the configuration detailed in the 
Source and Target sections of the reference guide as-is with the exception of 
updating the zkHost 
(https://lucene.apache.org/solr/guide/7_1/cross-data-center-replication-cdcr.html#cdcr-configuration-2).

The source data center has the following nodes:
solr01-a, solr01-b, and solr01-c

The target data center has the following nodes:
solr02-a, solr02-b, and solr02-c

Here are the steps that I've done:

1. Create collection in source and target data centers

2. Add a number of documents to the source data center

3. Verify:

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 81
solr01-b: 81
solr01-c: 81
solr02-a: 0
solr02-b: 0
solr02-c: 0

4. Start CDCR:

$ curl 'solr01-a:8080/solr/mycollection/cdcr?action=START'

5. See if target data center has received the initial index

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 81
solr01-b: 81
solr01-c: 81
solr02-a: 0
solr02-b: 0
solr02-c: 81

note: only -c has received the index

6. Add another document to the source cluster

7. See how many documents are in each node:

$ for i in solr0{1,2}-{a,b,c}; do echo -n "$i: "; curl -s 
$i:8080/solr/mycollection/select'?q=*:*' | jq '.response.numFound'; done
solr01-a: 82
solr01-b: 82
solr01-c: 82
solr02-a: 1
solr02-b: 1
solr02-c: 82


As you can see, the initial index only made it to one of the replicas in the 
target data center, but subsequent incremental updates have appeared everywhere 
I would expect. Any help would be greatly appreciated, thanks.



This message and any attachment may contain information that is confidential 
and/or proprietary. Any use, disclosure, copying, storing, or distribution of 
this e-mail or any attached file by anyone other than the intended recipient is 
strictly prohibited. If you have received this message in error, please notify 
the sender by reply email and delete the message and any attachments. Thank you.


Re: Data inconsistencies and updates in solrcloud

2017-11-21 Thread Tom Barber

Thanks Erick!

As I said, user error! ;)

Tom

On 21/11/17 22:41, Erick Erickson wrote:

I think you're confusing shards with replicas.

numShards is 2, each with one replica. Therefore half of your docs
will wind up on one replica and half on the other. If you're adding a
single doc, by definition it'll be placed on only one of the two
shards. If your shards had multiple replicas, all of the replicas
associated with that shard would change.

Best,
Erick

On Tue, Nov 21, 2017 at 12:56 PM, Tom Barber <magicaltr...@apache.org> wrote:

Hi folks

I can't find an answer to this, and its clearly user error,  we have a 
collection in solrcloud that is started numShards=2 replicationFactor=1 solr 
seems happy the collection seems happy. Yet when we post and update to it and 
then look at the record again, it seems to only affect one core and not the 
second.

What are we likely to be doing wrong in our config or update to prevent the 
replication?

Thanks

Tom




--


Spicule Limited is registered in England & Wales. Company Number: 09954122. 
Registered office: First Floor, Telecom House, 125-135 Preston Road, 
Brighton, England, BN1 6AF. VAT No. 251478891.



All engagements are subject to Spicule Terms and Conditions of Business. 
This email and its contents are intended solely for the individual to whom 
it is addressed and may contain information that is confidential, 
privileged or otherwise protected from disclosure, distributing or copying. 
Any views or opinions presented in this email are solely those of the 
author and do not necessarily represent those of Spicule Limited. The 
company accepts no liability for any damage caused by any virus transmitted 
by this email. If you have received this message in error, please notify us 
immediately by reply email before deleting it from your system. Service of 
legal notice cannot be effected on Spicule Limited by email.


Data inconsistencies and updates in solrcloud

2017-11-21 Thread Tom Barber
Hi folks

I can't find an answer to this, and its clearly user error,  we have a 
collection in solrcloud that is started numShards=2 replicationFactor=1 solr 
seems happy the collection seems happy. Yet when we post and update to it and 
then look at the record again, it seems to only affect one core and not the 
second. 

What are we likely to be doing wrong in our config or update to prevent the 
replication?

Thanks

Tom


Re: A problem of tracking the commits of Lucene using SHA num

2017-11-20 Thread TOM
Dear Shawn and Chris,
Thanks very much for your replies and helps.
And so sorry for my mistakes of first-time use of Mailing Lists.

On 11/9/2017 5:13 PM, Shawn wrote:
> Where did this information originate?

My SHA data come from the paper On the Naturalness of Buggy Code(Baishakhi Ray, 
et al. ICSE ??16), and download from
http://odd-code.github.io/Data.html.


On 11/9/2017 6:10 PM, Chris wrote:
> Also -- What exactly are you trying to do? what is your objective?

I want to analysis buggy codes?? statistical properties through some
learning models on Ray??s experimental dataset. Since its large size,
Ray did not make the entire data online. What I can acquire is a batch
of commits?? SHA data and some other info. So, I need to pick out
the old commits which are correlated to these SHAs.


On 17/9/2017 1:47 PM, Shawn wrote:
> The commit data you're using is nearly useless, because the repository
> where it originated has been gone for nearly two years. If you can find
> out how it was generated, you can build a new version from the current
> repository -- either on github or from Apache's official servers.


Thanks for all of your suggestions and helps, I am going to try other ways.
Thanks so much.
 
Best,
Xian

A problem of tracking the commits of Lucene using SHA num

2017-11-16 Thread TOM
Thanks for your patience and helps.

 Recently, I acquired a batch of commits?? SHA data of Lucene, of which the 
time span is from 2010 to 2015. In order to get original info, I tried to use 
these SHA data to track commits. First, I cloned Lucene repository to my local 
host, using the cmd git clone https:// 
https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] 
to get commits?? history record, but failed with the CMD info like this:

>> git show be5672c0c242d658b7ce36f291b74c344de925c7

>> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7

 

After that, I cloned another mirror of Apache Lucene & Solr 
(https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), 
and got the right record like this:



Moreover, I tried to track a commit using its title msg. However, for a same 
commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the 
two above mirror repositories 
(https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2
 and 
https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c),
 which confused me.

In summary, 1) did the method to generate SHA num of commit change once before? 
2) because the second mirror repository ended its update since 2014, how can I 
track the whole commits of my dataset?

 

Thanks so much!

A problem of tracking the commits of Lucene using SHA num

2017-11-09 Thread TOM
Thanks for your patience and helps.

 Recently, I acquired a batch of commits?? SHA data of Lucene, of which the 
time span is from 2010 to 2015. In order to get original info, I tried to use 
these SHA data to track commits. First, I cloned Lucene repository to my local 
host, using the cmd git clone https:// 
https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] 
to get commits?? history record, but failed with the CMD info like this:

>> git show be5672c0c242d658b7ce36f291b74c344de925c7

>> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7

 

After that, I cloned another mirror of Apache Lucene & Solr 
(https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), 
and got the right record like this:



Moreover, I tried to track a commit using its title msg. However, for a same 
commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the 
two above mirror repositories 
(https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2
 and 
https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c),
 which confused me.

In summary, 1) did the method to generate SHA num of commit change once before? 
2) because the second mirror repository ended its update since 2014, how can I 
track the whole commits of my dataset?

 

Thanks so much!

Re: Provide suggestion on indexing performance

2017-09-13 Thread Tom Evans
On Tue, Sep 12, 2017 at 4:06 AM, Aman Tandon <amantandon...@gmail.com> wrote:
> Hi,
>
> We want to know about the indexing performance in the below mentioned
> scenarios, consider the total number of 10 string fields and total number
> of documents are 10 million.
>
> 1) indexed=true, stored=true
> 2) indexed=true, docValues=true
>
> Which one should we prefer in terms of indexing performance, please share
> your experience.
>
> With regards,
> Aman Tandon

Your question doesn't make much sense. You turn on stored when you
need to retrieve the original contents of the fields after searching,
and you use docvalues to speed up faceting, sorting and grouping.
Using docvalues to retrieve values during search is more expensive
than simply using stored values, so if your primary aim is retrieving
stored values, use stored=true.

Secondly, the only way to answer performance questions for your schema
and data is to try it out. Generate 10 million docs, store them in a
doc (eg as CSV), and then use the post tool to try different schema
and query options.

Cheers

Tom


Re: Solr returning same object in different page

2017-09-13 Thread Tom Evans
On Tue, Sep 12, 2017 at 7:42 PM, ruby <rshoss...@gmail.com> wrote:
> I'm running into a issue where an object is appearing twice when we are
> paging. My query is gives documents boost based on field values. First query
> returns 50 object. Second query is exactly same as first query, except
> getting next 50 objects. We are noticing that few objects which were
> returned before are being returned again in the second page. Is this a known
> issue with Solr?

Are you using paging (page=N) or deep paging (cursorMark=*)? Do you
have a deterministic sort order (IE, not simply by score)?

Cheers

Tom


Re: Get results in multiple orders (multiple boosts)

2017-08-18 Thread Tom Evans
On Fri, Aug 18, 2017 at 8:21 AM, Luca Dall'Osto
<tenacious...@yahoo.it.invalid> wrote:
>
> Yes, of course, and excuse me for the misunderstanding.
>
>
> In my scenario I have to display a list with hundreds of documents.
> An user can show this documents in a particular order, this order is decided 
> by user in a settings view.
>
>
> Order levels are for example:
> 1) Order by category, as most important.
> 2) Order by source, as second level.
> 3) Order by date (ascending or descending).
> 4) Order by title (ascending or descending).
>
>
> For category order, in settings view, user has an box with a list of all 
> categories available for him/her.
> User drag elements of the list to set in the favorite order.
> Same thing for sources.
>

Solr can only sort by indexed fields, it needs to be able to compare
one document to another document, and the only information available
at that point are the indexed fields.

This would be untenable in your scenario, because you cannot add a
category..sort_order field to every document for every user.

If this custom sorting is a hard requirement, the only feasible
solution I see is to write a custom sorting plugin, that provides a
function that you can sort on. This blog post describes how this can
be achieved:

https://medium.com/culture-wavelabs/sorting-based-on-a-custom-function-in-solr-c94ddae99a12

I would imagine that you would need one sort function, maybe called
usersortorder(), to which you would provide the users preferred sort
ordering (which you would retrieve from wherever you store such
information) and the field that you want sorted. It would look
something like this:

usersortorder("category_id", "3,5,1,7,2,12,14,58") DESC,
usersortorder("source_id", "5,2,1,4,3") DESC, date DESC, title DESC

Cheers

Tom


Error in Solr 6.6 Example schemas re: DocValues for StrField type must be single-valued?

2017-08-15 Thread Tom Burton-West
Hello,

The comments in the example schema's for Solr 6.6, for state that the
StrField type must be single-valued to support doc values

For example
Solr-6.6.0/server/solr/configsets/basic_configs/conf/managed-schema:

216  

However, on line 221 a StrField is declared with docValues that is
multiValued:
221  

Also note that the comments above say that the field must either be
required or have a default value, but line 221 appears to satisfy neither
condition.

The JavaDocs indicate that StrField can be multi-valued
https://lucene.apache.org/core/6_6_0//core/org/apache/
lucene/index/DocValuesType.html

Is the comment in the example schema file  completely wrong, or is there
some issue with using a docValues with a multivalued StrField?

Tom Burton-West

https://www.hathitrust.org/blogslarge-scale-search


Re: setup solrcloud from scratch vie web-ui

2017-05-17 Thread Tom Evans
On Wed, May 17, 2017 at 6:28 AM, Thomas Porschberg
<tho...@randspringer.de> wrote:
> Hi,
>
> I did not manipulating the data dir. What I did was:
>
> 1. Downloaded solr-6.5.1.zip
> 2. ensured no solr process is running
> 3. unzipped solr-6.5.1.zip to ~/solr_new2/solr-6.5.1
> 3. started an external zookeeper
> 4. copied a conf directory from a working non-cloudsolr (6.5.1) to
>~/solr_new2/solr-6.5.1 so that I have ~/solr_new2/solr-6.5.1/conf
>   (see http://randspringer.de/solrcloud_test/my.zip for content)

..in which you've manipulated the dataDir! :)

The problem (I think) is that you have set a fixed data dir, and when
Solr attempts to create a second core (for whatever reason, in your
case it looks like you are adding a shard), Solr puts it exactly where
you have told it to, in the same directory as the previous one. It
finds the lock and blows up, because each core needs to be in a
separate directory, but you've instructed Solr to put them in the same
one.

Start with a the solrconfig from basic_configs configset that ships
with Solr and add the special things that your installation needs. I
am not massively surprised that your non cloud config does not work in
cloud mode, when we moved to SolrCloud, we rewrote from scratch
solrconfig.xml and schema.xml, starting from basic_configs and adding
anything particular that we needed from our old config, checking every
difference that we have from stock config and noting/discerning why,
and ensuring that our field types are using the same names for the
same types as basic_config wherever possible.

I only say all that because to fix this issue is a single thing, but
you should spend the time comparing configs because this will not be
the only issue. Anyway, to fix this problem, in your solrconfig.xml
you have:

  data

It should be

  ${solr.data.dir:}

Which is still in your config, you've just got it commented out :)

Cheers

Tom


Re: to handle expired documents: collection alias or delete by id query

2017-03-24 Thread Tom Evans
On Thu, Mar 23, 2017 at 6:10 AM, Derek Poh <d...@globalsources.com> wrote:
> Hi
>
> I have collections of products. I am doing indexing 3-4 times daily.
> Every day there are products that expired and I need to remove them from
> these collectionsdaily.
>
> Ican think of 2 ways to do this.
> 1. using collection aliasto switch between a main and temp collection.
> - clear and index the temp collection
> - create alias to temp collection.
> - clear and index the main collection.
> - create alias to main collection.
>
> this way require additional collections.
>

Another way of doing this is to have a moving alias (not constantly
clearing the "temp" collection). If you reindex daily, your index
would be called "products_mmdd" with an alias to "products". The
advantage of this is that you can roll back to a previous version of
the index if there are problems, and each index is guaranteed to be
freshly created with no artifacts.

The biggest consideration for me would be how long indexing your full
corpus takes you. If you can do it in a small period of time, then
full indexes would be preferable. If it takes a very long time,
deleting is preferable.

If you are doing a cloud setup, full indexes are even more appealing.
You can create the new collection on a single node (even if sharded;
just place each shard on the same node). This would only place the
indexing cost on that one node, whilst other nodes would be unaffected
by indexing degrading regular query response time. You also don't have
to distribute the documents around the cluster. There is no
distributed indexing in Solr, each replica has to index each document
again, even if it is not the leader.

Once indexing is complete, you can expand the collection by adding
replicas of that shard on other nodes - perhaps even removing it from
the node that did the indexing. We have a node that solely does
indexing, before the collection is queried for anything it is added to
the querying nodes.

You can do this manually, or you can automate it using the collections API.

Cheers

Tom


Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Tom Evans
Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<michael.bry...@kcl.ac.uk> wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.
>
> My situation, slightly simplified is:
>
> Solr 4.6.1
>
>   *   Doc set: ~200,000 docs
>   *   Grouping by item_id, an indexed, stored, single value string field with 
> ~50,000 unique values, ~4 docs per item
>   *   Faceting by person_id, an indexed, stored, multi-value string field 
> with ~50,000 values (w/ a very skewed distribution)
>   *   No docValues fields
>
> Each document here is a description of an item, and there are several 
> descriptions per item in multiple languages.
>
> With legacy facets I use group.field=item_id and group.facet=true, which 
> gives me facet counts with the number of items rather than descriptions, and 
> correctly sorted by descending item count.
>
> With JSON facets I'm doing the equivalent like so:
>
> ={
> "people": {
> "type": "terms",
> "field": "person_id",
> "facet": {
> "grouped_count": "unique(item_id)"
> },
> "sort": "grouped_count desc"
> }
> }
>
> This works, and is somewhat faster than legacy faceting, but it also produces 
> a massive spike in memory usage when (and only when) the sort parameter is 
> set to the aggregate field. A server that runs happily with a 512MB heap OOMs 
> unless I give it a 4GB heap. With sort set to (the default) "count desc" 
> there is no memory usage spike.
>
> I would be curious if anyone has experienced this kind of memory usage when 
> sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
> I’ve tried reindexing with docValues enabled on the relevant fields and it 
> seems to make no difference in this respect.
>
> Many thanks,
> ~Mike


Re: Interval Facets with JSON

2017-02-10 Thread Tom Evans
On Wed, Feb 8, 2017 at 11:26 PM, deniz <denizdurmu...@gmail.com> wrote:
> Tom Evans-2 wrote
>> I don't think there is such a thing as an interval JSON facet.
>> Whereabouts in the documentation are you seeing an "interval" as JSON
>> facet type?
>>
>>
>> You want a range facet surely?
>>
>> One thing with range facets is that the gap is fixed size. You can
>> actually do your example however:
>>
>> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
>> hardend:True, field:height}}
>>
>> If you do require arbitrary bucket sizes, you will need to do it by
>> specifying query facets instead, I believe.
>>
>> Cheers
>>
>> Tom
>
>
> nothing other than
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting
> for documentation on intervals...  i am ok with range queries as well but
> intervals would fit better because of different sizes...

That documentation is not for JSON facets though. You can't pick and
choose features from the old facet system and use them in JSON facets
unless they are mentioned in the JSON facet documentation:

https://cwiki.apache.org/confluence/display/solr/JSON+Request+API

and (not official documentation)

http://yonik.com/json-facet-api/

Cheers

Tom


Re: Interval Facets with JSON

2017-02-08 Thread Tom Evans
On Tue, Feb 7, 2017 at 8:54 AM, deniz <denizdurmu...@gmail.com> wrote:
> Hello,
>
> I am trying to run JSON facets with on interval query as follows:
>
>
> "json.facet":{"height_facet":{"interval":{"field":"height","set":["[160,180]","[180,190]"]}}}
>
> And related field is  stored="true" />
>
> But I keep seeing errors like:
>
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Unknown
> facet or stat. key=height_facet type=interval args={field=height,
> set=[[160,180], [180,190]]} , path=/facet
>

I don't think there is such a thing as an interval JSON facet.
Whereabouts in the documentation are you seeing an "interval" as JSON
facet type?


You want a range facet surely?

One thing with range facets is that the gap is fixed size. You can
actually do your example however:

json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
hardend:True, field:height}}

If you do require arbitrary bucket sizes, you will need to do it by
specifying query facets instead, I believe.

Cheers

Tom


Re: Upgrade SOLR version - facets perfomance regression

2017-01-31 Thread Tom Evans
On Tue, Jan 31, 2017 at 5:49 AM, SOLR4189 <klin892...@yandex.ru> wrote:
> But I can't run Json Facet API. I checked on SOLR-5.4.1.
> If I write:
> localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json=true=someField
> It works fine. But if I write:
> localhost:9001/solr/Test1_shard1_replica1/myHandler/q=*:*=5=*=json={field:someField}
> It doesn't work.
> Are you sure that it is built-in? If it is built-in, why I can't find
> explanation about it in reference guid?
> Thank you for your help.

You do have to follow the correct syntax:

  json.facet={name_of_facet_in_output:{type:terms, field:name_of_field}}

It is documented in confluence:

https://cwiki.apache.org/confluence/display/solr/Faceted+Search

Also by yonik:

http://yonik.com/json-facet-api/

Cheers

Tom

Cheers

Tom


Re: Trouble boosting a field -solved-

2017-01-18 Thread Tom Chiverton
I 'solved' this by removing some of the 'AND' from my full query. AND 
should be optional but have no effect if there, right ? But for me it 
was forcing the score to 0.



Which might be the same as saying nothing matched ?


Tom


On 13/01/17 15:10, Tom Chiverton wrote:

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for 
"connected vehicle" then a news article that has that in the content 
shouldn't be ranked higher than the page with that in the title is 
essentially what I want.


I have tried dismax with qf=title^2 as well as several other variants 
with the standard query parser (like q="title:"foo"^2 OR 
content:"foo") but documents without the search term in the title 
still come out before those with the term in the title when ordered by 
score.


Is there something I am missing ?

From the docs, something like q=title:"connected vehicle"^2 OR 
content:"connected vehicle" should have worked ? Even using ^100 
didn't help.


I tried with the dismax parser using

|"q": "Connected Vehicle", "defType": "dismax", "indent": "true", "qf": 
"title^2000 content", "pf": "pf=title^4000 content^2", "sort": "score 
desc", "wt": "json", but that was not better. if I remove content from 
pf/qf then documents seem to rank correctly. |
Example query and results (content omitted) : 
http://pastebin.com/5EhrRJP8 with managed-schema 
http://pastebin.com/mdraWQWE


--
*Tom Chiverton*
Lead Developer
e:  t...@extravision.com
p:  0161 817 2922
t:  @extravision <http://www.twitter.com/extravision>
w:  www.extravision.com

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.

Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the 
author and do not necessarily represent those of Extravision Ltd.






Re: Concat Fields in JSON Facet

2017-01-17 Thread Tom Evans
On Mon, Jan 16, 2017 at 2:58 PM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> Hi,
>
> I have been using JSON Facet, but I am facing some constraints in
> displaying the field.
>
> For example, I have 2 fields, itemId and itemName. However, when I do the
> JSON Facet, I can only get it to show one of them in the output, and I
> could not get it to show both together.
> I will like to show both the ID and Name together, so that it will be more
> meaningful and easier for user to understand, without having to refer to
> another table to determine the match between the ID and Name.

I don't understand what you mean. If you have these three documents in
your index, what data do you want in the facet?

[
  {itemId: 1, itemName: "Apple"},
  {itemId: 2, itemName: "Android"},
  {itemId: 3, itemName: "Android"},
]

Cheers

Tom


Re: Trouble boosting a field

2017-01-16 Thread Tom Chiverton
Ohh, that's handy ! But it needs Solr/ElasticSearch to be publicly 
accessible ?



On 14/01/17 09:23, Alan Woodward wrote:

http://splainer.io/ <http://splainer.io/> from the gents at 
OpenSourceConnections is pretty good for this sort of thing, I find…

Alan Woodward
www.flax.co.uk



On 13 Jan 2017, at 16:35, Tom Chiverton <t...@extravision.com> wrote:

Well, I've tried much larger values than 8, and it still doesn't seem to do the 
job ?

For now, assume my users are searching for exact sub strings of a real title.

Tom


On 13/01/17 16:22, Walter Underwood wrote:

I use a boost of 8 for title with no boost on the content. Both Infoseek and 
Inktomi settled on the 8X boost, getting there with completely different 
methodologies.

You might not want the title to completely trump the content. That causes some 
odd anomalies. If someone searches for “ice age 2”, do you really want every 
title with “2” to come before “ice age two”? Or a search for “steve jobs” to 
return every article with “job” or “jobs” in the title first?

Also, use “edismax”, not “dismax”. Dismax was obsolete in Solr 3.x, five years 
ago.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jan 13, 2017, at 7:10 AM, Tom Chiverton <t...@extravision.com> wrote:

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for "connected 
vehicle" then a news article that has that in the content shouldn't be ranked higher 
than the page with that in the title is essentially what I want.

I have tried dismax with qf=title^2 as well as several other variants with the standard query parser 
(like q="title:"foo"^2 OR content:"foo") but documents without the search term 
in the title still come out before those with the term in the title when ordered by score.

Is there something I am missing ?

 From the docs, something like q=title:"connected vehicle"^2 OR content:"connected 
vehicle" should have worked ? Even using ^100 didn't help.

I tried with the dismax parser using

   "q": "Connected Vehicle",
   "defType": "dismax",
   "indent": "true",
   "qf": "title^2000 content",
   "pf": "pf=title^4000 content^2",
   "sort": "score desc",
   "wt": "json",

but that was not better. if I remove content from pf/qf then documents seem to 
rank correctly.
Example query and results (content omitted) : http://pastebin.com/5EhrRJP8 
<http://pastebin.com/5EhrRJP8> with managed-schema http://pastebin.com/mdraWQWE 
<http://pastebin.com/mdraWQWE>

--



Tom Chiverton
Lead Developer

e:   <mailto:t...@extravision.com>t...@extravision.com 
<mailto:t...@extravision.com>
p:  0161 817 2922
t:  @extravision <http://www.twitter.com/extravision>
w:   <http://www.extravision.com/>www.extravision.com 
<http://www.extravision.com/>

 <http://www.extravision.com/>

Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 
4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may 
contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author and do 
not necessarily represent those of Extravision Ltd.


__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__


__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__




Re: Trouble boosting a field

2017-01-13 Thread Tom Chiverton
Well, I've tried much larger values than 8, and it still doesn't seem to 
do the job ?


For now, assume my users are searching for exact sub strings of a real 
title.


Tom


On 13/01/17 16:22, Walter Underwood wrote:

I use a boost of 8 for title with no boost on the content. Both Infoseek and 
Inktomi settled on the 8X boost, getting there with completely different 
methodologies.

You might not want the title to completely trump the content. That causes some 
odd anomalies. If someone searches for “ice age 2”, do you really want every 
title with “2” to come before “ice age two”? Or a search for “steve jobs” to 
return every article with “job” or “jobs” in the title first?

Also, use “edismax”, not “dismax”. Dismax was obsolete in Solr 3.x, five years 
ago.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Jan 13, 2017, at 7:10 AM, Tom Chiverton <t...@extravision.com> wrote:

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for "connected 
vehicle" then a news article that has that in the content shouldn't be ranked higher 
than the page with that in the title is essentially what I want.

I have tried dismax with qf=title^2 as well as several other variants with the standard query parser 
(like q="title:"foo"^2 OR content:"foo") but documents without the search term 
in the title still come out before those with the term in the title when ordered by score.

Is there something I am missing ?

 From the docs, something like q=title:"connected vehicle"^2 OR content:"connected 
vehicle" should have worked ? Even using ^100 didn't help.

I tried with the dismax parser using

   "q": "Connected Vehicle",
   "defType": "dismax",
   "indent": "true",
   "qf": "title^2000 content",
   "pf": "pf=title^4000 content^2",
   "sort": "score desc",
   "wt": "json",

but that was not better. if I remove content from pf/qf then documents seem to 
rank correctly.
Example query and results (content omitted) : http://pastebin.com/5EhrRJP8 
<http://pastebin.com/5EhrRJP8> with managed-schema http://pastebin.com/mdraWQWE 
<http://pastebin.com/mdraWQWE>

--



Tom Chiverton
Lead Developer

e:   <mailto:t...@extravision.com>t...@extravision.com 
<mailto:t...@extravision.com>
p:  0161 817 2922
t:  @extravision <http://www.twitter.com/extravision>
w:   <http://www.extravision.com/>www.extravision.com 
<http://www.extravision.com/>

 <http://www.extravision.com/>

Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, Manchester, M15 
4LD.
Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed and may 
contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author and do 
not necessarily represent those of Extravision Ltd.



__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__




Trouble boosting a field

2017-01-13 Thread Tom Chiverton

I have a few hundred documents with title and content fields.

I want a match in title to trump matches in content. If I search for 
"connected vehicle" then a news article that has that in the content 
shouldn't be ranked higher than the page with that in the title is 
essentially what I want.


I have tried dismax with qf=title^2 as well as several other variants 
with the standard query parser (like q="title:"foo"^2 OR content:"foo") 
but documents without the search term in the title still come out before 
those with the term in the title when ordered by score.


Is there something I am missing ?

From the docs, something like q=title:"connected vehicle"^2 OR 
content:"connected vehicle" should have worked ? Even using ^100 didn't 
help.


I tried with the dismax parser using

|"q": "Connected Vehicle", "defType": "dismax", "indent": "true", "qf": 
"title^2000 content", "pf": "pf=title^4000 content^2", "sort": "score 
desc", "wt": "json", but that was not better. if I remove content from 
pf/qf then documents seem to rank correctly. |


Example query and results (content omitted) : 
http://pastebin.com/5EhrRJP8 with managed-schema 
http://pastebin.com/mdraWQWE


--
*Tom Chiverton*
Lead Developer
e:  t...@extravision.com <mailto:t...@extravision.com>
p:  0161 817 2922
t:  @extravision <http://www.twitter.com/extravision>
w:  www.extravision.com <http://www.extravision.com/>

Extravision - email worth seeing <http://www.extravision.com/>
Registered in the UK at: 107 Timber Wharf, 33 Worsley Street, 
Manchester, M15 4LD.

Company Reg No: 0‌‌5017214 VAT: GB 8‌‌24 5386 19

This e-mail is intended solely for the person to whom it is addressed 
and may contain confidential or privileged information.
Any views or opinions presented in this e-mail are solely of the author 
and do not necessarily represent those of Extravision Ltd.




Re: Has anyone used linode.com to run Solr | ??Best way to deliver PHP/Apache clients with Solr question

2016-12-15 Thread Tom Evans
On Thu, Dec 15, 2016 at 12:37 PM, GW <thegeofo...@gmail.com> wrote:
> While my client is all PHP it does not use a solr client. I wanted to stay
> with he latest Solt Cloud and the PHP clients all seemed to have some kind
> of issue being unaware of newer Solr Cloud versions. The client makes pure
> REST calls with Curl. It is stateful through local storage. There is no
> persistent connection. There are no cookies and PHP work is not sticky so
> it is designed for round robin on both the internal network.
>
> I'm thinking we have a different idea of persistent. To me something like
> MySQL can be persistent, ie a fifo queue for requests. The stack can be
> always on/connected on something like a heap storage.
>
> I never thought about the impact of a solr node crashing with PHP on top.
> Many thanks!
>
> Was thinking of running a conga line (Ricci & Luci projects) and shutting
> down and replacing failed nodes. Never done this with Solr. I don't see any
> reasons why it would not work.
>
> ** When you say an array of connections per host. It would still require an
> internal DNS because hosts files don't round robin. perhaps this is handled
> in the Python client??


The best Solr clients will take the URIs of the Zookeeper servers;
they do not make queries via Zookeeper, but will read the current
cluster status from zookeeper in order to determine which solr node to
actually connect to, taking in to account what nodes are alive, and
the state of particular shards.

SolrJ (Java) will do this, as will pysolr (python), I'm not aware of a
PHP client that is ZK aware.

If you don't have a ZK aware client, there are several options:

1) Make your favourite client ZK aware, like in [1]
2) Use round robin DNS to distribute requests amongst the cluster.
3) Use a hardware or software load balancer in front of the cluster.
4) Use shared state to store the names of active nodes*

All apart from 1) have significant downsides:

2) Has no concept of a node being down. Down nodes should not cause
query failures, the requests should go elsewhere in the cluster.
Requires updating DNS to add or remove nodes.
3) Can detect "down" nodes. Has no idea about the state of the
cluster/shards (usually).
4) Basically duplicates what ZooKeeper does, but less effectively -
doesn't know cluster state, down nodes, nodes that are up but with
unhealthy replicas...

>
> You have given me some good clarification. I think lol. I know I can spin
> out WWW servers based on load. I'm not sure how shit will fly spinning up
> additional solr nodes. I'm not sure what happens if you spin up an empty
> solr node and what will happen with replication, shards and load cost of
> spinning an instance. I'm facing some experimentation me thinks. This will
> be a manual process at first, for sure
>
> I guess I could put the solr connect requests in my clients into a try
> loop, looking for successful connections by name before any action.

In SolrCloud mode, you can spin up/shut down nodes as you like.
Depending on how you have configured your collections, new replicas
may be automatically created on the new node, or the node will simply
become part of the cluster but empty, ready for you to assign new
replicas to it using the Collections API.

You can also use what are called "snitches" to define rules for how
you want replicas/shards allocated amongst the nodes, eg to avoid
placing all the replicas for a shard in the same rack.

Cheers

Tom

[1] 
https://github.com/django-haystack/pysolr/commit/366f14d75d2de33884334ff7d00f6b19e04e8bbf


Re: Using DIH FileListEntityProcessor with SolrCloud

2016-12-06 Thread Tom Evans
On Fri, Dec 2, 2016 at 4:36 PM, Chris Rogers
<chris.rog...@bodleian.ox.ac.uk> wrote:
> Hi all,
>
> A question regarding using the DIH FileListEntityProcessor with SolrCloud 
> (solr 6.3.0, zookeeper 3.4.8).
>
> I get that the config in SolrCloud lives on the Zookeeper node (a different 
> server from the solr nodes in my setup).
>
> With this in mind, where is the baseDir attribute in the 
> FileListEntityProcessor config relative to? I’m seeing the config in the Solr 
> GUI, and I’ve tried setting it as an absolute path on my Zookeeper server, 
> but this doesn’t seem to work… any ideas how this should be setup?
>
> My DIH config is below:
>
> 
>   
>   
> 
>  fileName=".*xml"
> newerThan="'NOW-5YEARS'"
> recursive="true"
> rootEntity="false"
> dataSource="null"
> baseDir="/home/bodl-zoo-svc/files/">
>
>   
>
>  forEach="/TEI" url="${f.fileAbsolutePath}" 
> transformer="RegexTransformer" >
>  xpath="/TEI/teiHeader/fileDesc/titleStmt/title"/>
>  xpath="/TEI/teiHeader/fileDesc/publicationStmt/publisher"/>
>  xpath="/TEI/teiHeader/fileDesc/sourceDesc/msDesc/msIdentifier/altIdentifier/idno"/>
>   
>
> 
>
>   
> 
>
>
> This same script worked as expected on a single solr node (i.e. not in 
> SolrCloud mode).
>
> Thanks,
> Chris
>

Hey Chris

We hit the same problem moving from non-cloud to cloud, we had a
collection that loaded its DIH config from various XML files listing
the DB queries to run. We wrote a simple DataSource plugin function to
load the config from Zookeeper instead of local disk to avoid having
to distribute those config files around the cluster.

https://issues.apache.org/jira/browse/SOLR-8557

Cheers

Tom


Re: insert lat/lon from jpeg into solr

2016-12-01 Thread Tom Evans
On Wed, Nov 30, 2016 at 1:36 PM, win harrington
<win_harring...@yahoo.com.invalid> wrote:
> I have jpeg files with latitude and longitudein separate fields. When I run 
> the post tool,it stores the lat/lon in separate fields.
> For geospatial search, Solr wants themcombined into one field with the 
> format'latitude,longitude'.
> How can I combine lat+lon into one field?
>

Build the field up using the UpdateRequestProcessorChain, something like this:

  
  

  latitude
  latlon


  longitude
  latlon


  latlon
  ,



  

  

  composite-latlon

  

Cheers

Tom


Re: Import from S3

2016-11-25 Thread Tom Evans
On Fri, Nov 25, 2016 at 7:23 AM, Aniket Khare <aniketish...@gmail.com> wrote:
> You can use Solr DIH for indexing csv data into solr.
> https://wiki.apache.org/solr/DataImportHandler
>

Seems overkill when you can simply post CSV data to the UpdateHandler,
using either the post tool:

https://cwiki.apache.org/confluence/display/solr/Post+Tool

Or by doing it manually however you wish:

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates

Cheers

Tom


Re: Query formulation help

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 4:00 PM, Prasanna S. Dhakephalkar
<prasann...@merajob.in> wrote:
> Hi,
>
> Thanks for reply, I did
>
> "q": "cost:[2 TO (2+5000)]"
>
> Got
>
>   "error": {
> "msg": "org.apache.solr.search.SyntaxError: Cannot parse 'cost:[2 to 
> (2+5000)]': Encountered \"  \"(2+5000) \"\" at line 1, 
> column 18.\nWas expecting one of:\n\"]\" ...\n\"}\" ...\n",
>   }
>
> I want solr to do the addition.
> I tried
> "q": "cost:[2 TO (2+5000)]"
> "q": "cost:[2 TO sum(2,5000)]"
>
> I has not worked. I am missing something. I donot know what. May be how to 
> invoke functions.
>
> Regards,
>
> Prasanna.

Sorry, I was unclear - do the maths before constructing the query!

You might be able to do this with function queries, but why bother? If
the number is fixed, then fix it in the query, if it varies then there
must be some code executing on your client that can be used to do a
simple addition.

Cheers

Tom


Re: Query formulation help

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 8:03 AM, Prasanna S. Dhakephalkar
 wrote:
> Hi,
>
>
>
> May be very rudimentary question
>
>
>
> There is a integer field in a core : "cost"
>
> Need to build a query that will return documents where 0  <
> "cost"-given_number  <  500
>

cost:[given_number TO (500+given_number)]


Re: OOM Error

2016-10-26 Thread Tom Evans
On Wed, Oct 26, 2016 at 4:53 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 10/25/2016 8:03 PM, Susheel Kumar wrote:
>> Agree, Pushkar.  I had docValues for sorting / faceting fields from
>> begining (since I setup Solr 6.0).  So good on that side. I am going to
>> analyze the queries to find any potential issue. Two questions which I am
>> puzzling with
>>
>> a) Should the below JVM parameter be included for Prod to get heap dump
>>
>> "-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/the/dump"
>
> A heap dump can take a very long time to complete, and there may not be
> enough memory in the machine to start another instance of Solr until the
> first one has finished the heap dump.  Also, I do not know whether Java
> would release the listening port before the heap dump finishes.  If not,
> then a new instance would not be able to start immediately.
>
> If a different heap dump file is created each time, that might lead to
> problems with disk space after repeated dumps.  I don't know how the
> option works.
>
>> b) Currently OOM script just kills the Solr instance. Shouldn't it be
>> enhanced to wait and restart Solr instance
>
> As long as there is a problem causing OOMs, it seems rather pointless to
> start Solr right back up, as another OOM is likely.  The safest thing to
> do is kill Solr (since its operation would be unpredictable after OOM)
> and let the admin sort the problem out.
>

Occasionally our cloud nodes can OOM, when particularly complex
faceting is performed. The current OOM management can be exceedingly
annoying; a user will make a too complex analysis request, bringing
down one server, taking it out of the balancer. The user gets fed up
at no response, so reloads the page, re-submitting the analysis and
bringing down the next server in the cluster.

Lather, rinse, repeat - and then you get to have a meeting to discuss
why we invest so much in HA infrastructure that can be made non-HA by
one user with a complex query. In those meetings it is much harder to
justify not restarting.

Cheers

Tom


Re: indexing - offline

2016-10-20 Thread Tom Evans
On Thu, Oct 20, 2016 at 5:38 PM, Rallavagu <rallav...@gmail.com> wrote:
> Solr 5.4.1 cloud with embedded jetty
>
> Looking for some ideas around offline indexing where an independent node
> will be indexed offline (not in the cloud) and added to the cloud to become
> leader so other cloud nodes will get replicated. Wonder if this is possible
> without interrupting the live service. Thanks.

How we do this, to reindex collection "foo":

1) First, collection "foo" should be an alias to the real collection,
eg "foo_1" aliased to "foo"
2) Have a node "node_i" in the cluster that is used for indexing. It
doesn't hold any shards of any collections
3) Use collections API to create collection "foo_2", with however many
shards required, but all placed on "node_i"
4) Index "foo_2" with new data with DIH or direct indexing to "node_1".
5) Use collections API to expand "foo_2" to all the nodes/replicas
that it should be on
6) Remove "foo_2" from "node_i"
7) Verify contents of "foo_2" are correct
8) Use collections API to change alias for "foo" to "foo_2"
9) Remove "foo_1" collection once happy

This avoids indexing overwhelming the performance of the cluster (or
any nodes in the cluster that receive queries), and can be performed
with zero downtime or config changes on the clients.

Cheers

Tom


Re: How to update from Solr Cloud 5.4.1 to 5.5.1

2016-08-29 Thread Tom Devel
Shawn,

Do you (or anybody else here) know of the upgrade steps from 6.1 to 6.2 in
this case? The release notes of 6.2 do not mention anything about
upgrading, but 6.2 has some good bugfixes.

If 6.2 made changes to the index format, is a drop-in replacement from 6.1
to 6.2 still possible?

Thanks,
Tom

On Sat, Aug 27, 2016 at 12:23 PM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 8/26/2016 10:22 AM, D'agostino Victor wrote:
> > Do you know in which version index format changes and if I should
> > update to a higher version ?
>
> In version 6.0, and again in the just-released 6.2, one aspect of the
> index format has been updated.  Version 6.1 didn't have any format
> changes from 6.0.  You won't see the new version reflected in any of the
> filenames in the index directory.
>
> Whether or not to upgrade depends on what features you need, and whether
> you need fixes included in the new version.  Not all of the fixed bugs
> in 6.x are applicable to 5.x -- some are fixes for problems introduced
> during 6.x development.
>
> > And about ZooKeeper ; the 3.4.8 is fine or should I update it too ?
>
> That's the newest stable version of zookeeper.  There are alpha releases
> of version 3.5.
>
> Solr includes zookeeper 3.4.6.  A 3.4.8 server will work, but no
> guarantees can be made about the 3.5 alpha versions.
>
> Thanks,
> Shawn
>
>


min()/max() on date fields using JSON facets

2016-07-25 Thread Tom Evans
Hi all

I'm trying to replace a use of the stats module with JSON facets in
order to calculate the min/max date range of documents in a query. For
the same search, "stats.field=date_published" returns this:

{u'date_published': {u'count': 86760,
 u'max': u'2016-07-13T00:00:00Z',
 u'mean': u'2013-12-11T07:09:17.676Z',
 u'min': u'2011-01-04T00:00:00Z',
 u'missing': 0,
 u'stddev': 50006856043.410477,
 u'sum': u'3814570-11-06T00:00:00Z',
 u'sumOfSquares': 1.670619719649826e+29}}

For the equivalent JSON facet - "{'date.max': 'max(date_published)',
'date.min': 'min(date_published)'}" - I'm returned this:

{u'count': 86760, u'date.max': 146836800.0, u'date.min': 129409920.0}

What do these numbers represent - I'm guessing it is milliseconds
since epoch? In UTC?
Is there any way to control the output format or TZ?
Is there any benefit in using JSON facets to determine this, or should
I just continue using stats?

Cheers

Tom


RE: Reference to SolrCore from SearchComponent

2016-07-21 Thread Ellis, Tom (Financial Markets IT)
Thanks! 

-Original Message-
From: Joel Bernstein [mailto:joels...@gmail.com] 
Sent: 21 July 2016 19:51
To: solr-user@lucene.apache.org
Subject: Re: Reference to SolrCore from SearchComponent

-- This email has reached the Bank via an external source --
 

There is a SolrCoreAware interface you can implement, which will provide access 
to the SolrCore. From there you can add a closeHook to the core.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Jul 21, 2016 at 2:34 PM, Ellis, Tom (Financial Markets IT) < 
tom.el...@lloydsbanking.com.invalid> wrote:

> Hi There,
>
> I'm in the process of creating a custom SearchComponent. This 
> component will have a long running thread performing an action to keep 
> a list updated. As SearchComponents do not seem to have a 
> destroy/close hook, I was wondering if there is a way of getting a 
> reference to the SolrCore the SearchComponent is instantiated in and 
> adding a CloseHook or similar? Is this possible?
>
> Cheers,
>
> Tom
>
>
> Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ.
> Registered in Scotland no. SC95000. Telephone: 0131 225 4555. Lloyds 
> Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. 
> Registered in England and Wales no. 2065. Telephone 0207626 1500. Bank of 
> Scotland plc.
> Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no.
> SC327000. Telephone: 03457 801 801. Cheltenham & Gloucester plc. 
> Registered
> Office: Barnett Way, Gloucester GL4 3RL. Registered in England and 
> Wales 2299428. Telephone: 0345 603 1637
>
> Lloyds Bank plc, Bank of Scotland plc are authorised by the Prudential 
> Regulation Authority and regulated by the Financial Conduct Authority 
> and Prudential Regulation Authority.
>
> Cheltenham & Gloucester plc is authorised and regulated by the 
> Financial Conduct Authority.
>
> Halifax is a division of Bank of Scotland plc. Cheltenham & Gloucester 
> Savings is a division of Lloyds Bank plc.
>
> HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered 
> in Scotland no. SC218813.
>
> This e-mail (including any attachments) is private and confidential 
> and may contain privileged material. If you have received this e-mail 
> in error, please notify the sender and delete it (including any 
> attachments) immediately. You must not copy, distribute, disclose or 
> use any of the information in it or any attachments. Telephone calls 
> may be monitored or recorded.
>


Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. 
Registered in Scotland no. SC95000. Telephone: 0131 225 4555. Lloyds Bank plc. 
Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England 
and Wales no. 2065. Telephone 0207626 1500. Bank of Scotland plc. Registered 
Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. 
Telephone: 03457 801 801. Cheltenham & Gloucester plc. Registered Office: 
Barnett Way, Gloucester GL4 3RL. Registered in England and Wales 2299428. 
Telephone: 0345 603 1637

Lloyds Bank plc, Bank of Scotland plc are authorised by the Prudential 
Regulation Authority and regulated by the Financial Conduct Authority and 
Prudential Regulation Authority.

Cheltenham & Gloucester plc is authorised and regulated by the Financial 
Conduct Authority.

Halifax is a division of Bank of Scotland plc. Cheltenham & Gloucester Savings 
is a division of Lloyds Bank plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in 
Scotland no. SC218813.

This e-mail (including any attachments) is private and confidential and may 
contain privileged material. If you have received this e-mail in error, please 
notify the sender and delete it (including any attachments) immediately. You 
must not copy, distribute, disclose or use any of the information in it or any 
attachments. Telephone calls may be monitored or recorded.


Reference to SolrCore from SearchComponent

2016-07-21 Thread Ellis, Tom (Financial Markets IT)
Hi There,

I'm in the process of creating a custom SearchComponent. This component will 
have a long running thread performing an action to keep a list updated. As 
SearchComponents do not seem to have a destroy/close hook, I was wondering if 
there is a way of getting a reference to the SolrCore the SearchComponent is 
instantiated in and adding a CloseHook or similar? Is this possible?

Cheers,

Tom


Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. 
Registered in Scotland no. SC95000. Telephone: 0131 225 4555. Lloyds Bank plc. 
Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England 
and Wales no. 2065. Telephone 0207626 1500. Bank of Scotland plc. Registered 
Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. 
Telephone: 03457 801 801. Cheltenham & Gloucester plc. Registered Office: 
Barnett Way, Gloucester GL4 3RL. Registered in England and Wales 2299428. 
Telephone: 0345 603 1637

Lloyds Bank plc, Bank of Scotland plc are authorised by the Prudential 
Regulation Authority and regulated by the Financial Conduct Authority and 
Prudential Regulation Authority.

Cheltenham & Gloucester plc is authorised and regulated by the Financial 
Conduct Authority.

Halifax is a division of Bank of Scotland plc. Cheltenham & Gloucester Savings 
is a division of Lloyds Bank plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in 
Scotland no. SC218813.

This e-mail (including any attachments) is private and confidential and may 
contain privileged material. If you have received this e-mail in error, please 
notify the sender and delete it (including any attachments) immediately. You 
must not copy, distribute, disclose or use any of the information in it or any 
attachments. Telephone calls may be monitored or recorded.


Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
On the nodes that have the replica in a recovering state we now see:

19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to
recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
lookups slice: shard1
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

19-07-2016 16:18:28 INFO  RecoveryStrategy:444 - Replay not started,
or was not successful... still buffering updates.
19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed -
trying again... (164)
19-07-2016 16:18:28 INFO  RecoveryStrategy:503 - Wait [12.0] seconds
before trying to recover again (attempt=165)


This is with the "leader that is not the leader" shut down.

Issuing a FORCELEADER via collections API doesn't in fact force a
leader election to occur.

Is there any other way to prompt Solr to have an election?

Cheers

Tom

On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans <tevans...@googlemail.com> wrote:
> There are 11 collections, each only has one shard, and each node has
> 10 replicas (9 collections are on every node, 2 are just on one node).
> We're not seeing any OOM errors on restart.
>
> I think we're being patient waiting for the leader election to occur.
> We stopped the troublesome "leader that is not the leader" server
> about 15-20 minutes ago, but we still have not had a leader election.
>
> Cheers
>
> Tom
>
> On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> How many replicas per Solr JVM? And do you
>> see any OOM errors when you bounce a server?
>> And how patient are you being, because it can
>> take 3 minutes for a leaderless shard to decide
>> it needs to elect a leader.
>>
>> See SOLR-7280 and SOLR-7191 for the case
>> where lots of replicas are in the same JVM,
>> the tell-tale symptom is errors in the log as you
>> bring Solr up saying something like
>> "OutOfMemory error unable to create native thread"
>>
>> SOLR-7280 has patches for 6x and 7x, with a 5x one
>> being added momentarily.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
>>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>>> of the collections on it marked as "Recovering" or "Recovery Failed".
>>> It attempts to recover from the leader, but the leader responds with:
>>>
>>> Error while trying to recover.
>>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>>

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
There are 11 collections, each only has one shard, and each node has
10 replicas (9 collections are on every node, 2 are just on one node).
We're not seeing any OOM errors on restart.

I think we're being patient waiting for the leader election to occur.
We stopped the troublesome "leader that is not the leader" server
about 15-20 minutes ago, but we still have not had a leader election.

Cheers

Tom

On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> How many replicas per Solr JVM? And do you
> see any OOM errors when you bounce a server?
> And how patient are you being, because it can
> take 3 minutes for a leaderless shard to decide
> it needs to elect a leader.
>
> See SOLR-7280 and SOLR-7191 for the case
> where lots of replicas are in the same JVM,
> the tell-tale symptom is errors in the log as you
> bring Solr up saying something like
> "OutOfMemory error unable to create native thread"
>
> SOLR-7280 has patches for 6x and 7x, with a 5x one
> being added momentarily.
>
> Best,
> Erick
>
> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans <tevans...@googlemail.com> wrote:
>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>> of the collections on it marked as "Recovering" or "Recovery Failed".
>> It attempts to recover from the leader, but the leader responds with:
>>
>> Error while trying to recover.
>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
>> ... 5 more
>>
>> and recovery never occurs.
>>
>> Each collection in this state has plenty (10+) of active replicas, but
>> stopping the server that is marked as the leader doesn't trigger a
>> leader election amongst these replicas.
>>
>> REBALANCELEADERS did nothing.
>> FORCELEADER complains that there is already a leader.
>> FORCELEADER with the purported leader stopped took 45 seconds,
>> reported status of "0" (and no other message) and kept the down node
>> as the leader (!)
>> Deleting the failed collection from the failed node and re-adding it
>> has the same "Leader said I'm not the leader" error message.
>>
>> Any other ideas?
>>
>> Cheers
>>
>> Tom


Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans
Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
of the collections on it marked as "Recovering" or "Recovery Failed".
It attempts to recover from the leader, but the leader responds with:

Error while trying to recover.
core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
... 5 more

and recovery never occurs.

Each collection in this state has plenty (10+) of active replicas, but
stopping the server that is marked as the leader doesn't trigger a
leader election amongst these replicas.

REBALANCELEADERS did nothing.
FORCELEADER complains that there is already a leader.
FORCELEADER with the purported leader stopped took 45 seconds,
reported status of "0" (and no other message) and kept the down node
as the leader (!)
Deleting the failed collection from the failed node and re-adding it
has the same "Leader said I'm not the leader" error message.

Any other ideas?

Cheers

Tom


Matching all terms in a multiValued field

2016-07-01 Thread Ellis, Tom (Financial Markets IT)
Hi There,

I'm trying to create search component for some document level security. A user 
will have a number of tags assigned to them, and these will be passed to the 
search component which will add a filter to whatever the user's original query 
was. Documents will be written with some or all of the users tags, and the 
query must only return documents that have a set of tags that are included in 
the users tags.

E.g. Alice is authorised to see 'confidential' and 'paid_source'

Bob is only authorised to see 'confidential'

Document 1 has tags confidential and paid_source - Alice should be able to see 
this document, but Bob should not.

So if I am creating a query for Bob, how can I write it so that he can't see 
Document 1? I.e. how do I create a query that checks the multiValued field for 
'confidential' but excludes documents that have anything else?

Cheers,

Tom Ellis
Consultant Developer - Excelian
Data Lake | Financial Markets IT
LLOYDS BANK COMMERCIAL BANKING


E: tom.el...@lloydsbanking.com<mailto:tom.el...@lloydsbanking.com>
Website: www.lloydsbankcommercial.com<http://www.lloydsbankcommercial.com/>
, , ,
Reduce printing. Lloyds Banking Group is helping to build the low carbon 
economy.
Corporate Responsibility Report: 
www.lloydsbankinggroup-cr.com/downloads<http://www.lloydsbankinggroup-cr.com/downloads>



Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. 
Registered in Scotland no. SC95000. Telephone: 0131 225 4555. Lloyds Bank plc. 
Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England 
and Wales no. 2065. Telephone 0207626 1500. Bank of Scotland plc. Registered 
Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. 
Telephone: 03457 801 801. Cheltenham & Gloucester plc. Registered Office: 
Barnett Way, Gloucester GL4 3RL. Registered in England and Wales 2299428. 
Telephone: 0345 603 1637

Lloyds Bank plc, Bank of Scotland plc are authorised by the Prudential 
Regulation Authority and regulated by the Financial Conduct Authority and 
Prudential Regulation Authority.

Cheltenham & Gloucester plc is authorised and regulated by the Financial 
Conduct Authority.

Halifax is a division of Bank of Scotland plc. Cheltenham & Gloucester Savings 
is a division of Lloyds Bank plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in 
Scotland no. SC218813.

This e-mail (including any attachments) is private and confidential and may 
contain privileged material. If you have received this e-mail in error, please 
notify the sender and delete it (including any attachments) immediately. You 
must not copy, distribute, disclose or use any of the information in it or any 
attachments. Telephone calls may be monitored or recorded.


Strange highlighting on search

2016-06-16 Thread Tom Evans
Hi all

I'm investigating a bug where by every term in the highlighted field
gets marked for highlighting instead of just the words that match the
fulltext portion of the query. This is on Solr 5.5.0, but I didn't see
any bug fixes related to highlighting in 5.5.1 or 6.0 release notes.

The query that affects it is where we have a not clause on a specific
field (not the fulltext field) and also only include documents where
that field has a value:

q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *]
AND -ingredient_tag_id:(35223)

This returns the correct results, but the highlighting has matched
every word in the results (see below for debugQuery output). If I
change the query to put the exclusion in to an fq, the highlighting is
correct again (and the results are correct):

q: cosmetics_packaging_fulltext:(Mist)
fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)

Is there any way I can make the query and highlighting work as
expected as part of q?

Is there any downside to putting the exclusion part in the fq in terms
of performance? We don't use score at all for our results, we always
order by other parameters.

Cheers

Tom

Query with strange highlighting:

{
  "responseHeader":{
"status":0,
"QTime":314,
"params":{
  "q":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
  "hl":"true",
  "hl.simple.post":"",
  "indent":"true",
  "fl":"id,product",
  "hl.fragsize":"0",
  "hl.fl":"product",
  "rows":"5",
  "wt":"json",
  "debugQuery":"true",
  "hl.simple.pre":""}},
  "response":{"numFound":10132,"start":0,"docs":[
  {
"id":"2403841-1498608",
"product":"Mist"},
  {
"id":"2410603-1502577",
"product":"Mist"},
  {
"id":"5988531-3882415",
"product":"Ao + Mist"},
  {
"id":"6020805-3904203",
"product":"UV Mist Cushion SPF 50+ PA+++"},
  {
"id":"2617977-1629335",
"product":"Ultra Radiance Facial Re-Hydrating Mist"}]
  },
  "highlighting":{
"2403841-1498608":{
  "product":["Mist"]},
"2410603-1502577":{
  "product":["Mist"]},
"5988531-3882415":{
  "product":["Ao + Mist"]},
"6020805-3904203":{
  "product":["UV Mist Cushion
SPF 50+ PA+++"]},
"2617977-1629335":{
  "product":["Ultra Radiance Facial
Re-Hydrating Mist"]}},
  "debug":{
"rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"querystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"parsedquery":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"parsedquery_toString":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"explain":{
  "2403841-1498608":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 13983)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=13983,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=13983)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "2410603-1502577":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 14023)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=14023,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=14023)\n  0.15275055 = ingredient_tag_id:[0 TO *],
p

Re: result grouping in sharded index

2016-06-15 Thread Tom Evans
Do you have to group, or can you collapse instead?

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Cheers

Tom

On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharaju <jspothar...@gmail.com> wrote:
> Any suggestions on how to handle result grouping in sharded index?
>
>
> On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju <jspothar...@gmail.com>
> wrote:
>
>> Hi,
>> I am working on a functionality that would require me to group documents
>> by a id field. I read that the ngroups feature would not work in a sharded
>> index.
>> Can someone recommend how to handle this in a sharded index?
>>
>>
>> Solr Version: 5.5
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
>>
>> --
>> Thanks
>> Jay
>>
>>
>
>
>
> --
> Thanks
> Jay Potharaju


Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]

2016-05-24 Thread Tom Evans
On Tue, May 24, 2016 at 3:06 PM, Scott Chu <scott@udngroup.com> wrote:
> p.s. There're really many many extensive, worthy stuffs in Solr. If the
> project team can provide some "dictionary" of them, It would be a "Santa 
> Claus"
> for we solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not 
> possbile.
> I really like to study them one after another, to learn about all of them.
> However, Internet IT goes too fast to have time to congest all of the great
>  stuffs in Solr.

The reference guide is both extensive and also broadly informative.
Start from the top page and browse away!

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Handy to keep the glossary handy for any terms that you don't recognise:

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary

Cheers

Tom


Re: SolrCloud increase replication factor

2016-05-23 Thread Tom Evans
On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorp
<hendrik.hadd...@gmx.net> wrote:
> Hi,
>
> I have a SolrCloud 6.0 setup and created my collection with a
> replication factor of 1. Now I want to increase the replication factor
> but would like the replicas for the same shard to be on different nodes,
> so that my collection does not fail when one node fails. I tried two
> approaches so far:
>
> 1) When I use the collections API with the MODIFYCOLLECTION action [1] I
> can set the replication factor but that did not result in the creation
> of additional replicas. The Solr Admin UI showed that my replication
> factor changed but otherwise nothing happened. A reload of the
> collection did also result in no change.
>
> 2) Using the ADDREPLICA action [2] from the collections API I have to
> add the replicas to the shard individually, which is a bit more
> complicated but otherwise worked. During testing this did however at
> least once result in the replica being created on the same node. My
> collection was split in 4 shards and for 2 of them all replicas ended up
> on the same node.
>
> So is the only option to create the replicas manually and also pick the
> nodes manually or is the perceived behavior wrong?
>
> regards,
> Hendrik
>
> [1]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll
> [2]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica


With ADDREPLICA, you can specify the node to create the replica on. If
you are using a script to increase/remove replicas, you can simply
incorporate the logic you desire in to your script - you can also use
CLUSTERSTATUS to get a list of nodes/collections/shards etc in order
to inform the logic in the script. This is the approach we took, we
have a fabric script to add/remove extra nodes to/from the cluster, it
works well.

The alternative is to put the logic in to Solr itself, using what Solr
calls a "snitch" to define the rules on where replicas are created.
The snitch is specified at collection creation time, or you can use
MODIFYCOLLECTION to set it after the fact. See this wiki patch for
details:

https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement

Cheers

Tom


Re: Creating a collection with 1 shard gives a weird range

2016-05-17 Thread Tom Evans
On Tue, May 17, 2016 at 9:40 AM, John Smith <solr-u...@remailme.net> wrote:
> I'm trying to create a collection starting with only one shard
> (numShards=1) using a compositeID router. The purpose is to start small
> and begin splitting shards when the index grows larger. The shard
> created gets a weird range value: 8000-7fff, which doesn't look
> effective. Indeed, if a try to import some documents using a DIH, none
> gets added.
>
> If I create the same collection with 2 shards, the ranges seem more
> logical (0-7fff & 8000-). In this case documents are
> indexed correctly.
>
> Is this behavior by design, i.e. is a minimum of 2 shards required? If
> not, how can I create a working collection with a single shard?
>
> This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8.
>

I believe this is as designed, see this email from Shawn:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E

Cheers

Tom


Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo
That helps.  I ended up updating the sole.in.sh file in /etc/default and that 
was in getting picked up.  Thanks

> On May 11, 2016, at 2:05 PM, Tom Gullo <tomgu...@gmail.com> wrote:
> 
> My Solr installation is running on Tomcat on port 8080 with a  web context 
> name that is different than /solr.   We want to move to a basic jetty setup 
> with all the defaults.  I haven’t found a clean way to do this.  A lot of the 
> values like baseurl and /leader/elect/shard1 have values that need to be 
> updated.  If I try shutting down the servers, change the zookeeper settings 
> and then restart Solr in Jetty I get issues - like Solr thinks they are 
> replicas.   So I’m looking to see if anyone knows what is the cleanest way to 
> move from a Tomcat/8080 install to a Jetty/8983 one.
> 
> Thanks
> 
>> On May 11, 2016, at 1:59 PM, John Bickerstaff <j...@johnbickerstaff.com> 
>> wrote:
>> 
>> I may be answering the wrong question - but SolrCloud goes in by default on
>> 8983, yes?  Is yours currently on 8080?
>> 
>> I don't recall where, but I think I saw a config file setting for the port
>> number (In Solr I mean)
>> 
>> Am I on the right track or are you asking something other than how to get
>> Solr on host:8983/solr ?
>> 
>> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo <tomgu...@gmail.com> wrote:
>> 
>>> I need to change the web context and the port for a SolrCloud installation.
>>> 
>>> Example, change:
>>> 
>>> host:8080/some-api-here/
>>> 
>>> to this:
>>> 
>>> host:8983/solr/
>>> 
>>> Does anyone know how to do this with SolrCloud?  There are values stored
>>> in clusterstate.json and /leader/elect and I could change them
>>> but that seems a little messy.
>>> 
>>> Thanks
> 



Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo
My Solr installation is running on Tomcat on port 8080 with a  web context name 
that is different than /solr.   We want to move to a basic jetty setup with all 
the defaults.  I haven’t found a clean way to do this.  A lot of the values 
like baseurl and /leader/elect/shard1 have values that need to be updated.  If 
I try shutting down the servers, change the zookeeper settings and then restart 
Solr in Jetty I get issues - like Solr thinks they are replicas.   So I’m 
looking to see if anyone knows what is the cleanest way to move from a 
Tomcat/8080 install to a Jetty/8983 one.

Thanks

> On May 11, 2016, at 1:59 PM, John Bickerstaff <j...@johnbickerstaff.com> 
> wrote:
> 
> I may be answering the wrong question - but SolrCloud goes in by default on
> 8983, yes?  Is yours currently on 8080?
> 
> I don't recall where, but I think I saw a config file setting for the port
> number (In Solr I mean)
> 
> Am I on the right track or are you asking something other than how to get
> Solr on host:8983/solr ?
> 
> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo <tomgu...@gmail.com> wrote:
> 
>> I need to change the web context and the port for a SolrCloud installation.
>> 
>> Example, change:
>> 
>> host:8080/some-api-here/
>> 
>> to this:
>> 
>> host:8983/solr/
>> 
>> Does anyone know how to do this with SolrCloud?  There are values stored
>> in clusterstate.json and /leader/elect and I could change them
>> but that seems a little messy.
>> 
>> Thanks



changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo
I need to change the web context and the port for a SolrCloud installation.

Example, change:

host:8080/some-api-here/

to this:

host:8983/solr/

Does anyone know how to do this with SolrCloud?  There are values stored in 
clusterstate.json and /leader/elect and I could change them but 
that seems a little messy.

Thanks

Re: Indexing 700 docs per second

2016-04-19 Thread Tom Evans
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson <mark123lea...@gmail.com> wrote:
> Hi,
>
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.
>
> 1. Can I manage this updation rate with a non-sharded ie single Solr
> instance set up?
> 2. Also is atomic update or a full update (the whole doc) of the changed
> records the better approach in this case.
>
> Could some one please share their views/ experience?

Try it and see - everyone's data/schemas are different and can affect
indexing speed. It certainly sounds achievable enough - presumably you
can at least produce the documents at that rate?

Cheers

Tom


Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
Hi David,

It may not matter for your use case  but just in case you really are
interested in the "real BM25F" there is a difference between configuring K1
and B for different fields in Solr and a "real" BM25F implementation.  This
has to do with Solr's model of fields being mini-documents (i.e. each field
has its own length, idf and tf)   See the discussion in
https://issues.apache.org/jira/browse/LUCENE-2959, particularly these
comments by Robert Muir:

"Actually as far as BM25f, this one presents a few challenges (some already
discussed on LUCENE-2091 <https://issues.apache.org/jira/browse/LUCENE-2091>
).

To summarize:

   - for any field, Lucene has a per-field terms dictionary that contains
   that term's docFreq. To compute BM25f's IDF method would be challenging,
   because it wants a docFreq "across all the fields". (its not clear to me at
   a glance either from the original paper, if this should be across only the
   fields in the query, across all the fields in the document, and if a
   "static" schema is implied in this scoring system (in lucene document 1 can
   have 3 fields and document 2 can have 40 different ones, even with
   different properties).
   - the same issue applies to length normalization, lucene has a "field
   length" but really no concept of document length."

Tom

On Thu, Apr 14, 2016 at 12:41 PM, David Cawley <david.cawl...@mail.dcu.ie>
wrote:

> Hello,
> I am developing an enterprise search engine for a project and I was hoping
> to implement BM25F ranking algorithm to configure the tuning parameters on
> a per field basis. I understand BM25 similarity is now supported in Solr
> but I was hoping to be able to configure k1 and b for different fields such
> as title, description, anchor etc, as they are structured documents.
> I am fairly new to Solr so any help would be appreciated. If this is
> possible or any steps as to how I can go about implementing this it would
> be greatly appreciated.
>
> Regards,
>
> David
>
> Current Solr Version 5.4.1
>


Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Tom Evans
On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
<j...@johnbickerstaff.com> wrote:
> Thanks all - very helpful.
>
> @Shawn - your reply implies that even if I'm hitting the URL for a single
> endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
> (I understand the caveat about that single endpoint being a potential point
> of failure).  I just want to verify that I'm interpreting your response
> correctly...
>
> (I have been asked to provide IT with a comprehensive list of options prior
> to a design discussion - which is why I'm trying to get clear about the
> various options)
>
> In a nutshell, I think I understand the following:
>
> a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> available nodes for searching
>   Caveat: That single URL represents a potential single point of
> failure and this should be taken into account
>
> b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> based on Zookeeper's "knowledge" of all available Solr instances.
>   Note: This is more robust than "a" due to the fact that it
> eliminates the "single point of failure"
>
> c.  Use of a load balancer hitting all known Solr instances will be fine -
> although the search requests may not run on the Solr instance the load
> balancer targeted - due to "a" above.
>
> Corrections or refinements welcomed...

With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom


Re: Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans
Awesome, thanks :)

On Fri, Apr 15, 2016 at 4:19 PM, Anshum Gupta <ans...@anshumgupta.net> wrote:
> Hi Tom,
>
> I plan on getting a release candidate out for vote by Monday. If all goes
> well, it'd be about a week from then for the official release.
>
> On Fri, Apr 15, 2016 at 6:52 AM, Tom Evans <tevans...@googlemail.com> wrote:
>
>> Hi all
>>
>> We're currently using Solr 5.5.0 and converting our regular old style
>> facets into JSON facets, and are running in to SOLR-8155 and
>> SOLR-8835. I can see these have already been back-ported to 5.5.x
>> branch, does anyone know when 5.5.1 may be released?
>>
>> We don't particularly want to move to Solr 6, as we have only just
>> finished validating 5.5.0 with our original queries!
>>
>> Cheers
>>
>> Tom
>>
>
>
>
> --
> Anshum Gupta


Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans
Hi all

We're currently using Solr 5.5.0 and converting our regular old style
facets into JSON facets, and are running in to SOLR-8155 and
SOLR-8835. I can see these have already been back-ported to 5.5.x
branch, does anyone know when 5.5.1 may be released?

We don't particularly want to move to Solr 6, as we have only just
finished validating 5.5.0 with our original queries!

Cheers

Tom


SolrCloud no leader for collection

2016-04-05 Thread Tom Evans
Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections,
most of them in a 1 shard x 8 replicas configuration. We have 5 ZK
nodes.

During the night, we attempted to reindex one of the larger
collections. We reindex by pushing json docs to the update handler
from a number of processes. It seemed this overwhelmed the servers,
and caused all of the collections to fail and end up in either a down
or a recovering state, often with no leader.

Restarting and rebooting the servers brought a lot of the collections
back online, but we are left with a few collections for which all the
nodes hosting those replicas are up, but the replica reports as either
"active" or "down", and with no leader.

Trying to force a leader election has no effect, it keeps choosing a
leader that is in "down" state. Removing all the nodes that are in
"down" state and forcing a leader election also has no effect.


Any ideas? The only viable option I see is to create a new collection,
index it and then remove the old collection and alias it in.

Cheers

Tom


Re: Creating new cluster with existing config in zookeeper

2016-03-23 Thread Tom Evans
On Wed, Mar 23, 2016 at 3:43 PM, Robert Brown <r...@intelcompute.com> wrote:
> So I setup a new solr server to point to my existing ZK configs.
>
> When going to the admin UI on this new server I can see the shards/replica's
> of the existing collection, and can even query it, even tho this new server
> has no cores on it itself.
>
> Is this all expected behaviour?
>
> Is there any performance gain with what I have at this precise stage?  The
> extra server certainly makes it appear i could balance more load/requests,
> but I guess the queries are just being forwarded on to the servers with the
> actual data?
>
> Am I correct in thinking I can now create a new collection on this host, and
> begin to build up a new cluster?  and they won't interfere with each other
> at all?
>
> Also, that I'll be able to see both collections when using the admin UI
> Cloud page on any of the servers in either collection?
>

I'm confused slightly:

SolrCloud is a (singular) cluster of servers, storing all of its state
and configuration underneath a single zookeeper path. The cluster
contains collections. Collections are tied to a particular config set
within the cluster. Collections are made up of 1 or more shards. Each
shard is a core, and there are 1 or more replicas of each core.

You can add more servers to the cluster, and then create a new
collection with the same config as an existing collection, but it is
still part of the same cluster. Of course, you could think of a set of
servers within a cluster as a "logical" cluster if it just serves
particular collection, but "cluster" to me would be all of the servers
within the same zookeeper tree, because that is where cluster state is
maintained.

Cheers

Tom


Re: Re: Paging and cursorMark

2016-03-23 Thread Tom Evans
On Wed, Mar 23, 2016 at 12:21 PM, Vanlerberghe, Luc
<luc.vanlerber...@bvdinfo.com> wrote:
> I worked on something similar a couple of years ago, but didn’t continue work 
> on it in the end.
>
> I've included the text of my original mail.
> If you're interested, I could try to find the sources I was working on at the 
> time
>
> Luc
>

Thanks both Luc and Steve. I'm not sure if we will have time to deploy
patched versions of things to production - time is always the enemy :(
, and we're not a Java shop so there is non trivial time investment in
just building replacement jars, let alone getting that integrated in
to our RPMs - but I'll definitely try it out on my dev server.

The change seems excessively complex imo, but maybe I'm not seeing the
use cases for skip.

To my mind, calculating a nextCursorMark is cheap and only relies on
having a strict sort ordering, which is also cheap to check. If that
condition is met, you should get a nextCursorMark in your response
regardless of whether you specified a cursorMark in the request, to
allow you to efficiently get the next page.

This would still leave slightly pathological performance if you skip
to page N, and then iterate back to page 0, which Luc's idea of a
previousCursorMark can solve. cursorMark is easy to implement, you can
ignore docs which sort lower than that mark. Can you do similar with
previousCursorMark?, as would it not require to keep a buffer of rows
documents, and stop when a document which sorts higher than the
supplied mark appears. Seems more complex, but maybe I'm not
understanding the internals correctly.

Fortunately for us, 90% of our users prefer infinite scroll, and 97%
of them never go beyond page 3.

Cheers

Tom


Paging and cursorMark

2016-03-22 Thread Tom Evans
Hi all

With Solr 5.5.0, we're trying to improve our paging performance. When
we are delivering results using infinite scrolling, cursorMark is
perfectly fine - one page is followed by the next. However, we also
offer traditional paging of results, and this is where it gets a
little tricky.

Say we have 10 results per page, and a user wants to jump from page 1
to page 20, and then wants to view page 21, there doesn't seem to be a
simple way to get the nextCursorMark. We can make an inefficient
request for page 20 (start=190, rows=10), but we cannot give that
request a cursorMark=* as it contains start=190.

Consequently, if the user clicks to page 21, we have to continue along
using start=200, as we have no cursorMark. The only way I can see to
get a cursorMark at that point is to omit the start=200, and instead
say rows=210, and ignore the first 200 results on the client side.
Obviously, this gets more and more inefficient the deeper we page - I
know that internally to Solr, using start=200=10 has to do the
same work as rows=210, but less data is sent over the wire to the
client.

As I understand it, the cursorMark is a hash of the sort values of the
last document returned, so I don't really see why it is forbidden to
specify start=190=10=* - why is it not possible to
calculate the nextCursorMark from the last document returned?

I was also thinking a possible temporary workaround would be to
request start=190=10, note the last document returned, and then
make a subsequent query for q=id:""=1=*.
This seems to work, but means an extra Solr query for no real reason.
Is there any other problem to doing this?

Is there some other simple trick I am missing that we can use to get
both the page of results we want and a nextCursorMark for the
subsequent page?

Cheers

Tom


Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
On Wed, Mar 16, 2016 at 4:10 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 3/16/2016 8:14 AM, Tom Evans wrote:
>> The problem occurs when we attempt to query a node to see if products
>> or items is active on that node. The balancer (haproxy) requests the
>> ping handler for the appropriate collection, however all the nodes
>> return OK for all the collections(!)
>>
>> Eg, on node01, it has replicas for products and skus, but the ping
>> handler for /solr/items/admin/ping returns 200!
>
> This returns OK because as long as one replica for every shard in
> "items" is available somewhere in the cloud, you can make a request for
> "items" on that node and it will work.  Or at least it *should* work,
> and if it's not working, that's a bug.  I remember that one of the older
> 4.x versions *did* have a bug where queries for a collection would only
> work if the node actually contained shards for that collection.

Sorry, this is Solr 5.5, I should have said.

Yes, we can absolutely make a request of "items", and it will work
correctly. However, we are making requests of "skus" that join to
"products", and the query is routed to a node which has only "skus"
and "items", and the request fails because joins can only work over
local replicas.

To fix this, we now have two additional balancers:

solr: has all the nodes, all nodes are valid backends
solr-items: has all the nodes in the cluster, but nodes are only valid
backends if it has "items" and "skus" replicas.
solr-products: has all the nodes in the cluster, but nodes are only
valid backends if it has "products" and "skus" replicas

(I'm simplifying things a bit, there are another 6 collections that
are on all nodes, hence the main balancer.)

The new balancers need a cheap way of checking what nodes are valid,
and ideally I'd like that check to not involve a query with a join
clause!

Cheers

Tom


Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
On Wed, Mar 16, 2016 at 2:14 PM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all
>
> [ .. ]
>
> The option I'm trying now is to make two ping handler for skus that
> join to one of items/products, which should fail on the servers which
> do not support it, but I am concerned that this is a little
> heavyweight for a status check to see whether we can direct requests
> at this server or not.

This worked, I would still be interested in a lighter-weight approach
that doesn't involve joins to see if a given collection has a shard on
this server. I suspect that might require a custom ping handler plugin
however.

Cheers

Tom


Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans
Hi all

I have a cloud setup with 8 nodes and 3 collections, products, items
and skus. All collections have just one shard, products has 6
replicas, items has 2 replicas, skus has 8 replicas. No node has both
products and items, all nodes have skus

Some of our queries join from sku to either products or items. If the
query is directed at a node without the appropriate shard on them, we
obviously get an error, so we have separate balancers for products and
items.

The problem occurs when we attempt to query a node to see if products
or items is active on that node. The balancer (haproxy) requests the
ping handler for the appropriate collection, however all the nodes
return OK for all the collections(!)

Eg, on node01, it has replicas for products and skus, but the ping
handler for /solr/items/admin/ping returns 200!

This means that as far as the balancer is concerned, node01 is a valid
destination for item queries, and inevitably it blows up as soon as
such a query is made to it.

As I understand it, this is because the URL we are checking is for the
collection ("items") rather than a specific core
("items_shard1_replica1")

Is there a way to make the ping handler only check local shards? I
have tried with distrib=false=false, but it still
returns a 200.

The option I'm trying now is to make two ping handler for skus that
join to one of items/products, which should fail on the servers which
do not support it, but I am concerned that this is a little
heavyweight for a status check to see whether we can direct requests
at this server or not.

Cheers

Tom


mergeFactor/maxMergeDocs is deprecated

2016-03-03 Thread Tom Evans
Hi all

Updating to Solr 5.5.0, and getting these messages in our error log:

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

However, mergeFactor is only mentioned in a commented out sections of
our solrconfig.xml files, and mergeFactor is not mentioned at all.

> $ ack -B 1 -A 1 '<mergeFactor'
lookups/conf/solrconfig.xml
210-

> $ ack --all maxMergeDocs
> $

Any ideas?

Cheers

Tom


Re: Separating cores from Solr home

2016-03-03 Thread Tom Evans
Hmm, I've worked around this by setting the directory where the
indexes should live to be the actual solr home, and symlink the files
from the current release in to that directory, but it feels icky.

Any better ideas?

Cheers

Tom

On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all
>
> I'm struggling to configure solr cloud to put the index files and
> core.properties in the correct places in SolrCloud 5.5. Let me explain
> what I am trying to achieve:
>
> * solr is installed in /opt/solr
> * the user who runs solr only has read only access to that tree
> * the solr home files - custom libraries, log4j.properties, solr.in.sh
> and solr.xml - live in /data/project/solr/releases/, which
> is then the target of a symlink /data/project/solr/releases/current
> * releasing a new version of the solr home (eg adding/changing
> libraries, changing logging options) is done by checking out a fresh
> copy of the solr home, switching the symlink and restarting solr
> * the solr core.properties and any data live in /data/project/indexes,
> so they are preserved when new solr home is released
>
> Setting core specific dataDir with absolute paths in solrconfig.xml
> only gets me part of the way, as the core.properties for each shard is
> created inside the solr home.
>
> This is obviously no good, as when releasing a new version of the solr
> home, they will no longer be in the current solr home.
>
> Cheers
>
> Tom


Separating cores from Solr home

2016-03-03 Thread Tom Evans
Hi all

I'm struggling to configure solr cloud to put the index files and
core.properties in the correct places in SolrCloud 5.5. Let me explain
what I am trying to achieve:

* solr is installed in /opt/solr
* the user who runs solr only has read only access to that tree
* the solr home files - custom libraries, log4j.properties, solr.in.sh
and solr.xml - live in /data/project/solr/releases/, which
is then the target of a symlink /data/project/solr/releases/current
* releasing a new version of the solr home (eg adding/changing
libraries, changing logging options) is done by checking out a fresh
copy of the solr home, switching the symlink and restarting solr
* the solr core.properties and any data live in /data/project/indexes,
so they are preserved when new solr home is released

Setting core specific dataDir with absolute paths in solrconfig.xml
only gets me part of the way, as the core.properties for each shard is
created inside the solr home.

This is obviously no good, as when releasing a new version of the solr
home, they will no longer be in the current solr home.

Cheers

Tom


Re: docValues error

2016-02-29 Thread Tom Evans
On Mon, Feb 29, 2016 at 11:43 AM, David Santamauro
<david.santama...@gmail.com> wrote:
> You will have noticed below, the field definition does not contain
> multiValues=true

What version of the schema are you using? In pre 1.1 schemas,
multiValued="true" is the default if it is omitted.

Cheers

Tom


Re: Display entire string containing query string

2016-02-18 Thread Tom Running
Hello
Thank you for your reply.
I am wondering if you can clarify a bit more for me. Is
field_where_string_may_be_present something that I have to specify? I am
searching HTML page.
For example if I search for the word "name" I am trying to display the
entire sentence containing  "name = T" or maybe "name: T". Ultimately by
searching for the string "name" I am trying to find the value of name.

Thanks for your time. I appreciate your help
-T
On Feb 18, 2016 1:18 AM, "Binoy Dalal" <binoydala...@gmail.com> wrote:

> Append =
>
> On Thu, 18 Feb 2016, 11:35 Tom Running <runningt...@gmail.com> wrote:
>
> > Hello,
> >
> > I am working on a project using Solr to search data from retrieved from
> > Nutch.
> >
> > I have successfully integrated Nutch with Solr, and Solr is able to
> search
> > Nutch's data.
> >
> > However I am having a bit of a problem. If I query Solr, it will bring
> back
> > the numfound and which document the query string was found in, but it
> will
> > not display the string that contains the query string.
> >
> > Can anyone help on how to display the entire string that contains the
> > query.
> >
> >
> > I appreciate your time and guidance. Thank you so much!
> >
> > -T
> >
> --
> Regards,
> Binoy Dalal
>


Display entire string containing query string

2016-02-17 Thread Tom Running
Hello,

I am working on a project using Solr to search data from retrieved from
Nutch.

I have successfully integrated Nutch with Solr, and Solr is able to search
Nutch's data.

However I am having a bit of a problem. If I query Solr, it will bring back
the numfound and which document the query string was found in, but it will
not display the string that contains the query string.

Can anyone help on how to display the entire string that contains the query.


I appreciate your time and guidance. Thank you so much!

-T


Solr and Nutch integration

2016-02-16 Thread Tom Running
I am having problem configuring Solr to read Nutch data or Integrate with
Nutch.
Does  anyone able to get SOLR 5.4.x to work with Nutch?

I went through lot of google's article any still not able to get SOLR 5.4.1
to searching Nutch contents.

Any howto or working configuration sample that you can share will be
greatly appreciate it.

Thanks,
Toom


Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Tom Evans
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats 
> over ranges is not yet supported. More specifically, SOLR-6352 is what we 
> would need.
>
> [1]: https://issues.apache.org/jira/browse/SOLR-6348
> [2]: https://issues.apache.org/jira/browse/SOLR-6352
>
> Thanks anyway, at least we found the tickets :)
>

No problem - as I was reading this I was thinking "But wait, I *know*
we do this ourselves for average price vs month published". In fact, I
was forgetting that we index the ranges that we will want to facet
over as part of the document - so a document with a date_published of
"2010-03-29T00:00:00Z" also has a date_published.month of "201003"
(and a bunch of other ranges that we want to facet by). The frontend
then converts those fields in to the appropriate values for display.

This might be an acceptable solution for you guys too, depending on
how many ranges that you require, and how much larger it would make
your index.

Cheers

Tom


Re: Json faceting, aggregate numeric field by day?

2016-02-10 Thread Tom Evans
On Wed, Feb 10, 2016 at 10:21 AM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Hi - if we assume the following simple documents:
>
> 
>   2015-01-01T00:00:00Z
>   2
> 
> 
>   2015-01-01T00:00:00Z
>   4
> 
> 
>   2015-01-02T00:00:00Z
>   3
> 
> 
>   2015-01-02T00:00:00Z
>   7
> 
>
> Can i get a daily average for the field 'value' by day? e.g.
>
> 
>   3.0
>   5.0
> 
>
> Reading the documentation, i don't think i can, or i am missing it 
> completely. But i just want to be sure.

Yes, you can facet by day, and use the stats component to calculate
the mean average. This blog post explains it:

https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/

Cheers

Tom


  1   2   3   4   5   >