Re: Indexing performance 7.3 vs 8.7

2020-12-23 Thread Bram Van Dam
On 23/12/2020 16:00, Ron Buchanan wrote:
>   - both run Java 1.8, but 7.3 is running HotSpot and 8.7 is running
>   OpenJDK (and a bit newer)

If you're using G1GC, you probably want to give Java 11 a go. It's an
easy thing to test, and it's had a positive impact for us. Your mileage
may vary.

 - Bram


Re: Reindexing major upgrades

2020-10-06 Thread Bram Van Dam
On 05/10/2020 16:02, Rafael Sousa wrote:
> Having things reindexed from scratch is not
> an option, so, is there a way of creating a 8.6.2 index from a pre-existing
> 6.5 index or something like that?

Sadly there is no such way. If all your fields are stored you might be
able to whip up something which can read all the data from old Solr and
write it to new Solr without having to rereead all your documents. But
that's still pretty painful.

 - Bram


Re: ApacheCon at Home 2020 starts tomorrow!

2020-09-30 Thread Bram Van Dam
On 30/09/2020 05:14, Rahul Goswami wrote:
> Thanks for sharing this Anshum. Day 1 had some really interesting sessions.
> Missed out on a couple that I would have liked to listen to. Are the
> recordings of these sessions available anywhere?

The ASF will be uploading the recordings of all sessions "soon", which
probably means about a week or two.

 - Bram


Re: Many small instances, or few large instances?

2020-09-22 Thread Bram Van Dam
Thanks, Erick. I should probably keep a tally of how many beers I owe
you ;-)


On 21/09/2020 14:50, Erick Erickson wrote:
> In a word, yes. G1GC still has spikes, and the larger the heap the more 
> likely you’ll be to encounter them. So having multiple JVMS rather than one 
> large JVM with a ginormous heap is still recommended.
> 
> I’ve seen some cases that used the Zing zero-pause product with very large 
> heaps, but they were forced into that by the project requirements.
> 
> That said, when Java has a ZCG option, I think we’re in uncharted territory. 
> I frankly don’t know what using very large heaps without having to worry 
> about GC pauses will mean for Solr. I suspect we’ll have to do something to 
> take advantage of that. For instance, could we support a topology where all 
> shards had at least one replica in the same JVM that didn’t make any HTTP 
> requests? Would that topology be common enough to support? Maybe extend “rack 
> aware” to be “JVM aware”? Etc.
> 
> One thing that does worry me is that it’ll be easier and easier to “just 
> throw more memory at it” rather than examine whether you’re choosing options 
> that minimize heap requirements. And Lucene has done a lot to move memory to 
> the OS rather than heap (e.g. docValues, MMapDirectory etc.).
> 
> Anyway, carry on as before for the nonce.
> 
> Best,
> Erick
> 
>> On Sep 21, 2020, at 6:06 AM, Bram Van Dam  wrote:
>>
>> Hey folks,
>>
>> I've always heard that it's preferred to have a SolrCloud setup with
>> many smaller instances under the CompressedOops limit in terms of
>> memory, instead of having larger instances with, say, 256GB worth of
>> heap space.
>>
>> Does this recommendation still hold true with newer garbage collectors?
>> G1 is pretty fast on large heaps. ZGC and Shenandoah promise even more
>> improvements.
>>
>> Thx,
>>
>> - Bram
> 



Many small instances, or few large instances?

2020-09-21 Thread Bram Van Dam
Hey folks,

I've always heard that it's preferred to have a SolrCloud setup with
many smaller instances under the CompressedOops limit in terms of
memory, instead of having larger instances with, say, 256GB worth of
heap space.

Does this recommendation still hold true with newer garbage collectors?
G1 is pretty fast on large heaps. ZGC and Shenandoah promise even more
improvements.

Thx,

 - Bram


Re: "timeAllowed" param with "numFound" having a count value but doc list is empty

2020-09-16 Thread Bram Van Dam
There are a couple of open issues related to the timeAllowed parameter.
For instance it currently doesn't work on conjunction with the
cursorMark parameter [1]. And on Solr 7 it doesn't work at all [2].

But other than that, when users have a lot of query flexibility, it's a
pretty good idea to limit them somehow. You don't want your users to
blow up your servers.

[1] https://issues.apache.org/jira/browse/SOLR-14413

[2] https://issues.apache.org/jira/browse/SOLR-9882

 - Bram

On 16/09/2020 03:04, Mark Robinson wrote:
> Thanks Dominique!
> So is this parameter generally recommended or not. I wanted to try with a
> value of 10s. We are not using it now.
> My goal is to prevent a query from running more than 10s on the solr server
> and choking it.
> 
> What is the general recommendation.
> 
> Thanks!
> Mark
> 
> On Tue, Sep 15, 2020 at 5:38 PM Dominique Bejean 
> wrote:
> 
>> Hi,
>>
>> 1. Yes, your analysis is correct
>>
>> 2. Yes, it can occurs too with very slow query.
>>
>> Regards
>>
>> Dominique
>>
>> Le mar. 15 sept. 2020 à 15:14, Mark Robinson  a
>> écrit :
>>
>>> Hi,
>>>
>>> When in a sample query I used "timeAllowed" as low as 10mS, I got value
>> for
>>>
>>> "numFound" as say 2000, but no docs were returned. But when I increased
>> the
>>>
>>> value for timeAllowed to be in seconds, never got this scenario.
>>>
>>>
>>>
>>> I have 2 qns:-
>>>
>>> 1. Why does numFound have a value like say 2000 or even 6000 but no
>>>
>>> documents actually returned. During document collection is calculation of
>>>
>>> numFound done first and doc collection later?. Is doc list empty
>> because,by
>>>
>>> the time doc collection started the timeAllowed cut off took effect?
>>>
>>>
>>>
>>> 2. If I give timeAllowed a value say, 10s or above do you think the above
>>>
>>> scenario of valid count displayed in numFound, but doc list empty can
>> ever
>>>
>>> occur still, as there is more time before cut-off to retrieve at least
>> one
>>>
>>> doc ?
>>>
>>>
>>>
>>> Thanks!
>>>
>>> Mark
>>>
>>>
>>
> 



Re: Backups in SolrCloud using snapshots of individual cores?

2020-08-11 Thread Bram Van Dam
On 11/08/2020 13:15, Erick Erickson wrote:
> CDCR is being deprecated. so I wouldn’t suggest it for the long term.

Ah yes, thanks for pointing that out. That makes Dominique's alternative
less attractive. I guess I'll stick to my original proposal!

Thanks Erick :-)

 - Bram


Backups in SolrCloud using snapshots of individual cores?

2020-08-06 Thread Bram Van Dam
Hey folks,

Been reading up about the various ways of creating backups. The whole
"shared filesystem for Solrcloud backups"-thing is kind of a no-go in
our environment, so I've been looking for ways around that, and here's
what I've come up with so far:

1. Stop applications from writing to solr

2. Commit everything

3. Identify a single core for each shard in each collection

4. Snapshot that core using CREATESNAPSHOT in the Collections API

5. Once complete, re-enable application write access to Solr

6. Create a backup from these snapshots using the replication handler's
backup function (replication?command=backup=mySnapshot)

7. Put the backups somewhere safe

8. Clean up snapshots


This seems ... too good to be true? I've seen so many threads about how
hard it is to create backups in SolrCloud on this mailing list over the
years, but this seems pretty straightforward? Am I missing some
glaringly obvious reason why this will fail catastrophically?

Using Solr 7.7 in this case.

Feedback much appreciated!

Thanks,

 - Bram


Re: Solr Float/Double multivalues fields

2020-07-03 Thread Bram Van Dam
On 03/07/2020 09:50, Thomas Corthals wrote:
> I think this should go in the ref guide. If your product depends on this
> behaviour, you want reassurance that it isn't going to change in the next
> release. Not everyone will go looking through the javadoc to see if this is
> implied.

This is in the ref guide. Section DocValues. Here's the quote:

DocValues are only available for specific field types. The types chosen
determine the underlying Lucene
docValue type that will be used. The available Solr field types are:
• StrField, and UUIDField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the SORTED type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
• BoolField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the SORTED type.
© 2019, Apache Software Foundation
 Guide Version 7.7 - Published: 2019-03-04
Page 212 of 1426
 Apache Solr Reference Guide 7.7
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
• Any *PointField Numeric or Date fields, EnumFieldType, and
CurrencyFieldType:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_NUMERIC type.
Entries are kept in sorted order
and duplicates are kept.
• Any of the deprecated Trie* Numeric or Date fields, EnumField and
CurrencyField:
◦ If the field is single-valued (i.e., multi-valued is false), Lucene
will use the NUMERIC type.
◦ If the field is multi-valued, Lucene will use the SORTED_SET type.
Entries are kept in sorted order and
duplicates are removed.
These Lucene types are related to how the values are sorted and stored.





Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-29 Thread Bram Van Dam
On 28/06/2020 14:42, Erick Erickson wrote:
> We need to draw a sharp distinction between standalone “going away”
> in terms of our internal code and going away in terms of the user
> experience.

It'll be hard to make it completely transparant in terms of user
experience. For instance, tere is currently no way to unload a core in
SolrCloud (without deleting it). I'm sure there are many other similar
gotchas.

 - Bram


Re: Autocommit in SolrCloud with many shards

2020-06-18 Thread Bram Van Dam
Thanks, just created SOLR-14581 as an entry point.

And sure, beers sound good! ;-)

On 17/06/2020 23:13, Erick Erickson wrote:
> Please raise a JIRA and attach your patch to that….
> 
> Best,
> Erick
> 
> P.S. Buy me some beers sometime if we’re even in the same place...
> 
>> On Jun 17, 2020, at 5:00 PM, Bram Van Dam  wrote:
>>
>> Thanks for pointing that out. I'm attaching a patch for the ref-guide
>> which summarizes what you said. Maybe other people will find this useful
>> as well?
>>
>> Oh and Erick, thanks for your ever thoughtful replies. Given all the
>> hours of your time I've soaked up over the years, you should probably
>> start invoicing me :-)
>>
>> - Bram
>>
>> On 17/06/2020 13:55, Erick Erickson wrote:
>>> Each node has its own timer that starts when it receives an update.
>>> So in your situation, 60 seconds after any give replica gets it’s first
>>> update, all documents that have been received in the interval will
>>> be committed.
>>>
>>> But note several things:
>>>
>>> 1> commits will tend to cluster for a given shard. By that I mean
>>>they’ll tend to happen within a few milliseconds of each other
>>>   ‘cause it doesn’t take that long for an update to get from the
>>>   leader to all the followers.
>>>
>>> 2> this is per replica. So if you host replicas from multiple collections
>>>   on some node, their commits have no relation to each other. And
>>>   say for some reason you transmit exactly one document that lands
>>>   on shard1. Further, say nodeA contains replicas for shard1 and shard2.
>>>   Only the replica for shard1 would commit.
>>>
>>> 3> Solr promises eventual consistency. In this case, due to all the
>>>   timing variables it is not guaranteed that every replica of a single
>>>   shard has the same document available for search at any given time.
>>>   Say doc1 hits the leader at time T and a follower at time T+10ms.
>>>   Say doc2 hits the leader and gets indexed 5ms before the 
>>>   commit is triggered, but for some reason it takes 15ms for it to get
>>>   to the follower. The leader will be able to search doc2, but the
>>>  follower won’t until 60 seconds later.
>>>
>>> Best,
>>> Erick
>>>
>>>> On Jun 17, 2020, at 5:36 AM, Bram Van Dam  wrote:
>>>>
>>>> 'morning :-)
>>>>
>>>> I'm wondering how autocommits work in Solr.
>>>>
>>>> Say I have a cluster with many nodes and many colections with many
>>>> shards. If each collection's config has a hard autocommit configured
>>>> every minute, does that mean that SolrCloud (presumably the leader?)
>>>> will dish out commit requests to each node on that schedule? Or does
>>>> each node have its own timed trigger?
>>>>
>>>> If it's the former, doesn't that mean the load will spike dramatically
>>>> across the whole cluster every minute?
>>>>
>>>> I tried reading the code, but I don't quite understand the way
>>>> CommitTracker and the UpdateHandlers interact with SolrCloud.
>>>>
>>>> Thanks,
>>>>
>>>> - Bram
>>>
>>
>> 
> 



Re: Autocommit in SolrCloud with many shards

2020-06-17 Thread Bram Van Dam
Thanks for pointing that out. I'm attaching a patch for the ref-guide
which summarizes what you said. Maybe other people will find this useful
as well?

Oh and Erick, thanks for your ever thoughtful replies. Given all the
hours of your time I've soaked up over the years, you should probably
start invoicing me :-)

 - Bram

On 17/06/2020 13:55, Erick Erickson wrote:
> Each node has its own timer that starts when it receives an update.
> So in your situation, 60 seconds after any give replica gets it’s first
> update, all documents that have been received in the interval will
> be committed.
> 
> But note several things:
> 
> 1> commits will tend to cluster for a given shard. By that I mean
> they’ll tend to happen within a few milliseconds of each other
>‘cause it doesn’t take that long for an update to get from the
>leader to all the followers.
> 
> 2> this is per replica. So if you host replicas from multiple collections
>on some node, their commits have no relation to each other. And
>say for some reason you transmit exactly one document that lands
>on shard1. Further, say nodeA contains replicas for shard1 and shard2.
>Only the replica for shard1 would commit.
> 
> 3> Solr promises eventual consistency. In this case, due to all the
>timing variables it is not guaranteed that every replica of a single
>shard has the same document available for search at any given time.
>Say doc1 hits the leader at time T and a follower at time T+10ms.
>Say doc2 hits the leader and gets indexed 5ms before the 
>commit is triggered, but for some reason it takes 15ms for it to get
>to the follower. The leader will be able to search doc2, but the
>   follower won’t until 60 seconds later.
> 
> Best,
> Erick
> 
>> On Jun 17, 2020, at 5:36 AM, Bram Van Dam  wrote:
>>
>> 'morning :-)
>>
>> I'm wondering how autocommits work in Solr.
>>
>> Say I have a cluster with many nodes and many colections with many
>> shards. If each collection's config has a hard autocommit configured
>> every minute, does that mean that SolrCloud (presumably the leader?)
>> will dish out commit requests to each node on that schedule? Or does
>> each node have its own timed trigger?
>>
>> If it's the former, doesn't that mean the load will spike dramatically
>> across the whole cluster every minute?
>>
>> I tried reading the code, but I don't quite understand the way
>> CommitTracker and the UpdateHandlers interact with SolrCloud.
>>
>> Thanks,
>>
>> - Bram
> 

>From 858406e5c322a96c82934a6477518f65c5c605cc Mon Sep 17 00:00:00 2001
From: Bram 
Date: Wed, 17 Jun 2020 22:54:46 +0200
Subject: [PATCH] Add a blurb about commit timings to the SolrCloud
 documentation

---
 .../src/shards-and-indexing-data-in-solrcloud.adoc  | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc b/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
index 3aa07cbdae7..43828048383 100644
--- a/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
+++ b/solr/solr-ref-guide/src/shards-and-indexing-data-in-solrcloud.adoc
@@ -122,6 +122,8 @@ More details on how to use shard splitting is in the section on the Collection A
 
 In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with `openSearcher=false` and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster.
 
+TIP: Each node has its own auto commit timer which starts upon receipt of an update. While Solr promises eventual consistency, leaders will generally receive updates *before* replicas; it is therefore possible for replicas to lag behind somewhat.
+
 To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the `IgnoreCommitOptimizeUpdateProcessorFactory`, which allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.
 
 To activate this request processor you'll need to add the following to your `solrconfig.xml`:
-- 
2.20.1



Autocommit in SolrCloud with many shards

2020-06-17 Thread Bram Van Dam
'morning :-)

I'm wondering how autocommits work in Solr.

Say I have a cluster with many nodes and many colections with many
shards. If each collection's config has a hard autocommit configured
every minute, does that mean that SolrCloud (presumably the leader?)
will dish out commit requests to each node on that schedule? Or does
each node have its own timed trigger?

If it's the former, doesn't that mean the load will spike dramatically
across the whole cluster every minute?

I tried reading the code, but I don't quite understand the way
CommitTracker and the UpdateHandlers interact with SolrCloud.

Thanks,

 - Bram


Re: Solr Deletes

2020-05-26 Thread Bram Van Dam
On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id. 


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram


Re: +(-...) vs +(*:* -...) vs -(+...)

2020-05-22 Thread Bram Van Dam
Additional reading: https://lucidworks.com/post/why-not-and-or-and-not/

Assuming implicit AND, we perform the following rewrite on strictly
negative queries:

-f:a -> -f:a *:*

Isn't search fun? :-)

 - Bram


On 21/05/2020 20:51, Houston Putman wrote:
> Jochen,
> 
> For the standard query parser, pure negative queries (no positive query in
> front of it, such as "*:*") are only allowed as a top level clause, so not
> nested within parenthesis.
> 
> Check the second bullet point of the this section of the Ref Guide page for
> the Standard Query Parser.
> 
> 
> For the edismax query parser, pure negative queries are allowed to be
> nested within parenthesis. Docs can be found in the Ref Guide page for the
> eDismax Query Parser.
> 
> 
> - Houston
> 
> On Thu, May 21, 2020 at 2:25 PM Jochen Barth 
> wrote:
> 
>> Dear reader,
>>
>> why does +(-x_ss:y) finds 0 docs,
>>
>> while -(+x_ss:y) finds many docs?
>>
>> Ok... +(*:* -x_ss:y) works, too, but I'm a bit surprised.
>>
>> Kind regards, J. Barth
>>
>>
> 



Re: ZooKeeper 3.4 end of life

2020-04-15 Thread Bram Van Dam
On 09/04/2020 16:03, Bram Van Dam wrote:
> Thanks, Erick. I'll give it a go this weekend and see how it behaves.
> I'll report back so there's a record of my attempts in case anyone else
> ends up asking the same question.

Here's a quick update after non-exhaustive testing: Running SolrCloud
7.7.2 against ZK 3.5.7 seems to work. This is using the same Ensemble
configuration as in 3.4, but with 4-letter-words now explicitly enabled.

ZK 3.5 allegedly makes it easier to use TLS throughout the ensemble, but
I haven't tried that in conjunction with Solr yet. I'll give it a go if
I can find the time.

 - Bram


Re: ZooKeeper 3.4 end of life

2020-04-09 Thread Bram Van Dam
Thanks, Erick. I'll give it a go this weekend and see how it behaves.
I'll report back so there's a record of my attempts in case anyone else
ends up asking the same question.

Some of our customers get a bit nervous when software goes out of
support, even if it works fine, so I try to be prepared ;-)

On 09/04/2020 13:50, Erick Erickson wrote:
> All it means is that there won’t be upgrades/improvements to
> ZK, 3.4 will still run. So there’s no need to move to 3.5 independent 
> of upgrading Solr just because that version of ZK is unsupported
> going forward.
> 
> I haven’t personally tried to run 3.5 against an earlier version, but
> one thing you’ll probably want to do if you try it is whitelist
> “ruok”, “mntr”, and “conf” in your Zookeeper config, see the 
> Zookeeper documentation. NOTE: that is only necessary if you 
> want to see the Zookeeper status in the admin UI, those
> commands aren’t used by anything else in Solr.
> 
> The JIRA (SOLR-8346) contains a _ton_ of changes, but the mostly
> fall in two categories: Using a file rather than strings in tests and
> trying to handle the whitelist programmatically, neither of which
> are relevant to just dropping a newer version of ZK in.


ZooKeeper 3.4 end of life

2020-04-09 Thread Bram Van Dam
Hey folks,

The ZK team just announced that they're dropping 3.4 support as of the
1st of June, 2020.

What does this mean for those of us still on Solr < 8.2? From what I can
tell, ZooKeeper 3.5+ does not work with older Solr versions.

Has anyone managed to get a 3.5+ to work with Solr 7 at all?

Thanks,

 - Bram


Re: Modify ZK ensemble string in a running SolrCloud?

2020-03-23 Thread Bram Van Dam
On 23/03/2020 14:17, Erick Erickson wrote:
> As of Solr 8.2, Solr is distributed with ZooKeeper 3.5.5 (will be 3.5.7 in 
> Solr 8.6), which allows “dynamic reconfiguration”. If you’re running an 
> earlier version of Zookeeper, then no you’ll have to restart to change ZK 
> nodes.

Thanks Erick, much appreciated. I guess I'll have to settle for a
restart for the time being.

 - Bram


Modify ZK ensemble string in a running SolrCloud?

2020-03-23 Thread Bram Van Dam
Is it possible to change the ZK ensemble without restarting the entire
SolrCloud? Specifically adding or removing a ZK instance from the
ensemble. I'm assuming the answer is no, as far as I can tell the only
place where this is configured is the zkHost parameter, which is passed
to Solr as a JVM argument.

But I figured I'd ask anyway. Thanks for any insights!

 - Bram


Devoxx Antwerp

2019-11-05 Thread Bram Van Dam
I don't suppose any Solr users/devs will be attending Devoxx in Antwerp
this week? If any of you are, it might be nice to have a chat to
exchange some experiences. If not, I'll take that as a sign not to leave
it quite so late next year .. ahem.

 - Bram


Re: Query number of Lucene documents using Solr?

2019-08-27 Thread Bram Van Dam
On 26/08/2019 23:12, Shawn Heisey wrote:
> The numbers shown in Solr's LukeRequestHandler come directly from
> Lucene.  This is the URL endpoint it will normally be at, for core XXX:
> 
> http://host:port/solr/XXX/admin/luke

Thanks Shawn, that's a great entry point!

> The specific error you encountered is why old hands will recommended
> staying below a billion documents in a core.  That leaves room for
> deleted documents as well.

Indeed, that's what we usually try. But every once in a while Stuff
Happens(TM), and so it'd be nice if we could monitor the actual count.

 - Bram


Query number of Lucene documents using Solr?

2019-08-26 Thread Bram Van Dam
Possibly somewhat unusual question: I'm looking for a way to query the
number of *lucene documents* from within Solr. This can be different
from the number of Solr documents (because of unmerged deletes/updates/
etc).

As a bit of background; we recently found this lovely little error
message in a Solr log, and we'd like to get a bit of an early warning
system going :-)

> Too many documents, composite IndexReaders cannot exceed 2147483647

If no way currently exists, I'm not adverse to hacking one in, but I
could use a few pointers in the general direction.

As an alternative strategy, I guess I could use Lucene to walk through
each index segment and add the segment info maxDoc values. But I'm not
sure if that would be a good idea.

Thanks a bunch,

 - Bram


Re: Incorrect shard placement during Collection creation in 7.6

2019-02-14 Thread Bram Van Dam
Thanks Erick, I just created SOLR-13247 and linked it to SOLR-12944.

 - Bram

On 13/02/2019 18:31, Erick Erickson wrote:
> I haven't verified, but this looks like a JIRA to me. Looks like some
> of the create logic may have issues, see: SOLR-12944 and maybe link to
> that JIRA?


Re: Incorrect shard placement during Collection creation in 7.6

2019-02-13 Thread Bram Van Dam
> TL;DR; createNodeSet & shards combination is not being respected.

Update: Upgraded to 7.7, no improvement sadly.


Incorrect shard placement during Collection creation in 7.6

2019-02-13 Thread Bram Van Dam
Hey folks,

TL;DR; createNodeSet & shards combination is not being respected.

I'm attempting to create a collection with multiple shards, but
apparently the value of createNodeSet is not being respected and shards
are being assigned to nodes seemingly at random.

createNodeSet.shuffle is set to false, so that's not the cause.
Furthermore, sometimes not all nodes in the request are used.

Here's my request, cleaned up for legibility. Note that the node names
are IP addresses but I've removed the first 3 octets for legibility.


admin/collections
?action=CREATE
=collectionName
=implicit
=collectionName1,collectionName2,collectionName3,collectionName4,collectionName5,collectionName6
=1024
=some_config
=171:8180_solr,172:8180_solr,173:8180_solr,177:8180_solr,179:8180_solr,179:8180_solr
=false
=true

Note that I'm creating a collection with 6 shards across 5 nodes.

Requested:
collectionName1: 171:8180_solr
collectionName2: 172:8180_solr
collectionName3: 173:8180_solr
collectionName4: 177:8180_solr
collectionName5: 179:8180_solr
collectionName6: 179:8180_solr

Actual:
collectionName1: 177:8180_solr
collectionName2: 172:8180_solr
collectionName3: 179:8180_solr
collectionName4: 173:8180_solr
collectionName5: 171:8180_solr
collectionName6: 171:8180_solr

Not a single shard ends up on the requested node.

Additionally, when the response comes back, it only contained
information about 5 of the 6 created cores (even though 6 were created).
Possibly because there are only 5 nodes?

Am I misunderstanding the way this is supposed to work? Or did I stumble
upon a bug? Should I attempt to create a collection without shards and
add them one at a time for better control?

Sidenote: having control over which shard lives where is a business
requirement, so leaving Solr to its own devices is, sadly, not an option
in this case :-(

Thanks a bunch,

 - Bram


CDCR "all" collections

2019-01-24 Thread Bram Van Dam
Hey folks,

Is there any way to set up CDCR for *all* collections, including any
newly created ones? Having to modify the solrconfig in ZK every time a
collection is added is a bit of a pain, especially because I'm assuming
it requires a restart to activate the config?

Basically if I have DC Src and DC Tgt, I want every collection from Src
to be replicated to Tgt. Even when I create a new collection on Src.

Thanks,

 - Bram


Re: "no servers hosting shard" when querying during shard creation

2019-01-15 Thread Bram Van Dam
On 13/01/2019 19:43, Erick Erickson wrote:
> Yeah, that seems wrong, I'd say open a JIRA.

I've created a bug in Jira: SOLR-13136. Should I assign this to anyone?
Unsure what the procedure is there.

Incidentally, while doing so I noticed that 7.6 is still "unreleased"
according to Jira.

Thanks,

 - Bram


Re: "no servers hosting shard" when querying during shard creation

2019-01-13 Thread Bram Van Dam
On 13/01/2019 14:28, Bram Van Dam wrote:
> If a query is launched during the shard creation, I get a
> SolrServerException from SolrJ: Error from server at foo: no servers
> hosting shard: bar

I should probably add that I'm running 7.6.0.


"no servers hosting shard" when querying during shard creation

2019-01-13 Thread Bram Van Dam
Hey folks,

I'm getting SolrServerExceptions and I'm not sure whether this is by
design or whether this is a concurrency bug of some sort.

Basically I've got a pretty active collection which is being queried all
the time. Periodically, new shards are created (using the Collection
Admin API's CREATESHARD call). Creating the shard takes a certain amount
of time.

If a query is launched during the shard creation, I get a
SolrServerException from SolrJ: Error from server at foo: no servers
hosting shard: bar

This strikes me as odd. It's a new shard that's either still being
created or was just created, it's empty, so it shouldn't affect the
query in any way.

I had a quick nose around the code and found HttpShardHandler to be the
one throwing the exception. But it's unclear to me how or where it's
decided which shards are included in the query execution. Is this a bug?

Thanks,

 - Bram



Re: “solr.data.dir” can only config a single directory

2018-08-29 Thread Bram Van Dam
On 28/08/18 08:03, zhenyuan wei wrote:
> But this is not a common way to do so, I mean, nobody want to ADDREPLICA
> after collection was created.

I wouldn't say "nobody"..


Odd GC values in solr.in.sh on 7.2.1

2018-02-21 Thread Bram Van Dam
Hey folks,

solr.in.sh appears to contain broken GC suggestions:

# These GC settings have shown to work well for a number of common Solr
workloads
#GC_TUNE="-XX:NewRatio=3 -XX:SurvivorRatio=4etc.

The "etc." part is copied verbatim from the file. It looks likes the
original GC_TUNE settings have partially disappeared.

Here they are copied from 5.X:

# These GC settings have shown to work well for a number of common Solr
workloads
GC_TUNE="-XX:NewRatio=3 \
-XX:SurvivorRatio=4 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=8 \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:ConcGCThreads=4 -XX:ParallelGCThreads=4 \
-XX:+CMSScavengeBeforeRemark \
-XX:PretenureSizeThreshold=64m \
-XX:+UseCMSInitiatingOccupancyOnly \
-XX:CMSInitiatingOccupancyFraction=50 \
-XX:CMSMaxAbortablePrecleanTime=6000 \
-XX:+CMSParallelRemarkEnabled \
-XX:+ParallelRefProcEnabled"

Are these settings no longer recommended? Or is there a new mechanism
other than solr.in.sh where these might be configured?

Thanks,

 - Bram


7.0 upgrade: Trie* -> Point* migration

2017-09-26 Thread Bram Van Dam
Hey folks,

We're preparing for an upgrade to 7.0, but I'm a bit worried about the
deprecation of Trie* fields. Is there any way to upgrade an existing
index to use Point* fields without having to reindex all documents? Does
the IndexUpgrader take care of this?

Thanks,

 - Bram


Re: Strange boolean query behaviour on 5.5.4

2017-07-05 Thread Bram Van Dam
On 04/07/17 18:10, Erick Erickson wrote:
> I think you'll get what you expect by something like:
> (*:* -someField:Foo) AND (otherField: (Bar OR Baz))

Yeah that's what I figured. It's not a big deal since we generate Solr
syntax using a parser/generator on top of our own query syntax. Still a
little strange!

Thanks for the heads up,

 - Bram


Strange boolean query behaviour on 5.5.4

2017-07-04 Thread Bram Van Dam
Hey folks,

I'm experiencing some strange query behaviour, and it isn't immediately
clear to me why this wouldn happen. The definition of the query syntax
on the wiki is a bit fuzzy, so my interpretation of the syntax might be off.

This query does work (no results, when results are expected).

(-someField:Foo) AND (otherField: (Bar OR Baz))

With debug enabled, Solr interprets the query as

+(-someField:Foo) +(otherField:Bar otherField:Baz)

This query DOES work, results are returned.

-someField:Foo +(otherField:Bar otherField:Baz)

With debug enabled:

-someField:Foo +(otherField:Bar otherField:Baz)


The only difference between these queries is the presence of parantheses
around the field with a single NOT condition. From a boolean point of
view, they are equivalent.

To make matters stranger, if I add a *:* clause to the NOT field,
everything works again.

(-someField:Foo AND *:*) AND (otherField: (Bar OR Baz))
and
-someField:Foo AND *:* AND (otherField: (Bar OR Baz))
both work.

Is this is query parser bug? Or are parenthesized groups with a single
negated expression not supported? :-/

I've only tested this on 5.5.4 using the default query parser, I don't
have access to any other versions at the moment.

Thanks for any insights,

 - Bram


Re: solr 6 at scale

2017-05-25 Thread Bram Van Dam
>>> It is relatively easy to downgrade to an earlier release within the
>>> same major version. We have not switched to 6.5.1 simply because we
>>> have no pressing need for it - Solr 6.3 works well for us.
> 
>> That strikes me as a little bit dangerous, unless your indexes are very
>> static.  The Lucene index format does occasionally change in minor
>> versions.
> 
> Err.. Okay? Thank you for that. I was under the impression that the index 
> format was fixed (modulo critical bugs) for major versions. This will change 
> our approach to updating.

*Upgrading* (say 6.3 to 6.5.1) should be fine, because -- as I
understand it -- newer Lucene/Solr versions support reading older index
formats (up to the previous major version). Older versions reading newer
index would be ... difficult.

 - Bram


Re: Solr - example for using percentiles

2017-02-22 Thread Bram Van Dam
On 17/02/17 13:39, John Blythe wrote:
> Using the stats component makes short work of things.
> 
> stats.true=foo

The stats component has been rendered obsolete by the newer and shinier
json.facet stuff.

 - Bram





Re: Solr - example for using percentiles

2017-02-17 Thread Bram Van Dam
On 15/01/17 15:26, Vidal, Gilad wrote:
> Hi,
> Can you direct me for Java example using Solr percentiles?
> The following 3 examples are not seems to be working.


Not sure if this is still relevant, but I use the json.facet parameter
with SolrJ:

query.add("json.facet", "{\"ninety\":\"percentile(value,90)\"}");

 - Bram


Re: Atomic updates to increase single field bulk updates?

2017-02-17 Thread Bram Van Dam
> I am aware of the requirements to use atomic updates, but as I understood, 
> those would not have a big impact on performance and only a slight increase 
> in index size?

AFAIK there won't be a difference in index size between atomic updates
and full updates, as the end result is the same.

But you will probably see a performance increase because you'll only
have to send 4 boolean flags instead of 4 full documents.

Using atomic updates sounds like a good idea to me.

 - Bram



Re: Solr 6 Performance Suggestions

2016-11-23 Thread Bram Van Dam
On 22/11/16 15:34, Prateek Jain J wrote:
> I am not sure but I heard this in one of discussions, that you cant migrate 
> directly from solr 4 to solr 6. It has to be incremental like solr 4 to solr 
> 5 and then to solr 6. I might be wrong but is worth trying. 

Ideally the index needs to be upgraded using the IndexUpgrader.

Something like this should do the trick:

java -cp lucene-core-6.0.0.jar:lucene-backward-codecs-6.0.0.jar
org.apache.lucene.index.IndexUpgrader /path/to/index

 - Bram


Re: 5.5.3: fieldValueCache auto-warming error

2016-11-15 Thread Bram Van Dam
On 11/11/16 18:08, Bram Van Dam wrote:
> On 10/11/16 17:10, Erick Erickson wrote:
>> Just facet on the text field yourself ;)

Quick update: you were right. One of the users managed to find a bug in
our application which enabled them to facet on the text field. It would
be still be nice if Solr wouldn't try to keep caching a broken query (or
an impossible facet field), but we can work around the issue by fixing
our own bug.

Thanks!

 - Bram



Re: 5.5.3: fieldValueCache auto-warming error

2016-11-11 Thread Bram Van Dam
On 10/11/16 17:10, Erick Erickson wrote:
> Just facet on the text field yourself ;)

Wish I could, this is on premise over at a client, access is difficult
and their response time is pretty bad on public holidays and weekends.
So I'm basically twiddling my thumbs while waiting to get more log files
:-) I haven't been able to reproduce the problem locally, but there
could be any number of contributing factors that I'm missing.

> Kidding aside, this should be in the clear from the logs, my guess is
> that the first time you see an OOM error in the logs the query will be
> in the file also.

We generally prefer "fail hard fast", so I think we are running with the
OOM killer script in most environments. I don't think they've gone OOM
in this case, though something else could have gone wrong undected.

I hope I'll know more after the weekend.


Re: 5.5.3: fieldValueCache auto-warming error

2016-11-10 Thread Bram Van Dam
On 09/11/16 16:59, Erick Erickson wrote:
> But my bet is that you _are_ doing something that uninverts the text
> field (obviously inadvertently). If you restart Solr and monitor the
> log until the first time you see this exception, what do the queries
> show? My guess is that once you get some query in your
> queryResultCache or filterCache it gets recycled and produces this on
> autowarm rather than the fieldValueCache, but that's a total guess.

Good point. I'll try to log the queries and see if anything comes up.
Though if it was a user-initiated error it might be impossible to get
them to reproduce it.

I'll keep you updated.

Thanks!

 - Bram



5.5.3: fieldValueCache auto-warming error

2016-11-09 Thread Bram Van Dam
Hey folks,

I'm frequently getting the following error, which has me a little puzzled:

Error during auto-warming of
key:text:org.apache.solr.common.SolrException:
java.lang.IllegalStateException: Too many values for UnInvertedField
faceting on field text

This is strange because the field in question is a full text field,
which will never ever be used in faceting. I don't understand why Solr
is trying to build an uninverted index for faceting on this field during
auto-warming.

Should I turn off the auto-warming of fieldValueCache? But then that
will probably affect overall performance of other fields, which is
something I'd like to avoid.

Is there any way to exclude this field from being cached in the manner?

Thanks a bunch. Stack trace and field definitions below.

 - Bram

Field:


 
  
  
 




Cache definition:



Full stack trace:

2016-11-08
03:35:57.196/UTC|ERROR|searcherExecutor-27-thread-1-processing-x:myIndexx:myIndex|org.apache.solr.search.FastLRUCache|Error
during auto-warming of key:text:org.apache.solr.common.SolrException:
java.lang.IllegalStateException: Too many values for UnInvertedField
faceting on field text
at
org.apache.solr.search.facet.UnInvertedField.(UnInvertedField.java:194)
at
org.apache.solr.search.facet.UnInvertedField.getUnInvertedField(UnInvertedField.java:595)
at
org.apache.solr.search.SolrIndexSearcher$1.regenerateItem(SolrIndexSearcher.java:523)
at org.apache.solr.search.FastLRUCache.warm(FastLRUCache.java:163)
at
org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:2320)
at org.apache.solr.core.SolrCore$4.call(SolrCore.java:1851)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Too many values for
UnInvertedField faceting on field text
at 
org.apache.lucene.uninverting.DocTermOrds.uninvert(DocTermOrds.java:489)
at
org.apache.solr.search.facet.UnInvertedField.(UnInvertedField.java:192)
... 10 more




Re: SolrJ & Ridiculously Large Queries

2016-10-18 Thread Bram Van Dam
On 14/10/16 16:13, Shawn Heisey wrote:
>  name="solr.jetty.request.header.size" default="8192" />

A belated thanks, Shawn! 32k should be sufficient, I hope.

 - Bram




SolrJ & Ridiculously Large Queries

2016-10-14 Thread Bram Van Dam
Hey folks,

I just noticed that Jetty barfs with HTTP 414 when request URIs are very
large, which makes sense. I think the default limit is ~8k.
Unfortunately I've got users who insist on executing queries that are
16k (!1!?!?) in size.

Two questions:

1) is it possible to POST these oversized monstrosities instead?

2) can I get SolrJ to POST them?

Suggestions are welcome!

Quick disclaimer: I don't write the queries, and only the default query
parser is available, so trying to reduce the query size is not an option :-(

Thanks

 - Bram


Re: json.facet without a facet ...

2016-09-27 Thread Bram Van Dam
On 26/09/16 17:06, Yonik Seeley wrote:
> Statistics are now fully integrated into faceting. Since we start off
> with a single facet bucket with a domain defined by the main query and
> filters, we can even ask for statistics for this top level bucket,
> before breaking up into further buckets via faceting. Example:
> 
> json.facet={
>   x : "avg(price)",   // the average of the price field will

Aaaah! Thanks, that explains it. I was trying to put the statistics
under "facet":{}, instead of on the top level. Much appreciated :-)

 - Bram



json.facet without a facet ...

2016-09-26 Thread Bram Van Dam
Howdy,

I realize that this might be a strange question, so please bear with me
here.

I've been replacing my usage of the old Stats Component (stats=true,
stats.field=foo, [stats.facet=bar]) with the new json.facet sugar. This
has been a great improvement on all fronts.

However, with the stats component I could calculate stats on a field
*without* having to facet. The new json.facet API doesn't seem to
support that in any way that I can see. Which, admittedly, makes sense,
given the name.

Faceting on a random field and setting allBuckets:true kind of
approximates the behaviour I'm after, but that's pretty ugly and
difficult (because I don't know which field to facet on and it would
have to be present in all documents etc).

Is there any way to do this that I'm not seeing?

TL;DR; Trying to calculate statistics using json.facet without faceting.

Thanks,

 - Bram


Re: JSON Facet API

2016-09-21 Thread Bram Van Dam
On 21/09/16 05:40, Sandeep Khanzode wrote:
> How can I specify JSON Facets in SolrJ? The below facet query for example ... 

SolrQuery query = new SolrQuery();
query.add("json.facet", jsonStringGoesHere);

 - Bram




Re: Miserable Experience Using Solr. Again.

2016-09-17 Thread Bram Van Dam
> I would like to see a future where the admin UI is more than just an
> addon ... but even then, I think the HTTP API will *still* be the most
> important piece of the system.

In 4 years of heavily using (many instances and many versions of) Solr,
the only times when I've used the admin UI has been as a
debugging/diagnostics tool. For instance to quickly check memory usage
or to verify data has been loaded.

My (and by extension my employer's and our customers') Solr usage
probably isn't typical, but I can't imagine anyone relying on the admin
UI for day-to-day Solr operations.

>> Good software doesn’t force users to learn how it works. It hides the
inner workings under the interface, so that people never even have to
worry about it at all.

Administering a system you know nothing about is a recipe for disaster.
This is just as true for WordPress, MySQL or Oracle as it is for Solr.
I'm not saying things can't/shouldn't be as easy and clear as possible.
But some effort to understand the system should be expected by the
sysadmin/operator.

 - Bram



Re: Miserable Experience Using Solr. Again.

2016-09-13 Thread Bram Van Dam
I'm sorry you're having a "miserable" experience "again". That's
certainly not my experience with Solr. That being said:

> First I was getting errors about "Unsupported major.minor version 52.0", so I 
> needed to install the Linux x64 JRE 1.8.0, which I managed on CentOS 6 with...
> yum install openjdk-1.8.0

This is not a Solr problem. Solr requires Java 8. Java 7 has been
officially end-of-lifed since april 2015. This means no more patches, no
more performance improvements and no more security updates (unless
you're paying Oracle). This is clearly stated in the (very decent) Solr
documentation. To use your own words: Java 7 is an antiquated nightmare
and the rest of the world has moved on to Java 8.

> So far so good. But I didn’t have JAVA_HOME set properly apparently, so I 
> needed to do the not-exactly-intuitive…
> export 
> JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.101-3.b13.el6_8.x86_64/jre/

You don't need to set JAVA_HOME to run Solr. But if you do have a
JAVA_HOME environment variable, and it points to a wrong Java version,
you're going to have a bad time.

> Then after stopping the old process (with kill -9, since there seems to be no 
> graceful way to shut down Solr)

There's a stop command, which is documented. It's a non-surprising
location and has a non-surprising name. And even if there wasn't, "kill"
would have sufficed.

> There was some kind of problem with StopFilterFactory and the text_general 
> field type. Thanks to Stack Overflow I was able to determine that the 
> apparent problem was that there was a parameter, previously fine, which was 
> no longer fine. So I removed all instances of 
> enablePositionIncrements="true". That helped, but then I ran into a broader 
> error: "Plugin Initializing failure for [schema.xml] fieldType". It didn’t 
> say which field type. Buried in the logs I found a reference in the Java 
> stack trace—which *disappears* (and distorts the viewing window horribly) 
> after a few seconds when you try to view it in the web log UI—to the string 
> "units="degrees"". Sure enough, this string appeared in my schema.xml for a 
> class called "solr.SpatialRecursivePrefixTreeFieldType" that I’m pretty sure 
> I never use. I removed that parameter, and moved on to the next set of errors.

Releases come with release notes and -- when required -- upgrade
instructions and admonitions. It's certainly possible that there's been
an oversight here or there and you're more than welcome to point those out.

> The user interface is still as buggy as an early alpha of most
products, the errors are difficult to understand when they don’t
actually specify what’s wrong (and they almost never do), and there
should have been an automatic process to highlight and fix problems in
old (pre-6) configuration files.

What user interface? Are you talking about the Admin UI? That's a
convenience feature which helps you manage Solr. It makes life a lot
easier, even if it's not perfect. The logs are generally quite good at
explaining what's wrong.

> Never mind the fact that the XML-based configuration process is an
antiquated nightmare when the rest of the world has long since moved
onto databases.

An antiquated nightmare? The rest of the world? How would this work?
What benefit would it possibly have?

You're more than welcome to report any bugs you find
(https://issues.apache.org/jira/browse/SOLR). But I feel like general
ranting on the mailing list isn't very productive. Well, I suppose
venting feels good, so there's that.

Things that would be more productive:

1. Reading the documentation.
2. Taking a basic system administration class or two.
3. Pointing out -- or contributing to -- parts of the documentation that
aren't up to par. Either on the mailing list, or on Jira. Preferably in
a constructive way instead of a "miserable experience"-way.

I feel like you're missing the part where most open source development,
documentation, release management etc is done by volunteers. Volunteers
who tend to scratch their own itch first, and are then kind enough to
donate the fruit of their labour to the rest of the world. You can
certainly make requests, and you can certainly hope for things to improve.

If you're having a "miserable" time "again", then you can always hire a
Solr consultant to do the work for you. You can't demand free stuff to
scratch your every itch. You can either invest your time and figure out
how to do things yourself, or your money and have things done for you.
But there's no such thing as a free lunch.

 - Bram



Re: Monitoring Apache Solr

2016-09-12 Thread Bram Van Dam
> I try to monitor apache solr, because solr often over heap and status
> collection solr be "down". How to monitor apache solr ??
> is there any tools for monitoring solr or how ??

The easiest way is to use the Solr ping feature:
https://cwiki.apache.org/confluence/display/solr/Ping

It will quickly and reliable tell you if Solr is still alive.

There is also a status call: /solr/admin/info/system?wt=json which can
tell you how much free memory you have left.

 - Bram



SolrJ & json.facet?

2016-05-25 Thread Bram Van Dam
Hey folks,

Is there any way to use the "new" json.facet features (or some Java
equivalent) using SolrJ? I've had a quick look at the source code, but
nothing really jumps out at me.

Thanks,

 - Bram


Re: Solr 5.2.1 on Java 8 GC

2016-05-01 Thread Bram Van Dam
On 30/04/16 17:34, Davis, Daniel (NIH/NLM) [C] wrote:
> Bram, on the subject of brute force - if your script is "clever" and uses 
> binary first search, I'd love to adapt it to my environment.  I am trying to 
> build a truly multi-tenant Solr because each of our indexes is tiny, but all 
> together they will eventually be big, and so I'll have to repeat this 
> experiment, many, many times.

Sorry to disappoint, the script is very dumb, and it doesn't just
start/stop Solr, it installs our application suite, picks a GC profile
at random, indexes a boatload of data and then runs a bunch of query tests.

Three pointers I can give you:

1) beware of JVM versions, especially when using the G1 collector, it
behaves horribly on older JVMs but rather nicely on newer versions.

2) At the very least you'll want to test the G1 and CMS collectors.

3) One large index vs many small indexes: the behaviour is very
different. Depending on how many indexes you have, it might be worth to
run each one in a different JVM. Of course that's not practical if you
have thousands of indexes.

 - Bram



Re: Tuning solr for large index with rapid writes

2016-04-30 Thread Bram Van Dam
> If I'm reading this right, you have 420M docs on a single shard?
> Yep, you were reading it right. 

Is Erick mentioned, it's hard to give concrete sizing advice, but we've
found 120M to be the magic number. When a shard contains more than 120M
documents, performance goes down rapidly & GC pauses grow a lot longer.
Up until 250M things remain acceptable. But then performance starts to
drop very quickly after that.

 - Bram



Re: Tuning solr for large index with rapid writes

2016-04-30 Thread Bram Van Dam
On 29/04/16 16:33, Erick Erickson wrote:
> You have one huge advantage when doing prototyping, you can
> mine your current logs for real user queries. It's actually
> surprisingly difficult to generate, say, 10,000 "realistic" queries. And
> IMO you need something approaching that number to insure that
> you're queries don't hit the caches etc

Our approach is to log queries for a while, boil them down to their
different use cases (full text search, simple facet, complex 2D ranged
with stats, etc) and then generate realistic parameter values for each
search field used in those queries. It's not perfect, but it gives you
large amounts of reasonably realistic queries.

Also, you can bypass the query cache by adding {!cache=false} to your query.

 - Bram




Re: Solr 5.2.1 on Java 8 GC

2016-04-30 Thread Bram Van Dam
On 29/04/16 16:40, Nick Vasilyev wrote:
> Not sure if it helps anyone, but I am seeing decent results with the
> following.
> 
> It was mostly a result of trial and error, 

I'm ashamed to admit that I've used a similar approach: wrote a simple
test script to try out various GC settings with various values. Repeat
ad nauseum. Ended with a configuration that works reasonably well on the
environment in question, but will probably fail horribly anywhere else.

When in doubt, use brute force.

 - Bram


Re: Storing different collection on different hard disk

2016-04-21 Thread Bram Van Dam
On 21/04/16 03:56, Zheng Lin Edwin Yeo wrote:
> This is the working one:
> dataDir=D:/collection1/data

Ah yes. Backslashes are escape characters in properties files.
C:\\collection1\\data would probably work as well.

 - bram


Re: Indexing 700 docs per second

2016-04-20 Thread Bram Van Dam
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.

We've found that average index size per document is a good predictor of
performance. For instance, I've got a 150GB index lying around,
containing 400M documents. That's roughly 400 bytes per document in
index size. This was indexed @ 4500 documents/second.

If the average index size per documents doubles, the throughput will go
down by about a third. Your mileage may vary.

But yeah, I would say that 700 docs on your machine won't be much of a
problem. Especially considering your index will likely fit in memory.

 - Bram




Re: Storing different collection on different hard disk

2016-04-20 Thread Bram Van Dam
Have you considered simply mounting different disks under different
paths? It looks like you're using Windows, so I'm not sure if that's
possible, but it seems like a relatively basic task, so who knows.

You could mount Disk 1 as /path/to/collection1 and Disk 2 as
/path/to/collection2. That way you won't need to change your Solr
configuration at all.

 - Bram

On 20/04/16 06:04, Zheng Lin Edwin Yeo wrote:
> Thanks for your info.
> 
> I tried to set, but Solr is not able to find the indexes, and I get the
> following error:
> 
>- *collection1:*
> org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
>java.io.IOException: The filename, directory name, or volume label syntax
>is incorrect
> 
> 
> Is this the correct way to set in core.properties file?
> dataDir="D:\collection1"
> 
> Also, do we need to set the dataDir in solrconfig.xml as well?
> 
> Regards,
> Edwin
> 
> 
> On 19 April 2016 at 19:36, Alexandre Rafalovitch  wrote:
> 
>> Have you tried setting dataDir parameter in the core.properties file?
>> https://cwiki.apache.org/confluence/display/solr/Defining+core.properties
>>
>> Regards,
>>Alex.
>> 
>> Newsletter and resources for Solr beginners and intermediates:
>> http://www.solr-start.com/
>>
>>
>> On 19 April 2016 at 20:43, Zheng Lin Edwin Yeo 
>> wrote:
>>> Hi,
>>>
>>> I would like to find out is it possible to store the indexes file of
>>> different collections in different hard disk?
>>> Like for example, I want to store the indexes of collection1 in Hard Disk
>>> 1, and the indexes of collection2 in Hard Disk 2.
>>>
>>> I am using Solr 5.4.0
>>>
>>> Regards,
>>> Edwin
>>
> 



Re: [Possible Bug] 5.5.0 Startup script ignoring host parameter?

2016-03-31 Thread Bram Van Dam
On 30/03/16 16:45, Shawn Heisey wrote:
> The host parameter does not control binding to network interfaces.  It
> controls what hostname is published to zookeeper when running in cloud mode.

Oh I see. That wasn't clear from the documentation. Might be worth
adding such a parameter to the startup script, in which case.

But for now I'll just edit the config file, thanks for the tip!

 - Bram



[Possible Bug] 5.5.0 Startup script ignoring host parameter?

2016-03-30 Thread Bram Van Dam
Hi folks,

It looks like the "-h" parameter isn't being processed correctly. I want
Solr to listen on 127.0.0.1, but instead it binds to all interfaces. Am
I doing something wrong? Or am I misinterpreting what the -h parameter
is for?

Linux:

# bin/solr start -h 127.0.0.1 -p 8180
# netstat -tlnp | grep 8180
tcp6   0  0 :::8180 :::*
LISTEN  14215/java

Windows:

> solr.cmd start -h 127.0.0.1 -p 8180
> netstat -a
TCP 0.0.0.0:8180MyBox:0 LISTENING


The Solr JVM args are likely the cause. From the Solr Admin GUI:
-DSTOP.KEY=solrrocks
-Dhost=127.0.0.1
-Djetty.port=8180

Presumably that ought to be -Djetty.host=127.0.0.1 instead of -Dhost?

This has potential security implications for us :-(

Thanks,

 - Bram


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Bram Van Dam
On 23/03/16 15:50, Yonik Seeley wrote:
> Kind of a unique situation for a dot-oh release, but from the Solr
> perspective, 6.0 should have *fewer* bugs than 5.5 (for those features
> in 5.5 at least)... we've been squashing a bunch of docValue related
> issues.

I've been led to understand that 6.X (at least the Lucene part?) won't
be backwards compatible with 4.X data. 5.5 at least works fine with data
files from 4.7, for instance. With that in mind, at least from my
selfish perspective, applying fixes to 5.X would be much appreciated ;-)

 - Bram




Re: Solr 5.5.0: JVM args warning in console logfile.

2016-03-24 Thread Bram Van Dam
> When I made the change outlined in the patch on SOLR-8145 to my bin/solr
> script, the warning disappeared.  That was not the intended effect of
> the patch, but I'm glad to have the mystery solved.
> 
> Thank you for mentioning the problem so we could track it down.

You're welcome. And thanks for fixing it ;-). We're rather particular
about what appears in our logs.

 - Bram



Re: Solr 5.5.0: JVM args warning in console logfile.

2016-03-22 Thread Bram Van Dam
On 22/03/16 15:16, Shawn Heisey wrote:
> This message is not coming from Solr.  It's coming from Jetty.  Solr
> uses Jetty, but uses it completely unchanged.

Ah you're right. Here's the offending code:

https://github.com/eclipse/jetty.project/blob/ac24196b0d341534793308d585161381d5bca4ac/jetty-start/src/main/java/org/eclipse/jetty/start/Main.java#L446

Doesn't look like there's an immediate workaround. Darn.

 - Bram



Solr 5.5.0: JVM args warning in console logfile.

2016-03-22 Thread Bram Van Dam
Hey folks,

When I start 5.5.0 (on RHEL), the following entry is added to
server/logs/solr-8983-console.log:

WARNING: System properties and/or JVM args set.  Consider using
--dry-run or --exec

I can't quite figure out what's causing this. Any clues on how to get
rid of it?

Thanks,

 - Bram


SolrCloud: Frequent "No registered leader was found" errors

2015-12-22 Thread Bram Van Dam
Hi folks,

Been doing some SolrCloud testing and I've been experiencing some
problems. I'll try to be relatively brief, but feel free to ask for
additional information.

I've added about 200 million documents to a SolrCloud. The cloud
contains 3 collections, and all documents were added to all three
collections.

While indexing these documents, we noticed 486k (!!) "No registered
leader was found"-errors. 482k (!!) of which referred to the same shard.
The other shards are or more or less evenly distributed in the log.

This indexing job has been running for about 5 days now, and is pretty
much IO-bound. CPU usage is ~50%. The load average, on the other hand,
has been 128 for 5 days straight. Which is high, but fine: the machine
is responsive.

Memory usage is fine. Most of it is going towards file system caches and
the like. Each Solr instance has 8GB Xmx, and is currently using about
7GB. I haven't noticed any OutOfMemoryErrors in the log files.

Monitoring shows that both Solr instances have been up throughout these
procedings.

Now, I'm willing to accept that these Solr instances don't have enough
memory, or anything else, but I'm not seeing any of this reflected in
the log files, which I'm finding troubling.

What I do notice in the log file, is the very vague "SolrException:
Service Unavailable". See below.

Could anyone shed some light on what could be causing these errors?

Thanks a bunch,

 - Bram


SolrCloud Setup:


- Version: 5.4.0
- 3 Collections
-- firstCollection : 18 shards
-- secondCollection: 36 shards
-- thirdCollection : 79 shards
- Routing: implicit
- 2 Solr Instances
-- 8GB Xmx.

Machine:

- Hexacore Xeon E5-1650
- 64GB RAM
- 50TB Disk (RAID6, 10 disks)

Leader Stack Trace:
---

Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: No
registered leader was found after waiting for 4000ms , collection:
biweekly slice: thirdCollectionShard39
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:495)
~[solr-solrj-4.7.1.jar:4.7.1 1582953 - sarowe - 2014-03-29 00:43:32]
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:199)
~[solr-solrj-4.7.1.jar:4.7.1 1582953 - sarowe - 2014-03-29 00:43:32]
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:118)
~[solr-solrj-4.7.1.jar:4.7.1 1582953 - sarowe - 2014-03-29 00:43:32]
at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116)
~[solr-solrj-4.7.1.jar:4.7.1 1582953 - sarowe - 2014-03-29 00:43:32]
at
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102)
~[solr-solrj-4.7.1.jar:4.7.1 1582953 - sarowe - 2014-03-29 00:43:32]


Service Unavailable Log:



527280878 ERROR (qtp59559151-194160) [c:collectionTwo
s:collectionTwoShard12 r:core_node12
x:collectionTwo_collectionTwoShard12_replica1]
o.a.s.u.SolrCmdDistributor forwarding update to
http://[CENSORED]:8983/solr/collectionTwo_collectionTwoShard1_replica1/
failed - retrying ... retries: 15 add{,id=000195641101}
params:update.distrib=TOLEADER=http://[CENSORED]:/solr/collectionTwo_collectionTwoShard12_replica1/
rsp:503:org.apache.solr.common.SolrException: Service Unavailable





Re: Deduplication

2015-05-20 Thread Bram Van Dam
 Write a custom update processor and include it in your update chain.
 You will then have the ability to do anything you want with the entire
 input document before it hits the code to actually do the indexing.

This sounded like the perfect option ... until I read Jack's comment:


 My understanding was that the distributed update processor is near the end
 of the chain, so that running of user update processors occurs before the
 distribution step, but is that distribution to the leader, or distribution
 from leader to replicas for a shard?

That would pose some potential problems.

Would a custom update processor make the solution cloud-safe?

Thx,

 - Bram



Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote:
 Hi Bram,
 what do you mean with :
   I
 would like it to provide the unique value myself, without having the
 deduplicator create a hash of field values  .
 
 This is not reduplication, but simple document filtering based on a
 constraint.
 In the case you want de-duplication ( which seemed from your very first
 part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram


Re: Solr 5.0, Jetty and WAR

2015-05-19 Thread Bram Van Dam
 My organization has issues with Jetty (some customers don't want Jetty on
 their boxes, but are OK with WebSphere or Tomcat) so I'm trying to figure
 out: how to get Solr on WebSphere / Tomcat without using WAR knowing that
 the WAR will go away.

I understand that some customers are irrational. Doesn't mean you (or
Solr) should cater to them. I've heard the objections, and they're all
nonsense. Jetty is slow? WebSphere is easier to manage? Tomcat doesn't
support X/Y/Z. It's all nonsense.

We're currently in the process of migrating our applications away from
WARs for many of the same reason as Solr. Whether or not we use Jetty
internally to handle HTTP requests isn't anyone's concern.

The best way to explain it to your irrational customers is that you're
running Solr, instead of confusing them with useless details. It doesn't
matter that Solr uses Jetty internally.

As for running Solr on WebSphere/Tomcat without a WAR...that's not going
to happen. Unless you want to fork Solr and keep the WAR...

 - Bram


Deduplication

2015-05-19 Thread Bram Van Dam
Hi folks,

I'm looking for a way to have Solr reject documents if a certain field
value is duplicated (reject, not overwrite). There doesn't seem to be
any kind of unique option in schema fields.

The de-duplication feature seems to make this (somewhat) possible, but I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values.

Am I missing an obvious (or less obvious) way of accomplishing this?

Thanks,

 - Bram


Date Time datatypes?

2015-03-30 Thread Bram Van Dam

Howdy folks,

Is there any way index only the date and time portions of a datetime field?

A Date is really a period of 24hrs, starting at 00:00 in said date's 
time zone. It would be useful if there was a way to search for documents 
of a certain date with these semantics.


As for times, I'd like to be able to do queries like time:[17:00 TO 
18:00]. I suppose I could accomplish that by resetting the date portion 
to some bogus value, but then my facet/range values will contain that 
bogus date.


I suppose the alternative is to create my own data types. Extending 
PrimitiveFieldType doesn't seem too hairy but maybe I'm overlooking some 
of the complexity...


Thanks a bunch,

 - Bram


Re: How large is your solr index?

2015-01-11 Thread Bram Van Dam

Do note that one strategy is to create more shards than you need at
the beginning. Say you determine that 10 shards will work fine, but
you expect to grow your corpus by 2x. _Start_  with 20 shards
(multiple shards can be hosted in the same JVM, no problem, see
maxShardsPerNode in the collections API CREATE action. Then
as your corpus grows you can move the shards to their own
boxes.


I guess planning ahead is something we can do. We usually have a pretty 
good idea of how large our indexes are going to be (number of documents 
is one of the things we base our license pricing on). I still feel like 
shard management could be made easier. I'll see if I can have a look at 
JIRA and try to pitch in.


Thanks a lot for the input, especially Shawn  Erick!


Re: How large is your solr index?

2015-01-08 Thread Bram Van Dam

On 01/07/2015 05:42 PM, Erick Erickson wrote:

True, and you can do this if you take explicit control of the document
routing, but...
that's quite tricky. You forever after have to send any _updates_ to the same
shard you did the first time, whereas SPLITSHARD will do the right thing.


Hmm. That is a good point. I wonder if there's some kind of middle 
ground here? Something that lets me send an update (or new document) to 
an arbitrary node/shard but which is still routed according to my 
specific requirements? Maybe this can already be achieved by messing 
with the routing?



snip there are some components that don't do the right thing in
distributed mode, joins for instance. The list is actually quite small and
is getting smaller all the time.



That's fine. We have a lot of query (pre-)processing outside of Solr. 
It's no problem for us to send a couple of queries to a couple of shards 
and aggregate the result ourselves. It would, of course, be nice if 
everything worked in distributed mode, but at least for us it's not an 
issue. This is a side effect of our complex reporting requirements -- we 
do aggregation, filtering and other magic on data that is partially in 
Solr and partially elsewhere.



Not true if the other shards have had any indexing activity. The commit is
usually forwarded to all shards. If the individual index on a
particular shard is
unchanged then it should be a no-op though.


I think a no-op commit no longer clears the caches either, so that's great.


But the usage pattern here is its own bit of a trap. If all your
indexing is going
to a single shard, then also the entire indexing _load_ is happening on that
shard. So the CPU utilization will be higher on that shard than the older ones.
Since distributed requests need to get a response from every shard before
returning to the client, the response time will be bounded by the response from
the slowest shard and this may actually be slower. Probably only noticeable
when the CPU is maxed anyway though.


This is a very good point. But I don't think SPLITSHARD is the magical 
answer here. If you have N shards on N boxes, and they are all getting 
nearly full and you decide to split one and move half to a new box, 
you'll end up with N-2 nearly full boxes and 2 half-full boxes. What 
happens if the disks fill up further? Do I have to split each shard? 
That sounds pretty nightmareish!


 - Bram


Re: How large is your solr index?

2015-01-07 Thread Bram Van Dam

On 01/06/2015 07:54 PM, Erick Erickson wrote:

Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?


I think that's the direction we'll probably be going. Index size (at 
least for us) can be unpredictable in some cases. Some clients start out 
small and then grow exponentially, while others start big and then don't 
grow much at all. Starting with SolrCloud would at least give us that 
flexibility.


That being said, SPLITSHARD doesn't seem ideal. If a shard reaches a 
certain size, it would be better for us to simply add an extra shard, 
without splitting.




On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge peter.stu...@gmail.com wrote:

++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to
the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued.


Is there a problem with multi-valued fields and distributed queries?


A cool side-effect of sub-sharding (for lack of a snappy term) is that the
parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.


Sounds like you're doing something similar to us. In some cases we have 
a hard commit every minute. Keeping the caches hot seems like a very 
good reason to send data to a specific shard. At least I'm assuming that 
when you add documents to a single shard and commit; the other shards 
won't be impacted...


 - Bram



Re: Solr support for multi-tenant applications

2015-01-07 Thread Bram Van Dam

One possibility is to have separate core for each tenant domain.


You could do that, and it's probably the way to go if you have a lot of 
data.


However, if you don't have much data, you can achieve multi-tenancy by 
adding a filter to all your queries, for instance:


query = userQuery
filterQuery = tenant:currentTenant

 - Bram


Re: How large is your solr index?

2015-01-04 Thread Bram Van Dam

On 01/04/2015 02:22 AM, Jack Krupansky wrote:

The reality doesn't seem to
be there today. 50 to 100 million documents, yes, but beyond that takes
some kind of heroic effort, whether a much beefier box, very careful and
limited data modeling or limiting of query capabilities or tolerance of
higher latency, expert tuning, etc.


I disagree. On the scale, at least. Up until 500M Solr performs well 
(read: well enough considering the scale) in a single shard on a single 
box of commodity hardware. Without any tuning or heroic efforts. Sure, 
some queries aren't as snappy as you'd like, and sure, indexing and 
querying at the same time will be somewhat unpleasant, but it will work, 
and it will work well enough.


Will it work for thousands of concurrent users? Of course not. Anyone 
who is after that sort of thing won't find themselves in this scenario 
-- they will throw hardware at the problem.


There is something to be said for making sharding less painful. It would 
be nice if, for instance, Solr would automagically create a new shard 
once some magic number was reached (2B at the latest, I guess). But then 
that'll break some query features ... :-(


The reason we're using single large instances (sometimes on beefy 
hardware) is that SolrCloud is a pain. Not just from an administrative 
point of view (though that seems to be getting better, kudos for that!), 
but mostly because some queries cannot be executed with 
distributed=true. Our users, at least, prefer a slow query over an 
impossible query.


Actually, this 2B limit is a good thing. It'll help me convince 
$management to donate some of our time to Solr :-)


 - Bram


FOSDEM Open source search devroom

2015-01-02 Thread Bram Van Dam

Hi folks,

There will be an Open source search devroom[1] at this year's FOSDEM in 
Brussels, 31st of January  1st of February.


I don't know if there will be a Lucene/Solr presence (there's no 
schedule for the dev room yet), but this seems like a good place meet up 
and talk shop.


I'll be there, and I hope some of you will as well.

 - Bram

[1] https://fosdem.org/2015/schedule/track/open_source_search/




Re: How large is your solr index?

2014-12-31 Thread Bram Van Dam

On 12/30/2014 05:03 PM, Erick Erickson wrote:

I think that it would be _extremely_ helpful to have a bunch of war
stories to reference. In my experience, people dealing with large
numbers of documents really are most concerned with whether what
they're doing is _possible_, and are mostly looking to see if someone
else has been there and done that. Of course they'd like all the
specificity possible, but there's a lot of comfort in knowing
something similar has been done before.


That's right. We deal with some pretty interesting use cases for banks. 
Some of them don't mind throwing hardware at a problem (some do).


One use case I can talk about is an archiving application. A customer 
calls in, asks about something, someone has to physically walk down to 
an archive, get a tape/cd/folder, plonk it in some ancient piece of 
hardware, and then rely on awful tools like windows file search to find 
whatever it is they were looking for.


No matter *how bad* Solr performance might get in the billions of 
documents on cheap and crappy hardware scale, it's *always* going to be 
better than the manual steps I just described. Even if it takes an hour 
to run, the value added by being able to search and report using 
structured  full-text search is immense.





Re: How large is your solr index?

2014-12-30 Thread Bram Van Dam

On 12/29/2014 08:08 PM, ralph tice wrote:

Like all things it really depends on your use case.  We have 160B
documents in our largest SolrCloud and doing a *:* to get that count takes
~13-14 seconds.  Doing a text:happy query only takes ~3.5-3.6 seconds cold,
subsequent queries for the same terms take 500ms.


That seems perfectly reasonable.


Facets over high cardinality fields are going to be painful.  We currently
programmatically limit the range to around 1/12th or 1/13th of the data set
for facet queries, but plan on evaluating Heliosearch (initial results
didn't look promising) and Toke's sparse faceting patch (SOLR-5894) to help
out there.


We had a look at Heliosearch a while ago and found it unsuitable. Seems 
like they're trying to make use of some native x86_64 code and HotSpot 
JVM specific features which we can't use. Some of our clients use IBM's 
JVM so we're pretty much limited to strictly Java.



There could be more support / ease of use enhancements for moving shards
across SolrClouds, moving shards across physically nodes within a
SolrCloud, and snapshot/restore of a SolrCloud, but there has also been a
lot of recent work in these areas that are starting to provide the
underlying infrastructure for more advanced shard management.


That's reassuring to hear. If we run in to these issues we can probably 
donate some time to work on them, so I'm not too worried about that.



I think there are more people getting into the space of 100B documents but
I only ran into or discovered a handful during my time at Lucene/Solr
Revolution this November.  The majority of large scale SolrCloud users seem
to have many collections (collections per logical user) rather than many
documents in one/few collections.


That's my understanding as well. Lucene Revolution is on the wrong side 
of the Atlantic for me. But there's an Open Source Search devroom at 
FOSDEM this year, which seems like a sensible place to discuss these 
things. I'll make a post on the relevant mailing lists about this after 
the holidays if anyone is interested.


Thanks for your detailed response!

 - Bram


Re: How large is your solr index?

2014-12-30 Thread Bram Van Dam

On 12/29/2014 09:53 PM, Jack Krupansky wrote:

And that Lucene index document limit includes deleted and updated
documents, so even if your actual document count stays under 2^31-1,
deleting and updating documents can push the apparent document count over
the limit unless you very aggressively merge segments to expunge deleted
documents.
On Mon, Dec 29, 2014 at 12:54 PM, Erick Erickson erickerick...@gmail.com
wrote:

When you say 2B docs on a single Solr instance, are you talking only one
shard?
Because if you are, you're very close to the absolute upper limit of a
shard, internally
the doc id is an int or 2^31. 2^31 + 1 will cause all sorts of problems.


Thankfully we're not doing any updates on that particular instance. But 
yes, we are getting close to the limits there. Is there any way to query 
the internal document ID? :-/


 - Bram


Re: How large is your solr index?

2014-12-30 Thread Bram Van Dam

On 12/29/2014 10:30 PM, Toke Eskildsen wrote:

That being said, I acknowledge that it helps with stories to get a feel of what 
can be done.


That's pretty much what I'm after, mostly to reassure myself that it can 
be done. Even if it does require a lot of hardware (which is fine).




At Lucene/Solr Revolution 2014, Grant Ingersoll also asked for user stories and 
pointed to https://wiki.apache.org/solr/SolrUseCases - sadly it has not caught 
on. The only entry is for our (State and University Library, Denmark) setup 
with 21TB / 7 billion documents on a single machine. To follow my own advice, I 
can elaborate that we have 1-3 concurrent users and a design goal of median 
response times below 2 seconds for faceted search. I guess that is at the 
larger end at the spectrum for pure size, but at the very low end for usage.


Thanks. I'll try to add some of our use cases!

 - Bram



Re: SolrCloud Paging on large indexes

2014-12-29 Thread Bram Van Dam

On 12/23/2014 04:07 PM, Toke Eskildsen wrote:

The beauty of the cursor is that it is has little to no overhead, relative to a 
standard top-X sorted search. A standard search uses a sliding window over the 
full result set, as does a cursor-search. Same amount of work. It is just a 
question of limits for the window.


That is very good to hear. Thanks.


Nobody will hit next 499 times, but a lot of our users skip to the last
page quite often. Maybe I should make *that* as hard as possible. Hmm.


Issue a search with sort in reverse order, then reverse the returned list of 
documents?


Sneaky. I like it. But in the end we're simply getting rid of the 
last-button. Solves a lot of issues. If have a billion search results, 
you might as well refine your criteria!


 - Bram



How large is your solr index?

2014-12-29 Thread Bram Van Dam

Hi folks,

I'm trying to get a feel of how large Solr can grow without slowing down 
too much. We're looking into a use-case with up to 100 billion documents 
(SolrCloud), and we're a little afraid that we'll end up requiring 100 
servers to pull it off.


The largest index we currently have is ~2billion documents in a single 
Solr instance. Documents are smallish (5k each) and we have ~50 fields 
in the schema, with an index size of about 2TB. Performance is mostly 
OK. Cold searchers take a while, but most queries are alright after 
warming up. I wish I could provide more statistics, but I only have very 
limited access to the data (...banks...).


I'd very grateful to anyone sharing statistics, especially on the larger 
end of the spectrum -- with or without SolrCloud.


Thanks,

 - Bram


Re: SolrCloud Paging on large indexes

2014-12-23 Thread Bram Van Dam

On 12/22/2014 04:27 PM, Erick Erickson wrote:

Have you read Hossman's blog here?
https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/#referrer=solr.pl


Oh thanks, that's a pretty interesting read. The scale we're 
investigating is several orders of magnitude larger than what was tested 
there, so I'm still a bit worried.



Because if you're trying this and _still_ getting bad performance we
need to know.


I'll definitely keep you posted when our test results on larger indexes 
(~50 billion documents) come in, but this sadly won't be any time soon 
(infrastructure sucks). The largest index I currently have access to is 
about a billion documents in size. Paging there is a nightmare, but the 
Solr version is too old to support cursors so I'm afraid I can't offer 
any useful data.


Does anyone have any performance data on multi-billion-document indexes? 
With or without SolrCloud?



Bram:
One minor pedantic clarification.. The first round-trip only returns
the id and sort criteria (score by default), not the whole document,
although the effect is the same, as you page N into the corpus, the
default implementation returns N * (pageNum + 1) entries. Even worse,
each node itself has to _sort_ that many entries Then a second
call is made to get the page-worth of docs...


I was trying to keep it short and sweet, but yes, that's the way I think 
it works ;-)



That said, though, its pretty easy to argue that the 500th page is
pretty useless, nobody will ever hit the next page button 499 times.


Nobody will hit next 499 times, but a lot of our users skip to the last 
page quite often. Maybe I should make *that* as hard as possible. Hmm.


Thanks for the tips!

 - Bram


SolrCloud Paging on large indexes

2014-12-22 Thread Bram Van Dam

Hi folks,

If I understand things correctly, you can use paging  sorting in a 
SolrCloud environment. However, if I request the first 10 documents, a 
distributed query will be launched to all shards requesting the top 10, 
and then (Shards * 10) documents will then be sorted so that only the 
top 10 is returned.


This is fine.

But I'm a little worried when going beyond the first page ... This 
becomes (Page * shards * 10). I'm worried that in a 50 billion document 
setup paging will just explode.


Does anyone have any experience with paging on large cloud setups? 
Positive or negative? Or can anyone offer some reassurances or words of 
caution with this approach?


Or should I tell my users that they can never go beyond Page X (which is 
fine if the alternative is hell fire and brimstone).


Thanks,

 - Bram


Re: SolrCloud Paging on large indexes

2014-12-22 Thread Bram Van Dam

On 12/22/2014 12:47 PM, heaven wrote:

I have a very bad experience with pagination on collections larger than a few
millions of documents. Pagination becomes very and very slow. Just tried to
switch to page 76662 and it took almost 30 seconds.


Yeah that's pretty much my experience, and I think SolrCloud would only 
exacerbate the problem (due to increased complexity of sorting). If 
there's no silver bullet to be found, I guess I'll just have to disable 
paging on large data sets -- which is fine, really, who the hell browses 
through 50 billion documents anyway? That's what search is for, right?


Thx,

 - Bram



Filter Query or Query

2014-11-10 Thread Bram Van Dam

Hi folks,

I have an index with hundreds of millions of documents, which users can 
query in various ways.


Two index fields are used by the system to hide certain documents from 
certain users (for instance: Department A can only view documents 
belonging to Department A, but not Department B).


We're currently doing something like this:

query = userQuery AND department:userDepartment

I'm wondering if perhaps a filter query might be a better fit?

query = userQuery
filterQuery = department:userDepartment

This feels a lot cleaner, but I'm worried about the performance 
implications. Some users have access to all documents, which might be a 
bit painful for the filter cache? Or am I missing something?


Thanks,

 - Bram


Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-23 Thread Bram Van Dam

On 06/20/2014 06:48 PM, Yonik Seeley wrote:

Heliosearch is a Solr fork that will hopefully find it's way back to
the ASF in the future.


There are about 50 instances of sun.misc.unsafe in heliosearch's code at 
this point. Has this been tested on non-oracle VMs? Particularly IBM?


Also: please set up an actual mailing list? Google groups is pretty 
awful, and it feels a bit silly to discuss this on the solr mailing list :/


 - Bram


Paging while indexes

2014-06-23 Thread Bram Van Dam
Is there any way to take the current index version (or commit number or 
something) into account in paged queries? When navigating through a 
large result set in an NRT environment, I want the navigation to remain 
*fixed* on the initial results.


I'm trying to avoid a scenario where a user has a page of results, 
clicks next, and then has some of those results from the first page show 
up again because the index changed in the mean while.


Thanks,

 - Bram


Re: How to handle multiple sub second updates to same SOLR Document

2014-01-28 Thread Bram Van Dam

On 01/25/2014 07:21 PM, christopher palm wrote:

The problem I am trying to solve is that the order of these updates isn’t
guaranteed once the multi threaded SOLRJ client starts sending them to
SOLR, and older updates are overlaying the newer updates on the same
document.


Don't do that. There is no way to guarantee what your updates will look 
like. We deal with this by keeping a list of document IDs that are 
currently being updated. If the ID is already in the list, do nothing. 
If it isn't, go ahead. Some synchronization required :-)


If you can think of a better way, I'd love to hear it, but I haven't 
found one.




Per-field/facet TimeZone in query?

2013-11-28 Thread Bram Van Dam

Howdy,

Is there any way to specify a time zone per field/facet? There is a 
global TZ query parameter, but I would like to be able to use a 
different TZ for different fields or facets in a query.


Thx,

 - Bram


Re: Core admin: create new core

2013-11-05 Thread Bram Van Dam

On 11/04/2013 04:06 PM, Bill Bell wrote:

You could pre create a bunch of directories and base configs. Create as needed. 
Then use schema less API to set it up ... Or make changes in a script and 
reload the core..


I ended up creating a little API that takes schema/config as input, 
creates the files and then uses the core admin CREATE request. Works 
like a charm. If I get bored I might just turn it into a solr plugin or 
something.


Ta for the suggestions!

 - Bram


Core admin: create new core

2013-11-04 Thread Bram Van Dam
The core admin CREATE function requires that the new instance dir and 
schema/config exist already. Is there a particular reason for this? It 
would be incredible convenient if I could create a core with a new 
schema and new config simply by calling CREATE (maybe providing the 
contents of config.xml and schema.xml as base64 encoded strings in HTTP 
POST or something?).


I'm guessing this isn't currently possible?

Ta,

 - bram


Re: pivot range faceting

2013-10-20 Thread Bram Van Dam

On 10/21/2013 03:46 AM, Toby Lazar wrote:

Thanks for confirming my fears.  I saw some presentations where I thought
this feature was used, but perhaps it was done performing multiple range
queries.


Probably. I had a look at implementing the feature (because it's 
something we rely on quite a bit), but decided against it. The solr 
implementation of faceting is hard to get my head around -- and 
launching multiple queries seems to outperform pivot queries anyway.


You can use a range query to determine the ranges (and their total 
counts), and then launch an extra query per range.




[SolrJ] HttpSolrServer - maxRetries

2013-10-07 Thread Bram Van Dam

Hi folks,

Long story short: I'm occasionally getting exceptions under heavy load 
(SocketException: Connection reset). I would expect HttpSolrServer to 
try again maxRetries-times, but it doesn't.


For reasons I don't entirely understand, the call to 
httpClient.execute(method) is not inside the retry block (and thus will 
never be retried).


Is this a bug in HttpSolrServer? Or is this intended behaviour? I'd 
rather not wrap my code in a retry mechanism if HttpSolrServer provides one.


Thx,

 - Bram


Re: [SolrJ] HttpSolrServer - maxRetries

2013-10-07 Thread Bram Van Dam

On 10/07/2013 11:51 AM, Furkan KAMACI wrote:

Could you send you error logs?


Whoops, forgot to paste:


Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://localhost:8080/solr/fooIndex
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:416) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]
at 
org.violet.search.service.IndexingService.addDocument(IndexingService.java:79) 
~[Violet-Search-1.06.003.jar:na]

... 8 common frames omitted
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:185) 
~[na:1.6.0_24]
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) 
~[httpcore-4.2.2.jar:4.2.2]
at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) 
~[httpclient-4.2.3.jar:4.2.3]
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:353) 
~[solr-solrj-4.2.1.jar:4.2.1 1461071 - mark - 2013-03-26 08:26:57]

... 13 common frames omitted



Re: [SolrJ] HttpSolrServer - maxRetries

2013-10-07 Thread Bram Van Dam

On 10/07/2013 12:55 PM, Furkan KAMACI wrote:

One more thing, could you say that which version of Solr you are using?


The stacktrace comes from 4.2.1, but I suspect that this could occur on 
4.4 as well. I've not been able to reproduce this consistently: it has 
happened twice (!) after indexing around 100 million documents.


Re: {soft}Commit and cache flusing

2013-10-02 Thread Bram Van Dam

if there are no modifications to an index and a softCommit or hardCommit
issued, then solr flushes the cache.


Indeed. The easiest way to work around this is by disabling auto commits 
and only commit when you have to.


Re: OpenJDK or OracleJDK

2013-09-30 Thread Bram Van Dam

On 09/30/2013 01:11 PM, Raheel Hasan wrote:

Could someone tell me if OpenJDK or OracleJDK will be best for Apache Solr
over CentOS?


If you're using Java 7 (or 8) then it doesn't matter. If you're using 
Java 6, stick with the Oracle version.




  1   2   >