Re: Index very large number of documents from large number of clients

2015-08-15 Thread Toke Eskildsen
Troy Edwards  wrote:
> 1) There are about 6000 clients
> 2) The number of documents from each client are about 50 (average
> document size is about 400 bytes)

So roughly 3 billion documents / 1TB index size. So at least 2 shards, due to 
the 2 billion limit in Lucene. If you want more advice than that, you will have 
to describe how the setup is to be used

- How many requests persecond?
- What is a typical query?
- How low does the response time needs to be?

> 3 I have to wipe off the index/collection every night and create new

Let's say you have 4 hours to do that. That's about 200K documents/second you 
need to index. That is a high number and with such tiny documents, I suspect 
that logistics might take up the largest part of that. This might call for 
multiple independent setups.

> 1) How to index such large number of documents i.e. do I use an http client
> to send documents or is data import handler right or should I try uploading
> CSV files?

As the overhead for constructing and parsing XML documents is not trivial, CSV 
seems reasonable, Probably also DIH.

> 2) How many collections should I use?

Not 6000 in a single SolrCloud.

> 3) How many shards / replicas per collection should I use?
> 4) Do I need multiple Solr servers?

Not enough data about index usage to say. Between 1 and 50, not kidding.


- Toke Eskildsen


Re: Index very large number of documents from large number of clients

2015-08-15 Thread Erick Erickson
Piling on here. At the scale you're talking, I suspect you'll not only have
a bunch of servers, you'll really have a bunch of completely separate
"Solr Clouds", complete with their own Zookeepers etc. Partly for
administrative sake, partly for stability, etc.

Not sure that'll be true, mind you, but a "divide and conquer" approcah
seems in order.

And to be clear, the multiple clusters are NOT because of 3 Billion docs,
I've certainly seen that number of docs fit on 10 shards when the records
are as small as your's are. OTOH, I've seen it take 30 or 60 shards, but
that's usually for complex documents. As Shawn says, prototyping is the
only way to be sure.

It's because if you choose to have 6,000 _collections_, you'll need some
kind of divisions.

Now, if you can create a smaller number of collections and have, say,
a collection ID with each doc, you can simply add an fq=collectionID
to each query and that'll show you only the docs belonging to that collection.
This could be significantly simpler than maintaining 6,000 collections..

Best,
Erick

On Sat, Aug 15, 2015 at 8:40 PM, Shawn Heisey  wrote:
> On 8/15/2015 2:03 PM, Troy Edwards wrote:
>> I am using SolrCloud
>>
>> My initial requirements are:
>>
>> 1) There are about 6000 clients
>> 2) The number of documents from each client are about 50 (average
>> document size is about 400 bytes)
>> 3 I have to wipe off the index/collection every night and create new
>>
>> Any thoughts/ideas/suggestions on:
>>
>> 1) How to index such large number of documents i.e. do I use an http client
>> to send documents or is data import handler right or should I try uploading
>> CSV files?
>
> This is general info only.
>
> 6000 clients, each with half a million docs?  That's 3 billion docs.
> There are some users who have more, but this is squarely in the realm of
> a HUGE install.
>
>> 2) How many collections should I use?
>>
>> 3) How many shards / replicas per collection should I use?
>
> Any answer we came up with for those two questions would involve quite a
> few assumptions, any one of which could be wrong.  The only way to
> really find out what you need is to set up a prototype system and test
> it with real data, real indexing requests, and real queries.  Record the
> results of the tests, change the configuration, rebuild the index(es),
> and run the tests again.
>
> The number one rule when it comes to Solr performance: Install enough
> memory so that all the index data on the server will fit in the
> available OS disk cache RAM.  You're going to have a lot of index data.
>
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> When the number of collections reaches the low hundreds, SolrCloud
> stability begins to suffer because of how much interaction with
> Zookeeper is required for very small cluster changes.  When there are
> thousands of collections, any little problem turns into a nightmare.
> Adding more machines doesn't help this particular problem.  Some ideas
> are being discussed to make this better, but users won't see the results
> of that effort until version 5.4 or 5.5, possibly later.
>
>> 4) Do I need multiple Solr servers?
>
> You would need multiple servers for any hope of redundancy, but the
> answer to the question I think you're trying to ask here is yes.
> Definitely.  Possibly a LOT of them.
>
> Thanks,
> Shawn
>


Re: Admin Login

2015-08-15 Thread Erick Erickson
Scott:

You better not even let them access Solr directly.

http://server:port/solr/admin/collections?ACTION=delete&name=collection.

Try it sometime on a collection that's not important ;)

But as Walter said, that'd be similar to allowing end users
unrestricted access to
a SOL database, that Solr URL is akin to "drop database".

Or, if you've locked down the admin stuff,

http://solr:port/solr/collection/update?commit=true&stream.body=*:*

Best
Erick

On Sat, Aug 15, 2015 at 6:57 PM, Scott Derrick  wrote:
> Walter,
>
> actually that explains it perfectly!  I will move behind my apache server...
>
> thanks,
>
> Scott
>
>
> On 8/15/2015 6:15 PM, Walter Underwood wrote:
>>
>> No one runs a public-facing Solr server. Just like no one runs a
>> public-facing MySQL server.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>>
>> On Aug 15, 2015, at 4:15 PM, Scott Derrick  wrote:
>>
>>> I'm somewhat puzzled there is no built in security.  I can't image
>>> anybody is running a public facing solr server with the admin page wide
>>> open?
>>>
>>> I've searched and haven't found any solutions that work out of the box.
>>>
>>> I've tried the solutions here to no avail.
>>> https://wiki.apache.org/solr/SolrSecurity
>>>
>>> and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms
>>>
>>> The Solr security docs say to use the application server and if I could
>>> run it on my tomcat server I would already be done.  But I'm told I can't do
>>> that?
>>>
>>> What solutions are people using?
>>>
>>> Scott
>>>
>>> --
>>> Leave no stone unturned.
>>> Euripides
>>
>>
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>


Re: Index very large number of documents from large number of clients

2015-08-15 Thread Shawn Heisey
On 8/15/2015 2:03 PM, Troy Edwards wrote:
> I am using SolrCloud
> 
> My initial requirements are:
> 
> 1) There are about 6000 clients
> 2) The number of documents from each client are about 50 (average
> document size is about 400 bytes)
> 3 I have to wipe off the index/collection every night and create new
> 
> Any thoughts/ideas/suggestions on:
> 
> 1) How to index such large number of documents i.e. do I use an http client
> to send documents or is data import handler right or should I try uploading
> CSV files?

This is general info only.

6000 clients, each with half a million docs?  That's 3 billion docs.
There are some users who have more, but this is squarely in the realm of
a HUGE install.

> 2) How many collections should I use?
> 
> 3) How many shards / replicas per collection should I use?

Any answer we came up with for those two questions would involve quite a
few assumptions, any one of which could be wrong.  The only way to
really find out what you need is to set up a prototype system and test
it with real data, real indexing requests, and real queries.  Record the
results of the tests, change the configuration, rebuild the index(es),
and run the tests again.

The number one rule when it comes to Solr performance: Install enough
memory so that all the index data on the server will fit in the
available OS disk cache RAM.  You're going to have a lot of index data.

https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

https://wiki.apache.org/solr/SolrPerformanceProblems

When the number of collections reaches the low hundreds, SolrCloud
stability begins to suffer because of how much interaction with
Zookeeper is required for very small cluster changes.  When there are
thousands of collections, any little problem turns into a nightmare.
Adding more machines doesn't help this particular problem.  Some ideas
are being discussed to make this better, but users won't see the results
of that effort until version 5.4 or 5.5, possibly later.

> 4) Do I need multiple Solr servers?

You would need multiple servers for any hope of redundancy, but the
answer to the question I think you're trying to ask here is yes.
Definitely.  Possibly a LOT of them.

Thanks,
Shawn



Re: Admin Login

2015-08-15 Thread Scott Derrick

Walter,

actually that explains it perfectly!  I will move behind my apache server...

thanks,

Scott

On 8/15/2015 6:15 PM, Walter Underwood wrote:

No one runs a public-facing Solr server. Just like no one runs a public-facing 
MySQL server.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 15, 2015, at 4:15 PM, Scott Derrick  wrote:


I'm somewhat puzzled there is no built in security.  I can't image anybody is 
running a public facing solr server with the admin page wide open?

I've searched and haven't found any solutions that work out of the box.

I've tried the solutions here to no avail. 
https://wiki.apache.org/solr/SolrSecurity

and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms

The Solr security docs say to use the application server and if I could run it 
on my tomcat server I would already be done.  But I'm told I can't do that?

What solutions are people using?

Scott

--
Leave no stone unturned.
Euripides





---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: Admin Login

2015-08-15 Thread Walter Underwood
No one runs a public-facing Solr server. Just like no one runs a public-facing 
MySQL server.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Aug 15, 2015, at 4:15 PM, Scott Derrick  wrote:

> I'm somewhat puzzled there is no built in security.  I can't image anybody is 
> running a public facing solr server with the admin page wide open?
> 
> I've searched and haven't found any solutions that work out of the box.
> 
> I've tried the solutions here to no avail. 
> https://wiki.apache.org/solr/SolrSecurity
> 
> and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms
> 
> The Solr security docs say to use the application server and if I could run 
> it on my tomcat server I would already be done.  But I'm told I can't do that?
> 
> What solutions are people using?
> 
> Scott
> 
> -- 
> Leave no stone unturned.
> Euripides



Admin Login

2015-08-15 Thread Scott Derrick
I'm somewhat puzzled there is no built in security.  I can't image 
anybody is running a public facing solr server with the admin page wide 
open?


I've searched and haven't found any solutions that work out of the box.

I've tried the solutions here to no avail. 
https://wiki.apache.org/solr/SolrSecurity


and here.  http://wiki.eclipse.org/Jetty/Tutorial/Realms

The Solr security docs say to use the application server and if I could 
run it on my tomcat server I would already be done.  But I'm told I 
can't do that?


What solutions are people using?

Scott

--
Leave no stone unturned.
Euripides


Re: Cache for percentiles facets

2015-08-15 Thread Erick Erickson
You have to provide a lot more info about your problem, including
what you've tried, what your data looks like, etc.

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Sat, Aug 15, 2015 at 10:27 AM, Håvard Wahl Kongsgård
 wrote:
> Hi, I have tried various options to speed up percentile calculation for
> facets. But the internal solr cache only speed up my queries from 22 to 19
> sec.
>
> I'am using the new json facets http://yonik.com/json-facet-api/
>
> Any tips for caching stats?
>
>
> -Håvard


Re: Index very large number of documents from large number of clients

2015-08-15 Thread Alexandre Rafalovitch
This is beyond my direct area of expertise, but one way to look at
this would be:
1) Create new collections offline. Down to each of the 6000 clients
having its own private collection (embedded SolrJ/server). Or some
sort of mini-hubs, e.g. a server per N clients.
2) Bring those collections into central server
3) Update alias that used to point to previous collection set to point
to the new one:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateormodifyanAliasforaCollection
4) Delete old collection set, as nothing points at it anymore

Now, I don't know how that would play with shards/replicas.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 15 August 2015 at 16:03, Troy Edwards  wrote:
> I am using SolrCloud
>
> My initial requirements are:
>
> 1) There are about 6000 clients
> 2) The number of documents from each client are about 50 (average
> document size is about 400 bytes)
> 3 I have to wipe off the index/collection every night and create new
>
> Any thoughts/ideas/suggestions on:
>
> 1) How to index such large number of documents i.e. do I use an http client
> to send documents or is data import handler right or should I try uploading
> CSV files?
>
> 2) How many collections should I use?
>
> 3) How many shards / replicas per collection should I use?
>
> 4) Do I need multiple Solr servers?
>
> Thanks


Index very large number of documents from large number of clients

2015-08-15 Thread Troy Edwards
I am using SolrCloud

My initial requirements are:

1) There are about 6000 clients
2) The number of documents from each client are about 50 (average
document size is about 400 bytes)
3 I have to wipe off the index/collection every night and create new

Any thoughts/ideas/suggestions on:

1) How to index such large number of documents i.e. do I use an http client
to send documents or is data import handler right or should I try uploading
CSV files?

2) How many collections should I use?

3) How many shards / replicas per collection should I use?

4) Do I need multiple Solr servers?

Thanks


Cache for percentiles facets

2015-08-15 Thread Håvard Wahl Kongsgård
Hi, I have tried various options to speed up percentile calculation for
facets. But the internal solr cache only speed up my queries from 22 to 19
sec.

I'am using the new json facets http://yonik.com/json-facet-api/

Any tips for caching stats?


-Håvard


Re: phonetic filter factory question

2015-08-15 Thread Alexandre Rafalovitch
>From the "teaching to fish" category of advice (since I don't know the
actual answer).

Did you try "Analysis" screen in the Admin UI? If you check "Verbose
output" mark, you will see all the offsets and can easily confirm the
detailed behavior for yourself.

Regards,
  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 15 August 2015 at 12:22, Jamie Johnson  wrote:
> The JavaDoc says that the PhoneticFilterFactory will "inject" tokens with
> an offset of 0 into the stream.  I'm assuming this means an offset of 0
> from the token that it is analyzing, is that right?  I am trying to
> collapse some of my schema, I currently have a text field that I use for
> general purpose text and another field with the PhoneticFilterFactory
> applied for finding things that are similar phonetically, but if this does
> inject at the current position then I could likely collapse these into a
> single field.  As always thanks in advance!
>
> -Jamie


phonetic filter factory question

2015-08-15 Thread Jamie Johnson
The JavaDoc says that the PhoneticFilterFactory will "inject" tokens with
an offset of 0 into the stream.  I'm assuming this means an offset of 0
from the token that it is analyzing, is that right?  I am trying to
collapse some of my schema, I currently have a text field that I use for
general purpose text and another field with the PhoneticFilterFactory
applied for finding things that are similar phonetically, but if this does
inject at the current position then I could likely collapse these into a
single field.  As always thanks in advance!

-Jamie


Re: Big SolrCloud cluster with a lot of collections

2015-08-15 Thread Toke Eskildsen
yura last  wrote:
> Hi All, I am testing a SolrCloud with many collections. The version is 5.2.1
> and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then I
> created collections with 3 shards and replication factor of 2. It gives me 2
> cores per collection on each machine.I reached almost 900 collections
> and then the cluster was stuck and I couldn’t revive the cluster.

That mirrors what others are reporting.

> As I understand Solr have issues with many collections (thousands).If I
> will use much more machines – does it will give me the ability to create
> tens of thousands of collections or the limit is couple of thousands?

(Caveat: I have no real world experience with high collection count in Solr)

Adding more machines will not really help you as the problem with thousands of 
collections is not hardware power per se, but rather the coordination of them. 
You mention 180K collections below and with the current Solr architecture, I do 
not see that happening.

> I want to build a cluster that will handle 10 billion of documents (currently 
> I
> have 1 billion) per day and to keep the data for 90 days.

Are those real requirements or something somebody hope will come true some 
years down the road? Technology has a habit of catching up and while a 900 
billion document setup is a challenge today, it is probably a lot easier in 5 
years.

When we are discussion this, it would help if you could also approximate the 
index size in bytes. How large do you expect the sum of shards for 1 billion of 
your documents to be? Likewise, which kind of queries do you expect? Grouping? 
Faceting? All these things multiply.

Anyway, your requirements are in a league where there is not much collective 
experience. You will definitely have to build a serious prototype or three to 
get a proper idea of how much power you need: The standard advices for scaling 
Solr does not make economical sense beyond a point. But you seem to have 
started that process already with your current tests.

> I want to support 2000 customers so I would like to split them to collections
> and also to split it by days. (180,000 collections) 

As 180,000 collections currently seems infeasible for a single SolrCloud, you 
should consider alternatives:

1) If your collections are independent, then build fully independent clusters 
of machines.

2) Don't use collections for dividing data between your customers. Use a field 
with a customer-ID or something like that.

> If I will create big collections I will have performance issues with queries
> and also most of the queries are for a specific customer.

Why would many smaller collections have better performance than fewer larger 
collections?

> (I also have cross customers queries)

If you make independent setups, that could be solved by querying them 
independently and do the merging yourself.

- Toke Eskildsen


Re: Big SolrCloud cluster with a lot of collections

2015-08-15 Thread Jack Krupansky
1. Keep the number of collections down to the low hundreds max. Preferably
no more than a few dozen or a hundred.
2. 8GB is too small to be useful. 16 GB min.
3. If you need large numbers of machines, organize them as separate
clusters.
4. Figure 100 to 200 million documents on a Solr server. E.g., 1 billion
documents would be 5 to 10 servers. Depends on document size and query
latency requirements, practical limit could be higher or lower.


-- Jack Krupansky

On Sat, Aug 15, 2015 at 6:22 AM, yura last 
wrote:

> Hi All, I am testing a SolrCloud with many collections. The version is
> 5.2.1 and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then
> I created collections with 3 shards and replication factor of 2. It gives
> me 2 cores per collection on each machine.I reached almost 900 collections
> and then the cluster was stuck and I couldn’t revive the cluster. As I
> understand Solr have issues with many collections (thousands).If I will use
> much more machines – does it will give me the ability to create tens of
> thousands of collections or the limit is couple of thousands? I want to
> build a cluster that will handle 10 billion of documents (currently I have
> 1 billion) per day and to keep the data for 90 days.I want to support 2000
> customers so I would like to split them to collections and also to split it
> by days. (180,000 collections)If I will create big collections I will have
> performance issues with queries and also most of the queries are for a
> specific customer. (I also have cross customers queries) How I can build an
> appropriate design? Thanks a lot,Yuri


Re: Solr relevant results

2015-08-15 Thread Alexandre Rafalovitch
If I understood your question correctly, that's what I am suggesting to try.

Notice that, as I mentioned earlier, that ignores all the complexity
of similarity, ranking, etc that Solr offers. But it does not seem you
need it in your particular case, as you are just searching for
presence/absence of terms.

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 15 August 2015 at 00:08, Brian Narsi  wrote:
> I see, so basically I add another field to the schema "CustomScore" and
> assign score to it based on values in other fields. And then just order by
> it.
>
> Is that right?
>
> On Fri, Aug 14, 2015 at 10:58 PM, Alexandre Rafalovitch 
> wrote:
>
>> Clarification: In the client that is doing the _indexing_/sending data
>> to Solr. Not the one doing the querying.
>>
>> And custom URP if you can't change the client and need to inject that
>> extra code on the Solr side.
>>
>> Sorry, for extra emails.
>>
>> Regards,
>>Alex.
>> 
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 14 August 2015 at 23:57, Alexandre Rafalovitch 
>> wrote:
>> > My suggestion was to do the mapping in the client, before you hit
>> > Solr. Or in a custom UpdateRequestProcessor. Because only your client
>> > app knows the order you want those things in. It certainly was not any
>> > kind of alphabetical.
>> >
>> > Then, you just sort by that field and Solr would not care about the
>> > complicated rules. Faster that way too, as the mapping only happens
>> > once when the document is indexed.
>> >
>> > Regards,
>> >Alex.
>> > 
>> > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> > http://www.solr-start.com/
>> >
>> >
>> > On 14 August 2015 at 23:52, Brian Narsi  wrote:
>> >> Search term is searched in Description.
>> >>
>> >> The search string is relevant in the context that the Description of
>> >> returned records must contain the search string. But when several
>> records
>> >> Description contains the search string then they must be ordered
>> according
>> >> to the values in Code and Prefer.
>> >>
>> >> I understand what you are saying about mapping Code to numbers. But can
>> you
>> >> help with some examples of actual solr queries on how to do this?
>> >>
>> >> Thanks
>> >>
>> >> On Fri, Aug 14, 2015 at 2:46 PM, Alexandre Rafalovitch <
>> arafa...@gmail.com>
>> >> wrote:
>> >>
>> >>> What's the search string? Or is the search string irrelevant and
>> >>> that's just your compulsory ordering.
>> >>> Assuming anything that searches has to be returned and has to fit into
>> >>> that order, I would frankly just map your special codes all together
>> >>> to some sort of 'sort order' number.
>> >>>
>> >>> So, Code=>C = 4000, Code =>B=3000. Prefer=true=>100, Prefer=false=>0.
>> >>> Then, sum it up. Or some such.
>> >>>
>> >>> Remember that fuzzy search will match even things with low probability
>> >>> so a fixed sort will bring low-probability matches on top. So, either
>> >>> hard non-fuzzy searches or you need to look at different solutions,
>> >>> such as buckets and top-n items within those.
>> >>>
>> >>> Regards,
>> >>> Alex.
>> >>> 
>> >>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> >>> http://www.solr-start.com/
>> >>>
>> >>>
>> >>> On 14 August 2015 at 15:10, Brian Narsi  wrote:
>> >>> > In my documents there are several fields, but for example say there
>> are
>> >>> > three fields:
>> >>> >
>> >>> > Description - text  - this variable text
>> >>> > Code - string - always a single character
>> >>> > Prefer - boolean
>> >>> >
>> >>> > User searches on Description.
>> >>> >
>> >>> > When returning results I have to order results as following:
>> >>> >
>> >>> > Code = C
>> >>> > Code = B
>> >>> > Code = S
>> >>> > Code = N
>> >>> > Prefer = true and Code is NULL
>> >>> > Prefer = false and Code is NULL
>> >>> > Prefer is NULL and Code is NULL
>> >>> >
>> >>> > How can this be achieved?
>> >>> >
>> >>> > Thanks in advance!
>> >>>
>>


Big SolrCloud cluster with a lot of collections

2015-08-15 Thread yura last
Hi All, I am testing a SolrCloud with many collections. The version is 5.2.1 
and I installed 3 machines – each one with 4 cores and 8 GB Ram.Then I created 
collections with 3 shards and replication factor of 2. It gives me 2 cores per 
collection on each machine.I reached almost 900 collections and then the 
cluster was stuck and I couldn’t revive the cluster. As I understand Solr have 
issues with many collections (thousands).If I will use much more machines – 
does it will give me the ability to create tens of thousands of collections or 
the limit is couple of thousands? I want to build a cluster that will handle 10 
billion of documents (currently I have 1 billion) per day and to keep the data 
for 90 days.I want to support 2000 customers so I would like to split them to 
collections and also to split it by days. (180,000 collections)If I will create 
big collections I will have performance issues with queries and also most of 
the queries are for a specific customer. (I also have cross customers queries) 
How I can build an appropriate design? Thanks a lot,Yuri