That 1024 limit of the DataStax Enterprise packaging of Solr is going to be
relaxed in a coming release - you will be able to have more dynamic fields,
but... "going wild" has memory and performance implications anyway. That
limit is the number of populated fields in a single document - different
documents could use different fields, so the total number of fields across
all documents could be much higher than that soon-to-be-relaxed 1024 limit.
That limit was in place simply to try to protect users from running into
problematic scenarios in terms of memory and performance.
Standard Solr itself does NOT have that 1024 limit. Still... be careful when
playing when fire.
You are basically pursuing a "multi-tenant" design. Yes, schemaless mode for
Solr should work reasonably well, although it is a new feature in 4.4.
Solr and Lucene can handle missing field values efficiently, so have lots of
unused fields in a document is actually okay.
So, if the idea is that each document will be modest in size (number of
fields), but with a potentially large number of fields across all documents,
that should be fine as well.
All of that said, multi-tenant/schemaless mode is uncharted territory, and
there is no slam-dunk solution that is guaranteed to work really well for
all apps in all environments - be prepared to doing multiple Proof of
Concept implementations.
-- Jack Krupansky
-----Original Message-----
From: Marcelo Elias Del Valle
Sent: Monday, July 08, 2013 3:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr limitations
Jack,
Thanks a lot for your answers. I guess I just had heard on Cassandra
Summit that Solr can't support more than 1024 dynamic fields and it might
be possible in my case, that's why I asked this question. However, your
answer was very complete and made me think in a lot of things.
The most latent is about the schemaless design vs dynamic fields.
AFAIK, there is no difference on how a dynamic field or a fixed field is
stored, right? If this premisse is wrong, I would like to know, as this
might affect me.
The reason why I use dynamic fields is: I don't know what I will
index. My application is a platform that runs on a cloud and I may have
thousands of customers, each one storing different kinds of fields (or
equal, in some cases), and my main worry right now is making sure my
architecture is flexible enough to adapt to my customers needs. It's
imperative to me, however, that I am able to cross data from different
customers, so I should keep a single index.
The way my architecture is now, I store all the fields on solr as
dynamic fields, but if for some reason I detect some fields need any
special feature, I can change my schema on the fly and add some specific
configuration for that field...
I wonder what are your thoughts about it... Do you think using
schemaless configurations would be better in my case? Indeed, the only
reason my fields are dynamic is because I cannot predict very well what I
am going to index. When you say "a weak data model", I am not sure if it
fits my case, as there is no way of having a well defined model if I don't
know what kind of data I am going to index... Right?
Best regards,
Marcelo.
2013/7/8 Jack Krupansky <j...@basetechnology.com>
Other that the per-node/per-collection limit of 2 billion documents per
Lucene index, most of the limits of Solr are performance-based limits -
Solr can handle it, but the performance may not be acceptable. Dynamic
fields are a great example. Nothing prevents you from creating a document
with, say, 50,000 dynamic fields, but you are likely to find the
performance less than acceptable. Or facets. Sure, Solr will let you have
5,000 faceted fields, but the performance is likely to be... you get the
picture.
What is acceptable performance? That's for you to decide.
What will the performance of 5,000 dynamic fields or 500 faceted fields or
500 million documents on a node be? It all depends on your data,
especially
the cardinality (unique values) of each individual field.
How can you determine the performance? Only one way: Proof of concept. You
need to do your own proof of concept implementation, with your own
representative data, with your own representative data model, with your
own
representative hardware, with your own representative client software,
with
your own representative user query load. That testing will give you all
the
answers you need.
There are are no magic answers. Don't believe any magic spreadsheet or
magic wizard. Flip a coin whether they will work for your situation.
Some simple, common sense limits:
1. No more than 50 to 100 million documents per node.
2. No more than 250 fields per document.
3. No more than 250K characters per document.
4. No more than 25 faceted fields.
5. No more than 32 nodes in your SolrCloud cluster.
6. Don't return more than 250 results on a query.
None of those is a hard limit, but don't go beyond them unless your Proof
of Concept testing proves that performance is acceptable for your
situation.
Start with a simple 4-node, 2-shard, 2-replica cluster for preliminary
tests and then scale as needed.
Dynamic and multivalued fields? Try to stay away from them - excepts for
the simplest cases, they are usually an indicator of a weak data model.
Sure, it's fine to store a relatively small number of values in a
multivalued field (say, dozens of values), but be aware that you can't
directly access individual values, you can't tell which was matched on a
query, and you can't coordinate values between multiple multivalued
fields.
Except for very simple cases, multivalued fields should be flattened into
multiple documents with a parent ID.
Since you brought up the topic of dynamic fields, I am curious how you got
the impression that they were a good technique to use as a starting point.
They're fine for prototyping and hacking, and fine when used in
moderation,
but not when used to excess. The whole point of Solr is searching and
searching is optimized within fields, not across fields, so having lots of
dynamic fields is counter to the primary strengths of Lucene and Solr.
And... schemas with lots of dynamic fields tend to be difficult to
maintain. For example, if you wanted to ask a support question here, one
of
the first things we want to know is what your schema looks like, but with
lots of dynamic fields it is not possible to have a simple discussion of
what your schema looks like.
Sure, there is something called "schemaless design" (and Solr supports
that in 4.4), but that's very different from heavy reliance on dynamic
fields in the traditional sense. Schemaless design is A-OK, but using
dynamic fields for "arrays" of data in a single document is a poor match
for the search features of Solr (e.g., Edismax searching across multiple
fields.)
One other tidbit: Although Solr does not enforce naming conventions for
field names, and you can put special characters in them, there are plenty
of features in Solr, such as the common "fl" parameter, where field names
are expected to adhere to Java naming rules. When people start "going
wild"
with dynamic fields, it is common that they start "going wild" with their
names as well, using spaces, colons, slashes, etc. that cannot be parsed
in
the "fl" and "qf" parameters, for example. Please don't go there!
In short, put up a small cluster and start doing a Proof of Concept
cluster. Stay within my suggested guidelines and you should do okay.
-- Jack Krupansky
-----Original Message----- From: Marcelo Elias Del Valle
Sent: Monday, July 08, 2013 9:46 AM
To: solr-user@lucene.apache.org
Subject: Solr limitations
Hello everyone,
I am trying to search information about possible solr limitations I
should consider in my architecture. Things like max number of dynamic
fields, max number o documents in SolrCloud, etc.
Does anyone know where I can find this info?
Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr