Re: amount of values in a multi value field - is denormalization always the best option?

Jack Krupansky Thu, 11 Jul 2013 10:35:12 -0700

Again, generally, if the number of values is relatively modest and you don't
need to discriminate (tell which one matches on a search) and you don't edit
the list, a multivalued field makes perfect sense, but if any of those
requirements is not true, then you need to represent the items as discrete
Solr documents.


But, it does all depend on your particular data and particular requirements.

-- Jack Krupansky

-----Original Message-----From: Flavio Pompermaier

Sent: Thursday, July 11, 2013 7:50 AM
To: solr-user@lucene.apache.org
Subject: Re: amount of values in a multi value field - is denormalization
always the best option?

I also have a similar scenario, where fundamentally I have to retrieve all
urls where a userid has been found.
So, in my schema, I designed the url as (string) key and a (possible huge)
list of attributes automatically mapped to strings.
For example:

Url1 (key):
- language: en
- content:userid1
- content:userid1
- content:userid1 (i.e. 3 times actually for user 1)
- content:userid2
- content:userid3
- author:userid4

and so on and so forth.
So, if I did understand, you're saying that this is a bad design? How
should I fix my schema in your opinion in that case?

Best,
Flavio

On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky
<j...@basetechnology.com>wrote:

Simple answer: avoid "large number of values in a single document". There
should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update
any field of a single document, even with atomic update, requires Solr to
read and rewrite every field of the document. So, lots of smaller
documents
are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of
values. You can add to the list easily, but under the hood, Solr must read
and rewrite the full list as well as the full document. And, there is no
way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a
great idea.

-- Jack Krupansky

-----Original Message----- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization
always the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1
   url:
http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>
   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27
   webpage1: ["id:1", "url:
http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>",
"category:
Technical"]
   webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
Technical"]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to