Re: amount of values in a multi value field - is denormalization always the best option?

Jack Krupansky Wed, 10 Jul 2013 16:23:46 -0700

Join is a query operation - it has nothing to do with the number of values(fields and multivalued fields) in a Solr/Lucene document.

Block "insert" isn't available yet anyway, so we don't have any clearassessments of its performance.


Generally, any kind of large block of data is not a great idea.

1. Break things down.
2. Keep things simple.
3. Join is not simple.
4. Only use non-simple features in careful moderation.

There is no reasonable short cut to doing a robust data model. Shortcuts mayseem enticing in the short run, but will eat you alive in the long run.


-- Jack Krupansky

-----Original Message-----From: Marcelo Elias Del Valle

Sent: Wednesday, July 10, 2013 6:52 PM
To: solr-user@lucene.apache.org

Subject: Re: amount of values in a multi value field - is denormalizationalways the best option?


Jack,

    When you say: "large number of values in a single document" you also
mean a block in a block join, right? Exactly the same thing, agree?
    In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.

Best regards,
Marcelo.


2013/7/10 Jack Krupansky <j...@basetechnology.com>

Simple answer: avoid "large number of values in a single document". There
should only be a modest to moderate number of fields in a single document.

Is the data relatively static, or subject to frequent updates? To update
any field of a single document, even with atomic update, requires Solr to

read and rewrite every field of the document. So, lots of smallerdocuments

are best for a frequent update scenario.

Multivalues fields are great for storing a relatively small list of
values. You can add to the list easily, but under the hood, Solr must read
and rewrite the full list as well as the full document. And, there is no
way to address or synchronize individual elements of multivalued fields.

Joins are great... if used in moderation. Heavy use of joins is not a
great idea.

-- Jack Krupansky

-----Original Message----- From: Marcelo Elias Del Valle
Sent: Wednesday, July 10, 2013 5:37 PM
To: solr-user@lucene.apache.org
Subject: amount of values in a multi value field - is denormalization
always the best option?


Hello,

   I have asked a question recently about solr limitations and some about
joins. It comes that this question is about both at the same time.
   I am trying to figure how to denormalize my data so I will need just 1
document in my index instead of performing a join. I figure one way of
doing this is storing an entity as a multivalued field, instead of storing
different fields.
   Let me give an example. Consider the entities:

User:
   id: 1
   type: Joan of Arc
   age: 27

Webpage:
   id: 1

url:http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>

   category: Technical
   user_id: 1

   id: 2
   url: http://stackoverflow.com
   category: Technical
   user_id: 1

   Instead of creating 1 document for user, 1 for webpage 1 and 1 for
webpage 2 (1 parent and 2 childs) I could store webpages in a user
multivalued field, as follows:

User:
   id: 1
   name: Joan of Arc
   age: 27

webpage1: ["id:1", "url:http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>",

"category:
Technical"]
   webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
Technical"]

   It would probably perform better than the join, right? However, it made
me think about solr limitations again. What if I have 200 million webpges
(200 million fields) per user? Or imagine a case where I could have 200
million values on a field, like in the case I need to index every html DOM
element (div, a, etc.) for each web page user visited.
   I mean, if I need to do the query and this is a business requirement no
matter what, although denormalizing could be better than using query time
joins, I wonder it distributing the data present in this single document
along the cluster wouldn't give me better performance. And this is
something I won't get with block joins or multivalued fields...
   I guess there is probably no right answer for this question (at least
not a known one), and I know I should create a POC to check how each
perform... But do you think a so large number of values in a single
document could make denormalization not possible in an extreme case like
this? Would you share my thoughts if I said denormalization is not always
the right option?

Best regards,
--
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr




--
Marcelo Elias Del Valle

http://mvalle.com - @mvallebr

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to