Re: schema examples

Jeremy Kepner Sun, 29 Dec 2013 17:23:58 -0800

I would be reluctant to make generalizations.


On Sun, Dec 29, 2013 at 05:45:28PM -0500, Arshak Navruzyan wrote:
>    Josh, I am still a little stuck on the idea of how this would work in a
>    transactional app? (aka mixed workload of reads and writes).
>    I definitely see the power of using a serialized structure in order to
>    minimize the number of records but what will happen when rows get deleted
>    out of the main table (or mutated)? � In the bloated model I could see
>    some referential integrity code zapping the index entries as well. �In the
>    serialized structure design it seems pretty complex to go and update every
>    array that referenced that row. �
>    Is it fair to say that the D4M approach is a little better suited for
>    transactional apps and the wikisearch approach is better for
>    read-optimized index apps?
> 
>    On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[1][email protected]>
>    wrote:
> 
>      Some context here in regards to the wikisearch:
> 
>      The point of the protocol buffers here (or any serialized structure in
>      the Value) is to reduce the ingest pressure and increase query
>      performance on the inverted index (or transpose table, if I follow the
>      d4m phrasing).
> 
>      This works well because most languages (especially English) follow a
>      Zipfian distribution: some terms appear very frequently while some occur
>      very infrequently. For common terms, we don't want to bloat our index,
>      nor spend time creating those index records (e.g. "the"). For uncommon
>      terms, we still want direct access to these infrequent words (e.g.
>      "supercalifragilisticexpialidocious")
> 
>      The ingest affect is also rather interesting when dealing with Accumulo
>      as you're not just writing more data, but typically writing data to most
>      (if not all) tservers. Even the tokenization of a single document is
>      likely to create inserts to a majority of the tablets for your inverted
>      index. When dealing with high ingest rates (live *or* bulk -- you still
>      have the send data to these servers), minimizing the number of records
>      becomes important to be cognizant of as it may be a bottleneck in your
>      pipeline.
> 
>      The query implications are pretty straightforward: common terms don't
>      bloat the index in size nor affect uncommon term lookups and those
>      uncommon term lookups remain specific to documents rather than a range
>      (shard) of documents.
> 
>      On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
> 
>        Sorry I mixed things up. �It was in the wikisearch example:
> 
>        [2]http://accumulo.apache.org/example/wikisearch.html
> 
>        "If the cardinality is small enough, it will track the set of
>        documents
>        by term directly."
> 
>        On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
>        <[3][email protected] <mailto:[4][email protected]>> wrote:
> 
>        � � Hi Arshak,
>        � � � �See interspersed below.
>        � � Regards. �-Jeremy
> 
>        � � On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
>        <[5][email protected]
>        � � <mailto:[6][email protected]>> wrote:
> 
>          � � Jeremy,
> 
>          � � Thanks for the detailed explanation. �Just a couple of final
>          � � questions:
> 
>          � � 1. �What's your advise on the transpose table as far as whether
>          to
>          � � repeat the indexed term (one per matching row id) or try to
>          store
>          � � all matching row ids from tedge in a single row in
>          tedgetranspose
>          � � (using protobuf for example). �What's the performance
>          implication
>          � � of each approach? �In the paper you mentioned that if it's a few
>          � � values they should just be stored together. �Was there a cut-off
>          � � point in your testing?
> 
>        � � Can you clarify? �I am not sure what your asking.
> 
>          � � 2. �You mentioned that the degrees should be calculated
>          beforehand
>          � � for high ingest rates. �Doesn't this change Accumulo from being
>          a
>          � � true database to being more of an index? �If changes to the data
>          � � cause the degree table to get out of sync, sounds like changes
>          � � have to be applied elsewhere first and Accumulo has to be
>          reloaded
>          � � periodically. �Or perhaps letting the degree table get out of
>          sync
>          � � is ok since it's just an assist...
> 
>        � � My point was a very narrow comment on optimization in very high
>        � � performance situations. I probably shouldn't have mentioned it.
>        �If
>        � � you have ever have performance issues with your degree tables,
>        that
>        � � would be the time to discuss. . You may never encounter this
>        issue.
> 
>          � � Thanks,
> 
>          � � Arshak
> 
>          � � On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL
>          � � <[7][email protected] <mailto:[8][email protected]>> wrote:
> 
>          � � � � Hi Arshak,
>          � � � � � Here is how you might do it. �We implement everything with
>          � � � � batch writers and batch scanners. �Note: if you are doing
>          high
>          � � � � ingest rates, the degree table can be tricky and usually
>          � � � � requires pre-summing prior to ingestion to reduce the
>          pressure
>          � � � � on the accumulator inside of Accumulo. �Feel free to ask
>          � � � � further questions as I would imagine that there a details
>          that
>          � � � � still wouldn't be clear. �In particular, why we do it this
>          way.
> 
>          � � � � Regards. �-Jeremy
> 
>          � � � � Original data:
> 
>          � � � � Machine,Pool,Load,ReadingTimestamp
>          � � � � neptune,west,5,1388191975000
>          � � � � neptune,west,9,1388191975010
>          � � � � pluto,east,13,1388191975090
> 
>          � � � � Tedge table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � 0005791918831-neptune,Machine|neptune,1
>          � � � � 0005791918831-neptune,Pool|west,1
>          � � � � 0005791918831-neptune,Load|5,1
>          � � � � 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>          � � � � 0105791918831-neptune,Machine|neptune,1
>          � � � � 0105791918831-neptune,Pool|west,1
>          � � � � 0105791918831-neptune,Load|9,1
>          � � � � 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>          � � � � 0905791918831-pluto,Machine|pluto,1
>          � � � � 0905791918831-pluto,Pool|east,1
>          � � � � 0905791918831-pluto,Load|13,1
>          � � � � 0905791918831-pluto,ReadingTimestamp|1388191975090,1
> 
>          � � � � TedgeTranspose table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � Machine|neptune,0005791918831-neptune,1
>          � � � � Pool|west,0005791918831-neptune,1
>          � � � � Load|5,0005791918831-neptune,1
>          � � � � ReadingTimestamp|1388191975000,0005791918831-neptune,1
>          � � � � Machine|neptune,0105791918831-neptune,1
>          � � � � Pool|west,0105791918831-neptune,1
>          � � � � Load|9,0105791918831-neptune,1
>          � � � � ReadingTimestamp|1388191975010,0105791918831-neptune,1
>          � � � � Machine|pluto,0905791918831-pluto,1
>          � � � � Pool|east,0905791918831-pluto,1
>          � � � � Load|13,0905791918831-pluto,1
>          � � � � ReadingTimestamp|1388191975090,0905791918831-pluto,1
> 
>          � � � � TedgeDegree table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � Machine|neptune,Degree,2
>          � � � � Pool|west,Degree,2
>          � � � � Load|5,Degree,1
>          � � � � ReadingTimestamp|1388191975000,Degree,1
>          � � � � Load|9,Degree,1
>          � � � � ReadingTimestamp|1388191975010,Degree,1
>          � � � � Machine|pluto,Degree,1
>          � � � � Pool|east,Degree,1
>          � � � � Load|13,Degree,1
>          � � � � ReadingTimestamp|1388191975090,Degree,1
> 
>          � � � � TedgeText table:
>          � � � � rowKey,columnQualifier,value
> 
>          � � � � 0005791918831-neptune,Text,< ... raw text of original log
>          ...>
>          � � � � 0105791918831-neptune,Text,< ... raw text of original log
>          ...>
>          � � � � 0905791918831-pluto,Text,< ... raw text of original log ...>
> 
>          � � � � On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
>          � � � � <[9][email protected] <mailto:[10][email protected]>> wrote:
> 
>          � � � � > Jeremy,
>          � � � � >
>          � � � � > Wow, didn't expect to get help from the author :)
>          � � � � >
>          � � � � > How about something simple like this:
>          � � � � >
>          � � � � > Machine � �Pool � � �Load � � � �ReadingTimestamp
>          � � � � > neptune � � west � � �5 � � � � � �1388191975000
>          � � � � > neptune � � west � � �9 � � � � � �1388191975010
>          � � � � > pluto � � � � east � � � 13 � � � � � 1388191975090
>          � � � � >
>          � � � � > These are the areas I am unclear on:
>          � � � � >
>          � � � � > 1. �Should the transpose table be built as part of ingest
>          � � � � code or as an accumulo combiner?
>          � � � � > 2. �What does the degree table do in this example ? �The
>          � � � � paper mentions it's useful for query optimization. �How?
>          � � � � > 3. �Does D4M accommodate "repurposing" the row_id to a
>          � � � � partition key? �The wikisearch shows how the partition id is
>          � � � � important for parallel scans of the index. �But since
>          Accumulo
>          � � � � is a row store how can you do fast lookups by row if you've
>          � � � � used the row_id as a partition key.
>          � � � � >
>          � � � � > Thank you,
>          � � � � >
>          � � � � > Arshak
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � >
>          � � � � > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
>          � � � � <[11][email protected] <mailto:[12][email protected]>>
>          wrote:
>          � � � � > Hi Arshak,
>          � � � � > � Maybe you can send a few (~3) records of data that you
>          are
>          � � � � familiar with
>          � � � � > and we can walk you through how the D4M schema would be
>          � � � � applied to those records.
>          � � � � >
>          � � � � > Regards. �-Jeremy
>          � � � � >
>          � � � � > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan
>          � � � � wrote:
>          � � � � > > � �Hello,
>          � � � � > > � �I am trying to get my head around Accumulo schema
>          � � � � designs. �I went through
>          � � � � > > � �a lot of trouble to get the wikisearch example
>          running
>          � � � � but since the data
>          � � � � > > � �in protobuf lists, it's not that illustrative (for a
>          � � � � newbie).
>          � � � � > > � �Would love to find another example that is a little
>          � � � � simpler to understand.
>          � � � � > > � � In particular I am interested in java/scala code
>          that
>          � � � � mimics the D4M
>          � � � � > > � �schema design (not a Matlab guy).
>          � � � � > > � �Thanks,
>          � � � � > > � �Arshak
>          � � � � >
> 
> References
> 
>    Visible links
>    1. mailto:[email protected]
>    2. http://accumulo.apache.org/example/wikisearch.html
>    3. mailto:[email protected]
>    4. mailto:[email protected]
>    5. mailto:[email protected]
>    6. mailto:[email protected]
>    7. mailto:[email protected]
>    8. mailto:[email protected]
>    9. mailto:[email protected]
>   10. mailto:[email protected]
>   11. mailto:[email protected]
>   12. mailto:[email protected]

Re: schema examples

Reply via email to