Re: schema examples

Arshak Navruzyan Sat, 28 Dec 2013 10:26:06 -0800

Dylan, thanks for the detailed explanation.  Very helpful!

Josh, appreciate the hint about combiners.



On Fri, Dec 27, 2013 at 9:53 PM, Dylan Hutchison <[email protected]>wrote:

> Hi Arshak,
>
> Perhaps this might help, though I will defer to Jeremy on the finer
> points.  Keep in mind that writing methods that interface with the D4M
> schema will duplicate some of the Matlab D4M code.  Jeremy, is the Java
> source for the D4M JARs available?
>
> First let's weigh potential choices for row IDs:
>
>    1. row ID = ReadingTimestamp.  The advantage is that you can easily
>    query for continuous periods of time via the row, but the disadvantage is
>    that all of your writes will go at the end to the last tablet server,
>    assuming that your data is streaming in live.  If your data is not
>    streaming in live but available offline, you can create table splits to
>    distribute the time stamps evenly across your tablet servers.  Scanning may
>    or may not bottleneck on one tablet server depending on your use case.
>    2. row ID = ReadingTimestamp *reversed*.  For example, use
>    0005791918831 instead of 1388191975000.  Now writes will naturally
>    distribute across all your tablet servers distribute.  However, you lose
>    the ability to easily query continuous periods of time.
>    3. row ID = a new unique ID assigned to each row in your table,
>    leaving the ReadingTimestamp as another column.  This is a good all-around
>    solution for the regular table, but what about the transpose table?
>     Because it switches the rows and columns, we still have the disadvantage
>    in option #1 in the transpose table.
>
> Since we don't have a clear winner, I wrote a quick implementation of
> option #1.  Here's the original table, followed by the exploded table, its
> transpose and the degree table.
>
> Original
>               Load,Machine,Pool,
> 1388191975000,005, neptune,west,
> 1388191975010,009, neptune,west,
> 1388191975090,013, pluto,  east,
>
> Exploded
>
> Load|005,Load|009,Load|013,Machine|neptune,Machine|pluto,Pool|east,Pool|west,
> 1388191975000,1,                         1,
>        1,
> 1388191975010,         1,                1,
>        1,
> 1388191975090,                  1,                       1,            1,
>
>
>
> Transpose
>                 1388191975000,1388191975010,1388191975090,
> Load|005,       1,
> Load|009,                     1,
> Load|013,                                   1,
> Machine|neptune,1,            1,
> Machine|pluto,                              1,
> Pool|east,                                  1,
> Pool|west,      1,            1,
>
>
> Degree
>                 degree,
> Load|005,       1,
> Load|009,       1,
> Load|013,       1,
> Machine|neptune,2,
> Machine|pluto,  1,
> Pool|east,      1,
> Pool|west,      2,
>
>
> *1.  Should the transpose table be built as part of ingest code or as an
> accumulo combiner?*
> I recommend ingest code for much greater simplicity, though it may be
> possible to build a combiner to automatically ingest to a second table.
>  When inserting (row,col,val) triples, do another insert to the transpose
> with (col,row,val).
> Use summing combiners to create the degree tables.
>
> *2.  What does the degree table do in this example ?  The paper mentions
> it's useful for query optimization.  How?  *
> Suppose you're interested in querying the row timestamps that log a *load
> > 5* (query A) and come from the *east pool *(query B). (You want A & B.)
>  *I assume that the number of returned rows matching A & B is far less
> than the total number of rows.*  (If they are roughly equal, you must
> query all the rows anyway and a database won't help you.)  There are a few
> options:
>
>    1. Scan through all the rows and return only those that match A & B.
>     Sure, this will do the job, but what if you have 100M rows?  We can do
>    faster with our transpose table.
>    2. Scan through the transpose table for the timestamps that match
>    query A.  In our example, this returns 2 rows with loads > 5.  Then retain
>    only the rows that also match query B and come from the East pool.
>    3. Scan through the transpose table for the timestamps that match
>    query B.  In our example, this returns 1 row that comes from the East pool.
>     Then retain only the rows that also match query A and have loads > 5.
>
> How can we tell whether query strategy #2 or #3 is better?  Choose the one
> that returns the fewest rows first!  We determine that by consulting the
> degree table twice (two quick lookups for counts) and verifying that there
> are 2 rows from query A and 1 row for query B, thus making strategy #3 the
> better choice.  Of course, the difference is 1 row in our small example,
> but it could be much larger on a real database.
>
> *3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>  The wikisearch shows how the partition id is important for parallel scans
> of the index.  But since Accumulo is a row store how can you do fast
> lookups by row if you've used the row_id as a partition key.*
> Table design is tough when optimizing for the general case.  Hopefully the
> above tradeoff analysis helps present different options.
>
>
> I attached the CSV file and a few lines of Matlab code to recreate the
> tables.  If you don't mind reading a little Matlab, take a look at the
> tutorial here:
>
> https://github.com/denine99/d4mBB
>
> I work out a very similar query optimization solution on baseball data in
> the section "%% Find how many players weigh < 200 lb. and bat with left
> hand or both hands".
>
> Cheers,
> Dylan Hutchison
>
>
>
> On Fri, Dec 27, 2013 at 8:01 PM, Arshak Navruzyan <[email protected]>wrote:
>
>> Jeremy,
>>
>> Wow, didn't expect to get help from the author :)
>>
>> How about something simple like this:
>>
>> Machine    Pool      Load        ReadingTimestamp
>> neptune     west      5            1388191975000
>> neptune     west      9            1388191975010
>> pluto         east       13           1388191975090
>>
>> These are the areas I am unclear on:
>>
>> 1.  Should the transpose table be built as part of ingest code or as an
>> accumulo combiner?
>> 2.  What does the degree table do in this example ?  The paper mentions
>> it's useful for query optimization.  How?
>> 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>  The wikisearch shows how the partition id is important for parallel scans
>> of the index.  But since Accumulo is a row store how can you do fast
>> lookups by row if you've used the row_id as a partition key.
>>
>> Thank you,
>>
>> Arshak
>>
>>
>>
>>
>>
>>
>> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <[email protected]> wrote:
>>
>>> Hi Arshak,
>>>   Maybe you can send a few (~3) records of data that you are familiar
>>> with
>>> and we can walk you through how the D4M schema would be applied to those
>>> records.
>>>
>>> Regards.  -Jeremy
>>>
>>> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>>> >    Hello,
>>> >    I am trying to get my head around Accumulo schema designs.  I went
>>> through
>>> >    a lot of trouble to get the wikisearch example running but since
>>> the data
>>> >    in protobuf lists, it's not that illustrative (for a newbie).
>>> >    Would love to find another example that is a little simpler to
>>> understand.
>>> >     In particular I am interested in java/scala code that mimics the
>>> D4M
>>> >    schema design (not a Matlab guy).
>>> >    Thanks,
>>> >    Arshak
>>>
>>
>>
>
>
> --
> www.cs.stevens.edu/~dhutchis
>

Re: schema examples

Reply via email to