Jeremy, Thanks for the detailed explanation. Just a couple of final questions:
1. What's your advise on the transpose table as far as whether to repeat the indexed term (one per matching row id) or try to store all matching row ids from tedge in a single row in tedgetranspose (using protobuf for example). What's the performance implication of each approach? In the paper you mentioned that if it's a few values they should just be stored together. Was there a cut-off point in your testing? 2. You mentioned that the degrees should be calculated beforehand for high ingest rates. Doesn't this change Accumulo from being a true database to being more of an index? If changes to the data cause the degree table to get out of sync, sounds like changes have to be applied elsewhere first and Accumulo has to be reloaded periodically. Or perhaps letting the degree table get out of sync is ok since it's just an assist... Thanks, Arshak On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL < [email protected]> wrote: > Hi Arshak, > Here is how you might do it. We implement everything with batch writers > and batch scanners. Note: if you are doing high ingest rates, the degree > table can be tricky and usually requires pre-summing prior to ingestion to > reduce the pressure on the accumulator inside of Accumulo. Feel free to > ask further questions as I would imagine that there a details that still > wouldn't be clear. In particular, why we do it this way. > > Regards. -Jeremy > > Original data: > > Machine,Pool,Load,ReadingTimestamp > neptune,west,5,1388191975000 > neptune,west,9,1388191975010 > pluto,east,13,1388191975090 > > > Tedge table: > rowKey,columnQualifier,value > > 0005791918831-neptune,Machine|neptune,1 > 0005791918831-neptune,Pool|west,1 > 0005791918831-neptune,Load|5,1 > 0005791918831-neptune,ReadingTimestamp|1388191975000,1 > 0105791918831-neptune,Machine|neptune,1 > 0105791918831-neptune,Pool|west,1 > 0105791918831-neptune,Load|9,1 > 0105791918831-neptune,ReadingTimestamp|1388191975010,1 > 0905791918831-pluto,Machine|pluto,1 > 0905791918831-pluto,Pool|east,1 > 0905791918831-pluto,Load|13,1 > 0905791918831-pluto,ReadingTimestamp|1388191975090,1 > > > TedgeTranspose table: > rowKey,columnQualifier,value > > Machine|neptune,0005791918831-neptune,1 > Pool|west,0005791918831-neptune,1 > Load|5,0005791918831-neptune,1 > ReadingTimestamp|1388191975000,0005791918831-neptune,1 > Machine|neptune,0105791918831-neptune,1 > Pool|west,0105791918831-neptune,1 > Load|9,0105791918831-neptune,1 > ReadingTimestamp|1388191975010,0105791918831-neptune,1 > Machine|pluto,0905791918831-pluto,1 > Pool|east,0905791918831-pluto,1 > Load|13,0905791918831-pluto,1 > ReadingTimestamp|1388191975090,0905791918831-pluto,1 > > > TedgeDegree table: > rowKey,columnQualifier,value > > Machine|neptune,Degree,2 > Pool|west,Degree,2 > Load|5,Degree,1 > ReadingTimestamp|1388191975000,Degree,1 > Load|9,Degree,1 > ReadingTimestamp|1388191975010,Degree,1 > Machine|pluto,Degree,1 > Pool|east,Degree,1 > Load|13,Degree,1 > ReadingTimestamp|1388191975090,Degree,1 > > > TedgeText table: > rowKey,columnQualifier,value > > 0005791918831-neptune,Text,< ... raw text of original log ...> > 0105791918831-neptune,Text,< ... raw text of original log ...> > 0905791918831-pluto,Text,< ... raw text of original log ...> > > On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <[email protected]> wrote: > > > Jeremy, > > > > Wow, didn't expect to get help from the author :) > > > > How about something simple like this: > > > > Machine Pool Load ReadingTimestamp > > neptune west 5 1388191975000 > > neptune west 9 1388191975010 > > pluto east 13 1388191975090 > > > > These are the areas I am unclear on: > > > > 1. Should the transpose table be built as part of ingest code or as an > accumulo combiner? > > 2. What does the degree table do in this example ? The paper mentions > it's useful for query optimization. How? > > 3. Does D4M accommodate "repurposing" the row_id to a partition key? > The wikisearch shows how the partition id is important for parallel scans > of the index. But since Accumulo is a row store how can you do fast > lookups by row if you've used the row_id as a partition key. > > > > Thank you, > > > > Arshak > > > > > > > > > > > > > > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <[email protected]> > wrote: > > Hi Arshak, > > Maybe you can send a few (~3) records of data that you are familiar > with > > and we can walk you through how the D4M schema would be applied to those > records. > > > > Regards. -Jeremy > > > > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote: > > > Hello, > > > I am trying to get my head around Accumulo schema designs. I went > through > > > a lot of trouble to get the wikisearch example running but since > the data > > > in protobuf lists, it's not that illustrative (for a newbie). > > > Would love to find another example that is a little simpler to > understand. > > > In particular I am interested in java/scala code that mimics the > D4M > > > schema design (not a Matlab guy). > > > Thanks, > > > Arshak > > > >
