Got it, thanks again Jeremy!
On Sun, Dec 29, 2013 at 9:12 AM, Kepner, Jeremy - 0553 - MITLL < [email protected]> wrote: > FYI, we just insert all the triples into both Tedge and TedgeTranspose > using seperate batchwriters and let Accumulo figure out which ones belong > in the same row. This has worked well for us. > > On Dec 29, 2013, at 11:57 AM, Arshak Navruzyan <[email protected]> wrote: > > Sorry I mixed things up. It was in the wikisearch example: > > http://accumulo.apache.org/example/wikisearch.html > > "If the cardinality is small enough, it will track the set of documents by > term directly." > > > On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL < > [email protected]> wrote: > >> Hi Arshak, >> See interspersed below. >> Regards. -Jeremy >> >> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <[email protected]> wrote: >> >> Jeremy, >> >> Thanks for the detailed explanation. Just a couple of final questions: >> >> 1. What's your advise on the transpose table as far as whether to repeat >> the indexed term (one per matching row id) or try to store all matching row >> ids from tedge in a single row in tedgetranspose (using protobuf for >> example). What's the performance implication of each approach? In the >> paper you mentioned that if it's a few values they should just be stored >> together. Was there a cut-off point in your testing? >> >> >> Can you clarify? I am not sure what your asking. >> >> >> 2. You mentioned that the degrees should be calculated beforehand for >> high ingest rates. Doesn't this change Accumulo from being a true database >> to being more of an index? If changes to the data cause the degree table >> to get out of sync, sounds like changes have to be applied elsewhere first >> and Accumulo has to be reloaded periodically. Or perhaps letting the >> degree table get out of sync is ok since it's just an assist... >> >> >> My point was a very narrow comment on optimization in very high >> performance situations. I probably shouldn't have mentioned it. If you >> have ever have performance issues with your degree tables, that would be >> the time to discuss. . You may never encounter this issue. >> >> Thanks, >> >> Arshak >> >> >> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL < >> [email protected]> wrote: >> >>> Hi Arshak, >>> Here is how you might do it. We implement everything with batch >>> writers and batch scanners. Note: if you are doing high ingest rates, the >>> degree table can be tricky and usually requires pre-summing prior to >>> ingestion to reduce the pressure on the accumulator inside of Accumulo. >>> Feel free to ask further questions as I would imagine that there a details >>> that still wouldn't be clear. In particular, why we do it this way. >>> >>> Regards. -Jeremy >>> >>> Original data: >>> >>> Machine,Pool,Load,ReadingTimestamp >>> neptune,west,5,1388191975000 >>> neptune,west,9,1388191975010 >>> pluto,east,13,1388191975090 >>> >>> >>> Tedge table: >>> rowKey,columnQualifier,value >>> >>> 0005791918831-neptune,Machine|neptune,1 >>> 0005791918831-neptune,Pool|west,1 >>> 0005791918831-neptune,Load|5,1 >>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1 >>> 0105791918831-neptune,Machine|neptune,1 >>> 0105791918831-neptune,Pool|west,1 >>> 0105791918831-neptune,Load|9,1 >>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1 >>> 0905791918831-pluto,Machine|pluto,1 >>> 0905791918831-pluto,Pool|east,1 >>> 0905791918831-pluto,Load|13,1 >>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1 >>> >>> >>> TedgeTranspose table: >>> rowKey,columnQualifier,value >>> >>> Machine|neptune,0005791918831-neptune,1 >>> Pool|west,0005791918831-neptune,1 >>> Load|5,0005791918831-neptune,1 >>> ReadingTimestamp|1388191975000,0005791918831-neptune,1 >>> Machine|neptune,0105791918831-neptune,1 >>> Pool|west,0105791918831-neptune,1 >>> Load|9,0105791918831-neptune,1 >>> ReadingTimestamp|1388191975010,0105791918831-neptune,1 >>> Machine|pluto,0905791918831-pluto,1 >>> Pool|east,0905791918831-pluto,1 >>> Load|13,0905791918831-pluto,1 >>> ReadingTimestamp|1388191975090,0905791918831-pluto,1 >>> >>> >>> TedgeDegree table: >>> rowKey,columnQualifier,value >>> >>> Machine|neptune,Degree,2 >>> Pool|west,Degree,2 >>> Load|5,Degree,1 >>> ReadingTimestamp|1388191975000,Degree,1 >>> Load|9,Degree,1 >>> ReadingTimestamp|1388191975010,Degree,1 >>> Machine|pluto,Degree,1 >>> Pool|east,Degree,1 >>> Load|13,Degree,1 >>> ReadingTimestamp|1388191975090,Degree,1 >>> >>> >>> TedgeText table: >>> rowKey,columnQualifier,value >>> >>> 0005791918831-neptune,Text,< ... raw text of original log ...> >>> 0105791918831-neptune,Text,< ... raw text of original log ...> >>> 0905791918831-pluto,Text,< ... raw text of original log ...> >>> >>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <[email protected]> wrote: >>> >>> > Jeremy, >>> > >>> > Wow, didn't expect to get help from the author :) >>> > >>> > How about something simple like this: >>> > >>> > Machine Pool Load ReadingTimestamp >>> > neptune west 5 1388191975000 >>> > neptune west 9 1388191975010 >>> > pluto east 13 1388191975090 >>> > >>> > These are the areas I am unclear on: >>> > >>> > 1. Should the transpose table be built as part of ingest code or as >>> an accumulo combiner? >>> > 2. What does the degree table do in this example ? The paper >>> mentions it's useful for query optimization. How? >>> > 3. Does D4M accommodate "repurposing" the row_id to a partition key? >>> The wikisearch shows how the partition id is important for parallel scans >>> of the index. But since Accumulo is a row store how can you do fast >>> lookups by row if you've used the row_id as a partition key. >>> > >>> > Thank you, >>> > >>> > Arshak >>> > >>> > >>> > >>> > >>> > >>> > >>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <[email protected]> >>> wrote: >>> > Hi Arshak, >>> > Maybe you can send a few (~3) records of data that you are familiar >>> with >>> > and we can walk you through how the D4M schema would be applied to >>> those records. >>> > >>> > Regards. -Jeremy >>> > >>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote: >>> > > Hello, >>> > > I am trying to get my head around Accumulo schema designs. I >>> went through >>> > > a lot of trouble to get the wikisearch example running but since >>> the data >>> > > in protobuf lists, it's not that illustrative (for a newbie). >>> > > Would love to find another example that is a little simpler to >>> understand. >>> > > In particular I am interested in java/scala code that mimics the >>> D4M >>> > > schema design (not a Matlab guy). >>> > > Thanks, >>> > > Arshak >>> > >>> >>> >> >> > >
