Sorry I mixed things up. It was in the wikisearch example: http://accumulo.apache.org/example/wikisearch.html
"If the cardinality is small enough, it will track the set of documents by term directly." On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL < [email protected]> wrote: > Hi Arshak, > See interspersed below. > Regards. -Jeremy > > On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <[email protected]> wrote: > > Jeremy, > > Thanks for the detailed explanation. Just a couple of final questions: > > 1. What's your advise on the transpose table as far as whether to repeat > the indexed term (one per matching row id) or try to store all matching row > ids from tedge in a single row in tedgetranspose (using protobuf for > example). What's the performance implication of each approach? In the > paper you mentioned that if it's a few values they should just be stored > together. Was there a cut-off point in your testing? > > > Can you clarify? I am not sure what your asking. > > > 2. You mentioned that the degrees should be calculated beforehand for > high ingest rates. Doesn't this change Accumulo from being a true database > to being more of an index? If changes to the data cause the degree table > to get out of sync, sounds like changes have to be applied elsewhere first > and Accumulo has to be reloaded periodically. Or perhaps letting the > degree table get out of sync is ok since it's just an assist... > > > My point was a very narrow comment on optimization in very high > performance situations. I probably shouldn't have mentioned it. If you > have ever have performance issues with your degree tables, that would be > the time to discuss. . You may never encounter this issue. > > Thanks, > > Arshak > > > On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL < > [email protected]> wrote: > >> Hi Arshak, >> Here is how you might do it. We implement everything with batch >> writers and batch scanners. Note: if you are doing high ingest rates, the >> degree table can be tricky and usually requires pre-summing prior to >> ingestion to reduce the pressure on the accumulator inside of Accumulo. >> Feel free to ask further questions as I would imagine that there a details >> that still wouldn't be clear. In particular, why we do it this way. >> >> Regards. -Jeremy >> >> Original data: >> >> Machine,Pool,Load,ReadingTimestamp >> neptune,west,5,1388191975000 >> neptune,west,9,1388191975010 >> pluto,east,13,1388191975090 >> >> >> Tedge table: >> rowKey,columnQualifier,value >> >> 0005791918831-neptune,Machine|neptune,1 >> 0005791918831-neptune,Pool|west,1 >> 0005791918831-neptune,Load|5,1 >> 0005791918831-neptune,ReadingTimestamp|1388191975000,1 >> 0105791918831-neptune,Machine|neptune,1 >> 0105791918831-neptune,Pool|west,1 >> 0105791918831-neptune,Load|9,1 >> 0105791918831-neptune,ReadingTimestamp|1388191975010,1 >> 0905791918831-pluto,Machine|pluto,1 >> 0905791918831-pluto,Pool|east,1 >> 0905791918831-pluto,Load|13,1 >> 0905791918831-pluto,ReadingTimestamp|1388191975090,1 >> >> >> TedgeTranspose table: >> rowKey,columnQualifier,value >> >> Machine|neptune,0005791918831-neptune,1 >> Pool|west,0005791918831-neptune,1 >> Load|5,0005791918831-neptune,1 >> ReadingTimestamp|1388191975000,0005791918831-neptune,1 >> Machine|neptune,0105791918831-neptune,1 >> Pool|west,0105791918831-neptune,1 >> Load|9,0105791918831-neptune,1 >> ReadingTimestamp|1388191975010,0105791918831-neptune,1 >> Machine|pluto,0905791918831-pluto,1 >> Pool|east,0905791918831-pluto,1 >> Load|13,0905791918831-pluto,1 >> ReadingTimestamp|1388191975090,0905791918831-pluto,1 >> >> >> TedgeDegree table: >> rowKey,columnQualifier,value >> >> Machine|neptune,Degree,2 >> Pool|west,Degree,2 >> Load|5,Degree,1 >> ReadingTimestamp|1388191975000,Degree,1 >> Load|9,Degree,1 >> ReadingTimestamp|1388191975010,Degree,1 >> Machine|pluto,Degree,1 >> Pool|east,Degree,1 >> Load|13,Degree,1 >> ReadingTimestamp|1388191975090,Degree,1 >> >> >> TedgeText table: >> rowKey,columnQualifier,value >> >> 0005791918831-neptune,Text,< ... raw text of original log ...> >> 0105791918831-neptune,Text,< ... raw text of original log ...> >> 0905791918831-pluto,Text,< ... raw text of original log ...> >> >> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <[email protected]> wrote: >> >> > Jeremy, >> > >> > Wow, didn't expect to get help from the author :) >> > >> > How about something simple like this: >> > >> > Machine Pool Load ReadingTimestamp >> > neptune west 5 1388191975000 >> > neptune west 9 1388191975010 >> > pluto east 13 1388191975090 >> > >> > These are the areas I am unclear on: >> > >> > 1. Should the transpose table be built as part of ingest code or as an >> accumulo combiner? >> > 2. What does the degree table do in this example ? The paper mentions >> it's useful for query optimization. How? >> > 3. Does D4M accommodate "repurposing" the row_id to a partition key? >> The wikisearch shows how the partition id is important for parallel scans >> of the index. But since Accumulo is a row store how can you do fast >> lookups by row if you've used the row_id as a partition key. >> > >> > Thank you, >> > >> > Arshak >> > >> > >> > >> > >> > >> > >> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <[email protected]> >> wrote: >> > Hi Arshak, >> > Maybe you can send a few (~3) records of data that you are familiar >> with >> > and we can walk you through how the D4M schema would be applied to >> those records. >> > >> > Regards. -Jeremy >> > >> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote: >> > > Hello, >> > > I am trying to get my head around Accumulo schema designs. I went >> through >> > > a lot of trouble to get the wikisearch example running but since >> the data >> > > in protobuf lists, it's not that illustrative (for a newbie). >> > > Would love to find another example that is a little simpler to >> understand. >> > > In particular I am interested in java/scala code that mimics the >> D4M >> > > schema design (not a Matlab guy). >> > > Thanks, >> > > Arshak >> > >> >> > >
