I would be reluctant to make generalizations.
On Sun, Dec 29, 2013 at 05:45:28PM -0500, Arshak Navruzyan wrote: > Josh, I am still a little stuck on the idea of how this would work in a > transactional app? (aka mixed workload of reads and writes). > I definitely see the power of using a serialized structure in order to > minimize the number of records but what will happen when rows get deleted > out of the main table (or mutated)? � In the bloated model I could see > some referential integrity code zapping the index entries as well. �In the > serialized structure design it seems pretty complex to go and update every > array that referenced that row. � > Is it fair to say that the D4M approach is a little better suited for > transactional apps and the wikisearch approach is better for > read-optimized index apps? > > On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[1][email protected]> > wrote: > > Some context here in regards to the wikisearch: > > The point of the protocol buffers here (or any serialized structure in > the Value) is to reduce the ingest pressure and increase query > performance on the inverted index (or transpose table, if I follow the > d4m phrasing). > > This works well because most languages (especially English) follow a > Zipfian distribution: some terms appear very frequently while some occur > very infrequently. For common terms, we don't want to bloat our index, > nor spend time creating those index records (e.g. "the"). For uncommon > terms, we still want direct access to these infrequent words (e.g. > "supercalifragilisticexpialidocious") > > The ingest affect is also rather interesting when dealing with Accumulo > as you're not just writing more data, but typically writing data to most > (if not all) tservers. Even the tokenization of a single document is > likely to create inserts to a majority of the tablets for your inverted > index. When dealing with high ingest rates (live *or* bulk -- you still > have the send data to these servers), minimizing the number of records > becomes important to be cognizant of as it may be a bottleneck in your > pipeline. > > The query implications are pretty straightforward: common terms don't > bloat the index in size nor affect uncommon term lookups and those > uncommon term lookups remain specific to documents rather than a range > (shard) of documents. > > On 12/29/2013 11:57 AM, Arshak Navruzyan wrote: > > Sorry I mixed things up. �It was in the wikisearch example: > > [2]http://accumulo.apache.org/example/wikisearch.html > > "If the cardinality is small enough, it will track the set of > documents > by term directly." > > On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL > <[3][email protected] <mailto:[4][email protected]>> wrote: > > � � Hi Arshak, > � � � �See interspersed below. > � � Regards. �-Jeremy > > � � On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan > <[5][email protected] > � � <mailto:[6][email protected]>> wrote: > > � � Jeremy, > > � � Thanks for the detailed explanation. �Just a couple of final > � � questions: > > � � 1. �What's your advise on the transpose table as far as whether > to > � � repeat the indexed term (one per matching row id) or try to > store > � � all matching row ids from tedge in a single row in > tedgetranspose > � � (using protobuf for example). �What's the performance > implication > � � of each approach? �In the paper you mentioned that if it's a few > � � values they should just be stored together. �Was there a cut-off > � � point in your testing? > > � � Can you clarify? �I am not sure what your asking. > > � � 2. �You mentioned that the degrees should be calculated > beforehand > � � for high ingest rates. �Doesn't this change Accumulo from being > a > � � true database to being more of an index? �If changes to the data > � � cause the degree table to get out of sync, sounds like changes > � � have to be applied elsewhere first and Accumulo has to be > reloaded > � � periodically. �Or perhaps letting the degree table get out of > sync > � � is ok since it's just an assist... > > � � My point was a very narrow comment on optimization in very high > � � performance situations. I probably shouldn't have mentioned it. > �If > � � you have ever have performance issues with your degree tables, > that > � � would be the time to discuss. . You may never encounter this > issue. > > � � Thanks, > > � � Arshak > > � � On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL > � � <[7][email protected] <mailto:[8][email protected]>> wrote: > > � � � � Hi Arshak, > � � � � � Here is how you might do it. �We implement everything with > � � � � batch writers and batch scanners. �Note: if you are doing > high > � � � � ingest rates, the degree table can be tricky and usually > � � � � requires pre-summing prior to ingestion to reduce the > pressure > � � � � on the accumulator inside of Accumulo. �Feel free to ask > � � � � further questions as I would imagine that there a details > that > � � � � still wouldn't be clear. �In particular, why we do it this > way. > > � � � � Regards. �-Jeremy > > � � � � Original data: > > � � � � Machine,Pool,Load,ReadingTimestamp > � � � � neptune,west,5,1388191975000 > � � � � neptune,west,9,1388191975010 > � � � � pluto,east,13,1388191975090 > > � � � � Tedge table: > � � � � rowKey,columnQualifier,value > > � � � � 0005791918831-neptune,Machine|neptune,1 > � � � � 0005791918831-neptune,Pool|west,1 > � � � � 0005791918831-neptune,Load|5,1 > � � � � 0005791918831-neptune,ReadingTimestamp|1388191975000,1 > � � � � 0105791918831-neptune,Machine|neptune,1 > � � � � 0105791918831-neptune,Pool|west,1 > � � � � 0105791918831-neptune,Load|9,1 > � � � � 0105791918831-neptune,ReadingTimestamp|1388191975010,1 > � � � � 0905791918831-pluto,Machine|pluto,1 > � � � � 0905791918831-pluto,Pool|east,1 > � � � � 0905791918831-pluto,Load|13,1 > � � � � 0905791918831-pluto,ReadingTimestamp|1388191975090,1 > > � � � � TedgeTranspose table: > � � � � rowKey,columnQualifier,value > > � � � � Machine|neptune,0005791918831-neptune,1 > � � � � Pool|west,0005791918831-neptune,1 > � � � � Load|5,0005791918831-neptune,1 > � � � � ReadingTimestamp|1388191975000,0005791918831-neptune,1 > � � � � Machine|neptune,0105791918831-neptune,1 > � � � � Pool|west,0105791918831-neptune,1 > � � � � Load|9,0105791918831-neptune,1 > � � � � ReadingTimestamp|1388191975010,0105791918831-neptune,1 > � � � � Machine|pluto,0905791918831-pluto,1 > � � � � Pool|east,0905791918831-pluto,1 > � � � � Load|13,0905791918831-pluto,1 > � � � � ReadingTimestamp|1388191975090,0905791918831-pluto,1 > > � � � � TedgeDegree table: > � � � � rowKey,columnQualifier,value > > � � � � Machine|neptune,Degree,2 > � � � � Pool|west,Degree,2 > � � � � Load|5,Degree,1 > � � � � ReadingTimestamp|1388191975000,Degree,1 > � � � � Load|9,Degree,1 > � � � � ReadingTimestamp|1388191975010,Degree,1 > � � � � Machine|pluto,Degree,1 > � � � � Pool|east,Degree,1 > � � � � Load|13,Degree,1 > � � � � ReadingTimestamp|1388191975090,Degree,1 > > � � � � TedgeText table: > � � � � rowKey,columnQualifier,value > > � � � � 0005791918831-neptune,Text,< ... raw text of original log > ...> > � � � � 0105791918831-neptune,Text,< ... raw text of original log > ...> > � � � � 0905791918831-pluto,Text,< ... raw text of original log ...> > > � � � � On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan > � � � � <[9][email protected] <mailto:[10][email protected]>> wrote: > > � � � � > Jeremy, > � � � � > > � � � � > Wow, didn't expect to get help from the author :) > � � � � > > � � � � > How about something simple like this: > � � � � > > � � � � > Machine � �Pool � � �Load � � � �ReadingTimestamp > � � � � > neptune � � west � � �5 � � � � � �1388191975000 > � � � � > neptune � � west � � �9 � � � � � �1388191975010 > � � � � > pluto � � � � east � � � 13 � � � � � 1388191975090 > � � � � > > � � � � > These are the areas I am unclear on: > � � � � > > � � � � > 1. �Should the transpose table be built as part of ingest > � � � � code or as an accumulo combiner? > � � � � > 2. �What does the degree table do in this example ? �The > � � � � paper mentions it's useful for query optimization. �How? > � � � � > 3. �Does D4M accommodate "repurposing" the row_id to a > � � � � partition key? �The wikisearch shows how the partition id is > � � � � important for parallel scans of the index. �But since > Accumulo > � � � � is a row store how can you do fast lookups by row if you've > � � � � used the row_id as a partition key. > � � � � > > � � � � > Thank you, > � � � � > > � � � � > Arshak > � � � � > > � � � � > > � � � � > > � � � � > > � � � � > > � � � � > > � � � � > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner > � � � � <[11][email protected] <mailto:[12][email protected]>> > wrote: > � � � � > Hi Arshak, > � � � � > � Maybe you can send a few (~3) records of data that you > are > � � � � familiar with > � � � � > and we can walk you through how the D4M schema would be > � � � � applied to those records. > � � � � > > � � � � > Regards. �-Jeremy > � � � � > > � � � � > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan > � � � � wrote: > � � � � > > � �Hello, > � � � � > > � �I am trying to get my head around Accumulo schema > � � � � designs. �I went through > � � � � > > � �a lot of trouble to get the wikisearch example > running > � � � � but since the data > � � � � > > � �in protobuf lists, it's not that illustrative (for a > � � � � newbie). > � � � � > > � �Would love to find another example that is a little > � � � � simpler to understand. > � � � � > > � � In particular I am interested in java/scala code > that > � � � � mimics the D4M > � � � � > > � �schema design (not a Matlab guy). > � � � � > > � �Thanks, > � � � � > > � �Arshak > � � � � > > > References > > Visible links > 1. mailto:[email protected] > 2. http://accumulo.apache.org/example/wikisearch.html > 3. mailto:[email protected] > 4. mailto:[email protected] > 5. mailto:[email protected] > 6. mailto:[email protected] > 7. mailto:[email protected] > 8. mailto:[email protected] > 9. mailto:[email protected] > 10. mailto:[email protected] > 11. mailto:[email protected] > 12. mailto:[email protected]
