Josh, I am still a little stuck on the idea of how this would work in a
transactional app? (aka mixed workload of reads and writes).
I definitely see the power of using a serialized structure in order to
minimize the number of records but what will happen when rows get
deleted out of the main table (or mutated)? In the bloated model I
could see some referential integrity code zapping the index entries as
well. In the serialized structure design it seems pretty complex to go
and update every array that referenced that row.
Is it fair to say that the D4M approach is a little better suited for
transactional apps and the wikisearch approach is better for
read-optimized index apps?
On Sun, Dec 29, 2013 at 12:27 PM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:
Some context here in regards to the wikisearch:
The point of the protocol buffers here (or any serialized structure
in the Value) is to reduce the ingest pressure and increase query
performance on the inverted index (or transpose table, if I follow
the d4m phrasing).
This works well because most languages (especially English) follow a
Zipfian distribution: some terms appear very frequently while some
occur very infrequently. For common terms, we don't want to bloat
our index, nor spend time creating those index records (e.g. "the").
For uncommon terms, we still want direct access to these infrequent
words (e.g. "__supercalifragilisticexpialidoc__ious")
The ingest affect is also rather interesting when dealing with
Accumulo as you're not just writing more data, but typically writing
data to most (if not all) tservers. Even the tokenization of a
single document is likely to create inserts to a majority of the
tablets for your inverted index. When dealing with high ingest rates
(live *or* bulk -- you still have the send data to these servers),
minimizing the number of records becomes important to be cognizant
of as it may be a bottleneck in your pipeline.
The query implications are pretty straightforward: common terms
don't bloat the index in size nor affect uncommon term lookups and
those uncommon term lookups remain specific to documents rather than
a range (shard) of documents.
On 12/29/2013 11:57 AM, Arshak Navruzyan wrote:
Sorry I mixed things up. It was in the wikisearch example:
http://accumulo.apache.org/__example/wikisearch.html
<http://accumulo.apache.org/example/wikisearch.html>
"If the cardinality is small enough, it will track the set of
documents
by term directly."
On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi Arshak,
See interspersed below.
Regards. -Jeremy
On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Jeremy,
Thanks for the detailed explanation. Just a couple of
final
questions:
1. What's your advise on the transpose table as far as
whether to
repeat the indexed term (one per matching row id) or
try to store
all matching row ids from tedge in a single row in
tedgetranspose
(using protobuf for example). What's the performance
implication
of each approach? In the paper you mentioned that if
it's a few
values they should just be stored together. Was there
a cut-off
point in your testing?
Can you clarify? I am not sure what your asking.
2. You mentioned that the degrees should be calculated
beforehand
for high ingest rates. Doesn't this change Accumulo
from being a
true database to being more of an index? If changes to
the data
cause the degree table to get out of sync, sounds like
changes
have to be applied elsewhere first and Accumulo has to
be reloaded
periodically. Or perhaps letting the degree table get
out of sync
is ok since it's just an assist...
My point was a very narrow comment on optimization in very high
performance situations. I probably shouldn't have mentioned
it. If
you have ever have performance issues with your degree
tables, that
would be the time to discuss. . You may never encounter
this issue.
Thanks,
Arshak
On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553
- MITLL
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi Arshak,
Here is how you might do it. We implement
everything with
batch writers and batch scanners. Note: if you are
doing high
ingest rates, the degree table can be tricky and
usually
requires pre-summing prior to ingestion to reduce
the pressure
on the accumulator inside of Accumulo. Feel free
to ask
further questions as I would imagine that there a
details that
still wouldn't be clear. In particular, why we do
it this way.
Regards. -Jeremy
Original data:
Machine,Pool,Load,__ReadingTimestamp
neptune,west,5,1388191975000
neptune,west,9,1388191975010
pluto,east,13,1388191975090
Tedge table:
rowKey,columnQualifier,value
0005791918831-neptune,Machine|__neptune,1
0005791918831-neptune,Pool|__west,1
0005791918831-neptune,Load|5,1
0005791918831-neptune,__ReadingTimestamp|__1388191975000,1
0105791918831-neptune,Machine|__neptune,1
0105791918831-neptune,Pool|__west,1
0105791918831-neptune,Load|9,1
0105791918831-neptune,__ReadingTimestamp|__1388191975010,1
0905791918831-pluto,Machine|__pluto,1
0905791918831-pluto,Pool|east,__1
0905791918831-pluto,Load|13,1
0905791918831-pluto,__ReadingTimestamp|__1388191975090,1
TedgeTranspose table:
rowKey,columnQualifier,value
Machine|neptune,0005791918831-__neptune,1
Pool|west,0005791918831-__neptune,1
Load|5,0005791918831-neptune,1
ReadingTimestamp|__1388191975000,0005791918831-__neptune,1
Machine|neptune,0105791918831-__neptune,1
Pool|west,0105791918831-__neptune,1
Load|9,0105791918831-neptune,1
ReadingTimestamp|__1388191975010,0105791918831-__neptune,1
Machine|pluto,0905791918831-__pluto,1
Pool|east,0905791918831-pluto,__1
Load|13,0905791918831-pluto,1
ReadingTimestamp|__1388191975090,0905791918831-__pluto,1
TedgeDegree table:
rowKey,columnQualifier,value
Machine|neptune,Degree,2
Pool|west,Degree,2
Load|5,Degree,1
ReadingTimestamp|__1388191975000,Degree,1
Load|9,Degree,1
ReadingTimestamp|__1388191975010,Degree,1
Machine|pluto,Degree,1
Pool|east,Degree,1
Load|13,Degree,1
ReadingTimestamp|__1388191975090,Degree,1
TedgeText table:
rowKey,columnQualifier,value
0005791918831-neptune,Text,< ... raw text of
original log ...>
0105791918831-neptune,Text,< ... raw text of
original log ...>
0905791918831-pluto,Text,< ... raw text of original
log ...>
On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
> Jeremy,
>
> Wow, didn't expect to get help from the author :)
>
> How about something simple like this:
>
> Machine Pool Load ReadingTimestamp
> neptune west 5 1388191975000
> neptune west 9 1388191975010
> pluto east 13 1388191975090
>
> These are the areas I am unclear on:
>
> 1. Should the transpose table be built as part
of ingest
code or as an accumulo combiner?
> 2. What does the degree table do in this example
? The
paper mentions it's useful for query optimization.
How?
> 3. Does D4M accommodate "repurposing" the row_id
to a
partition key? The wikisearch shows how the
partition id is
important for parallel scans of the index. But
since Accumulo
is a row store how can you do fast lookups by row
if you've
used the row_id as a partition key.
>
> Thank you,
>
> Arshak
>
>
>
>
>
>
> On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
> Hi Arshak,
> Maybe you can send a few (~3) records of data
that you are
familiar with
> and we can walk you through how the D4M schema
would be
applied to those records.
>
> Regards. -Jeremy
>
> On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak
Navruzyan
wrote:
> > Hello,
> > I am trying to get my head around Accumulo
schema
designs. I went through
> > a lot of trouble to get the wikisearch
example running
but since the data
> > in protobuf lists, it's not that
illustrative (for a
newbie).
> > Would love to find another example that is a
little
simpler to understand.
> > In particular I am interested in java/scala
code that
mimics the D4M
> > schema design (not a Matlab guy).
> > Thanks,
> > Arshak
>