Re: schema examples

Arshak Navruzyan Sun, 29 Dec 2013 08:59:19 -0800

Sorry I mixed things up.  It was in the wikisearch example:

http://accumulo.apache.org/example/wikisearch.html


"If the cardinality is small enough, it will track the set of documents by
term directly."


On Sun, Dec 29, 2013 at 8:42 AM, Kepner, Jeremy - 0553 - MITLL <
[email protected]> wrote:

> Hi Arshak,
>   See interspersed below.
> Regards.  -Jeremy
>
> On Dec 29, 2013, at 11:34 AM, Arshak Navruzyan <[email protected]> wrote:
>
> Jeremy,
>
> Thanks for the detailed explanation.  Just a couple of final questions:
>
> 1.  What's your advise on the transpose table as far as whether to repeat
> the indexed term (one per matching row id) or try to store all matching row
> ids from tedge in a single row in tedgetranspose (using protobuf for
> example).  What's the performance implication of each approach?  In the
> paper you mentioned that if it's a few values they should just be stored
> together.  Was there a cut-off point in your testing?
>
>
> Can you clarify?  I am not sure what your asking.
>
>
> 2.  You mentioned that the degrees should be calculated beforehand for
> high ingest rates.  Doesn't this change Accumulo from being a true database
> to being more of an index?  If changes to the data cause the degree table
> to get out of sync, sounds like changes have to be applied elsewhere first
> and Accumulo has to be reloaded periodically.  Or perhaps letting the
> degree table get out of sync is ok since it's just an assist...
>
>
> My point was a very narrow comment on optimization in very high
> performance situations. I probably shouldn't have mentioned it.  If you
> have ever have performance issues with your degree tables, that would be
> the time to discuss. . You may never encounter this issue.
>
> Thanks,
>
> Arshak
>
>
> On Sat, Dec 28, 2013 at 10:36 AM, Kepner, Jeremy - 0553 - MITLL <
> [email protected]> wrote:
>
>> Hi Arshak,
>>   Here is how you might do it.  We implement everything with batch
>> writers and batch scanners.  Note: if you are doing high ingest rates, the
>> degree table can be tricky and usually requires pre-summing prior to
>> ingestion to reduce the pressure on the accumulator inside of Accumulo.
>>  Feel free to ask further questions as I would imagine that there a details
>> that still wouldn't be clear.  In particular, why we do it this way.
>>
>> Regards.  -Jeremy
>>
>> Original data:
>>
>> Machine,Pool,Load,ReadingTimestamp
>> neptune,west,5,1388191975000
>> neptune,west,9,1388191975010
>> pluto,east,13,1388191975090
>>
>>
>> Tedge table:
>> rowKey,columnQualifier,value
>>
>> 0005791918831-neptune,Machine|neptune,1
>> 0005791918831-neptune,Pool|west,1
>> 0005791918831-neptune,Load|5,1
>> 0005791918831-neptune,ReadingTimestamp|1388191975000,1
>> 0105791918831-neptune,Machine|neptune,1
>> 0105791918831-neptune,Pool|west,1
>> 0105791918831-neptune,Load|9,1
>> 0105791918831-neptune,ReadingTimestamp|1388191975010,1
>> 0905791918831-pluto,Machine|pluto,1
>> 0905791918831-pluto,Pool|east,1
>> 0905791918831-pluto,Load|13,1
>> 0905791918831-pluto,ReadingTimestamp|1388191975090,1
>>
>>
>> TedgeTranspose table:
>> rowKey,columnQualifier,value
>>
>> Machine|neptune,0005791918831-neptune,1
>> Pool|west,0005791918831-neptune,1
>> Load|5,0005791918831-neptune,1
>> ReadingTimestamp|1388191975000,0005791918831-neptune,1
>> Machine|neptune,0105791918831-neptune,1
>> Pool|west,0105791918831-neptune,1
>> Load|9,0105791918831-neptune,1
>> ReadingTimestamp|1388191975010,0105791918831-neptune,1
>> Machine|pluto,0905791918831-pluto,1
>> Pool|east,0905791918831-pluto,1
>> Load|13,0905791918831-pluto,1
>> ReadingTimestamp|1388191975090,0905791918831-pluto,1
>>
>>
>> TedgeDegree table:
>> rowKey,columnQualifier,value
>>
>> Machine|neptune,Degree,2
>> Pool|west,Degree,2
>> Load|5,Degree,1
>> ReadingTimestamp|1388191975000,Degree,1
>> Load|9,Degree,1
>> ReadingTimestamp|1388191975010,Degree,1
>> Machine|pluto,Degree,1
>> Pool|east,Degree,1
>> Load|13,Degree,1
>> ReadingTimestamp|1388191975090,Degree,1
>>
>>
>> TedgeText table:
>> rowKey,columnQualifier,value
>>
>> 0005791918831-neptune,Text,< ... raw text of original log ...>
>> 0105791918831-neptune,Text,< ... raw text of original log ...>
>> 0905791918831-pluto,Text,< ... raw text of original log ...>
>>
>> On Dec 27, 2013, at 8:01 PM, Arshak Navruzyan <[email protected]> wrote:
>>
>> > Jeremy,
>> >
>> > Wow, didn't expect to get help from the author :)
>> >
>> > How about something simple like this:
>> >
>> > Machine    Pool      Load        ReadingTimestamp
>> > neptune     west      5            1388191975000
>> > neptune     west      9            1388191975010
>> > pluto         east       13           1388191975090
>> >
>> > These are the areas I am unclear on:
>> >
>> > 1.  Should the transpose table be built as part of ingest code or as an
>> accumulo combiner?
>> > 2.  What does the degree table do in this example ?  The paper mentions
>> it's useful for query optimization.  How?
>> > 3.  Does D4M accommodate "repurposing" the row_id to a partition key?
>>  The wikisearch shows how the partition id is important for parallel scans
>> of the index.  But since Accumulo is a row store how can you do fast
>> lookups by row if you've used the row_id as a partition key.
>> >
>> > Thank you,
>> >
>> > Arshak
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Dec 26, 2013 at 5:31 PM, Jeremy Kepner <[email protected]>
>> wrote:
>> > Hi Arshak,
>> >   Maybe you can send a few (~3) records of data that you are familiar
>> with
>> > and we can walk you through how the D4M schema would be applied to
>> those records.
>> >
>> > Regards.  -Jeremy
>> >
>> > On Thu, Dec 26, 2013 at 03:10:59PM -0500, Arshak Navruzyan wrote:
>> > >    Hello,
>> > >    I am trying to get my head around Accumulo schema designs.  I went
>> through
>> > >    a lot of trouble to get the wikisearch example running but since
>> the data
>> > >    in protobuf lists, it's not that illustrative (for a newbie).
>> > >    Would love to find another example that is a little simpler to
>> understand.
>> > >     In particular I am interested in java/scala code that mimics the
>> D4M
>> > >    schema design (not a Matlab guy).
>> > >    Thanks,
>> > >    Arshak
>> >
>>
>>
>
>

Re: schema examples

Reply via email to