I think re-using Iterators in the client-write path makes sense architecturally and is a logical progression for the reasons pointed out by Roman and Russ.

The big concern that Keith pointed out, it's hard to directly apply iterators on the client-write side because we're not dealing in sorted key-values at this point. I think there could be ways to work around this.

I'd say if we have people who are interested in pursuing this, let's start a new discussion on dev@ where we can start laying some groundwork for the scope and implementation of what this solution would look like.

[email protected] wrote:
My view is that introduction of ingest-time iterators would be quite a
useful feature. Anyway. J

Also, could anyone exactly explain why composite mutation perform pretty
much in the same way as a set of individual mutations?

One large composite mutation with 19 qualifiers inside is just 10-30%
faster than 19 individual mutations.

*From:*Russ Weeks [mailto:[email protected]]
*Sent:* 09 June 2015 20:54
*To:* accumulo-user
*Subject:* Re: micro compaction

For consistency and ease of implementation. Say I've written a stack of
combiners that do statistical aggregation, sampling etc. on my table.
Rather than port that logic to a Storm topology or to the DStream API
I'd just like to turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets <[email protected]
<mailto:[email protected]>> wrote:

    Consider using Storm, Pig, Spark, or your own framework to handle
    the in-memory aggregation before giving the data to the BatchWriter.
    Why would any part of Accumulo code be responsible for this kind of
    application-specific data handling?

    On Tue, Jun 9, 2015 at 3:17 PM, [email protected]
    <mailto:[email protected]> <[email protected]
    <mailto:[email protected]>> wrote:

    Just to clarify the origin of my question.

    I had to do some performance tests to compare different storage
    types of “raw” data against each other.

    Hopefully, picture below is visible in the mailing list. If not, I
    will put it somewhere else.

    6 million “original” records, 1.3GB data, 233 bytes per record

    Each original record is 40 fields delimited by tab, on average 19 –
    not null

    Batchwriter, single java program

    First three bars represent single “heavy” mutation to insert the
    whole tabular line / serialized object.

    4,5,6,7 bars – composite mutation (all qualifiers for the same rowid
    in one mutation)

    8, 9, 10, 11 – individual mutations (all qualifiers for the same
    rowid in separate mutations) - ~19 mutations per original record

    On average, single “heavy” mutations are 7-10 times faster than
    anything else, composite are 10%-35% faster than individual.

    I am not an expert how Accumulo is implemented internally, however
    it looks like composite mutation is treated more or less in the same
    way as a set of individual mutations. Probably, largest overhead is
    added by WAL.

    Data utilization before and after manual compaction of test table
    and all system tables:

    It’s not clear why “accumulo du” shows twice less data used
    comparing to “hdfs du”.

    All these tests made us think that we can improve performance by
    doing some calculations in-memory (and our use-case fits very well)
    and reducing number of mutations. Now I am trying to understand
    whether there is a relatively easy way to do this with Accumulo or
    whether it’s time to look closer into something like Spark.

    Thanks

    Roman

    *From:*Adam Fuchs [mailto:[email protected] <mailto:[email protected]>]
    *Sent:* 09 June 2015 19:08


    *To:* [email protected] <mailto:[email protected]>
    *Subject:* Re: micro compaction

    I think this might be the same concept as in-mapper combining, but
    applied to data being sent to a BatchWriter rather than an
    OutputCollector. See [1], section 3.1.1. A similar performance
    analysis and probably a lot of the same code should apply here.

    Cheers,

    Adam

    [1]
    http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

    On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]
    <mailto:[email protected]>> wrote:

    Having a combiner stack (more generally an iterator stack) run on
    the client-side seems to be the second most popular request on this
    list. The most popular being, "How do I write to Accumulo from
    inside an iterator?"

    Such a thing would be very useful for me, too. I have some cycles to
    help out, if somebody can give me an idea of where to get started
    and where the potential land-mines are.

    -Russ

    On Tue, Jun 9, 2015 at 9:08 AM [email protected]
    <mailto:[email protected]> <[email protected]
    <mailto:[email protected]>> wrote:

        Aggregated output is tiny, so if I do same calculations in
        memory (instead of sending mutations to Accumulo) , I can reduce
        overall number of mutations by 1000x or so



        -----Original Message-----
        From: Josh Elser [mailto:[email protected]
        <mailto:[email protected]>]
        Sent: 09 June 2015 16:54
        To: [email protected] <mailto:[email protected]>
        Subject: Re: micro compaction

        Well, you win the prize for new terminology. I haven't ever
        heard the term "micro compaction" before.

        Can you clarify though, you say hundreds of millions of
        mutations that result in megabytes of data. Is that an increase
        or decrease in size.
        Comparing apples to oranges :)

        [email protected]
        <mailto:[email protected]> wrote:
         > Hi guys,
         >
         > While doing pre-analytics we generate hundreds of millions of
         > mutations that result in 1-100 megabytes of useful data after
        major
         > compaction. We ingest into Accumulo using MR from Mapper job. We
         > identified that performance really degrades while increasing
        a number of mutations.
         >
         > The obvious improvement is to do some calculations in-memory
        before
         > sending mutations to Accumulo.
         >
         > Of course, at the same time we are looking for a solution to
        minimize
         > development effort.
         >
         > I guess I am asking about micro compaction/ingest-time
        iterators on
         > the client side (before data is sent to Accumulo).
         >
         > To my understanding, Accumulo does not support them, is it
        correct?
         > And if so, are there any plans to support this functionality
        in the future?
         >
         > Thanks
         >
         > Roman
         >
         > Please consider the environment before printing this email. This
         > message should be regarded as confidential. If you have
        received this
         > email in error please notify the sender and destroy it
        immediately.
         > Statements of intent shall only become binding when confirmed
        in hard
         > copy by an authorised signatory. The contents of this email
        may relate
         > to dealings with other companies under the control of BAE Systems
         > Applied Intelligence Limited, details of which can be found at
         > http://www.baesystems.com/Businesses/index.htm.
        Please consider the environment before printing this email. This
        message should be regarded as confidential. If you have received
        this email in error please notify the sender and destroy it
        immediately. Statements of intent shall only become binding when
        confirmed in hard copy by an authorised signatory. The contents
        of this email may relate to dealings with other companies under
        the control of BAE Systems Applied Intelligence Limited, details
        of which can be found at
        http://www.baesystems.com/Businesses/index.htm.

    Please consider the environment before printing this email. This
    message should be regarded as confidential. If you have received
    this email in error please notify the sender and destroy it
    immediately. Statements of intent shall only become binding when
    confirmed in hard copy by an authorised signatory. The contents of
    this email may relate to dealings with other companies under the
    control of BAE Systems Applied Intelligence Limited, details of
    which can be found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message
should be regarded as confidential. If you have received this email in
error please notify the sender and destroy it immediately. Statements of
intent shall only become binding when confirmed in hard copy by an
authorised signatory. The contents of this email may relate to dealings
with other companies under the control of BAE Systems Applied
Intelligence Limited, details of which can be found at
http://www.baesystems.com/Businesses/index.htm.

Reply via email to