Re: Scans during Compaction

Josh Elser Tue, 24 Feb 2015 07:20:59 -0800

Looking forward to it!!

Dylan Hutchison wrote:

Good suggestion; I will follow up with a design document in the next few
days.


Creating idempotentency via indicator entries (in the column family,
timestamp or something else) is one option to work in an iterator that
should run once over a table's entries.  I think we may have the
opportunity to solve a more general problem---scans merging data from
multiple table sources with user-defined merge & compute functions---in
addition to my use case by re-approaching the problem.  Think
selective-scan->transform->join->transform->write-out.  Analytics on the
server.

My specific use case is table-table multiplication, treating table rows,
column qualifiers and values as the components of a matrix.  We view a
table as a /sparse/ matrix by treating non-present (row, colQ, value)
entries as zeros in the matrix.  Accumulo offers an advantage when we
only want to run /on selected ranges/ from the input tables, as opposed
to running on whole tables where Yarn/Mapreduce may work better.  The
table-table multiplication should run on tables in Accumulo, the result
/persisting/ to Accumulo, so that we never need return values to the
client.  We value /low latency/ over high throughput, so that we can
perform multiplications interactively.  The user should have a way to
/monitor/ multiplication progress, perhaps by a live count of the number
of entries processed or (harder) a live sample of result entries from
the multiplication.  The user should be able to /stop/ an operation
midway once he decides enough entries processed.  In addition to
interactivity, we may want to perform multiplications /in series/ and
parallel.  They form base /building blocks/ for higher-level algorithms.

I promise I will write these details up more formally, including how I
made them work so far and putting them in more general context.  Will
post in a separate thread.

Regards,
Dylan Hutchison

On Mon, Feb 23, 2015 at 2:16 PM, Adam Fuchs <[email protected]
<mailto:[email protected]>> wrote:

    Dylan,

    I think the way this is generally solved is by using an idempotent
    iterator that can be applied at both full major compaction and query
    scopes to give a consistent view. Aggregation, age-off filtering,
    and all the other "standard" iterators have the property that you
    can leave them in place and get a consistent answer even if they are
    applied multiple times. Major compaction and query-time iterators
    are even simpler than the general case, since you don't really need
    to worry about partial views of the underlying data. In your case I
    think you are trying to use an iterator that needs to be applied
    exactly once to a complete stream of data (either at query time or
    major compaction time). What we should probably do is look at
    options for more generally supporting that type of iterator. You
    could help us a ton by describing exactly what you want your
    iterator to do, and we can all propose a few ideas for how this
    might be implemented. Here are a couple off the top of my head:

    1. If you can reform your iterator so that it is idempotent then you
    can apply it liberally. This might be possible using some sort of
    flag that the major compactor puts in the data and the query-time
    iterator looks for to determine if the compaction has already
    happened. We often use version numbers in column families to this
    effect. Special row keys at the beginning of the tablet might also
    be an option. This would be doable without changes to Accumulo.

    2. We could build a mechanism into core accumulo that applies an
    iterator with exactly once semantics, such that the user submits a
    transformation as an iterator and it gets applied similarly to how
    you described. The query-time reading of results of the major
    compaction might be overkill, but that would be a possible
    optimization that we could think about engineering in a second pass.

    Adam



    On Mon, Feb 23, 2015 at 1:42 PM, Dylan Hutchison
    <[email protected] <mailto:[email protected]>> wrote:

        Thanks Adam and Keith.

        I see the following as a potential solution that achieves (1)
        low latency for clients that want to see entries after an
        iterator and (2) the entries from that iterator persisting in
        the Accumulo table.

         1. Start a major compaction in thread T1 of a client with the
            iterator set, blocking until the compaction completes.
         2. Start scanning in thread T2 of the client with the same
            iterator now set at scan-time scope. Use an isolated scanner
            to make sure we do not read the results of the major
            compaction committing, though this is not full-proof due to
            timing and that the isolated scanner is row-wise.
         3. Eventually, T1 unblocks and signals that the compaction
            completes.  T1 interrupts T2.
         4. Thread T2 stops scanning, removes the scan-time iterator,
            and starts scanning again at the point it last left off, now
            seeing the results of the major compaction which already
            passed through the iterator.

        The whole scheme is only necessary if the client wants results
        faster than the major compaction completes.  A disadvantage is
        duplicated work -- the iterator runs at scan-time and at
        compaction-time until the compaction finishes.  This may strain
        server resources.

        Will think about other schemes.  If only we could attach an
        apply-once scan-time iterator, that also persists its results to
        an Accumulo table in a streaming fashion.  Or on the flip side,
        a one-time compaction iterator that streams results, such that
        we could scan from them right away instead of needing to wait
        for the entire compaction to complete.

        Regards,
        Dylan Hutchison

        On Mon, Feb 23, 2015 at 12:48 PM, Adam Fuchs <[email protected]
        <mailto:[email protected]>> wrote:

            Dylan,

            The effect of a major compaction is never seen in queries
            before the major compaction completes. At the end of the
            major compaction there is a multi-phase commit which
            eventually replaces all of the old files with the new file.
            At that point the major compaction will have completely
            processed the given tablet's data (although other tablets
            may not be synchronized). For long-running non-isolated
            queries (more than a second or so) the iterator tree is
            occasionally rebuilt and re-seeked. When it is rebuilt it
            will use whatever is the latest file set, which will include
            the results of a completed major compaction.

            In your case #1 that's a tricky guarantee to make across a
            whole tablet, but it can be made one row at a time by using
            an isolated iterator.

            To make your case #2 work, you probably will have to
            implement some higher-level logic to only start your query
            after the major compaction has completed, using an external
            mechanism to track the completion of your transformation.

            Adam


            On Mon, Feb 23, 2015 at 12:35 PM, Dylan Hutchison
            <[email protected] <mailto:[email protected]>> wrote:

                Hello all,

                When I initiate a full major compaction (with flushing
                turned on) manually via the Accumulo API
                
<https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#compact(java.lang.String,%20org.apache.hadoop.io.Text,%20org.apache.hadoop.io.Text,%20java.util.List,%20boolean,%20boolean)>,
                how does the table appear to

                 1. clients that started scanning the table before the
                    major compaction began;
                 2. clients that start scanning during the major compaction?

                I'm interested in the case where there is an iterator
                attached to the full major compaction that modifies
                entries (respecting sorted order of entries).

                The best possible answer for my use case, with case #2
                more important than case #1 and *low latency* more
                important than high throughput, is that

                 1. clients that started scanning before the compaction
                    began would not see entries altered by
                    the compaction-time iterator;
                 2. clients that start scanning during the major
                    compaction stream back entries as they finish
                    processing from the major compaction, such that the
                    clients /only/ see entries that have passed through
                    the compaction-time iterator.

                How accurate are these descriptions?  If #2 really were
                as I would like it to be, then a scan on the range
                (-inf,+inf) started after compaction would "monitor
                compaction progress," such that the first entry batch
                transmits to the scanner as soon as it is available from
                the major compaction, and the scanner finishes (receives
                all entries) exactly when the compaction finishes.  If
                this is not possible, I may make something to that
                effect by calling the blocking version of compact().

                Bonus: how does cancelCompaction()
                
<https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#cancelCompaction(java.lang.String)>
                affect clients scanning in case #1 and case #2?

                Regards,
                Dylan Hutchison





        --
        www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>





--
www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>

Re: Scans during Compaction

Reply via email to