RE: micro compaction

roman.drap...@baesystems.com Tue, 09 Jun 2015 14:11:42 -0700

I am using:

Mutation m = new Mutation(rowId);
m.put(f1, q1, v1);
m.put(f2, q2, v2);
m.put(f3, q3, v3);

I guess it’s a native one? If not, what should I use?

Thanks
Roman

From: Keith Turner [mailto:ke...@deenlo.com]
Sent: 09 June 2015 22:04
To: user@accumulo.apache.org
Subject: Re: micro compaction

On Tue, Jun 9, 2015 at 4:06 PM, 
roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com> 
<roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com>> wrote:
My view is that introduction of ingest-time iterators would be quite a useful 
feature. Anyway. ☺

Also, could anyone exactly explain why composite mutation perform pretty much 
in the same way as a set of individual mutations?

One large composite mutation with 19 qualifiers inside is just 10-30% faster 
than 19 individual mutations.

One different is the row has to be sent over RPM 19 times vs once.  So the size 
of the row will impact this.
Are you using native maps?  The structure of the native map is Map<row, Map<col 
val>>.  For a mutation with 19 cols, the row is looked up once to find the 
column map.   For non-native map the structure is Map<Key, Value>.  
Conceptually for this you keep looking up the row (or do multuple compare of 
the row for each column in the mutation).

From: Russ Weeks 
[mailto:rwe...@newbrightidea.com<mailto:rwe...@newbrightidea.com>]
Sent: 09 June 2015 20:54
To: accumulo-user
Subject: Re: micro compaction

For consistency and ease of implementation. Say I've written a stack of 
combiners that do statistical aggregation, sampling etc. on my table. Rather 
than port that logic to a Storm topology or to the DStream API I'd just like to 
turn that stack on in my BatchWriter.

On Tue, Jun 9, 2015 at 12:47 PM David Medinets 
<david.medin...@gmail.com<mailto:david.medin...@gmail.com>> wrote:
Consider using Storm, Pig, Spark, or your own framework to handle the in-memory 
aggregation before giving the data to the BatchWriter. Why would any part of 
Accumulo code be responsible for this kind of application-specific data 
handling?

On Tue, Jun 9, 2015 at 3:17 PM, 
roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com> 
<roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com>> wrote:
Just to clarify the origin of my question.

I had to do some performance tests to compare different storage types of “raw” 
data against each other.

Hopefully, picture below is visible in the mailing list. If not, I will put it 
somewhere else.

6 million “original” records, 1.3GB data, 233 bytes per record
Each original record is 40 fields delimited by tab, on average 19 – not null
Batchwriter, single java program

First three bars represent single “heavy” mutation to insert the whole tabular 
line / serialized object.
4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in one 
mutation)
8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in 
separate mutations) - ~19 mutations per original record

On average, single “heavy” mutations are 7-10 times faster than anything else, 
composite are 10%-35% faster than individual.

I am not an expert how Accumulo is implemented internally, however it looks 
like composite mutation is treated more or less in the same way as a set of 
individual mutations. Probably, largest overhead is added by WAL.

[cid:image001.png@01D0A301.1722B630]

Data utilization before and after manual compaction of test table and all 
system tables:

[cid:image002.png@01D0A301.1722B630]

It’s not clear why “accumulo du” shows twice less data used comparing to “hdfs 
du”.

All these tests made us think that we can improve performance by doing some 
calculations in-memory (and our use-case fits very well) and reducing number of 
mutations. Now I am trying to understand whether there is a relatively easy way 
to do this with Accumulo or whether it’s time to look closer into something 
like Spark.

Thanks
Roman

From: Adam Fuchs [mailto:afu...@apache.org<mailto:afu...@apache.org>]
Sent: 09 June 2015 19:08

To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: micro compaction

I think this might be the same concept as in-mapper combining, but applied to 
data being sent to a BatchWriter rather than an OutputCollector. See [1], 
section 3.1.1. A similar performance analysis and probably a lot of the same 
code should apply here.

Cheers,
Adam

[1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf

On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks 
<rwe...@newbrightidea.com<mailto:rwe...@newbrightidea.com>> wrote:
Having a combiner stack (more generally an iterator stack) run on the 
client-side seems to be the second most popular request on this list. The most 
popular being, "How do I write to Accumulo from inside an iterator?"

Such a thing would be very useful for me, too. I have some cycles to help out, 
if somebody can give me an idea of where to get started and where the potential 
land-mines are.

-Russ

On Tue, Jun 9, 2015 at 9:08 AM 
roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com> 
<roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com>> wrote:
Aggregated output is tiny,  so if I do same calculations in memory (instead of 
sending mutations to Accumulo) , I can reduce overall number of mutations by 
1000x or so

-----Original Message-----
From: Josh Elser [mailto:josh.el...@gmail.com<mailto:josh.el...@gmail.com>]
Sent: 09 June 2015 16:54
To: user@accumulo.apache.org<mailto:user@accumulo.apache.org>
Subject: Re: micro compaction

Well, you win the prize for new terminology. I haven't ever heard the term 
"micro compaction" before.

Can you clarify though, you say hundreds of millions of mutations that result 
in megabytes of data. Is that an increase or decrease in size.
Comparing apples to oranges :)

roman.drap...@baesystems.com<mailto:roman.drap...@baesystems.com> wrote:
> Hi guys,
>
> While doing pre-analytics we generate hundreds of millions of
> mutations that result in 1-100 megabytes of useful data after major
> compaction. We ingest into Accumulo using MR from Mapper job. We
> identified that performance really degrades while increasing a number of 
> mutations.
>
> The obvious improvement is to do some calculations in-memory before
> sending mutations to Accumulo.
>
> Of course, at the same time we are looking for a solution to minimize
> development effort.
>
> I guess I am asking about micro compaction/ingest-time iterators on
> the client side (before data is sent to Accumulo).
>
> To my understanding, Accumulo does not support them, is it correct?
> And if so, are there any plans to support this functionality in the future?
>
> Thanks
>
> Roman
>
> Please consider the environment before printing this email. This
> message should be regarded as confidential. If you have received this
> email in error please notify the sender and destroy it immediately.
> Statements of intent shall only become binding when confirmed in hard
> copy by an authorised signatory. The contents of this email may relate
> to dealings with other companies under the control of BAE Systems
> Applied Intelligence Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

Please consider the environment before printing this email. This message should 
be regarded as confidential. If you have received this email in error please 
notify the sender and destroy it immediately. Statements of intent shall only 
become binding when confirmed in hard copy by an authorised signatory. The 
contents of this email may relate to dealings with other companies under the 
control of BAE Systems Applied Intelligence Limited, details of which can be 
found at http://www.baesystems.com/Businesses/index.htm.

RE: micro compaction

Reply via email to