Visibility labels are not replaced with any other types of identifiers
which means that, considering nothing else, a visibility label which has
20 characters will take up more space than one that only has 2
characters. This is a conscious decision to make sure it is completely
obvious what the label on some data is without an external lookup table.
Accumulo uses two strategies to reduce the size of data on disk: run
length encoding and a compression algorithm. The run-length encoding is
used to prevent common prefixes in a sequential Keys from being stored
multiple times. For example, given the following Keys
row1 cf:cq []
row2 cf:cq []
the RLE would prevent "row" from being stored a second time. Families
and qualifiers would only be replaced with a back-reference if there is
a common Key-prefix that extends into the family or qualifier.
A compression algorithm, GZ by default, is then applied to the result of
the encoding. Snappy is another common compression algorithm used by
Accumulo instances.
- Josh
[email protected] wrote:
Hi there,
My question is how Accumulo compression works in regards to visibility
labels.
Is there any difference between ”VeryLargeLargeLarge &
AlsoLargeLargeLarge” and “A&B” expressions? Will it be internally
compiled to a low data consuming structure?
Same question applies to column and qualifier names. Is there any
difference?
The reason for this question is simple – we are trying to find out what
would be the data utilization overhead for different approaches.
Regards
Roman
Please consider the environment before printing this email. This message
should be regarded as confidential. If you have received this email in
error please notify the sender and destroy it immediately. Statements of
intent shall only become binding when confirmed in hard copy by an
authorised signatory. The contents of this email may relate to dealings
with other companies under the control of BAE Systems Applied
Intelligence Limited, details of which can be found at
http://www.baesystems.com/Businesses/index.htm.