Re: Noob questions

2020-04-14 Thread Christopher
The `du` command should show in bytes. Keep in mind that Accumulo
compresses data in its files. If the number doesn't match what you see
for the *.rf files in Hadoop, there may be a bug. Please let us know
if you find this to be the case.

On Tue, Apr 14, 2020 at 10:30 PM Niclas Hedhman  wrote:
>
> Yes, a bit of experimentation and I figured that out.
>
> As for the "putIfAbsent"; I can actually figure that out from the data being 
> written in this case, effectively an event store, and all rows starts with a 
> "created" event.
>
> One more small question;
> there is a "du" command, does it really report "bytes" or is it kB, of 
> storage space needed? The number seems too small for bytes, and if in kB then 
> it is over the hdfs physical disk usage...
>
> Cheers
> Niclas
>
> On Tue, Apr 14, 2020 at 9:49 PM Adam J. Shook  wrote:
>>
>> limitVersion = false would *not* set the default VersioningIterator, 
>> effectively keeping every entry you write to Accumulo.  Sounds like it hits 
>> your requirement of "versions never to be removed", though keep in mind that 
>> your static "metadata" qualifier would also never be versioned/deleted.
>>
>> On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman  wrote:
>>>
>>> Ah! I had some misunderstandings implanted in me, and good to get corrected.
>>>
>>> For
>>>
>>> connector.tableOperations.create(String tableName, boolean limitVersion);
>>>
>>>
>>> Will limitVersion=false disable versioning completely and I will always 
>>> only have one version, or will it have a "no limit" and "no removal" policy 
>>> of versions?
>>>
>>> Well, to be clear, I am looking for "versions never to be removed", a 
>>> requirement that made me smile and remember "Accumulo can do that 
>>> automatically", rather than implement that at a higher level.
>>>
>>> Thanks
>>>
>>> On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook  wrote:

 Hi Niclas,

 1. Accumulo uses a VersioningIterator for all tables which ensures that 
 you see the latest version of a particular entry, defined as the entry 
 that has the highest value for the timestamp.  Older versions of the same 
 key (row ID + family + qualifier + visibility) are compacted away by 
 Accumulo and will eventually be deleted.  You can set the number of 
 versions you want to keep to something other than the default of 1 (see 
 https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps).

 2. Related to #1, Accumulo will update the value to the latest version of 
 entry.  I believe if you keep writing the same entry with the same data 
 over and over again, you'll see them if you are keeping more than one 
 version of the same entry.  AFAIK there is no "put if absent" behavior 
 without reading for every write.  You can, of course, configure an 
 existing iterator or write your own to achieve whatever logic you want as 
 far as what versions to keep of what columns of your data model.

 3. The "Scanner" will return entries in order.  Related to #1, it will 
 only return the latest version of an entry (by default).  If you are 
 keeping more versions of the same entry, then you would see the newest 
 entry first.  The "BatchScanner" is multi-threaded and communicates to 
 several tablets at once, returning entries out of order.  One common 
 pattern is to use the WholeRowIterator when scanning.  This iterator 
 serializes all entries with the same row into one entry on the server 
 side, then you can deserialize the row on the client side to view the 
 entire contents of a row at once.  The order of the rows themselves is 
 still undefined when using a BatchScanner due to the multi-threaded nature 
 of the scanner.

 Hope this helps!
 --Adam

 On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman  wrote:
>
> Hi,
> I am steaming new on Accumulo, but tasked to put it into what used to be 
> Apache Polygene (now in Attic) as a entity store, one that keeps history.
>
> I have a couple of questions;
> 1. Assuming that I can guarantee that no one executes any explicit 
> deletes, can I rely on the mutation sequences not disappearing over time?
>
> 2. Part of storing a row, I have a "metadata" qualifier, that contains 
> static information. But since I don't know whether the row exists without 
> reading it first, then IIUIC I will fill the "metadata" with the same 
> information over and over again OR, does Accumulo realize that this 
> is the same byte[] as before and won't update the value, alternatively 
> creating a new Key, but pointing to the same Value?  I effectively want a 
> "putIfAbsent()"
>
> 3. The Scanner can fetch multiple rows, and constrained by CF and 
> qualifier. I think that is quite clear. But what does the iterator() 
> actually return? I presume that it is many key/value paris, of ALL 

Re: Noob questions

2020-04-14 Thread Niclas Hedhman
Yes, a bit of experimentation and I figured that out.

As for the "putIfAbsent"; I can actually figure that out from the data
being written in this case, effectively an event store, and all rows starts
with a "created" event.

One more small question;
there is a "du" command, does it really report "bytes" or is it kB, of
storage space needed? The number seems too small for bytes, and if in kB
then it is over the hdfs physical disk usage...

Cheers
Niclas

On Tue, Apr 14, 2020 at 9:49 PM Adam J. Shook  wrote:

> limitVersion = false would *not* set the default VersioningIterator,
> effectively keeping every entry you write to Accumulo.  Sounds like it hits
> your requirement of "versions never to be removed", though keep in mind
> that your static "metadata" qualifier would also never be versioned/deleted.
>
> On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman  wrote:
>
>> Ah! I had some misunderstandings implanted in me, and good to get
>> corrected.
>>
>> For
>>
>> connector.tableOperations.create(String tableName, boolean limitVersion);
>>
>>
>> Will limitVersion=false disable versioning completely and I will always
>> only have one version, or will it have a "no limit" and "no removal" policy
>> of versions?
>>
>> Well, to be clear, I am looking for "versions never to be removed", a
>> requirement that made me smile and remember "Accumulo can do that
>> automatically", rather than implement that at a higher level.
>>
>> Thanks
>>
>> On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook 
>> wrote:
>>
>>> Hi Niclas,
>>>
>>> 1. Accumulo uses a VersioningIterator for all tables which ensures that
>>> you see the latest version of a particular entry, defined as the entry that
>>> has the highest value for the timestamp.  Older versions of the same key
>>> (row ID + family + qualifier + visibility) are compacted away by Accumulo
>>> and will eventually be deleted.  You can set the number of versions you
>>> want to keep to something other than the default of 1 (see
>>> https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps
>>> ).
>>>
>>> 2. Related to #1, Accumulo will update the value to the latest version
>>> of entry.  I believe if you keep writing the same entry with the same data
>>> over and over again, you'll see them if you are keeping more than one
>>> version of the same entry.  AFAIK there is no "put if absent" behavior
>>> without reading for every write.  You can, of course, configure an existing
>>> iterator or write your own to achieve whatever logic you want as far as
>>> what versions to keep of what columns of your data model.
>>>
>>> 3. The "Scanner" will return entries in order.  Related to #1, it will
>>> only return the latest version of an entry (by default).  If you are
>>> keeping more versions of the same entry, then you would see the newest
>>> entry first.  The "BatchScanner" is multi-threaded and communicates to
>>> several tablets at once, returning entries out of order.  One common
>>> pattern is to use the WholeRowIterator when scanning.  This iterator
>>> serializes all entries with the same row into one entry on the server side,
>>> then you can deserialize the row on the client side to view the entire
>>> contents of a row at once.  The order of the rows themselves is still
>>> undefined when using a BatchScanner due to the multi-threaded nature of the
>>> scanner.
>>>
>>> Hope this helps!
>>> --Adam
>>>
>>> On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman 
>>> wrote:
>>>
 Hi,
 I am steaming new on Accumulo, but tasked to put it into what used to
 be Apache Polygene (now in Attic) as a entity store, one that keeps 
 history.

 I have a couple of questions;
 1. Assuming that I can guarantee that no one executes any explicit
 deletes, can I rely on the mutation sequences not disappearing over time?

 2. Part of storing a row, I have a "metadata" qualifier, that contains
 static information. But since I don't know whether the row exists without
 reading it first, then IIUIC I will fill the "metadata" with the same
 information over and over again OR, does Accumulo realize that this is
 the same byte[] as before and won't update the value, alternatively
 creating a new Key, but pointing to the same Value?  I effectively want a
 "putIfAbsent()"

 3. The Scanner can fetch multiple rows, and constrained by CF and
 qualifier. I think that is quite clear. But what does the iterator()
 actually return? I presume that it is many key/value paris, of ALL
 timestamped values. But what is the order guarantees here? I get the
 impression that within a row->cf->qualifier, the returned values are in
 timestamp order, newest first. And I think that within a row, I am
 guaranteed that the order maintained, i.e. row -> cf -> qualifier (all
 ascending). But am I also guaranteed that the iterator is "done" with a row
 when the has changed? Or can rows be 

Re: Noob questions

2020-04-14 Thread Emilio Lahr-Vivaz
You should be able to use a conditional writer to support 'put if 
absent': 
https://accumulo.apache.org/docs/2.x/getting-started/clients#conditionalwriter


Generally you would not want to repeatedly write the same key/value, as 
you will have to scan every single versioned entry when you want to read 
it back, which can make it much slower than you might expect to read a 
single row.


Thanks,

Emilio

On 4/14/20 9:48 AM, Adam J. Shook wrote:
limitVersion = false would *not* set the default VersioningIterator, 
effectively keeping every entry you write to Accumulo.  Sounds like it 
hits your requirement of "versions never to be removed", though keep 
in mind that your static "metadata" qualifier would also never be 
versioned/deleted.


On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman > wrote:


Ah! I had some misunderstandings implanted in me, and good to get
corrected.

For

|connector.tableOperations.create(String tableName, boolean
limitVersion);|


Will limitVersion=false disable versioning completely and I will
always only have one version, or will it have a "no limit" and "no
removal" policy of versions?

Well, to be clear, I am looking for "versions never to be
removed", a requirement that made me smile and remember "Accumulo
can do that automatically", rather than implement that at a higher
level.

Thanks

On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook
mailto:adamjsh...@gmail.com>> wrote:

Hi Niclas,

1. Accumulo uses a VersioningIterator for all tables which
ensures that you see the latest version of a particular entry,
defined as the entry that has the highest value for the
timestamp.  Older versions of the same key (row ID + family +
qualifier + visibility) are compacted away by Accumulo and
will eventually be deleted.  You can set the number of
versions you want to keep to something other than the default
of 1 (see

https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps).

2. Related to #1, Accumulo will update the value to the latest
version of entry.  I believe if you keep writing the same
entry with the same data over and over again, you'll see them
if you are keeping more than one version of the same entry. 
AFAIK there is no "put if absent" behavior without reading for
every write.  You can, of course, configure an existing
iterator or write your own to achieve whatever logic you want
as far as what versions to keep of what columns of your data
model.

3. The "Scanner" will return entries in order. Related to #1,
it will only return the latest version of an entry (by
default).  If you are keeping more versions of the same entry,
then you would see the newest entry first.  The "BatchScanner"
is multi-threaded and communicates to several tablets at once,
returning entries out of order.  One common pattern is to use
the WholeRowIterator when scanning.  This iterator serializes
all entries with the same row into one entry on the server
side, then you can deserialize the row on the client side to
view the entire contents of a row at once.  The order of the
rows themselves is still undefined when using a BatchScanner
due to the multi-threaded nature of the scanner.

Hope this helps!
--Adam

On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman
mailto:nic...@apache.org>> wrote:

Hi,
I am steaming new on Accumulo, but tasked to put it into
what used to be Apache Polygene (now in Attic) as a entity
store, one that keeps history.

I have a couple of questions;
1. Assuming that I can guarantee that no one executes any
explicit deletes, can I rely on the mutation sequences not
disappearing over time?

2. Part of storing a row, I have a "metadata" qualifier,
that contains static information. But since I don't know
whether the row exists without reading it first, then
IIUIC I will fill the "metadata" with the same information
over and over again OR, does Accumulo realize that
this is the same byte[] as before and won't update the
value, alternatively creating a new Key, but pointing to
the same Value?  I effectively want a "putIfAbsent()"

3. The Scanner can fetch multiple rows, and constrained by
CF and qualifier. I think that is quite clear. But what
does the iterator() actually return? I presume that it is
many key/value paris, of ALL timestamped values. But what
is the order guarantees here? I get the impression that
within a 

Re: Noob questions

2020-04-14 Thread Adam J. Shook
limitVersion = false would *not* set the default VersioningIterator,
effectively keeping every entry you write to Accumulo.  Sounds like it hits
your requirement of "versions never to be removed", though keep in mind
that your static "metadata" qualifier would also never be versioned/deleted.

On Mon, Apr 13, 2020 at 8:47 PM Niclas Hedhman  wrote:

> Ah! I had some misunderstandings implanted in me, and good to get
> corrected.
>
> For
>
> connector.tableOperations.create(String tableName, boolean limitVersion);
>
>
> Will limitVersion=false disable versioning completely and I will always
> only have one version, or will it have a "no limit" and "no removal" policy
> of versions?
>
> Well, to be clear, I am looking for "versions never to be removed", a
> requirement that made me smile and remember "Accumulo can do that
> automatically", rather than implement that at a higher level.
>
> Thanks
>
> On Tue, Apr 14, 2020 at 12:55 AM Adam J. Shook 
> wrote:
>
>> Hi Niclas,
>>
>> 1. Accumulo uses a VersioningIterator for all tables which ensures that
>> you see the latest version of a particular entry, defined as the entry that
>> has the highest value for the timestamp.  Older versions of the same key
>> (row ID + family + qualifier + visibility) are compacted away by Accumulo
>> and will eventually be deleted.  You can set the number of versions you
>> want to keep to something other than the default of 1 (see
>> https://accumulo.apache.org/1.9/accumulo_user_manual.html#_versioning_iterators_and_timestamps
>> ).
>>
>> 2. Related to #1, Accumulo will update the value to the latest version of
>> entry.  I believe if you keep writing the same entry with the same data
>> over and over again, you'll see them if you are keeping more than one
>> version of the same entry.  AFAIK there is no "put if absent" behavior
>> without reading for every write.  You can, of course, configure an existing
>> iterator or write your own to achieve whatever logic you want as far as
>> what versions to keep of what columns of your data model.
>>
>> 3. The "Scanner" will return entries in order.  Related to #1, it will
>> only return the latest version of an entry (by default).  If you are
>> keeping more versions of the same entry, then you would see the newest
>> entry first.  The "BatchScanner" is multi-threaded and communicates to
>> several tablets at once, returning entries out of order.  One common
>> pattern is to use the WholeRowIterator when scanning.  This iterator
>> serializes all entries with the same row into one entry on the server side,
>> then you can deserialize the row on the client side to view the entire
>> contents of a row at once.  The order of the rows themselves is still
>> undefined when using a BatchScanner due to the multi-threaded nature of the
>> scanner.
>>
>> Hope this helps!
>> --Adam
>>
>> On Mon, Apr 13, 2020 at 12:57 AM Niclas Hedhman 
>> wrote:
>>
>>> Hi,
>>> I am steaming new on Accumulo, but tasked to put it into what used to be
>>> Apache Polygene (now in Attic) as a entity store, one that keeps history.
>>>
>>> I have a couple of questions;
>>> 1. Assuming that I can guarantee that no one executes any explicit
>>> deletes, can I rely on the mutation sequences not disappearing over time?
>>>
>>> 2. Part of storing a row, I have a "metadata" qualifier, that contains
>>> static information. But since I don't know whether the row exists without
>>> reading it first, then IIUIC I will fill the "metadata" with the same
>>> information over and over again OR, does Accumulo realize that this is
>>> the same byte[] as before and won't update the value, alternatively
>>> creating a new Key, but pointing to the same Value?  I effectively want a
>>> "putIfAbsent()"
>>>
>>> 3. The Scanner can fetch multiple rows, and constrained by CF and
>>> qualifier. I think that is quite clear. But what does the iterator()
>>> actually return? I presume that it is many key/value paris, of ALL
>>> timestamped values. But what is the order guarantees here? I get the
>>> impression that within a row->cf->qualifier, the returned values are in
>>> timestamp order, newest first. And I think that within a row, I am
>>> guaranteed that the order maintained, i.e. row -> cf -> qualifier (all
>>> ascending). But am I also guaranteed that the iterator is "done" with a row
>>> when the has changed? Or can rows be interleaved in the iterator?
>>>
>>> Thanks in advance
>>> Niclas
>>>
>>