Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread Michael Moss
Hello, everyone.

I'd love to hear folks' input on using the "natural" data model of Accumulo
("BigTable" style) vs more of a Document Model. I'll try to succinctly
describe with a contrived example.

Let's say I have one domain object I'd like to model, "SensorReadings". A
single entry might look something like the following with 4 distinct CF, CQ
pairs.

RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
CF: "Meta", CQ: "Timestamp", Value: 
CF: "Sensor", CQ: "Temperature", Value: 80.4
CF: "Sensor", CQ: "Humidity", Value: 40.2
CF: "Sensor", CQ: "Barometer", Value: 29.1

I might do queries like "get me all SensorReadings for 2015 for DeviceID =
1" and if I wanted to operate on each SensorReading as a single unit (and
not as the 4 'rows' it returns for each one), I'd either have to aggregate
the 4 CF, CQ pairs for each RowKey client side, or use something like the
WholeRowIterator.

In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015,
return me SensorReadings where Temperature > 90, Humidity < 40, Barometer >
31", I'd again have to either use the WholeRowIterator to 'see' each entire
SensorReading in memory on the server for the compound query, or I could
take the intersection of the results of 3 parallel, independent queries on
the client side.

Where I am going with this is, what are the thoughts around creating a
Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
storing each SensorReading as a single 'Document'?

RowKey: DeviceID-YYYMMDD
CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
Humidity=40.2, Barometer = 29.1)

This way you avoid having to use the WholeRowIterator and unless you often
have queries that only look at a tiny subset of your fields (let's say just
"Temperature"), the serialization costs seem similar since Value is just
bytes anyway.

Appreciate folks' experience and wisdom here. Hope this makes sense, happy
to clarify.

Best.

-Mike


Re: Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread dlmarion

Both will work, but I think the answer depends on the amount of data that you 
will be querying over and your query latency requirements. I would include the 
wikisearch[1] storage scheme into your list as well (k/v table + indices). 
Then, personally, I would rate them in the following order as database size 
increases and query latency requirements decrease: 

1. Document Model 
2. K/V Model 
3. K/V Model with indices (wikisearch) 

[1] https://accumulo.apache.org/example/wikisearch.html 

- Original Message -

From: "Michael Moss"  
To: user@accumulo.apache.org 
Sent: Friday, September 4, 2015 11:42:20 AM 
Subject: Accumulo: "BigTable" vs. "Document Model" 

Hello, everyone. 

I'd love to hear folks' input on using the "natural" data model of Accumulo 
("BigTable" style) vs more of a Document Model. I'll try to succinctly describe 
with a contrived example. 

Let's say I have one domain object I'd like to model, "SensorReadings". A 
single entry might look something like the following with 4 distinct CF, CQ 
pairs. 

RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) 
CF: "Meta", CQ: "Timestamp", Value:  
CF: "Sensor", CQ: "Temperature", Value: 80.4 
CF: "Sensor", CQ: "Humidity", Value: 40.2 
CF: "Sensor", CQ: "Barometer", Value: 29.1 

I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" 
and if I wanted to operate on each SensorReading as a single unit (and not as 
the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, 
CQ pairs for each RowKey client side, or use something like the 
WholeRowIterator. 

In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, 
return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 
31", I'd again have to either use the WholeRowIterator to 'see' each entire 
SensorReading in memory on the server for the compound query, or I could take 
the intersection of the results of 3 parallel, independent queries on the 
client side. 

Where I am going with this is, what are the thoughts around creating a Java, 
Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing 
each SensorReading as a single 'Document'? 

RowKey: DeviceID-YYYMMDD 
CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, 
Barometer = 29.1) 

This way you avoid having to use the WholeRowIterator and unless you often have 
queries that only look at a tiny subset of your fields (let's say just 
"Temperature"), the serialization costs seem similar since Value is just bytes 
anyway. 

Appreciate folks' experience and wisdom here. Hope this makes sense, happy to 
clarify. 

Best. 

-Mike 







Re: Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread Eric Newton
You could use a server-side iterator that does the filtering on the server,
and returns a protobuf value for matching rows.

-Eric


On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss 
wrote:

> Hello, everyone.
>
> I'd love to hear folks' input on using the "natural" data model of
> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
> succinctly describe with a contrived example.
>
> Let's say I have one domain object I'd like to model, "SensorReadings". A
> single entry might look something like the following with 4 distinct CF, CQ
> pairs.
>
> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
> CF: "Meta", CQ: "Timestamp", Value: 
> CF: "Sensor", CQ: "Temperature", Value: 80.4
> CF: "Sensor", CQ: "Humidity", Value: 40.2
> CF: "Sensor", CQ: "Barometer", Value: 29.1
>
> I might do queries like "get me all SensorReadings for 2015 for DeviceID =
> 1" and if I wanted to operate on each SensorReading as a single unit (and
> not as the 4 'rows' it returns for each one), I'd either have to aggregate
> the 4 CF, CQ pairs for each RowKey client side, or use something like the
> WholeRowIterator.
>
> In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015,
> return me SensorReadings where Temperature > 90, Humidity < 40, Barometer >
> 31", I'd again have to either use the WholeRowIterator to 'see' each entire
> SensorReading in memory on the server for the compound query, or I could
> take the intersection of the results of 3 parallel, independent queries on
> the client side.
>
> Where I am going with this is, what are the thoughts around creating a
> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
> storing each SensorReading as a single 'Document'?
>
> RowKey: DeviceID-YYYMMDD
> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
> Humidity=40.2, Barometer = 29.1)
>
> This way you avoid having to use the WholeRowIterator and unless you often
> have queries that only look at a tiny subset of your fields (let's say just
> "Temperature"), the serialization costs seem similar since Value is just
> bytes anyway.
>
> Appreciate folks' experience and wisdom here. Hope this makes sense, happy
> to clarify.
>
> Best.
>
> -Mike
>
>
>
>
>


Re: Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread Adam Fuchs
Sqrrl uses a hybrid approach. For records that are relatively static we use
a compacted form, but for maintaining aggregates and for making updates to
the compacted form documents we use a more explicit form. This is done
mostly through iterators and a fairly complex type system. The big
trade-off for us was storage footprint. We gain something like 30% more
compression by using the compacted form, and that also translates into
better ingest and query performance. I can tell you it takes a significant
engineering investment to make this work without overspecializing, so make
sure your use case warrants it.

Cheers,
Adam


On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss 
wrote:

> Hello, everyone.
>
> I'd love to hear folks' input on using the "natural" data model of
> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
> succinctly describe with a contrived example.
>
> Let's say I have one domain object I'd like to model, "SensorReadings". A
> single entry might look something like the following with 4 distinct CF, CQ
> pairs.
>
> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
> CF: "Meta", CQ: "Timestamp", Value: 
> CF: "Sensor", CQ: "Temperature", Value: 80.4
> CF: "Sensor", CQ: "Humidity", Value: 40.2
> CF: "Sensor", CQ: "Barometer", Value: 29.1
>
> I might do queries like "get me all SensorReadings for 2015 for DeviceID =
> 1" and if I wanted to operate on each SensorReading as a single unit (and
> not as the 4 'rows' it returns for each one), I'd either have to aggregate
> the 4 CF, CQ pairs for each RowKey client side, or use something like the
> WholeRowIterator.
>
> In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015,
> return me SensorReadings where Temperature > 90, Humidity < 40, Barometer >
> 31", I'd again have to either use the WholeRowIterator to 'see' each entire
> SensorReading in memory on the server for the compound query, or I could
> take the intersection of the results of 3 parallel, independent queries on
> the client side.
>
> Where I am going with this is, what are the thoughts around creating a
> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
> storing each SensorReading as a single 'Document'?
>
> RowKey: DeviceID-YYYMMDD
> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
> Humidity=40.2, Barometer = 29.1)
>
> This way you avoid having to use the WholeRowIterator and unless you often
> have queries that only look at a tiny subset of your fields (let's say just
> "Temperature"), the serialization costs seem similar since Value is just
> bytes anyway.
>
> Appreciate folks' experience and wisdom here. Hope this makes sense, happy
> to clarify.
>
> Best.
>
> -Mike
>
>
>
>
>


Re: Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread Josh Elser
These days, I tend to lean towards breaking out each attribute in a 
record into discrete columns.


When you roll up multiple columns into a single value, you lose the 
ability to use the native column filtering (cf or cf+cq) that's built 
into Accumulo. Same goes for column visibilities (at least in the 
traditional sense). Deletes and updates are more difficult to reason 
about and require some extra coordination to work.


You can always aggregate many rows on the server dynamically if that 
makes processing things as one "entry" more simple.


Michael Moss wrote:

Hello, everyone.

I'd love to hear folks' input on using the "natural" data model of
Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
succinctly describe with a contrived example.

Let's say I have one domain object I'd like to model, "SensorReadings".
A single entry might look something like the following with 4 distinct
CF, CQ pairs.

RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
CF: "Meta", CQ: "Timestamp", Value: 
CF: "Sensor", CQ: "Temperature", Value: 80.4
CF: "Sensor", CQ: "Humidity", Value: 40.2
CF: "Sensor", CQ: "Barometer", Value: 29.1

I might do queries like "get me all SensorReadings for 2015 for DeviceID
= 1" and if I wanted to operate on each SensorReading as a single unit
(and not as the 4 'rows' it returns for each one), I'd either have to
aggregate the 4 CF, CQ pairs for each RowKey client side, or use
something like the WholeRowIterator.

In addition, if I wanted to write a query like, "for DeviceID = 1 in
2015, return me SensorReadings where Temperature > 90, Humidity < 40,
Barometer > 31", I'd again have to either use the WholeRowIterator to
'see' each entire SensorReading in memory on the server for the compound
query, or I could take the intersection of the results of 3 parallel,
independent queries on the client side.

Where I am going with this is, what are the thoughts around creating a
Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields
and storing each SensorReading as a single 'Document'?

RowKey: DeviceID-YYYMMDD
CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
Humidity=40.2, Barometer = 29.1)

This way you avoid having to use the WholeRowIterator and unless you
often have queries that only look at a tiny subset of your fields (let's
say just "Temperature"), the serialization costs seem similar since
Value is just bytes anyway.

Appreciate folks' experience and wisdom here. Hope this makes sense,
happy to clarify.

Best.

-Mike






Re: Accumulo: "BigTable" vs. "Document Model"

2015-09-04 Thread David Medinets
+1 for Eric's suggestion. I used this technique. It seemed to work nicely.
When storing ProtoBuf, JSON, or any other 'document' remember to factor in
the parsing needed during iteration. This affects both CPU and Memory
requirements on the tservers.

On Fri, Sep 4, 2015 at 11:53 AM, Eric Newton  wrote:

> You could use a server-side iterator that does the filtering on the
> server, and returns a protobuf value for matching rows.
>
> -Eric
>
>
> On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss 
> wrote:
>
>> Hello, everyone.
>>
>> I'd love to hear folks' input on using the "natural" data model of
>> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to
>> succinctly describe with a contrived example.
>>
>> Let's say I have one domain object I'd like to model, "SensorReadings". A
>> single entry might look something like the following with 4 distinct CF, CQ
>> pairs.
>>
>> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234)
>> CF: "Meta", CQ: "Timestamp", Value: 
>> CF: "Sensor", CQ: "Temperature", Value: 80.4
>> CF: "Sensor", CQ: "Humidity", Value: 40.2
>> CF: "Sensor", CQ: "Barometer", Value: 29.1
>>
>> I might do queries like "get me all SensorReadings for 2015 for DeviceID
>> = 1" and if I wanted to operate on each SensorReading as a single unit (and
>> not as the 4 'rows' it returns for each one), I'd either have to aggregate
>> the 4 CF, CQ pairs for each RowKey client side, or use something like the
>> WholeRowIterator.
>>
>> In addition, if I wanted to write a query like, "for DeviceID = 1 in
>> 2015, return me SensorReadings where Temperature > 90, Humidity < 40,
>> Barometer > 31", I'd again have to either use the WholeRowIterator to 'see'
>> each entire SensorReading in memory on the server for the compound query,
>> or I could take the intersection of the results of 3 parallel, independent
>> queries on the client side.
>>
>> Where I am going with this is, what are the thoughts around creating a
>> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and
>> storing each SensorReading as a single 'Document'?
>>
>> RowKey: DeviceID-YYYMMDD
>> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4,
>> Humidity=40.2, Barometer = 29.1)
>>
>> This way you avoid having to use the WholeRowIterator and unless you
>> often have queries that only look at a tiny subset of your fields (let's
>> say just "Temperature"), the serialization costs seem similar since Value
>> is just bytes anyway.
>>
>> Appreciate folks' experience and wisdom here. Hope this makes sense,
>> happy to clarify.
>>
>> Best.
>>
>> -Mike
>>
>>
>>
>>
>>
>