Both will work, but I think the answer depends on the amount of data that you 
will be querying over and your query latency requirements. I would include the 
wikisearch[1] storage scheme into your list as well (k/v table + indices). 
Then, personally, I would rate them in the following order as database size 
increases and query latency requirements decrease: 

1. Document Model 
2. K/V Model 
3. K/V Model with indices (wikisearch) 

[1] https://accumulo.apache.org/example/wikisearch.html 

----- Original Message -----

From: "Michael Moss" <[email protected]> 
To: [email protected] 
Sent: Friday, September 4, 2015 11:42:20 AM 
Subject: Accumulo: "BigTable" vs. "Document Model" 

Hello, everyone. 

I'd love to hear folks' input on using the "natural" data model of Accumulo 
("BigTable" style) vs more of a Document Model. I'll try to succinctly describe 
with a contrived example. 

Let's say I have one domain object I'd like to model, "SensorReadings". A 
single entry might look something like the following with 4 distinct CF, CQ 
pairs. 

RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) 
CF: "Meta", CQ: "Timestamp", Value: <Some timestamp> 
CF: "Sensor", CQ: "Temperature", Value: 80.4 
CF: "Sensor", CQ: "Humidity", Value: 40.2 
CF: "Sensor", CQ: "Barometer", Value: 29.1 

I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" 
and if I wanted to operate on each SensorReading as a single unit (and not as 
the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, 
CQ pairs for each RowKey client side, or use something like the 
WholeRowIterator. 

In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, 
return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 
31", I'd again have to either use the WholeRowIterator to 'see' each entire 
SensorReading in memory on the server for the compound query, or I could take 
the intersection of the results of 3 parallel, independent queries on the 
client side. 

Where I am going with this is, what are the thoughts around creating a Java, 
Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing 
each SensorReading as a single 'Document'? 

RowKey: DeviceID-YYYMMDD 
CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, 
Barometer = 29.1) 

This way you avoid having to use the WholeRowIterator and unless you often have 
queries that only look at a tiny subset of your fields (let's say just 
"Temperature"), the serialization costs seem similar since Value is just bytes 
anyway. 

Appreciate folks' experience and wisdom here. Hope this makes sense, happy to 
clarify. 

Best. 

-Mike 





Reply via email to