Accumulo: "BigTable" vs. "Document Model"
Hello, everyone. I'd love to hear folks' input on using the "natural" data model of Accumulo ("BigTable" style) vs more of a Document Model. I'll try to succinctly describe with a contrived example. Let's say I have one domain object I'd like to model, "SensorReadings". A single entry might look something like the following with 4 distinct CF, CQ pairs. RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) CF: "Meta", CQ: "Timestamp", Value: CF: "Sensor", CQ: "Temperature", Value: 80.4 CF: "Sensor", CQ: "Humidity", Value: 40.2 CF: "Sensor", CQ: "Barometer", Value: 29.1 I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" and if I wanted to operate on each SensorReading as a single unit (and not as the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, CQ pairs for each RowKey client side, or use something like the WholeRowIterator. In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 31", I'd again have to either use the WholeRowIterator to 'see' each entire SensorReading in memory on the server for the compound query, or I could take the intersection of the results of 3 parallel, independent queries on the client side. Where I am going with this is, what are the thoughts around creating a Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing each SensorReading as a single 'Document'? RowKey: DeviceID-YYYMMDD CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, Barometer = 29.1) This way you avoid having to use the WholeRowIterator and unless you often have queries that only look at a tiny subset of your fields (let's say just "Temperature"), the serialization costs seem similar since Value is just bytes anyway. Appreciate folks' experience and wisdom here. Hope this makes sense, happy to clarify. Best. -Mike
Re: Accumulo: "BigTable" vs. "Document Model"
Both will work, but I think the answer depends on the amount of data that you will be querying over and your query latency requirements. I would include the wikisearch[1] storage scheme into your list as well (k/v table + indices). Then, personally, I would rate them in the following order as database size increases and query latency requirements decrease: 1. Document Model 2. K/V Model 3. K/V Model with indices (wikisearch) [1] https://accumulo.apache.org/example/wikisearch.html - Original Message - From: "Michael Moss"To: user@accumulo.apache.org Sent: Friday, September 4, 2015 11:42:20 AM Subject: Accumulo: "BigTable" vs. "Document Model" Hello, everyone. I'd love to hear folks' input on using the "natural" data model of Accumulo ("BigTable" style) vs more of a Document Model. I'll try to succinctly describe with a contrived example. Let's say I have one domain object I'd like to model, "SensorReadings". A single entry might look something like the following with 4 distinct CF, CQ pairs. RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) CF: "Meta", CQ: "Timestamp", Value: CF: "Sensor", CQ: "Temperature", Value: 80.4 CF: "Sensor", CQ: "Humidity", Value: 40.2 CF: "Sensor", CQ: "Barometer", Value: 29.1 I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" and if I wanted to operate on each SensorReading as a single unit (and not as the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, CQ pairs for each RowKey client side, or use something like the WholeRowIterator. In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 31", I'd again have to either use the WholeRowIterator to 'see' each entire SensorReading in memory on the server for the compound query, or I could take the intersection of the results of 3 parallel, independent queries on the client side. Where I am going with this is, what are the thoughts around creating a Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing each SensorReading as a single 'Document'? RowKey: DeviceID-YYYMMDD CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, Barometer = 29.1) This way you avoid having to use the WholeRowIterator and unless you often have queries that only look at a tiny subset of your fields (let's say just "Temperature"), the serialization costs seem similar since Value is just bytes anyway. Appreciate folks' experience and wisdom here. Hope this makes sense, happy to clarify. Best. -Mike
Re: Accumulo: "BigTable" vs. "Document Model"
You could use a server-side iterator that does the filtering on the server, and returns a protobuf value for matching rows. -Eric On Fri, Sep 4, 2015 at 11:42 AM, Michael Mosswrote: > Hello, everyone. > > I'd love to hear folks' input on using the "natural" data model of > Accumulo ("BigTable" style) vs more of a Document Model. I'll try to > succinctly describe with a contrived example. > > Let's say I have one domain object I'd like to model, "SensorReadings". A > single entry might look something like the following with 4 distinct CF, CQ > pairs. > > RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) > CF: "Meta", CQ: "Timestamp", Value: > CF: "Sensor", CQ: "Temperature", Value: 80.4 > CF: "Sensor", CQ: "Humidity", Value: 40.2 > CF: "Sensor", CQ: "Barometer", Value: 29.1 > > I might do queries like "get me all SensorReadings for 2015 for DeviceID = > 1" and if I wanted to operate on each SensorReading as a single unit (and > not as the 4 'rows' it returns for each one), I'd either have to aggregate > the 4 CF, CQ pairs for each RowKey client side, or use something like the > WholeRowIterator. > > In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, > return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > > 31", I'd again have to either use the WholeRowIterator to 'see' each entire > SensorReading in memory on the server for the compound query, or I could > take the intersection of the results of 3 parallel, independent queries on > the client side. > > Where I am going with this is, what are the thoughts around creating a > Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and > storing each SensorReading as a single 'Document'? > > RowKey: DeviceID-YYYMMDD > CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, > Humidity=40.2, Barometer = 29.1) > > This way you avoid having to use the WholeRowIterator and unless you often > have queries that only look at a tiny subset of your fields (let's say just > "Temperature"), the serialization costs seem similar since Value is just > bytes anyway. > > Appreciate folks' experience and wisdom here. Hope this makes sense, happy > to clarify. > > Best. > > -Mike > > > > >
Re: Accumulo: "BigTable" vs. "Document Model"
Sqrrl uses a hybrid approach. For records that are relatively static we use a compacted form, but for maintaining aggregates and for making updates to the compacted form documents we use a more explicit form. This is done mostly through iterators and a fairly complex type system. The big trade-off for us was storage footprint. We gain something like 30% more compression by using the compacted form, and that also translates into better ingest and query performance. I can tell you it takes a significant engineering investment to make this work without overspecializing, so make sure your use case warrants it. Cheers, Adam On Fri, Sep 4, 2015 at 11:42 AM, Michael Mosswrote: > Hello, everyone. > > I'd love to hear folks' input on using the "natural" data model of > Accumulo ("BigTable" style) vs more of a Document Model. I'll try to > succinctly describe with a contrived example. > > Let's say I have one domain object I'd like to model, "SensorReadings". A > single entry might look something like the following with 4 distinct CF, CQ > pairs. > > RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) > CF: "Meta", CQ: "Timestamp", Value: > CF: "Sensor", CQ: "Temperature", Value: 80.4 > CF: "Sensor", CQ: "Humidity", Value: 40.2 > CF: "Sensor", CQ: "Barometer", Value: 29.1 > > I might do queries like "get me all SensorReadings for 2015 for DeviceID = > 1" and if I wanted to operate on each SensorReading as a single unit (and > not as the 4 'rows' it returns for each one), I'd either have to aggregate > the 4 CF, CQ pairs for each RowKey client side, or use something like the > WholeRowIterator. > > In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, > return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > > 31", I'd again have to either use the WholeRowIterator to 'see' each entire > SensorReading in memory on the server for the compound query, or I could > take the intersection of the results of 3 parallel, independent queries on > the client side. > > Where I am going with this is, what are the thoughts around creating a > Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and > storing each SensorReading as a single 'Document'? > > RowKey: DeviceID-YYYMMDD > CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, > Humidity=40.2, Barometer = 29.1) > > This way you avoid having to use the WholeRowIterator and unless you often > have queries that only look at a tiny subset of your fields (let's say just > "Temperature"), the serialization costs seem similar since Value is just > bytes anyway. > > Appreciate folks' experience and wisdom here. Hope this makes sense, happy > to clarify. > > Best. > > -Mike > > > > >
Re: Accumulo: "BigTable" vs. "Document Model"
These days, I tend to lean towards breaking out each attribute in a record into discrete columns. When you roll up multiple columns into a single value, you lose the ability to use the native column filtering (cf or cf+cq) that's built into Accumulo. Same goes for column visibilities (at least in the traditional sense). Deletes and updates are more difficult to reason about and require some extra coordination to work. You can always aggregate many rows on the server dynamically if that makes processing things as one "entry" more simple. Michael Moss wrote: Hello, everyone. I'd love to hear folks' input on using the "natural" data model of Accumulo ("BigTable" style) vs more of a Document Model. I'll try to succinctly describe with a contrived example. Let's say I have one domain object I'd like to model, "SensorReadings". A single entry might look something like the following with 4 distinct CF, CQ pairs. RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) CF: "Meta", CQ: "Timestamp", Value: CF: "Sensor", CQ: "Temperature", Value: 80.4 CF: "Sensor", CQ: "Humidity", Value: 40.2 CF: "Sensor", CQ: "Barometer", Value: 29.1 I might do queries like "get me all SensorReadings for 2015 for DeviceID = 1" and if I wanted to operate on each SensorReading as a single unit (and not as the 4 'rows' it returns for each one), I'd either have to aggregate the 4 CF, CQ pairs for each RowKey client side, or use something like the WholeRowIterator. In addition, if I wanted to write a query like, "for DeviceID = 1 in 2015, return me SensorReadings where Temperature > 90, Humidity < 40, Barometer > 31", I'd again have to either use the WholeRowIterator to 'see' each entire SensorReading in memory on the server for the compound query, or I could take the intersection of the results of 3 parallel, independent queries on the client side. Where I am going with this is, what are the thoughts around creating a Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and storing each SensorReading as a single 'Document'? RowKey: DeviceID-YYYMMDD CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, Humidity=40.2, Barometer = 29.1) This way you avoid having to use the WholeRowIterator and unless you often have queries that only look at a tiny subset of your fields (let's say just "Temperature"), the serialization costs seem similar since Value is just bytes anyway. Appreciate folks' experience and wisdom here. Hope this makes sense, happy to clarify. Best. -Mike
Re: Accumulo: "BigTable" vs. "Document Model"
+1 for Eric's suggestion. I used this technique. It seemed to work nicely. When storing ProtoBuf, JSON, or any other 'document' remember to factor in the parsing needed during iteration. This affects both CPU and Memory requirements on the tservers. On Fri, Sep 4, 2015 at 11:53 AM, Eric Newtonwrote: > You could use a server-side iterator that does the filtering on the > server, and returns a protobuf value for matching rows. > > -Eric > > > On Fri, Sep 4, 2015 at 11:42 AM, Michael Moss > wrote: > >> Hello, everyone. >> >> I'd love to hear folks' input on using the "natural" data model of >> Accumulo ("BigTable" style) vs more of a Document Model. I'll try to >> succinctly describe with a contrived example. >> >> Let's say I have one domain object I'd like to model, "SensorReadings". A >> single entry might look something like the following with 4 distinct CF, CQ >> pairs. >> >> RowKey: DeviceID-YYYMMDD-ReadingID (i.e. - 1-20150101-1234) >> CF: "Meta", CQ: "Timestamp", Value: >> CF: "Sensor", CQ: "Temperature", Value: 80.4 >> CF: "Sensor", CQ: "Humidity", Value: 40.2 >> CF: "Sensor", CQ: "Barometer", Value: 29.1 >> >> I might do queries like "get me all SensorReadings for 2015 for DeviceID >> = 1" and if I wanted to operate on each SensorReading as a single unit (and >> not as the 4 'rows' it returns for each one), I'd either have to aggregate >> the 4 CF, CQ pairs for each RowKey client side, or use something like the >> WholeRowIterator. >> >> In addition, if I wanted to write a query like, "for DeviceID = 1 in >> 2015, return me SensorReadings where Temperature > 90, Humidity < 40, >> Barometer > 31", I'd again have to either use the WholeRowIterator to 'see' >> each entire SensorReading in memory on the server for the compound query, >> or I could take the intersection of the results of 3 parallel, independent >> queries on the client side. >> >> Where I am going with this is, what are the thoughts around creating a >> Java, Protobuf, Avro (etc) object with these 4 CF, CQ pairs as fields and >> storing each SensorReading as a single 'Document'? >> >> RowKey: DeviceID-YYYMMDD >> CF: ReadingID Value: Protobuf(Timestamp=123, Temperature=80.4, >> Humidity=40.2, Barometer = 29.1) >> >> This way you avoid having to use the WholeRowIterator and unless you >> often have queries that only look at a tiny subset of your fields (let's >> say just "Temperature"), the serialization costs seem similar since Value >> is just bytes anyway. >> >> Appreciate folks' experience and wisdom here. Hope this makes sense, >> happy to clarify. >> >> Best. >> >> -Mike >> >> >> >> >> >