Re: Long vs String for qualifier

N Kapshoo Mon, 21 Jun 2010 12:04:37 -0700

I thought through the counter option, but I think storing the actual
value per doc makes sense so that I dont have to worry about reading
and updating the counter on the client side everytime.


The Bytes class is quite powerful!

Which now makes me think: should I separate out the entire JSON object
into its own qualifier values? In other words, should I use composite
keys instead of composite values for the other family DocInfo?

I see 1 advantage: not having to worry about JSON versioning. As far a
querying support goes, since the docIds are incremental, I dont see
additional real-time querying help.

So I guess if I am accessing the data together all the time, then
having it in place as a JSON object makes more sense.

If someone has done something otherwise, please let me know what I am
missing. In my case, I dont see any benefit of using composite keys
for the Docinfo family.

Jonathan, thanks so much for all the input so far.

On Mon, Jun 21, 2010 at 1:09 PM, Jonathan Gray <[email protected]> wrote:
> To implement the fast count, you may want to either keep counters in a 
> separate family or store it in another family.  You would pay a storage cost 
> but that's usually a trade-off we make in HBase to ensure optimal efficiency 
> at read time.
>
> As for how to append, you are almost right.  You would get an 
> ArrayIndexOutOfBoundsException with the code you write (Bytes.toBytes(long) 
> returns a byte array of length 8 so arr[8] is out of bounds).
>
> One option is to use the Bytes.add(byte[],byte[]) method.  For example:
>
> byte [] type = new byte [] { 1 };
> byte [] arr = Bytes.add(Bytes.toBytes(docId), type);
>
> That would append a '1' byte to the end of the docId.
>
> You can be more efficient (only allocating a byte[] once) by doing something 
> more like:
>
> byte [] arr = new byte[9];
> Bytes.putLong(arr, 0, docId);
> arr[8] = 1;
>
> Most things can be done with the Bytes class so poke around.
>
> JG
>
>> -----Original Message-----
>> From: N Kapshoo [mailto:[email protected]]
>> Sent: Monday, June 21, 2010 10:45 AM
>> To: [email protected]
>> Subject: Re: Long vs String for qualifier
>>
>> Now here is my conundrum:
>> I would be doing both queries very often. The UI shows count(status=Y)
>> on the very first page and then depending on whether user does a
>> listing, would show all other info and status info per doc.
>>
>> Is it a bad idea to have both a new ColumnFamily and store it in a
>> qualifier as well? Same data in 2 places, but it would help the read
>> performance in both queries right?
>>
>> When you say append a byte, I assume something like this, am I right?
>>
>> byte[] arr = Bytes.toBytes(docId);
>> arr[arr.length] = '0';
>>
>> Thanks so much for your help.
>>
>> On Mon, Jun 21, 2010 at 12:33 PM, Jonathan Gray <[email protected]>
>> wrote:
>> > Got it.
>> >
>> > Well, you could do what you're describing below, appending something
>> at the end of the docId to notate that it's the status column.  You
>> wouldn't need to use a "_status" string, could be as simple as
>> appending an additional byte of type information.
>> >
>> > Another option is to break status into a separate column family.
>> >
>> > What are the most common queries and which query is most critical
>> performance-wise?
>> >
>> > Are you most interested in "give me all docs and their statuses for
>> user X" or more like "give me the info for doc Y" or "give me status
>> for doc Z"?
>> >
>> > If the first one, then seems like adding a type byte after the docId
>> would make the most sense and be most optimal.
>> >
>> > JG
>> >
>> >> -----Original Message-----
>> >> From: N Kapshoo [mailto:[email protected]]
>> >> Sent: Monday, June 21, 2010 10:26 AM
>> >> To: [email protected]
>> >> Subject: Re: Long vs String for qualifier
>> >>
>> >> Thanks for the quick reply.
>> >>
>> >> I have a schema design based on ids because I actually have the ids
>> as
>> >> rowids in another table. This is to avoid data redundancy since we
>> >> might have a big doc referenced by millions of users, but we dont
>> want
>> >> to store a copy for every user. So,
>> >>
>> >> Table: Docs
>> >> Row: docId (long generated by incrementColumnValue)
>> >> ColFamily: Data
>> >>
>> >> Table: Users
>> >> Row: UserId
>> >> ColFamily: DocInfo
>> >> Qualifier: docId
>> >> Value: More information per user (JSON)
>> >>
>> >> Now in addition:
>> >> ColFamily: DocInfo
>> >> Qualifier: docId_status
>> >> Value: Status
>> >>
>> >> Now I want a status on each doc for each user. This status might
>> >> change several times.
>> >> The first column, docInfo is static, its value doesnt change once
>> >> inserted. However the status can be toggled back and forth (between
>> Y
>> >> and N).
>> >>
>> >> The docs per user should always be sorted by docId.
>> >>
>> >> How would you design it? I am not sure how I can get the values into
>> >> the qualifiers when it should be sorted by docId always. Thank you.
>> >>
>> >> On Mon, Jun 21, 2010 at 12:12 PM, Jonathan Gray <[email protected]>
>> >> wrote:
>> >> > Can you describe your schema a bit more?  Could you use versioning
>> >> instead of incrementing IDs on the qualifiers?
>> >> >
>> >> > Also, you could consider having a composite value, so id1_asLong
>> >> would have a value that contained both val1 and val5 in your
>> example.
>> >>  You could use any number of serialization strategies (comma-
>> separated,
>> >> JSON, Thrift/protobuf, Writable, etc).
>> >> >
>> >> > If you want them as two columns, I would recommend that things you
>> >> want to retrieve together be neighboring.  For example, you might
>> make
>> >> the qualifiers a composite type of <id_as_long><qf_type>, so
>> >> <id1_asLong><0byte> for the existing stuff and <id1_asLong><1byte>
>> for
>> >> status?  That way they are stored sequentially so optimally
>> efficient
>> >> at read time.
>> >> >
>> >> > JG
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: N Kapshoo [mailto:[email protected]]
>> >> >> Sent: Monday, June 21, 2010 9:59 AM
>> >> >> To: [email protected]
>> >> >> Subject: Long vs String for qualifier
>> >> >>
>> >> >> I have a 'long' number that I get by using
>> >> >> HTable.'incrementColumnValue'. This long is used as the qualifier
>> id
>> >> >> on a columnFamily.
>> >> >>
>> >> >> Now I need to add a prefix 'status' so that I can store another
>> >> value
>> >> >> in the same family.
>> >> >>
>> >> >> How should I consider String vs long sorting?
>> >> >>
>> >> >> So right now:
>> >> >>
>> >> >> colFamily: id1_asLong = val1
>> >> >> colFamily: id2_asLong = val2
>> >> >> colFamily: id3_asLong = val3
>> >> >> colFamily: id4_asLong = val4
>> >> >>
>> >> >> and in addition
>> >> >>
>> >> >> colFamily: status_id1_asString = val5
>> >> >> colFamily: status_id2_asString = val6
>> >> >> colFamily: status_id3_asString = val7
>> >> >> colFamily: status_id4_asString = val8
>> >> >>
>> >> >> To make sure that 'id' values are sorted and accessed
>> sequentially,
>> >> >> should I change my design so that the id1_asLong is stored as
>> >> >> id1_asString?
>> >> >> When I do my Get, I always get id1_asLong and status_id1_asString
>> >> >> together.
>> >> >>
>> >> >> Thanks.
>> >> >
>> >
>

Re: Long vs String for qualifier

Reply via email to