RE: Long vs String for qualifier

Jonathan Gray Mon, 21 Jun 2010 11:09:57 -0700

To implement the fast count, you may want to either keep counters in a separate 
family or store it in another family.  You would pay a storage cost but that's 
usually a trade-off we make in HBase to ensure optimal efficiency at read time.


As for how to append, you are almost right.  You would get an 
ArrayIndexOutOfBoundsException with the code you write (Bytes.toBytes(long) 
returns a byte array of length 8 so arr[8] is out of bounds).

One option is to use the Bytes.add(byte[],byte[]) method.  For example:

byte [] type = new byte [] { 1 };
byte [] arr = Bytes.add(Bytes.toBytes(docId), type);

That would append a '1' byte to the end of the docId.

You can be more efficient (only allocating a byte[] once) by doing something 
more like:

byte [] arr = new byte[9];
Bytes.putLong(arr, 0, docId);
arr[8] = 1;

Most things can be done with the Bytes class so poke around.

JG

> -----Original Message-----
> From: N Kapshoo [mailto:[email protected]]
> Sent: Monday, June 21, 2010 10:45 AM
> To: [email protected]
> Subject: Re: Long vs String for qualifier
> 
> Now here is my conundrum:
> I would be doing both queries very often. The UI shows count(status=Y)
> on the very first page and then depending on whether user does a
> listing, would show all other info and status info per doc.
> 
> Is it a bad idea to have both a new ColumnFamily and store it in a
> qualifier as well? Same data in 2 places, but it would help the read
> performance in both queries right?
> 
> When you say append a byte, I assume something like this, am I right?
> 
> byte[] arr = Bytes.toBytes(docId);
> arr[arr.length] = '0';
> 
> Thanks so much for your help.
> 
> On Mon, Jun 21, 2010 at 12:33 PM, Jonathan Gray <[email protected]>
> wrote:
> > Got it.
> >
> > Well, you could do what you're describing below, appending something
> at the end of the docId to notate that it's the status column.  You
> wouldn't need to use a "_status" string, could be as simple as
> appending an additional byte of type information.
> >
> > Another option is to break status into a separate column family.
> >
> > What are the most common queries and which query is most critical
> performance-wise?
> >
> > Are you most interested in "give me all docs and their statuses for
> user X" or more like "give me the info for doc Y" or "give me status
> for doc Z"?
> >
> > If the first one, then seems like adding a type byte after the docId
> would make the most sense and be most optimal.
> >
> > JG
> >
> >> -----Original Message-----
> >> From: N Kapshoo [mailto:[email protected]]
> >> Sent: Monday, June 21, 2010 10:26 AM
> >> To: [email protected]
> >> Subject: Re: Long vs String for qualifier
> >>
> >> Thanks for the quick reply.
> >>
> >> I have a schema design based on ids because I actually have the ids
> as
> >> rowids in another table. This is to avoid data redundancy since we
> >> might have a big doc referenced by millions of users, but we dont
> want
> >> to store a copy for every user. So,
> >>
> >> Table: Docs
> >> Row: docId (long generated by incrementColumnValue)
> >> ColFamily: Data
> >>
> >> Table: Users
> >> Row: UserId
> >> ColFamily: DocInfo
> >> Qualifier: docId
> >> Value: More information per user (JSON)
> >>
> >> Now in addition:
> >> ColFamily: DocInfo
> >> Qualifier: docId_status
> >> Value: Status
> >>
> >> Now I want a status on each doc for each user. This status might
> >> change several times.
> >> The first column, docInfo is static, its value doesnt change once
> >> inserted. However the status can be toggled back and forth (between
> Y
> >> and N).
> >>
> >> The docs per user should always be sorted by docId.
> >>
> >> How would you design it? I am not sure how I can get the values into
> >> the qualifiers when it should be sorted by docId always. Thank you.
> >>
> >> On Mon, Jun 21, 2010 at 12:12 PM, Jonathan Gray <[email protected]>
> >> wrote:
> >> > Can you describe your schema a bit more?  Could you use versioning
> >> instead of incrementing IDs on the qualifiers?
> >> >
> >> > Also, you could consider having a composite value, so id1_asLong
> >> would have a value that contained both val1 and val5 in your
> example.
> >>  You could use any number of serialization strategies (comma-
> separated,
> >> JSON, Thrift/protobuf, Writable, etc).
> >> >
> >> > If you want them as two columns, I would recommend that things you
> >> want to retrieve together be neighboring.  For example, you might
> make
> >> the qualifiers a composite type of <id_as_long><qf_type>, so
> >> <id1_asLong><0byte> for the existing stuff and <id1_asLong><1byte>
> for
> >> status?  That way they are stored sequentially so optimally
> efficient
> >> at read time.
> >> >
> >> > JG
> >> >
> >> >> -----Original Message-----
> >> >> From: N Kapshoo [mailto:[email protected]]
> >> >> Sent: Monday, June 21, 2010 9:59 AM
> >> >> To: [email protected]
> >> >> Subject: Long vs String for qualifier
> >> >>
> >> >> I have a 'long' number that I get by using
> >> >> HTable.'incrementColumnValue'. This long is used as the qualifier
> id
> >> >> on a columnFamily.
> >> >>
> >> >> Now I need to add a prefix 'status' so that I can store another
> >> value
> >> >> in the same family.
> >> >>
> >> >> How should I consider String vs long sorting?
> >> >>
> >> >> So right now:
> >> >>
> >> >> colFamily: id1_asLong = val1
> >> >> colFamily: id2_asLong = val2
> >> >> colFamily: id3_asLong = val3
> >> >> colFamily: id4_asLong = val4
> >> >>
> >> >> and in addition
> >> >>
> >> >> colFamily: status_id1_asString = val5
> >> >> colFamily: status_id2_asString = val6
> >> >> colFamily: status_id3_asString = val7
> >> >> colFamily: status_id4_asString = val8
> >> >>
> >> >> To make sure that 'id' values are sorted and accessed
> sequentially,
> >> >> should I change my design so that the id1_asLong is stored as
> >> >> id1_asString?
> >> >> When I do my Get, I always get id1_asLong and status_id1_asString
> >> >> together.
> >> >>
> >> >> Thanks.
> >> >
> >

RE: Long vs String for qualifier

Reply via email to