I thought through the counter option, but I think storing the actual value per doc makes sense so that I dont have to worry about reading and updating the counter on the client side everytime.
The Bytes class is quite powerful! Which now makes me think: should I separate out the entire JSON object into its own qualifier values? In other words, should I use composite keys instead of composite values for the other family DocInfo? I see 1 advantage: not having to worry about JSON versioning. As far a querying support goes, since the docIds are incremental, I dont see additional real-time querying help. So I guess if I am accessing the data together all the time, then having it in place as a JSON object makes more sense. If someone has done something otherwise, please let me know what I am missing. In my case, I dont see any benefit of using composite keys for the Docinfo family. Jonathan, thanks so much for all the input so far. On Mon, Jun 21, 2010 at 1:09 PM, Jonathan Gray <[email protected]> wrote: > To implement the fast count, you may want to either keep counters in a > separate family or store it in another family. You would pay a storage cost > but that's usually a trade-off we make in HBase to ensure optimal efficiency > at read time. > > As for how to append, you are almost right. You would get an > ArrayIndexOutOfBoundsException with the code you write (Bytes.toBytes(long) > returns a byte array of length 8 so arr[8] is out of bounds). > > One option is to use the Bytes.add(byte[],byte[]) method. For example: > > byte [] type = new byte [] { 1 }; > byte [] arr = Bytes.add(Bytes.toBytes(docId), type); > > That would append a '1' byte to the end of the docId. > > You can be more efficient (only allocating a byte[] once) by doing something > more like: > > byte [] arr = new byte[9]; > Bytes.putLong(arr, 0, docId); > arr[8] = 1; > > Most things can be done with the Bytes class so poke around. > > JG > >> -----Original Message----- >> From: N Kapshoo [mailto:[email protected]] >> Sent: Monday, June 21, 2010 10:45 AM >> To: [email protected] >> Subject: Re: Long vs String for qualifier >> >> Now here is my conundrum: >> I would be doing both queries very often. The UI shows count(status=Y) >> on the very first page and then depending on whether user does a >> listing, would show all other info and status info per doc. >> >> Is it a bad idea to have both a new ColumnFamily and store it in a >> qualifier as well? Same data in 2 places, but it would help the read >> performance in both queries right? >> >> When you say append a byte, I assume something like this, am I right? >> >> byte[] arr = Bytes.toBytes(docId); >> arr[arr.length] = '0'; >> >> Thanks so much for your help. >> >> On Mon, Jun 21, 2010 at 12:33 PM, Jonathan Gray <[email protected]> >> wrote: >> > Got it. >> > >> > Well, you could do what you're describing below, appending something >> at the end of the docId to notate that it's the status column. You >> wouldn't need to use a "_status" string, could be as simple as >> appending an additional byte of type information. >> > >> > Another option is to break status into a separate column family. >> > >> > What are the most common queries and which query is most critical >> performance-wise? >> > >> > Are you most interested in "give me all docs and their statuses for >> user X" or more like "give me the info for doc Y" or "give me status >> for doc Z"? >> > >> > If the first one, then seems like adding a type byte after the docId >> would make the most sense and be most optimal. >> > >> > JG >> > >> >> -----Original Message----- >> >> From: N Kapshoo [mailto:[email protected]] >> >> Sent: Monday, June 21, 2010 10:26 AM >> >> To: [email protected] >> >> Subject: Re: Long vs String for qualifier >> >> >> >> Thanks for the quick reply. >> >> >> >> I have a schema design based on ids because I actually have the ids >> as >> >> rowids in another table. This is to avoid data redundancy since we >> >> might have a big doc referenced by millions of users, but we dont >> want >> >> to store a copy for every user. So, >> >> >> >> Table: Docs >> >> Row: docId (long generated by incrementColumnValue) >> >> ColFamily: Data >> >> >> >> Table: Users >> >> Row: UserId >> >> ColFamily: DocInfo >> >> Qualifier: docId >> >> Value: More information per user (JSON) >> >> >> >> Now in addition: >> >> ColFamily: DocInfo >> >> Qualifier: docId_status >> >> Value: Status >> >> >> >> Now I want a status on each doc for each user. This status might >> >> change several times. >> >> The first column, docInfo is static, its value doesnt change once >> >> inserted. However the status can be toggled back and forth (between >> Y >> >> and N). >> >> >> >> The docs per user should always be sorted by docId. >> >> >> >> How would you design it? I am not sure how I can get the values into >> >> the qualifiers when it should be sorted by docId always. Thank you. >> >> >> >> On Mon, Jun 21, 2010 at 12:12 PM, Jonathan Gray <[email protected]> >> >> wrote: >> >> > Can you describe your schema a bit more? Could you use versioning >> >> instead of incrementing IDs on the qualifiers? >> >> > >> >> > Also, you could consider having a composite value, so id1_asLong >> >> would have a value that contained both val1 and val5 in your >> example. >> >> You could use any number of serialization strategies (comma- >> separated, >> >> JSON, Thrift/protobuf, Writable, etc). >> >> > >> >> > If you want them as two columns, I would recommend that things you >> >> want to retrieve together be neighboring. For example, you might >> make >> >> the qualifiers a composite type of <id_as_long><qf_type>, so >> >> <id1_asLong><0byte> for the existing stuff and <id1_asLong><1byte> >> for >> >> status? That way they are stored sequentially so optimally >> efficient >> >> at read time. >> >> > >> >> > JG >> >> > >> >> >> -----Original Message----- >> >> >> From: N Kapshoo [mailto:[email protected]] >> >> >> Sent: Monday, June 21, 2010 9:59 AM >> >> >> To: [email protected] >> >> >> Subject: Long vs String for qualifier >> >> >> >> >> >> I have a 'long' number that I get by using >> >> >> HTable.'incrementColumnValue'. This long is used as the qualifier >> id >> >> >> on a columnFamily. >> >> >> >> >> >> Now I need to add a prefix 'status' so that I can store another >> >> value >> >> >> in the same family. >> >> >> >> >> >> How should I consider String vs long sorting? >> >> >> >> >> >> So right now: >> >> >> >> >> >> colFamily: id1_asLong = val1 >> >> >> colFamily: id2_asLong = val2 >> >> >> colFamily: id3_asLong = val3 >> >> >> colFamily: id4_asLong = val4 >> >> >> >> >> >> and in addition >> >> >> >> >> >> colFamily: status_id1_asString = val5 >> >> >> colFamily: status_id2_asString = val6 >> >> >> colFamily: status_id3_asString = val7 >> >> >> colFamily: status_id4_asString = val8 >> >> >> >> >> >> To make sure that 'id' values are sorted and accessed >> sequentially, >> >> >> should I change my design so that the id1_asLong is stored as >> >> >> id1_asString? >> >> >> When I do my Get, I always get id1_asLong and status_id1_asString >> >> >> together. >> >> >> >> >> >> Thanks. >> >> > >> > >
