Re: Timestamp as a key good practice?

Jean-Daniel Cryans Tue, 26 Jun 2012 10:50:28 -0700

A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the
exception you are seeing is that ZK is trying to get a quorum on with
1 machine but that doesn't make sense so instead it should revert to a
standalone server and still work.


J-D

On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari
<[email protected]> wrote:
> Hum... Seems that it's not working that way:
>
> ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one
> server specified (ignoring)
>
> So most porbably the secondary should looks exactly like the master,
> but I'm not 100% sure...
>
> 2012/6/22, Jean-Marc Spaggiari <[email protected]>:
>> Ok. So if I understand correctly, I need:
>> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and
>> ZooKeeper (ZK)
>> PC2 => Secondary Name Node (Hadoop)
>> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server
>> (HBase)
>>
>> For PC2, should I run Zookeeper, JobTracker and master too? Can I have
>> 2 masters? Or I just run just the secondray name node?
>>
>> 2012/6/21, Michael Segel <[email protected]>:
>>> If you have a really small cluster...
>>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a
>>> single node. (Secondary too)
>>> Then you have Data Nodes that run DN, TT, and RS.
>>>
>>> That would solve any ZK RS problems.
>>>
>>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:
>>>
>>>> Hi Mike, Hi Rob,
>>>>
>>>> Thanks for your replies and advices. Seems that now I'm due for some
>>>> implementation. I'm readgin Lars' book first and when I will be done I
>>>> will start with the coding.
>>>>
>>>> I already have my Zookeeper/Hadoop/HBase running and based on the
>>>> first pages I read, I already know it's not well done since I have put
>>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More
>>>> reading for me for the next few days, and then I will start.
>>>>
>>>> Thanks again!
>>>>
>>>> JM
>>>>
>>>> 2012/6/16, Rob Verkuylen <[email protected]>:
>>>>> Just to add from my experiences:
>>>>>
>>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable
>>>>> machine
>>>>> can handle 3-4000 puts a second with ease, and a simple timerange scan
>>>>> can
>>>>> give you the records you need. I have my doubts you will be hitting
>>>>> these
>>>>> amounts anytime soon. A simple setup will get your PoC and then scale
>>>>> when
>>>>> you need to scale.
>>>>>
>>>>> Rob
>>>>>
>>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Jean-Marc,
>>>>>>
>>>>>> You indicated that you didn't want to do full table scans when you
>>>>>> want
>>>>>> to
>>>>>> find out which files hadn't been touched since X time has past.
>>>>>> (X could be months, weeks, days, hours, etc ...)
>>>>>>
>>>>>> So here's the thing.
>>>>>> First,  I am not convinced that you will have hot spotting.
>>>>>> Second, you end up having to now do 26 scans instead of one. Then you
>>>>>> need
>>>>>> to join the result set.
>>>>>>
>>>>>> Not really a good solution if you think about it.
>>>>>>
>>>>>> Oh and I don't believe that you will be hitting a single region,
>>>>>> although
>>>>>> you may hit  a region hard.
>>>>>> (Your second table's key is on the timestamp of the last update to the
>>>>>> file.  If the file hadn't been touched in a week, there's the
>>>>>> probability
>>>>>> that at scale, it won't be in the same region as a file that had
>>>>>> recently
>>>>>> been touched. )
>>>>>>
>>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only
>>>>>> be
>>>>>> applied on a subset of problems.
>>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>>>>
>>>>>>> Let's imagine the timestamp is "123456789".
>>>>>>>
>>>>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>>>>> between few RegionServers. I will have like "t123456789". The issue
>>>>>>> is
>>>>>>> that I will have to do 26 queries to be able to find all the entries.
>>>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>>>>> and so on.
>>>>>>>
>>>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>>>>> key myself? Or what if I use something like HBaseWD?
>>>>>>>
>>>>>>> JM
>>>>>>>
>>>>>>> 2012/6/16, Michel Segel <[email protected]>:
>>>>>>>> You can't salt the key in the second table.
>>>>>>>> By salting the key, you lose the ability to do range scans, which is
>>>>>> what
>>>>>>>> you want to do.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>
>>>>>>>> Mike Segel
>>>>>>>>
>>>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>>>>> hotspotting I will try to salt the key in the 2nd table and see the
>>>>>>>>> results.
>>>>>>>>>
>>>>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>>>>> machine.
>>>>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>>>>
>>>>>>>>> You are recommending to modify the timestamp to be to the second or
>>>>>>>>> minute and have more entries per row. Is that because it's better
>>>>>>>>> to
>>>>>>>>> have more columns than rows? Or it's more because that will allow
>>>>>>>>> to
>>>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which
>>>>>>>>> if
>>>>>>>>> more efficient?
>>>>>>>>>
>>>>>>>>> JM
>>>>>>>>>
>>>>>>>>> 2012/6/15, Michael Segel <[email protected]>:
>>>>>>>>>> Thought about this a little bit more...
>>>>>>>>>>
>>>>>>>>>> You will want two tables for a solution.
>>>>>>>>>>
>>>>>>>>>> 1 Table is  Key: Unique ID
>>>>>>>>>>                  Column: FilePath            Value: Full Path to
>>>>>>>>>> file
>>>>>>>>>>                  Column: Last Update time    Value: timestamp
>>>>>>>>>>
>>>>>>>>>> 2 Table is Key: Last Update time    (The timestamp)
>>>>>>>>>>                          Column 1-N: Unique ID    Value: Full Path
>>>>>>>>>> to
>>>>>>>>>> the
>>>>>>>>>> file
>>>>>>>>>>
>>>>>>>>>> Now if you want to get fancy,  in Table 1, you could use the time
>>>>>> stamp
>>>>>>>>>> on
>>>>>>>>>> the column File Path to hold the last update time.
>>>>>>>>>> But its probably easier for you to start by keeping the data as a
>>>>>>>>>> separate
>>>>>>>>>> column and ignore the Timestamps on the columns for now.
>>>>>>>>>>
>>>>>>>>>> Note the following:
>>>>>>>>>>
>>>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given
>>>>>> timestamp
>>>>>>>>>> you
>>>>>>>>>> may or may not have multiple files that were updated. (You weren't
>>>>>>>>>> specific
>>>>>>>>>> as to the scale)
>>>>>>>>>> This is a good example of HBase's column oriented approach where
>>>>>>>>>> you
>>>>>> may
>>>>>>>>>> or
>>>>>>>>>> may not have a column. It doesn't matter. :-) You could also
>>>>>>>>>> modify
>>>>>> the
>>>>>>>>>> timestamp to be to the second or minute and have more entries per
>>>>>>>>>> row.
>>>>>>>>>> It
>>>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value,
>>>>>>>>>> so
>>>>>> you
>>>>>>>>>> will
>>>>>>>>>> add a column to this table.
>>>>>>>>>>
>>>>>>>>>> 2) First prove that the logic works. You insert/update table 1 to
>>>>>>>>>> capture
>>>>>>>>>> the ID of the file and its last update time.  You then delete the
>>>>>>>>>> old
>>>>>>>>>> timestamp entry in table 2, then insert new entry in table 2.
>>>>>>>>>>
>>>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to
>>>>>>>>>> find
>>>>>> your
>>>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit the
>>>>>>>>>> scan
>>>>>>>>>> to
>>>>>>>>>> 500 rows. Note that you may or may not have multiple entries so as
>>>>>>>>>> you
>>>>>>>>>> walk
>>>>>>>>>> through the result set, you count the number of columns and stop
>>>>>>>>>> when
>>>>>>>>>> you
>>>>>>>>>> have 500 columns, regardless of the number of rows you've
>>>>>>>>>> processed.
>>>>>>>>>>
>>>>>>>>>> This should solve your problem and be pretty efficient.
>>>>>>>>>> You can then work out the Coprocessors and add it to the solution
>>>>>>>>>> to
>>>>>> be
>>>>>>>>>> even
>>>>>>>>>> more efficient.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash
>>>>>>>>>> your
>>>>>>>>>> unique
>>>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as the
>>>>>> table
>>>>>>>>>> splits.
>>>>>>>>>> On table 2, because you have temporal data and you want to
>>>>>>>>>> efficiently
>>>>>>>>>> scan
>>>>>>>>>> a small portion of the table based on size, you will always scan
>>>>>>>>>> the
>>>>>>>>>> first
>>>>>>>>>> bloc, however as data rolls off and compression occurs, you will
>>>>>>>>>> probably
>>>>>>>>>> have to do some cleanup. I'm not sure how HBase  handles splits
>>>>>>>>>> that
>>>>>> no
>>>>>>>>>> longer contain data. When you compress an empty split, does it go
>>>>>> away?
>>>>>>>>>>
>>>>>>>>>> By switching to coprocessors, you now limit the update accessors
>>>>>>>>>> to
>>>>>> the
>>>>>>>>>> second table so you should still have pretty good performance.
>>>>>>>>>>
>>>>>>>>>> You may also want to look at Asynchronous HBase, however I don't
>>>>>>>>>> know
>>>>>>>>>> how
>>>>>>>>>> well it will work with Coprocessors or if you want to perform
>>>>>>>>>> async
>>>>>>>>>> operations in this specific use case.
>>>>>>>>>>
>>>>>>>>>> Good luck, HTH...
>>>>>>>>>>
>>>>>>>>>> -Mike
>>>>>>>>>>
>>>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>
>>>>>>>>>>> For now this is more a proof of concept than a production
>>>>>> application.
>>>>>>>>>>> And if it's working, it should be growing a lot and database at
>>>>>>>>>>> the
>>>>>>>>>>> end will easily be over 1B rows. each individual server will have
>>>>>>>>>>> to
>>>>>>>>>>> send it's own information to one centralized server which will
>>>>>>>>>>> insert
>>>>>>>>>>> that into a database. That's why it need to be very quick and
>>>>>>>>>>> that's
>>>>>>>>>>> why I'm looking in HBase's direction. I tried with some
>>>>>>>>>>> relational
>>>>>>>>>>> databases with 4M rows in the table but the insert time is to
>>>>>>>>>>> slow
>>>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for
>>>>>>>>>>> HBase
>>>>>>>>>>> to keep only the cells with values will allow to save a lot on
>>>>>>>>>>> the
>>>>>>>>>>> disk space (futur projects).
>>>>>>>>>>>
>>>>>>>>>>> I'm not yet used with HBase and there is still many things I need
>>>>>>>>>>> to
>>>>>>>>>>> undertsand but until I'm able to create a solution and test it, I
>>>>>> will
>>>>>>>>>>> continue to read, learn and try that way. Then at then end I will
>>>>>>>>>>> be
>>>>>>>>>>> able to compare the 2 options I have (HBase or relational) and
>>>>>>>>>>> decide
>>>>>>>>>>> based on the results.
>>>>>>>>>>>
>>>>>>>>>>> So yes, your reply helped because it's giving me a way to achieve
>>>>>> this
>>>>>>>>>>> goal (using co-processors). I don't know ye thow this part is
>>>>>> working,
>>>>>>>>>>> so I will dig the documentation for it.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> JM
>>>>>>>>>>>
>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>>> Jean-Marc,
>>>>>>>>>>>>
>>>>>>>>>>>> You do realize that this really isn't a good use case for HBase,
>>>>>>>>>>>> assuming
>>>>>>>>>>>> that what you are describing is a stand alone system.
>>>>>>>>>>>> It would be easier and better if you just used a simple
>>>>>>>>>>>> relational
>>>>>>>>>>>> database.
>>>>>>>>>>>>
>>>>>>>>>>>> Then you would have your table w an ID, and a secondary index on
>>>>>>>>>>>> the
>>>>>>>>>>>> timestamp.
>>>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the
>>>>>>>>>>>> top
>>>>>> 500
>>>>>>>>>>>> off
>>>>>>>>>>>> the list.
>>>>>>>>>>>>
>>>>>>>>>>>> If you insist on using HBase, yes you will have to have a
>>>>>>>>>>>> secondary
>>>>>>>>>>>> table.
>>>>>>>>>>>> Then using co-processors...
>>>>>>>>>>>> When you update the row in your base table, you
>>>>>>>>>>>> then get() the row in your index by timestamp, removing the
>>>>>>>>>>>> column
>>>>>> for
>>>>>>>>>>>> that
>>>>>>>>>>>> rowid.
>>>>>>>>>>>> Add the new column to the timestamp row.
>>>>>>>>>>>>
>>>>>>>>>>>> As you put it.
>>>>>>>>>>>>
>>>>>>>>>>>> Now you can just do a partial scan on your index. Because your
>>>>>>>>>>>> index
>>>>>>>>>>>> table
>>>>>>>>>>>> is so small... you shouldn't worry about hotspots.
>>>>>>>>>>>> You may just want to rebuild your index every so often...
>>>>>>>>>>>>
>>>>>>>>>>>> HTH
>>>>>>>>>>>>
>>>>>>>>>>>> -Mike
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your feedback. Here are more details to describe
>>>>>>>>>>>>> what
>>>>>> I'm
>>>>>>>>>>>>> trying to achieve.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My goal is to store information about files into the database.
>>>>>>>>>>>>> I
>>>>>> need
>>>>>>>>>>>>> to check the oldest files in the database to refresh the
>>>>>> information.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network
>>>>>>>>>>>>> hosting
>>>>>>>>>>>>> the
>>>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So each time I look at a file and gather the information, I
>>>>>>>>>>>>> update
>>>>>>>>>>>>> its
>>>>>>>>>>>>> row in the database based on the key including a "last_update"
>>>>>> field.
>>>>>>>>>>>>> I can calculate this key for any file in the drives.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In order to know which file I need to check in the network, I
>>>>>>>>>>>>> need
>>>>>> to
>>>>>>>>>>>>> scan the table by "last_update" field. So the idea is to build
>>>>>>>>>>>>> another
>>>>>>>>>>>>> table which contain the last_update as a key and the files IDs
>>>>>>>>>>>>> in
>>>>>>>>>>>>> columns. (Here is the hotspotting)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Each time I work on a file, I will have to update the main
>>>>>>>>>>>>> table
>>>>>>>>>>>>> by
>>>>>>>>>>>>> ID
>>>>>>>>>>>>> and remove the cell from the second table (the index) and put
>>>>>>>>>>>>> it
>>>>>> back
>>>>>>>>>>>>> with the new "last_update" key.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm mainly doing 3 operations in the database.
>>>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update
>>>>>>>>>>>>> 2) I update the information for  those 500 files (bulk update)
>>>>>>>>>>>>> 3) I load new files references to be checked.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the key.
>>>>>>>>>>>>> the
>>>>>>>>>>>>> distribution is almost perfect because I'm using hash. The
>>>>>>>>>>>>> prefix
>>>>>> is
>>>>>>>>>>>>> the server ID but it's not always going to the same server
>>>>>>>>>>>>> since
>>>>>> it's
>>>>>>>>>>>>> done by last_update. But this allow a quick access to the list
>>>>>>>>>>>>> of
>>>>>>>>>>>>> files from one server.
>>>>>>>>>>>>> For 1, I have expected to build this second table with the
>>>>>>>>>>>>> "last_update" as the key.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regarding the frequency, it really depends on the activities on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> network, but it should be "often".  The faster the database
>>>>>>>>>>>>> update
>>>>>>>>>>>>> will be, the more up to date I will be able to keep it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> JM
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>>>>> Actually I think you should revisit your key design....
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Look at your access path to the data for each of the types of
>>>>>>>>>>>>>> queries
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> are going to run.
>>>>>>>>>>>>>> From your post:
>>>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>>> it
>>>>>> has
>>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>> "
>>>>>>>>>>>>>> So your primary query is going to be against the key.
>>>>>>>>>>>>>> Not sure if you meant to say that your key was a composite key
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>> not...
>>>>>>>>>>>>>> sounds like your key is just the unique key and the rest are
>>>>>> columns
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The secondary query or path to the data is to find data where
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> files
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>> not updated for more than a period of time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you make your key temporal, that is adding time as a
>>>>>>>>>>>>>> component
>>>>>> of
>>>>>>>>>>>>>> your
>>>>>>>>>>>>>> key, you will end up creating new rows of data while the old
>>>>>>>>>>>>>> row
>>>>>>>>>>>>>> still
>>>>>>>>>>>>>> exists.
>>>>>>>>>>>>>> Not a good side effect.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The other nasty side effect of using time as your key is that
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> have the potential for hot spotting, but that you also have
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> nasty
>>>>>>>>>>>>>> side
>>>>>>>>>>>>>> effect of creating splits that will never grow.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> How often are you going to ask to see the files where they
>>>>>>>>>>>>>> were
>>>>>> not
>>>>>>>>>>>>>> updated
>>>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>> should care if you have to do a complete table scan.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read
>>>>>>>>>>>>>>> all
>>>>>> of
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Need to read here at the bottom:
>>>>>>>>>>>>>>> https://github.com/sematext/HBaseWD
>>>>>>>>>>>>>>> and here:
>>>>>>>>>>>>>>>
>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>>>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this
>>>>>> comes
>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Otis
>>>>>>>>>>>>>>>> ----
>>>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>>>>>>>>>>>>> http://sematext.com/spm
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the
>>>>>>>>>>>>>>>>> documentation
>>>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the
>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> key because that will always load the same region until the
>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>> reach a certain value and move to the next region
>>>>>> (hotspotting).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>>> it
>>>>>> has
>>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will
>>>>>>>>>>>>>>>>> have
>>>>>> to
>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>> full parsing of the table. Which might take a while.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind of
>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each
>>>>>>>>>>>>>>>>> column
>>>>>> will
>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>> the ID of the file with a dummy content.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this table,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> And so one.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly too.
>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>> hotspotting one region.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>>>>>>>>>>>>> 1) Hotspotting.
>>>>>>>>>>>>>>>>> 2) Salting.
>>>>>>>>>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>>>>>>>>>> 4) Randomization.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other
>>>>>> options.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number
>>>>>> between
>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To
>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will
>>>>>>>>>>>>>>>>> need
>>>>>> to
>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will becaume
>>>>>>>>>>>>>>>>> to
>>>>>> big
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and
>>>>>>>>>>>>>>>>> 99?
>>>>>>>>>>>>>>>>> Which
>>>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the
>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>> I will have to do. Is that really a good approach?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first few
>>>>>> bytes
>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I
>>>>>>>>>>>>>>>>> will
>>>>>> have
>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>> way to retreive the information I'm looking for.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>>>>> them base on the date?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Re: Timestamp as a key good practice?

Reply via email to