Re: Timestamp as a key good practice?

Jean-Marc Spaggiari Thu, 21 Jun 2012 04:44:19 -0700

Hi Mike, Hi Rob,

Thanks for your replies and advices. Seems that now I'm due for some
implementation. I'm readgin Lars' book first and when I will be done I
will start with the coding.


I already have my Zookeeper/Hadoop/HBase running and based on the
first pages I read, I already know it's not well done since I have put
a DataNode and a Zookeeper server on ALL the servers ;) So. More
reading for me for the next few days, and then I will start.

Thanks again!

JM

2012/6/16, Rob Verkuylen <[email protected]>:
> Just to add from my experiences:
>
> Yes hotspotting is bad, but so are devops headaches. A reasonable machine
> can handle 3-4000 puts a second with ease, and a simple timerange scan can
> give you the records you need. I have my doubts you will be hitting these
> amounts anytime soon. A simple setup will get your PoC and then scale when
> you need to scale.
>
> Rob
>
> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
> <[email protected]>wrote:
>
>> Jean-Marc,
>>
>> You indicated that you didn't want to do full table scans when you want
>> to
>> find out which files hadn't been touched since X time has past.
>> (X could be months, weeks, days, hours, etc ...)
>>
>> So here's the thing.
>> First,  I am not convinced that you will have hot spotting.
>> Second, you end up having to now do 26 scans instead of one. Then you
>> need
>> to join the result set.
>>
>> Not really a good solution if you think about it.
>>
>> Oh and I don't believe that you will be hitting a single region, although
>> you may hit  a region hard.
>> (Your second table's key is on the timestamp of the last update to the
>> file.  If the file hadn't been touched in a week, there's the probability
>> that at scale, it won't be in the same region as a file that had recently
>> been touched. )
>>
>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
>> applied on a subset of problems.
>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>
>> HTH
>>
>> -Mike
>>
>>
>>
>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>
>> > Let's imagine the timestamp is "123456789".
>> >
>> > If I salt it with later from 'a' to 'z' them it will always be split
>> > between few RegionServers. I will have like "t123456789". The issue is
>> > that I will have to do 26 queries to be able to find all the entries.
>> > I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>> > and so on.
>> >
>> > So what's worst? Am I better to deal with the hotspotting? Salt the
>> > key myself? Or what if I use something like HBaseWD?
>> >
>> > JM
>> >
>> > 2012/6/16, Michel Segel <[email protected]>:
>> >> You can't salt the key in the second table.
>> >> By salting the key, you lose the ability to do range scans, which is
>> what
>> >> you want to do.
>> >>
>> >>
>> >>
>> >> Sent from a remote device. Please excuse any typos...
>> >>
>> >> Mike Segel
>> >>
>> >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>> [email protected]>
>> >> wrote:
>> >>
>> >>> Thanks all for your comments and suggestions. Regarding the
>> >>> hotspotting I will try to salt the key in the 2nd table and see the
>> >>> results.
>> >>>
>> >>> Yesterday I finished to install my 4 servers cluster with old
>> >>> machine.
>> >>> It's slow, but it's working. So I will do some testing.
>> >>>
>> >>> You are recommending to modify the timestamp to be to the second or
>> >>> minute and have more entries per row. Is that because it's better to
>> >>> have more columns than rows? Or it's more because that will allow to
>> >>> have a more "squarred" pattern (lot of rows, lot of colums) which if
>> >>> more efficient?
>> >>>
>> >>> JM
>> >>>
>> >>> 2012/6/15, Michael Segel <[email protected]>:
>> >>>> Thought about this a little bit more...
>> >>>>
>> >>>> You will want two tables for a solution.
>> >>>>
>> >>>> 1 Table is  Key: Unique ID
>> >>>>                   Column: FilePath            Value: Full Path to
>> >>>> file
>> >>>>                   Column: Last Update time    Value: timestamp
>> >>>>
>> >>>> 2 Table is Key: Last Update time    (The timestamp)
>> >>>>                           Column 1-N: Unique ID    Value: Full Path
>> >>>> to
>> >>>> the
>> >>>> file
>> >>>>
>> >>>> Now if you want to get fancy,  in Table 1, you could use the time
>> stamp
>> >>>> on
>> >>>> the column File Path to hold the last update time.
>> >>>> But its probably easier for you to start by keeping the data as a
>> >>>> separate
>> >>>> column and ignore the Timestamps on the columns for now.
>> >>>>
>> >>>> Note the following:
>> >>>>
>> >>>> 1) I used the notation Column 1-N to reflect that for a given
>> timestamp
>> >>>> you
>> >>>> may or may not have multiple files that were updated. (You weren't
>> >>>> specific
>> >>>> as to the scale)
>> >>>> This is a good example of HBase's column oriented approach where you
>> may
>> >>>> or
>> >>>> may not have a column. It doesn't matter. :-) You could also modify
>> the
>> >>>> timestamp to be to the second or minute and have more entries per
>> >>>> row.
>> >>>> It
>> >>>> doesn't matter. You insert based on timestamp:columnName, value, so
>> you
>> >>>> will
>> >>>> add a column to this table.
>> >>>>
>> >>>> 2) First prove that the logic works. You insert/update table 1 to
>> >>>> capture
>> >>>> the ID of the file and its last update time.  You then delete the
>> >>>> old
>> >>>> timestamp entry in table 2, then insert new entry in table 2.
>> >>>>
>> >>>> 3) You store Table 2 in ascending order. Then when you want to find
>> your
>> >>>> last 500 entries, you do a start scan at 0x000 and then limit the
>> >>>> scan
>> >>>> to
>> >>>> 500 rows. Note that you may or may not have multiple entries so as
>> >>>> you
>> >>>> walk
>> >>>> through the result set, you count the number of columns and stop
>> >>>> when
>> >>>> you
>> >>>> have 500 columns, regardless of the number of rows you've processed.
>> >>>>
>> >>>> This should solve your problem and be pretty efficient.
>> >>>> You can then work out the Coprocessors and add it to the solution to
>> be
>> >>>> even
>> >>>> more efficient.
>> >>>>
>> >>>>
>> >>>> With respect to 'hot-spotting' , can't be helped. You could hash
>> >>>> your
>> >>>> unique
>> >>>> ID in table 1, this will reduce the potential of a hotspot as the
>> table
>> >>>> splits.
>> >>>> On table 2, because you have temporal data and you want to
>> >>>> efficiently
>> >>>> scan
>> >>>> a small portion of the table based on size, you will always scan the
>> >>>> first
>> >>>> bloc, however as data rolls off and compression occurs, you will
>> >>>> probably
>> >>>> have to do some cleanup. I'm not sure how HBase  handles splits that
>> no
>> >>>> longer contain data. When you compress an empty split, does it go
>> away?
>> >>>>
>> >>>> By switching to coprocessors, you now limit the update accessors to
>> the
>> >>>> second table so you should still have pretty good performance.
>> >>>>
>> >>>> You may also want to look at Asynchronous HBase, however I don't
>> >>>> know
>> >>>> how
>> >>>> well it will work with Coprocessors or if you want to perform async
>> >>>> operations in this specific use case.
>> >>>>
>> >>>> Good luck, HTH...
>> >>>>
>> >>>> -Mike
>> >>>>
>> >>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>> >>>>
>> >>>>> Hi Michael,
>> >>>>>
>> >>>>> For now this is more a proof of concept than a production
>> application.
>> >>>>> And if it's working, it should be growing a lot and database at the
>> >>>>> end will easily be over 1B rows. each individual server will have
>> >>>>> to
>> >>>>> send it's own information to one centralized server which will
>> >>>>> insert
>> >>>>> that into a database. That's why it need to be very quick and
>> >>>>> that's
>> >>>>> why I'm looking in HBase's direction. I tried with some relational
>> >>>>> databases with 4M rows in the table but the insert time is to slow
>> >>>>> when I have to introduce entries in bulk. Also, the ability for
>> >>>>> HBase
>> >>>>> to keep only the cells with values will allow to save a lot on the
>> >>>>> disk space (futur projects).
>> >>>>>
>> >>>>> I'm not yet used with HBase and there is still many things I need
>> >>>>> to
>> >>>>> undertsand but until I'm able to create a solution and test it, I
>> will
>> >>>>> continue to read, learn and try that way. Then at then end I will
>> >>>>> be
>> >>>>> able to compare the 2 options I have (HBase or relational) and
>> >>>>> decide
>> >>>>> based on the results.
>> >>>>>
>> >>>>> So yes, your reply helped because it's giving me a way to achieve
>> this
>> >>>>> goal (using co-processors). I don't know ye thow this part is
>> working,
>> >>>>> so I will dig the documentation for it.
>> >>>>>
>> >>>>> Thanks,
>> >>>>>
>> >>>>> JM
>> >>>>>
>> >>>>> 2012/6/14, Michael Segel <[email protected]>:
>> >>>>>> Jean-Marc,
>> >>>>>>
>> >>>>>> You do realize that this really isn't a good use case for HBase,
>> >>>>>> assuming
>> >>>>>> that what you are describing is a stand alone system.
>> >>>>>> It would be easier and better if you just used a simple relational
>> >>>>>> database.
>> >>>>>>
>> >>>>>> Then you would have your table w an ID, and a secondary index on
>> >>>>>> the
>> >>>>>> timestamp.
>> >>>>>> Retrieve the data in Ascending order by timestamp and take the top
>> 500
>> >>>>>> off
>> >>>>>> the list.
>> >>>>>>
>> >>>>>> If you insist on using HBase, yes you will have to have a
>> >>>>>> secondary
>> >>>>>> table.
>> >>>>>> Then using co-processors...
>> >>>>>> When you update the row in your base table, you
>> >>>>>> then get() the row in your index by timestamp, removing the column
>> for
>> >>>>>> that
>> >>>>>> rowid.
>> >>>>>> Add the new column to the timestamp row.
>> >>>>>>
>> >>>>>> As you put it.
>> >>>>>>
>> >>>>>> Now you can just do a partial scan on your index. Because your
>> >>>>>> index
>> >>>>>> table
>> >>>>>> is so small... you shouldn't worry about hotspots.
>> >>>>>> You may just want to rebuild your index every so often...
>> >>>>>>
>> >>>>>> HTH
>> >>>>>>
>> >>>>>> -Mike
>> >>>>>>
>> >>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>> >>>>>>
>> >>>>>>> Hi Michael,
>> >>>>>>>
>> >>>>>>> Thanks for your feedback. Here are more details to describe what
>> I'm
>> >>>>>>> trying to achieve.
>> >>>>>>>
>> >>>>>>> My goal is to store information about files into the database. I
>> need
>> >>>>>>> to check the oldest files in the database to refresh the
>> information.
>> >>>>>>>
>> >>>>>>> The key is an 8 bytes ID of the server name in the network
>> >>>>>>> hosting
>> >>>>>>> the
>> >>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>> >>>>>>>
>> >>>>>>> So each time I look at a file and gather the information, I
>> >>>>>>> update
>> >>>>>>> its
>> >>>>>>> row in the database based on the key including a "last_update"
>> field.
>> >>>>>>> I can calculate this key for any file in the drives.
>> >>>>>>>
>> >>>>>>> In order to know which file I need to check in the network, I
>> >>>>>>> need
>> to
>> >>>>>>> scan the table by "last_update" field. So the idea is to build
>> >>>>>>> another
>> >>>>>>> table which contain the last_update as a key and the files IDs in
>> >>>>>>> columns. (Here is the hotspotting)
>> >>>>>>>
>> >>>>>>> Each time I work on a file, I will have to update the main table
>> >>>>>>> by
>> >>>>>>> ID
>> >>>>>>> and remove the cell from the second table (the index) and put it
>> back
>> >>>>>>> with the new "last_update" key.
>> >>>>>>>
>> >>>>>>> I'm mainly doing 3 operations in the database.
>> >>>>>>> 1) I retrieve a list of 500 files which need to be update
>> >>>>>>> 2) I update the information for  those 500 files (bulk update)
>> >>>>>>> 3) I load new files references to be checked.
>> >>>>>>>
>> >>>>>>> For 2 and 3, I use the main table with the file ID as the key.
>> >>>>>>> the
>> >>>>>>> distribution is almost perfect because I'm using hash. The prefix
>> is
>> >>>>>>> the server ID but it's not always going to the same server since
>> it's
>> >>>>>>> done by last_update. But this allow a quick access to the list of
>> >>>>>>> files from one server.
>> >>>>>>> For 1, I have expected to build this second table with the
>> >>>>>>> "last_update" as the key.
>> >>>>>>>
>> >>>>>>> Regarding the frequency, it really depends on the activities on
>> >>>>>>> the
>> >>>>>>> network, but it should be "often".  The faster the database
>> >>>>>>> update
>> >>>>>>> will be, the more up to date I will be able to keep it.
>> >>>>>>>
>> >>>>>>> JM
>> >>>>>>>
>> >>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>> >>>>>>>> Actually I think you should revisit your key design....
>> >>>>>>>>
>> >>>>>>>> Look at your access path to the data for each of the types of
>> >>>>>>>> queries
>> >>>>>>>> you
>> >>>>>>>> are going to run.
>> >>>>>>>> From your post:
>> >>>>>>>> "I have a table with a uniq key, a file path and a "last update"
>> >>>>>>>> field.
>> >>>>>>>>>>> I can easily find back the file with the ID and find when it
>> has
>> >>>>>>>>>>> been
>> >>>>>>>>>>> updated.
>> >>>>>>>>>>>
>> >>>>>>>>>>> But what I need too is to find the files not updated for more
>> >>>>>>>>>>> than
>> >>>>>>>>>>> a
>> >>>>>>>>>>> certain period of time.
>> >>>>>>>> "
>> >>>>>>>> So your primary query is going to be against the key.
>> >>>>>>>> Not sure if you meant to say that your key was a composite key
>> >>>>>>>> or
>> >>>>>>>> not...
>> >>>>>>>> sounds like your key is just the unique key and the rest are
>> columns
>> >>>>>>>> in
>> >>>>>>>> the
>> >>>>>>>> table.
>> >>>>>>>>
>> >>>>>>>> The secondary query or path to the data is to find data where
>> >>>>>>>> the
>> >>>>>>>> files
>> >>>>>>>> were
>> >>>>>>>> not updated for more than a period of time.
>> >>>>>>>>
>> >>>>>>>> If you make your key temporal, that is adding time as a
>> >>>>>>>> component
>> of
>> >>>>>>>> your
>> >>>>>>>> key, you will end up creating new rows of data while the old row
>> >>>>>>>> still
>> >>>>>>>> exists.
>> >>>>>>>> Not a good side effect.
>> >>>>>>>>
>> >>>>>>>> The other nasty side effect of using time as your key is that
>> >>>>>>>> you
>> >>>>>>>> not
>> >>>>>>>> only
>> >>>>>>>> have the potential for hot spotting, but that you also have the
>> >>>>>>>> nasty
>> >>>>>>>> side
>> >>>>>>>> effect of creating splits that will never grow.
>> >>>>>>>>
>> >>>>>>>> How often are you going to ask to see the files where they were
>> not
>> >>>>>>>> updated
>> >>>>>>>> in the last couple of days/minutes? If its infrequent, then you
>> >>>>>>>> really
>> >>>>>>>> should care if you have to do a complete table scan.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>> >>>>>>>>
>> >>>>>>>>> Wow! This is exactly what I was looking for. So I will read all
>> of
>> >>>>>>>>> that
>> >>>>>>>>> now.
>> >>>>>>>>>
>> >>>>>>>>> Need to read here at the bottom:
>> >>>>>>>>> https://github.com/sematext/HBaseWD
>> >>>>>>>>> and here:
>> >>>>>>>>>
>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>> >>>>>>>>>
>> >>>>>>>>> Thanks,
>> >>>>>>>>>
>> >>>>>>>>> JM
>> >>>>>>>>>
>> >>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>> >>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this
>> comes
>> >>>>>>>>>> up
>> >>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>> >>>>>>>>>>
>> >>>>>>>>>> Otis
>> >>>>>>>>>> ----
>> >>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>> >>>>>>>>>> http://sematext.com/spm
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> ________________________________
>> >>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>> >>>>>>>>>>> To: [email protected]
>> >>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>> >>>>>>>>>>> Subject: Timestamp as a key good practice?
>> >>>>>>>>>>>
>> >>>>>>>>>>> I watched Lars George's video about HBase and read the
>> >>>>>>>>>>> documentation
>> >>>>>>>>>>> and it's saying that it's not a good idea to have the
>> >>>>>>>>>>> timestamp
>> >>>>>>>>>>> as
>> >>>>>>>>>>> a
>> >>>>>>>>>>> key because that will always load the same region until the
>> >>>>>>>>>>> timestamp
>> >>>>>>>>>>> reach a certain value and move to the next region
>> (hotspotting).
>> >>>>>>>>>>>
>> >>>>>>>>>>> I have a table with a uniq key, a file path and a "last
>> >>>>>>>>>>> update"
>> >>>>>>>>>>> field.
>> >>>>>>>>>>> I can easily find back the file with the ID and find when it
>> has
>> >>>>>>>>>>> been
>> >>>>>>>>>>> updated.
>> >>>>>>>>>>>
>> >>>>>>>>>>> But what I need too is to find the files not updated for more
>> >>>>>>>>>>> than
>> >>>>>>>>>>> a
>> >>>>>>>>>>> certain period of time.
>> >>>>>>>>>>>
>> >>>>>>>>>>> If I want to retrieve that from this single table, I will
>> >>>>>>>>>>> have
>> to
>> >>>>>>>>>>> do
>> >>>>>>>>>>> a
>> >>>>>>>>>>> full parsing of the table. Which might take a while.
>> >>>>>>>>>>>
>> >>>>>>>>>>> So I thought of building a table to reference that (kind of
>> >>>>>>>>>>> secondary
>> >>>>>>>>>>> index). The key is the "last update", one FC and each column
>> will
>> >>>>>>>>>>> have
>> >>>>>>>>>>> the ID of the file with a dummy content.
>> >>>>>>>>>>>
>> >>>>>>>>>>> When a file is updated, I remove its cell from this table,
>> >>>>>>>>>>> and
>> >>>>>>>>>>> introduce a new cell with the new timestamp as the key.
>> >>>>>>>>>>>
>> >>>>>>>>>>> And so one.
>> >>>>>>>>>>>
>> >>>>>>>>>>> With this schema, I can find the files by ID very quickly and
>> >>>>>>>>>>> I
>> >>>>>>>>>>> can
>> >>>>>>>>>>> find the files which need to be updated pretty quickly too.
>> >>>>>>>>>>> But
>> >>>>>>>>>>> it's
>> >>>>>>>>>>> hotspotting one region.
>> >>>>>>>>>>>
>> >>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>> >>>>>>>>>>> 1) Hotspotting.
>> >>>>>>>>>>> 2) Salting.
>> >>>>>>>>>>> 3) Key field swap/promotion
>> >>>>>>>>>>> 4) Randomization.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other
>> options.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I can do salting. Like prefix the timestamp with a number
>> between
>> >>>>>>>>>>> 0
>> >>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To
>> >>>>>>>>>>> find
>> >>>>>>>>>>> all
>> >>>>>>>>>>> the files with a timestamp below a specific value, I will
>> >>>>>>>>>>> need
>> to
>> >>>>>>>>>>> run
>> >>>>>>>>>>> 10 requests instead of one. But when the load will becaume to
>> big
>> >>>>>>>>>>> for
>> >>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99?
>> >>>>>>>>>>> Which
>> >>>>>>>>>>> mean 100 request? And the more regions I will have, the more
>> >>>>>>>>>>> requests
>> >>>>>>>>>>> I will have to do. Is that really a good approach?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Key field swap is close to salting. I can add the first few
>> bytes
>> >>>>>>>>>>> from
>> >>>>>>>>>>> the path before the timestamp, but the issue will remain the
>> >>>>>>>>>>> same.
>> >>>>>>>>>>>
>> >>>>>>>>>>> I looked and randomization, and I can't do that. Else I will
>> have
>> >>>>>>>>>>> no
>> >>>>>>>>>>> way to retreive the information I'm looking for.
>> >>>>>>>>>>>
>> >>>>>>>>>>> So the question is. Is there a good way to store the data to
>> >>>>>>>>>>> retrieve
>> >>>>>>>>>>> them base on the date?
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>>
>> >>>>>>>>>>> JM
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >
>>
>>
>

Re: Timestamp as a key good practice?

Reply via email to