Just to add from my experiences:

Yes hotspotting is bad, but so are devops headaches. A reasonable machine
can handle 3-4000 puts a second with ease, and a simple timerange scan can
give you the records you need. I have my doubts you will be hitting these
amounts anytime soon. A simple setup will get your PoC and then scale when
you need to scale.

Rob

On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel <[email protected]>wrote:

> Jean-Marc,
>
> You indicated that you didn't want to do full table scans when you want to
> find out which files hadn't been touched since X time has past.
> (X could be months, weeks, days, hours, etc ...)
>
> So here's the thing.
> First,  I am not convinced that you will have hot spotting.
> Second, you end up having to now do 26 scans instead of one. Then you need
> to join the result set.
>
> Not really a good solution if you think about it.
>
> Oh and I don't believe that you will be hitting a single region, although
> you may hit  a region hard.
> (Your second table's key is on the timestamp of the last update to the
> file.  If the file hadn't been touched in a week, there's the probability
> that at scale, it won't be in the same region as a file that had recently
> been touched. )
>
> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only be
> applied on a subset of problems.
> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>
> HTH
>
> -Mike
>
>
>
> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>
> > Let's imagine the timestamp is "123456789".
> >
> > If I salt it with later from 'a' to 'z' them it will always be split
> > between few RegionServers. I will have like "t123456789". The issue is
> > that I will have to do 26 queries to be able to find all the entries.
> > I will need to query from A000000000 to Axxxxxxxxx, then same for B,
> > and so on.
> >
> > So what's worst? Am I better to deal with the hotspotting? Salt the
> > key myself? Or what if I use something like HBaseWD?
> >
> > JM
> >
> > 2012/6/16, Michel Segel <[email protected]>:
> >> You can't salt the key in the second table.
> >> By salting the key, you lose the ability to do range scans, which is
> what
> >> you want to do.
> >>
> >>
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
> [email protected]>
> >> wrote:
> >>
> >>> Thanks all for your comments and suggestions. Regarding the
> >>> hotspotting I will try to salt the key in the 2nd table and see the
> >>> results.
> >>>
> >>> Yesterday I finished to install my 4 servers cluster with old machine.
> >>> It's slow, but it's working. So I will do some testing.
> >>>
> >>> You are recommending to modify the timestamp to be to the second or
> >>> minute and have more entries per row. Is that because it's better to
> >>> have more columns than rows? Or it's more because that will allow to
> >>> have a more "squarred" pattern (lot of rows, lot of colums) which if
> >>> more efficient?
> >>>
> >>> JM
> >>>
> >>> 2012/6/15, Michael Segel <[email protected]>:
> >>>> Thought about this a little bit more...
> >>>>
> >>>> You will want two tables for a solution.
> >>>>
> >>>> 1 Table is  Key: Unique ID
> >>>>                   Column: FilePath            Value: Full Path to file
> >>>>                   Column: Last Update time    Value: timestamp
> >>>>
> >>>> 2 Table is Key: Last Update time    (The timestamp)
> >>>>                           Column 1-N: Unique ID    Value: Full Path to
> >>>> the
> >>>> file
> >>>>
> >>>> Now if you want to get fancy,  in Table 1, you could use the time
> stamp
> >>>> on
> >>>> the column File Path to hold the last update time.
> >>>> But its probably easier for you to start by keeping the data as a
> >>>> separate
> >>>> column and ignore the Timestamps on the columns for now.
> >>>>
> >>>> Note the following:
> >>>>
> >>>> 1) I used the notation Column 1-N to reflect that for a given
> timestamp
> >>>> you
> >>>> may or may not have multiple files that were updated. (You weren't
> >>>> specific
> >>>> as to the scale)
> >>>> This is a good example of HBase's column oriented approach where you
> may
> >>>> or
> >>>> may not have a column. It doesn't matter. :-) You could also modify
> the
> >>>> timestamp to be to the second or minute and have more entries per row.
> >>>> It
> >>>> doesn't matter. You insert based on timestamp:columnName, value, so
> you
> >>>> will
> >>>> add a column to this table.
> >>>>
> >>>> 2) First prove that the logic works. You insert/update table 1 to
> >>>> capture
> >>>> the ID of the file and its last update time.  You then delete the old
> >>>> timestamp entry in table 2, then insert new entry in table 2.
> >>>>
> >>>> 3) You store Table 2 in ascending order. Then when you want to find
> your
> >>>> last 500 entries, you do a start scan at 0x000 and then limit the scan
> >>>> to
> >>>> 500 rows. Note that you may or may not have multiple entries so as you
> >>>> walk
> >>>> through the result set, you count the number of columns and stop when
> >>>> you
> >>>> have 500 columns, regardless of the number of rows you've processed.
> >>>>
> >>>> This should solve your problem and be pretty efficient.
> >>>> You can then work out the Coprocessors and add it to the solution to
> be
> >>>> even
> >>>> more efficient.
> >>>>
> >>>>
> >>>> With respect to 'hot-spotting' , can't be helped. You could hash your
> >>>> unique
> >>>> ID in table 1, this will reduce the potential of a hotspot as the
> table
> >>>> splits.
> >>>> On table 2, because you have temporal data and you want to efficiently
> >>>> scan
> >>>> a small portion of the table based on size, you will always scan the
> >>>> first
> >>>> bloc, however as data rolls off and compression occurs, you will
> >>>> probably
> >>>> have to do some cleanup. I'm not sure how HBase  handles splits that
> no
> >>>> longer contain data. When you compress an empty split, does it go
> away?
> >>>>
> >>>> By switching to coprocessors, you now limit the update accessors to
> the
> >>>> second table so you should still have pretty good performance.
> >>>>
> >>>> You may also want to look at Asynchronous HBase, however I don't know
> >>>> how
> >>>> well it will work with Coprocessors or if you want to perform async
> >>>> operations in this specific use case.
> >>>>
> >>>> Good luck, HTH...
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
> >>>>
> >>>>> Hi Michael,
> >>>>>
> >>>>> For now this is more a proof of concept than a production
> application.
> >>>>> And if it's working, it should be growing a lot and database at the
> >>>>> end will easily be over 1B rows. each individual server will have to
> >>>>> send it's own information to one centralized server which will insert
> >>>>> that into a database. That's why it need to be very quick and that's
> >>>>> why I'm looking in HBase's direction. I tried with some relational
> >>>>> databases with 4M rows in the table but the insert time is to slow
> >>>>> when I have to introduce entries in bulk. Also, the ability for HBase
> >>>>> to keep only the cells with values will allow to save a lot on the
> >>>>> disk space (futur projects).
> >>>>>
> >>>>> I'm not yet used with HBase and there is still many things I need to
> >>>>> undertsand but until I'm able to create a solution and test it, I
> will
> >>>>> continue to read, learn and try that way. Then at then end I will be
> >>>>> able to compare the 2 options I have (HBase or relational) and decide
> >>>>> based on the results.
> >>>>>
> >>>>> So yes, your reply helped because it's giving me a way to achieve
> this
> >>>>> goal (using co-processors). I don't know ye thow this part is
> working,
> >>>>> so I will dig the documentation for it.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> JM
> >>>>>
> >>>>> 2012/6/14, Michael Segel <[email protected]>:
> >>>>>> Jean-Marc,
> >>>>>>
> >>>>>> You do realize that this really isn't a good use case for HBase,
> >>>>>> assuming
> >>>>>> that what you are describing is a stand alone system.
> >>>>>> It would be easier and better if you just used a simple relational
> >>>>>> database.
> >>>>>>
> >>>>>> Then you would have your table w an ID, and a secondary index on the
> >>>>>> timestamp.
> >>>>>> Retrieve the data in Ascending order by timestamp and take the top
> 500
> >>>>>> off
> >>>>>> the list.
> >>>>>>
> >>>>>> If you insist on using HBase, yes you will have to have a secondary
> >>>>>> table.
> >>>>>> Then using co-processors...
> >>>>>> When you update the row in your base table, you
> >>>>>> then get() the row in your index by timestamp, removing the column
> for
> >>>>>> that
> >>>>>> rowid.
> >>>>>> Add the new column to the timestamp row.
> >>>>>>
> >>>>>> As you put it.
> >>>>>>
> >>>>>> Now you can just do a partial scan on your index. Because your index
> >>>>>> table
> >>>>>> is so small... you shouldn't worry about hotspots.
> >>>>>> You may just want to rebuild your index every so often...
> >>>>>>
> >>>>>> HTH
> >>>>>>
> >>>>>> -Mike
> >>>>>>
> >>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
> >>>>>>
> >>>>>>> Hi Michael,
> >>>>>>>
> >>>>>>> Thanks for your feedback. Here are more details to describe what
> I'm
> >>>>>>> trying to achieve.
> >>>>>>>
> >>>>>>> My goal is to store information about files into the database. I
> need
> >>>>>>> to check the oldest files in the database to refresh the
> information.
> >>>>>>>
> >>>>>>> The key is an 8 bytes ID of the server name in the network hosting
> >>>>>>> the
> >>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
> >>>>>>>
> >>>>>>> So each time I look at a file and gather the information, I update
> >>>>>>> its
> >>>>>>> row in the database based on the key including a "last_update"
> field.
> >>>>>>> I can calculate this key for any file in the drives.
> >>>>>>>
> >>>>>>> In order to know which file I need to check in the network, I need
> to
> >>>>>>> scan the table by "last_update" field. So the idea is to build
> >>>>>>> another
> >>>>>>> table which contain the last_update as a key and the files IDs in
> >>>>>>> columns. (Here is the hotspotting)
> >>>>>>>
> >>>>>>> Each time I work on a file, I will have to update the main table by
> >>>>>>> ID
> >>>>>>> and remove the cell from the second table (the index) and put it
> back
> >>>>>>> with the new "last_update" key.
> >>>>>>>
> >>>>>>> I'm mainly doing 3 operations in the database.
> >>>>>>> 1) I retrieve a list of 500 files which need to be update
> >>>>>>> 2) I update the information for  those 500 files (bulk update)
> >>>>>>> 3) I load new files references to be checked.
> >>>>>>>
> >>>>>>> For 2 and 3, I use the main table with the file ID as the key. the
> >>>>>>> distribution is almost perfect because I'm using hash. The prefix
> is
> >>>>>>> the server ID but it's not always going to the same server since
> it's
> >>>>>>> done by last_update. But this allow a quick access to the list of
> >>>>>>> files from one server.
> >>>>>>> For 1, I have expected to build this second table with the
> >>>>>>> "last_update" as the key.
> >>>>>>>
> >>>>>>> Regarding the frequency, it really depends on the activities on the
> >>>>>>> network, but it should be "often".  The faster the database update
> >>>>>>> will be, the more up to date I will be able to keep it.
> >>>>>>>
> >>>>>>> JM
> >>>>>>>
> >>>>>>> 2012/6/14, Michael Segel <[email protected]>:
> >>>>>>>> Actually I think you should revisit your key design....
> >>>>>>>>
> >>>>>>>> Look at your access path to the data for each of the types of
> >>>>>>>> queries
> >>>>>>>> you
> >>>>>>>> are going to run.
> >>>>>>>> From your post:
> >>>>>>>> "I have a table with a uniq key, a file path and a "last update"
> >>>>>>>> field.
> >>>>>>>>>>> I can easily find back the file with the ID and find when it
> has
> >>>>>>>>>>> been
> >>>>>>>>>>> updated.
> >>>>>>>>>>>
> >>>>>>>>>>> But what I need too is to find the files not updated for more
> >>>>>>>>>>> than
> >>>>>>>>>>> a
> >>>>>>>>>>> certain period of time.
> >>>>>>>> "
> >>>>>>>> So your primary query is going to be against the key.
> >>>>>>>> Not sure if you meant to say that your key was a composite key or
> >>>>>>>> not...
> >>>>>>>> sounds like your key is just the unique key and the rest are
> columns
> >>>>>>>> in
> >>>>>>>> the
> >>>>>>>> table.
> >>>>>>>>
> >>>>>>>> The secondary query or path to the data is to find data where the
> >>>>>>>> files
> >>>>>>>> were
> >>>>>>>> not updated for more than a period of time.
> >>>>>>>>
> >>>>>>>> If you make your key temporal, that is adding time as a component
> of
> >>>>>>>> your
> >>>>>>>> key, you will end up creating new rows of data while the old row
> >>>>>>>> still
> >>>>>>>> exists.
> >>>>>>>> Not a good side effect.
> >>>>>>>>
> >>>>>>>> The other nasty side effect of using time as your key is that you
> >>>>>>>> not
> >>>>>>>> only
> >>>>>>>> have the potential for hot spotting, but that you also have the
> >>>>>>>> nasty
> >>>>>>>> side
> >>>>>>>> effect of creating splits that will never grow.
> >>>>>>>>
> >>>>>>>> How often are you going to ask to see the files where they were
> not
> >>>>>>>> updated
> >>>>>>>> in the last couple of days/minutes? If its infrequent, then you
> >>>>>>>> really
> >>>>>>>> should care if you have to do a complete table scan.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
> >>>>>>>>
> >>>>>>>>> Wow! This is exactly what I was looking for. So I will read all
> of
> >>>>>>>>> that
> >>>>>>>>> now.
> >>>>>>>>>
> >>>>>>>>> Need to read here at the bottom:
> >>>>>>>>> https://github.com/sematext/HBaseWD
> >>>>>>>>> and here:
> >>>>>>>>>
> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> JM
> >>>>>>>>>
> >>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
> >>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this
> comes
> >>>>>>>>>> up
> >>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
> >>>>>>>>>>
> >>>>>>>>>> Otis
> >>>>>>>>>> ----
> >>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
> >>>>>>>>>> http://sematext.com/spm
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> ________________________________
> >>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
> >>>>>>>>>>> To: [email protected]
> >>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
> >>>>>>>>>>> Subject: Timestamp as a key good practice?
> >>>>>>>>>>>
> >>>>>>>>>>> I watched Lars George's video about HBase and read the
> >>>>>>>>>>> documentation
> >>>>>>>>>>> and it's saying that it's not a good idea to have the timestamp
> >>>>>>>>>>> as
> >>>>>>>>>>> a
> >>>>>>>>>>> key because that will always load the same region until the
> >>>>>>>>>>> timestamp
> >>>>>>>>>>> reach a certain value and move to the next region
> (hotspotting).
> >>>>>>>>>>>
> >>>>>>>>>>> I have a table with a uniq key, a file path and a "last update"
> >>>>>>>>>>> field.
> >>>>>>>>>>> I can easily find back the file with the ID and find when it
> has
> >>>>>>>>>>> been
> >>>>>>>>>>> updated.
> >>>>>>>>>>>
> >>>>>>>>>>> But what I need too is to find the files not updated for more
> >>>>>>>>>>> than
> >>>>>>>>>>> a
> >>>>>>>>>>> certain period of time.
> >>>>>>>>>>>
> >>>>>>>>>>> If I want to retrieve that from this single table, I will have
> to
> >>>>>>>>>>> do
> >>>>>>>>>>> a
> >>>>>>>>>>> full parsing of the table. Which might take a while.
> >>>>>>>>>>>
> >>>>>>>>>>> So I thought of building a table to reference that (kind of
> >>>>>>>>>>> secondary
> >>>>>>>>>>> index). The key is the "last update", one FC and each column
> will
> >>>>>>>>>>> have
> >>>>>>>>>>> the ID of the file with a dummy content.
> >>>>>>>>>>>
> >>>>>>>>>>> When a file is updated, I remove its cell from this table, and
> >>>>>>>>>>> introduce a new cell with the new timestamp as the key.
> >>>>>>>>>>>
> >>>>>>>>>>> And so one.
> >>>>>>>>>>>
> >>>>>>>>>>> With this schema, I can find the files by ID very quickly and I
> >>>>>>>>>>> can
> >>>>>>>>>>> find the files which need to be updated pretty quickly too. But
> >>>>>>>>>>> it's
> >>>>>>>>>>> hotspotting one region.
> >>>>>>>>>>>
> >>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
> >>>>>>>>>>> 1) Hotspotting.
> >>>>>>>>>>> 2) Salting.
> >>>>>>>>>>> 3) Key field swap/promotion
> >>>>>>>>>>> 4) Randomization.
> >>>>>>>>>>>
> >>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other
> options.
> >>>>>>>>>>>
> >>>>>>>>>>> I can do salting. Like prefix the timestamp with a number
> between
> >>>>>>>>>>> 0
> >>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To find
> >>>>>>>>>>> all
> >>>>>>>>>>> the files with a timestamp below a specific value, I will need
> to
> >>>>>>>>>>> run
> >>>>>>>>>>> 10 requests instead of one. But when the load will becaume to
> big
> >>>>>>>>>>> for
> >>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99?
> >>>>>>>>>>> Which
> >>>>>>>>>>> mean 100 request? And the more regions I will have, the more
> >>>>>>>>>>> requests
> >>>>>>>>>>> I will have to do. Is that really a good approach?
> >>>>>>>>>>>
> >>>>>>>>>>> Key field swap is close to salting. I can add the first few
> bytes
> >>>>>>>>>>> from
> >>>>>>>>>>> the path before the timestamp, but the issue will remain the
> >>>>>>>>>>> same.
> >>>>>>>>>>>
> >>>>>>>>>>> I looked and randomization, and I can't do that. Else I will
> have
> >>>>>>>>>>> no
> >>>>>>>>>>> way to retreive the information I'm looking for.
> >>>>>>>>>>>
> >>>>>>>>>>> So the question is. Is there a good way to store the data to
> >>>>>>>>>>> retrieve
> >>>>>>>>>>> them base on the date?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> JM
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>
>

Reply via email to