Re: Timestamp as a key good practice?

Jean-Marc Spaggiari Tue, 26 Jun 2012 10:56:28 -0700

Am I better to run it on 1? Or on 3? I just want to do some testing
for now. But I have issues with the performances. It's taking 20
seconds to do 1000 gets with the actual configuration... I'm tracking
the issues. I think the network is one so I will address it this week,
but for ZK, can I keep it in only one server for now? Or it will be
more efficient if  Iconfigure it on 3?


Thanks,

JM

2012/6/26, Jean-Daniel Cryans <[email protected]>:
> A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the
> exception you are seeing is that ZK is trying to get a quorum on with
> 1 machine but that doesn't make sense so instead it should revert to a
> standalone server and still work.
>
> J-D
>
> On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari
> <[email protected]> wrote:
>> Hum... Seems that it's not working that way:
>>
>> ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one
>> server specified (ignoring)
>>
>> So most porbably the secondary should looks exactly like the master,
>> but I'm not 100% sure...
>>
>> 2012/6/22, Jean-Marc Spaggiari <[email protected]>:
>>> Ok. So if I understand correctly, I need:
>>> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and
>>> ZooKeeper (ZK)
>>> PC2 => Secondary Name Node (Hadoop)
>>> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server
>>> (HBase)
>>>
>>> For PC2, should I run Zookeeper, JobTracker and master too? Can I have
>>> 2 masters? Or I just run just the secondray name node?
>>>
>>> 2012/6/21, Michael Segel <[email protected]>:
>>>> If you have a really small cluster...
>>>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a
>>>> single node. (Secondary too)
>>>> Then you have Data Nodes that run DN, TT, and RS.
>>>>
>>>> That would solve any ZK RS problems.
>>>>
>>>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:
>>>>
>>>>> Hi Mike, Hi Rob,
>>>>>
>>>>> Thanks for your replies and advices. Seems that now I'm due for some
>>>>> implementation. I'm readgin Lars' book first and when I will be done I
>>>>> will start with the coding.
>>>>>
>>>>> I already have my Zookeeper/Hadoop/HBase running and based on the
>>>>> first pages I read, I already know it's not well done since I have put
>>>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More
>>>>> reading for me for the next few days, and then I will start.
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> JM
>>>>>
>>>>> 2012/6/16, Rob Verkuylen <[email protected]>:
>>>>>> Just to add from my experiences:
>>>>>>
>>>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable
>>>>>> machine
>>>>>> can handle 3-4000 puts a second with ease, and a simple timerange
>>>>>> scan
>>>>>> can
>>>>>> give you the records you need. I have my doubts you will be hitting
>>>>>> these
>>>>>> amounts anytime soon. A simple setup will get your PoC and then scale
>>>>>> when
>>>>>> you need to scale.
>>>>>>
>>>>>> Rob
>>>>>>
>>>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Jean-Marc,
>>>>>>>
>>>>>>> You indicated that you didn't want to do full table scans when you
>>>>>>> want
>>>>>>> to
>>>>>>> find out which files hadn't been touched since X time has past.
>>>>>>> (X could be months, weeks, days, hours, etc ...)
>>>>>>>
>>>>>>> So here's the thing.
>>>>>>> First,  I am not convinced that you will have hot spotting.
>>>>>>> Second, you end up having to now do 26 scans instead of one. Then
>>>>>>> you
>>>>>>> need
>>>>>>> to join the result set.
>>>>>>>
>>>>>>> Not really a good solution if you think about it.
>>>>>>>
>>>>>>> Oh and I don't believe that you will be hitting a single region,
>>>>>>> although
>>>>>>> you may hit  a region hard.
>>>>>>> (Your second table's key is on the timestamp of the last update to
>>>>>>> the
>>>>>>> file.  If the file hadn't been touched in a week, there's the
>>>>>>> probability
>>>>>>> that at scale, it won't be in the same region as a file that had
>>>>>>> recently
>>>>>>> been touched. )
>>>>>>>
>>>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only
>>>>>>> be
>>>>>>> applied on a subset of problems.
>>>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> -Mike
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>>>>>
>>>>>>>> Let's imagine the timestamp is "123456789".
>>>>>>>>
>>>>>>>> If I salt it with later from 'a' to 'z' them it will always be
>>>>>>>> split
>>>>>>>> between few RegionServers. I will have like "t123456789". The issue
>>>>>>>> is
>>>>>>>> that I will have to do 26 queries to be able to find all the
>>>>>>>> entries.
>>>>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for
>>>>>>>> B,
>>>>>>>> and so on.
>>>>>>>>
>>>>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>>>>>> key myself? Or what if I use something like HBaseWD?
>>>>>>>>
>>>>>>>> JM
>>>>>>>>
>>>>>>>> 2012/6/16, Michel Segel <[email protected]>:
>>>>>>>>> You can't salt the key in the second table.
>>>>>>>>> By salting the key, you lose the ability to do range scans, which
>>>>>>>>> is
>>>>>>> what
>>>>>>>>> you want to do.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>
>>>>>>>>> Mike Segel
>>>>>>>>>
>>>>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>>>>>> hotspotting I will try to salt the key in the 2nd table and see
>>>>>>>>>> the
>>>>>>>>>> results.
>>>>>>>>>>
>>>>>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>>>>>> machine.
>>>>>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>>>>>
>>>>>>>>>> You are recommending to modify the timestamp to be to the second
>>>>>>>>>> or
>>>>>>>>>> minute and have more entries per row. Is that because it's better
>>>>>>>>>> to
>>>>>>>>>> have more columns than rows? Or it's more because that will allow
>>>>>>>>>> to
>>>>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which
>>>>>>>>>> if
>>>>>>>>>> more efficient?
>>>>>>>>>>
>>>>>>>>>> JM
>>>>>>>>>>
>>>>>>>>>> 2012/6/15, Michael Segel <[email protected]>:
>>>>>>>>>>> Thought about this a little bit more...
>>>>>>>>>>>
>>>>>>>>>>> You will want two tables for a solution.
>>>>>>>>>>>
>>>>>>>>>>> 1 Table is  Key: Unique ID
>>>>>>>>>>>                  Column: FilePath            Value: Full Path to
>>>>>>>>>>> file
>>>>>>>>>>>                  Column: Last Update time    Value: timestamp
>>>>>>>>>>>
>>>>>>>>>>> 2 Table is Key: Last Update time    (The timestamp)
>>>>>>>>>>>                          Column 1-N: Unique ID    Value: Full
>>>>>>>>>>> Path
>>>>>>>>>>> to
>>>>>>>>>>> the
>>>>>>>>>>> file
>>>>>>>>>>>
>>>>>>>>>>> Now if you want to get fancy,  in Table 1, you could use the
>>>>>>>>>>> time
>>>>>>> stamp
>>>>>>>>>>> on
>>>>>>>>>>> the column File Path to hold the last update time.
>>>>>>>>>>> But its probably easier for you to start by keeping the data as
>>>>>>>>>>> a
>>>>>>>>>>> separate
>>>>>>>>>>> column and ignore the Timestamps on the columns for now.
>>>>>>>>>>>
>>>>>>>>>>> Note the following:
>>>>>>>>>>>
>>>>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given
>>>>>>> timestamp
>>>>>>>>>>> you
>>>>>>>>>>> may or may not have multiple files that were updated. (You
>>>>>>>>>>> weren't
>>>>>>>>>>> specific
>>>>>>>>>>> as to the scale)
>>>>>>>>>>> This is a good example of HBase's column oriented approach where
>>>>>>>>>>> you
>>>>>>> may
>>>>>>>>>>> or
>>>>>>>>>>> may not have a column. It doesn't matter. :-) You could also
>>>>>>>>>>> modify
>>>>>>> the
>>>>>>>>>>> timestamp to be to the second or minute and have more entries
>>>>>>>>>>> per
>>>>>>>>>>> row.
>>>>>>>>>>> It
>>>>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value,
>>>>>>>>>>> so
>>>>>>> you
>>>>>>>>>>> will
>>>>>>>>>>> add a column to this table.
>>>>>>>>>>>
>>>>>>>>>>> 2) First prove that the logic works. You insert/update table 1
>>>>>>>>>>> to
>>>>>>>>>>> capture
>>>>>>>>>>> the ID of the file and its last update time.  You then delete
>>>>>>>>>>> the
>>>>>>>>>>> old
>>>>>>>>>>> timestamp entry in table 2, then insert new entry in table 2.
>>>>>>>>>>>
>>>>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to
>>>>>>>>>>> find
>>>>>>> your
>>>>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit
>>>>>>>>>>> the
>>>>>>>>>>> scan
>>>>>>>>>>> to
>>>>>>>>>>> 500 rows. Note that you may or may not have multiple entries so
>>>>>>>>>>> as
>>>>>>>>>>> you
>>>>>>>>>>> walk
>>>>>>>>>>> through the result set, you count the number of columns and stop
>>>>>>>>>>> when
>>>>>>>>>>> you
>>>>>>>>>>> have 500 columns, regardless of the number of rows you've
>>>>>>>>>>> processed.
>>>>>>>>>>>
>>>>>>>>>>> This should solve your problem and be pretty efficient.
>>>>>>>>>>> You can then work out the Coprocessors and add it to the
>>>>>>>>>>> solution
>>>>>>>>>>> to
>>>>>>> be
>>>>>>>>>>> even
>>>>>>>>>>> more efficient.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash
>>>>>>>>>>> your
>>>>>>>>>>> unique
>>>>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as
>>>>>>>>>>> the
>>>>>>> table
>>>>>>>>>>> splits.
>>>>>>>>>>> On table 2, because you have temporal data and you want to
>>>>>>>>>>> efficiently
>>>>>>>>>>> scan
>>>>>>>>>>> a small portion of the table based on size, you will always scan
>>>>>>>>>>> the
>>>>>>>>>>> first
>>>>>>>>>>> bloc, however as data rolls off and compression occurs, you will
>>>>>>>>>>> probably
>>>>>>>>>>> have to do some cleanup. I'm not sure how HBase  handles splits
>>>>>>>>>>> that
>>>>>>> no
>>>>>>>>>>> longer contain data. When you compress an empty split, does it
>>>>>>>>>>> go
>>>>>>> away?
>>>>>>>>>>>
>>>>>>>>>>> By switching to coprocessors, you now limit the update accessors
>>>>>>>>>>> to
>>>>>>> the
>>>>>>>>>>> second table so you should still have pretty good performance.
>>>>>>>>>>>
>>>>>>>>>>> You may also want to look at Asynchronous HBase, however I don't
>>>>>>>>>>> know
>>>>>>>>>>> how
>>>>>>>>>>> well it will work with Coprocessors or if you want to perform
>>>>>>>>>>> async
>>>>>>>>>>> operations in this specific use case.
>>>>>>>>>>>
>>>>>>>>>>> Good luck, HTH...
>>>>>>>>>>>
>>>>>>>>>>> -Mike
>>>>>>>>>>>
>>>>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>
>>>>>>>>>>>> For now this is more a proof of concept than a production
>>>>>>> application.
>>>>>>>>>>>> And if it's working, it should be growing a lot and database at
>>>>>>>>>>>> the
>>>>>>>>>>>> end will easily be over 1B rows. each individual server will
>>>>>>>>>>>> have
>>>>>>>>>>>> to
>>>>>>>>>>>> send it's own information to one centralized server which will
>>>>>>>>>>>> insert
>>>>>>>>>>>> that into a database. That's why it need to be very quick and
>>>>>>>>>>>> that's
>>>>>>>>>>>> why I'm looking in HBase's direction. I tried with some
>>>>>>>>>>>> relational
>>>>>>>>>>>> databases with 4M rows in the table but the insert time is to
>>>>>>>>>>>> slow
>>>>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for
>>>>>>>>>>>> HBase
>>>>>>>>>>>> to keep only the cells with values will allow to save a lot on
>>>>>>>>>>>> the
>>>>>>>>>>>> disk space (futur projects).
>>>>>>>>>>>>
>>>>>>>>>>>> I'm not yet used with HBase and there is still many things I
>>>>>>>>>>>> need
>>>>>>>>>>>> to
>>>>>>>>>>>> undertsand but until I'm able to create a solution and test it,
>>>>>>>>>>>> I
>>>>>>> will
>>>>>>>>>>>> continue to read, learn and try that way. Then at then end I
>>>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>> able to compare the 2 options I have (HBase or relational) and
>>>>>>>>>>>> decide
>>>>>>>>>>>> based on the results.
>>>>>>>>>>>>
>>>>>>>>>>>> So yes, your reply helped because it's giving me a way to
>>>>>>>>>>>> achieve
>>>>>>> this
>>>>>>>>>>>> goal (using co-processors). I don't know ye thow this part is
>>>>>>> working,
>>>>>>>>>>>> so I will dig the documentation for it.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> JM
>>>>>>>>>>>>
>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>>>> Jean-Marc,
>>>>>>>>>>>>>
>>>>>>>>>>>>> You do realize that this really isn't a good use case for
>>>>>>>>>>>>> HBase,
>>>>>>>>>>>>> assuming
>>>>>>>>>>>>> that what you are describing is a stand alone system.
>>>>>>>>>>>>> It would be easier and better if you just used a simple
>>>>>>>>>>>>> relational
>>>>>>>>>>>>> database.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then you would have your table w an ID, and a secondary index
>>>>>>>>>>>>> on
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timestamp.
>>>>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the
>>>>>>>>>>>>> top
>>>>>>> 500
>>>>>>>>>>>>> off
>>>>>>>>>>>>> the list.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you insist on using HBase, yes you will have to have a
>>>>>>>>>>>>> secondary
>>>>>>>>>>>>> table.
>>>>>>>>>>>>> Then using co-processors...
>>>>>>>>>>>>> When you update the row in your base table, you
>>>>>>>>>>>>> then get() the row in your index by timestamp, removing the
>>>>>>>>>>>>> column
>>>>>>> for
>>>>>>>>>>>>> that
>>>>>>>>>>>>> rowid.
>>>>>>>>>>>>> Add the new column to the timestamp row.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you put it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Now you can just do a partial scan on your index. Because your
>>>>>>>>>>>>> index
>>>>>>>>>>>>> table
>>>>>>>>>>>>> is so small... you shouldn't worry about hotspots.
>>>>>>>>>>>>> You may just want to rebuild your index every so often...
>>>>>>>>>>>>>
>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Mike
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for your feedback. Here are more details to describe
>>>>>>>>>>>>>> what
>>>>>>> I'm
>>>>>>>>>>>>>> trying to achieve.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My goal is to store information about files into the
>>>>>>>>>>>>>> database.
>>>>>>>>>>>>>> I
>>>>>>> need
>>>>>>>>>>>>>> to check the oldest files in the database to refresh the
>>>>>>> information.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network
>>>>>>>>>>>>>> hosting
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So each time I look at a file and gather the information, I
>>>>>>>>>>>>>> update
>>>>>>>>>>>>>> its
>>>>>>>>>>>>>> row in the database based on the key including a
>>>>>>>>>>>>>> "last_update"
>>>>>>> field.
>>>>>>>>>>>>>> I can calculate this key for any file in the drives.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In order to know which file I need to check in the network, I
>>>>>>>>>>>>>> need
>>>>>>> to
>>>>>>>>>>>>>> scan the table by "last_update" field. So the idea is to
>>>>>>>>>>>>>> build
>>>>>>>>>>>>>> another
>>>>>>>>>>>>>> table which contain the last_update as a key and the files
>>>>>>>>>>>>>> IDs
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> columns. (Here is the hotspotting)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Each time I work on a file, I will have to update the main
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> by
>>>>>>>>>>>>>> ID
>>>>>>>>>>>>>> and remove the cell from the second table (the index) and put
>>>>>>>>>>>>>> it
>>>>>>> back
>>>>>>>>>>>>>> with the new "last_update" key.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm mainly doing 3 operations in the database.
>>>>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update
>>>>>>>>>>>>>> 2) I update the information for  those 500 files (bulk
>>>>>>>>>>>>>> update)
>>>>>>>>>>>>>> 3) I load new files references to be checked.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the
>>>>>>>>>>>>>> key.
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> distribution is almost perfect because I'm using hash. The
>>>>>>>>>>>>>> prefix
>>>>>>> is
>>>>>>>>>>>>>> the server ID but it's not always going to the same server
>>>>>>>>>>>>>> since
>>>>>>> it's
>>>>>>>>>>>>>> done by last_update. But this allow a quick access to the
>>>>>>>>>>>>>> list
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> files from one server.
>>>>>>>>>>>>>> For 1, I have expected to build this second table with the
>>>>>>>>>>>>>> "last_update" as the key.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regarding the frequency, it really depends on the activities
>>>>>>>>>>>>>> on
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> network, but it should be "often".  The faster the database
>>>>>>>>>>>>>> update
>>>>>>>>>>>>>> will be, the more up to date I will be able to keep it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>>>>>> Actually I think you should revisit your key design....
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Look at your access path to the data for each of the types
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> queries
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> are going to run.
>>>>>>>>>>>>>>> From your post:
>>>>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>>>> it
>>>>>>> has
>>>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>>> "
>>>>>>>>>>>>>>> So your primary query is going to be against the key.
>>>>>>>>>>>>>>> Not sure if you meant to say that your key was a composite
>>>>>>>>>>>>>>> key
>>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>> not...
>>>>>>>>>>>>>>> sounds like your key is just the unique key and the rest are
>>>>>>> columns
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The secondary query or path to the data is to find data
>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>> were
>>>>>>>>>>>>>>> not updated for more than a period of time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you make your key temporal, that is adding time as a
>>>>>>>>>>>>>>> component
>>>>>>> of
>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>> key, you will end up creating new rows of data while the old
>>>>>>>>>>>>>>> row
>>>>>>>>>>>>>>> still
>>>>>>>>>>>>>>> exists.
>>>>>>>>>>>>>>> Not a good side effect.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The other nasty side effect of using time as your key is
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> have the potential for hot spotting, but that you also have
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> nasty
>>>>>>>>>>>>>>> side
>>>>>>>>>>>>>>> effect of creating splits that will never grow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> How often are you going to ask to see the files where they
>>>>>>>>>>>>>>> were
>>>>>>> not
>>>>>>>>>>>>>>> updated
>>>>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then
>>>>>>>>>>>>>>> you
>>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> should care if you have to do a complete table scan.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read
>>>>>>>>>>>>>>>> all
>>>>>>> of
>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Need to read here at the bottom:
>>>>>>>>>>>>>>>> https://github.com/sematext/HBaseWD
>>>>>>>>>>>>>>>> and here:
>>>>>>>>>>>>>>>>
>>>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>>>>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD
>>>>>>>>>>>>>>>>> (this
>>>>>>> comes
>>>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Otis
>>>>>>>>>>>>>>>>> ----
>>>>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>>>>>>>>>>>>>> http://sematext.com/spm
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>>>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the
>>>>>>>>>>>>>>>>>> documentation
>>>>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the
>>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> key because that will always load the same region until
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>>>> reach a certain value and move to the next region
>>>>>>> (hotspotting).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>>>> it
>>>>>>> has
>>>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will
>>>>>>>>>>>>>>>>>> have
>>>>>>> to
>>>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>> full parsing of the table. Which might take a while.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind
>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each
>>>>>>>>>>>>>>>>>> column
>>>>>>> will
>>>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>> the ID of the file with a dummy content.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this
>>>>>>>>>>>>>>>>>> table,
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> And so one.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly
>>>>>>>>>>>>>>>>>> too.
>>>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>>>> hotspotting one region.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>>>>>>>>>>>>>> 1) Hotspotting.
>>>>>>>>>>>>>>>>>> 2) Salting.
>>>>>>>>>>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>>>>>>>>>>> 4) Randomization.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other
>>>>>>> options.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number
>>>>>>> between
>>>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers.
>>>>>>>>>>>>>>>>>> To
>>>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will
>>>>>>>>>>>>>>>>>> need
>>>>>>> to
>>>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will
>>>>>>>>>>>>>>>>>> becaume
>>>>>>>>>>>>>>>>>> to
>>>>>>> big
>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and
>>>>>>>>>>>>>>>>>> 99?
>>>>>>>>>>>>>>>>>> Which
>>>>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the
>>>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>>>> I will have to do. Is that really a good approach?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first
>>>>>>>>>>>>>>>>>> few
>>>>>>> bytes
>>>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I
>>>>>>>>>>>>>>>>>> will
>>>>>>> have
>>>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>>>> way to retreive the information I'm looking for.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>>>>>> them base on the date?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>

Re: Timestamp as a key good practice?

Reply via email to