Hum... Seems that it's not working that way: ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one server specified (ignoring)
So most porbably the secondary should looks exactly like the master, but I'm not 100% sure... 2012/6/22, Jean-Marc Spaggiari <[email protected]>: > Ok. So if I understand correctly, I need: > PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and > ZooKeeper (ZK) > PC2 => Secondary Name Node (Hadoop) > PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server > (HBase) > > For PC2, should I run Zookeeper, JobTracker and master too? Can I have > 2 masters? Or I just run just the secondray name node? > > 2012/6/21, Michael Segel <[email protected]>: >> If you have a really small cluster... >> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >> single node. (Secondary too) >> Then you have Data Nodes that run DN, TT, and RS. >> >> That would solve any ZK RS problems. >> >> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >> >>> Hi Mike, Hi Rob, >>> >>> Thanks for your replies and advices. Seems that now I'm due for some >>> implementation. I'm readgin Lars' book first and when I will be done I >>> will start with the coding. >>> >>> I already have my Zookeeper/Hadoop/HBase running and based on the >>> first pages I read, I already know it's not well done since I have put >>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>> reading for me for the next few days, and then I will start. >>> >>> Thanks again! >>> >>> JM >>> >>> 2012/6/16, Rob Verkuylen <[email protected]>: >>>> Just to add from my experiences: >>>> >>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>> machine >>>> can handle 3-4000 puts a second with ease, and a simple timerange scan >>>> can >>>> give you the records you need. I have my doubts you will be hitting >>>> these >>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>> when >>>> you need to scale. >>>> >>>> Rob >>>> >>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>> <[email protected]>wrote: >>>> >>>>> Jean-Marc, >>>>> >>>>> You indicated that you didn't want to do full table scans when you >>>>> want >>>>> to >>>>> find out which files hadn't been touched since X time has past. >>>>> (X could be months, weeks, days, hours, etc ...) >>>>> >>>>> So here's the thing. >>>>> First, I am not convinced that you will have hot spotting. >>>>> Second, you end up having to now do 26 scans instead of one. Then you >>>>> need >>>>> to join the result set. >>>>> >>>>> Not really a good solution if you think about it. >>>>> >>>>> Oh and I don't believe that you will be hitting a single region, >>>>> although >>>>> you may hit a region hard. >>>>> (Your second table's key is on the timestamp of the last update to the >>>>> file. If the file hadn't been touched in a week, there's the >>>>> probability >>>>> that at scale, it won't be in the same region as a file that had >>>>> recently >>>>> been touched. ) >>>>> >>>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only >>>>> be >>>>> applied on a subset of problems. >>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>>> >>>>> HTH >>>>> >>>>> -Mike >>>>> >>>>> >>>>> >>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>>> >>>>>> Let's imagine the timestamp is "123456789". >>>>>> >>>>>> If I salt it with later from 'a' to 'z' them it will always be split >>>>>> between few RegionServers. I will have like "t123456789". The issue >>>>>> is >>>>>> that I will have to do 26 queries to be able to find all the entries. >>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>>>> and so on. >>>>>> >>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>>>> key myself? Or what if I use something like HBaseWD? >>>>>> >>>>>> JM >>>>>> >>>>>> 2012/6/16, Michel Segel <[email protected]>: >>>>>>> You can't salt the key in the second table. >>>>>>> By salting the key, you lose the ability to do range scans, which is >>>>> what >>>>>>> you want to do. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>> >>>>>>> Mike Segel >>>>>>> >>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks all for your comments and suggestions. Regarding the >>>>>>>> hotspotting I will try to salt the key in the 2nd table and see the >>>>>>>> results. >>>>>>>> >>>>>>>> Yesterday I finished to install my 4 servers cluster with old >>>>>>>> machine. >>>>>>>> It's slow, but it's working. So I will do some testing. >>>>>>>> >>>>>>>> You are recommending to modify the timestamp to be to the second or >>>>>>>> minute and have more entries per row. Is that because it's better >>>>>>>> to >>>>>>>> have more columns than rows? Or it's more because that will allow >>>>>>>> to >>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which >>>>>>>> if >>>>>>>> more efficient? >>>>>>>> >>>>>>>> JM >>>>>>>> >>>>>>>> 2012/6/15, Michael Segel <[email protected]>: >>>>>>>>> Thought about this a little bit more... >>>>>>>>> >>>>>>>>> You will want two tables for a solution. >>>>>>>>> >>>>>>>>> 1 Table is Key: Unique ID >>>>>>>>> Column: FilePath Value: Full Path to >>>>>>>>> file >>>>>>>>> Column: Last Update time Value: timestamp >>>>>>>>> >>>>>>>>> 2 Table is Key: Last Update time (The timestamp) >>>>>>>>> Column 1-N: Unique ID Value: Full Path >>>>>>>>> to >>>>>>>>> the >>>>>>>>> file >>>>>>>>> >>>>>>>>> Now if you want to get fancy, in Table 1, you could use the time >>>>> stamp >>>>>>>>> on >>>>>>>>> the column File Path to hold the last update time. >>>>>>>>> But its probably easier for you to start by keeping the data as a >>>>>>>>> separate >>>>>>>>> column and ignore the Timestamps on the columns for now. >>>>>>>>> >>>>>>>>> Note the following: >>>>>>>>> >>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given >>>>> timestamp >>>>>>>>> you >>>>>>>>> may or may not have multiple files that were updated. (You weren't >>>>>>>>> specific >>>>>>>>> as to the scale) >>>>>>>>> This is a good example of HBase's column oriented approach where >>>>>>>>> you >>>>> may >>>>>>>>> or >>>>>>>>> may not have a column. It doesn't matter. :-) You could also >>>>>>>>> modify >>>>> the >>>>>>>>> timestamp to be to the second or minute and have more entries per >>>>>>>>> row. >>>>>>>>> It >>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value, >>>>>>>>> so >>>>> you >>>>>>>>> will >>>>>>>>> add a column to this table. >>>>>>>>> >>>>>>>>> 2) First prove that the logic works. You insert/update table 1 to >>>>>>>>> capture >>>>>>>>> the ID of the file and its last update time. You then delete the >>>>>>>>> old >>>>>>>>> timestamp entry in table 2, then insert new entry in table 2. >>>>>>>>> >>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to >>>>>>>>> find >>>>> your >>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit the >>>>>>>>> scan >>>>>>>>> to >>>>>>>>> 500 rows. Note that you may or may not have multiple entries so as >>>>>>>>> you >>>>>>>>> walk >>>>>>>>> through the result set, you count the number of columns and stop >>>>>>>>> when >>>>>>>>> you >>>>>>>>> have 500 columns, regardless of the number of rows you've >>>>>>>>> processed. >>>>>>>>> >>>>>>>>> This should solve your problem and be pretty efficient. >>>>>>>>> You can then work out the Coprocessors and add it to the solution >>>>>>>>> to >>>>> be >>>>>>>>> even >>>>>>>>> more efficient. >>>>>>>>> >>>>>>>>> >>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash >>>>>>>>> your >>>>>>>>> unique >>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as the >>>>> table >>>>>>>>> splits. >>>>>>>>> On table 2, because you have temporal data and you want to >>>>>>>>> efficiently >>>>>>>>> scan >>>>>>>>> a small portion of the table based on size, you will always scan >>>>>>>>> the >>>>>>>>> first >>>>>>>>> bloc, however as data rolls off and compression occurs, you will >>>>>>>>> probably >>>>>>>>> have to do some cleanup. I'm not sure how HBase handles splits >>>>>>>>> that >>>>> no >>>>>>>>> longer contain data. When you compress an empty split, does it go >>>>> away? >>>>>>>>> >>>>>>>>> By switching to coprocessors, you now limit the update accessors >>>>>>>>> to >>>>> the >>>>>>>>> second table so you should still have pretty good performance. >>>>>>>>> >>>>>>>>> You may also want to look at Asynchronous HBase, however I don't >>>>>>>>> know >>>>>>>>> how >>>>>>>>> well it will work with Coprocessors or if you want to perform >>>>>>>>> async >>>>>>>>> operations in this specific use case. >>>>>>>>> >>>>>>>>> Good luck, HTH... >>>>>>>>> >>>>>>>>> -Mike >>>>>>>>> >>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >>>>>>>>> >>>>>>>>>> Hi Michael, >>>>>>>>>> >>>>>>>>>> For now this is more a proof of concept than a production >>>>> application. >>>>>>>>>> And if it's working, it should be growing a lot and database at >>>>>>>>>> the >>>>>>>>>> end will easily be over 1B rows. each individual server will have >>>>>>>>>> to >>>>>>>>>> send it's own information to one centralized server which will >>>>>>>>>> insert >>>>>>>>>> that into a database. That's why it need to be very quick and >>>>>>>>>> that's >>>>>>>>>> why I'm looking in HBase's direction. I tried with some >>>>>>>>>> relational >>>>>>>>>> databases with 4M rows in the table but the insert time is to >>>>>>>>>> slow >>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for >>>>>>>>>> HBase >>>>>>>>>> to keep only the cells with values will allow to save a lot on >>>>>>>>>> the >>>>>>>>>> disk space (futur projects). >>>>>>>>>> >>>>>>>>>> I'm not yet used with HBase and there is still many things I need >>>>>>>>>> to >>>>>>>>>> undertsand but until I'm able to create a solution and test it, I >>>>> will >>>>>>>>>> continue to read, learn and try that way. Then at then end I will >>>>>>>>>> be >>>>>>>>>> able to compare the 2 options I have (HBase or relational) and >>>>>>>>>> decide >>>>>>>>>> based on the results. >>>>>>>>>> >>>>>>>>>> So yes, your reply helped because it's giving me a way to achieve >>>>> this >>>>>>>>>> goal (using co-processors). I don't know ye thow this part is >>>>> working, >>>>>>>>>> so I will dig the documentation for it. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> JM >>>>>>>>>> >>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>> Jean-Marc, >>>>>>>>>>> >>>>>>>>>>> You do realize that this really isn't a good use case for HBase, >>>>>>>>>>> assuming >>>>>>>>>>> that what you are describing is a stand alone system. >>>>>>>>>>> It would be easier and better if you just used a simple >>>>>>>>>>> relational >>>>>>>>>>> database. >>>>>>>>>>> >>>>>>>>>>> Then you would have your table w an ID, and a secondary index on >>>>>>>>>>> the >>>>>>>>>>> timestamp. >>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the >>>>>>>>>>> top >>>>> 500 >>>>>>>>>>> off >>>>>>>>>>> the list. >>>>>>>>>>> >>>>>>>>>>> If you insist on using HBase, yes you will have to have a >>>>>>>>>>> secondary >>>>>>>>>>> table. >>>>>>>>>>> Then using co-processors... >>>>>>>>>>> When you update the row in your base table, you >>>>>>>>>>> then get() the row in your index by timestamp, removing the >>>>>>>>>>> column >>>>> for >>>>>>>>>>> that >>>>>>>>>>> rowid. >>>>>>>>>>> Add the new column to the timestamp row. >>>>>>>>>>> >>>>>>>>>>> As you put it. >>>>>>>>>>> >>>>>>>>>>> Now you can just do a partial scan on your index. Because your >>>>>>>>>>> index >>>>>>>>>>> table >>>>>>>>>>> is so small... you shouldn't worry about hotspots. >>>>>>>>>>> You may just want to rebuild your index every so often... >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> -Mike >>>>>>>>>>> >>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Michael, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for your feedback. Here are more details to describe >>>>>>>>>>>> what >>>>> I'm >>>>>>>>>>>> trying to achieve. >>>>>>>>>>>> >>>>>>>>>>>> My goal is to store information about files into the database. >>>>>>>>>>>> I >>>>> need >>>>>>>>>>>> to check the oldest files in the database to refresh the >>>>> information. >>>>>>>>>>>> >>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network >>>>>>>>>>>> hosting >>>>>>>>>>>> the >>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>>>>>>>>> >>>>>>>>>>>> So each time I look at a file and gather the information, I >>>>>>>>>>>> update >>>>>>>>>>>> its >>>>>>>>>>>> row in the database based on the key including a "last_update" >>>>> field. >>>>>>>>>>>> I can calculate this key for any file in the drives. >>>>>>>>>>>> >>>>>>>>>>>> In order to know which file I need to check in the network, I >>>>>>>>>>>> need >>>>> to >>>>>>>>>>>> scan the table by "last_update" field. So the idea is to build >>>>>>>>>>>> another >>>>>>>>>>>> table which contain the last_update as a key and the files IDs >>>>>>>>>>>> in >>>>>>>>>>>> columns. (Here is the hotspotting) >>>>>>>>>>>> >>>>>>>>>>>> Each time I work on a file, I will have to update the main >>>>>>>>>>>> table >>>>>>>>>>>> by >>>>>>>>>>>> ID >>>>>>>>>>>> and remove the cell from the second table (the index) and put >>>>>>>>>>>> it >>>>> back >>>>>>>>>>>> with the new "last_update" key. >>>>>>>>>>>> >>>>>>>>>>>> I'm mainly doing 3 operations in the database. >>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update >>>>>>>>>>>> 2) I update the information for those 500 files (bulk update) >>>>>>>>>>>> 3) I load new files references to be checked. >>>>>>>>>>>> >>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the key. >>>>>>>>>>>> the >>>>>>>>>>>> distribution is almost perfect because I'm using hash. The >>>>>>>>>>>> prefix >>>>> is >>>>>>>>>>>> the server ID but it's not always going to the same server >>>>>>>>>>>> since >>>>> it's >>>>>>>>>>>> done by last_update. But this allow a quick access to the list >>>>>>>>>>>> of >>>>>>>>>>>> files from one server. >>>>>>>>>>>> For 1, I have expected to build this second table with the >>>>>>>>>>>> "last_update" as the key. >>>>>>>>>>>> >>>>>>>>>>>> Regarding the frequency, it really depends on the activities on >>>>>>>>>>>> the >>>>>>>>>>>> network, but it should be "often". The faster the database >>>>>>>>>>>> update >>>>>>>>>>>> will be, the more up to date I will be able to keep it. >>>>>>>>>>>> >>>>>>>>>>>> JM >>>>>>>>>>>> >>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>>>> Actually I think you should revisit your key design.... >>>>>>>>>>>>> >>>>>>>>>>>>> Look at your access path to the data for each of the types of >>>>>>>>>>>>> queries >>>>>>>>>>>>> you >>>>>>>>>>>>> are going to run. >>>>>>>>>>>>> From your post: >>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>> update" >>>>>>>>>>>>> field. >>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>> it >>>>> has >>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>> " >>>>>>>>>>>>> So your primary query is going to be against the key. >>>>>>>>>>>>> Not sure if you meant to say that your key was a composite key >>>>>>>>>>>>> or >>>>>>>>>>>>> not... >>>>>>>>>>>>> sounds like your key is just the unique key and the rest are >>>>> columns >>>>>>>>>>>>> in >>>>>>>>>>>>> the >>>>>>>>>>>>> table. >>>>>>>>>>>>> >>>>>>>>>>>>> The secondary query or path to the data is to find data where >>>>>>>>>>>>> the >>>>>>>>>>>>> files >>>>>>>>>>>>> were >>>>>>>>>>>>> not updated for more than a period of time. >>>>>>>>>>>>> >>>>>>>>>>>>> If you make your key temporal, that is adding time as a >>>>>>>>>>>>> component >>>>> of >>>>>>>>>>>>> your >>>>>>>>>>>>> key, you will end up creating new rows of data while the old >>>>>>>>>>>>> row >>>>>>>>>>>>> still >>>>>>>>>>>>> exists. >>>>>>>>>>>>> Not a good side effect. >>>>>>>>>>>>> >>>>>>>>>>>>> The other nasty side effect of using time as your key is that >>>>>>>>>>>>> you >>>>>>>>>>>>> not >>>>>>>>>>>>> only >>>>>>>>>>>>> have the potential for hot spotting, but that you also have >>>>>>>>>>>>> the >>>>>>>>>>>>> nasty >>>>>>>>>>>>> side >>>>>>>>>>>>> effect of creating splits that will never grow. >>>>>>>>>>>>> >>>>>>>>>>>>> How often are you going to ask to see the files where they >>>>>>>>>>>>> were >>>>> not >>>>>>>>>>>>> updated >>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then >>>>>>>>>>>>> you >>>>>>>>>>>>> really >>>>>>>>>>>>> should care if you have to do a complete table scan. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read >>>>>>>>>>>>>> all >>>>> of >>>>>>>>>>>>>> that >>>>>>>>>>>>>> now. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Need to read here at the bottom: >>>>>>>>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>>>>>>>> and here: >>>>>>>>>>>>>> >>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> JM >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this >>>>> comes >>>>>>>>>>>>>>> up >>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Otis >>>>>>>>>>>>>>> ---- >>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>>>>>>>>> http://sematext.com/spm >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>>>>>>>>> documentation >>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the >>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> key because that will always load the same region until the >>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>> reach a certain value and move to the next region >>>>> (hotspotting). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>>>> update" >>>>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>> it >>>>> has >>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will >>>>>>>>>>>>>>>> have >>>>> to >>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind of >>>>>>>>>>>>>>>> secondary >>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each >>>>>>>>>>>>>>>> column >>>>> will >>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>> the ID of the file with a dummy content. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this table, >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And so one. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly >>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly too. >>>>>>>>>>>>>>>> But >>>>>>>>>>>>>>>> it's >>>>>>>>>>>>>>>> hotspotting one region. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>>>>>>>>> 1) Hotspotting. >>>>>>>>>>>>>>>> 2) Salting. >>>>>>>>>>>>>>>> 3) Key field swap/promotion >>>>>>>>>>>>>>>> 4) Randomization. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other >>>>> options. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number >>>>> between >>>>>>>>>>>>>>>> 0 >>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To >>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will >>>>>>>>>>>>>>>> need >>>>> to >>>>>>>>>>>>>>>> run >>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will becaume >>>>>>>>>>>>>>>> to >>>>> big >>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and >>>>>>>>>>>>>>>> 99? >>>>>>>>>>>>>>>> Which >>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the >>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>> requests >>>>>>>>>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first few >>>>> bytes >>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I >>>>>>>>>>>>>>>> will >>>>> have >>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> retrieve >>>>>>>>>>>>>>>> them base on the date? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> JM >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >
