A quorum with 2 members is worse than 1 so don't put a ZK on PC2, the exception you are seeing is that ZK is trying to get a quorum on with 1 machine but that doesn't make sense so instead it should revert to a standalone server and still work.
J-D On Fri, Jun 22, 2012 at 7:20 PM, Jean-Marc Spaggiari <[email protected]> wrote: > Hum... Seems that it's not working that way: > > ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one > server specified (ignoring) > > So most porbably the secondary should looks exactly like the master, > but I'm not 100% sure... > > 2012/6/22, Jean-Marc Spaggiari <[email protected]>: >> Ok. So if I understand correctly, I need: >> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and >> ZooKeeper (ZK) >> PC2 => Secondary Name Node (Hadoop) >> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server >> (HBase) >> >> For PC2, should I run Zookeeper, JobTracker and master too? Can I have >> 2 masters? Or I just run just the secondray name node? >> >> 2012/6/21, Michael Segel <[email protected]>: >>> If you have a really small cluster... >>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a >>> single node. (Secondary too) >>> Then you have Data Nodes that run DN, TT, and RS. >>> >>> That would solve any ZK RS problems. >>> >>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote: >>> >>>> Hi Mike, Hi Rob, >>>> >>>> Thanks for your replies and advices. Seems that now I'm due for some >>>> implementation. I'm readgin Lars' book first and when I will be done I >>>> will start with the coding. >>>> >>>> I already have my Zookeeper/Hadoop/HBase running and based on the >>>> first pages I read, I already know it's not well done since I have put >>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More >>>> reading for me for the next few days, and then I will start. >>>> >>>> Thanks again! >>>> >>>> JM >>>> >>>> 2012/6/16, Rob Verkuylen <[email protected]>: >>>>> Just to add from my experiences: >>>>> >>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable >>>>> machine >>>>> can handle 3-4000 puts a second with ease, and a simple timerange scan >>>>> can >>>>> give you the records you need. I have my doubts you will be hitting >>>>> these >>>>> amounts anytime soon. A simple setup will get your PoC and then scale >>>>> when >>>>> you need to scale. >>>>> >>>>> Rob >>>>> >>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel >>>>> <[email protected]>wrote: >>>>> >>>>>> Jean-Marc, >>>>>> >>>>>> You indicated that you didn't want to do full table scans when you >>>>>> want >>>>>> to >>>>>> find out which files hadn't been touched since X time has past. >>>>>> (X could be months, weeks, days, hours, etc ...) >>>>>> >>>>>> So here's the thing. >>>>>> First, I am not convinced that you will have hot spotting. >>>>>> Second, you end up having to now do 26 scans instead of one. Then you >>>>>> need >>>>>> to join the result set. >>>>>> >>>>>> Not really a good solution if you think about it. >>>>>> >>>>>> Oh and I don't believe that you will be hitting a single region, >>>>>> although >>>>>> you may hit a region hard. >>>>>> (Your second table's key is on the timestamp of the last update to the >>>>>> file. If the file hadn't been touched in a week, there's the >>>>>> probability >>>>>> that at scale, it won't be in the same region as a file that had >>>>>> recently >>>>>> been touched. ) >>>>>> >>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel, and can only >>>>>> be >>>>>> applied on a subset of problems. >>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.) >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> >>>>>> >>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote: >>>>>> >>>>>>> Let's imagine the timestamp is "123456789". >>>>>>> >>>>>>> If I salt it with later from 'a' to 'z' them it will always be split >>>>>>> between few RegionServers. I will have like "t123456789". The issue >>>>>>> is >>>>>>> that I will have to do 26 queries to be able to find all the entries. >>>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B, >>>>>>> and so on. >>>>>>> >>>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the >>>>>>> key myself? Or what if I use something like HBaseWD? >>>>>>> >>>>>>> JM >>>>>>> >>>>>>> 2012/6/16, Michel Segel <[email protected]>: >>>>>>>> You can't salt the key in the second table. >>>>>>>> By salting the key, you lose the ability to do range scans, which is >>>>>> what >>>>>>>> you want to do. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>> >>>>>>>> Mike Segel >>>>>>>> >>>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari < >>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks all for your comments and suggestions. Regarding the >>>>>>>>> hotspotting I will try to salt the key in the 2nd table and see the >>>>>>>>> results. >>>>>>>>> >>>>>>>>> Yesterday I finished to install my 4 servers cluster with old >>>>>>>>> machine. >>>>>>>>> It's slow, but it's working. So I will do some testing. >>>>>>>>> >>>>>>>>> You are recommending to modify the timestamp to be to the second or >>>>>>>>> minute and have more entries per row. Is that because it's better >>>>>>>>> to >>>>>>>>> have more columns than rows? Or it's more because that will allow >>>>>>>>> to >>>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which >>>>>>>>> if >>>>>>>>> more efficient? >>>>>>>>> >>>>>>>>> JM >>>>>>>>> >>>>>>>>> 2012/6/15, Michael Segel <[email protected]>: >>>>>>>>>> Thought about this a little bit more... >>>>>>>>>> >>>>>>>>>> You will want two tables for a solution. >>>>>>>>>> >>>>>>>>>> 1 Table is Key: Unique ID >>>>>>>>>> Column: FilePath Value: Full Path to >>>>>>>>>> file >>>>>>>>>> Column: Last Update time Value: timestamp >>>>>>>>>> >>>>>>>>>> 2 Table is Key: Last Update time (The timestamp) >>>>>>>>>> Column 1-N: Unique ID Value: Full Path >>>>>>>>>> to >>>>>>>>>> the >>>>>>>>>> file >>>>>>>>>> >>>>>>>>>> Now if you want to get fancy, in Table 1, you could use the time >>>>>> stamp >>>>>>>>>> on >>>>>>>>>> the column File Path to hold the last update time. >>>>>>>>>> But its probably easier for you to start by keeping the data as a >>>>>>>>>> separate >>>>>>>>>> column and ignore the Timestamps on the columns for now. >>>>>>>>>> >>>>>>>>>> Note the following: >>>>>>>>>> >>>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given >>>>>> timestamp >>>>>>>>>> you >>>>>>>>>> may or may not have multiple files that were updated. (You weren't >>>>>>>>>> specific >>>>>>>>>> as to the scale) >>>>>>>>>> This is a good example of HBase's column oriented approach where >>>>>>>>>> you >>>>>> may >>>>>>>>>> or >>>>>>>>>> may not have a column. It doesn't matter. :-) You could also >>>>>>>>>> modify >>>>>> the >>>>>>>>>> timestamp to be to the second or minute and have more entries per >>>>>>>>>> row. >>>>>>>>>> It >>>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value, >>>>>>>>>> so >>>>>> you >>>>>>>>>> will >>>>>>>>>> add a column to this table. >>>>>>>>>> >>>>>>>>>> 2) First prove that the logic works. You insert/update table 1 to >>>>>>>>>> capture >>>>>>>>>> the ID of the file and its last update time. You then delete the >>>>>>>>>> old >>>>>>>>>> timestamp entry in table 2, then insert new entry in table 2. >>>>>>>>>> >>>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to >>>>>>>>>> find >>>>>> your >>>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit the >>>>>>>>>> scan >>>>>>>>>> to >>>>>>>>>> 500 rows. Note that you may or may not have multiple entries so as >>>>>>>>>> you >>>>>>>>>> walk >>>>>>>>>> through the result set, you count the number of columns and stop >>>>>>>>>> when >>>>>>>>>> you >>>>>>>>>> have 500 columns, regardless of the number of rows you've >>>>>>>>>> processed. >>>>>>>>>> >>>>>>>>>> This should solve your problem and be pretty efficient. >>>>>>>>>> You can then work out the Coprocessors and add it to the solution >>>>>>>>>> to >>>>>> be >>>>>>>>>> even >>>>>>>>>> more efficient. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash >>>>>>>>>> your >>>>>>>>>> unique >>>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as the >>>>>> table >>>>>>>>>> splits. >>>>>>>>>> On table 2, because you have temporal data and you want to >>>>>>>>>> efficiently >>>>>>>>>> scan >>>>>>>>>> a small portion of the table based on size, you will always scan >>>>>>>>>> the >>>>>>>>>> first >>>>>>>>>> bloc, however as data rolls off and compression occurs, you will >>>>>>>>>> probably >>>>>>>>>> have to do some cleanup. I'm not sure how HBase handles splits >>>>>>>>>> that >>>>>> no >>>>>>>>>> longer contain data. When you compress an empty split, does it go >>>>>> away? >>>>>>>>>> >>>>>>>>>> By switching to coprocessors, you now limit the update accessors >>>>>>>>>> to >>>>>> the >>>>>>>>>> second table so you should still have pretty good performance. >>>>>>>>>> >>>>>>>>>> You may also want to look at Asynchronous HBase, however I don't >>>>>>>>>> know >>>>>>>>>> how >>>>>>>>>> well it will work with Coprocessors or if you want to perform >>>>>>>>>> async >>>>>>>>>> operations in this specific use case. >>>>>>>>>> >>>>>>>>>> Good luck, HTH... >>>>>>>>>> >>>>>>>>>> -Mike >>>>>>>>>> >>>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote: >>>>>>>>>> >>>>>>>>>>> Hi Michael, >>>>>>>>>>> >>>>>>>>>>> For now this is more a proof of concept than a production >>>>>> application. >>>>>>>>>>> And if it's working, it should be growing a lot and database at >>>>>>>>>>> the >>>>>>>>>>> end will easily be over 1B rows. each individual server will have >>>>>>>>>>> to >>>>>>>>>>> send it's own information to one centralized server which will >>>>>>>>>>> insert >>>>>>>>>>> that into a database. That's why it need to be very quick and >>>>>>>>>>> that's >>>>>>>>>>> why I'm looking in HBase's direction. I tried with some >>>>>>>>>>> relational >>>>>>>>>>> databases with 4M rows in the table but the insert time is to >>>>>>>>>>> slow >>>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for >>>>>>>>>>> HBase >>>>>>>>>>> to keep only the cells with values will allow to save a lot on >>>>>>>>>>> the >>>>>>>>>>> disk space (futur projects). >>>>>>>>>>> >>>>>>>>>>> I'm not yet used with HBase and there is still many things I need >>>>>>>>>>> to >>>>>>>>>>> undertsand but until I'm able to create a solution and test it, I >>>>>> will >>>>>>>>>>> continue to read, learn and try that way. Then at then end I will >>>>>>>>>>> be >>>>>>>>>>> able to compare the 2 options I have (HBase or relational) and >>>>>>>>>>> decide >>>>>>>>>>> based on the results. >>>>>>>>>>> >>>>>>>>>>> So yes, your reply helped because it's giving me a way to achieve >>>>>> this >>>>>>>>>>> goal (using co-processors). I don't know ye thow this part is >>>>>> working, >>>>>>>>>>> so I will dig the documentation for it. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> JM >>>>>>>>>>> >>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>>> Jean-Marc, >>>>>>>>>>>> >>>>>>>>>>>> You do realize that this really isn't a good use case for HBase, >>>>>>>>>>>> assuming >>>>>>>>>>>> that what you are describing is a stand alone system. >>>>>>>>>>>> It would be easier and better if you just used a simple >>>>>>>>>>>> relational >>>>>>>>>>>> database. >>>>>>>>>>>> >>>>>>>>>>>> Then you would have your table w an ID, and a secondary index on >>>>>>>>>>>> the >>>>>>>>>>>> timestamp. >>>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the >>>>>>>>>>>> top >>>>>> 500 >>>>>>>>>>>> off >>>>>>>>>>>> the list. >>>>>>>>>>>> >>>>>>>>>>>> If you insist on using HBase, yes you will have to have a >>>>>>>>>>>> secondary >>>>>>>>>>>> table. >>>>>>>>>>>> Then using co-processors... >>>>>>>>>>>> When you update the row in your base table, you >>>>>>>>>>>> then get() the row in your index by timestamp, removing the >>>>>>>>>>>> column >>>>>> for >>>>>>>>>>>> that >>>>>>>>>>>> rowid. >>>>>>>>>>>> Add the new column to the timestamp row. >>>>>>>>>>>> >>>>>>>>>>>> As you put it. >>>>>>>>>>>> >>>>>>>>>>>> Now you can just do a partial scan on your index. Because your >>>>>>>>>>>> index >>>>>>>>>>>> table >>>>>>>>>>>> is so small... you shouldn't worry about hotspots. >>>>>>>>>>>> You may just want to rebuild your index every so often... >>>>>>>>>>>> >>>>>>>>>>>> HTH >>>>>>>>>>>> >>>>>>>>>>>> -Mike >>>>>>>>>>>> >>>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for your feedback. Here are more details to describe >>>>>>>>>>>>> what >>>>>> I'm >>>>>>>>>>>>> trying to achieve. >>>>>>>>>>>>> >>>>>>>>>>>>> My goal is to store information about files into the database. >>>>>>>>>>>>> I >>>>>> need >>>>>>>>>>>>> to check the oldest files in the database to refresh the >>>>>> information. >>>>>>>>>>>>> >>>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network >>>>>>>>>>>>> hosting >>>>>>>>>>>>> the >>>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key. >>>>>>>>>>>>> >>>>>>>>>>>>> So each time I look at a file and gather the information, I >>>>>>>>>>>>> update >>>>>>>>>>>>> its >>>>>>>>>>>>> row in the database based on the key including a "last_update" >>>>>> field. >>>>>>>>>>>>> I can calculate this key for any file in the drives. >>>>>>>>>>>>> >>>>>>>>>>>>> In order to know which file I need to check in the network, I >>>>>>>>>>>>> need >>>>>> to >>>>>>>>>>>>> scan the table by "last_update" field. So the idea is to build >>>>>>>>>>>>> another >>>>>>>>>>>>> table which contain the last_update as a key and the files IDs >>>>>>>>>>>>> in >>>>>>>>>>>>> columns. (Here is the hotspotting) >>>>>>>>>>>>> >>>>>>>>>>>>> Each time I work on a file, I will have to update the main >>>>>>>>>>>>> table >>>>>>>>>>>>> by >>>>>>>>>>>>> ID >>>>>>>>>>>>> and remove the cell from the second table (the index) and put >>>>>>>>>>>>> it >>>>>> back >>>>>>>>>>>>> with the new "last_update" key. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm mainly doing 3 operations in the database. >>>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update >>>>>>>>>>>>> 2) I update the information for those 500 files (bulk update) >>>>>>>>>>>>> 3) I load new files references to be checked. >>>>>>>>>>>>> >>>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the key. >>>>>>>>>>>>> the >>>>>>>>>>>>> distribution is almost perfect because I'm using hash. The >>>>>>>>>>>>> prefix >>>>>> is >>>>>>>>>>>>> the server ID but it's not always going to the same server >>>>>>>>>>>>> since >>>>>> it's >>>>>>>>>>>>> done by last_update. But this allow a quick access to the list >>>>>>>>>>>>> of >>>>>>>>>>>>> files from one server. >>>>>>>>>>>>> For 1, I have expected to build this second table with the >>>>>>>>>>>>> "last_update" as the key. >>>>>>>>>>>>> >>>>>>>>>>>>> Regarding the frequency, it really depends on the activities on >>>>>>>>>>>>> the >>>>>>>>>>>>> network, but it should be "often". The faster the database >>>>>>>>>>>>> update >>>>>>>>>>>>> will be, the more up to date I will be able to keep it. >>>>>>>>>>>>> >>>>>>>>>>>>> JM >>>>>>>>>>>>> >>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>: >>>>>>>>>>>>>> Actually I think you should revisit your key design.... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Look at your access path to the data for each of the types of >>>>>>>>>>>>>> queries >>>>>>>>>>>>>> you >>>>>>>>>>>>>> are going to run. >>>>>>>>>>>>>> From your post: >>>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>> update" >>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>>> it >>>>>> has >>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>> " >>>>>>>>>>>>>> So your primary query is going to be against the key. >>>>>>>>>>>>>> Not sure if you meant to say that your key was a composite key >>>>>>>>>>>>>> or >>>>>>>>>>>>>> not... >>>>>>>>>>>>>> sounds like your key is just the unique key and the rest are >>>>>> columns >>>>>>>>>>>>>> in >>>>>>>>>>>>>> the >>>>>>>>>>>>>> table. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The secondary query or path to the data is to find data where >>>>>>>>>>>>>> the >>>>>>>>>>>>>> files >>>>>>>>>>>>>> were >>>>>>>>>>>>>> not updated for more than a period of time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If you make your key temporal, that is adding time as a >>>>>>>>>>>>>> component >>>>>> of >>>>>>>>>>>>>> your >>>>>>>>>>>>>> key, you will end up creating new rows of data while the old >>>>>>>>>>>>>> row >>>>>>>>>>>>>> still >>>>>>>>>>>>>> exists. >>>>>>>>>>>>>> Not a good side effect. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The other nasty side effect of using time as your key is that >>>>>>>>>>>>>> you >>>>>>>>>>>>>> not >>>>>>>>>>>>>> only >>>>>>>>>>>>>> have the potential for hot spotting, but that you also have >>>>>>>>>>>>>> the >>>>>>>>>>>>>> nasty >>>>>>>>>>>>>> side >>>>>>>>>>>>>> effect of creating splits that will never grow. >>>>>>>>>>>>>> >>>>>>>>>>>>>> How often are you going to ask to see the files where they >>>>>>>>>>>>>> were >>>>>> not >>>>>>>>>>>>>> updated >>>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then >>>>>>>>>>>>>> you >>>>>>>>>>>>>> really >>>>>>>>>>>>>> should care if you have to do a complete table scan. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read >>>>>>>>>>>>>>> all >>>>>> of >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>> now. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Need to read here at the bottom: >>>>>>>>>>>>>>> https://github.com/sematext/HBaseWD >>>>>>>>>>>>>>> and here: >>>>>>>>>>>>>>> >>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> JM >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>: >>>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this >>>>>> comes >>>>>>>>>>>>>>>> up >>>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Otis >>>>>>>>>>>>>>>> ---- >>>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase - >>>>>>>>>>>>>>>> http://sematext.com/spm >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]> >>>>>>>>>>>>>>>>> To: [email protected] >>>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM >>>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the >>>>>>>>>>>>>>>>> documentation >>>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the >>>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> key because that will always load the same region until the >>>>>>>>>>>>>>>>> timestamp >>>>>>>>>>>>>>>>> reach a certain value and move to the next region >>>>>> (hotspotting). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last >>>>>>>>>>>>>>>>> update" >>>>>>>>>>>>>>>>> field. >>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when >>>>>>>>>>>>>>>>> it >>>>>> has >>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>> updated. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for >>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>> than >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> certain period of time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will >>>>>>>>>>>>>>>>> have >>>>>> to >>>>>>>>>>>>>>>>> do >>>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>> full parsing of the table. Which might take a while. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind of >>>>>>>>>>>>>>>>> secondary >>>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each >>>>>>>>>>>>>>>>> column >>>>>> will >>>>>>>>>>>>>>>>> have >>>>>>>>>>>>>>>>> the ID of the file with a dummy content. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this table, >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> And so one. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> I >>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly too. >>>>>>>>>>>>>>>>> But >>>>>>>>>>>>>>>>> it's >>>>>>>>>>>>>>>>> hotspotting one region. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations. >>>>>>>>>>>>>>>>> 1) Hotspotting. >>>>>>>>>>>>>>>>> 2) Salting. >>>>>>>>>>>>>>>>> 3) Key field swap/promotion >>>>>>>>>>>>>>>>> 4) Randomization. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other >>>>>> options. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number >>>>>> between >>>>>>>>>>>>>>>>> 0 >>>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To >>>>>>>>>>>>>>>>> find >>>>>>>>>>>>>>>>> all >>>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will >>>>>>>>>>>>>>>>> need >>>>>> to >>>>>>>>>>>>>>>>> run >>>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will becaume >>>>>>>>>>>>>>>>> to >>>>>> big >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and >>>>>>>>>>>>>>>>> 99? >>>>>>>>>>>>>>>>> Which >>>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the >>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>> requests >>>>>>>>>>>>>>>>> I will have to do. Is that really a good approach? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first few >>>>>> bytes >>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> same. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I >>>>>>>>>>>>>>>>> will >>>>>> have >>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>> way to retreive the information I'm looking for. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>> retrieve >>>>>>>>>>>>>>>>> them base on the date? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> JM >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>
