Hum... Seems that it's not working that way:

ERROR [main:QuorumPeerConfig@283] - Invalid configuration, only one
server specified (ignoring)

So most porbably the secondary should looks exactly like the master,
but I'm not 100% sure...

2012/6/22, Jean-Marc Spaggiari <[email protected]>:
> Ok. So if I understand correctly, I need:
> PC1 => HMaster (HBase), JobTracker (Hadoop), Name Node (Hadoop), and
> ZooKeeper (ZK)
> PC2 => Secondary Name Node (Hadoop)
> PC3 to x => Data Node (Hadoop), Task Tracker (Hadoop), Restion Server
> (HBase)
>
> For PC2, should I run Zookeeper, JobTracker and master too? Can I have
> 2 masters? Or I just run just the secondray name node?
>
> 2012/6/21, Michael Segel <[email protected]>:
>> If you have a really small cluster...
>> You can put your HMaster, JobTracker, Name Node, and ZooKeeper all on a
>> single node. (Secondary too)
>> Then you have Data Nodes that run DN, TT, and RS.
>>
>> That would solve any ZK RS problems.
>>
>> On Jun 21, 2012, at 6:43 AM, Jean-Marc Spaggiari wrote:
>>
>>> Hi Mike, Hi Rob,
>>>
>>> Thanks for your replies and advices. Seems that now I'm due for some
>>> implementation. I'm readgin Lars' book first and when I will be done I
>>> will start with the coding.
>>>
>>> I already have my Zookeeper/Hadoop/HBase running and based on the
>>> first pages I read, I already know it's not well done since I have put
>>> a DataNode and a Zookeeper server on ALL the servers ;) So. More
>>> reading for me for the next few days, and then I will start.
>>>
>>> Thanks again!
>>>
>>> JM
>>>
>>> 2012/6/16, Rob Verkuylen <[email protected]>:
>>>> Just to add from my experiences:
>>>>
>>>> Yes hotspotting is bad, but so are devops headaches. A reasonable
>>>> machine
>>>> can handle 3-4000 puts a second with ease, and a simple timerange scan
>>>> can
>>>> give you the records you need. I have my doubts you will be hitting
>>>> these
>>>> amounts anytime soon. A simple setup will get your PoC and then scale
>>>> when
>>>> you need to scale.
>>>>
>>>> Rob
>>>>
>>>> On Sat, Jun 16, 2012 at 6:33 PM, Michael Segel
>>>> <[email protected]>wrote:
>>>>
>>>>> Jean-Marc,
>>>>>
>>>>> You indicated that you didn't want to do full table scans when you
>>>>> want
>>>>> to
>>>>> find out which files hadn't been touched since X time has past.
>>>>> (X could be months, weeks, days, hours, etc ...)
>>>>>
>>>>> So here's the thing.
>>>>> First,  I am not convinced that you will have hot spotting.
>>>>> Second, you end up having to now do 26 scans instead of one. Then you
>>>>> need
>>>>> to join the result set.
>>>>>
>>>>> Not really a good solution if you think about it.
>>>>>
>>>>> Oh and I don't believe that you will be hitting a single region,
>>>>> although
>>>>> you may hit  a region hard.
>>>>> (Your second table's key is on the timestamp of the last update to the
>>>>> file.  If the file hadn't been touched in a week, there's the
>>>>> probability
>>>>> that at scale, it won't be in the same region as a file that had
>>>>> recently
>>>>> been touched. )
>>>>>
>>>>> I wouldn't recommend HBaseWD. Its cute, its not novel,  and can only
>>>>> be
>>>>> applied on a subset of problems.
>>>>> (Think round-robin partitioning in a RDBMS. DB2 was big on this.)
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>>
>>>>> On Jun 16, 2012, at 9:42 AM, Jean-Marc Spaggiari wrote:
>>>>>
>>>>>> Let's imagine the timestamp is "123456789".
>>>>>>
>>>>>> If I salt it with later from 'a' to 'z' them it will always be split
>>>>>> between few RegionServers. I will have like "t123456789". The issue
>>>>>> is
>>>>>> that I will have to do 26 queries to be able to find all the entries.
>>>>>> I will need to query from A000000000 to Axxxxxxxxx, then same for B,
>>>>>> and so on.
>>>>>>
>>>>>> So what's worst? Am I better to deal with the hotspotting? Salt the
>>>>>> key myself? Or what if I use something like HBaseWD?
>>>>>>
>>>>>> JM
>>>>>>
>>>>>> 2012/6/16, Michel Segel <[email protected]>:
>>>>>>> You can't salt the key in the second table.
>>>>>>> By salting the key, you lose the ability to do range scans, which is
>>>>> what
>>>>>>> you want to do.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>
>>>>>>> Mike Segel
>>>>>>>
>>>>>>> On Jun 16, 2012, at 6:22 AM, Jean-Marc Spaggiari <
>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks all for your comments and suggestions. Regarding the
>>>>>>>> hotspotting I will try to salt the key in the 2nd table and see the
>>>>>>>> results.
>>>>>>>>
>>>>>>>> Yesterday I finished to install my 4 servers cluster with old
>>>>>>>> machine.
>>>>>>>> It's slow, but it's working. So I will do some testing.
>>>>>>>>
>>>>>>>> You are recommending to modify the timestamp to be to the second or
>>>>>>>> minute and have more entries per row. Is that because it's better
>>>>>>>> to
>>>>>>>> have more columns than rows? Or it's more because that will allow
>>>>>>>> to
>>>>>>>> have a more "squarred" pattern (lot of rows, lot of colums) which
>>>>>>>> if
>>>>>>>> more efficient?
>>>>>>>>
>>>>>>>> JM
>>>>>>>>
>>>>>>>> 2012/6/15, Michael Segel <[email protected]>:
>>>>>>>>> Thought about this a little bit more...
>>>>>>>>>
>>>>>>>>> You will want two tables for a solution.
>>>>>>>>>
>>>>>>>>> 1 Table is  Key: Unique ID
>>>>>>>>>                  Column: FilePath            Value: Full Path to
>>>>>>>>> file
>>>>>>>>>                  Column: Last Update time    Value: timestamp
>>>>>>>>>
>>>>>>>>> 2 Table is Key: Last Update time    (The timestamp)
>>>>>>>>>                          Column 1-N: Unique ID    Value: Full Path
>>>>>>>>> to
>>>>>>>>> the
>>>>>>>>> file
>>>>>>>>>
>>>>>>>>> Now if you want to get fancy,  in Table 1, you could use the time
>>>>> stamp
>>>>>>>>> on
>>>>>>>>> the column File Path to hold the last update time.
>>>>>>>>> But its probably easier for you to start by keeping the data as a
>>>>>>>>> separate
>>>>>>>>> column and ignore the Timestamps on the columns for now.
>>>>>>>>>
>>>>>>>>> Note the following:
>>>>>>>>>
>>>>>>>>> 1) I used the notation Column 1-N to reflect that for a given
>>>>> timestamp
>>>>>>>>> you
>>>>>>>>> may or may not have multiple files that were updated. (You weren't
>>>>>>>>> specific
>>>>>>>>> as to the scale)
>>>>>>>>> This is a good example of HBase's column oriented approach where
>>>>>>>>> you
>>>>> may
>>>>>>>>> or
>>>>>>>>> may not have a column. It doesn't matter. :-) You could also
>>>>>>>>> modify
>>>>> the
>>>>>>>>> timestamp to be to the second or minute and have more entries per
>>>>>>>>> row.
>>>>>>>>> It
>>>>>>>>> doesn't matter. You insert based on timestamp:columnName, value,
>>>>>>>>> so
>>>>> you
>>>>>>>>> will
>>>>>>>>> add a column to this table.
>>>>>>>>>
>>>>>>>>> 2) First prove that the logic works. You insert/update table 1 to
>>>>>>>>> capture
>>>>>>>>> the ID of the file and its last update time.  You then delete the
>>>>>>>>> old
>>>>>>>>> timestamp entry in table 2, then insert new entry in table 2.
>>>>>>>>>
>>>>>>>>> 3) You store Table 2 in ascending order. Then when you want to
>>>>>>>>> find
>>>>> your
>>>>>>>>> last 500 entries, you do a start scan at 0x000 and then limit the
>>>>>>>>> scan
>>>>>>>>> to
>>>>>>>>> 500 rows. Note that you may or may not have multiple entries so as
>>>>>>>>> you
>>>>>>>>> walk
>>>>>>>>> through the result set, you count the number of columns and stop
>>>>>>>>> when
>>>>>>>>> you
>>>>>>>>> have 500 columns, regardless of the number of rows you've
>>>>>>>>> processed.
>>>>>>>>>
>>>>>>>>> This should solve your problem and be pretty efficient.
>>>>>>>>> You can then work out the Coprocessors and add it to the solution
>>>>>>>>> to
>>>>> be
>>>>>>>>> even
>>>>>>>>> more efficient.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> With respect to 'hot-spotting' , can't be helped. You could hash
>>>>>>>>> your
>>>>>>>>> unique
>>>>>>>>> ID in table 1, this will reduce the potential of a hotspot as the
>>>>> table
>>>>>>>>> splits.
>>>>>>>>> On table 2, because you have temporal data and you want to
>>>>>>>>> efficiently
>>>>>>>>> scan
>>>>>>>>> a small portion of the table based on size, you will always scan
>>>>>>>>> the
>>>>>>>>> first
>>>>>>>>> bloc, however as data rolls off and compression occurs, you will
>>>>>>>>> probably
>>>>>>>>> have to do some cleanup. I'm not sure how HBase  handles splits
>>>>>>>>> that
>>>>> no
>>>>>>>>> longer contain data. When you compress an empty split, does it go
>>>>> away?
>>>>>>>>>
>>>>>>>>> By switching to coprocessors, you now limit the update accessors
>>>>>>>>> to
>>>>> the
>>>>>>>>> second table so you should still have pretty good performance.
>>>>>>>>>
>>>>>>>>> You may also want to look at Asynchronous HBase, however I don't
>>>>>>>>> know
>>>>>>>>> how
>>>>>>>>> well it will work with Coprocessors or if you want to perform
>>>>>>>>> async
>>>>>>>>> operations in this specific use case.
>>>>>>>>>
>>>>>>>>> Good luck, HTH...
>>>>>>>>>
>>>>>>>>> -Mike
>>>>>>>>>
>>>>>>>>> On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:
>>>>>>>>>
>>>>>>>>>> Hi Michael,
>>>>>>>>>>
>>>>>>>>>> For now this is more a proof of concept than a production
>>>>> application.
>>>>>>>>>> And if it's working, it should be growing a lot and database at
>>>>>>>>>> the
>>>>>>>>>> end will easily be over 1B rows. each individual server will have
>>>>>>>>>> to
>>>>>>>>>> send it's own information to one centralized server which will
>>>>>>>>>> insert
>>>>>>>>>> that into a database. That's why it need to be very quick and
>>>>>>>>>> that's
>>>>>>>>>> why I'm looking in HBase's direction. I tried with some
>>>>>>>>>> relational
>>>>>>>>>> databases with 4M rows in the table but the insert time is to
>>>>>>>>>> slow
>>>>>>>>>> when I have to introduce entries in bulk. Also, the ability for
>>>>>>>>>> HBase
>>>>>>>>>> to keep only the cells with values will allow to save a lot on
>>>>>>>>>> the
>>>>>>>>>> disk space (futur projects).
>>>>>>>>>>
>>>>>>>>>> I'm not yet used with HBase and there is still many things I need
>>>>>>>>>> to
>>>>>>>>>> undertsand but until I'm able to create a solution and test it, I
>>>>> will
>>>>>>>>>> continue to read, learn and try that way. Then at then end I will
>>>>>>>>>> be
>>>>>>>>>> able to compare the 2 options I have (HBase or relational) and
>>>>>>>>>> decide
>>>>>>>>>> based on the results.
>>>>>>>>>>
>>>>>>>>>> So yes, your reply helped because it's giving me a way to achieve
>>>>> this
>>>>>>>>>> goal (using co-processors). I don't know ye thow this part is
>>>>> working,
>>>>>>>>>> so I will dig the documentation for it.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> JM
>>>>>>>>>>
>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>> Jean-Marc,
>>>>>>>>>>>
>>>>>>>>>>> You do realize that this really isn't a good use case for HBase,
>>>>>>>>>>> assuming
>>>>>>>>>>> that what you are describing is a stand alone system.
>>>>>>>>>>> It would be easier and better if you just used a simple
>>>>>>>>>>> relational
>>>>>>>>>>> database.
>>>>>>>>>>>
>>>>>>>>>>> Then you would have your table w an ID, and a secondary index on
>>>>>>>>>>> the
>>>>>>>>>>> timestamp.
>>>>>>>>>>> Retrieve the data in Ascending order by timestamp and take the
>>>>>>>>>>> top
>>>>> 500
>>>>>>>>>>> off
>>>>>>>>>>> the list.
>>>>>>>>>>>
>>>>>>>>>>> If you insist on using HBase, yes you will have to have a
>>>>>>>>>>> secondary
>>>>>>>>>>> table.
>>>>>>>>>>> Then using co-processors...
>>>>>>>>>>> When you update the row in your base table, you
>>>>>>>>>>> then get() the row in your index by timestamp, removing the
>>>>>>>>>>> column
>>>>> for
>>>>>>>>>>> that
>>>>>>>>>>> rowid.
>>>>>>>>>>> Add the new column to the timestamp row.
>>>>>>>>>>>
>>>>>>>>>>> As you put it.
>>>>>>>>>>>
>>>>>>>>>>> Now you can just do a partial scan on your index. Because your
>>>>>>>>>>> index
>>>>>>>>>>> table
>>>>>>>>>>> is so small... you shouldn't worry about hotspots.
>>>>>>>>>>> You may just want to rebuild your index every so often...
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> -Mike
>>>>>>>>>>>
>>>>>>>>>>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for your feedback. Here are more details to describe
>>>>>>>>>>>> what
>>>>> I'm
>>>>>>>>>>>> trying to achieve.
>>>>>>>>>>>>
>>>>>>>>>>>> My goal is to store information about files into the database.
>>>>>>>>>>>> I
>>>>> need
>>>>>>>>>>>> to check the oldest files in the database to refresh the
>>>>> information.
>>>>>>>>>>>>
>>>>>>>>>>>> The key is an 8 bytes ID of the server name in the network
>>>>>>>>>>>> hosting
>>>>>>>>>>>> the
>>>>>>>>>>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>>>>>>>>>>
>>>>>>>>>>>> So each time I look at a file and gather the information, I
>>>>>>>>>>>> update
>>>>>>>>>>>> its
>>>>>>>>>>>> row in the database based on the key including a "last_update"
>>>>> field.
>>>>>>>>>>>> I can calculate this key for any file in the drives.
>>>>>>>>>>>>
>>>>>>>>>>>> In order to know which file I need to check in the network, I
>>>>>>>>>>>> need
>>>>> to
>>>>>>>>>>>> scan the table by "last_update" field. So the idea is to build
>>>>>>>>>>>> another
>>>>>>>>>>>> table which contain the last_update as a key and the files IDs
>>>>>>>>>>>> in
>>>>>>>>>>>> columns. (Here is the hotspotting)
>>>>>>>>>>>>
>>>>>>>>>>>> Each time I work on a file, I will have to update the main
>>>>>>>>>>>> table
>>>>>>>>>>>> by
>>>>>>>>>>>> ID
>>>>>>>>>>>> and remove the cell from the second table (the index) and put
>>>>>>>>>>>> it
>>>>> back
>>>>>>>>>>>> with the new "last_update" key.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm mainly doing 3 operations in the database.
>>>>>>>>>>>> 1) I retrieve a list of 500 files which need to be update
>>>>>>>>>>>> 2) I update the information for  those 500 files (bulk update)
>>>>>>>>>>>> 3) I load new files references to be checked.
>>>>>>>>>>>>
>>>>>>>>>>>> For 2 and 3, I use the main table with the file ID as the key.
>>>>>>>>>>>> the
>>>>>>>>>>>> distribution is almost perfect because I'm using hash. The
>>>>>>>>>>>> prefix
>>>>> is
>>>>>>>>>>>> the server ID but it's not always going to the same server
>>>>>>>>>>>> since
>>>>> it's
>>>>>>>>>>>> done by last_update. But this allow a quick access to the list
>>>>>>>>>>>> of
>>>>>>>>>>>> files from one server.
>>>>>>>>>>>> For 1, I have expected to build this second table with the
>>>>>>>>>>>> "last_update" as the key.
>>>>>>>>>>>>
>>>>>>>>>>>> Regarding the frequency, it really depends on the activities on
>>>>>>>>>>>> the
>>>>>>>>>>>> network, but it should be "often".  The faster the database
>>>>>>>>>>>> update
>>>>>>>>>>>> will be, the more up to date I will be able to keep it.
>>>>>>>>>>>>
>>>>>>>>>>>> JM
>>>>>>>>>>>>
>>>>>>>>>>>> 2012/6/14, Michael Segel <[email protected]>:
>>>>>>>>>>>>> Actually I think you should revisit your key design....
>>>>>>>>>>>>>
>>>>>>>>>>>>> Look at your access path to the data for each of the types of
>>>>>>>>>>>>> queries
>>>>>>>>>>>>> you
>>>>>>>>>>>>> are going to run.
>>>>>>>>>>>>> From your post:
>>>>>>>>>>>>> "I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>> update"
>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>> it
>>>>> has
>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>> "
>>>>>>>>>>>>> So your primary query is going to be against the key.
>>>>>>>>>>>>> Not sure if you meant to say that your key was a composite key
>>>>>>>>>>>>> or
>>>>>>>>>>>>> not...
>>>>>>>>>>>>> sounds like your key is just the unique key and the rest are
>>>>> columns
>>>>>>>>>>>>> in
>>>>>>>>>>>>> the
>>>>>>>>>>>>> table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The secondary query or path to the data is to find data where
>>>>>>>>>>>>> the
>>>>>>>>>>>>> files
>>>>>>>>>>>>> were
>>>>>>>>>>>>> not updated for more than a period of time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you make your key temporal, that is adding time as a
>>>>>>>>>>>>> component
>>>>> of
>>>>>>>>>>>>> your
>>>>>>>>>>>>> key, you will end up creating new rows of data while the old
>>>>>>>>>>>>> row
>>>>>>>>>>>>> still
>>>>>>>>>>>>> exists.
>>>>>>>>>>>>> Not a good side effect.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The other nasty side effect of using time as your key is that
>>>>>>>>>>>>> you
>>>>>>>>>>>>> not
>>>>>>>>>>>>> only
>>>>>>>>>>>>> have the potential for hot spotting, but that you also have
>>>>>>>>>>>>> the
>>>>>>>>>>>>> nasty
>>>>>>>>>>>>> side
>>>>>>>>>>>>> effect of creating splits that will never grow.
>>>>>>>>>>>>>
>>>>>>>>>>>>> How often are you going to ask to see the files where they
>>>>>>>>>>>>> were
>>>>> not
>>>>>>>>>>>>> updated
>>>>>>>>>>>>> in the last couple of days/minutes? If its infrequent, then
>>>>>>>>>>>>> you
>>>>>>>>>>>>> really
>>>>>>>>>>>>> should care if you have to do a complete table scan.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wow! This is exactly what I was looking for. So I will read
>>>>>>>>>>>>>> all
>>>>> of
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Need to read here at the bottom:
>>>>>>>>>>>>>> https://github.com/sematext/HBaseWD
>>>>>>>>>>>>>> and here:
>>>>>>>>>>>>>>
>>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2012/6/14, Otis Gospodnetic <[email protected]>:
>>>>>>>>>>>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this
>>>>> comes
>>>>>>>>>>>>>>> up
>>>>>>>>>>>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Otis
>>>>>>>>>>>>>>> ----
>>>>>>>>>>>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>>>>>>>>>>>> http://sematext.com/spm
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>>>> From: Jean-Marc Spaggiari <[email protected]>
>>>>>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>>>>>>>>>>>> Subject: Timestamp as a key good practice?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I watched Lars George's video about HBase and read the
>>>>>>>>>>>>>>>> documentation
>>>>>>>>>>>>>>>> and it's saying that it's not a good idea to have the
>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> key because that will always load the same region until the
>>>>>>>>>>>>>>>> timestamp
>>>>>>>>>>>>>>>> reach a certain value and move to the next region
>>>>> (hotspotting).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a table with a uniq key, a file path and a "last
>>>>>>>>>>>>>>>> update"
>>>>>>>>>>>>>>>> field.
>>>>>>>>>>>>>>>> I can easily find back the file with the ID and find when
>>>>>>>>>>>>>>>> it
>>>>> has
>>>>>>>>>>>>>>>> been
>>>>>>>>>>>>>>>> updated.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> But what I need too is to find the files not updated for
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> certain period of time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If I want to retrieve that from this single table, I will
>>>>>>>>>>>>>>>> have
>>>>> to
>>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> full parsing of the table. Which might take a while.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So I thought of building a table to reference that (kind of
>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>> index). The key is the "last update", one FC and each
>>>>>>>>>>>>>>>> column
>>>>> will
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the ID of the file with a dummy content.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> When a file is updated, I remove its cell from this table,
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And so one.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> With this schema, I can find the files by ID very quickly
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> find the files which need to be updated pretty quickly too.
>>>>>>>>>>>>>>>> But
>>>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>> hotspotting one region.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>>>>>>>>>>>> 1) Hotspotting.
>>>>>>>>>>>>>>>> 2) Salting.
>>>>>>>>>>>>>>>> 3) Key field swap/promotion
>>>>>>>>>>>>>>>> 4) Randomization.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I need to avoid hostpotting, so I looked at the 3 other
>>>>> options.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I can do salting. Like prefix the timestamp with a number
>>>>> between
>>>>>>>>>>>>>>>> 0
>>>>>>>>>>>>>>>> and 9. So that will distribut the load over 10 servers. To
>>>>>>>>>>>>>>>> find
>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>> the files with a timestamp below a specific value, I will
>>>>>>>>>>>>>>>> need
>>>>> to
>>>>>>>>>>>>>>>> run
>>>>>>>>>>>>>>>> 10 requests instead of one. But when the load will becaume
>>>>>>>>>>>>>>>> to
>>>>> big
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> 10 servers, I will have to prefix by a byte between 0 and
>>>>>>>>>>>>>>>> 99?
>>>>>>>>>>>>>>>> Which
>>>>>>>>>>>>>>>> mean 100 request? And the more regions I will have, the
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>> requests
>>>>>>>>>>>>>>>> I will have to do. Is that really a good approach?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Key field swap is close to salting. I can add the first few
>>>>> bytes
>>>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>> the path before the timestamp, but the issue will remain
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> same.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I looked and randomization, and I can't do that. Else I
>>>>>>>>>>>>>>>> will
>>>>> have
>>>>>>>>>>>>>>>> no
>>>>>>>>>>>>>>>> way to retreive the information I'm looking for.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So the question is. Is there a good way to store the data
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> retrieve
>>>>>>>>>>>>>>>> them base on the date?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> JM
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Reply via email to