Okay I did my research - these need to be set to false. I agree. On Sat, Feb 9, 2013 at 12:05 AM, Varun Sharma <[email protected]> wrote:
> I have ipc.client.tcpnodelay, ipc.server.tcpnodelay set to false and the > hbase one - [hbase].ipc.client.tcpnodelay set to true. Do these induce > network latency ? > > On Fri, Feb 8, 2013 at 11:57 PM, lars hofhansl <[email protected]> wrote: > >> Sorry.. I meant set these two config parameters to true (not false as I >> state below). >> >> >> >> ----- Original Message ----- >> From: lars hofhansl <[email protected]> >> To: "[email protected]" <[email protected]> >> Cc: >> Sent: Friday, February 8, 2013 11:41 PM >> Subject: Re: Get on a row with multiple columns >> >> Only somewhat related. Seeing the magic 40ms random read time there. Did >> you disable Nagle's? >> (set hbase.ipc.client.tcpnodelay and ipc.server.tcpnodelay to false in >> hbase-site.xml). >> >> ________________________________ >> From: Varun Sharma <[email protected]> >> To: [email protected]; lars hofhansl <[email protected]> >> Sent: Friday, February 8, 2013 10:45 PM >> Subject: Re: Get on a row with multiple columns >> >> The use case is like your twitter feed. Tweets from people u follow. When >> someone unfollows, you need to delete a bunch of his tweets from the >> following feed. So, its frequent, and we are essentially running into some >> extreme corner cases like the one above. We need high write throughput for >> this, since when someone tweets, we need to fanout the tweet to all the >> followers. We need the ability to do fast deletes (unfollow) and fast adds >> (follow) and also be able to do fast random gets - when a real user loads >> the feed. I doubt we will able to play much with the schema here since we >> need to support a bunch of use cases. >> >> @lars: It does not take 30 seconds to place 300 delete markers. It takes >> 30 >> seconds to first find which of those 300 pins are in the set of columns >> present - this invokes 300 gets and then place the appropriate delete >> markers. Note that we can have tens of thousands of columns in a single >> row >> so a single get is not cheap. >> >> If we were to just place delete markers, that is very fast. But when >> started doing that, our random read performance suffered because of too >> many delete markers. The 90th percentile on random reads shot up from 40 >> milliseconds to 150 milliseconds, which is not acceptable for our usecase. >> >> Thanks >> Varun >> >> On Fri, Feb 8, 2013 at 10:33 PM, lars hofhansl <[email protected]> wrote: >> >> > Can you organize your columns and then delete by column family? >> > >> > deleteColumn without specifying a TS is expensive, since HBase first has >> > to figure out what the latest TS is. >> > >> > Should be better in 0.94.1 or later since deletes are batched like Puts >> > (still need to retrieve the latest version, though). >> > >> > In 0.94.3 or later you can also the BulkDeleteEndPoint, which basically >> > let's specify a scan condition and then place specific delete marker for >> > all KVs encountered. >> > >> > >> > If you wanted to get really >> > fancy, you could hook up a coprocessor to the compaction process and >> > simply filter all KVs you no longer want (without ever placing any >> > delete markers). >> > >> > >> > Are you saying it takes 15 seconds to place 300 version delete markers?! >> > >> > >> > -- Lars >> > >> > >> > >> > ________________________________ >> > From: Varun Sharma <[email protected]> >> > To: [email protected] >> > Sent: Friday, February 8, 2013 10:05 PM >> > Subject: Re: Get on a row with multiple columns >> > >> > We are given a set of 300 columns to delete. I tested two cases: >> > >> > 1) deleteColumns() - with the 's' >> > >> > This function simply adds delete markers for 300 columns, in our case, >> > typically only a fraction of these columns are actually present - 10. >> After >> > starting to use deleteColumns, we starting seeing a drop in cluster wide >> > random read performance - 90th percentile latency worsened, so did 99th >> > probably because of having to traverse delete markers. I attribute this >> to >> > profusion of delete markers in the cluster. Major compactions slowed >> down >> > by almost 50 percent probably because of having to clean out >> significantly >> > more delete markers. >> > >> > 2) deleteColumn() >> > >> > Ended up with untolerable 15 second calls, which clogged all the >> handlers. >> > Making the cluster pretty much unresponsive. >> > >> > On Fri, Feb 8, 2013 at 9:55 PM, Ted Yu <[email protected]> wrote: >> > >> > > For the 300 column deletes, can you show us how the Delete(s) are >> > > constructed ? >> > > >> > > Do you use this method ? >> > > >> > > public Delete deleteColumns(byte [] family, byte [] qualifier) { >> > > Thanks >> > > >> > > On Fri, Feb 8, 2013 at 9:44 PM, Varun Sharma <[email protected]> >> > wrote: >> > > >> > > > So a Get call with multiple columns on a single row should be much >> > faster >> > > > than independent Get(s) on each of those columns for that row. I am >> > > > basically seeing severely poor performance (~ 15 seconds) for >> certain >> > > > deleteColumn() calls and I am seeing that there is a >> > > > prepareDeleteTimestamps() function in HRegion.java which first >> tries to >> > > > locate the column by doing individual gets on each column you want >> to >> > > > delete (I am doing 300 column deletes). Now, I think this should >> ideall >> > > by >> > > > 1 get call with the batch of 300 columns so that one scan can >> retrieve >> > > the >> > > > columns and the columns that are found, are indeed deleted. >> > > > >> > > > Before I try this fix, I wanted to get an opinion if it will make a >> > > > difference to batch the get() and it seems from your answer, it >> should. >> > > > >> > > > On Fri, Feb 8, 2013 at 9:34 PM, lars hofhansl <[email protected]> >> > wrote: >> > > > >> > > > > Everything is stored as a KeyValue in HBase. >> > > > > The Key part of a KeyValue contains the row key, column family, >> > column >> > > > > name, and timestamp in that order. >> > > > > Each column family has it's own store and store files. >> > > > > >> > > > > So in a nutshell a get is executed by starting a scan at the row >> key >> > > > > (which is a prefix of the key) in each store (CF) and then >> scanning >> > > > forward >> > > > > in each store until the next row key is reached. (in reality it >> is a >> > > bit >> > > > > more complicated due to multiple versions, skipping columns, etc) >> > > > > >> > > > > >> > > > > -- Lars >> > > > > ________________________________ >> > > > > From: Varun Sharma <[email protected]> >> > > > > To: [email protected] >> > > > > Sent: Friday, February 8, 2013 9:22 PM >> > > > > Subject: Re: Get on a row with multiple columns >> > > > > >> > > > > Sorry, I was a little unclear with my question. >> > > > > >> > > > > Lets say you have >> > > > > >> > > > > Get get = new Get(row) >> > > > > get.addColumn("1"); >> > > > > get.addColumn("2"); >> > > > > . >> > > > > . >> > > > > . >> > > > > >> > > > > When internally hbase executes the batch get, it will seek to >> column >> > > "1", >> > > > > now since data is lexicographically sorted, it does not need to >> seek >> > > from >> > > > > the beginning to get to "2", it can continue seeking, henceforth >> > since >> > > > > column "2" will always be after column "1". I want to know whether >> > this >> > > > is >> > > > > how a multicolumn get on a row works or not. >> > > > > >> > > > > Thanks >> > > > > Varun >> > > > > >> > > > > On Fri, Feb 8, 2013 at 9:08 PM, Marcos Ortiz <[email protected]> >> wrote: >> > > > > >> > > > > > Like Ishan said, a get give an instance of the Result class. >> > > > > > All utility methods that you can use are: >> > > > > > byte[] getValue(byte[] family, byte[] qualifier) >> > > > > > byte[] value() >> > > > > > byte[] getRow() >> > > > > > int size() >> > > > > > boolean isEmpty() >> > > > > > KeyValue[] raw() # Like Ishan said, all data here is sorted >> > > > > > List<KeyValue> list() >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > On 02/08/2013 11:29 PM, Ishan Chhabra wrote: >> > > > > > >> > > > > >> Based on what I read in Lars' book, a get will return a result >> a >> > > > Result, >> > > > > >> which is internally a KeyValue[]. This KeyValue[] is sorted by >> the >> > > key >> > > > > and >> > > > > >> you access this array using raw or list methods on the Result >> > > object. >> > > > > >> >> > > > > >> >> > > > > >> On Fri, Feb 8, 2013 at 5:40 PM, Varun Sharma < >> [email protected] >> > > >> > > > > wrote: >> > > > > >> >> > > > > >> +user >> > > > > >>> >> > > > > >>> On Fri, Feb 8, 2013 at 5:38 PM, Varun Sharma < >> > [email protected]> >> > > > > >>> wrote: >> > > > > >>> >> > > > > >>> Hi, >> > > > > >>>> >> > > > > >>>> When I do a Get on a row with multiple column qualifiers. Do >> we >> > > sort >> > > > > the >> > > > > >>>> column qualifers and make use of the sorted order when we get >> > the >> > > > > >>>> >> > > > > >>> results ? >> > > > > >>> >> > > > > >>>> Thanks >> > > > > >>>> Varun >> > > > > >>>> >> > > > > >>>> >> > > > > >> >> > > > > >> >> > > > > > -- >> > > > > > Marcos Ortiz Valmaseda, >> > > > > > Product Manager && Data Scientist at UCI >> > > > > > Blog: http://marcosluis2186.**posterous.com< >> > > > > http://marcosluis2186.posterous.com> >> > > > > > Twitter: @marcosluis2186 <http://twitter.com/**marcosluis2186< >> > > > > http://twitter.com/marcosluis2186> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >> >
