michael, infact i dont care about latency bw doc write and index write. today i did some tests. turns out turning off WAL does speed up the writes by about a factor of 2. interestingly, enabling bloom filter did little to improve the checkandput.
earlier you mentioned >>>> The OP doesn't really get in to the use case, so we don't know why the >>> Check and Put in the M/R job. >>>> He should just be using put() and then a postPut(). the main reason i use checkandput is to make sure the word count index doesnt get duplicate increments when duplicate documents come in. additionally i also need to dump dup free docs to hdfs for legacy system that we have in place. is there some way to avoid chechandput? Sincerely, Prakash On Feb 20, 2013, at 10:00 PM, Michel Segel <[email protected]> wrote: > I was suggesting removing the write to WAL on your write to the index table > only. > > The thing you have to realize that true low latency systems use databases as > a sink. It's the end of the line so to speak. > > So if you're worried about a small latency between the writing to your doc > table, and then the write of your index.. You are designing the wrong system. > > Consider that it takes some time t to write the base record and then to write > the indexes. > For that period, you have a Schrödinger's cat problem as to if the row exists > or not. Since HBase lacks transactions and ACID, trying to write a solution > where you require the low latency... You are using the wrong tool. > > Remember that HBase was designed as a distributed system for managing very > large data sets. Your speed from using secondary indexes like an inverted > table is in the read and not the write. > > If you had append working, you could create an index if you could create a > fixed sized key buffer. Or something down that path... Sorry, just thinking > something out loud... > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Feb 19, 2013, at 1:53 PM, Asaf Mesika <[email protected]> wrote: > >> 1. Try batching your increment calls to a List<Row> and use batch() to >> execute it. Should reduce RPC calls by 2 magnitudes. >> 2. Combine batching with scanning more words, thus aggregating your count >> for a certain word thus less Increment commands. >> 3. Enable Bloom Filters. Should speed up Increment by a factor of 2 at >> least. >> 4. Don't use keyValue.getValue(). It does a System.arraycopy behind the >> scenes. Use getBuffer() and getValueOffset() and getValueLength() and >> iterate on the existing array. Write your own Split without going into >> using String functions which goes through encoding (expensive). Just find >> your delimiter by byte comparison. >> 5. Enable BloomFilters on doc table. It should speed up the checkAndPut. >> 6. I wouldn't give up WAL. It ain't your bottleneck IMO. >> >> On Monday, February 18, 2013, prakash kadel wrote: >> >>> Thank you guys for your replies, >>> Michael, >>> I think i didnt make it clear. Here is my use case, >>> >>> I have text documents to insert in the hbase. (With possible duplicates) >>> Suppose i have a document as : " I am working. He is not working" >>> >>> I want to insert this document to a table in hbase, say table "doc" >>> >>> =doc table= >>> ----- >>> rowKey : doc_id >>> cf: doc_content >>> value: "I am working. He is not working" >>> >>> Now, i to create another table that stores the word count, say "doc_idx" >>> >>> doc_idx table >>> --- >>> rowKey : I, cf: count, value: 1 >>> rowKey : am, cf: count, value: 1 >>> rowKey : working, cf: count, value: 2 >>> rowKey : He, cf: count, value: 1 >>> rowKey : is, cf: count, value: 1 >>> rowKey : not, cf: count, value: 1 >>> >>> My MR job code: >>> ============== >>> >>> if(doc.checkAndPut(rowKey, doc_content, "", null, putDoc)) { >>> for(String word : doc_content.split("\\s+")) { >>> Increment inc = new Increment(Bytes.toBytes(word)); >>> inc.addColumn("count", "", 1); >>> } >>> } >>> >>> Now, i wanted to do some experiments with coprocessors. So, i modified >>> the code as follows. >>> >>> My MR job code: >>> =============== >>> >>> doc.checkAndPut(rowKey, doc_content, "", null, putDoc); >>> >>> Coprocessor code: >>> =============== >>> >>> public void start(CoprocessorEnvironment env) { >>> pool = new HTablePool(conf, 100); >>> } >>> >>> public boolean postCheckAndPut(c, row, family, byte[] qualifier, >>> compareOp, comparator, put, result) { >>> >>> if(!result) return true; // check if the put succeeded >>> >>> HTableInterface table_idx = pool.getTable("doc_idx"); >>> >>> try { >>> >>> for(KeyValue contentKV = put.get("doc_content", >>> "")) { >>> for(String word : >>> contentKV.getValue().split("\\s+")) { >>> Increment inc = new >>> Increment(Bytes.toBytes(word)); >>> inc.addColumn("count", "", 1); >>> table_idx.increment(inc); >>> } >>> } >>> } finally { >>> table_idx.close(); >>> } >>> return true; >>> } >>> >>> public void stop(env) { >>> pool.close(); >>> } >>> >>> I am a newbee to HBASE. I am not sure this is the way to do. >>> Given that, why is the cooprocessor enabled version much slower than >>> the one without? >>> >>> >>> Sincerely, >>> Prakash Kadel >>> >>> >>> On Mon, Feb 18, 2013 at 9:11 PM, Michael Segel >>> <[email protected] <javascript:;>> wrote: >>>> >>>> The issue I was talking about was the use of a check and put. >>>> The OP wrote: >>>>>>>> each map inserts to doc table.(checkAndPut) >>>>>>>> regionobserver coprocessor does a postCheckAndPut and inserts some >>> rows to >>>>>>>> a index table. >>>> >>>> My question is why does the OP use a checkAndPut, and the >>> RegionObserver's postChecAndPut? >>>> >>>> >>>> Here's a good example... >>> http://stackoverflow.com/questions/13404447/is-hbase-checkandput-latency-higher-than-simple-put >>>> >>>> The OP doesn't really get in to the use case, so we don't know why the >>> Check and Put in the M/R job. >>>> He should just be using put() and then a postPut(). >>>> >>>> Another issue... since he's writing to a different HTable... how? Does >>> he create an HTable instance in the start() method of his RO object and >>> then reference it later? Or does he create the instance of the HTable on >>> the fly in each postCheckAndPut() ? >>>> Without seeing his code, we don't know. >>>> >>>> Note that this is synchronous set of writes. Your overall return from >>> the M/R call to put will wait until the second row is inserted. >>>> >>>> Interestingly enough, you may want to consider disabling the WAL on the >>> write to the index. You can always run a M/R job that rebuilds the index >>> should something occur to the system where you might lose the data. >>> Indexes *ARE* expendable. ;-) >>>> >>>> Does that explain it? >>>> >>>> -Mike >>>> >>>> On Feb 18, 2013, at 4:57 AM, yonghu <[email protected]> wrote: >>>> >>>>> Hi, Michael >>>>> >>>>> I don't quite understand what do you mean by "round trip back to the >>>>> client". In my understanding, as the RegionServer and TaskTracker can >>>>> be the same node, MR don't have to pull data into client and then >>>>> process. And you also mention the "unnecessary overhead", can you >>>>> explain a little bit what operations or data processing can be seen as >>>>> "unnecessary overhead". >>>>> >>>>> Thanks >>>>> >>>>> yong >>>>> On Mon, Feb 18, 2013 at 10:35 AM, Michael Segel >>>>> <[email protected]> wrote: >>>>>> Why? >>>>>> >>>>>> This seems like an unnecessary overhead. >>>>>> >>>>>> You are writing code within the coprocessor on the server. >>> Pessimistic code really isn't recommended if you are worried about >>> performance. >>>>>> >>>>>> I have to ask... by the time you have executed the code in your >>> co-processor, what would cause the initial write to fail? >>>>>> >>>>>> >>>>>> On Feb 18, 2013, at 3:01 AM, Prakash Kadel <[email protected]> >>> wrote: >>>>>> >>>>>>> its a local read. i just check the last param of PostCheckAndPut >>> indicating if the Put succeeded. Incase if the put success, i insert a row >>> in another table >>>>>>> >>>>>>> Sincerely, >>>>>>> Prakash Kadel >>>>>>> >>>>>>> On Feb 18, 2013, at 2:52 PM, Wei Tan <[email protected]> wrote: >>>>>>> >>>>>>>> Is your CheckAndPut involving a local or remote READ? Due to the >>> nature of >>>>>>>> LSM, read is much slower compared to a write... >>>>>>>> >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Wei >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> From: Prakash Kadel <[email protected]> >>>>>>>> To: "[email protected]" <[email protected]>, >>>>>>>> Date: 02/17/2013 07:49 PM >>>>>>>> Subject: coprocessor enabled put very slow, help please~~~ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> hi, >>>>>>>> i am trying to insert few million documents
