Re: HBaseAdmin needs a close methord

2012-04-19 Thread Eason Lee
I don't think this issue can resovle the problem ZKWatcher is removed,but the configuration and HConnectionImplementation objects are still in HConnectionManager this may still cause memery leak but calling HConnectionManager.deleteConnection may resolve HBASE-5073 problem. I can see

Re: hbase coprocessor unit testing

2012-04-19 Thread Marcin Cylke
On 17/04/12 18:45, Alex Baranau wrote: I don't think that your error is related to CPs stuff. What lib versions do you use? Can you compare with those of the HBaseHUT pom? Ok, I've managed to track down the source of my error. If I do normal Put modifications in my prePut/postPut method

Re: HBaseAdmin needs a close methord

2012-04-19 Thread N Keywal
Hi, fwiw, the close method was added in HBaseAdmin for HBase 0.90.5. N. On Thu, Apr 19, 2012 at 8:09 AM, Eason Lee softse@gmail.com wrote: I don't think this issue can resovle the problem ZKWatcher is removed,but the configuration and HConnectionImplementation objects are still in

Re: HBaseAdmin needs a close methord

2012-04-19 Thread Eason Lee
I see, thanks to all~~ Hi, fwiw, the close method was added in HBaseAdmin for HBase 0.90.5. N. On Thu, Apr 19, 2012 at 8:09 AM, Eason Leesoftse@gmail.com wrote: I don't think this issue can resovle the problem ZKWatcher is removed,but the configuration and HConnectionImplementation

RE: Performance issues of prepending a table

2012-04-19 Thread de Souza Medeiros Andre
Hi Ian, Thank you very much, that pretty much answers it. Best regards, Andre Medeiros From: Ian Varley [ivar...@salesforce.com] Sent: Wednesday, April 18, 2012 17:11 To: user@hbase.apache.org Subject: Re: Performance issues of prepending a table I would

HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32 GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution for maintaining our cluster. I have a single tweets table in which we store the tweets, one tweet per row (it has millions of rows currently). Now I

Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
So in your step 2 you have the following: FOREACH row IN TABLE alpha: SELECT something FROM TABLE alpha WHERE alpha.url = row.url Right? And you are wondering why you are getting timeouts? ... ... And how long does it take to do a full table scan? ;-) (there's more, but that's the

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Hi Michel Yes, that is exactly what I do in step 2. I am aware of the reason for the scanner timeout exceptions. It is the time between two consecutive invocations of the next call on a specific scanner object. I increased the scanner timeout to 10 min on the region server and still I keep seeing

Re: HBase parallel scanner performance

2012-04-19 Thread Michel Segel
Narendra, Are you trying to solve a real problem, or is this a class project? Your solution doesn't scale. It's a non starter. 130 seconds for each iteration times 1 million seconds is how long? 130 million seconds, which is ~36000 hours or over 4 years to complete. (the numbers are rough but

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
Hi Narendra, I have a few doubts: 1. Which version you are using? 2. What's the size of each KeyValue? 3. Did you change the GC parameters in client side or server side? After changing the GC parameters, did you keep an eye on the GC logs? Thank you. Regards, Jieshan -Original

Re: hbase coprocessor unit testing

2012-04-19 Thread Marcin Cylke
On 17/04/12 18:45, Alex Baranau wrote: I don't think that your error is related to CPs stuff. What lib versions do you use? Can you compare with those of the HBaseHUT pom? Ok, I've managed to track down the source of my error. If I do normal Put modifications in my prePut/postPut method

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael, Thanks for the response. This is a real problem and not a class project. Boxes itself costed 9k ;) I think there is some difference in understanding of the problem. The table has 2m rows but I am looking at the latest 10k rows only in the outer for loop. Only in the inner for loop i am

Re: hbase coprocessor unit testing

2012-04-19 Thread Alex Baranau
Are you sure you need to do table.close() after each put? Looks incorrect. Alex Baranau -- Sematext :: http://blog.sematext.com/ :: Solr - Lucene - Hadoop - HBase On Thu, Apr 19, 2012 at 2:48 AM, Marcin Cylke m...@touk.pl wrote: On 17/04/12 18:45, Alex Baranau wrote: I don't think that

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Hi Jieshan HBase version : Version 0.90.4-cdh3u3 Size of Key Value pair should not be more than 2KB I changed the GC parameters at the server side. I have not looked into GC logs yet but I have noticed that it pausing the batch process every now and then. How do I look at the server GC logs?

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
Narendra, I think you are still missing the point. 130 seconds to scan the table per iteration. Even if you have 10K rows 130 * 10^4 or 1.3*10^6 seconds. ~361 hours Compare that to 10K rows where you then select a single row in your sub select that has a list of all of the associated rows.

RE: HBase parallel scanner performance

2012-04-19 Thread Bijieshan
Narendra, Since I didn't see the client logs , FullGC is one probably reason I suspect. No matter it happens in client side or server side. So I suggest to check the GC log (Open the client GC log both at server and client side) to see whether FullGC happens with a high frequency, and check

Re: More tables, or add a prefix to each row key?

2012-04-19 Thread Ian Varley
Tom, The overall tradeoff with table vs prefix is that the former adds some (small) amount of cluster management overhead for each new table, whereas the latter adds runtime overhead (memory, cpu, disk, etc) on every operation. In your case, since you're just talking about ~3 tables vs 1, my

Re: Applying filters to ResultScanner

2012-04-19 Thread Kevin M
Thanks for the reply. I see. Would HBase cache the results of the first scan so it wouldn't take as long to collect the results? Say there were 5 facets selected one after another. A new scan would take place with more strict filtering each time on the whole table rather than to use the results

Re: HBase parallel scanner performance

2012-04-19 Thread Narendra yadala
Michael, I will do the redesign and build the index. Thanks a lot for the insights. Narendra On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel michael_se...@hotmail.comwrote: Narendra, I think you are still missing the point. 130 seconds to scan the table per iteration. Even if you have 10K

Re: HBase parallel scanner performance

2012-04-19 Thread Michael Segel
No problem. One of the hardest things to do is to try to be open to other design ideas and not become wedded to one. I think once you get that working you can start to look at your cluster. On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote: Michael, I will do the redesign and build the

Re: Need help on using hbase on EC2

2012-04-19 Thread Jean-Daniel Cryans
Would it be possible for you to pastebin a much bigger portion of the hbase log? Thx, J-D On Tue, Apr 17, 2012 at 10:35 AM, Xin Liu codeoe...@gmail.com wrote: Hi there, I setup hadoop and hbase on top of EC2 in Pseudo-distributed mode. I can use hbase shell to connect. However, when I use

Re: Duplicate an HBase cluster

2012-04-19 Thread lars hofhansl
A good way of doing that start replicating to the new cluster using HBase replication. Then *after* replication has been setup and enabled you would issue a CopyTable M/R for each table. After the CopyTable jobs are finished you have a backup cluster that behind only a few seconds (however

Re: Applying filters to ResultScanner

2012-04-19 Thread Alex Baranau
Regarding caching during scans there are two types of caches: * caching (bufferring) the records before returning them to the client, enabled via scan.setCaching(numRows) * block cache on a regionserver, enabled via setCacheBlocks(true) The latter one (block cache) is what you are looking for.

Re: Applying filters to ResultScanner

2012-04-19 Thread Kevin M
Thanks for pointing me towards setCacheBlocks() and explaining the difference between those two types of caching in HBase. According to the API documentation, setCacheBlocks defaults to true, so it looks like HBase will take care of what I am looking for automatically. Thanks so much for your

Re: Applying filters to ResultScanner

2012-04-19 Thread Alok Kumar
Thanks for pointing about setCacheBlocks() , its HBase default value will provide better performance for following Filters as well as for Kevin's multiple Facet search. -Alok On Fri, Apr 20, 2012 at 7:02 AM, Kevin M kevin.macksa...@gmail.com wrote: Thanks for pointing me towards