Re: HBase and Datawarehouse

2013-04-30 Thread Viral Bajaria
On Mon, Apr 29, 2013 at 10:54 PM, Asaf Mesika asaf.mes...@gmail.com wrote: I think for Pheoenix truly to succeed, it's need HBase to break the JVM Heap barrier of 12G as I saw mentioned in couple of posts. since Lots of analytics queries utilize memory, thus since its memory is shared with

Re: max regionserver handler count

2013-04-30 Thread Viral Bajaria
Thanks for getting back, Ted. I totally understand other priorities and will wait for some feedback. I am adding some more info to this post to allow better diagnosing of performance. I hit my region servers with a lot of GET requests (~20K per second per regionserver) using asynchbase in my test

Re: HBase and Datawarehouse

2013-04-30 Thread James Taylor
Phoenix will succeed if HBase succeeds. Phoenix just makes it easier to drive HBase to it's maximum capability. IMHO, if HBase is to make further gains in the OLAP space, scans need to be faster and new, more compressed columnar-store type block formats need to be developed. Running inside

Re: max regionserver handler count

2013-04-30 Thread Anoop John
You are making use of batch Gets? get(List) -Anoop- On Tue, Apr 30, 2013 at 11:40 AM, Viral Bajaria viral.baja...@gmail.comwrote: Thanks for getting back, Ted. I totally understand other priorities and will wait for some feedback. I am adding some more info to this post to allow better

Re: max regionserver handler count

2013-04-30 Thread Viral Bajaria
I am using asynchbase which does not have the notion of batch gets. It allows you to batch at a rowkey level in a single get request. -Viral On Mon, Apr 29, 2013 at 11:29 PM, Anoop John anoop.hb...@gmail.com wrote: You are making use of batch Gets? get(List) -Anoop-

Re: Scala and Hbase, hbase-default.xml file seems to be for and old version of HBase (null)

2013-04-30 Thread Håvard Wahl Kongsgård
Nope.. the system is clean only CDH4 on it. And I can't find hbase-default.xml on the system. However, I solved this issue my downloading http://hbase_master:60010/conf, renaming it to hbase-default.xml and adding that to the classpath So maybe a bug in CDH4. On Mon, Apr 29, 2013 at 11:36 PM,

Re: max regionserver handler count

2013-04-30 Thread Anoop John
If you can make use of the batch API ie. get(ListGet) you can reduce the handlers (and no# of RPC calls also).. One batch will use one handler. I am using asynchbase which does not have the notion of batch gets I have not checked with asynchbase. Just telling as a pointer.. -Anoop- On Tue,

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
I don't wish to be rude, but you are making odd claims as fact as mentioned in a couple of posts. It will be difficult to have a serious conversation. I encourage you to test your hypotheses and let us know if in fact there is a JVM heap barrier (and where it may be). On Monday, April 29, 2013,

Re: max regionserver handler count

2013-04-30 Thread Viral Bajaria
Looked closely into the async API and there is no way to batch GETs to reduce the # of RPC calls and thus handlers. Will play around tomorrow with the handlers again and see if I can find anything interesting. On Tue, Apr 30, 2013 at 12:03 AM, Anoop John anoop.hb...@gmail.com wrote: If you can

Re: Corrupt files

2013-04-30 Thread Jean-Marc Spaggiari
Hi Loïc, How many datanodes do you have on your cluster? Your replication factor is set to 3 so I think you should have at least 3 datanodes? Is one of those nodes down? There is some blocks missing, they are maybe on a system which is down now? Bringing it back on might restore those blocks.

Re: HBase is not running.

2013-04-30 Thread Jean-Marc Spaggiari
Hi Yves, Your host file looks good. Don't even try the shell until you get the UI displayed correctly and the server logs saying that initialization is done. So what do you have on the logs when you are trying with this new host file? JM 2013/4/28 Asaf Mesika asaf.mes...@gmail.com

Re: HBase and Datawarehouse

2013-04-30 Thread Kevin O'dell
Asaf, The heap barrier is something of a legend :) You can ask 10 different HBase committers what they think the max heap is and get 10 different answers. This is my take on heap sizes from the many clusters I have dealt with: 8GB - Standard heap size, and tends to run fine without any

Re: Corrupt files

2013-04-30 Thread Loic Talon
Hi Jean-Marc, Thanks. I have one datanode in my cluster. The node isn't down. How can I restore those blocks ? Loïc TALON mail.lta...@teads.tv http://teads.tv/ Video Ads Solutions 2013/4/30 Jean-Marc Spaggiari jean-m...@spaggiari.org Hi Loïc, How many datanodes do you have on your

Re: Corrupt files

2013-04-30 Thread Jean-Marc Spaggiari
Bonjour Loïc, I don't thnk you can restore those blocks. If you have only one datanode and it doesn't have the missing blocks, there is no-where for hadoop to get those blocks back. So unfortunatly I don't think you can restore them. Also, this is more hadoop than hbase related. You might want

Re: Read access pattern

2013-04-30 Thread Michael Segel
Geez that's a bad article. Never salt. And yes there's a difference between using a salt and using the first 2-4 bytes from your MD5 hash. (Hint: Salts are random. Your hash isn't. ) Sorry to be-itch but its a bad idea and it shouldn't be propagated. On Apr 29, 2013, at 10:17 AM, Shahab

Re: discp versus export

2013-04-30 Thread Asaf Mesika
The replication.html reference appears to contain a reference to a bug (2611) which was solved two years ago :) On Wed, Mar 6, 2013 at 12:15 AM, Damien Hardy dha...@viadeoteam.com wrote: IMO the easier would be hbase export. For long term offline backup (for disaster recovery). It can even be

Re: Read access pattern

2013-04-30 Thread Shahab Yunus
Well those are *some* words :) Anyway, can you explain a bit in detail that why you feel so strongly about this design/approach? The salting here is not the only option mentioned and static hashing can be used as well. Plus even in case of salting, wouldn't the distributed scan take care of it?

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
Now I post my configurations: I use a 3 nodes cluster with all the nodes runnind hadoop, zookeeper and hbase. Hbase master, a zookeeper daemon and Hadoop namenode run on the same host. Hbase regionserver, a zookeeper daemon and hadoop datanode run on the other 2 nodes. I called one of the

RE: Read access pattern

2013-04-30 Thread ricla
1. Change the schema If I understand correctly, in this scenario, I loose the ordering (changeDate desc). Moreover in my case, I could have 100k rows per objectId, meaning I would have to iterate a long list, but I understand the logic. If I only look for 24 hours before the original column

RE: Read access pattern

2013-04-30 Thread ricla
Yes, I see, but this is quite expensive as the table is huge -Message d'origine- De : Jean-Marc Spaggiari [mailto:jean-m...@spaggiari.org] Envoyé : lundi 29 avril 2013 20:04 À : user@hbase.apache.org; ri...@laposte.net Objet : Re: Read access pattern HBASE-4811 is what you should be

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
I solved the last problem: I modified the file /etc/hostname and i replaced the default hostname, debian01 with namenode, jobtracker, or datanode, the hostnames i used in hbase conf files. Now i start hbase fro master with bin/start-hbase.sh and regionservers, instead of trying to connect with

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread John Foxinhead
I solved my problem with zookeeper. I don't know how, maybe it was a spell xD I made this way: on a slave i removed the directory of hbase, and i copied the diectory of hbase-pseudo-distribuited (which works). Then i copied all the configurations from the virtual machines which runned as master in

Re: Re: While starting 3-nodes cluster hbase: WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null

2013-04-30 Thread Jean-Marc Spaggiari
Hi John, Thanks for sharing that. Might help other people who are facing the same issues. JM 2013/4/30 John Foxinhead john.foxinh...@gmail.com Now I post my configurations: I use a 3 nodes cluster with all the nodes runnind hadoop, zookeeper and hbase. Hbase master, a zookeeper daemon and

Re: Read access pattern

2013-04-30 Thread James Taylor
bq. The downside that I see, is the bucket_number that we have to maintain both at time or reading/writing and update it in case of cluster restructuring. I agree that this maintenance can be painful. However, Phoenix (https://github.com/forcedotcom/phoenix) now supports salting, automating

Re: Read access pattern

2013-04-30 Thread Michael Segel
Sure. By definition, the salt number is a random seed that is not associated with the underlying record. A simple example is a round robin counter (mod the counter by 10 yielding [0..9] ) So you get a record, prepend your salt and you write it out to HBase. The salt will push the data out to

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
Rules of thumb for starting off safely and for easing support issues are really good to have, but there are no hard barriers or singular approaches: use Java 7 + G1GC, disable HBase blockcache in lieu of OS blockcache, run multiple regionservers per host. It is going to depend on how the cluster

Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Hi, I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table: CREATE EXTERNAL TABLE event(key structname:string,dateCreated:string,uid:string, {more columns here}) ROW FORMAT DELIMITED

RE: Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Here it is: select * from event where key.name='Signup' and key.dateCreated='2013-03-06 16:39:55.353' and key.uid='7af4c330-5988-4255-9250-924ce5864e3bf'; From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com] Sent: Tuesday, April 30, 2013 11:25 PM To: u...@hive.apache.org Cc:

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
Can you show your query that is taking 700 seconds? On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh rsi...@care.com wrote: Hi, ** ** I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data

Re: HBase and Datawarehouse

2013-04-30 Thread Amandeep Khurana
Multiple RS' per host gets you around the WAL bottleneck as well. But it's operationally less than ideal. Do you usually recommend this approach, Andy? I've shied away from it mostly. On Apr 30, 2013, at 10:38 AM, Andrew Purtell apurt...@apache.org wrote: Rules of thumb for starting off safely

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
You wouldn't do that if colocating MR. It is one way to soak up extra RAM on a large RAM box, although I'm not sure I would recommend it (I have no personal experience trying it, yet). For more on this where people are actively considering it, see https://issues.apache.org/jira/browse/BIGTOP-732

Re: HBase and Datawarehouse

2013-04-30 Thread Andrew Purtell
Running more than one RS on a host is an option for soaking up extra RAM, since that is what we are discussing, but I can't recommend it because I have no experience with that approach. I think I do want to experiment with it, but not on a box with less than something like 16 or 24 cores. On

Re: HBase and Datawarehouse

2013-04-30 Thread Michael Segel
Hmmm I don't recommend HBase in situations where you are not running a M/R Framework. Sorry, as much as I love HBase, IMHO there are probably better solutions for a standalone NoSQL Databases. (YMMV depending on your use case.) The strength of HBase is that its part of the Hadoop Ecosystem.

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
Rupinder, Hive supports a filter pushdown[1] which means that the predicates in the where clause are pushed down to the storage handler level where either they get handled by the storage handler or delegated to hive if they cannot handle them. As of now, the HBaseStorageHandler only supports

Re: checkAnd...

2013-04-30 Thread Lior Schachter
Hi, We have a simple HBase schema: row key = subscriber id. Column family A = counters - all kinds of aggregations. Events records have a UUID, in some scenarios we might get duplicate events. We should not count the duplicates. A possible solution was to keep event ids as qualifiers in another

Re: HBase is not running.

2013-04-30 Thread Yves S. Garret
Hi Jean-Marc. Thanks for the tip. However, for the moment at least, I'm going to be abandoning my forays into HBase, I received direction to focus on Hive instead. Again, thank you. Should I need help in the near future, I'll be sure to send the mailing list an enquiry. On Tue, Apr 30, 2013

RE: Very poor read performance with composite keys in hbase

2013-04-30 Thread Rupinder Singh
Swarnim, Thanks. So this means custom map reduce is the viable option when working with hbase tables having composite keys, since it allows to set the start and stop keys. Hive+Hbase combination is out. Regards Rupinder From: kulkarni.swar...@gmail.com [mailto:kulkarni.swar...@gmail.com]

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread kulkarni.swar...@gmail.com
That depends on how dynamic your data is. If it is pretty static, you can also consider using something like Create Table As Select (CTAS) to create a snapshot of your data to HDFS and then run queries on top of that data. So your query might become something like: create table my_table as

Re: Very poor read performance with composite keys in hbase

2013-04-30 Thread James Taylor
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It'll use all of the parts of your row key and depending on how much data you're returning back to the client, will query over 10 million row in seconds. James @JamesPlusPlus http://phoenix-hbase.blogspot.com On Apr 30,

Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
I have been attempting to speed up my HBase map-reduce scans for a while now. I have tried just about everything without much luck. I'm running out of ideas and was hoping for some suggestions. This is HBase 0.94.2 and Hadoop 2.0.0 (CDH4.2.1). The table I'm scanning: 20 mil rows Hundreds of

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Ted Yu
From http://hbase.apache.org/book.html#mapreduce.example : scan.setCaching(500);// 1 is the default in Scan, which will be bad for MapReduce jobs scan.setCacheBlocks(false); // don't set to true for MR jobs I guess you have used the above setting. 0.94.x releases are compatible. Have

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false) On Apr 30, 2013, at 9:17 PM, Ted Yu yuzhih...@gmail.com wrote: From http://hbase.apache.org/book.html#mapreduce.example : scan.setCaching(500);// 1 is the default in Scan, which will be bad for

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Ted Yu
Have you tried enabling short circuit read ? Thanks On Apr 30, 2013, at 9:31 PM, Bryan Keller brya...@gmail.com wrote: Yes, I have tried various settings for setCaching() and I have setCacheBlocks(false) On Apr 30, 2013, at 9:17 PM, Ted Yu yuzhih...@gmail.com wrote: From

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread Bryan Keller
Yes, I have it enabled (forgot to mention that). On Apr 30, 2013, at 9:56 PM, Ted Yu yuzhih...@gmail.com wrote: Have you tried enabling short circuit read ? Thanks On Apr 30, 2013, at 9:31 PM, Bryan Keller brya...@gmail.com wrote: Yes, I have tried various settings for setCaching() and

Re: Poor HBase map-reduce scan performance

2013-04-30 Thread lars hofhansl
Your average row is 35k so scanner caching would not make a huge difference, although I would have expected some improvements by setting it to 10 or 50 since you have a wide 10ge pipe. I assume your table is split sufficiently to touch all RegionServer... Do you see the same load/IO on all

Re: Read access pattern

2013-04-30 Thread lars hofhansl
I do not want to be rude or anything... But how often we need to have this discussion? When you salt your rowkeys with say 10 salt values then for each read you need to fork of 10 read requests, and each of them touches only 1/10th of the tables (which nicely with HBase's prefix scans).

Re: Schema Design Question

2013-04-30 Thread lars hofhansl
Same here. HBase is generally good at honing in to a small (maybe 10-100m rows) continuous subset of an essentially unlimited dataset. If all you ever do is scanning _everything_ and then throwing it away, a straight scan (using Impala for example) or direct M/R on file(s) in HDFS is far