Hi Sleiman
Take a look at this for some ideas:
http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf
-Amandeep
On Wed, Jul 1, 2015 at 11:53 AM, Sleiman Jneidi jneidi.slei...@gmail.com
wrote:
Hello everyone, I am working on a scheme design
I've seen other threads like this from Michael in the past. While I ignore
them when they show up, it is certainly off putting to the community
members and discourage open discussions and sharing of ideas. Some people
might not understand the problems as well as others or might have
completely
problem , ie: get latest
posts from the 2k people I follow. That's for me is the challenge. I would
really appreciate any ideas.
Thank you.
On Wed, 1 Jul 2015 at 7:57 pm Amandeep Khurana ama...@gmail.com wrote:
Hi Sleiman
Take a look at this for some ideas:
http://0b4af6cdc2f0c5998459
with more details.
-- Lars
From: Amandeep Khurana ama...@gmail.com
To: user@hbase.apache.org user@hbase.apache.org
Sent: Tuesday, July 15, 2014 10:48 PM
Subject: Cluster sizing guidelines
Hi
How do users usually go about sizing HBase clusters? What
Hi
How do users usually go about sizing HBase clusters? What are the factors
you take into account? What are typical hardware profiles you run with? Any
data points you can share would help.
Thanks
Amandeep
On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.comwrote:
Hi,
I have the dfs.block.size value set to 1 GB in my cluster configuration.
Just out of curiosity - why do you have it set at 1GB?
I
have around 250 GB of data stored in hbase over this cluster. But when i
The start script is a shell script and it forks a new shell when the
script is executed. That'll source the bashrc file.
On May 11, 2013, at 12:39 PM, Mohammad Tariq donta...@gmail.com wrote:
Hello list,
Does Hbase read the environment variables set in
*~/.bashrc*file everytime I
not able to understand
this. Pardon my ignorance.
Warm Regards,
Tariq
cloudfront.blogspot.com
On Sun, May 12, 2013 at 1:14 AM, Amandeep Khurana ama...@gmail.com
wrote:
The start script is a shell script and it forks a new shell when the
script is executed. That'll source the bashrc file
.
Thank Regards
Raju.
On Fri, May 10, 2013 at 11:07 AM, Amandeep Khurana ama...@gmail.com
wrote:
Can you tell what the logs are saying? Did you try to restart your
application?
On Thu, May 9, 2013 at 10:36 PM, naga raju rajumudd...@gmail.com
wrote:
Thanks for the Reply
Are you running HBase on your local development box or on some server
outside and trying to connect to it from your dev box?
On Thu, May 9, 2013 at 6:36 AM, raju rajumudd...@gmail.com wrote:
Hi all,
I am New to Hbase(Hadoop), and i am working on hbase with standalone
mode.My application
What do the HBase logs say? Can you access the web UI?
On Thu, May 9, 2013 at 9:57 PM, naga raju rajumudd...@gmail.com wrote:
I am running hbase on same(single) system which i am using to development.
On Fri, May 10, 2013 at 6:19 AM, Amandeep Khurana ama...@gmail.com
wrote:
Are you
Can you tell what the logs are saying? Did you try to restart your
application?
On Thu, May 9, 2013 at 10:36 PM, naga raju rajumudd...@gmail.com wrote:
Thanks for the Reply Amandeep,
I can access the web UI.
On Fri, May 10, 2013 at 10:52 AM, Amandeep Khurana ama...@gmail.com
wrote
To add to what Andy said - the key to getting HBase running well in AWS is:
1. Choose the right instance types. I usually recommend the HPC
instances or now the high storage density instances. Those will give
you the best performance.
2. Use the latest Amzn Linux AMIs and the latest HBase and
I've not come across anyone spanning clusters cross AZ. You pay for cross
AZ traffic and the link is slower than within a single AZ.
Amandeep
On Mon, May 6, 2013 at 10:37 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,
Do people spread HBase clusters over multiple EC2
, Amandeep Khurana ama...@gmail.com wrote:
I've not come across anyone spanning clusters cross AZ. You pay for cross
AZ traffic and the link is slower than within a single AZ.
Amandeep
On Mon, May 6, 2013 at 10:37 AM, Otis Gospodnetic
otis.gospodne...@gmail.com wrote:
Hi,
Do people
if there are any HBase
(or HDFS)-specific reasons why one should not attempt to do this?
Thanks,
Otis
--
Search Analytics - http://sematext.com/search-analytics/index.html
HBASE Performance Monitoring - http://sematext.com/spm/index.html
On Mon, May 6, 2013 at 1:41 PM, Amandeep Khurana
Multiple RS' per host gets you around the WAL bottleneck as well. But
it's operationally less than ideal. Do you usually recommend this
approach, Andy? I've shied away from it mostly.
On Apr 30, 2013, at 10:38 AM, Andrew Purtell apurt...@apache.org wrote:
Rules of thumb for starting off safely
+user@
moving dev@ to bcc
Kiran
Can you elaborate on what you mean by semantic integration?
Also, this question is more relevant for the user mailing list than
the dev list.
Amandeep
On Apr 27, 2013, at 3:13 PM, Kiran kiranvk2...@gmail.com wrote:
We can use HBase as a storage back end for
to user:location.
On Sun, Apr 28, 2013 at 3:52 AM, Amandeep Khurana [via Apache HBase]
ml-node+s679495n4043154...@n3.nabble.com wrote:
+user@
moving dev@ to bcc
Kiran
Can you elaborate on what you mean by semantic integration?
Also, this question is more relevant for the user mailing
To add to Andy's point - storing images in HBase is fine as long as
the size of each image isn't huge. A couple for MBs per row in HBase
do just fine. But once you start getting into 10s of MBs, there are
more optimal solutions you can explore and HBase might not be the best
bet.
Amandeep
On Jan
Mehmet
What's the problem you are getting while running the Sqoop job? Can
you give details?
-Amandeep
On Thu, Dec 13, 2012 at 8:44 AM, Manoj Babu manoj...@gmail.com wrote:
Mehmet,
You can try to write a MapReduce using DBInputFormat and insert into HBase.
What failure condition are you trying to safeguard against? A full
data center failure? That's when you would lose your entire cluster
and need the DR to kick in. Otherwise, you could deploy such that an
entire rack failure or even a row failure won't take you down. Just
span across multiple racks
Answers inline
On Fri, Oct 19, 2012 at 4:31 PM, Dave Latham lat...@davelink.net wrote:
I need to scale an internal service / datastore that is currently hosted on
an HBase cluster and wanted to ask for advice from anyone out there who may
have some to share. The service does simple key value
Mohit
Getting the maximum performance out of HBase isn't just about tuning
the cluster. There are several other factors to take into account. The
two most important being:
1. Most important factor being the schema design
2. How you are using the APIs
Starting with the default configs is okay.
Also, you might have read that an initial loading of data can be better
distributed across the cluster if the table is pre-split rather than
starting with a single region and splitting (possibly aggressively,
depending on the throughput) as the data loads in. Once you are in a stable
state with
Can you give an example of what you are trying to do and how you would
use both the writes coming in at the same instant for the same cell
and why do you say that the nanosecond approach is tricky?
On Aug 28, 2012, at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
How does it deal with
How do you want to use two tables? Can you explain your algo a bit?
On Fri, Aug 10, 2012 at 6:40 PM, Weishung Chung weish...@gmail.com wrote:
Hi HBase users,
I need to pull data from 2 HBase tables in a mapreduce job. For 1 table
input, I use TableMapReduceUtil.initTableMapperJob. Is there
want to denormalize and not need joins when working with HBase
(or for that matter most NoSQL stores).
On Fri, Aug 10, 2012 at 6:52 PM, Weishung Chung weish...@gmail.com wrote:
Basically a join of two data sets on the same row key.
On Fri, Aug 10, 2012 at 6:12 AM, Amandeep Khurana ama
Correct. You are limited to the throughput of a single region server while
interacting with a particular region. This throughput limitation is
typically handled by designing your keys such that your data is distributed
well across the cluster.
Having multiple region servers serve a single region
in active-passive
mode, when at one time only one active server is active? Correct?
regards,
Lin
On Thu, Aug 9, 2012 at 2:04 PM, Amandeep Khurana ama...@gmail.com wrote:
Correct. You are limited to the throughput of a single region server while
interacting with a particular region. This throughput
Firstly, I recommend you read the GFS and Bigtable papers. That'll give you
a good understanding of the architecture. Adhoc question on the mailing
list won't.
I'll try to answer some of your questions briefly. Think of HBase as a
database layer over an underlying filesystem (the same way MySQL
of the
CAP is sacrificed?
regards,
Lin
On Thu, Aug 9, 2012 at 1:34 PM, Amandeep Khurana ama...@gmail.com wrote:
Firstly, I recommend you read the GFS and Bigtable papers. That'll give
you
a good understanding of the architecture. Adhoc question on the mailing
list won't.
I'll try to answer
This is most likely because of a mismatch in the ZK library version between
your web service and the HBase install. Can you confirm you got the same
version in both places?
On Monday, July 23, 2012 at 8:31 AM, Rajendra Manjunath wrote:
i have hbase configured in pseudo distributed mode and
On Mon, Jul 23, 2012 at 9:58 AM, Jonathan Bishop jbishop@gmail.comwrote:
Hi,
Thanks everyone for the informative discussion on this topic.
I think that for project I am involved in I must remove the risk, however
small, of a row key collision, and append the original id (in my case a
You shouldn't have empty regions. Using timestamp will give you
regions that are always half filled except the last one to which you
are writing the current time range. The moment that'll fill up, split
and you'll again be writing to the last region. How did you end up
with empty regions? Did you
I have come across clusters with 100s of tables but that typically is
due to a sub optimal table design.
The question here is - why do you need to distribute your data over
lots of tables? What's your access pattern and what kind of data are
you putting in? Or is this just a theoretical question?
Inline.
On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:
Quick question about data node hadrware. I've read a few articles, which
cover the basics, including the Cloudera's recommendations here:
better.
-Amandeep
On Thursday, July 12, 2012 at 1:20 PM, Bartosz M. Frak wrote:
Amandeep Khurana wrote:
Inline.
On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote:
Quick question about data node hadrware. I've read a few articles, which
cover the basics, including
Gen,
HBase has HA across the entire stack. Have you read the original Google
Bigtable paper to understand the architecture of the system? That is a great
place to start.
-Amandeep
On Tuesday, July 10, 2012 at 9:40 PM, Gen Liu wrote:
Hi, I'm new here. I'm doing evaluation on Hbase before
Inline.
On Monday, July 9, 2012 at 12:17 PM, registrat...@circle-cross-jn.com wrote:
Now that I have a stable cluster, I would like to use YCSB to test
its performance; however, I am a bit confused after reading several
different website posting about YCSB.
1) Be default will YCSB
I _think_ you should be able to do it and be just fine but you'll need to shut
down the region servers before you remove and start them back up after you are
done. Someone else closer to the internals can confirm/deny this.
On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote:
Hello,
On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote:
Finally my HMaster has stabilized and been running for 7 hours. I
believe my networking issues are behind me now. Thank you everyone for
the help.
Awesome.
Looks like the same issue is biting you with the RS too. The RS isn't
. I reduced the number
of nodes to 40 and had all of them placed on the same switch with a
single vlan. I even had the network techs use a completely different
switch just to be safe.
Is there some heatbeat timer I can tweak?
---
Jay Wilson
On 7/5/2012 8:34 PM, Amandeep Khurana wrote
it attempt to reconnect with ZK on devrackA-03,
get the reject and then attempt ZK on devrackA-04.
---
Jay Wilson
On 7/5/2012 9:08 PM, Amandeep Khurana wrote:
The timeout can be configured using the session timeout configuration. The
default for that is 180s, but that means
On Tuesday, July 3, 2012 at 10:08 AM, Jay Wilson wrote:
2012-07-03 09:05:00,530 ERROR
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close
log at
hdfs://devrackA-00:8020/var/hbase-hadoop/hbase/-ROOT-/70236052/recovered.edits/046.temp
://pastebin.com/download.php?i=cS1Gm19x
RS (devrackA-06) -- http://pastebin.com/download.php?i=XayB2HeX
RS (devrackB-07) -- http://pastebin.com/download.php?i=RQZ45a8j
RS (devrackB-08) -- http://pastebin.com/download.php?i=ZDZD0z7B
---
Jay Wilson
On 7/3/2012 5:23 PM, Amandeep Khurana wrote
As someone who has been developing/running/using the software for a longer
period of time than the person who is asking the question, you can best serve
the poser by making them aware of the trade offs and why it's a good/bad idea
to do things a certain way. At the end of the day, it's their
Jean-Marc,
These are great questions! Find my answers (and some questions for you) inline.
-ak
On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote:
Hi,
I have a question regarding the best way to design a table.
Let's imagine I want to store all the people in the world on a
natural to store the number of columns, then
parse them by name, etc. but I think I need to think about it a little
be more before taking any decision...
JM
2012/7/2, Amandeep Khurana ama...@gmail.com (mailto:ama...@gmail.com):
Jean-Marc,
These are great questions! Find my answers
with option 1,
why is it better to have a 2nd table instead of a 2nd column familly?
JM
2012/7/2, Amandeep Khurana ama...@gmail.com (mailto:ama...@gmail.com):
Responses inline
On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote:
Hi Amandeep,
Thanks for your prompt
To run HBase (or for that matter any distributed system) you need your
networking setup to function properly. No route to host is caused due to issues
with the underlying network. I have seen TORs losing packets, causing these
exceptions. There could be several other issues that could cause
Currently, you have to compile a jar, put them on all servers and restart the
RS process. I don't believe there is an easier way to do it as of right now.
And I agree, it's not entirely desirable to have to restart the cluster to
install a custom filter.
You can combine multiple filters into a
Mohit,
What would be your read patterns later on? Are you going to read per
session, or for a time period, or for a set of users, or process through
the entire dataset every time? That would play an important role in
defining your keys and columns.
-Amandeep
On Tue, Jun 26, 2012 at 1:34 PM,
Cyril,
Did you notice the space on the hbase directory in HDFS change at all? It
takes time to complete the major compactions (and it depends on the size of
the tables). Deleting column families will just delete those HFiles. That
should definitely free up space.
-Amandeep
On Wed, Jun 27, 2012
Mohammad,
Can you describe what you are trying to do a little more? Is this a
endpoint coprocessor you are trying to build? What is the functionality
it'll provide?
-Amandeep
On Tue, Jun 26, 2012 at 12:44 PM, Mohammad Tariq donta...@gmail.com wrote:
Hello Lars,
Thank you so much for
and query parameters
How should I go about designing schema?
Thanks
Sent from my iPad
On Jun 27, 2012, at 2:01 PM, Amandeep Khurana ama...@gmail.com
(mailto:ama...@gmail.com) wrote:
Mohit,
What would be your read patterns later on? Are you going to read per
session
Is this on a standalone instance or do you have fully distributed setup
deployed? Do you have any kind of monitoring in place?
From the numbers you are giving, it looks like the data is of the order of a
few 10 MBs, assuming this is a single threaded read. Did you write more data
between the
is not there.
Please help me
Thanks and Regards
Prakrati
-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com]
Sent: Monday, June 25, 2012 11:51 AM
To: user@hbase.apache.org
Subject: Re: Enabling caching increasing the time of retrieval
Is this on a standalone instance or do you
As the the thread JD pointed out suggests - the best approach if you
want to avoid aggregations later on is to aggregate in an MR job,
output to a file with ad id and the number of impressions found for
that ad. Run a separate client application, likely single threaded if
the number of ads is not
Atif,
These are general recommendations and definitely change based on the access
patterns and the way you will be using HBase and MapReduce. In general, if you
are building a latency sensitive application on top of HBase, running a
MapReduce job at the same time will impact performance due to
Tom,
Old cells will get deleted as a part of the next major compaction, which is
typically recommended to be done once a day, when the load on the system is at
its lowest.
FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction
every hour, which is an expensive
Assuming that your data fits into the pseudo dist cluster and both clusters can
talk to each other, the CopyTable job that comes bundled with HBase should work.
-ak
On Tuesday, May 29, 2012 at 11:42 AM, arun sirimalla wrote:
Hi,
I want to copy a table from Distributed Hbase cluster to
Marcos,
You could to a distcp from S3 to HDFS and then do a bulk import into HBase.
Are you running HBase on EC2 or on your own hardware?
-Amandeep
On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote:
Regards to all the list.
We are using Amazon S3 to store millions of files with
:
Thanks a lot for your answer, Amandeep.
On 05/24/2012 02:55 PM, Amandeep Khurana wrote:
Marcos,
You could to a distcp from S3 to HDFS and then do a bulk import into HBase.
The quantity of files are very large, so, we want to combine some files,
and then construct
the HFile to upload
HBase
here?
-ak
On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote:
On 05/24/2012 03:21 PM, Amandeep Khurana wrote:
Marcos
Can you elaborate on your use case a little bit? What is the nature of
data in S3 and why you want to use HBase? Why do you want to combine
HDFS is designed to not lose data if a few nodes fail. It holds multiple
replicas of each block. Having said that - it also depends on the definition of
a few. Many companies are using HDFS as their central data store and it's
proven at scale in production. It does not lose data arbitrarily,
Ahmed,
I'll second what Ian and Andrew have highlighted. HBase is very capable of
being used as a primary store as long as you run it following the best
practices. It's a useful exercise to clearly define the failure scenarios you
want to safeguard against and what kind of SLAs you have in
From what I understand, the online schema update feature (0.92.x onwards)
would allow you to do this without disabling tables. It's experimental in
0.92.
On Thu, May 10, 2012 at 9:02 AM, Jeff Whiting je...@qualtrics.com wrote:
We really need to be able to do this type of thing online. Taking
+user
(bcc: dev)
Mikael,
Such questions are better suited for the user mailing list. You'll
find more people talking about issues that they ran into and possibly
get answers to your questions faster.
Hadoop internally using a form of the linux 'hostname' command from
within Java. When servers
Mohit,
Adding to what Andy and Vaibhav have listed - you'll need to ensure that
the Hadoop versions running in EMR and your HBase cluster are compatible if
you want to run MapReduce from EMR onto an external HBase cluster.
If you choose to run HBase on your EMR cluster and don't want it to tear
Tim
Going directly to HFiles has the following pitfalls:
1. You'll miss out on data that's in the memstore and has not been
flushed to an HFile yet.
2. If you have deletes, you'll probably see the data from some HFiles
where the data resides since a compactions hasn't taken place to throw
it
From the limitations you mention, 1) and 2) we can live with, but 3)
could be why my quick tests are already giving incorrect record
counts. That sounds like a show stopper straight away right?
One option for us would be HBase for the primary store for random
access, and periodic (e.g. 12
Delete and Updates in HBase are like new writes.. The way to update a cell
is to actually do a Put. And when you delete, it internally flags the cell
to be deleted and removes the data from the underlying file on the next
compaction. If you want to learn the technical details further, you could
Otis,
You could co-locate RS' with TT and DN for the most part as long as you are
not really serving real time requests. Just tweak your task configs and
give HBase enough RAM. You get the benefit of data locality and that could
improve performance. But you should definitely try out your approach
/11 9:04 AM, Srikanth P. Shreenivas
srikanth_shreeni...@mindtree.com wrote:
Will major compactions take care of merging older regions or adding
more key/values to them as number of regions grow?
Regard,
Srikanth
-Original Message-
From: Amandeep Khurana [mailto:ama...@gmail.com
Mark,
This is an interesting discussion and like Michel said - the answer to your
question depends on what you are trying to achieve. However, here are the
points that I would think about:
What are the access patters of the various buckets of data that you want to
put in HBase? For instance,
Mark,
Yes, your understanding is correct. If your keys are sequential (timestamps
etc), you will always be writing to the end of the table and older
regions will not get any writes. This is one of the arguments against using
sequential keys.
-ak
On Sun, Nov 20, 2011 at 11:33 AM, Mark
Rohit,
It'll depend on what processing you want to do on all documents for a
given author. You could either write author - {list of documents} to
an HDFS file and scan through that file using a MR job to do the
processing. Or you could simply output author, document as the
output of the map stage
Vivek,
Can you elaborate on 4? Storing data in HDFS directly does not give you the
option of updating it. However, that's not a good enough reason to use
HBase. Do you need random reads/writes outside of just the selective
increments? Can you store the increments in a separate file and then do a
Hello HBasers,
Hadoop World 2011 (Nov 8th 9th) is coming up soon and a bunch of Hadoop
and HBase users would be attending it. We are having a meetup the evening
before Hadoop World (Nov 7th) to talk about HBase and Hadoop topics that are
not going being covered in the Hadoop World sessions.
The
Geoff,
In general it's not advisable to simply auto-restart regionservers without
digging into the root cause. We had a case where we were doing that and the
real problem got masked. Debugging it later took a lot more effort than it
would have taken had we looked at it earlier on.
-Amandeep
You can scan through one table and see if the other one has those rowids or
not.
On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor
vishal.kapoor...@gmail.comwrote:
Friends,
how do I best achieve intersection of sets of row ids
suppose I have two tables with similar row ids
how can I get the row
The command I'm running on the shell:
create 'table', {NAME='fam', COMPRESSION='GZ'}
or
create 'table', {NAME='fam', COMPRESSION='LZO'}
Here's the error:
ERROR: cannot convert instance of class org.jruby.RubyString to class
org.apache.hadoop.hbase.io.hfile.Compression$Algorithm
Any idea
Seems like it. Let me try the patch.
-AK
On Dec 6, 2010, at 12:36 PM, Lars George lars.geo...@gmail.com wrote:
Hi AK,
This issue? https://issues.apache.org/jira/browse/HBASE-3310
Lars
On Mon, Dec 6, 2010 at 9:17 AM, Amandeep Khurana ama...@gmail.com wrote:
The command I'm running
Article published in this months Usenix Login magazine:
http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf
-ak
I doubt they'll talk about their internal systems openly... Any special
reasons why you are asking this?
On Wed, Jul 14, 2010 at 11:04 AM, S Ahmed sahmed1...@gmail.com wrote:
In what context/feature is facebook using hbase?
Another link: http://bit.ly/aGJi1e
On Wed, Jul 7, 2010 at 10:48 PM, Akash Deep Shakya akasha...@gmail.comwrote:
I studied Cassandra in detail and currently working in hbase. There are
lots
of differences, yet similarities between Cassandra and HBase, for HBase
data
model/arch, there are two
a) all the Puts are collected in Reduce or Map (if there is no reduce) and
a batch write is done
b) writing out each K,V pair using context.write(k, v)
If a) is considered instead of b) then wouldn't there be a violation of
semantics w.r.t KEYOUT, VALUEOUT (because K, V is not being
Yes.. If HBase master fails, it loses its handle in ZK. The backup master
will register itself as the new master and all works thereafter.
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
2010/5/25 y_823...@tsmc.com
Thank for your reply.
You mean we
89 matches
Mail list logo