Re: Time series scheme design

2015-07-01 Thread Amandeep Khurana
Hi Sleiman Take a look at this for some ideas: http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf -Amandeep On Wed, Jul 1, 2015 at 11:53 AM, Sleiman Jneidi jneidi.slei...@gmail.com wrote: Hello everyone, I am working on a scheme design

Re: [DISCUSS] correcting abusive behavior on mailing lists was (Re: [DISCUSS] Multi-Cluster HBase Client)

2015-07-01 Thread Amandeep Khurana
I've seen other threads like this from Michael in the past. While I ignore them when they show up, it is certainly off putting to the community members and discourage open discussions and sharing of ideas. Some people might not understand the problems as well as others or might have completely

Re: Time series scheme design

2015-07-01 Thread Amandeep Khurana
problem , ie: get latest posts from the 2k people I follow. That's for me is the challenge. I would really appreciate any ideas. Thank you. On Wed, 1 Jul 2015 at 7:57 pm Amandeep Khurana ama...@gmail.com wrote: Hi Sleiman Take a look at this for some ideas: http://0b4af6cdc2f0c5998459

Re: Cluster sizing guidelines

2014-07-16 Thread Amandeep Khurana
with more details. -- Lars From: Amandeep Khurana ama...@gmail.com To: user@hbase.apache.org user@hbase.apache.org Sent: Tuesday, July 15, 2014 10:48 PM Subject: Cluster sizing guidelines Hi How do users usually go about sizing HBase clusters? What

Cluster sizing guidelines

2014-07-15 Thread Amandeep Khurana
Hi How do users usually go about sizing HBase clusters? What are the factors you take into account? What are typical hardware profiles you run with? Any data points you can share would help. Thanks Amandeep

Re: Block size of HBase files

2013-05-13 Thread Amandeep Khurana
On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani praveen.ii...@gmail.comwrote: Hi, I have the dfs.block.size value set to 1 GB in my cluster configuration. Just out of curiosity - why do you have it set at 1GB? I have around 250 GB of data stored in hbase over this cluster. But when i

Re: Does Hbase read .bashrc file??

2013-05-11 Thread Amandeep Khurana
The start script is a shell script and it forks a new shell when the script is executed. That'll source the bashrc file. On May 11, 2013, at 12:39 PM, Mohammad Tariq donta...@gmail.com wrote: Hello list, Does Hbase read the environment variables set in *~/.bashrc*file everytime I

Re: Does Hbase read .bashrc file??

2013-05-11 Thread Amandeep Khurana
not able to understand this. Pardon my ignorance. Warm Regards, Tariq cloudfront.blogspot.com On Sun, May 12, 2013 at 1:14 AM, Amandeep Khurana ama...@gmail.com wrote: The start script is a shell script and it forks a new shell when the script is executed. That'll source the bashrc file

Re: HBase Support

2013-05-10 Thread Amandeep Khurana
. Thank Regards Raju. On Fri, May 10, 2013 at 11:07 AM, Amandeep Khurana ama...@gmail.com wrote: Can you tell what the logs are saying? Did you try to restart your application? On Thu, May 9, 2013 at 10:36 PM, naga raju rajumudd...@gmail.com wrote: Thanks for the Reply

Re: HBase Support

2013-05-09 Thread Amandeep Khurana
Are you running HBase on your local development box or on some server outside and trying to connect to it from your dev box? On Thu, May 9, 2013 at 6:36 AM, raju rajumudd...@gmail.com wrote: Hi all, I am New to Hbase(Hadoop), and i am working on hbase with standalone mode.My application

Re: HBase Support

2013-05-09 Thread Amandeep Khurana
What do the HBase logs say? Can you access the web UI? On Thu, May 9, 2013 at 9:57 PM, naga raju rajumudd...@gmail.com wrote: I am running hbase on same(single) system which i am using to development. On Fri, May 10, 2013 at 6:19 AM, Amandeep Khurana ama...@gmail.com wrote: Are you

Re: HBase Support

2013-05-09 Thread Amandeep Khurana
Can you tell what the logs are saying? Did you try to restart your application? On Thu, May 9, 2013 at 10:36 PM, naga raju rajumudd...@gmail.com wrote: Thanks for the Reply Amandeep, I can access the web UI. On Fri, May 10, 2013 at 10:52 AM, Amandeep Khurana ama...@gmail.com wrote

Re: EC2 Elastic MapReduce HBase install recommendations

2013-05-08 Thread Amandeep Khurana
To add to what Andy said - the key to getting HBase running well in AWS is: 1. Choose the right instance types. I usually recommend the HPC instances or now the high storage density instances. Those will give you the best performance. 2. Use the latest Amzn Linux AMIs and the latest HBase and

Re: HBase cluster over multiple EC2 Availability Zones?

2013-05-06 Thread Amandeep Khurana
I've not come across anyone spanning clusters cross AZ. You pay for cross AZ traffic and the link is slower than within a single AZ. Amandeep On Mon, May 6, 2013 at 10:37 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Do people spread HBase clusters over multiple EC2

Re: HBase cluster over multiple EC2 Availability Zones?

2013-05-06 Thread Amandeep Khurana
, Amandeep Khurana ama...@gmail.com wrote: I've not come across anyone spanning clusters cross AZ. You pay for cross AZ traffic and the link is slower than within a single AZ. Amandeep On Mon, May 6, 2013 at 10:37 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Do people

Re: HBase cluster over multiple EC2 Availability Zones?

2013-05-06 Thread Amandeep Khurana
if there are any HBase (or HDFS)-specific reasons why one should not attempt to do this? Thanks, Otis -- Search Analytics - http://sematext.com/search-analytics/index.html HBASE Performance Monitoring - http://sematext.com/spm/index.html On Mon, May 6, 2013 at 1:41 PM, Amandeep Khurana

Re: HBase and Datawarehouse

2013-04-30 Thread Amandeep Khurana
Multiple RS' per host gets you around the WAL bottleneck as well. But it's operationally less than ideal. Do you usually recommend this approach, Andy? I've shied away from it mostly. On Apr 30, 2013, at 10:38 AM, Andrew Purtell apurt...@apache.org wrote: Rules of thumb for starting off safely

Re: HBase and Semantic Integration

2013-04-27 Thread Amandeep Khurana
+user@ moving dev@ to bcc Kiran Can you elaborate on what you mean by semantic integration? Also, this question is more relevant for the user mailing list than the dev list. Amandeep On Apr 27, 2013, at 3:13 PM, Kiran kiranvk2...@gmail.com wrote: We can use HBase as a storage back end for

Re: HBase and Semantic Integration

2013-04-27 Thread Amandeep Khurana
to user:location. On Sun, Apr 28, 2013 at 3:52 AM, Amandeep Khurana [via Apache HBase] ml-node+s679495n4043154...@n3.nabble.com wrote: +user@ moving dev@ to bcc Kiran Can you elaborate on what you mean by semantic integration? Also, this question is more relevant for the user mailing

Re: Storing images in Hbase

2013-01-06 Thread Amandeep Khurana
To add to Andy's point - storing images in HBase is fine as long as the size of each image isn't huge. A couple for MBs per row in HBase do just fine. But once you start getting into 10s of MBs, there are more optimal solutions you can explore and HBase might not be the best bet. Amandeep On Jan

Re: Bulk Loading from Oracle to Hbase

2012-12-13 Thread Amandeep Khurana
Mehmet What's the problem you are getting while running the Sqoop job? Can you give details? -Amandeep On Thu, Dec 13, 2012 at 8:44 AM, Manoj Babu manoj...@gmail.com wrote: Mehmet, You can try to write a MapReduce using DBInputFormat and insert into HBase.

Re: PROD/DR - Replication

2012-12-07 Thread Amandeep Khurana
What failure condition are you trying to safeguard against? A full data center failure? That's when you would lose your entire cluster and need the DR to kick in. Otherwise, you could deploy such that an entire rack failure or even a row failure won't take you down. Just span across multiple racks

Re: scaling a low latency service with HBase

2012-10-19 Thread Amandeep Khurana
Answers inline On Fri, Oct 19, 2012 at 4:31 PM, Dave Latham lat...@davelink.net wrote: I need to scale an internal service / datastore that is currently hosted on an HBase cluster and wanted to ask for advice from anyone out there who may have some to share. The service does simple key value

Re: HBase tunning

2012-10-05 Thread Amandeep Khurana
Mohit Getting the maximum performance out of HBase isn't just about tuning the cluster. There are several other factors to take into account. The two most important being: 1. Most important factor being the schema design 2. How you are using the APIs Starting with the default configs is okay.

Re: md5 hash key and splits

2012-08-30 Thread Amandeep Khurana
Also, you might have read that an initial loading of data can be better distributed across the cluster if the table is pre-split rather than starting with a single region and splitting (possibly aggressively, depending on the throughput) as the data loads in. Once you are in a stable state with

Re: Timeseries data

2012-08-28 Thread Amandeep Khurana
Can you give an example of what you are trying to do and how you would use both the writes coming in at the same instant for the same cell and why do you say that the nanosecond approach is tricky? On Aug 28, 2012, at 5:54 PM, Mohit Anchlia mohitanch...@gmail.com wrote: How does it deal with

Re: multitable query

2012-08-10 Thread Amandeep Khurana
How do you want to use two tables? Can you explain your algo a bit? On Fri, Aug 10, 2012 at 6:40 PM, Weishung Chung weish...@gmail.com wrote: Hi HBase users, I need to pull data from 2 HBase tables in a mapreduce job. For 1 table input, I use TableMapReduceUtil.initTableMapperJob. Is there

Re: multitable query

2012-08-10 Thread Amandeep Khurana
want to denormalize and not need joins when working with HBase (or for that matter most NoSQL stores). On Fri, Aug 10, 2012 at 6:52 PM, Weishung Chung weish...@gmail.com wrote: Basically a join of two data sets on the same row key. On Fri, Aug 10, 2012 at 6:12 AM, Amandeep Khurana ama

Re: consistency, availability and partition pattern of HBase

2012-08-09 Thread Amandeep Khurana
Correct. You are limited to the throughput of a single region server while interacting with a particular region. This throughput limitation is typically handled by designing your keys such that your data is distributed well across the cluster. Having multiple region servers serve a single region

Re: consistency, availability and partition pattern of HBase

2012-08-09 Thread Amandeep Khurana
in active-passive mode, when at one time only one active server is active? Correct? regards, Lin On Thu, Aug 9, 2012 at 2:04 PM, Amandeep Khurana ama...@gmail.com wrote: Correct. You are limited to the throughput of a single region server while interacting with a particular region. This throughput

Re: consistency, availability and partition pattern of HBase

2012-08-08 Thread Amandeep Khurana
Firstly, I recommend you read the GFS and Bigtable papers. That'll give you a good understanding of the architecture. Adhoc question on the mailing list won't. I'll try to answer some of your questions briefly. Think of HBase as a database layer over an underlying filesystem (the same way MySQL

Re: consistency, availability and partition pattern of HBase

2012-08-08 Thread Amandeep Khurana
of the CAP is sacrificed? regards, Lin On Thu, Aug 9, 2012 at 1:34 PM, Amandeep Khurana ama...@gmail.com wrote: Firstly, I recommend you read the GFS and Bigtable papers. That'll give you a good understanding of the architecture. Adhoc question on the mailing list won't. I'll try to answer

Re: host:port problem

2012-07-23 Thread Amandeep Khurana
This is most likely because of a mismatch in the ZK library version between your web service and the HBase install. Can you confirm you got the same version in both places? On Monday, July 23, 2012 at 8:31 AM, Rajendra Manjunath wrote: i have hbase configured in pseudo distributed mode and

Re: Use of MD5 as row keys - is this safe?

2012-07-23 Thread Amandeep Khurana
On Mon, Jul 23, 2012 at 9:58 AM, Jonathan Bishop jbishop@gmail.comwrote: Hi, Thanks everyone for the informative discussion on this topic. I think that for project I am involved in I must remove the risk, however small, of a row key collision, and append the original id (in my case a

Re: How to merge regions in HBase?

2012-07-17 Thread Amandeep Khurana
You shouldn't have empty regions. Using timestamp will give you regions that are always half filled except the last one to which you are writing the current time range. The moment that'll fill up, split and you'll again be writing to the last region. How did you end up with empty regions? Did you

Re: Maximum number of tables ?

2012-07-13 Thread Amandeep Khurana
I have come across clusters with 100s of tables but that typically is due to a sub optimal table design. The question here is - why do you need to distribute your data over lots of tables? What's your access pattern and what kind of data are you putting in? Or is this just a theoretical question?

Re: DataNode Hardware

2012-07-12 Thread Amandeep Khurana
Inline. On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote: Quick question about data node hadrware. I've read a few articles, which cover the basics, including the Cloudera's recommendations here:

Re: DataNode Hardware

2012-07-12 Thread Amandeep Khurana
better. -Amandeep On Thursday, July 12, 2012 at 1:20 PM, Bartosz M. Frak wrote: Amandeep Khurana wrote: Inline. On Thursday, July 12, 2012 at 12:56 PM, Bartosz M. Frak wrote: Quick question about data node hadrware. I've read a few articles, which cover the basics, including

Re: Auto failover of HBase

2012-07-10 Thread Amandeep Khurana
Gen, HBase has HA across the entire stack. Have you read the original Google Bigtable paper to understand the architecture of the system? That is a great place to start. -Amandeep On Tuesday, July 10, 2012 at 9:40 PM, Gen Liu wrote: Hi, I'm new here. I'm doing evaluation on Hbase before

Re: HBASE -- YCSB ?

2012-07-09 Thread Amandeep Khurana
Inline. On Monday, July 9, 2012 at 12:17 PM, registrat...@circle-cross-jn.com wrote: Now that I have a stable cluster, I would like to use YCSB to test its performance; however, I am a bit confused after reading several different website posting about YCSB. 1) Be default will YCSB

Re: Can manually remove HFiles (similar to bulk import, but bulk remove)?

2012-07-09 Thread Amandeep Khurana
I _think_ you should be able to do it and be just fine but you'll need to shut down the region servers before you remove and start them back up after you are done. Someone else closer to the internals can confirm/deny this. On Monday, July 9, 2012 at 12:36 PM, Alex Baranau wrote: Hello,

Re: HBASE -- RS expire?

2012-07-05 Thread Amandeep Khurana
On Thursday, July 5, 2012 at 8:25 PM, Jay Wilson wrote: Finally my HMaster has stabilized and been running for 7 hours. I believe my networking issues are behind me now. Thank you everyone for the help. Awesome. Looks like the same issue is biting you with the RS too. The RS isn't

Re: HBASE -- RS expire?

2012-07-05 Thread Amandeep Khurana
. I reduced the number of nodes to 40 and had all of them placed on the same switch with a single vlan. I even had the network techs use a completely different switch just to be safe. Is there some heatbeat timer I can tweak? --- Jay Wilson On 7/5/2012 8:34 PM, Amandeep Khurana wrote

Re: HBASE -- RS expire?

2012-07-05 Thread Amandeep Khurana
it attempt to reconnect with ZK on devrackA-03, get the reject and then attempt ZK on devrackA-04. --- Jay Wilson On 7/5/2012 9:08 PM, Amandeep Khurana wrote: The timeout can be configured using the session timeout configuration. The default for that is 180s, but that means

Re: HMASTER -- odd messages ?

2012-07-03 Thread Amandeep Khurana
On Tuesday, July 3, 2012 at 10:08 AM, Jay Wilson wrote: 2012-07-03 09:05:00,530 ERROR org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Couldn't close log at hdfs://devrackA-00:8020/var/hbase-hadoop/hbase/-ROOT-/70236052/recovered.edits/046.temp

Re: HBASE -- Session Expire ?

2012-07-03 Thread Amandeep Khurana
://pastebin.com/download.php?i=cS1Gm19x RS (devrackA-06) -- http://pastebin.com/download.php?i=XayB2HeX RS (devrackB-07) -- http://pastebin.com/download.php?i=RQZ45a8j RS (devrackB-08) -- http://pastebin.com/download.php?i=ZDZD0z7B --- Jay Wilson On 7/3/2012 5:23 PM, Amandeep Khurana wrote

Re: HBASE -- Regionserver and QuorumPeer ?

2012-07-02 Thread Amandeep Khurana
As someone who has been developing/running/using the software for a longer period of time than the person who is asking the question, you can best serve the poser by making them aware of the trade offs and why it's a good/bad idea to do things a certain way. At the end of the day, it's their

Re: Advices for HTable schema

2012-07-02 Thread Amandeep Khurana
Jean-Marc, These are great questions! Find my answers (and some questions for you) inline. -ak On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote: Hi, I have a question regarding the best way to design a table. Let's imagine I want to store all the people in the world on a

Re: Advices for HTable schema

2012-07-02 Thread Amandeep Khurana
natural to store the number of columns, then parse them by name, etc. but I think I need to think about it a little be more before taking any decision... JM 2012/7/2, Amandeep Khurana ama...@gmail.com (mailto:ama...@gmail.com): Jean-Marc, These are great questions! Find my answers

Re: Advices for HTable schema

2012-07-02 Thread Amandeep Khurana
with option 1, why is it better to have a 2nd table instead of a 2nd column familly? JM 2012/7/2, Amandeep Khurana ama...@gmail.com (mailto:ama...@gmail.com): Responses inline On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote: Hi Amandeep, Thanks for your prompt

Re: HBase dies shortly after starting.

2012-06-30 Thread Amandeep Khurana
To run HBase (or for that matter any distributed system) you need your networking setup to function properly. No route to host is caused due to issues with the underlying network. I have seen TORs losing packets, causing these exceptions. There could be several other issues that could cause

Re: Best practices for custom filter class distribution?

2012-06-27 Thread Amandeep Khurana
Currently, you have to compile a jar, put them on all servers and restart the RS process. I don't believe there is an easier way to do it as of right now. And I agree, it's not entirely desirable to have to restart the cluster to install a custom filter. You can combine multiple filters into a

Re: HBase Schema Design for clickstream data

2012-06-27 Thread Amandeep Khurana
Mohit, What would be your read patterns later on? Are you going to read per session, or for a time period, or for a set of users, or process through the entire dataset every time? That would play an important role in defining your keys and columns. -Amandeep On Tue, Jun 26, 2012 at 1:34 PM,

Re: How to free space

2012-06-27 Thread Amandeep Khurana
Cyril, Did you notice the space on the hbase directory in HDFS change at all? It takes time to complete the major compactions (and it depends on the size of the tables). Deleting column families will just delete those HFiles. That should definitely free up space. -Amandeep On Wed, Jun 27, 2012

Re: Coprocessors on specific servers

2012-06-27 Thread Amandeep Khurana
Mohammad, Can you describe what you are trying to do a little more? Is this a endpoint coprocessor you are trying to build? What is the functionality it'll provide? -Amandeep On Tue, Jun 26, 2012 at 12:44 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Lars, Thank you so much for

Re: HBase Schema Design for clickstream data

2012-06-27 Thread Amandeep Khurana
and query parameters How should I go about designing schema? Thanks Sent from my iPad On Jun 27, 2012, at 2:01 PM, Amandeep Khurana ama...@gmail.com (mailto:ama...@gmail.com) wrote: Mohit, What would be your read patterns later on? Are you going to read per session

Re: Enabling caching increasing the time of retrieval

2012-06-25 Thread Amandeep Khurana
Is this on a standalone instance or do you have fully distributed setup deployed? Do you have any kind of monitoring in place? From the numbers you are giving, it looks like the data is of the order of a few 10 MBs, assuming this is a single threaded read. Did you write more data between the

Re: Enabling caching increasing the time of retrieval

2012-06-25 Thread Amandeep Khurana
is not there. Please help me Thanks and Regards Prakrati -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Monday, June 25, 2012 11:51 AM To: user@hbase.apache.org Subject: Re: Enabling caching increasing the time of retrieval Is this on a standalone instance or do you

Re: Increment Counters in HBase during MapReduce

2012-06-19 Thread Amandeep Khurana
As the the thread JD pointed out suggests - the best approach if you want to avoid aggregations later on is to aggregate in an MR job, output to a file with ad id and the number of impressions found for that ad. Run a separate client application, likely single threaded if the number of ads is not

Re: Shared Cluster between HBase and MapReduce

2012-06-05 Thread Amandeep Khurana
Atif, These are general recommendations and definitely change based on the access patterns and the way you will be using HBase and MapReduce. In general, if you are building a latency sensitive application on top of HBase, running a MapReduce job at the same time will impact performance due to

Re: When does compaction actually occur?

2012-06-01 Thread Amandeep Khurana
Tom, Old cells will get deleted as a part of the next major compaction, which is typically recommended to be done once a day, when the load on the system is at its lowest. FWIW… To have a TTL of 3600 take effect, you'll have to do a major compaction every hour, which is an expensive

Re: Data import from Distributed Hbase cluster to Pseudo Distributed Hbase cluster

2012-05-29 Thread Amandeep Khurana
Assuming that your data fits into the pseudo dist cluster and both clusters can talk to each other, the CopyTable job that comes bundled with HBase should work. -ak On Tuesday, May 29, 2012 at 11:42 AM, arun sirimalla wrote: Hi, I want to copy a table from Distributed Hbase cluster to

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
Marcos, You could to a distcp from S3 to HDFS and then do a bulk import into HBase. Are you running HBase on EC2 or on your own hardware? -Amandeep On Thursday, May 24, 2012 at 11:52 AM, Marcos Ortiz wrote: Regards to all the list. We are using Amazon S3 to store millions of files with

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
: Thanks a lot for your answer, Amandeep. On 05/24/2012 02:55 PM, Amandeep Khurana wrote: Marcos, You could to a distcp from S3 to HDFS and then do a bulk import into HBase. The quantity of files are very large, so, we want to combine some files, and then construct the HFile to upload

Re: Efficient way to read a large number of files in S3 and upload their content to HBase

2012-05-24 Thread Amandeep Khurana
HBase here? -ak On Thursday, May 24, 2012 at 12:52 PM, Marcos Ortiz wrote: On 05/24/2012 03:21 PM, Amandeep Khurana wrote: Marcos Can you elaborate on your use case a little bit? What is the nature of data in S3 and why you want to use HBase? Why do you want to combine

Re: hbase as a primary store, or is it more for 2nd class data?

2012-05-14 Thread Amandeep Khurana
HDFS is designed to not lose data if a few nodes fail. It holds multiple replicas of each block. Having said that - it also depends on the definition of a few. Many companies are using HDFS as their central data store and it's proven at scale in production. It does not lose data arbitrarily,

Re: hbase as a primary store, or is it more for 2nd class data?

2012-05-14 Thread Amandeep Khurana
Ahmed, I'll second what Ian and Andrew have highlighted. HBase is very capable of being used as a primary store as long as you run it following the best practices. It's a useful exercise to clearly define the failure scenarios you want to safeguard against and what kind of SLAs you have in

Re: Switching existing table to Snappy possible?

2012-05-10 Thread Amandeep Khurana
From what I understand, the online schema update feature (0.92.x onwards) would allow you to do this without disabling tables. It's experimental in 0.92. On Thu, May 10, 2012 at 9:02 AM, Jeff Whiting je...@qualtrics.com wrote: We really need to be able to do this type of thing online. Taking

Re: HMaster shutdown when a DNS address cannot be solved

2012-04-08 Thread Amandeep Khurana
+user (bcc: dev) Mikael, Such questions are better suited for the user mailing list. You'll find more people talking about issues that they ran into and possibly get answers to your questions faster. Hadoop internally using a form of the linux 'hostname' command from within Java. When servers

Re: HBase with EMR

2012-03-03 Thread Amandeep Khurana
Mohit, Adding to what Andy and Vaibhav have listed - you'll need to ensure that the Hadoop versions running in EMR and your HBase cluster are compatible if you want to run MapReduce from EMR onto an external HBase cluster. If you choose to run HBase on your EMR cluster and don't want it to tear

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Amandeep Khurana
Tim Going directly to HFiles has the following pitfalls: 1. You'll miss out on data that's in the memstore and has not been flushed to an HFile yet. 2. If you have deletes, you'll probably see the data from some HFiles where the data resides since a compactions hasn't taken place to throw it

Re: HFileInputFormat for MapReduce

2012-02-09 Thread Amandeep Khurana
From the limitations you mention, 1) and 2) we can live with, but 3) could be why my quick tests are already giving incorrect record counts. That sounds like a show stopper straight away right? One option for us would be HBase for the primary store for random access, and periodic (e.g. 12

Re: Question about HBase for OLTP

2012-01-09 Thread Amandeep Khurana
Delete and Updates in HBase are like new writes.. The way to update a cell is to actually do a Put. And when you delete, it internally flags the cell to be deleted and removes the data from the underlying file on the next compaction. If you want to learn the technical details further, you could

Re: Thoughts on a hybrid HBase-Hadoop cluster

2011-12-14 Thread Amandeep Khurana
Otis, You could co-locate RS' with TT and DN for the most part as long as you are not really serving real time requests. Just tweak your task configs and give HBase enough RAM. You get the benefit of data locality and that could improve performance. But you should definitely try out your approach

Re: Region Splits

2011-11-22 Thread Amandeep Khurana
/11 9:04 AM, Srikanth P. Shreenivas srikanth_shreeni...@mindtree.com wrote: Will major compactions take care of merging older regions or adding more key/values to them as number of regions grow? Regard, Srikanth -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com

Re: Multiple tables vs big fat table

2011-11-20 Thread Amandeep Khurana
Mark, This is an interesting discussion and like Michel said - the answer to your question depends on what you are trying to achieve. However, here are the points that I would think about: What are the access patters of the various buckets of data that you want to put in HBase? For instance,

Re: Region Splits

2011-11-20 Thread Amandeep Khurana
Mark, Yes, your understanding is correct. If your keys are sequential (timestamps etc), you will always be writing to the end of the table and older regions will not get any writes. This is one of the arguments against using sequential keys. -ak On Sun, Nov 20, 2011 at 11:33 AM, Mark

Re: mapreduce on two tables

2011-11-07 Thread Amandeep Khurana
Rohit, It'll depend on what processing you want to do on all documents for a given author. You could either write author - {list of documents} to an HDFS file and scan through that file using a MR job to do the processing. Or you could simply output author, document as the output of the map stage

Re: HBase, Hive, Hive over HBase or Pig over HBase

2011-10-26 Thread Amandeep Khurana
Vivek, Can you elaborate on 4? Storing data in HDFS directly does not give you the option of updating it. However, that's not a good enough reason to use HBase. Do you need random reads/writes outside of just the selective increments? Can you store the increments in a separate file and then do a

HBase NYC meetup - day before Hadoop World 2011

2011-10-01 Thread Amandeep Khurana
Hello HBasers, Hadoop World 2011 (Nov 8th 9th) is coming up soon and a bunch of Hadoop and HBase users would be attending it. We are having a meetup the evening before Hadoop World (Nov 7th) to talk about HBase and Hadoop topics that are not going being covered in the Hadoop World sessions. The

Re: auto-restart regionservers

2011-03-23 Thread Amandeep Khurana
Geoff, In general it's not advisable to simply auto-restart regionservers without digging into the root cause. We had a case where we were doing that and the real problem got masked. Debugging it later took a lot more effort than it would have taken had we looked at it earlier on. -Amandeep

Re: intersection of row ids

2011-03-11 Thread Amandeep Khurana
You can scan through one table and see if the other one has those rowids or not. On Thu, Mar 10, 2011 at 8:08 PM, Vishal Kapoor vishal.kapoor...@gmail.comwrote: Friends, how do I best achieve intersection of sets of row ids suppose I have two tables with similar row ids how can I get the row

Error while creating a table with compression enabled

2010-12-06 Thread Amandeep Khurana
The command I'm running on the shell: create 'table', {NAME='fam', COMPRESSION='GZ'} or create 'table', {NAME='fam', COMPRESSION='LZO'} Here's the error: ERROR: cannot convert instance of class org.jruby.RubyString to class org.apache.hadoop.hbase.io.hfile.Compression$Algorithm Any idea

Re: Error while creating a table with compression enabled

2010-12-06 Thread Amandeep Khurana
Seems like it. Let me try the patch. -AK On Dec 6, 2010, at 12:36 PM, Lars George lars.geo...@gmail.com wrote: Hi AK, This issue? https://issues.apache.org/jira/browse/HBASE-3310 Lars On Mon, Dec 6, 2010 at 9:17 AM, Amandeep Khurana ama...@gmail.com wrote: The command I'm running

Ceph as an alternative to HDFS

2010-08-06 Thread Amandeep Khurana
Article published in this months Usenix Login magazine: http://www.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf -ak

Re: how is facebook using hbase?

2010-07-14 Thread Amandeep Khurana
I doubt they'll talk about their internal systems openly... Any special reasons why you are asking this? On Wed, Jul 14, 2010 at 11:04 AM, S Ahmed sahmed1...@gmail.com wrote: In what context/feature is facebook using hbase?

Re: major differences with Cassandra

2010-07-08 Thread Amandeep Khurana
Another link: http://bit.ly/aGJi1e On Wed, Jul 7, 2010 at 10:48 PM, Akash Deep Shakya akasha...@gmail.comwrote: I studied Cassandra in detail and currently working in hbase. There are lots of differences, yet similarities between Cassandra and HBase, for HBase data model/arch, there are two

Re: performance consideration when writing to HBase from MR job

2010-06-05 Thread Amandeep Khurana
a) all the Puts are collected in Reduce or Map (if there is no reduce) and a batch write is done b) writing out each K,V pair using context.write(k, v) If a) is considered instead of b) then wouldn't there be a violation of semantics w.r.t KEYOUT, VALUEOUT (because K, V is not being

Re: Hbase production deployment

2010-05-25 Thread Amandeep Khurana
Yes.. If HBase master fails, it loses its handle in ZK. The backup master will register itself as the new master and all works thereafter. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz 2010/5/25 y_823...@tsmc.com Thank for your reply. You mean we