Re: delete rows without writing HLog may be appear in the future?

2012-11-21 Thread Michael Segel
for your deletes? On Wed, Nov 21, 2012 at 10:17 AM, Bing Jiang jiangbinglo...@gmail.com wrote: yes,hbase has made a compaction between batch-put and deletes. any ideas? On Nov 21, 2012 11:10 PM, Michael Segel michael_se...@hotmail.com wrote: Some time later? Time of course is relative, so

Re: Region hot spotting

2012-11-21 Thread Michael Segel
Salting is not a good idea and I don't know why people suggest it. Case in point you want to fetch a single row/record back. Because the salt is arbitrary, you will need to send N number of get()s one for each salt value. Doing a simple one way hash of the data, even appending the data,

Re: Why only check1-and-putMany and check1-and-deleteMany?

2012-11-19 Thread Michael Segel
there is not a more comprehensive composite atomic operation available. If there is a good reason for the API to include appends, then that reason applies here. If there is no such reason, then you may ignore the appends in my question. Thanks, Mike From: Michael Segel michael_se

Re: Why only check1-and-putMany and check1-and-deleteMany?

2012-11-19 Thread Michael Segel
/HTableInterface.html#append%28org.apache.hadoop.hbase.client.Append%29 However, the point of my question is not specific to appends. I am asking why HBase does not have checkMany-and-mutateMany. Thanks, Mike From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org Date

Re: Why only check1-and-putMany and check1-and-deleteMany?

2012-11-18 Thread Michael Segel
Ok, maybe this is a silly question on my part Could you define what you mean by check and append? With respect to HBase, how would that be different from check and put? On Nov 18, 2012, at 12:28 AM, Mike Spreitzer mspre...@us.ibm.com wrote: I am not looking at the trunk. I am just a

Re: scan is slower after bulk load

2012-11-12 Thread Michael Segel
Just a guess... have you done any compactions on the table post bulk load? On Nov 12, 2012, at 8:44 AM, Marcos Ortiz mlor...@uci.cu wrote: Regards, Amit. Did you tuned the RegionServer where you has that data range hosted? Why do you say that scans are slower after a bulk load? Did you test

Re: Nosqls schema design

2012-11-08 Thread Michael Segel
Ok... First, if you're estimating that the raw data would be 10TB, you will find out that you will need a bit more to handle the data in terms of indexing and denormalized structures. The short answer to your question is yes, you can do it. Longer answer... You can bake a solution in

Re: HBase scan performance decreases over time.

2012-11-05 Thread Michael Segel
There's an HDFS bandwidth setting which is set to 10MB/s. Way too low for even 1GBe. Have you modified this setting yet? -Mike On Nov 3, 2012, at 2:50 PM, David Koch ogd...@googlemail.com wrote: Hello Ted, We never initiate major compaction manually. I have not looked at I/O balance

Re: HBase scan performance decreases over time.

2012-11-05 Thread Michael Segel
2012, at 15:05, Michael Segel michael_se...@hotmail.com wrote: There's an HDFS bandwidth setting which is set to 10MB/s. Way too low for even 1GBe. Have you modified this setting yet? -Mike On Nov 3, 2012, at 2:50 PM, David Koch ogd...@googlemail.com wrote: Hello Ted, We never

Re: Table in Inconsistent State; Perpetually pending region server transitions while loading lot of data into Hbase via MR

2012-11-01 Thread Michael Segel
Just out of curiosity... What's the impact on having regions of 10GB or larger? What does that do to your footprint in memory and the time it takes to split or compact a region? -Mike On Nov 1, 2012, at 8:35 AM, Kevin O'dell kevin.od...@cloudera.com wrote: Couple thoughts(it is still

Re: Adding column family to hbase table

2012-10-30 Thread Michael Segel
No, sorry, you have to disable the table in order to modify the table. On Oct 30, 2012, at 9:33 AM, Mike mike20...@gmail.com wrote: Hi All, I use hbase 0.92 and I am trying to add a column family to hbase table and I get the below error. ERROR:

Re: adding column family to hbase table

2012-10-30 Thread Michael Segel
When I hear experimental and production in the same conversation, I get shivers up my spine. Which release(s) contain this flag? On Oct 30, 2012, at 9:35 AM, Kevin O'dell kevin.od...@cloudera.com wrote: Mike, I have not messed around with the online schema changes too much. It is still

Re: How to config hbase0.94.2 to retain deleted data

2012-10-23 Thread Michael Segel
. -- Lars From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org; lars hofhansl lhofha...@yahoo.com Sent: Monday, October 22, 2012 9:18 PM Subject: Re: How to config hbase0.94.2 to retain deleted data Curious, why do you think this is better than using the keep-deleted-cells

Re: How to config hbase0.94.2 to retain deleted data

2012-10-23 Thread Michael Segel
nobody should use TTL/VERSIONS, which is nonsense. From: Michael Segel michael_se...@hotmail.com To: lars hofhansl lhofha...@yahoo.com Cc: user@hbase.apache.org user@hbase.apache.org Sent: Tuesday, October 23, 2012 4:41 AM Subject: Re: How to config

Re: How to config hbase0.94.2 to retain deleted data

2012-10-22 Thread Michael Segel
, Delete, Increment, Append, RowMutations, etc) Curious, why do you think this is better than using the keep-deleted-cells feature? (It might well be, just curious) -- Lars - Original Message - From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org

Re: How to config hbase0.94.2 to retain deleted data

2012-10-21 Thread Michael Segel
I would suggest that you use your coprocessor to copy the data to a 'backup' table when you mark them for delete. Then as major compaction hits, the rows are deleted from the main table, but still reside undeleted in your delete table. Call it a history table. On Oct 21, 2012, at 3:53 PM,

Re: How to config hbase0.94.2 to retain deleted data

2012-10-21 Thread Michael Segel
the keep-deleted-cells feature? (It might well be, just curious) -- Lars - Original Message - From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org Cc: Sent: Sunday, October 21, 2012 4:34 PM Subject: Re: How to config hbase0.94.2 to retain deleted data I

Re: Hbase sequential row merging in MapReduce job

2012-10-19 Thread Michael Segel
Outch... That could get very nasty. You may end up with a lot of uneven splits. Suppose your 'metric1' spans 3 regions, 'metric2' 1 but its still in the same split as 'metric1' and then 'metric3' is in two regions, 'metric4' is in two regions where its split between the end of 'metric3' and

Re: crafting your key - scan vs. get

2012-10-18 Thread Michael Segel
someone could describe the performance trade-off between Scan vs. Get. Thanks again for anyone who read this far. Neil Yalowitz neilyalow...@gmail.com On Wed, Oct 17, 2012 at 10:45 AM, Michael Segel michael_se...@hotmail.comwrote: Neil, Since you asked Actually your

Re: hbase deployment using VMs for data nodes and SAN for data storage

2012-10-18 Thread Michael Segel
Lars, I think we need to clarify what we think of as a SAN. Its possible to have a SAN where the disks appear as attached storage, while the traditional view is that the disks are detached. There are some design considerations like cluster density where one would want to use a SAN like

Re: crafting your key - scan vs. get

2012-10-17 Thread Michael Segel
Neil, Since you asked Actually your question is kind of a boring question. ;-) [Note I will probably get flamed for saying it, even if it is the truth!] Having said that... Boring as it is, its an important topic that many still seem to trivialize in terms of its impact on performance.

Re: Coprocessor end point vs MapReduce?

2012-10-17 Thread Michael Segel
Hi, I'm a firm believer in KISS (Keep It Simple, Stupid) The Map/Reduce (map job only) is the simplest and least prone to failure. Not sure why you would want to do this using coprocessors. How often are you running this job? It sounds like its going to be sporadic. -Mike On Oct 17,

Re: Coprocessor end point vs MapReduce?

2012-10-17 Thread Michael Segel
? Or should I go with the initial idea of doing the Put with the M/R job and the delete with HBASE-6942? Thanks, JM 2012/10/17, Michael Segel michael_se...@hotmail.com: Hi, I'm a firm believer in KISS (Keep It Simple, Stupid) The Map/Reduce (map job only) is the simplest

Re: setAutoFlush in HTableDescriptor

2012-10-13 Thread Michael Segel
Not really a good idea. JDC hit the nail on the head. You want to handle the setting on the HTable instance and not on the pool. Just saying... On Oct 10, 2012, at 3:09 AM, Jeroen Hoek jer...@lable.org wrote: If you want to disable auto-flush in 0.92.1, one approach is to override the

Re: Indexing w/ HBase

2012-10-12 Thread Michael Segel
Silly question(s). 1) What sort of indexes do you want to build? 2) Why would you want to store your indexes outside of HBase? (Ok they are not so silly. But I don't want people to think that I'm against the idea, just that its more of an issue of design.) -Mike On Oct 12, 2012, at 7:03

Re: How well does HBase run on low/medium memory/cpu clusters?

2012-10-10 Thread Michael Segel
Well you don't want to do joins in HBase. There are a couple of ways to do this, however, I think based on what you have said... the larger issue for either solution (HBase or MySQL would be your schema design.) Basically you said you have Table A w 50 Million rows and Table B of 7 Million

Re: ways to make orders when it puts

2012-10-05 Thread Michael Segel
:06, Michael Segel michael_se...@hotmail.com 작성: I took it that the OP wants to store the rows A1-A3 in the order in which they came in. So It could be A3,A1,A2 as an example. So to do this you end up prefixing the rowkey with a timestamp or something. This is not a good idea, and I

Re: Lucene instead of HFiles?

2012-10-05 Thread Michael Segel
Actually I think you'd want to do the reverse. Store your Lucene index in HBase. Which is what we did a while back. This could be extended to SOLR, but we never had time to do it. On Oct 5, 2012, at 4:11 AM, Lars George lars.geo...@gmail.com wrote: Hi Otis, My initial reaction was,

Re: HBase tunning

2012-10-05 Thread Michael Segel
Depends. What sort of system are you tuning? Sorry, but we have to start somewhere and if we don't know what you have in terms of hardware, we don't have a good starting point. On Oct 5, 2012, at 7:47 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Do most people start out with default

Re: ways to make orders when it puts

2012-10-04 Thread Michael Segel
Silly question. Why do you care how your data is being stored? Does it matter if the data is stored in rows where A1,A2, A3 are the order of the keys, or if its A3,A1,A2 ? If you say that you want to store the rows in order based on entry time, you're going to also have to deal with a little

Re: ways to make orders when it puts

2012-10-04 Thread Michael Segel
, 10 is before 9. So if your row key includes 1 ... 10, it is neccessory to format the single letter by adding 0. Best Wishes Dan Han On Thu, Oct 4, 2012 at 9:19 AM, Michael Segel michael_se...@hotmail.comwrote: Silly question. Why do you care how your data is being stored? Does

Re: long garbage collecting pause

2012-10-02 Thread Michael Segel
You really don't want to go to 20GB. Without knowing the number of regions... going beyond 1-2 GB may cause more headaches than its worth. Sorry, but I tend to be very cautious when it comes to tuning. -Mike On Oct 2, 2012, at 9:20 AM, Damien Hardy dha...@viadeoteam.com wrote: Hello

Re: HAcid: multi-row transactions in HBase

2012-10-01 Thread Michael Segel
Interesting. So how do you manage the transaction? On the client or on the cluster? On Oct 1, 2012, at 6:12 PM, de Souza Medeiros Andre andre.medei...@aalto.fi wrote: Hello all at this mailing list, I'm glad to finally announce an HBase addon that I have been working on. HAcid is a

Re: Regarding rowkey

2012-09-12 Thread Michael Segel
I wouldn't 'prefix' the hash to the key, but actually replace the key with a hash and store the unhashed key in a column. But that's a different discussion. In a nutshell, the problem is that there are a lot of potential use cases where you want to store data in a sequence dependent fashion.

Re: Optimizing table scans

2012-09-12 Thread Michael Segel
How much memory do you have? What's the size of the underlying row? What does your network look like? 1GBe or 10GBe? There's more to it, and I think that you'll find that YMMV on what is an optimum scan size... HTH -Mike On Sep 12, 2012, at 7:57 AM, Amit Sela am...@infolinks.com wrote: Hi

Re: Regarding rowkey

2012-09-12 Thread Michael Segel
can use a fast and simple hashing algorithm, because you do not need the hash to be unique. Depends again on various aspects. - Original Message - From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org; lars hofhansl lhofha...@yahoo.com Cc: Sent: Wednesday

Re: Tracking down coprocessor pauses

2012-09-10 Thread Michael Segel
On Sep 10, 2012, at 12:32 PM, Tom Brown tombrow...@gmail.com wrote: We have our system setup such that all interaction is done through co-processors. We update the database via a co-processor (it has the appropriate logic for dealing with concurrent access to rows), and we also

Re: Doubt in performance tuning

2012-09-10 Thread Michael Segel
Well, Lets actually skip a few rounds of questions... and start from the beginning. What does your physical cluster look like? On Sep 10, 2012, at 12:40 PM, Ramasubramanian ramasubramanian.naraya...@gmail.com wrote: Hi, Will be helpful if u say specific things to look into. Pls help

Re: Reading in parallel from table's regions in MapReduce

2012-09-04 Thread Michael Segel
I think the issue is that you are misinterpreting what you are seeing and what Doug was trying to tell you... The short simple answer is that you're getting one split per region. Each split is assigned to a specific mapper task and that task will sequentially walk through the table finding the

Re: Key formats and very low cardinality leading fields

2012-09-04 Thread Michael Segel
I think you have to understand what happens as a table splits. If you have a composite key where the first field has the value between 0-9 and you pre-split your table, you will have all of your 1's going to the single region until it splits. But both splits will start on the same node until

Re: Key formats and very low cardinality leading fields

2012-09-04 Thread Michael Segel
a little slow until the regions split and distribute effectively. That make sense? On Tue, Sep 4, 2012 at 1:34 PM, Michael Segel michael_se...@hotmail.comwrote: I think you have to understand what happens as a table splits. If you have a composite key where the first field has

Re: Can I specify the range inside of fuzzy rule in FuzzyRowFilter?

2012-08-18 Thread Michael Segel
What row keys are you skipping? Using your example... You have a start row of 200, and an end key of xFFxFFxFFxFFxFFxFF00350. Note that you could also write that end key as xFF(1..6) 01 since it looks like you're trying to match the 00 in positons 7 and 8 of your numeric string.

Re: issues copying data from one table to another

2012-08-18 Thread Michael Segel
Can you disable the table? How much free disk space do you have? Is this a production cluster? Can you upgrade to CDH3u5? Are you running a capacity scheduler or fair scheduler? Just out of curiosity, what would happen if you could disable the table, alter the table's max file size and then

Re: Secondary indexes suggestions

2012-08-14 Thread Michael Segel
Ah... schema design... Yes you have both options identified... but just to add a twist... in the column name, prepend the (epoch - timestamp) to the message id. This will put the messages in reverse order. The only drawback to this is that its theoretically possible to create a row which

Re: Secondary indexes suggestions

2012-08-14 Thread Michael Segel
I think you need to think outside of the box... I've thought about it a little more and while there's validity to indexing at the RS, there's a bit more of a headache. But I think you've been too dismissive of looking at the index at the table level and not at the region level. On Aug 14,

Re: Secondary indexes suggestions

2012-08-14 Thread Michael Segel
wrote: On Tue, Aug 14, 2012 at 7:38 PM, Michael Segel michael_se...@hotmail.com wrote: I think you need to think outside of the box... But I think you've been too dismissive of looking at the index at the table level and not at the region level. I'd be interested if you can point out exactly

Re: Bulk loading job failed when one region server went down in the cluster

2012-08-13 Thread Michael Segel
:58 AM, Michael Segel michael_se...@hotmail.comwrote: Yes, it can. You can see RS failure causing a cascading RS failure. Of course YMMV and it depends on which version you are running. OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and he should upgrade. (Or go

Re: Bulk loading job failed when one region server went down in the cluster

2012-08-13 Thread Michael Segel
will be relying heavily on Fault Tolerance. If HBase Bulk Loader is fault tolerant to failure of RS in a viable environment then I dont have any issue. I hope this clears up my purpose of posting on this topic. Thanks, Anil On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel michael_se

Re: Secondary indexes suggestions

2012-08-13 Thread Michael Segel
Not really a good idea or anything new. Essentially a full table scan where you're doing a closer inspection on the key to see if it matches your search regex, before actually fetching the entire row and returning it. Secondary indexes are pretty straight forward. You have your primary key

Re: Bulk loading job failed when one region server went down in the cluster

2012-08-13 Thread Michael Segel
, Michael Segel michael_se...@hotmail.com wrote: Anil, I don't know if you can call it a bug if you don't have enough memory available. I mean if you don't use HBase, then you may have more leeway in terms of swap. You can also do more tuning of HBase to handle the additional latency

Re: Encryption in HBase

2012-08-08 Thread Michael Segel
You don't want to do that. I mean you really don't want to do that. ;-) You would be better off doing a strong encryption at the cell level. You can use co-processors to do that if you'd like. YMMV On Aug 8, 2012, at 9:17 AM, Mohammad Tariq donta...@gmail.com wrote: Hello Stack, Would

Re: Encryption in HBase

2012-08-08 Thread Michael Segel
, Mohammad Tariq On Wed, Aug 8, 2012 at 8:01 PM, Michael Segel michael_se...@hotmail.com wrote: You don't want to do that. I mean you really don't want to do that. ;-) You would be better off doing a strong encryption at the cell level. You can use co-processors to do that if you'd

Re: CheckAndAppend Feature

2012-08-07 Thread Michael Segel
While this may be a trivial fix, have you considered possible down sides to the implementation? I'm not sure its a bad idea, but one that could have some potential issues when put into practice. -Mike On Aug 7, 2012, at 7:30 PM, lars hofhansl lhofha...@yahoo.com wrote: I filed HBASE-6522.

Re: CheckAndAppend Feature

2012-08-07 Thread Michael Segel
. (the canonical example is that nothing stops a RegionObserver implementation from calling System.exit(), taking the RegionServer with it). -- Lars - Original Message - From: Michael Segel michael_se...@hotmail.com To: user@hbase.apache.org; lars hofhansl lhofha...@yahoo.com

Re: How to query by rowKey-infix

2012-08-03 Thread Michael Segel
or substring (using special delimiters for date) Second, the extracted value must be parsed to Long and set to a RowFilter Comparator like this: - Ursprüngliche Message - Von: Michael Segel michael_se...@hotmail.com An: user@hbase.apache.org CC: Gesendet: 13:52 Mittwoch, 1

Re: How to query by rowKey-infix

2012-08-01 Thread Michael Segel
Actually w coprocessors you can create a secondary index in short order. Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive. On Jul 31, 2012, at 12:41 PM, Matt Corgan mcor...@hotpads.com wrote: When deciding between a table scan vs secondary

Re: Null row key

2012-07-31 Thread Michael Segel
Which release? On Jul 31, 2012, at 5:13 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I am seeing null row key and I am wondering how I got the nulls in there. Is it possible when using HBaseClient that a null row might have got inserted?

Re: Hbase bkup options

2012-07-23 Thread Michael Segel
Amian, Like always the answer to your question is... it depends. First, how much data are we talking about? What's the value of the underlying data? One possible scenario... You run a M/R job to copy data from the table to an HDFS file, that is then copied to attached storage on an edge

Re: Hbase bkup options

2012-07-23 Thread Michael Segel
Message- From: Michael Segel [mailto:michael_se...@hotmail.com] Sent: Monday, July 23, 2012 8:19 PM To: user@hbase.apache.org Subject: Re: Hbase bkup options Amian, Like always the answer to your question is... it depends. First, how much data are we talking about? What's

Re: Index building process design

2012-07-23 Thread Michael Segel
Ok, I'll take a stab at the shorter one. :-) You can create a base data table which contains your raw data. Depending on your index... like an inverted table, you can run a map/reduce job that builds up a second table. And a third, a fourth... depending on how many inverted indexes you want.

Re: How to merge regions in HBase?

2012-07-17 Thread Michael Segel
Find a different row key? The problem with merging regions is that once you merge the regions, any net new regions will still have the same problem. So you'll have to merge again, and again and again. You're always filling to the left of the last key. In order to merge, you have to take the

Re: Embedded table data model

2012-07-13 Thread Michael Segel
First, A caveat... Schema design in HBase is one of the hardest things to teach/learn because its so open. There is more than one correct answer when it comes to creating a good design... Ian's presentation kind of tries to relate HBase schema design to relational modeling. From past

Re: Maximum number of tables ?

2012-07-13 Thread Michael Segel
Currently there is a hardcoded limit on the number of regions that a region server can manage. Its 1500. Note that if the number of regions gets to around 1000 regions per region server, you end up with a performance hit. (YMMV) So if you have 1 region per table, there's a real limit of 1500

Re: Maximum number of tables ?

2012-07-13 Thread Michael Segel
I'm going from memory. There was a hardcoded number. I'd have to go back and try to find it. From a practical standpoint, going over 1000 regions per RS will put you on thin ice. Too many regions can kill your system. On Jul 13, 2012, at 12:36 PM, Kevin O'dell wrote: Mike, I just saw

Re: DataNode Hardware

2012-07-12 Thread Michael Segel
Uhm... I'd take a step back... Thanks for the reply. I didn't realized that all the non-MR tasks were this CPU bound; plus my naive assumption was that four spindles will have a hard time supplying data to MR fast enough for it to become bogged down. Your gut feel is correct. If you go w

Re: Mixing Puts and Deletes in a single RPC

2012-07-10 Thread Michael Segel
Regardless, Its still a bad design. On Jul 9, 2012, at 10:02 PM, Jonathan Hsieh wrote: Keith, The HBASE-3584 feature is a 0.94 and we are strongly considering an 0.94 version for for a future CDH4 update. There is very little chance this will get into a CDH3 release. Jon. On Thu,

Re: Improvement: Provide better feedback on Put to unknown CF

2012-07-10 Thread Michael Segel
help a lot the defect tracking when someone will face the issue and see the stack trace. JM 2012/7/9, Michael Segel michael_se...@hotmail.com: Jean-Marc, I think you mis understood. At run time, you can query HBase to find out the table schema and its column families. While I agree

Re: Improvement: Provide better feedback on Put to unknown CF

2012-07-09 Thread Michael Segel
This may beg the question ... Why do you not know the CF? Your table schemas only consist of tables and CFs. So you should know them at the start of your job or m/r Mapper.setup(); On Jul 9, 2012, at 7:25 AM, Jean-Marc Spaggiari wrote: Hi, When we try to add a value to a CF which does

Re: Improvement: Provide better feedback on Put to unknown CF

2012-07-09 Thread Michael Segel
really. 2012/7/9, Michael Segel michael_se...@hotmail.com: This may beg the question ... Why do you not know the CF? Your table schemas only consist of tables and CFs. So you should know them at the start of your job or m/r Mapper.setup(); On Jul 9, 2012, at 7:25 AM, Jean-Marc Spaggiari

Re: Mixing Puts and Deletes in a single RPC

2012-07-06 Thread Michael Segel
I was going to post this yesterday, but real work got in the way... I have to ask... why are you deleting anything from your columns? The reason I ask is that you're sync'ing an object from an RDBMS to HBase. While HBase allows fields that contain NULL not to exist, your RDBMS doesn't. Your

Re: Mixing Puts and Deletes in a single RPC

2012-07-06 Thread Michael Segel
, Michael Segel michael_se...@hotmail.com wrote: I was going to post this yesterday, but real work got in the way... I have to ask... why are you deleting anything from your columns? The reason I ask is that you're sync'ing an object from an RDBMS to HBase. While HBase allows fields that contain

Re: Presplit regions when creating a table

2012-07-05 Thread Michael Segel
No, you need to know your key ranges for each split. If you don't and you guess wrong, you may end up not seeing any benefits because your data may still end up going to a single region... (Its data dependent.) I am personally not a fan of pre-splitting a table. The way I look at it, you

Re: HBase table disk usage

2012-07-03 Thread Michael Segel
Timestamps on the cells themselves? # Versions? On Jul 3, 2012, at 4:54 AM, Sever Fundatureanu wrote: Hello, I have a simpel table with 1.5 billion rows and one column familiy 'F'. Each row key is 33 bytes and the cell values are void. By doing the math I would expect this table to take

Re: Advices for HTable schema

2012-07-03 Thread Michael Segel
Hi, You're over thinking this. Take a step back and remember that you can store anything you want as a byte stream in a column. Literally. So you have a record that could be a text blob. Store it in one column. Use JSON to define its structure and fields. The only thing that makes it

Re: Connection error while usinh HBase client API for pseudodistrbuted mode

2012-07-03 Thread Michael Segel
What's the status of Hadoop and IPV6 vs IPV4? On Jul 3, 2012, at 7:07 AM, AnandaVelMurugan Chandra Mohan wrote: Hi, These are text from the files /etc/hosts 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 localhost ip6-localhost

Re: Advices for HTable schema

2012-07-03 Thread Michael Segel
! JM 2012/7/3, Michael Segel michael_se...@hotmail.com: Hi, You're over thinking this. Take a step back and remember that you can store anything you want as a byte stream in a column. Literally. So you have a record that could be a text blob. Store it in one column. Use JSON to define

Re: HBASE -- Regionserver and QuorumPeer ?

2012-07-02 Thread Michael Segel
Lloyd Bridges talks about how today was a bad day for giving up insert your favorite drug ... Sorry to side track but I thought I'd give a more detailed explanation ... On Jul 2, 2012, at 2:51 AM, Stack wrote: On Mon, Jul 2, 2012 at 7:11 AM, Michael Segel michael_se...@hotmail.com wrote

Re: HBASE -- Regionserver and QuorumPeer ?

2012-07-02 Thread Michael Segel
of explanation is this??? Regards, Mohammad Tariq On Mon, Jul 2, 2012 at 5:10 PM, Michael Segel michael_se...@hotmail.com wrote: Sorry St. Ack, Which is why I said that I was losing it... The entire quote was... On Sun, Jul 1, 2012 at 2:05 PM, Jay Wilson registrat

Re: HBASE -- Regionserver and QuorumPeer ?

2012-07-01 Thread Michael Segel
I'm sorry I'm losing it. Running RS on a machine where DN isn't running? So then the RS can't store its regions locally. Not sure if that would ever be a good idea or recommended. Thought the initial question is running ZK on the same node as a RS which isn't a good idea and a recipe for

Re: HBase master not starting up

2012-06-28 Thread Michael Segel
Sounds like your .Meta. table is corrupted. Thought that was fixed in 90.4... On Jun 28, 2012, at 1:26 PM, Kasturi wrote: Hi, I have HBase master running on 3 nodes and region server on 4 other nodes on a Mapr hadoop cluster. We have been using it for a while, and it was working fine.

Re: [ hbase ] performance of Get from MR Job

2012-06-27 Thread Michael Segel
. Schema design is a bit tricky to master because its going to be data dependent along with your use case. On Jun 25, 2012, at 2:32 AM, Marcin Cylke wrote: On 21/06/12 14:33, Michael Segel wrote: I think the version issue is the killer factor here. Usually performing a simple get() where you

Re: Best practices for custom filter class distribution?

2012-06-27 Thread Michael Segel
One way.., Create an NFS mountable directory for your cluster and mount on all of the DNs. You can either place a symbolic link in /usr/lib/hadoop/lib or add the jar to the classpath in /etc/hadoop/conf/hadoop-env.sh (Assuming Cloudera) On Jun 27, 2012, at 12:47 PM, Evan Pollan wrote:

Re: Timestamp as a key good practice?

2012-06-26 Thread Michael Segel
Network is always good to check, it's all fun and games until an interface negotiates 100Mb. 50ms per get sounds a bit extreme. mini-rant Funny you should mention hardware. I did submit a talk on cluster design to Strata (NY and London) Seems it didn't make the cut on NY, but who knows

Re: Increment Counters in HBase during MapReduce

2012-06-24 Thread Michael Segel
There are a couple of issues and I'm sure others will point them out. If you turn off speculative execution on the job, you don't get duplicate tasks running in parallel. You could create a table to store your aggregations on a per job basis where your row-id could incorporate your job-id.

Re: performance of Get from MR Job

2012-06-21 Thread Michael Segel
I think the version issue is the killer factor here. Usually performing a simple get() where you are getting the latest version of the data on the row/cell occurs in some constant time k. This is constant regardless of the size of the cluster and should scale in a near linear curve. As JD C

Re: Data locality in HBase

2012-06-21 Thread Michael Segel
While data locality is nice, you may see it becoming less of a bonus or issue. With Co-processors available, indexing becomes viable. So you may see things where within the M/R you process a row from table A, maybe hit an index to find a value in table B and then do some processing.

Re: Timestamp as a key good practice?

2012-06-21 Thread Michael Segel
16, 2012 at 6:33 PM, Michael Segel michael_se...@hotmail.comwrote: Jean-Marc, You indicated that you didn't want to do full table scans when you want to find out which files hadn't been touched since X time has past. (X could be months, weeks, days, hours, etc ...) So here's the thing

Re: When node is down

2012-06-21 Thread Michael Segel
Assuming that you have an Apache release (Apache, HW, Cloudera) ... (If MapR, replace the drive and you should be able to repair the cluster from the console. Node doesn't go down. ) Node goes down. 10 min later, cluster sees node down. Should then be able to replicate the missing blocks.

Re: delete rows from hbase

2012-06-20 Thread Michael Segel
Hi, The simple way to do this as a map/reduce is the following Use the HTable Input and scan the records you want to delete. In side Mapper.Setup() create a connection to the HTable where you want to delete the records. In side Mapper.Map() for each iteration you will get a row which

Re: delete rows from hbase

2012-06-20 Thread Michael Segel
with job!); } } catch (Exception e) { LOG.error(e.getMessage(), e); } } } On Wed, Jun 20, 2012 at 7:41 AM, Michael Segel michael_se...@hotmail.comwrote: Hi, The simple way to do this as a map/reduce is the following Use the HTable

Re: Increment Counters in HBase during MapReduce

2012-06-19 Thread Michael Segel
Sure, why not? You can always open a connection to the counter table in your Mapper.setup() method and then increment the counters within the Mapper.map() method. Your update of the counter is an artifact and not the output of the Mapper.map() method. On Jun 18, 2012, at 7:49 PM, Sid Kumar

Re: Isolation level

2012-06-18 Thread Michael Segel
Since you don't have OLTP, the terms need to be better defined. What is meant by an uncommitted_ write in HBase? RLL in RDBMS is different than RLL in HBase. You don't have the concept of a transaction in HBase. -Mike On Jun 17, 2012, at 10:32 PM, Anoop Sam John wrote: Hi You

Re: Timestamp as a key good practice?

2012-06-16 Thread Michael Segel
if more efficient? JM 2012/6/15, Michael Segel michael_se...@hotmail.com: Thought about this a little bit more... You will want two tables for a solution. 1 Table is Key: Unique ID Column: FilePathValue: Full Path to file Column: Last

Re: Timestamp as a key good practice?

2012-06-15 Thread Michael Segel
to achieve this goal (using co-processors). I don't know ye thow this part is working, so I will dig the documentation for it. Thanks, JM 2012/6/14, Michael Segel michael_se...@hotmail.com: Jean-Marc, You do realize that this really isn't a good use case for HBase, assuming that what

Re: Timestamp as a key good practice?

2012-06-14 Thread Michael Segel
Actually I think you should revisit your key design Look at your access path to the data for each of the types of queries you are going to run. From your post: I have a table with a uniq key, a file path and a last update field. I can easily find back the file with the ID and find when it

Re: Timestamp as a key good practice?

2012-06-14 Thread Michael Segel
will be, the more up to date I will be able to keep it. JM 2012/6/14, Michael Segel michael_se...@hotmail.com: Actually I think you should revisit your key design Look at your access path to the data for each of the types of queries you are going to run. From your post: I have a table with a uniq

Re: Timestamp as a key good practice?

2012-06-14 Thread Michael Segel
the documentation for it. Thanks, JM 2012/6/14, Michael Segel michael_se...@hotmail.com: Jean-Marc, You do realize that this really isn't a good use case for HBase, assuming that what you are describing is a stand alone system. It would be easier and better if you just used a simple

Re: Pre-split table using shell

2012-06-12 Thread Michael Segel
UUIDs are unique but not necessarily random and even in random samplings, you may not see an even distribution except over time. Sent from my iPhone On Jun 12, 2012, at 3:18 AM, Simon Kelly simongdke...@gmail.com wrote: Hi I'm getting some unexpected results with a pre-split table where

Re: Pre-split table using shell

2012-06-12 Thread Michael Segel
been able to find any docs on what format the splits keys should be in so I've used what's produced by Bytes.toStringBinary. Is that correct? Simon On 12 June 2012 10:23, Michael Segel michael_se...@hotmail.com wrote: UUIDs are unique but not necessarily random and even in random

Re: Pre-split table using shell

2012-06-12 Thread Michael Segel
servers so I need to try and get as much as I can from the get go. Simon On 12 June 2012 13:37, Michael Segel michael_se...@hotmail.com wrote: Ok, Now that I'm awake, and am drinking my first cup of joe... If you just generate UUIDs you are not going to have an even distribution. Nor

<    1   2   3   4   5   6   >