Deploy filter on per table baiss

2014-09-09 Thread Jianshi Huang
Hi,

According to the HBAse definitive guide, I need to change to change
hbase-env.sh and put my jars in hbase's classpath, then I also need to
restart hbase daemon to make my customized filters effective.

In the Coprocessor loading section, it also mentioned that coprocessor can
be setup and loaded on per table basis.

So is it also possible for filter? The main problem is that I don't have
HBase admin permissions to do the change.


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Deploy filter on per table baiss

2014-09-09 Thread Ted Yu
Please take a look at HBASE-1936

Cheers

On Mon, Sep 8, 2014 at 11:26 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 According to the HBAse definitive guide, I need to change to change
 hbase-env.sh and put my jars in hbase's classpath, then I also need to
 restart hbase daemon to make my customized filters effective.

 In the Coprocessor loading section, it also mentioned that coprocessor can
 be setup and loaded on per table basis.

 So is it also possible for filter? The main problem is that I don't have
 HBase admin permissions to do the change.


 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Upadting a HBase KeyValue using bulk upload

2014-09-09 Thread Prakhar Srivastava
Hi,

I have a MapReduce job which creates a StoreFile which I can load using
LoadIncrementalFiles in HBase. I am also using the timestamp component of
the KeyValue in my mapper to maintain version in an custom manner. But when
I am trying to overwrite the same version using the bulk import, it is not
working. When I try to perform a git, it returns me to the old version.

Also, if I try to update a KeyValue by overwriting the timestamp in the
hbase shell, I can see that the value is getting updated.

eg.  put 't1', 'r1', 'c1', 'value', ts1

Can someone help on why the updates are not reflecting when using bulk
import ?


Re: Deploy filter on per table baiss

2014-09-09 Thread Jianshi Huang
Thanks Ted!

Jianshi

On Tue, Sep 9, 2014 at 10:39 PM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at HBASE-1936

 Cheers

 On Mon, Sep 8, 2014 at 11:26 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

  Hi,
 
  According to the HBAse definitive guide, I need to change to change
  hbase-env.sh and put my jars in hbase's classpath, then I also need to
  restart hbase daemon to make my customized filters effective.
 
  In the Coprocessor loading section, it also mentioned that coprocessor
 can
  be setup and loaded on per table basis.
 
  So is it also possible for filter? The main problem is that I don't have
  HBase admin permissions to do the change.
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Deploy filter on per table baiss

2014-09-09 Thread Ted Yu
Kudo goes to Jimmy, not me.

Cheers

On Tue, Sep 9, 2014 at 8:17 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Thanks Ted!

 Jianshi

 On Tue, Sep 9, 2014 at 10:39 PM, Ted Yu yuzhih...@gmail.com wrote:

  Please take a look at HBASE-1936
 
  Cheers
 
  On Mon, Sep 8, 2014 at 11:26 PM, Jianshi Huang jianshi.hu...@gmail.com
  wrote:
 
   Hi,
  
   According to the HBAse definitive guide, I need to change to change
   hbase-env.sh and put my jars in hbase's classpath, then I also need to
   restart hbase daemon to make my customized filters effective.
  
   In the Coprocessor loading section, it also mentioned that coprocessor
  can
   be setup and loaded on per table basis.
  
   So is it also possible for filter? The main problem is that I don't
 have
   HBase admin permissions to do the change.
  
  
   --
   Jianshi Huang
  
   LinkedIn: jianshi
   Twitter: @jshuang
   Github  Blog: http://huangjs.github.com/
  
 



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



HBase custom filter protocol buffers

2014-09-09 Thread Kevin
Hi,

I'm making the switch from 0.92.1 to 0.98.1, and I'm in the process of
updating all my custom filters to conform to the new HBase Filter API. I
have quite a few custom filters, so my question is: Must I create a custom
protocol buffer for each of my filters or I can reuse the custom logic that
I had in writeFields() and readFields() in toByteArray() and
parseFrom(byte[]), respectively?

I did post this same question on Cloudera's CDH User Google group, but I
figured it was better suited to be asked on the official HBase mailing
list. (Sorry for posting in multiple locations.)

Thanks,
Kevin


Re: HBase custom filter protocol buffers

2014-09-09 Thread Ted Yu
For each of your filters that carries custom information (limit, range,
etc), you need to create corresponding protobuf entity.

See hbase-protocol/src/main/protobuf/Filter.proto for examples.

Cheers

On Tue, Sep 9, 2014 at 12:55 PM, Kevin kevin.macksa...@gmail.com wrote:

 Hi,

 I'm making the switch from 0.92.1 to 0.98.1, and I'm in the process of
 updating all my custom filters to conform to the new HBase Filter API. I
 have quite a few custom filters, so my question is: Must I create a custom
 protocol buffer for each of my filters or I can reuse the custom logic that
 I had in writeFields() and readFields() in toByteArray() and
 parseFrom(byte[]), respectively?

 I did post this same question on Cloudera's CDH User Google group, but I
 figured it was better suited to be asked on the official HBase mailing
 list. (Sorry for posting in multiple locations.)

 Thanks,
 Kevin



Re: Nested data structures examples for HBase

2014-09-09 Thread Michael Segel
You do realize that everything you store in Hbase are byte arrays, right? That 
is each cell is a blob. 

So you have the ability to create nested structures like… JSON records? ;-) 

So to your point. You can have a column A which represents a set of values. 

This is one reason why you shouldn’t think of HBase in terms of being 
relational. In fact for Hadoop, you really don’t want to think in terms of 
relational structures. 
Think more of Hierarchical. 

So yes, you can do what you want to do… 

HTH

-Mike

On Sep 8, 2014, at 10:06 PM, Stephen Boesch java...@gmail.com wrote:

 While I am aware that HBase does not have native support for nested
 structures, surely there are some of you that have thought through this use
 case carefully.
 
 Our particular use case is likely having single digit nested layers with
 tens to hundreds of items in the lists at each level.
 
 An example would be a
 
 top Level  300 items
 middle level :  1 to 100 items  (1 value  may indicate a single value as
 opposed to a list)
 third level:  1 to 50 items
 fourth level  1 to 20 items
 
 The column names are likely known ahead of time- which may or may not
 matter for hbase.  We could model the above structure in a Parquet File or
 in Hive (with nested struct's)- but we would like to consider whether
 HBase.might also be an option.



Re: HBase - Performance issue

2014-09-09 Thread Michael Segel

So you have large RS and you have large regions. Your regions are huge relative 
to your RS memory heap. 
(Not ideal.) 

You have slow drives (5400rpm) and you have 1GbE network. 
Do didn’t say how many drives per server. 

Under load, you will saturate your network with just 4 drives. (Give or take. 
Never tried 5400 RPM drives)
So you hit one bandwidth bottleneck there. 
The other is the ratio of spindles to CPU.  So if you have 4 drives and 8 
cores… again under load, you’ll start to see 
an I/O bottleneck … 

On average, how many regions do you have per table per server? 

I’d consider shrinking your regions.

Sometimes you need to dial back from 11 do a more reasonable listening level… 
;-) 

HTH

-Mike



On Sep 8, 2014, at 8:23 AM, kiran kiran.sarvabho...@gmail.com wrote:

 Hi Lars,
 
 Ours is a problem of I/O wait and network bandwidth increase around the
 same time
 
 Lars,
 
 Sorry to say this... our's is a production cluster and we ideally should
 never want a downtime... Also lars, we had very miserable experience while
 upgrading from 0.92 to 0.94... There was a never a mention of change in
 split policy in the release notes... and the policy was not ideal for our
 cluster and it took us atleast a week to figure out that
 
 Our cluster runs on commodity hardware with big regions (5-10gb)... Region
 sever mem is 10gb...
 2TB SATA Hard disks (5400 - 7200 rpm)... Internal network bandwidth is 1 gig
 
 So please suggest us any work around with 0.94.1
 
 
 On Sun, Sep 7, 2014 at 8:42 AM, lars hofhansl la...@apache.org wrote:
 
 Thinking about it again, if you ran into a HBASE-7336 you'd see high CPU
 load, but *not* IOWAIT.
 0.94 is at 0.94.23, you should upgrade. A lot of fixes, improvements, and
 performance enhancements went in since 0.94.4.
 You can do a rolling upgrade straight to 0.94.23.
 
 With that out of the way, can you post a jstack of the processes that
 experience high wait times?
 
 -- Lars
 
  --
 *From:* kiran kiran.sarvabho...@gmail.com
 *To:* user@hbase.apache.org; lars hofhansl la...@apache.org
 *Sent:* Saturday, September 6, 2014 11:30 AM
 *Subject:* Re: HBase - Performance issue
 
 Lars,
 
 We are facing a similar situation on the similar cluster configuration...
 We are having high I/O wait percentages on some machines in our cluster...
 We have short circuit reads enabled but still we are facing the similar
 problem.. the cpu wait goes upto 50% also in some case while issuing scan
 commands with multiple threads.. Is there a work around other than applying
 the patch for 0.94.4 ??
 
 Thanks
 Kiran
 
 
 On Thu, Apr 25, 2013 at 12:12 AM, lars hofhansl la...@apache.org wrote:
 
 You may have run into https://issues.apache.org/jira/browse/HBASE-7336
 (which is in 0.94.4)
 (Although I had not observed this effect as much when short circuit reads
 are enabled)
 
 
 
 - Original Message -
 From: kzurek kzu...@proximetry.pl
 To: user@hbase.apache.org
 Cc:
 Sent: Wednesday, April 24, 2013 3:12 AM
 Subject: HBase - Performance issue
 
 The problem is that when I'm putting my data (multithreaded client, ~30MB/s
 traffic outgoing) into the cluster the load is equally spread over all
 RegionServer with 3.5% average CPU wait time (average CPU user: 51%). When
 I've added similar, mutlithreaded client that Scans for, let say, 100 last
 samples of randomly generated key from chosen time range, I'm getting high
 CPU wait time (20% and up) on two (or more if there is higher number of
 threads, default 10) random RegionServers. Therefore, machines that held
 those RS are getting very hot - one of the consequences is that number of
 store file is constantly increasing, up to the maximum limit. Rest of the
 RS
 are having 10-12% CPU wait time and everything seems to be OK (number of
 store files varies so they are being compacted and not increasing over
 time). Any ideas? Maybe  I could prioritize writes over reads somehow? Is
 it
 possible? If so what would be the best way to that and where it should be
 placed - on the client or cluster side)?
 
 Cluster specification:
 HBase Version0.94.2-cdh4.2.0
 Hadoop Version2.0.0-cdh4.2.0
 There are 6xDataNodes (5xHDD for storing data), 1xMasterNodes
 Other settings:
 - Bloom filters (ROWCOL) set
 - Short circuit turned on
 - HDFS Block Size: 128MB
 - Java Heap Size of Namenode/Secondary Namenode in Bytes: 8 GiB
 - Java Heap Size of HBase RegionServer in Bytes: 12 GiB
 - Java Heap Size of HBase Master in Bytes: 4 GiB
 - Java Heap Size of DataNode in Bytes: 1 GiB (default)
 Number of regions per RegionServer: 19 (total 114 regions on 6 RS)
 Key design: UUIDTIMESTAMP - UUID: 1-10M, TIMESTAMP: 1-N
 Table design: 1 column family with 20 columns of 8 bytes
 
 Get client:
 Multiple threads
 Each thread have its own tables instance with their Scanner.
 Each thread have its own range of UUIDs and randomly draws beginning of
 time
 range to build rowkey properly (see above).
 Each time Scan requests same 

Re: One-table w/ multi-CF or multi-table w/ one-CF?

2014-09-09 Thread Michael Segel
Locality? 

Then the data should be in the same column family.  That’s as local as you can 
get. 

I would suggest that you think of the following:

What’s the predominant use case? 
How are you querying the data. 
If you’re always hitting multiple CFs to get the data… then you should have it 
in the same table. 

I think more people would benefit if they took more time thinking about their 
design and how the data is being used and stored… it would help. 
Also knowing that there really isn’t a single ‘right’ answer. Just a lot of 
wrong ones. ;-) 


Most people still try to think of HBase in terms of relational modeling and not 
in terms of records and more of a hierarchial system. 
Things like CFs and Versioning are often misused because people see them as 
shortcuts. 

Also people tend not to think of their data in HBase in terms of 3D but in 
terms of 2D. 
(CF’s would be 2+D) 

The one question which really hasn’t been answered is how fat is fat in terms 
of a row’s width and when is it too fat? 
This may seem like a simple thing, but it can impact a couple of things in your 
design. (I never got a good answer, and its one of those questions that if your 
wife were to ask if the pants she’s wearing makes her fat, its time to run for 
the hills because you can’t win no matter how you answer!) 
Seriously though, the optimal width of the column is not that easy to answer 
and sometimes you have to just guess as to which would be a better design. 

One of the problems with CFs is that if there’s an imbalance in terms of the 
size of data being stored in each CF, you can run in to issues. 
CFs are stored in separate files and split when the base CF splits. (Assuming 
you have a base CF and then multiple CFs that are related but store smaller 
records per row.) 
And then there’s the issue in terms of each CF is stored separately. (If memory 
serves its a separate file per CF, but right now my last living brain cell 
decided to call it quits and went on strike for more beer.) 
[Damn you last brain cell!!!] :-) 

Again the idea is to follow KISS. 

HTH

-Mike

On Sep 8, 2014, at 7:17 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:

 Locality is important, that why I chose CF to put related data into one
 group. I can surely put the CF part to the head of rowkey to achieve
 similar result, but since the number of types is fixed, I don't any benefit
 doing that.
 
 With the setLoadColumnFamiliesOnDemand I learned from Ted, looks like the
 performance should be similar.
 
 Am I missing something? Please enlighten me.
 
 Jianshi
 
 On Mon, Sep 8, 2014 at 3:41 AM, Michael Segel michael_se...@hotmail.com
 wrote:
 
 I would suggest rethinking column families and look at your potential for
 a slightly different row key.
 
 Going with column families doesn’t really make sense.
 
 Also how wide are the rows? (worst case?)
 
 one idea is to make type part of the RK…
 
 HTH
 
 -Mike
 
 On Sep 7, 2014, at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 
 Hi Michael,
 
 Thanks for the questions.
 
 I'm modeling dynamic Graphs in HBase, all elements (vertices, edges)
 have a
 timestamp and I can query things like events between A and B for the
 last 7
 days.
 
 CFs are used for grouping different types of data for the same account.
 However, I have lots of skews in the data, to avoid having too much for
 the
 same row, I had to put what was in CQs to now RKs. So CF now acts more
 like
 a table.
 
 There's one CF containing sequence of events ordered by timestamp, and
 this
 CF is quite different as the use case is mostly in mapreduce jobs.
 
 Jianshi
 
 
 
 
 On Sun, Sep 7, 2014 at 4:52 AM, Michael Segel michael_se...@hotmail.com
 
 wrote:
 
 Again, a silly question.
 
 Why are you using column families?
 
 Just to play devil’s advocate in terms of design, why are you not
 treating
 your row as a record? Think hierarchal not relational.
 
 This really gets in to some design theory.
 
 Think Column Family as a way to group data that has the same row key,
 reference the same thing, yet the data in each column family is used
 separately.
 The example I always turn to when teaching, is to think of an order
 entry
 system at a retailer.
 
 You generate data which is segmented by business process. (order entry,
 pick slips, shipping, invoicing) All reflect a single order, yet the
 data
 in each process tends to be accessed separately.
 (You don’t need the order entry when using the pick slip to pull orders
 from the warehouse.)  So here, the data access pattern is that each
 column
 family is used separately, except in generating the data (the order
 entry
 is used to generate the pick slip(s) and set up things like backorders
 and
 then the pick process generates the shipping slip(s) etc …  And since
 they
 are all focused on the same order, they have the same row key.
 
 So its reasonable to ask how you are accessing the data and how you are
 designing your HBase model?
 
 Many times,  developers create a model using 

Re: Nested data structures examples for HBase

2014-09-09 Thread Stephen Boesch
Thanks Michael, yes  cells are byte[]; therefore, storing JSON or other
document structures is always possible.  Our use cases include querying
individual elements in the structure - so that would require reconstituting
the documents and then parsing them for every row.  We probably are not
headed in the direction of HBase for those use cases: but we are trying to
make that determination after having carefully considered the extent of the
mismatch.

2014-09-09 13:37 GMT-07:00 Michael Segel michael_se...@hotmail.com:

 You do realize that everything you store in Hbase are byte arrays, right?
 That is each cell is a blob.

 So you have the ability to create nested structures like… JSON records? ;-)

 So to your point. You can have a column A which represents a set of values.

 This is one reason why you shouldn’t think of HBase in terms of being
 relational. In fact for Hadoop, you really don’t want to think in terms of
 relational structures.
 Think more of Hierarchical.

 So yes, you can do what you want to do…

 HTH

 -Mike

 On Sep 8, 2014, at 10:06 PM, Stephen Boesch java...@gmail.com wrote:

  While I am aware that HBase does not have native support for nested
  structures, surely there are some of you that have thought through this
 use
  case carefully.
 
  Our particular use case is likely having single digit nested layers with
  tens to hundreds of items in the lists at each level.
 
  An example would be a
 
  top Level  300 items
  middle level :  1 to 100 items  (1 value  may indicate a single value
 as
  opposed to a list)
  third level:  1 to 50 items
  fourth level  1 to 20 items
 
  The column names are likely known ahead of time- which may or may not
  matter for hbase.  We could model the above structure in a Parquet File
 or
  in Hive (with nested struct's)- but we would like to consider whether
  HBase.might also be an option.




SKIP_FLUSH

2014-09-09 Thread Guangle Fan
Hi, anybody knows why I can't skip flush when taking snapshot ?


snapshot 'aaa', 'aaa_snapshot', {SKIP_FLUSH = true}

NameError: uninitialized constant SKIP_FLUSH


without {SKIP_FLUSH = true}, the command works fine/


Regards,


Guangle


Re: SKIP_FLUSH

2014-09-09 Thread Matteo Bertozzi
which version are you using?

Matteo


On Tue, Sep 9, 2014 at 5:34 PM, Guangle Fan fanguan...@gmail.com wrote:

 Hi, anybody knows why I can't skip flush when taking snapshot ?


 snapshot 'aaa', 'aaa_snapshot', {SKIP_FLUSH = true}

 NameError: uninitialized constant SKIP_FLUSH


 without {SKIP_FLUSH = true}, the command works fine/


 Regards,


 Guangle



Re: SKIP_FLUSH

2014-09-09 Thread Ted Yu
Matteo is so fast :-)

HBASE-10935 went into 0.98.4

FYI

On Tue, Sep 9, 2014 at 5:35 PM, Matteo Bertozzi theo.berto...@gmail.com
wrote:

 which version are you using?

 Matteo


 On Tue, Sep 9, 2014 at 5:34 PM, Guangle Fan fanguan...@gmail.com wrote:

  Hi, anybody knows why I can't skip flush when taking snapshot ?
 
 
  snapshot 'aaa', 'aaa_snapshot', {SKIP_FLUSH = true}
 
  NameError: uninitialized constant SKIP_FLUSH
 
 
  without {SKIP_FLUSH = true}, the command works fine/
 
 
  Regards,
 
 
  Guangle
 



Re: SKIP_FLUSH

2014-09-09 Thread Guangle Fan
That explains. I'm on .96

On Tue, Sep 9, 2014 at 5:37 PM, Ted Yu yuzhih...@gmail.com wrote:

 Matteo is so fast :-)

 HBASE-10935 went into 0.98.4

 FYI

 On Tue, Sep 9, 2014 at 5:35 PM, Matteo Bertozzi theo.berto...@gmail.com
 wrote:

  which version are you using?
 
  Matteo
 
 
  On Tue, Sep 9, 2014 at 5:34 PM, Guangle Fan fanguan...@gmail.com
 wrote:
 
   Hi, anybody knows why I can't skip flush when taking snapshot ?
  
  
   snapshot 'aaa', 'aaa_snapshot', {SKIP_FLUSH = true}
  
   NameError: uninitialized constant SKIP_FLUSH
  
  
   without {SKIP_FLUSH = true}, the command works fine/
  
  
   Regards,
  
  
   Guangle
  
 



Re: need help understand log output

2014-09-09 Thread Qiang Tian
out of curiosity, did you see below messages in RS log?

  LOG.warn(Snapshot called again without clearing previous.  +
  Doing nothing. Another ongoing flush or did we fail last
attempt?);

thanks.

On Tue, Sep 9, 2014 at 2:15 AM, Brian Jeltema 
brian.jelt...@digitalenvoy.net wrote:

 I’ve resolved these problems by restarting the region server that owned
 the region in question.
 I don’t know what the underlying issue was, but at this point it’s not
 worth pursuing.

 Thanks for responding.

 Brian

 On Sep 8, 2014, at 11:06 AM, Brian Jeltema brian.jelt...@digitalenvoy.net
 wrote:

  I realized today that the region server logs for the region being
 updated (startKey=\x00DDD@) contains the following:
 
  2014-09-08 06:25:50,223 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 11302
  2014-09-08 06:26:00,222 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 21682
  2014-09-08 06:26:10,223 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 5724
  2014-09-08 06:26:20,223 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 11962
  2014-09-08 06:26:30,223 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 7693
  2014-09-08 06:26:40,224 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 5578
  2014-09-08 06:26:50,223 INFO  [regionserver60020.periodicFlusher]
 regionserver.HRegionServer: regionserver60020.periodicFlusher requesting
 flush for region Host,\x00DDD@,1400624237999.5bb6bd41597ddd8dd7ca03e78f3a3e65.
 after a delay of 12420
 
  a log entry being generated every 10 seconds starting about 4 days ago.
 I presume these problems are related.
 
  On Sep 8, 2014, at 7:10 AM, Brian Jeltema 
 brian.jelt...@digitalenvoy.net wrote:
 
 
  When number of attempts is greater than the value of
  hbase.client.start.log.errors.counter (default 9), AsyncProcess would
  produce logs cited below.
  The interval following 'retrying after ' is the backoff time.
 
  Which release of HBase are you using ?
 
 
  HBase Version 0.98.0.2.1.1.0-385-hadoop2
 
  The MR job is reading from  an HBase snapshot, if that’s relevant.
 
  Cheers
 
 
  On Sun, Sep 7, 2014 at 8:50 AM, Brian Jeltema 
  brian.jelt...@digitalenvoy.net wrote:
 
  I have a map/reduce job that is consistently failing with timeouts.
 The
  failing mapper log files contain a series
  of records similar to those below. When I look at the hbase and hdfs
 logs
  (on foo.net in this case) I don’t see
  anything obvious at these timestamps. The mapper task times out
 at/near
  attempt=25/35. Can anyone shed light
  on what these log entries mean?
 
  Thanks - Brian
 
 
  2014-09-07 09:36:51,421 INFO [htable-pool1-t1]
  org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
  attempt=10/35 failed 1062 ops, last exception: null on foo.net
 ,60020,1406043467187,
  tracking started null, retrying after 10029 ms, replay 1062 ops
  2014-09-07 09:37:01,642 INFO [htable-pool1-t1]
  org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
  attempt=11/35 failed 1062 ops, last exception: null on foo.net
 ,60020,1406043467187,
  tracking started null, retrying after 10023 ms, replay 1062 ops
  2014-09-07 09:37:12,064 INFO [htable-pool1-t1]
  org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
  attempt=12/35 failed 1062 ops, last exception: null on foo.net
 ,60020,1406043467187,
  tracking started null, retrying after 20182 ms, replay 1062 ops
  2014-09-07 09:37:32,708 INFO [htable-pool1-t1]
  org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
  attempt=13/35 failed 1062 ops, last exception: null on foo.net
 ,60020,1406043467187,
  tracking started null, retrying after 20140 ms, replay 1062 ops
  2014-09-07 09:37:52,940 INFO [htable-pool1-t1]
  org.apache.hadoop.hbase.client.AsyncProcess: #3, table=Host, primary,
  attempt=14/35 failed 1062 ops, last exception: null on foo.net
 ,60020,1406043467187,
  tracking started null, retrying after 20041 ms, replay 1062 ops
  2014-09-07 09:38:13,324 INFO [htable-pool1-t1]
  

Re: Nested data structures examples for HBase

2014-09-09 Thread Michael Segel

Are you just kicking the tires or do you want to roll up your sleeves and do 
some work? 

You have options. 
Secondary Indexes. 

I don’t mean an inverted table but things like SOLR, Lucene, Elastic search… 

The only downside is that depending on what you index, you can see an explosion 
in the data being stored in HBase.

But that may be beyond you.  Its a non-trivial task, and to be honest… a bit of 
‘rocket science’. 

Its still doable…


On Sep 9, 2014, at 10:20 PM, Stephen Boesch java...@gmail.com wrote:

 Thanks Michael, yes  cells are byte[]; therefore, storing JSON or other
 document structures is always possible.  Our use cases include querying
 individual elements in the structure - so that would require reconstituting
 the documents and then parsing them for every row.  We probably are not
 headed in the direction of HBase for those use cases: but we are trying to
 make that determination after having carefully considered the extent of the
 mismatch.
 
 2014-09-09 13:37 GMT-07:00 Michael Segel michael_se...@hotmail.com:
 
 You do realize that everything you store in Hbase are byte arrays, right?
 That is each cell is a blob.
 
 So you have the ability to create nested structures like… JSON records? ;-)
 
 So to your point. You can have a column A which represents a set of values.
 
 This is one reason why you shouldn’t think of HBase in terms of being
 relational. In fact for Hadoop, you really don’t want to think in terms of
 relational structures.
 Think more of Hierarchical.
 
 So yes, you can do what you want to do…
 
 HTH
 
 -Mike
 
 On Sep 8, 2014, at 10:06 PM, Stephen Boesch java...@gmail.com wrote:
 
 While I am aware that HBase does not have native support for nested
 structures, surely there are some of you that have thought through this
 use
 case carefully.
 
 Our particular use case is likely having single digit nested layers with
 tens to hundreds of items in the lists at each level.
 
 An example would be a
 
 top Level  300 items
 middle level :  1 to 100 items  (1 value  may indicate a single value
 as
 opposed to a list)
 third level:  1 to 50 items
 fourth level  1 to 20 items
 
 The column names are likely known ahead of time- which may or may not
 matter for hbase.  We could model the above structure in a Parquet File
 or
 in Hive (with nested struct's)- but we would like to consider whether
 HBase.might also be an option.