How to scan only Memstore from end point co-processor
Hi all, Here is our use case, We have a very write heavy cluster. Also we run periodic end point co processor based jobs that operate on the data written in the last 10-15 mins, every 10 minute. Is there a way to only query in the MemStore from the end point co-processor? The periodic job scans for data using a time range. We would like to implement a simple logic, a. if query time range is within MemStore's TimeRangeTracker, then query only memstore. b. If end Time of the query time range is within MemStore's TimeRangeTracker, but query start Time is outside MemStore's TimeRangeTracker (memstore flush happened), then query both MemStore and Files. c. If start time and end time of the query is outside of MemStore TimeRangeTracker we query only files. The incoming data is time series and we do not allow old data (out of sync from clock) to come into the system(HBase). Cloudera has a scanner org.apache.hadoop.hbase.regionserver.InternalScan, that has methods like checkOnlyMemStore() and checkOnlyStoreFiles(). Is this available in Trunk? Also, how do I access the Memstore for a Column Family in the end point co-processor from CoprocessorEnvironment?
Re: How to scan only Memstore from end point co-processor
We have a postScannerOpen hook in the CP but that may not give you a direct access to know which one are the internal scanners on the Memstore and which one are on the store files. But this is possible but we may need to add some new hooks at this place where we explicitly add the internal scanners required for a scan. But still a general question - are you sure that your data will be only in the memstore and that the latest data would not have been flushed by that time from your memstore to the Hfiles. I see that your scenario is write centric and how can you guarentee your data can be in memstore only? Though your time range may say it is the latest data (may be 10 to 15 min) but you should be able to configure your memstore flushing in such a way that there are no flushes happening for the latest data in that 10 to 15 min time. Just saying my thoughts here. On Mon, Jun 1, 2015 at 11:46 AM, Gautam Borah gbo...@appdynamics.com wrote: Hi all, Here is our use case, We have a very write heavy cluster. Also we run periodic end point co processor based jobs that operate on the data written in the last 10-15 mins, every 10 minute. Is there a way to only query in the MemStore from the end point co-processor? The periodic job scans for data using a time range. We would like to implement a simple logic, a. if query time range is within MemStore's TimeRangeTracker, then query only memstore. b. If end Time of the query time range is within MemStore's TimeRangeTracker, but query start Time is outside MemStore's TimeRangeTracker (memstore flush happened), then query both MemStore and Files. c. If start time and end time of the query is outside of MemStore TimeRangeTracker we query only files. The incoming data is time series and we do not allow old data (out of sync from clock) to come into the system(HBase). Cloudera has a scanner org.apache.hadoop.hbase.regionserver.InternalScan, that has methods like checkOnlyMemStore() and checkOnlyStoreFiles(). Is this available in Trunk? Also, how do I access the Memstore for a Column Family in the end point co-processor from CoprocessorEnvironment?
HBase client: refreshing the connection
Hi All, We are using 0.94.15 in our Opendaylight/TSDR project currently. We observed put operation hanged for 20 mins (with all default timeouts) and then throws an IOException. Even when we re-attempt the same put operation, it hangs for 20 mins again. We observed there is an zxid mismatch on hbase server logs. We wanted to get clarified for the following items. 1) Reducing this hanging time from 20 mins to 5 mins: Looks there are many timeout configuration (hbase-client, zookeeper, client.pause etc) and it slightly confusing how they are all calculated with backoff series. If I add the configuration hbase.client.retries.number=3 in hbase-site.xml will bring down it to 5 mins? 2) When we receive this exception, we deletedAllConnections and subsequent put operation succeeded. We wish to continue this approach. Following is our code where we create HTable. HTableInterface htableResult = null; htableResult = htableMap.get(tableName); .. if (htableResult == null) { if (htablePool == null || htablePool.getTable(tableName) == null) { htablePool = new HTablePool(getConfiguration(), poolSize); } if ( htablePool != null){ htableResult = htablePool.getTable(tableName); .. } } htableMap.put(tableName, htableResult); We create 5 tables in our application. Will there be 5 HConnection totally and each HConnection for each Table? If yes, how do I delete a connection for the given table as most of the delete(All)Connections in HConnectionManager are deprecated in 0.94.15. No alternatives given in the java doc. Even if we use deleteConnection, it asks for conf which doesn't bind to any table, correct? deleteConnection @Deprecated public static void deleteConnection(org.apache.hadoop.conf.Configuration conf) Deprecated. Delete connection information for the instance specified by configuration. If there are no more references to it, this will then close connection to the zookeeper ensemble and let go of all resources. Parameters: conf - configuration whose identity is used to find HConnection instance. deleteAllConnections @Deprecated public static void deleteAllConnections(boolean stopProxy) Deprecated. use deleteAllConnections() instead Delete information for all connections. Parameters: stopProxy - No longer used. This parameter is ignored. deleteAllConnections @Deprecated public static void deleteAllConnections() Deprecated. Delete information for all connections. Throws: IOException Thanks, Hari
Monitor off heap Bucket Cache
Hi, What's the best way to monitor / know how's bucket cache being used, how much stuff is cached there, etc? Our RegionServer can use 32G of heap size, so we exported HBASE_OFFHEAPSIZE to 24G in hbase-env.sh, set hfile.block.cache.size to 0.05, and set couple of block sizes that we know we are using knowing our usage patterns. And this is where strange part starts - in web UI we see now, with turning this off, with those values, that total BlockCache available is 1G - before it was 10G. What we basically tried to achieve was to double it to 20G. Documentation we were referring to was http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_blockcache_configure.html#concept_cp3_fhy_dr_unique_1__section_m3r_2cz_dr_unique_1 as HBase book is not going into too much details how to properly configure this and get what you want. Btw. if we put hfile.block.cache.size to 0.2 we see in web UI that total available cache is 24G, but then after some time we had region server crashing. Host server have enough RAM, considering only data node and region server running there (128G of RAM in total) so we thought we could increase caching with turning on this functionality. Do you maybe see what exactly we are doing wrong? How to exactly increase offheap caching - and is it possible to monitor it anyhow, as in metrics we don't see anything associating to it? Thanks, Dejan
Java Hbase Client or Rest approach
Hi, We have a java based web application. There is a requirement to fetch the data from Hbase and build some dashboards. What is the best way to go about fetching the data from Hbase? 1 Using java hbase client api OR 2 Using the hbase rest api. Appreciate if anyone can provide the pros and cons of both of the above approaches. Regards, Shobha __ Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.
Re: How to scan only Memstore from end point co-processor
InternalScan has ctor from Scan object See https://issues.apache.org/jira/browse/HBASE-12720 You can instantiate InternalScan from Scan, set checkOnlyMemStore, then open RegionScanner, but the best approach is to cache data on write and run regular RegionScanner from memstore and block cache. best, -Vlad On Sun, May 31, 2015 at 11:45 PM, Anoop John anoop.hb...@gmail.com wrote: If your scan is having a time range specified in it, HBase internally will check this against the time range of files etc and will avoid those which are clearly out of your interested time range. You dont have to do any thing for this. Make sure you set the TimeRange for ur read -Anoop- On Mon, Jun 1, 2015 at 12:09 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: We have a postScannerOpen hook in the CP but that may not give you a direct access to know which one are the internal scanners on the Memstore and which one are on the store files. But this is possible but we may need to add some new hooks at this place where we explicitly add the internal scanners required for a scan. But still a general question - are you sure that your data will be only in the memstore and that the latest data would not have been flushed by that time from your memstore to the Hfiles. I see that your scenario is write centric and how can you guarentee your data can be in memstore only? Though your time range may say it is the latest data (may be 10 to 15 min) but you should be able to configure your memstore flushing in such a way that there are no flushes happening for the latest data in that 10 to 15 min time. Just saying my thoughts here. On Mon, Jun 1, 2015 at 11:46 AM, Gautam Borah gbo...@appdynamics.com wrote: Hi all, Here is our use case, We have a very write heavy cluster. Also we run periodic end point co processor based jobs that operate on the data written in the last 10-15 mins, every 10 minute. Is there a way to only query in the MemStore from the end point co-processor? The periodic job scans for data using a time range. We would like to implement a simple logic, a. if query time range is within MemStore's TimeRangeTracker, then query only memstore. b. If end Time of the query time range is within MemStore's TimeRangeTracker, but query start Time is outside MemStore's TimeRangeTracker (memstore flush happened), then query both MemStore and Files. c. If start time and end time of the query is outside of MemStore TimeRangeTracker we query only files. The incoming data is time series and we do not allow old data (out of sync from clock) to come into the system(HBase). Cloudera has a scanner org.apache.hadoop.hbase.regionserver.InternalScan, that has methods like checkOnlyMemStore() and checkOnlyStoreFiles(). Is this available in Trunk? Also, how do I access the Memstore for a Column Family in the end point co-processor from CoprocessorEnvironment?
Re: hfile.bucket.BucketAllocatorException: Allocation too big size
Oh, cool, something that will push us to upgrade sooner than later :) Just for my information - what limit was used than in 2.1 as maximum cache block size (or whatever name it was)? Size of the block, or something else? On Mon, Jun 1, 2015 at 5:00 PM Ted Yu yuzhih...@gmail.com wrote: Dejan: hbase.bucketcache.bucket.sizes was introduced by: HBASE-10641 Configurable Bucket Sizes in bucketCache which was integrated to 0.98.4 HDP 2.2 has the fix while HDP 2.1 didn't. FYI On Mon, Jun 1, 2015 at 7:23 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi Ted, It's 0.98.0 with bunch of patches (from Hortonworks). Let me try with that key, on my way :) On Mon, Jun 1, 2015 at 4:19 PM Ted Yu yuzhih...@gmail.com wrote: Which hbase release are you using ? I seem to recall that hbase.bucketcache.bucket.sizes was the key. Cheers On Mon, Jun 1, 2015 at 7:04 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io .hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
Re: hfile.bucket.BucketAllocatorException: Allocation too big size
Which hbase release are you using ? I seem to recall that hbase.bucketcache.bucket.sizes was the key. Cheers On Mon, Jun 1, 2015 at 7:04 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
Re: hfile.bucket.BucketAllocatorException: Allocation too big size
Dejan: hbase.bucketcache.bucket.sizes was introduced by: HBASE-10641 Configurable Bucket Sizes in bucketCache which was integrated to 0.98.4 HDP 2.2 has the fix while HDP 2.1 didn't. FYI On Mon, Jun 1, 2015 at 7:23 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi Ted, It's 0.98.0 with bunch of patches (from Hortonworks). Let me try with that key, on my way :) On Mon, Jun 1, 2015 at 4:19 PM Ted Yu yuzhih...@gmail.com wrote: Which hbase release are you using ? I seem to recall that hbase.bucketcache.bucket.sizes was the key. Cheers On Mon, Jun 1, 2015 at 7:04 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io .hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
Re: hfile.bucket.BucketAllocatorException: Allocation too big size
Hi Ted, It's 0.98.0 with bunch of patches (from Hortonworks). Let me try with that key, on my way :) On Mon, Jun 1, 2015 at 4:19 PM Ted Yu yuzhih...@gmail.com wrote: Which hbase release are you using ? I seem to recall that hbase.bucketcache.bucket.sizes was the key. Cheers On Mon, Jun 1, 2015 at 7:04 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io .hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io .hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
hfile.bucket.BucketAllocatorException: Allocation too big size
Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
Re: hfile.bucket.BucketAllocatorException: Allocation too big size
Yes Ted is right. hbase.bucketcache.bucket.sizes is the correct config name... I think wrong name was added to hbase-default.xml.. There was bug already raised for this? Some thing related to bucket cache was already there.. Am not sure.. We need fix in xml. -Anoop- On Mon, Jun 1, 2015 at 7:49 PM, Ted Yu yuzhih...@gmail.com wrote: Which hbase release are you using ? I seem to recall that hbase.bucketcache.bucket.sizes was the key. Cheers On Mon, Jun 1, 2015 at 7:04 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, I'm getting messages like: 015-06-01 14:02:29,529 WARN org.apache.hadoop.hbase.io.hfile.bucket.BucketCache: Failed allocating for block ce18012f4dfa424db88e92de29e76a9b_25809098330 org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocatorException: Allocation too big size=750465 at org.apache.hadoop.hbase.io.hfile.bucket.BucketAllocator.allocateBlock(BucketAllocator.java:400) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$RAMQueueEntry.writeToCache(BucketCache.java:1153) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.doDrain(BucketCache.java:703) at org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$WriterThread.run(BucketCache.java:675) at java.lang.Thread.run(Thread.java:745) However, not sure why is this. If I understood it correctly (and probably I didn't :/) this should fit in one of those: property namehbase.bucketcache.sizes/name value65536,131072,196608,262144,327680,393216,655360,1310720/value /property In the same time, hbase,bucketcache.size is 24G. Not sure what I did (again) wrong?
Re: Monitor off heap Bucket Cache
Also note that configuration is slightly changed between 0.98 and 1.0, see HBASE-11520. From the release note: Remove hbase.bucketcache.percentage.in.combinedcache. Simplifies config of block cache. If you are using this config., after this patch goes in, it will be ignored. The L1 LruBlockCache will be whatever hfile.block.cache.size is set to and the L2 BucketCache will be whatever hbase.bucketcache.size is set to. On Mon, Jun 1, 2015 at 8:10 AM, Stack st...@duboce.net wrote: On Mon, Jun 1, 2015 at 2:24 AM, Dejan Menges dejan.men...@gmail.com wrote: Hi, What's the best way to monitor / know how's bucket cache being used, how much stuff is cached there, etc? See the UI on a regionserver. Look down the page to the 'Block Cache' section. It has detail on both onheap and LRU offheap. See also the documentation on offheap cache: http://hbase.apache.org/book.html#offheap.blockcache Look also at metrics where we report block cache stats as well as offheap used by the JVM. Our RegionServer can use 32G of heap size, so we exported HBASE_OFFHEAPSIZE to 24G in hbase-env.sh, set hfile.block.cache.size to 0.05, and set couple of block sizes that we know we are using knowing our usage patterns. And this is where strange part starts - in web UI we see now, with turning this off, Turning off what? with those values, that total BlockCache available is 1G - before it was 10G. What we basically tried to achieve was to double it to 20G. Are you sure this not the onheap portion of BlockCache? Documentation we were referring to was http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/admin_hbase_blockcache_configure.html#concept_cp3_fhy_dr_unique_1__section_m3r_2cz_dr_unique_1 as HBase book is not going into too much details how to properly configure this and get what you want. Please let us know what is missing from here: http://hbase.apache.org/book.html#offheap.blockcache We'd like to fix it. Btw. if we put hfile.block.cache.size to 0.2 we see in web UI that total available cache is 24G, but then after some time we had region server crashing. Host server have enough RAM, considering only data node and region server running there (128G of RAM in total) so we thought we could increase caching with turning on this functionality. Make sure your hbase has HBASE-11678 Do you maybe see what exactly we are doing wrong? How to exactly increase offheap caching - and is it possible to monitor it anyhow, as in metrics we don't see anything associating to it? Please provide more detail. Your configurations and what you've changed in hbase-env.sh. St.Ack Thanks, Dejan
PhoenixIOException resolved only after compaction, is there a way to avoid it?
Hi Everyone, We load the data to Hbase tables through BulkImports. If the data set is small, we can query the imported data from phoenix with no issues. If data size is huge (with respect to our cluster, we have very small cluster), I m encountering the following error (org.apache.phoenix.exception.PhoenixIOException). 0: jdbc:phoenix:172.31.45.176:2181:/hbase selectcount(*) . . . . . . . . . . . . . . . . . . . . .from ldll_compression ldll join ds_compression ds on (ds.statusid = ldll.statusid) . . . . . . . . . . . . . . . . . . . . .where ldll.logdate = '2015-02-04' . . . . . . . . . . . . . . . . . . . . .and ldll.logdate = '2015-02-06' . . . . . . . . . . . . . . . . . . . . .and ldll.dbname = 'lmguaranteedrate'; +--+ | COUNT(1) | +--+ java.lang.RuntimeException: org.apache.phoenix.exception.PhoenixIOException: org.apache.phoenix.exception.PhoenixIOException: Failed after attempts=36, exceptions: Mon Jun 01 13:50:57 EDT 2015, null, java.net.SocketTimeoutException: callTimeout=6, callDuration=62358: row '' on table 'ldll_compression' at region=ldll_compression,,1432851434288.1a8b511def7d0c9e69a5491c6330d715., hostname=ip-172-31-32-181.us-west-2.compute.internal,60020,1432768597149, seqNum=16566 at sqlline.SqlLine$IncrementalRows.hasNext(SqlLine.java:2440) at sqlline.SqlLine$TableOutputFormat.print(SqlLine.java:2074) at sqlline.SqlLine.print(SqlLine.java:1735) at sqlline.SqlLine$Commands.execute(SqlLine.java:3683) at sqlline.SqlLine$Commands.sql(SqlLine.java:3584) at sqlline.SqlLine.dispatch(SqlLine.java:821) at sqlline.SqlLine.begin(SqlLine.java:699) at sqlline.SqlLine.mainWithInputRedirection(SqlLine.java:441) at sqlline.SqlLine.main(SqlLine.java:424) I did the major compaction for ldll_compression through Hbase shell(major_compact 'ldll_compression'). Same query ran successfully after the compaction. 0: jdbc:phoenix:172.31.45.176:2181:/hbase selectcount(*) . . . . . . . . . . . . . . . . . . . . .from ldll_compression ldll join ds_compression ds on (ds.statusid = ldll.statusid) . . . . . . . . . . . . . . . . . . . . .where ldll.logdate = '2015-02-04' . . . . . . . . . . . . . . . . . . . . .and ldll.logdate = '2015-02-06' . . . . . . . . . . . . . . . . . . . . .and ldll.dbname = 'lmguaranteedrate' . . . . . . . . . . . . . . . . . . . . . ; +--+ | COUNT(1) | +--+ | 13480| +--+ 1 row selected (72.36 seconds) Did anyone face the similar issue? Is IO exception is because of Phoenix not able to read from multiple regions since error was resolved after the compaction? or Any other thoughts? Thanks, Siva.
Re: Hbase vs Cassandra
Another point to add is the new HBase read high-availability using timeline-consistent region replicas feature from HBase 1.0 onward, which brings HBase closer to Cassandra in term of Read Availability during node failures. You have a choice for Read Availability now. https://issues.apache.org/jira/browse/HBASE-10070 On Sun, May 31, 2015 at 12:32 PM, Vladimir Rodionov vladrodio...@gmail.com wrote: Couple more + for HBase * Coprocessor framework (custom code inside Region Server and Master Servers), which Cassandra is missing, afaik. Coprocessors have been widely used by hBase users (Phoenix SQL, for example) since inception (in 0.92). * HBase security model is more mature and align well with Hadoop/HDFS security. Cassandra provides just basic authentication/authorization/SSL encryption, no Kerberos, no end-to-end data encryption, no cell level security. -Vlad On Sun, May 31, 2015 at 12:05 PM, lars hofhansl la...@apache.org wrote: You really have to try out both if you want to be sure. The fundamental differences that come to mind are: * HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. * Cassandra defaults to a random partitioner, so range scans are not possible (by default) * HBase has a range partitioner (if you don't want that the client has to prefix the rowkey with a prefix of a hash of the rowkey). The main feature that set HBase apart are range scans. * HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc. You can map reduce directly into HFiles and map those into HBase instantly. * Cassandra has a dedicated company supporting (and promoting) it. * Getting started is easier with Cassandra. For HBase you need to run HDFS and Zookeeper, etc. * I've heard lots of anecdotes about Cassandra working nicely with small cluster ( 50 nodes) and quick degenerating above that. * HBase does not have a query language (but you can use Phoenix for full SQL support) * HBase does not have secondary indexes (having an eventually consistent index, similar to what Cassandra has, is easy in HBase, but making it as consistent as the rest of HBase is hard) * Everything you'll hear here is biased :) From personal experience... At Salesforce we spent a few months prototyping various stores (including Cassandra) and arrived at HBase. Your mileage may vary. -- Lars - Original Message - From: Ajay ajay.ga...@gmail.com To: user@hbase.apache.org Cc: Sent: Friday, May 29, 2015 12:12 PM Subject: Hbase vs Cassandra Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: zookeeper closing socket connection exception
How many zookeeper servers do you have ? Cheers On Mon, Jun 1, 2015 at 12:15 PM, jeevi tesh jeevitesh...@gmail.com wrote: Hi, I'm running into this issue several times but still not able resolve kindly help me in this regard. I have written a crawler which will be keep running for several days after 4 days of continuous interaction of data base with my application system. Data base fails to responsed. I'm not able to figure where things all of a sudden can go wrong after 4 days of proper running. My configuration i have used hbase 0.96.2 single server. jdk 1.7 issue is this following error WARN [http-bio-8080-exec-4-SendThread(hadoop2:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x14da00e69e001ad for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused If this exception happens only solution i have is restart hbase that is not a viable solution because that will corrupt my system data.
Re: zookeeper closing socket connection exception
Hi Jeevi, Have you looked into why the ZooKeeper server is no longer accepting connections? what is the number of clients you have running per host and what is the configured value of maxClientCnxns in the ZooKeeper servers? Also is the issue impacting clients only or is it also impacting the RegionServers? cheers, esteban. -- Cloudera, Inc. On Mon, Jun 1, 2015 at 12:15 PM, jeevi tesh jeevitesh...@gmail.com wrote: Hi, I'm running into this issue several times but still not able resolve kindly help me in this regard. I have written a crawler which will be keep running for several days after 4 days of continuous interaction of data base with my application system. Data base fails to responsed. I'm not able to figure where things all of a sudden can go wrong after 4 days of proper running. My configuration i have used hbase 0.96.2 single server. jdk 1.7 issue is this following error WARN [http-bio-8080-exec-4-SendThread(hadoop2:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x14da00e69e001ad for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused If this exception happens only solution i have is restart hbase that is not a viable solution because that will corrupt my system data.
zookeeper closing socket connection exception
Hi, I'm running into this issue several times but still not able resolve kindly help me in this regard. I have written a crawler which will be keep running for several days after 4 days of continuous interaction of data base with my application system. Data base fails to responsed. I'm not able to figure where things all of a sudden can go wrong after 4 days of proper running. My configuration i have used hbase 0.96.2 single server. jdk 1.7 issue is this following error WARN [http-bio-8080-exec-4-SendThread(hadoop2:2181)] zookeeper.ClientCnxn (ClientCnxn.java:run(1089)) - Session 0x14da00e69e001ad for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused If this exception happens only solution i have is restart hbase that is not a viable solution because that will corrupt my system data.
Re: Hbase vs Cassandra
Well since you brought up coprocessors… lets talk about a lack of security and stability that’s been introduced by coprocessors. ;-) I’m not saying that you don’t want server side extensibility, but you need to recognize the risks introduced by coprocessors. On May 31, 2015, at 3:32 PM, Vladimir Rodionov vladrodio...@gmail.com wrote: Couple more + for HBase * Coprocessor framework (custom code inside Region Server and Master Servers), which Cassandra is missing, afaik. Coprocessors have been widely used by hBase users (Phoenix SQL, for example) since inception (in 0.92). * HBase security model is more mature and align well with Hadoop/HDFS security. Cassandra provides just basic authentication/authorization/SSL encryption, no Kerberos, no end-to-end data encryption, no cell level security. -Vlad On Sun, May 31, 2015 at 12:05 PM, lars hofhansl la...@apache.org wrote: You really have to try out both if you want to be sure. The fundamental differences that come to mind are: * HBase is always consistent. Machine outages lead to inability to read or write data on that machine. With Cassandra you can always write. * Cassandra defaults to a random partitioner, so range scans are not possible (by default) * HBase has a range partitioner (if you don't want that the client has to prefix the rowkey with a prefix of a hash of the rowkey). The main feature that set HBase apart are range scans. * HBase is much more tightly integrated with Hadoop/MapReduce/HDFS, etc. You can map reduce directly into HFiles and map those into HBase instantly. * Cassandra has a dedicated company supporting (and promoting) it. * Getting started is easier with Cassandra. For HBase you need to run HDFS and Zookeeper, etc. * I've heard lots of anecdotes about Cassandra working nicely with small cluster ( 50 nodes) and quick degenerating above that. * HBase does not have a query language (but you can use Phoenix for full SQL support) * HBase does not have secondary indexes (having an eventually consistent index, similar to what Cassandra has, is easy in HBase, but making it as consistent as the rest of HBase is hard) * Everything you'll hear here is biased :) From personal experience... At Salesforce we spent a few months prototyping various stores (including Cassandra) and arrived at HBase. Your mileage may vary. -- Lars - Original Message - From: Ajay ajay.ga...@gmail.com To: user@hbase.apache.org Cc: Sent: Friday, May 29, 2015 12:12 PM Subject: Hbase vs Cassandra Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
You are both making correct points, but FWIW HBase does not require use of Hadoop YARN or MapReduce. We do require HDFS of course. Some of the tools we ship are MapReduce applications but these are not core functions. We know of several large production use cases where the HBase(+HDFS) clusters are used as a data store backing online applications without colocated computation. On Jun 2, 2015, at 7:29 AM, Vladimir Rodionov vladrodio...@gmail.com wrote: The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. Hello, what is use case of a big data application w/o Hadoop? -Vlad On Mon, Jun 1, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. Hello, what is use case of a big data application w/o Hadoop? -Vlad On Mon, Jun 1, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
Hi Ajay, You won't be able to get unbiased opinion here easily. You'll need to try and see how each works for your use case. We use HBase for the SPM backend and it has worked well for us - it's stable, handles billions and billions of rows (I lost track of the actual number many moons ago) and fast, if you get your key design right. I'll answer your Q about monitoring: I'd say both are equally well monitorable. SPM http://sematext.com/spm can monitor both HBase and Cassandra equally well. Because Cassandra is a bit simpler (vs. HBase having multiple processes one needs to run), it's a bit simpler to add monitoring to Cassandra, but the difference is small. SPM is at http://sematext.com/spm if you want to have a look. We expose our own HBase clusters in the live demo, so you can see what metrics HBase exposes. We don't run Cassandra, so we can't show its graphs, but you can see some charts, metrics, and filters for Cassandra at http://blog.sematext.com/2014/06/02/announcement-cassandra-performance-monitoring-in-spm/ I hope this helps. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Fri, May 29, 2015 at 3:12 PM, Ajay ajay.ga...@gmail.com wrote: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: Hbase vs Cassandra
The point is that HBase is part of the Hadoop ecosystem. Not a stand alone database like Cassandra. This is one thing that gets lost when people want to compare NoSQL databases / data stores. As to Big Data without Hadoop? Well, there’s spark on mesos … :-P And there are other Big Data systems out there but are not as well known. Lexus/Nexus had their proprietary system that they’ve been trying to sell … On Jun 1, 2015, at 5:29 PM, Vladimir Rodionov vladrodio...@gmail.com wrote: The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. Hello, what is use case of a big data application w/o Hadoop? -Vlad On Mon, Jun 1, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
Re: How to scan only Memstore from end point co-processor
Thanks Vladimir. We will try this out soon. Regards, Gautam On Mon, Jun 1, 2015 at 12:22 AM, Vladimir Rodionov vladrodio...@gmail.com wrote: InternalScan has ctor from Scan object See https://issues.apache.org/jira/browse/HBASE-12720 You can instantiate InternalScan from Scan, set checkOnlyMemStore, then open RegionScanner, but the best approach is to cache data on write and run regular RegionScanner from memstore and block cache. best, -Vlad On Sun, May 31, 2015 at 11:45 PM, Anoop John anoop.hb...@gmail.com wrote: If your scan is having a time range specified in it, HBase internally will check this against the time range of files etc and will avoid those which are clearly out of your interested time range. You dont have to do any thing for this. Make sure you set the TimeRange for ur read -Anoop- On Mon, Jun 1, 2015 at 12:09 PM, ramkrishna vasudevan ramkrishna.s.vasude...@gmail.com wrote: We have a postScannerOpen hook in the CP but that may not give you a direct access to know which one are the internal scanners on the Memstore and which one are on the store files. But this is possible but we may need to add some new hooks at this place where we explicitly add the internal scanners required for a scan. But still a general question - are you sure that your data will be only in the memstore and that the latest data would not have been flushed by that time from your memstore to the Hfiles. I see that your scenario is write centric and how can you guarentee your data can be in memstore only? Though your time range may say it is the latest data (may be 10 to 15 min) but you should be able to configure your memstore flushing in such a way that there are no flushes happening for the latest data in that 10 to 15 min time. Just saying my thoughts here. On Mon, Jun 1, 2015 at 11:46 AM, Gautam Borah gbo...@appdynamics.com wrote: Hi all, Here is our use case, We have a very write heavy cluster. Also we run periodic end point co processor based jobs that operate on the data written in the last 10-15 mins, every 10 minute. Is there a way to only query in the MemStore from the end point co-processor? The periodic job scans for data using a time range. We would like to implement a simple logic, a. if query time range is within MemStore's TimeRangeTracker, then query only memstore. b. If end Time of the query time range is within MemStore's TimeRangeTracker, but query start Time is outside MemStore's TimeRangeTracker (memstore flush happened), then query both MemStore and Files. c. If start time and end time of the query is outside of MemStore TimeRangeTracker we query only files. The incoming data is time series and we do not allow old data (out of sync from clock) to come into the system(HBase). Cloudera has a scanner org.apache.hadoop.hbase.regionserver.InternalScan, that has methods like checkOnlyMemStore() and checkOnlyStoreFiles(). Is this available in Trunk? Also, how do I access the Memstore for a Column Family in the end point co-processor from CoprocessorEnvironment?
Re: Hbase vs Cassandra
Hbase can do range scans, and one can attack many problems with range scans. Cassandra can't do range scans. Hbase has a master. Cassandra does not. Those are the two main differences. On Monday, June 1, 2015, Andrew Purtell andrew.purt...@gmail.com wrote: HBase can very well be a standalone database, but we are debating semantics not technology I suspect. HBase uses some Hadoop ecosystem technologies but is absolutely a first class data store. I need to look no further than my employer for an example of a rather large production deploy of HBase* as a (internal) service, a high scale data storage platform. * - Strictly speaking HBase accessed with Apache Phoenix's JDBC driver. On Jun 2, 2015, at 10:32 AM, Michael Segel michael_se...@hotmail.com javascript:; wrote: The point is that HBase is part of the Hadoop ecosystem. Not a stand alone database like Cassandra. This is one thing that gets lost when people want to compare NoSQL databases / data stores. As to Big Data without Hadoop? Well, there’s spark on mesos … :-P And there are other Big Data systems out there but are not as well known. Lexus/Nexus had their proprietary system that they’ve been trying to sell … On Jun 1, 2015, at 5:29 PM, Vladimir Rodionov vladrodio...@gmail.com javascript:; wrote: The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. Hello, what is use case of a big data application w/o Hadoop? -Vlad On Mon, Jun 1, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com javascript:; wrote: Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com javascript:; wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Hbase vs Cassandra
HBase can very well be a standalone database, but we are debating semantics not technology I suspect. HBase uses some Hadoop ecosystem technologies but is absolutely a first class data store. I need to look no further than my employer for an example of a rather large production deploy of HBase* as a (internal) service, a high scale data storage platform. * - Strictly speaking HBase accessed with Apache Phoenix's JDBC driver. On Jun 2, 2015, at 10:32 AM, Michael Segel michael_se...@hotmail.com wrote: The point is that HBase is part of the Hadoop ecosystem. Not a stand alone database like Cassandra. This is one thing that gets lost when people want to compare NoSQL databases / data stores. As to Big Data without Hadoop? Well, there’s spark on mesos … :-P And there are other Big Data systems out there but are not as well known. Lexus/Nexus had their proprietary system that they’ve been trying to sell … On Jun 1, 2015, at 5:29 PM, Vladimir Rodionov vladrodio...@gmail.com wrote: The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. Hello, what is use case of a big data application w/o Hadoop? -Vlad On Mon, Jun 1, 2015 at 2:26 PM, Michael Segel michael_se...@hotmail.com wrote: Saying Ambari rules is like saying that you like to drink MD 20/20 and calling it a fine wine. Sorry to all the Hortonworks guys but Amabari has a long way to go…. very immature. What that has to do with Cassandra vs HBase? I haven’t a clue. The key issue is that unless you need or want to use Hadoop, you shouldn’t be using HBase. Its not a stand alone product or system. On May 30, 2015, at 7:40 AM, Serega Sheypak serega.shey...@gmail.com wrote: 1. No killer features comparing to hbase 2.terrible!!! Ambari/cloudera manager rulezzz. Netflix has its own tool for Cassandra but it doesn't support vnodes. 3. Rumors say it fast when it works;) the reason- it can silently drop data you try to write. 4. Timeseries is a nightmare. The easiest approach is just replicate data to hdfs, partition it by hour/day and run spark/scalding/pig/hive/Impala пятница, 29 мая 2015 г. пользователь Ajay написал: Hi, I need some info on Hbase vs Cassandra as a data store (in general plus specific to time series data). The comparison in the following helps: 1: features 2: deployment and monitoring 3: performance 4: anything else Thanks Ajay
[OFFTOPIC] Big Data Application Meetup
Hi everyone, I wanted to drop a note about a newly organized developer meetup in Bay Area: the Big Data Application Meetup (http://meetup.com/bigdataapps) and call for speakers. The plan is for meetup topics to be focused on application use-cases: how developers can build end-to-end solutions with open-source big data technologies. HBase is extremely popular among developers building on Hadoop stack and we would love to see talks about using it in big data solutions. If you want to share your experience, please email me back. If you have any questions - I will be happy to answer them. We plan for the first event to be hosted by Cask at its HQ in Palo Alto in end of June. Thank you, Alex Baranau