Re: count on large table

Abe Weinograd Wed, 22 Oct 2014 18:31:52 -0700

It seems to be disabled across the cluster.

Abe


On Tue, Oct 21, 2014 at 9:04 AM, Puneet Kumar Ojha <
[email protected]> wrote:

>  Please check IPV6 , if enabled …disable it and synchronize the ntpd and
> restart. It might help.
>
>
>
> *From:* Abe Weinograd [mailto:[email protected]]
> *Sent:* Tuesday, October 21, 2014 6:19 PM
> *To:* user; lars hofhansl
>
> *Subject:* Re: count on large table
>
>
>
> Hi Lars,
>
>
>
> We have 10 Region Servers and 2 1TB on each.  The table is not salted, but
> we pre split regions when we bulk load so that we force equal distribution
> of our data.  the data is relatively distributed across our region servers,
> with no one Region Server being the "long tail."
>
>
>
> I don't have any metrics from Ganglia.  We are running CDH on EC2, for
> what it is work.  The CPUs spike to 100% and IO jumps pretty equallyon the
> Region Servers.  Attached is the RS log from one of them when all I am
> doing on the entire cluster is a COUNT in phoenix.
>
>
>
> Thanks again for your help,
>
> Abe
>
>
>
>
>
>
>
> On Tue, Oct 14, 2014 at 3:10 AM, lars hofhansl <[email protected]> wrote:
>
>   Back on the envelope math - assuming disks that can sustain 120mb/s -
> suggests you'd need about 17 disks 100% busy in total to pull 120gb off the
> disks in 60s. (i.e. at least 6 servers completely utilizing all of their
> disk). How many server do you have? HBase/HDFS will likely not quite max
> out all disks so your 10 machines are cutting that close.
>
>
>
> Not concerned about the 250 regions - at least not for this.
>
>
>
> Are all machines/disks/CPUs equally busy? Is the table salted?
>
> Note that HBase's block cache stores data uncompressed, and hence your
> dataset likely does not fit into the aggregate block cache. Your query
> might run sightly better with the /*+ NO_CACHE */ hint.
>
>
>
> Now from your 187541ms number, things look worse, though.
>
> Do you have OTSDB or Ganglia to record metrics of that cluster? If so can
> you share some graphs of IO/CPU during the query time.
>
> Any chance to attach a profiler to one of the busy region server, or at
> least get us stack trace?
>
>
>
> Thanks.
>
>
>
> -- Lars
>
>
>    ------------------------------
>
> *From:* Abe Weinograd <[email protected]>
> *To:* user <[email protected]>; lars hofhansl <[email protected]>
> *Sent:* Monday, October 13, 2014 9:30 AM
>
>
> *Subject:* Re: count on large table
>
>
>
> Hi Lars,
>
>
>
> Thanks for following up.
>
>
>
> Table Size - 120G doing a du on HDFS.  We are using Snappy compression on
> the table.
>
> Column Family - We have 1 column family for all columns and are using the
> Phoenix default one.
>
> Regions - right now we have a ton of regions (250) because we pre split to
> help out bulk loads.  I haven't collapsed them yet, but in a DEV
> environment that is configured the same way, we have ~50 regions and
> experience the same performance issues.  I am planning on squaring this
> away and trying again.
>
> Resource Utilization - Really high CPU usage on the region servers and
> noticing a spike in IO too.
>
>
>
> Based on your questions and what I know, the # of regions needs to be
> compacted first, though I am not sure this is going to solve my issue.  the
> data nodes in HDFS have 3 1TB disks so I am not convinced that my IO is the
> bottleneck here.
>
>
>
> Thanks,
>
> Abe
>
>
>
>
>
> On Thu, Oct 9, 2014 at 8:36 PM, lars hofhansl <[email protected]> wrote:
>
>   Hi Abe,
>
>
>
> this is interesting.
>
>
>
> How big are your rows (i.e. how much data is in the table, you tell with
> du in HDFS)? And how many columns do you have? Any column families?
>
> How many regions are in this table? (you can tell that through the HBase
> HMaster UI page)
>
> When you execute the query, are all HBase region servers busy? Do you see
> IO, or just high CPU?
>
>
>
> Client batching won't help with an aggregate (such as count) where not
> much data is transferred back to the client.
>
>
>
> Thanks.
>
>
>
> -- Lars
>
>
>    ------------------------------
>
> *From:* Abe Weinograd <[email protected]>
> *To:* user <[email protected]>
> *Sent:* Wednesday, October 8, 2014 9:15 AM
> *Subject:* Re: count on large table
>
>
>
> Good point.  I have to figure out how to do that in a SQL Tool like
> Squirrel or workbench.
>
>
>
> Is there any obvious thing i can do to help tune this?  I know that's a
> loaded question.  My client scanner batches are 1000 (also tried 10000 with
> no luck).
>
>
>
> Thanks,
>
> Abe
>
>
>
>
>
> On Tue, Oct 7, 2014 at 9:09 PM, [email protected] <
> [email protected]> wrote:
>
>  Hi, Abe
>
> Maybe setting the following property would help...
>
> <property>
>     <name>phoenix.query.timeoutMs</name>
>     <value>3600000</value>
>
> </property>
>
>
>
> Thanks,
>
> Sun
>
>
>  ------------------------------
>   ------------------------------
>
>
>
>   *From:* Abe Weinograd <[email protected]>
>
> *Date:* 2014-10-08 04:34
>
> *To:* user <[email protected]>
>
> *Subject:* count on large table
>
> I have a table with 1B  rows.  I know this can is very specific to my
> environment, but just doing a SELECT COUNT(1) on the table   It never
> finished.
>
>
>
> We have a 10 node cluster with the RS's Heap size at 26GiB and skewed
> towards the block cache.  In the RS logs, i see a lot of these:
>
>
>
> 2014-10-07 16:27:04,942 WARN org.apache.hadoop.ipc.RpcServer:
> (responseTooSlow):
> {"processingtimems":22770,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"
> 10.10.0.10:44791
> ","starttimems":1412713602172,"queuetimems":0,"class":"HRegionServer","responsesize":8,"method":"Scan"}
>
>
>
> They stop eventually, but i the query times out and the query tool
> reports: org.apache.phoenix.exception.PhoenixIOException: 187541ms passed
> since the last invocation, timeout is currently set to 60000
>
>
>
> Any ideas of where I can start in order to figure this out?
>
>
>
> using Phoenix 4.1 on CDH 5.1 (Hbase 0.98.1)
>
>
>
> Thanks,
>
> Abe
>
>
>
>
>
>
>
>
>
>
>

Re: count on large table

Reply via email to