Re: Using Phoenix Bulk Upload CSV to upload 200GB data

Gaurav Kanade Wed, 16 Sep 2015 09:26:24 -0700

Hi guys

I was able to get this to work after using bigger VMs for data nodes;
however now the bigger problem I am facing is after my MR job completes
successfully I am not seeing any rows loaded in my table (count shows 0
both via phoenix and hbase)


Am I missing something simple ?

Thanks
Gaurav


On 12 September 2015 at 11:16, Gabriel Reid <gabriel.r...@gmail.com> wrote:

> Around 1400 mappers sounds about normal to me -- I assume your block
> size on HDFS is 128 MB, which works out to 1500 mappers for 200 GB of
> input.
>
> To add to what Krishna asked, can you be a bit more specific on what
> you're seeing (in log files or elsewhere) which leads you to believe
> the data nodes are running out of capacity? Are map tasks failing?
>
> If this is indeed a capacity issue, one thing you should ensure is
> that map output comression is enabled. This doc from Cloudera explains
> this (and the same information applies whether you're using CDH or
> not) -
> http://www.cloudera.com/content/cloudera/en/documentation/cdh4/latest/CDH4-Installation-Guide/cdh4ig_topic_23_3.html
>
> In any case, apart from that there isn't any basic thing that you're
> probably missing, so any additional information that you can supply
> about what you're running into would be useful.
>
> - Gabriel
>
>
> On Sat, Sep 12, 2015 at 2:17 AM, Krishna <research...@gmail.com> wrote:
> > 1400 mappers on 9 nodes is about 155 mappers per datanode which sounds
> high
> > to me. There are very few specifics in your mail. Are you using YARN? Can
> > you provide details like table structure, # of rows & columns, etc. Do
> you
> > have an error stack?
> >
> >
> > On Friday, September 11, 2015, Gaurav Kanade <gaurav.kan...@gmail.com>
> > wrote:
> >>
> >> Hi All
> >>
> >> I am new to Apache Phoenix (and relatively new to MR in general) but I
> am
> >> trying a bulk insert of a 200GB tar separated file in an HBase table.
> This
> >> seems to start off fine and kicks off about ~1400 mappers and 9
> reducers (I
> >> have 9 data nodes in my setup).
> >>
> >> At some point I seem to be running into problems with this process as it
> >> seems the data nodes run out of capacity (from what I can see my data
> nodes
> >> have 400GB local space). It does seem that certain reducers eat up most
> of
> >> the capacity on these - thus slowing down the process to a crawl and
> >> ultimately leading to Node Managers complaining that Node Health is bad
> >> (log-dirs and local-dirs are bad)
> >>
> >> Is there some inherent setting I am missing that I need to set up for
> the
> >> particular job ?
> >>
> >> Any pointers would be appreciated
> >>
> >> Thanks
> >>
> >> --
> >> Gaurav Kanade,
> >> Software Engineer
> >> Big Data
> >> Cloud and Enterprise Division
> >> Microsoft
>



-- 
Gaurav Kanade,
Software Engineer
Big Data
Cloud and Enterprise Division
Microsoft

Re: Using Phoenix Bulk Upload CSV to upload 200GB data

Reply via email to