Re: Issue with Hive and table with lots of column

Stephen Sprague Tue, 18 Feb 2014 09:23:59 -0800

thanks.

re #1.  we need to find that Hiveserver2 process. For all i know the one
you reported is hiveserver1 (which works.) chances are they use the same
-Xmx value but we really shouldn't make any assumptions.


try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)

re.#2.  okay.  so that tells us is not the number of columns blowing the
heap but rather the combination of rows + columns.  There's no way it
stores the full result set on the heap even under normal circumstances so
my guess is there's an internal number of rows it buffers.  sorta like how
unix buffers stdout.  How and where that's set is out of my league.
However, maybe you get around it by upping your heapsize again if you have
the available memory of course.


On Tue, Feb 18, 2014 at 8:39 AM, David Gayou <david.ga...@kxen.com> wrote:

>
> 1. I have no process with hiveserver2 ...
>
> "ps -ef | grep -i hive"  return some pretty long command with a -Xmx8192
> and that's the value set in hive-env.sh
>
>
> 2. The "select * from table limit 1" or even 100 is working correctly.
>
>
> David.
>
>
> On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague <sprag...@gmail.com>wrote:
>
>> He lives on after all! and thanks for the continued feedback.
>>
>> We need the answers to these questions using HS2:
>>
>>
>>
>>    1. what is the output of "ps -ef | grep -i hiveserver2" on your
>> system? in particular what is the value of -Xmx ?
>>
>>    2. does "select * from table limit 1" work?
>>
>> Thanks,
>> Stephen.
>>
>>
>>
>> On Tue, Feb 18, 2014 at 6:32 AM, David Gayou <david.ga...@kxen.com>wrote:
>>
>>> I'm so sorry, i wrote an answer, and i forgot to sent it....
>>> And i haven't been able to work on this for a few days.
>>>
>>>
>>> So far :
>>>
>>> I have a 15k columns table and 50k rows.
>>>
>>> I do not see any changes if i change the storage.
>>>
>>>
>>> *Hive 12.0*
>>>
>>> My test query is "select * from bigtable"
>>>
>>>
>>> If i use the hive cli, it works fine.
>>>
>>> If i use hiveserver1 + ODBC : it works fine
>>>
>>> If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this java
>>> exception :
>>>
>>> 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
>>> (ProcessFunction.java:process(41)) - Internal error processing FetchResults
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>         at java.util.Arrays.copyOf(Arrays.java:2734)
>>>         at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>>         at java.util.ArrayList.add(ArrayList.java:351)
>>>          at
>>> org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
>>>         at
>>> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
>>>         at
>>> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
>>>         at
>>> org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
>>>         at
>>> org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
>>>         at
>>> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
>>>         at
>>> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
>>>         at
>>> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
>>>         at
>>> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
>>>         at
>>> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
>>>         at
>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>>>
>>>
>>>
>>>
>>> *From the SVN trunk* : (for the HIVE-3746)
>>>
>>> With the maven change, most of the documentation and wiki are out of
>>> date.
>>> Compiling from trunk was not that easy and i may have failed some steps
>>> but :
>>>
>>> It has the same behavior. It works in CLI and hiveserver1.
>>> It fails with hiveserver 2.
>>>
>>>
>>> Regards
>>>
>>> David Gayou
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 <navis....@nexr.com> wrote:
>>>
>>>> With HIVE-3746, which will be included in hive-0.13, HiveServer2 takes
>>>> less memory than before.
>>>>
>>>> Could you try it with the version in trunk?
>>>>
>>>>
>>>> 2014-02-13 10:49 GMT+09:00 Stephen Sprague <sprag...@gmail.com>:
>>>>
>>>> question to the original poster.  closure appreciated!
>>>>>
>>>>>
>>>>> On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague 
>>>>> <sprag...@gmail.com>wrote:
>>>>>
>>>>>> thanks Ed. And on a separate tact lets look at Hiveserver2.
>>>>>>
>>>>>>
>>>>>> @OP>
>>>>>>
>>>>>> *I've tried to look around on how i can change the thrift heap size
>>>>>> but haven't found anything.*
>>>>>>
>>>>>>
>>>>>> looking at my hiveserver2 i find this:
>>>>>>
>>>>>>    $ ps -ef | grep -i hiveserver2
>>>>>>    dwr       9824 20479  0 12:11 pts/1    00:00:00 grep -i hiveserver2
>>>>>>    dwr      28410     1  0 00:05 ?        00:01:04
>>>>>> /usr/lib/jvm/java-6-sun/jre/bin/java 
>>>>>> *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
>>>>>> -Dhadoop.log.file=hadoop.log
>>>>>> -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
>>>>>> -Dhadoop.root.logger=INFO,console
>>>>>> -Djava.library.path=/usr/lib/hadoop/lib/native
>>>>>> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>>>>>> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
>>>>>> /usr/lib/hive/lib/hive-service-0.12.0.jar
>>>>>> org.apache.hive.service.server.HiveServer2
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> questions:
>>>>>>
>>>>>>    1. what is the output of "ps -ef | grep -i hiveserver2" on your
>>>>>> system? in particular what is the value of -Xmx ?
>>>>>>
>>>>>>    2. can you restart your hiveserver with -Xmx1g? or some value that
>>>>>> makes sense to your system?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Lots of questions now.  we await your answers! :)
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo <
>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>
>>>>>>> Final table compression should not effect the de serialized size of
>>>>>>> the data over the wire.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague <sprag...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Excellent progress David.   So.  What the most important thing here
>>>>>>>> we learned was that it works (!) by running hive in local mode and that
>>>>>>>> this error is a limitation in the HiveServer2.  That's important.
>>>>>>>>
>>>>>>>> so textfile storage handler and having issues converting it to ORC.
>>>>>>>> hmmm.
>>>>>>>>
>>>>>>>> follow-ups.
>>>>>>>>
>>>>>>>> 1. what is your query that fails?
>>>>>>>>
>>>>>>>> 2. can you add a "limit 1" to the end of your query and tell us if
>>>>>>>> that works? this'll tell us if it's column or row bound.
>>>>>>>>
>>>>>>>> 3. bonus points. run these in local mode:
>>>>>>>>       > set hive.exec.compress.output=true;
>>>>>>>>       > set mapred.output.compression.type=BLOCK;
>>>>>>>>       > set
>>>>>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>>>>>       > create table blah stored as ORC as select * from <your
>>>>>>>> table>;   #i'm curious if this'll work.
>>>>>>>>       > show create table blah;  #send output back if previous step
>>>>>>>> worked.
>>>>>>>>
>>>>>>>> 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that works
>>>>>>>> any differently.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm wondering if compression would have any effect on the size of
>>>>>>>> the internal ArrayList the thrift server uses.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 31, 2014 at 9:21 AM, David Gayou 
>>>>>>>> <david.ga...@kxen.com>wrote:
>>>>>>>>
>>>>>>>>> Ok, so here are some news :
>>>>>>>>>
>>>>>>>>> I tried to boost the HADOOP_HEAPSIZE to 8192,
>>>>>>>>> I also setted the mapred.child.java.opts to 512M
>>>>>>>>>
>>>>>>>>> And it doesn't seem's to have any effect.
>>>>>>>>>  ------
>>>>>>>>>
>>>>>>>>> I tried it using an ODBC driver => fail after few minutes.
>>>>>>>>> Using a local JDBC (beeline) => running forever without any error.
>>>>>>>>>
>>>>>>>>> Both through hiveserver 2
>>>>>>>>>
>>>>>>>>> If i use the local mode : it works!   (but that not really what i
>>>>>>>>> need, as i don't really how to access it with my software)
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>> I use a text file as storage.
>>>>>>>>> I tried to use ORC, but i can't populate it with a load data  (it
>>>>>>>>> return an error of file format).
>>>>>>>>>
>>>>>>>>> Using an "ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC"
>>>>>>>>> after populating the table, i have a file format error on select.
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>>
>>>>>>>>> @Edward :
>>>>>>>>>
>>>>>>>>> I've tried to look around on how i can change the thrift heap size
>>>>>>>>> but haven't found anything.
>>>>>>>>> Same thing for my client (haven't found how to change the heap
>>>>>>>>> size)
>>>>>>>>>
>>>>>>>>> My usecase is really to have the most possible columns.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks a lot for your help
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo <
>>>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok here are the problem(s). Thrift has frame size limits, thrift
>>>>>>>>>> has to buffer rows into memory.
>>>>>>>>>>
>>>>>>>>>> Hove thrift has a heap size, it needs to big in this case.
>>>>>>>>>>
>>>>>>>>>> Your client needs a big heap size as well.
>>>>>>>>>>
>>>>>>>>>> The way to do this query if it is possible may be turning row
>>>>>>>>>> lateral, potwntially by treating it as a list, it will make queries 
>>>>>>>>>> on it
>>>>>>>>>> awkward.
>>>>>>>>>>
>>>>>>>>>> Good luck
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thursday, January 30, 2014, Stephen Sprague <
>>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>> > oh. thinking some more about this i forgot to ask some other
>>>>>>>>>> basic questions.
>>>>>>>>>> >
>>>>>>>>>> > a) what storage format are you using for the table (text,
>>>>>>>>>> sequence, rcfile, orc or custom)?   "show create table <table>" 
>>>>>>>>>> would yield
>>>>>>>>>> that.
>>>>>>>>>> >
>>>>>>>>>> > b) what command is causing the stack trace?
>>>>>>>>>> >
>>>>>>>>>> > my thinking here is rcfile and orc are column based (i think)
>>>>>>>>>> and if you don't select all the columns that could very well limit 
>>>>>>>>>> the size
>>>>>>>>>> of the "row" being returned and hence the size of the internal 
>>>>>>>>>> ArrayList.
>>>>>>>>>> OTOH, if you're using "select *", um, you have my sympathies. :)
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague <
>>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > thanks for the information. Up-to-date hive. Cluster on the
>>>>>>>>>> smallish side. And, well, sure looks like a memory issue. :)  rather 
>>>>>>>>>> than
>>>>>>>>>> an inherent hive limitation that is.
>>>>>>>>>> >
>>>>>>>>>> > So.  I can only speak as a user (ie. not a hive developer) but
>>>>>>>>>> what i'd be interested in knowing next is is this via running hive 
>>>>>>>>>> in local
>>>>>>>>>> mode, correct? (eg. not through hiveserver1/2).  And it looks like it
>>>>>>>>>> boinks on array processing which i assume to be internal code arrays 
>>>>>>>>>> and
>>>>>>>>>> not hive data arrays - your 15K columns are all scalar/simple types,
>>>>>>>>>> correct?  Its clearly fetching results and looks be trying to store 
>>>>>>>>>> them in
>>>>>>>>>> a java array  - and not just one row but a *set* of rows (ArrayList)
>>>>>>>>>> >
>>>>>>>>>> > two things to try.
>>>>>>>>>> >
>>>>>>>>>> > 1. boost the heap-size. try 8192. And I don't know if
>>>>>>>>>> HADOOP_HEAPSIZE is the controller of that. I woulda hoped it was 
>>>>>>>>>> called
>>>>>>>>>> something like "HIVE_HEAPSIZE". :)  Anyway, can't hurt to try.
>>>>>>>>>> >
>>>>>>>>>> > 2. trim down the number of columns and see where the breaking
>>>>>>>>>> point is.  is it 10K? is it 5K?   The idea is to confirm its _the 
>>>>>>>>>> number of
>>>>>>>>>> columns_ that is causing the memory to blow and not some other 
>>>>>>>>>> artifact
>>>>>>>>>> unbeknownst to us.
>>>>>>>>>> >
>>>>>>>>>> > 3. Google around the Hive namespace for something that might
>>>>>>>>>> limit or otherwise control the number of rows stored at once in 
>>>>>>>>>> Hive's
>>>>>>>>>> internal buffer. I snoop around too.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > That's all i got for now and maybe we'll get lucky and someone
>>>>>>>>>> on this list will know something or another about this. :)
>>>>>>>>>> >
>>>>>>>>>> > cheers,
>>>>>>>>>> > Stephen.
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > On Thu, Jan 30, 2014 at 2:32 AM, David Gayou <
>>>>>>>>>> david.ga...@kxen.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > We are using the Hive 0.12.0, but it doesn't work better on
>>>>>>>>>> hive 0.11.0 or hive 0.10.0
>>>>>>>>>> > Our hadoop version is 1.1.2.
>>>>>>>>>> > Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU
>>>>>>>>>> (with hyperthreading so 4 cores per machine) + 16Gb Ram each
>>>>>>>>>> >
>>>>>>>>>> > The error message i get is :
>>>>>>>>>> >
>>>>>>>>>> > 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
>>>>>>>>>> (ProcessFunction.java:process(41)) - Internal error processing 
>>>>>>>>>> FetchResults
>>>>>>>>>> > java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>> >         at java.util.Arrays.copyOf(Arrays.java:2734)
>>>>>>>>>> >         at
>>>>>>>>>> java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>>>>>>>>> >         at java.util.ArrayList.add(ArrayList.java:351)
>>>>>>>>>> >         at org.apache.hive.service.cli.Row.<init>(Row.java:47)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
>>>>>>>>>> >         at
>>>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
>>>>>>>>>> >         at java.security.AccessCont
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>>>>>> check than usual.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Issue with Hive and table with lots of column

Reply via email to