Re: Issue with Hive and table with lots of column

David Gayou Tue, 18 Feb 2014 10:59:02 -0800

Sorry i badly reported it. It's 8192M

Thanks,


David.
Le 18 févr. 2014 18:37, "Stephen Sprague" <sprag...@gmail.com> a écrit :

> oh. i just noticed the -Xmx value you reported.
>
> there's no M or G after that number??  I'd like to see -Xmx8192M or
> -Xmx8G.  That *is* very important.
>
> thanks,
> Stephen.
>
>
> On Tue, Feb 18, 2014 at 9:22 AM, Stephen Sprague <sprag...@gmail.com>wrote:
>
>> thanks.
>>
>> re #1.  we need to find that Hiveserver2 process. For all i know the one
>> you reported is hiveserver1 (which works.) chances are they use the same
>> -Xmx value but we really shouldn't make any assumptions.
>>
>> try wide format on the ps command (eg. ps -efw | grep -i Hiveserver2)
>>
>> re.#2.  okay.  so that tells us is not the number of columns blowing the
>> heap but rather the combination of rows + columns.  There's no way it
>> stores the full result set on the heap even under normal circumstances so
>> my guess is there's an internal number of rows it buffers.  sorta like how
>> unix buffers stdout.  How and where that's set is out of my league.
>> However, maybe you get around it by upping your heapsize again if you have
>> the available memory of course.
>>
>>
>> On Tue, Feb 18, 2014 at 8:39 AM, David Gayou <david.ga...@kxen.com>wrote:
>>
>>>
>>> 1. I have no process with hiveserver2 ...
>>>
>>> "ps -ef | grep -i hive"  return some pretty long command with a -Xmx8192
>>> and that's the value set in hive-env.sh
>>>
>>>
>>> 2. The "select * from table limit 1" or even 100 is working correctly.
>>>
>>>
>>> David.
>>>
>>>
>>> On Tue, Feb 18, 2014 at 4:16 PM, Stephen Sprague <sprag...@gmail.com>wrote:
>>>
>>>> He lives on after all! and thanks for the continued feedback.
>>>>
>>>> We need the answers to these questions using HS2:
>>>>
>>>>
>>>>
>>>>    1. what is the output of "ps -ef | grep -i hiveserver2" on your
>>>> system? in particular what is the value of -Xmx ?
>>>>
>>>>    2. does "select * from table limit 1" work?
>>>>
>>>> Thanks,
>>>> Stephen.
>>>>
>>>>
>>>>
>>>> On Tue, Feb 18, 2014 at 6:32 AM, David Gayou <david.ga...@kxen.com>wrote:
>>>>
>>>>> I'm so sorry, i wrote an answer, and i forgot to sent it....
>>>>> And i haven't been able to work on this for a few days.
>>>>>
>>>>>
>>>>> So far :
>>>>>
>>>>> I have a 15k columns table and 50k rows.
>>>>>
>>>>> I do not see any changes if i change the storage.
>>>>>
>>>>>
>>>>> *Hive 12.0*
>>>>>
>>>>> My test query is "select * from bigtable"
>>>>>
>>>>>
>>>>> If i use the hive cli, it works fine.
>>>>>
>>>>> If i use hiveserver1 + ODBC : it works fine
>>>>>
>>>>> If i use hiverserver2 + odbc or hiverserver2 + beeline,i have this
>>>>> java exception :
>>>>>
>>>>> 2014-02-18 13:22:22,571 ERROR thrift.ProcessFunction
>>>>> (ProcessFunction.java:process(41)) - Internal error processing 
>>>>> FetchResults
>>>>>
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>         at java.util.Arrays.copyOf(Arrays.java:2734)
>>>>>         at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>>>>         at java.util.ArrayList.add(ArrayList.java:351)
>>>>>          at
>>>>> org.apache.hive.service.cli.thrift.TRow.addToColVals(TRow.java:160)
>>>>>         at
>>>>> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
>>>>>         at
>>>>> org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
>>>>>         at
>>>>> org.apache.hive.service.cli.operation.SQLOperation.prepareFromRow(SQLOperation.java:270)
>>>>>         at
>>>>> org.apache.hive.service.cli.operation.SQLOperation.decode(SQLOperation.java:262)
>>>>>         at
>>>>> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:246)
>>>>>         at
>>>>> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:171)
>>>>>         at
>>>>> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:438)
>>>>>         at
>>>>> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:346)
>>>>>         at
>>>>> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:407)
>>>>>         at
>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *From the SVN trunk* : (for the HIVE-3746)
>>>>>
>>>>> With the maven change, most of the documentation and wiki are out of
>>>>> date.
>>>>> Compiling from trunk was not that easy and i may have failed some
>>>>> steps but :
>>>>>
>>>>> It has the same behavior. It works in CLI and hiveserver1.
>>>>> It fails with hiveserver 2.
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> David Gayou
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 13, 2014 at 3:11 AM, Navis류승우 <navis....@nexr.com> wrote:
>>>>>
>>>>>> With HIVE-3746, which will be included in hive-0.13, HiveServer2
>>>>>> takes less memory than before.
>>>>>>
>>>>>> Could you try it with the version in trunk?
>>>>>>
>>>>>>
>>>>>> 2014-02-13 10:49 GMT+09:00 Stephen Sprague <sprag...@gmail.com>:
>>>>>>
>>>>>> question to the original poster.  closure appreciated!
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 31, 2014 at 12:22 PM, Stephen Sprague <
>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>
>>>>>>>> thanks Ed. And on a separate tact lets look at Hiveserver2.
>>>>>>>>
>>>>>>>>
>>>>>>>> @OP>
>>>>>>>>
>>>>>>>> *I've tried to look around on how i can change the thrift heap size
>>>>>>>> but haven't found anything.*
>>>>>>>>
>>>>>>>>
>>>>>>>> looking at my hiveserver2 i find this:
>>>>>>>>
>>>>>>>>    $ ps -ef | grep -i hiveserver2
>>>>>>>>    dwr       9824 20479  0 12:11 pts/1    00:00:00 grep -i
>>>>>>>> hiveserver2
>>>>>>>>    dwr      28410     1  0 00:05 ?        00:01:04
>>>>>>>> /usr/lib/jvm/java-6-sun/jre/bin/java 
>>>>>>>> *-Xmx256m*-Dhadoop.log.dir=/usr/lib/hadoop/logs 
>>>>>>>> -Dhadoop.log.file=hadoop.log
>>>>>>>> -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=
>>>>>>>> -Dhadoop.root.logger=INFO,console
>>>>>>>> -Djava.library.path=/usr/lib/hadoop/lib/native
>>>>>>>> -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true
>>>>>>>> -Dhadoop.security.logger=INFO,NullAppender 
>>>>>>>> org.apache.hadoop.util.RunJar
>>>>>>>> /usr/lib/hive/lib/hive-service-0.12.0.jar
>>>>>>>> org.apache.hive.service.server.HiveServer2
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> questions:
>>>>>>>>
>>>>>>>>    1. what is the output of "ps -ef | grep -i hiveserver2" on your
>>>>>>>> system? in particular what is the value of -Xmx ?
>>>>>>>>
>>>>>>>>    2. can you restart your hiveserver with -Xmx1g? or some value
>>>>>>>> that makes sense to your system?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Lots of questions now.  we await your answers! :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 31, 2014 at 11:51 AM, Edward Capriolo <
>>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Final table compression should not effect the de serialized size
>>>>>>>>> of the data over the wire.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jan 31, 2014 at 2:49 PM, Stephen Sprague <
>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Excellent progress David.   So.  What the most important thing
>>>>>>>>>> here we learned was that it works (!) by running hive in local mode 
>>>>>>>>>> and
>>>>>>>>>> that this error is a limitation in the HiveServer2.  That's 
>>>>>>>>>> important.
>>>>>>>>>>
>>>>>>>>>> so textfile storage handler and having issues converting it to
>>>>>>>>>> ORC. hmmm.
>>>>>>>>>>
>>>>>>>>>> follow-ups.
>>>>>>>>>>
>>>>>>>>>> 1. what is your query that fails?
>>>>>>>>>>
>>>>>>>>>> 2. can you add a "limit 1" to the end of your query and tell us
>>>>>>>>>> if that works? this'll tell us if it's column or row bound.
>>>>>>>>>>
>>>>>>>>>> 3. bonus points. run these in local mode:
>>>>>>>>>>       > set hive.exec.compress.output=true;
>>>>>>>>>>       > set mapred.output.compression.type=BLOCK;
>>>>>>>>>>       > set
>>>>>>>>>> mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>>>>>>>       > create table blah stored as ORC as select * from <your
>>>>>>>>>> table>;   #i'm curious if this'll work.
>>>>>>>>>>       > show create table blah;  #send output back if previous
>>>>>>>>>> step worked.
>>>>>>>>>>
>>>>>>>>>> 4. extra bonus.  change ORC to SEQUENCEFILE in #3 see if that
>>>>>>>>>> works any differently.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'm wondering if compression would have any effect on the size of
>>>>>>>>>> the internal ArrayList the thrift server uses.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 31, 2014 at 9:21 AM, David Gayou <
>>>>>>>>>> david.ga...@kxen.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, so here are some news :
>>>>>>>>>>>
>>>>>>>>>>> I tried to boost the HADOOP_HEAPSIZE to 8192,
>>>>>>>>>>> I also setted the mapred.child.java.opts to 512M
>>>>>>>>>>>
>>>>>>>>>>> And it doesn't seem's to have any effect.
>>>>>>>>>>>  ------
>>>>>>>>>>>
>>>>>>>>>>> I tried it using an ODBC driver => fail after few minutes.
>>>>>>>>>>> Using a local JDBC (beeline) => running forever without any
>>>>>>>>>>> error.
>>>>>>>>>>>
>>>>>>>>>>> Both through hiveserver 2
>>>>>>>>>>>
>>>>>>>>>>> If i use the local mode : it works!   (but that not really what
>>>>>>>>>>> i need, as i don't really how to access it with my software)
>>>>>>>>>>>
>>>>>>>>>>> ------
>>>>>>>>>>> I use a text file as storage.
>>>>>>>>>>> I tried to use ORC, but i can't populate it with a load data
>>>>>>>>>>> (it return an error of file format).
>>>>>>>>>>>
>>>>>>>>>>> Using an "ALTER TABLE orange_large_train_3 SET FILEFORMAT ORC"
>>>>>>>>>>> after populating the table, i have a file format error on select.
>>>>>>>>>>>
>>>>>>>>>>> ------
>>>>>>>>>>>
>>>>>>>>>>> @Edward :
>>>>>>>>>>>
>>>>>>>>>>> I've tried to look around on how i can change the thrift heap
>>>>>>>>>>> size but haven't found anything.
>>>>>>>>>>> Same thing for my client (haven't found how to change the heap
>>>>>>>>>>> size)
>>>>>>>>>>>
>>>>>>>>>>> My usecase is really to have the most possible columns.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks a lot for your help
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> David
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 31, 2014 at 1:12 AM, Edward Capriolo <
>>>>>>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok here are the problem(s). Thrift has frame size limits,
>>>>>>>>>>>> thrift has to buffer rows into memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Hove thrift has a heap size, it needs to big in this case.
>>>>>>>>>>>>
>>>>>>>>>>>> Your client needs a big heap size as well.
>>>>>>>>>>>>
>>>>>>>>>>>> The way to do this query if it is possible may be turning row
>>>>>>>>>>>> lateral, potwntially by treating it as a list, it will make 
>>>>>>>>>>>> queries on it
>>>>>>>>>>>> awkward.
>>>>>>>>>>>>
>>>>>>>>>>>> Good luck
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thursday, January 30, 2014, Stephen Sprague <
>>>>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>>>> > oh. thinking some more about this i forgot to ask some other
>>>>>>>>>>>> basic questions.
>>>>>>>>>>>> >
>>>>>>>>>>>> > a) what storage format are you using for the table (text,
>>>>>>>>>>>> sequence, rcfile, orc or custom)?   "show create table <table>" 
>>>>>>>>>>>> would yield
>>>>>>>>>>>> that.
>>>>>>>>>>>> >
>>>>>>>>>>>> > b) what command is causing the stack trace?
>>>>>>>>>>>> >
>>>>>>>>>>>> > my thinking here is rcfile and orc are column based (i think)
>>>>>>>>>>>> and if you don't select all the columns that could very well limit 
>>>>>>>>>>>> the size
>>>>>>>>>>>> of the "row" being returned and hence the size of the internal 
>>>>>>>>>>>> ArrayList.
>>>>>>>>>>>> OTOH, if you're using "select *", um, you have my sympathies. :)
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Thu, Jan 30, 2014 at 11:33 AM, Stephen Sprague <
>>>>>>>>>>>> sprag...@gmail.com> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > thanks for the information. Up-to-date hive. Cluster on the
>>>>>>>>>>>> smallish side. And, well, sure looks like a memory issue. :)  
>>>>>>>>>>>> rather than
>>>>>>>>>>>> an inherent hive limitation that is.
>>>>>>>>>>>> >
>>>>>>>>>>>> > So.  I can only speak as a user (ie. not a hive developer)
>>>>>>>>>>>> but what i'd be interested in knowing next is is this via running 
>>>>>>>>>>>> hive in
>>>>>>>>>>>> local mode, correct? (eg. not through hiveserver1/2).  And it 
>>>>>>>>>>>> looks like it
>>>>>>>>>>>> boinks on array processing which i assume to be internal code 
>>>>>>>>>>>> arrays and
>>>>>>>>>>>> not hive data arrays - your 15K columns are all scalar/simple 
>>>>>>>>>>>> types,
>>>>>>>>>>>> correct?  Its clearly fetching results and looks be trying to 
>>>>>>>>>>>> store them in
>>>>>>>>>>>> a java array  - and not just one row but a *set* of rows 
>>>>>>>>>>>> (ArrayList)
>>>>>>>>>>>> >
>>>>>>>>>>>> > two things to try.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 1. boost the heap-size. try 8192. And I don't know if
>>>>>>>>>>>> HADOOP_HEAPSIZE is the controller of that. I woulda hoped it was 
>>>>>>>>>>>> called
>>>>>>>>>>>> something like "HIVE_HEAPSIZE". :)  Anyway, can't hurt to try.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2. trim down the number of columns and see where the breaking
>>>>>>>>>>>> point is.  is it 10K? is it 5K?   The idea is to confirm its _the 
>>>>>>>>>>>> number of
>>>>>>>>>>>> columns_ that is causing the memory to blow and not some other 
>>>>>>>>>>>> artifact
>>>>>>>>>>>> unbeknownst to us.
>>>>>>>>>>>> >
>>>>>>>>>>>> > 3. Google around the Hive namespace for something that might
>>>>>>>>>>>> limit or otherwise control the number of rows stored at once in 
>>>>>>>>>>>> Hive's
>>>>>>>>>>>> internal buffer. I snoop around too.
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > That's all i got for now and maybe we'll get lucky and
>>>>>>>>>>>> someone on this list will know something or another about this. :)
>>>>>>>>>>>> >
>>>>>>>>>>>> > cheers,
>>>>>>>>>>>> > Stephen.
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> >
>>>>>>>>>>>> > On Thu, Jan 30, 2014 at 2:32 AM, David Gayou <
>>>>>>>>>>>> david.ga...@kxen.com> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > We are using the Hive 0.12.0, but it doesn't work better on
>>>>>>>>>>>> hive 0.11.0 or hive 0.10.0
>>>>>>>>>>>> > Our hadoop version is 1.1.2.
>>>>>>>>>>>> > Our cluster is 1 master + 4 slaves with 1 dual core xeon CPU
>>>>>>>>>>>> (with hyperthreading so 4 cores per machine) + 16Gb Ram each
>>>>>>>>>>>> >
>>>>>>>>>>>> > The error message i get is :
>>>>>>>>>>>> >
>>>>>>>>>>>> > 2014-01-29 12:41:09,086 ERROR thrift.ProcessFunction
>>>>>>>>>>>> (ProcessFunction.java:process(41)) - Internal error processing 
>>>>>>>>>>>> FetchResults
>>>>>>>>>>>> > java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>>>> >         at java.util.Arrays.copyOf(Arrays.java:2734)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> java.util.ArrayList.ensureCapacity(ArrayList.java:167)
>>>>>>>>>>>> >         at java.util.ArrayList.add(ArrayList.java:351)
>>>>>>>>>>>> >         at org.apache.hive.service.cli.Row.<init>(Row.java:47)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.RowSet.addRow(RowSet.java:61)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:235)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:170)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:417)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:306)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:386)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1373)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1358)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
>>>>>>>>>>>> >         at
>>>>>>>>>>>> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
>>>>>>>>>>>> >         at java.security.AccessCont
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sorry this was sent from mobile. Will do less grammar and spell
>>>>>>>>>>>> check than usual.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Issue with Hive and table with lots of column

Reply via email to