Prasanth Jayachandran created HIVE-19206:
--------------------------------------------

             Summary: Automatic memory management for open streaming writers
                 Key: HIVE-19206
                 URL: https://issues.apache.org/jira/browse/HIVE-19206
             Project: Hive
          Issue Type: Sub-task
          Components: Streaming
    Affects Versions: 3.0.0, 3.1.0
         Environment: Problem:
When there are 100s of record updaters open, the amount of memory required by 
orc writers keeps growing because of ORC's internal buffers. This can lead to 
potential high GC or OOM during streaming ingest.

Solution:
The high level idea is for the streaming connection to remember all the open 
record updaters and flush the record updater periodically (at some interval). 
Records written to each record updater can be used as a metric to determine the 
candidate record updaters for flushing. 
If stripe size of orc file is 64MB, the default memory management check happens 
only after every 5000 rows which may which may be too late when there are too 
many concurrent writers in a process. Example case would be 100 writers open 
and each of them have almost full stripe of 64MB buffered data, this would take 
100*64MB ~=6GB of memory. When all of the record writers flush, the memory 
usage drops down to 100*~2MB which is just ~200MB memory usage. 
            Reporter: Prasanth Jayachandran
            Assignee: Prasanth Jayachandran






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to