[ 
https://issues.apache.org/jira/browse/HIVE-16904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148234#comment-16148234
 ] 

anishek commented on HIVE-16904:
--------------------------------

On internal runs we saw that for 10000 partitions with one file each it was 
creating a metadata file of about ~ 16 MB. extrapolating this to include 
additional properties and files etc, to 20 MB for 10000 Partitions then for 1 
million its about 2GB.

Adding java object overhead to about another 50% we should still be using about 
3 GB of RAM to process this file which does not seem too large. 

So parking this for now and will come back to this later if there still an 
issue. 

Sample Code to allow doing this 

{code}

import org.apache.commons.io.FileUtils;
import org.apache.hadoop.hive.metastore.api.Partition;
import org.apache.thrift.TDeserializer;
import org.apache.thrift.TException;
import org.apache.thrift.protocol.TJSONProtocol;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonNode;
import org.codehaus.jackson.JsonParser;
import org.codehaus.jackson.JsonToken;
import org.codehaus.jackson.map.MappingJsonFactory;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.JSONObject;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

import static org.junit.Assert.fail;

public class StreamingJsonTests {

  @Test
  public void testStreaming() throws IOException, TException {
    TDeserializer deserializer = new TDeserializer(new TJSONProtocol.Factory());
    ObjectMapper mapper = new ObjectMapper();
    JsonFactory factory = new MappingJsonFactory();
    printMemory("before reading file to parser");
    JsonParser parser =
        factory.createJsonParser(new File("_metadata"));
    if (parser.nextToken() != JsonToken.START_OBJECT)
      fail("cant parse the files");
    for (JsonToken jsonToken = parser.nextToken();
         jsonToken != JsonToken.END_OBJECT; jsonToken = parser.nextToken()) {
      if (parser.getCurrentName().equalsIgnoreCase("partitions")) {
        break;
      }
    }
    int count = 0;
    printMemory("after finding out the partitions object location");
    if (parser.nextToken() == JsonToken.START_ARRAY) {
      while (parser.nextToken() != JsonToken.END_ARRAY) {
        JsonNode jsonNode = mapper.readTree(parser);
        Partition partition = new Partition();
        deserializer.deserialize(partition, jsonNode.asText(), "UTF-8");
        count++;
      }
      System.out.println("number of partitions :" + count);
    } else {
      fail("no partitions array token");
    }
    parser.close();
  }

  @Test
  public void testRegular() throws IOException {
    printMemory("before starting");
    JSONObject jsonObject = new JSONObject(
        FileUtils.readFileToString(new File("_metadata")));
    printMemory("after reading the file");
    jsonObject.toString();
  }

  private void printMemory(String msg) {
    Runtime runtime = Runtime.getRuntime();
    runtime.gc();
    long usedMemory = runtime.totalMemory() - runtime.freeMemory();
    System.out.println(msg + " KB used : " + usedMemory / 1024);
  }

}

{code}

Additional problem to look at is the overhead that bootstrap creates on 
namenode. all partitions will have their own directory hierarchy ( for multiple 
partition columns per table) to store the {{_files}}. 

> during repl load for large number of partitions the metadata file can be huge 
> and can lead to out of memory 
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-16904
>                 URL: https://issues.apache.org/jira/browse/HIVE-16904
>             Project: Hive
>          Issue Type: Sub-task
>          Components: HiveServer2
>    Affects Versions: 3.0.0
>            Reporter: anishek
>            Assignee: anishek
>             Fix For: 3.0.0
>
>
> the metadata pertaining to a table + its partitions is stored in a single 
> file, During repl load all the data is loaded in memory in one shot and then 
> individual partitions processed. This can lead to huge memory overhead as the 
> entire file is read in memory. try to deserialize the partition objects with 
> some sort of streaming json deserializer. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to