[ 
https://issues.apache.org/jira/browse/NIFI-11178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lehel Boér reassigned NIFI-11178:
---------------------------------

    Assignee: Lehel Boér

> Improve memory efficiency of ListHDFS
> -------------------------------------
>
>                 Key: NIFI-11178
>                 URL: https://issues.apache.org/jira/browse/NIFI-11178
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Mark Payne
>            Assignee: Lehel Boér
>            Priority: Major
>
> ListHDFS is used extremely commonly. Typically, a listing consists of several 
> hundred files or less. However, there are times (especially when performing 
> the first listing) when the Processor is configured to recurse into 
> subdirectories and creates a listing containing millions of files.
> Currently, performing a listing containing millions of files can occupy 
> several GB of heap space. Analyzing a recent heap dump, it was found that a 
> listing of 6.7 million files in HDFS occupied approximately 12 GB of heap 
> space in NiFi. The heap usage can be tracked back to the fact that we hold in 
> a local variable a HashSet<FileStatus> and these FileStatus objects occupy 
> approximately 1-2 KB of heap each.
> There are several improvements that can be made here, some small changes will 
> have significant memory improvements. There are also large changes that could 
> dramatically reduce heap utilization, to nearly nothing. But these would 
> require complex rewrites of the Processor that would be much more difficult 
> to maintain.
> As a simple analysis, I built a small test to see how many FileStatus objects 
> could be kept in memory:
> {code:java}
> final Set<FileStatus> statuses = new HashSet<>();
> for (int i=0; i < 10_000_000; i++) {
>     if (i % 10_000 == 0) {
>         System.out.println(i);
>     }
>     final FsPermission fsPermission = new FsPermission(777);
>     final FileStatus status = new FileStatus(2, true, 1, 4_000_000, 
> System.currentTimeMillis(), 0L, fsPermission,
>         "owner-" + i, "group-" + i, null, new Path("/path/to/my/file-" + i + 
> ".txt"), true, false, false);
>     statuses.add(status);
> }
> {code}
> This gives us a way to see how many FileStatus objects can be added to our 
> set before we encounter OOME. With a 512 MB heap I reached approximately 1.13 
> million FileStatus objects. Note that the Paths here are very small, which 
> occupies less memory than would normally be the case.
> Making one small change, to change the {{Set<FileStatus>}} to an 
> {{ArrayList<FileStatus>}} yielded 1.21 million instead of 1.13 million 
> objects. This is reasonable, since we don't expect any duplicates anyway.
> Another small change that was made was to introduce a new class that keeps 
> only the fields we care about from the FileStatus:
> {code:java}
> private static class HdfsEntity {
>     private final String path;
>     private final long length;
>     private final boolean isdir;
>     private final long timestamp;
>     private final boolean canRead;
>     private final boolean canWrite;
>     private final boolean canExecute;
>     private final boolean encrypted;
>     public HdfsEntity(final FileStatus status) {
>         this.path = status.getPath().getName();
>         this.length = status.getLen();
>         this.isdir = status.isDirectory();
>         this.timestamp = status.getModificationTime();
>         this.canRead = 
> status.getPermission().getGroupAction().implies(FsAction.READ);
>         this.canWrite = 
> status.getPermission().getGroupAction().implies(FsAction.WRITE);
>         this.canExecute = 
> status.getPermission().getGroupAction().implies(FsAction.EXECUTE);
>         this.encrypted = status.isEncrypted();
>     }
> }
>  {code}
> This introduced significant savings, allowing a {{List<HdfsEntity>}} to store 
> 5.2 million objects instead of 1.13 million.
> It is worth noting here that the HdfsEntity doesn't store the 'group' and 
> 'owner' that are part of the FileStatus. Interestingly, though, these values 
> are strings that are extremely repetitive, but are unique strings on the 
> heap. So a full solution here would mean tracking the owner and group, but 
> doing so in a way that we use a Map<String, String> or something similar in 
> order to reuse identical Strings on the heap (in much the same way that 
> String.intern() does, but without interning the String as it's only reusable 
> within a small context).
> These changes will yield significant memory improvements.
> However, more penetrating changes can make even large improvements:
>  * Rather than keeping a Set or a List of entities at all, we could instead 
> just iterate over each of the FileStatus objects returned from HDFS and 
> determine whether or not the file should be included in the listing. If so, 
> create the FlowFile or write the Record to the RecordWriter. Eliminate the 
> collection all together. This would likely result in code that is similar to 
> ListS3, which has an internal {{S3ObjectWriter}} class. This is used to 
> create an interface that can be used regardless of whether a RecordWriter is 
> being used or not. This would provide very significant memory improvements. 
> In the case of using a Record Writer, it may even reduce the size to nearly 0.
>  * However, when not using a Record Writer, we still will have the FlowFiles 
> kept in memory until the session is committed. One method for dealing with 
> this would be to change our algorithm so that instead of performing a single 
> listing of the directory and all sub-directories, we instead sort 
> sub-directories by name and then commit the session and update state for 
> individual sub-directories. This would introduce significant risk, though, in 
> ensuring that we don't create message duplication upon restart and that we 
> don't lose data. So it's perhaps not the best option.
>  * Alternatively, we could document the concern when not using a Record 
> Writer and provide info in an additionalDetails.html that shows the preferred 
> method for using the processor when expecting many files. ListHDFS would be 
> connected to SplitRecord that would split into chunks of say 10,000 Records. 
> This would then go to a PartitionRecord that would create attributes for the 
> fields. This would give us the same result as outputting without the Record 
> Writer but in such a way that is much more memory efficient.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to