[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729479#comment-16729479 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1849792 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1849792 ] PDFBOX-4182, PDFBOX-4188: remove unused parameter > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729481#comment-16729481 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1849793 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1849793 ] PDFBOX-4182, PDFBOX-4188: remove unused parameter > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438441#comment-16438441 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1829159 from [~msahyoun] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1829159 ] PDFBOX-4182, PDFBOX-4188: correct javadoc > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438439#comment-16438439 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1829158 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1829158 ] PDFBOX-4182, PDFBOX-4188: correct javadoc > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438436#comment-16438436 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1829156 from [~msahyoun] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1829156 ] PDFBOX-4182, PDFBOX-4188: add new merge mode which closes the source PDDocument after the individual merge; early implementation > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438428#comment-16438428 ] ASF subversion and git services commented on PDFBOX-4182: - Commit 1829154 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1829154 ] PDFBOX-4182, PDFBOX-4188: add new merge mode which closes the source PDDocument after the individual merge; early implementation > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436011#comment-16436011 ] Maruan Sahyoun commented on PDFBOX-4182: One benefit of a Supplier would be that we could store the information if there was a {{File}} provided instead of an {{InputStream}} where for a {{File}} a caller would expect us to close the {{FileInputStream}} after the merge where for an {{InputStream}} one could expect the caller to close the {{InputStream}} provided. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435924#comment-16435924 ] Tilman Hausherr commented on PDFBOX-4182: - I prefer Maruan's approach that he presented in PDFBOX-4188 to yours with the Supplier. That Supplier is a smart thing but I feel like it would be more difficult to understand. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432983#comment-16432983 ] Maruan Sahyoun commented on PDFBOX-4182: I like the idea of the patch to use a strategy which allows to select the mode of handling at which point in time documents are closed. But what I really would like to do is to come up with a new merge behind the scenes which initially doesn't support merging all of the elements which are currently supported but reuses or rewrites how we handle different elements to allow us to gradually resolve the open issues and generally allow to close a document after is has been merged. So instead of calling the merge strategies after how we close documents I'd rather go for names which do not reflect the inner workings. As you've written above implementing the patch helps improving the situation for documents where we know that they can be handled by closing the document directly after the merge but doesn't resolve the issues for the ones where it doesn't work. My proposal would be to have basically two mergeDocuments methods (although they might be called differently) - one for doing legacy merge i.e. the current mode of operation and one with a new implementation where we add capabilities over time. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432916#comment-16432916 ] Pas Filip commented on PDFBOX-4182: --- I've attached a patch with the option to choose which strategy to use. It could be interesting to add it depending on the pdf elements in the pdfs. It's not really a solution but there is some improvement. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, merge-utility.patch, > oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430777#comment-16430777 ] Tilman Hausherr commented on PDFBOX-4182: - You should open an issue in his project... but I don't know if he can help without the files involved. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430725#comment-16430725 ] Pas Filip commented on PDFBOX-4182: --- On a completely different note I've been running some tests based on sambox console command line feature to merge mutliple pdfs. Seems to run faster for small loads but it fails to complete with 10.000+ docs. (no memory issue still have 3gb free out of 8gb) I seem to run into a deadlock here: org.sejda.io.FileChannelSeekableSource.position(FileChannelSeekableSource.java:59) > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430605#comment-16430605 ] Maruan Sahyoun commented on PDFBOX-4182: Closing the PDDocument early will also improve the ScratchFile usage. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430600#comment-16430600 ] Pas Filip commented on PDFBOX-4182: --- [~tilman] I think introducing the parameter can be useful to improve memory usage in the short term. Ideally re-working the scratchfile may lead to the most gains in memory consumption but not as easy... > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430431#comment-16430431 ] Tilman Hausherr commented on PDFBOX-4182: - We could add a parameter to {{mergeDocuments}} like {{early closing}} that is false in the call without that parameter. Or {{lateClosing}} that is true in the call without parameter. The javadoc should contain a text explaining that closing early can be risky in some cases, e.g. the one in PDFBOX-4004. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430355#comment-16430355 ] Maruan Sahyoun commented on PDFBOX-4182: [~pasfilip] [~tilman] what about this approach: within PDFMergerUtility we develop a new merge method which is targeted to be able to close the PDDocument after it has been merged. There will be a flag allowing one to select between the new and old merge. One needs to select the old merge to get all the current capabilities but over time we add to the new merge method. After doing this we will need to look into doing further optimizations, such as a different/new/improved 'cache'/ScratchFile to reduce the memory consumption further if it might still be needed. This way we will have the ability to select the current implementation to handle the special cases for which (currently) the PDDocument needs to be available but also have a 'slim' method if one only wants to merge basic documents. WDYT? [~pasfilip] I understand that you can't share the documents. Would it be possible to provide a sample set done from scratch which reflects the document set you are having. Pointing to publicly available documents is also fine. Please keep the elements needed to the bare minimum. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. >
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16430333#comment-16430333 ] Pas Filip commented on PDFBOX-4182: --- [~msahyoun] I'm afraid I can't share the pdfs as they contain confidential information. But basically they are documents asking a customer for payment. It contains an image of the EU transfer form and some text as well as a company logo.In other words they are very simple pdfs I tested with. I will be receiving pdfs with hidden fields and layout instructions in production though. Most files I tested with were between 100kb - 140kb. Sharing the cosstream seems problematic indeed Memory mapped files sound like a good idea but I'm thinking it will probably imply a significant rewrite of some portions of the code. I'm not familiar enough with the code to be able to estimate if this is feasible... > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) -
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429669#comment-16429669 ] Maruan Sahyoun commented on PDFBOX-4182: Thanks - I did some special merge implementation which works wo leaving the files open but is for a very specific set of PDFs (merging over 1 docs in one go) - so maybe we find a way to also deal with the issues which currently prevent us from doing it. OTOH if the resulting file is large it will still need lots of memory. We could take a look at memory mapped files for caching. [~pasfilip] would it be possible to share a small set of your documents to get an idea which PDF elements they use? > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429472#comment-16429472 ] Tilman Hausherr commented on PDFBOX-4182: - I found it: https://stackoverflow.com/questions/47140209/files-flattened-and-merged-with-pdfbox-are-sharing-common-cosstream which leads to PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429434#comment-16429434 ] Maruan Sahyoun commented on PDFBOX-4182: [~tilman] couldn't find the issue you are mentioning - would you mind taking a look if you are able to find it? > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428637#comment-16428637 ] Tilman Hausherr commented on PDFBOX-4182: - The problem isn't the input stream, the problem are the documents. These point to many resources. Re pdfsam, the "m" stands for merge :), see also [https://pdfsam.org/] for the tool with GUI. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428625#comment-16428625 ] Pas Filip commented on PDFBOX-4182: --- I had a quick look at how PDDocument is implemented and it looks like the source is only used for incremental save and close operation. Since a merge is being performed the input PDDocuments to merge aren't being modified so it shouldn't be an issue to close them in the mean time. Regarding the workaround mentioned on stackoverflow: I don't see any real workaround for the memory consumption issue beside trying to make batches of files to merge. I don't really see how this would reduce memory consumption though as in the end you'll have to merge, for example, just 2 large files which will consume a lot of memory I suspect. I haven't verified to see of merging 2 large files uses less memory that merging the same content from smaller files. I'll check it out though and see if I can reduce memory consumption by using batches. Optionally we could create a merge utility that takes this approach and merges in batches. The other workaround that is mentioned is to use freetext. Looks like it still uses a lot of memory though. However I'm reluctant to use freetext as the license is agpl. I've had a look at sambox but there doesn't seem to be a merge utility. I haven't investigated if I can make a merge utility using the sambox api though. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshot
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428538#comment-16428538 ] Tilman Hausherr commented on PDFBOX-4182: - There was a recent SO issue with this problem:[https://stackoverflow.com/questions/48643074/how-to-make-streamed-pdf-merging-without-memory-consumption] Closing the files earlier can't be done because in some rare cases, resources are not properly cloned so these are still used for the destination. Sadly I can't find the issue... I think this was related to the structure tree. Opening the files later wouldn't have any effect considering that we can't close ealier. An alternative would be to sambox https://github.com/torakiki/sambox , this is a PDFBox clone specialized in split and merge. > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.
[jira] [Commented] (PDFBOX-4182) Improve memory usage of PDFMergerUtility
[ https://issues.apache.org/jira/browse/PDFBOX-4182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428515#comment-16428515 ] Pas Filip commented on PDFBOX-4182: --- I've added an excell that gives an idea on memory consumption and execution time > Improve memory usage of PDFMergerUtility > > > Key: PDFBOX-4182 > URL: https://issues.apache.org/jira/browse/PDFBOX-4182 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9 >Reporter: Pas Filip >Priority: Major > Attachments: PDFMergerUtilityUsingSupplier.java, Supplier.java, > Suppliers.java, > failed-merge-utility-4gb-heap-out-of-memory-after-1800-pdfs.png, > merge-pdf-stats.xlsx, oom-2gb-heap-after-refactoring-leak-suspect-1.png, > oom-2gb-heap-after-refactoring-leak-suspect-2.png, successful - > refactored-merge-utility-4gb-heap-2618-files-merged.png, successful > -merge-utility-6gb-heap-2618-files-merged.png, > successful-merge-utility-6gb-heap-2618-files-merged-setupTempFileOnly.png, > successful-merge-utility-8gb-heap-2618-files-merged.png, > successful-refactored-merge-utility-4gb-heap-2618-files-merged-setupTempFileOnly.png > > > I have been running some tests trying to merge large amounts (2618) of small > pdf documents, between 100kb and 130kb, into a single large pdf (288.433kb) > Memory consumption seems to be the main limitation. > ScratchFileBuffer seems to consume the majority of the memory usage. > (see screenshot from mat in attachment) > (I would include the hprof in attachment so you can analyze yourselves but > it's rather large) > Note that it seems impossible to generate a large pdf using a small memory > footprint. > I personally thought that using MemorySettings with temporary file only would > allow me to generate arbitrarily large pdf files but it doesn't seem to help. > I've run the mergeDocuments with memory settings: > * MemoryUsageSetting.setupMixed(1024L * 1024L, 1024L * 1024L * 1024L * 1024L > * 1024L) > * MemoryUsageSetting.setupTempFileOnly() > Refactored version completes with *4GB* heap: > with temp file only completes 2618 documents in 1.760 min > *VS* > *8GB* heap: > with temp file only completes 2618 documents in 2.0 min > Heaps of 6gb or less result in OOM. (Didn't try different sizes between 6GB > and 8GB) > It looks like the loop in the mergeDocuments accumulates PDDocument objects > in a list which are closed after the merge is completed. > Refactoring the code to close these as they are used, instead of accumulating > them and closing all at the end, improves memory usage considerably.(although > doesn't seem to be eliminated completed based on mat analysis.) > Another change I've implemented is to only create the inputstream when the > file needs to be read and to close it alongside the PDDocument. > (Some inputstreams contain buffers and depending on the size of the buffers > and or the stream type accumulating all the streams is a potential > memory-hog.) > These changes seems to have a beneficial improvement in the sense that I can > process the same amount of pdfs with about half the memory. > I'd appreciate it if you could roll these changes into the main codebase. > (I've respected java 6 compatibility.) > I've included in attachment the java files of the new implementation: > * Suppliers > * Supplier > * PDFMergerUtilityUsingSupplier > PDFMergerUtilityUsingSupplier can replace the previous version. No signature > changes only internal code changes. (just rename the class to > PDFMergerUtility if you decide to implemented the changes.) > In attachment you can also find some screenshots from visualvm showing the > memory usage of the original version and the refactored version as well as > some info produced by mat after analysing the heap. > If you know of any other means, without running into memory issues, to merge > large sets of pdf files into a large single pdf I'd love to hear about it! > I'd also suggest that there should be further improvements made in memory > usage in general as pdfbox seems to consumer a lot of memory in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org