[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125285#comment-17125285 ] Shan commented on PDFBOX-4188: -- Any idea when this will be released? This seems more like a bug given the API documentation of MemoryUsageSetting.setupMixed(long maxMainMemoryBytes, long maxStorageBytes). > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732206#comment-16732206 ] Tilman Hausherr commented on PDFBOX-4188: - Re 1) yes Re 2) don't know, see my comment from April 17th Re 3) No, all issues have been fixed. Of course new ones might be coming in the future. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16732099#comment-16732099 ] Gary Potagal commented on PDFBOX-4188: -- I saw some activity on this ticket so reviewed and have a couple of questions: 1. Am I correct in that without changing code, PDFBOX_LEGACY_MODE is going to be used? 2. With default PDFBOX_LEGACY_MODE, updated memory management presented in the ticket would still be hugely beneficial in merging large number of small files. Any plans to review it or change the default? 3. Does "Structure Tree" limitation still exist? Thank you! > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729482#comment-16729482 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1849793 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1849793 ] PDFBOX-4182, PDFBOX-4188: remove unused parameter > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16729480#comment-16729480 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1849792 from Tilman Hausherr in branch 'pdfbox/trunk' [ https://svn.apache.org/r1849792 ] PDFBOX-4182, PDFBOX-4188: remove unused parameter > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16442245#comment-16442245 ] Maruan Sahyoun commented on PDFBOX-4188: I'll revisit that over the weekend. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441300#comment-16441300 ] Tilman Hausherr commented on PDFBOX-4188: - I can't comment because I haven't had the time to understand the patch, and the memory management is an "undiscovered area" to me. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440811#comment-16440811 ] Gary Potagal commented on PDFBOX-4188: -- Hello [~msahyoun] and [~tilman] - should we continue to work on this patch for 2.0.10 or do you want to come back to this for 3.0? Thank you > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439923#comment-16439923 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - Sorry, I just reviewed the code better. What I'm seeing is: - org.apache.pdfbox.io.MemoryUsageSetting#getPartitionedCopy was only used in the org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting) method - getPartitionedCopy creates a new instance of MemoryUsageSetting with limits determined by parallelUseCount. It is basically a copy constructor. As a utility method it will still function just as before > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439861#comment-16439861 ] Maruan Sahyoun commented on PDFBOX-4188: That's what I got from your comments. We need to make sure that {{MemoryUsageSetting.getPartitionedCopy}} is still working - otherwise we can't include the patch in the 2.0 stream. And for 3.0 there are no release plans yet - so this is far out. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439858#comment-16439858 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - Probably nothing good. In our code, we took that method out. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439817#comment-16439817 ] Maruan Sahyoun commented on PDFBOX-4188: Thanks for the explanation. What happens if one calls {{MemoryUsageSetting.getPartitionedCopy}} after the changes? > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439685#comment-16439685 ] Gary Potagal commented on PDFBOX-4188: -- [~msahyoun] - I've attached [^PDFBOX-4188_memory_diagram.png] that demonstrates problem. It's harder to diagram, but the real scope of the problem becomes a lot worth, the more files you add to merge. We hope you see that problem in the test that was submitted. - The problem starts in PDFMergerUtility when memory is partitioned (Line 288). We're eliminating memory partitioning, so the patch can't be split into two parts. There's one very important point - MemoryUsageSettings is a *single* object that's shared between all ScratchFiles. All ScratchFiles must reserve pages with MemoryUsageSettings, thus -- Pages (in main memory and on disk) are allocated only when they are needed -- Total limits are tracked in a single place, so whatever settings are passed into the PDFMergeUtility will be the Maximum Memory Limits used during the merge. - I'll open another ticket for openAction - MappedByteBuffer is used when there need to read the content of a file multiple times. Is that done during the merge? - If the patch is acceptable, we'll clean it up to meet coding conventions. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFBOX-4188_memory_diagram.png, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438699#comment-16438699 ] Maruan Sahyoun commented on PDFBOX-4188: [~gary.potagal] I've taken a quick look at the patch and would like to discuss some topics - PDFMergerUtility was using {{MemoryUsageSetting getPartitionedCopy}} where now the setting is passed on for each PDDocument and is no longer partitioned. So although the value used for {{MemoryUsageSetting}} is much lower now isn't that at the end the same result? - I haven't understood the main benefit of the changes done to {{MemoryUsageSetting}} and {{ScratchFile}}. What is the reason for these? - I think the patch should be divided in two parts - the changes to {{MemoryUsageSetting}} / {{ScratchFile}} and the changes to PDFMerger with test cases to show the improvements for each. - Do you see a benefit in using {{MappedByteBuffer}} - the handling of openAction doesn't belong into this patch. It should be part of a new issue. - the code doesn't follow the coding conventions https://pdfbox.apache.org/codingconventions.html so there is some effort to bring it in line with these. (I think that this section might be difficult to find on our website - any suggestions to make it easier to find the information is highly appreciated) Many of the questions are because this part of PDFBox is something I rarely touch - so I hope you're a little patient with me. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438440#comment-16438440 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1829158 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1829158 ] PDFBOX-4182, PDFBOX-4188: correct javadoc > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438442#comment-16438442 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1829159 from [~msahyoun] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1829159 ] PDFBOX-4182, PDFBOX-4188: correct javadoc > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438437#comment-16438437 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1829156 from [~msahyoun] in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1829156 ] PDFBOX-4182, PDFBOX-4188: add new merge mode which closes the source PDDocument after the individual merge; early implementation > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438429#comment-16438429 ] ASF subversion and git services commented on PDFBOX-4188: - Commit 1829154 from [~msahyoun] in branch 'pdfbox/trunk' [ https://svn.apache.org/r1829154 ] PDFBOX-4182, PDFBOX-4188: add new merge mode which closes the source PDDocument after the individual merge; early implementation > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437703#comment-16437703 ] Gary Potagal commented on PDFBOX-4188: -- Added [^PDFBOX-4188-MemoryManagerPatch]. It assumes that [^PDFBOX-4188-breakingTest.zip] is already applied and the pdf used in the test exists. - This should optimize both modes, but especially the LEGACY mode. - Java doc explains what was changed (Hopefully) - Test are passing with long defaultMemory = 1 * MEG; runMergeTest("pdf_sample_1-100pages", defaultMemory, 10 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 15 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 25 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 30 * MEG); - It would be great if openAction behavior was configurable. When documents are merged, we would like for them to open on the first page. Please let us know what you think and if you have any questions. Thank you. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-MemoryManagerPatch.zip, > PDFBOX-4188-breakingTest.zip, PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16436055#comment-16436055 ] Tilman Hausherr commented on PDFBOX-4188: - Yeah could be, your comment in the other issue sounded to me like there would be some fine-tuning. A compromise would be to do something limited for 2.0 like your patch and more fine tuning in 3.0 so we'd have more time and could change the API. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435960#comment-16435960 ] Maruan Sahyoun commented on PDFBOX-4188: So you fear that in addition to AcroFormMergeMode and DocumentMergeMode there will be others? What about using an EnumSet and use a common enum for all options? > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435917#comment-16435917 ] Tilman Hausherr commented on PDFBOX-4188: - I like the new method, but I wonder if the enum is future-proof, if more options will be coming. I also like the test in the patch. But it should use the target directory, not the src directory. We can't use the PDF file, we'll need another, maybe from existing test files. Maybe choose one from pdfbox\src\test\resources\input, e.g. PDFBOX-3110-poems-beads.pdf. Or create one using something from "Hamlet". > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435734#comment-16435734 ] Maruan Sahyoun commented on PDFBOX-4188: with the patch setting the {{defaultMemory}} to {{4 * MEG}} or above there is no longer a ScratchFile being generated. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435312#comment-16435312 ] Maruan Sahyoun commented on PDFBOX-4188: with [^PDFMergerUtility.java-20180412.patch] these are the results: {noformat} INFORMATION: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 4,112; Pages/Second: 24,319; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 775; Merged File Size(K): 518; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 3,481; Pages/Second: 57,455; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1; Total Sources Size(K): 1.551; Merged File Size(K): 1.038; Ratio MaxStorageBytes/Merged File Size: 0 INFORMATION: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3,746; Pages/Second: 80,085; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 2; Total Sources Size(K): 2.327; Merged File Size(K): 1.558; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4,959; Pages/Second: 80,661; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 4; Total Sources Size(K): 3.103; Merged File Size(K): 2.078; Ratio MaxStorageBytes/Merged File Size: 1 INFORMATION: Summary: Pages: 1000, Time(s): 16,298, Pages/Second: 61,357 {noformat} which I was able to run with the following settings {noformat} runMergeTest("pdf_sample_1-100pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 1 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 2 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 4 * MEG); {noformat} Of course this is a quick and dirty implementation/test to verify that closing only will bring the requirements down. > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip, > PDFMergerUtility.java-20180412.patch > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435230#comment-16435230 ] Maruan Sahyoun commented on PDFBOX-4188: on my machine the tests fail with the following settings {quote} runMergeTest("pdf_sample_1-100pages", defaultMemory, 70 * MEG); runMergeTest("pdf_sample_1-200pages", defaultMemory, 310 * MEG); runMergeTest("pdf_sample_1-300pages", defaultMemory, 700 * MEG); runMergeTest("pdf_sample_1-400pages", defaultMemory, 1200 * MEG); {quote} > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434982#comment-16434982 ] Maruan Sahyoun commented on PDFBOX-4188: Good timing as just I wanted to start working on PDFBOX-4182 this will allow to test if there is some improvement. What's the idea of the patch you are working on? Should I wait for that? > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434644#comment-16434644 ] Gary Potagal commented on PDFBOX-4188: -- I'm working on merging the patch that we did for 2.0.4 to current trunk. I'll try to have it available shortly for your review > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004 are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434416#comment-16434416 ] Gary Potagal commented on PDFBOX-4188: -- [~tilman] - we created a breaking test and it's attached [^PDFBOX-4188-breakingTest.zip]. The patch is binary, so you would need to apply it in the checked out trunk directory using the command: trunk> patch -p0 --binary -i PDFBOX-4188-breakingTest.diff patching file pdfbox/src/test/java/org/apache/pdfbox/multipdf/PdfMergeUtilityPagesTest.java patching file pdfbox/src/test/resources/input/merge/pages/pdf_sample_1.pdf The test does the following: # Creates four folders containing copies of one page simple pdf_sample_1.pdf file. Each folders contain increasing number of copies, starting with 100, so it's 100, 200, 300, 400 . Each file is about 8K # Merges all files in each folder. The numbers in test for maxStorageBytes are just enough to let the test pass. If you decrease them slightly, the Exception will be thrown. Output looks like this: Apr 11, 2018 2:25:32 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-100pages; Files: 100; Pages: 100; Time(s): 0.781; Pages/Second: 128.041; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 74; Total Sources Size(K): 775; Merged File Size(K): 522; Ratio MaxStorageBytes/Merged File Size: 145 Apr 11, 2018 2:25:34 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-200pages; Files: 200; Pages: 200; Time(s): 1.486; Pages/Second: 134.590; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 315; Total Sources Size(K): 1,551; Merged File Size(K): 1,042; Ratio MaxStorageBytes/Merged File Size: 309 Apr 11, 2018 2:25:37 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-300pages; Files: 300; Pages: 300; Time(s): 3.532; Pages/Second: 84.938; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 710; Total Sources Size(K): 2,327; Merged File Size(K): 1,562; Ratio MaxStorageBytes/Merged File Size: 465 Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest runMergeTest INFO: Test Name: pdf_sample_1-400pages; Files: 400; Pages: 400; Time(s): 4.677; Pages/Second: 85.525; MaxMainMemoryBytes(MB): 10; MaxStorageBytes(MB): 1,240; Total Sources Size(K): 3,103; Merged File Size(K): 2,082; Ratio MaxStorageBytes/Merged File Size: 609 Apr 11, 2018 2:25:42 PM org.apache.pdfbox.multipdf.PdfMergeUtilityPagesTest testPerformanceMerge INFO: Summary: Pages: 1000, Time(s): 10.476, Pages/Second: 95.456 As you can see, to merge 400 one page 8K files, We need to set maxStorageBytes to ~1.2 GIG. The resulting file is ~2000 K > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > Attachments: PDFBOX-4188-breakingTest.zip > > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened PDFBOX-3721 where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. (See > PDFBOX-418
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434191#comment-16434191 ] Gary Potagal commented on PDFBOX-4188: -- We don't know what PDFs we're going to get and are trying to make this generic > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened > https://issues.apache.org/jira/browse/PDFBOX-3721 > where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. See > (https://issues.apache.org/jira/browse/PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434188#comment-16434188 ] Maruan Sahyoun commented on PDFBOX-4188: Are the documents you are using the elements described in PDFBOX-3999 and PDFBOX-4003? > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened > https://issues.apache.org/jira/browse/PDFBOX-3721 > where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. See > (https://issues.apache.org/jira/browse/PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-4188) "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs
[ https://issues.apache.org/jira/browse/PDFBOX-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16434177#comment-16434177 ] Gary Potagal commented on PDFBOX-4188: -- From: Tilman Hausherr [mailto:thaush...@t-online.de] Sent: Saturday, April 07, 2018 1:48 AM To: dev@pdfbox.apache.org Subject: Re: "Maximum allowed scratch file memory exceeded." Exception when merging large number of small PDFs Hi, Please have also a look at the comments in https://issues.apache.org/jira/browse/PDFBOX-4182 Please submit your patch proposal there or in a new issue. It should be against the trunk. Note that this doesn't mean your patch will be accepted, it just means I'd like to see it because I haven't understood your post fully, and many attachment types don't get through here. A breaking test would be interesting: is it possible to use (or better, create) 400 identical small PDFs and merge them and does it break? Tilman > "Maximum allowed scratch file memory exceeded." Exception when merging large > number of small PDFs > -- > > Key: PDFBOX-4188 > URL: https://issues.apache.org/jira/browse/PDFBOX-4188 > Project: PDFBox > Issue Type: Improvement >Affects Versions: 2.0.9, 3.0.0 PDFBox >Reporter: Gary Potagal >Priority: Major > > > Am 06.04.2018 um 23:10 schrieb Gary Potagal: > > We wanted to address one more merge issue in > org.apache.pdfbox.multipdf.PDFMergerUtility#mergeDocuments(org.apache.pdfbox.io.MemoryUsageSetting). > We need to merge a large number of small files. We use mixed mode, memory > and disk for cache. Initially, we would often get "Maximum allowed scratch > file memory exceeded.", unless we turned off the check by passing "-1" to > org.apache.pdfbox.io.MemoryUsageSetting#MemoryUsageSetting. I believe, this > is what the users that opened > https://issues.apache.org/jira/browse/PDFBOX-3721 > where running into. > Our research indicates that the core issue with the memory model is that > instead of sharing a single cache, it breaks it up into equal sized fixed > partitions based on the number of input + output files being merged. This > means that each partition must be big enough to hold the final output file. > When 400 1-page files are merged, this creates 401 partitions, but each of > which needs to be big enough to hold the final 400 pages. Even worse, the > merge algorithm needs to keep all files open until the end. > Given this, near the end of the merge, we're actually caching 400 x 1-page > input files, and 1 x 400-page output file, or 801 pages. > However, with the partitioned cache, we need to declare room for 401 x > 400-pages, or 160,400 pages in total when specifying "maxStorageBytes". This > would be a very high number, usually in GIGs. > > Given the current limitation that we need to keep all the input files open > until the output file is written (HUGE), we came up with 2 options. See > (https://issues.apache.org/jira/browse/PDFBOX-4182) > > 1. Good: Split the cache in ½, give ½ to the output file, and segment the > other ½ across the input files. (Still keeping them open until then end). > 2. Better: Dynamically allocate in 16 page (64K) chunks from memory or disk > on demand, release cache as documents are closed after merge. This is our > current implementation till PDFBOX-3999, PDFBOX-4003 and PDFBOX-4004.are > addressed. > > We would like to submit our current implementation as a Patch to 2.0.10 and > 3.0.0, unless this is already addressed. > > Thank you -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org