[jira] [Comment Edited] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
[ https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099 ] Anthony Abate edited comment on ARROW-9035 at 6/4/20, 5:38 PM: --- yes - I didn't realize it was configurable - it probably works (but i'll know soon if it doesnt) I thought the docs sections were in conflict - but now I realize that 8 byte alignment is the 'requirement' not 64.. (64 is still a multiple of 8) was (Author: abbot): yes - I didn't realize it was configurable - it probably works but i'll know soon if it doesnt) I thought the docs sections were in conflict - but now I realize that 8 byte alignment is the 'requirement' not 64.. (64 iis still a multiple of 8) > [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default > --- > > Key: ARROW-9035 > URL: https://issues.apache.org/jira/browse/ARROW-9035 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Affects Versions: 0.17.0 >Reporter: Anthony Abate >Priority: Minor > > I used the C++ library to create a very small arrow file (1 field of 5 int32) > and was surprised that the buffers are not aligned to 64 bytes as per the > documentation section "Buffer Alignment and Padding" with examples.. based on > the examples there, the 20 bytes of int32 should be padded to 64 bytes, but > it is only 24 (see below) . > extract message metadata - see BodyLength = 24 > {code:java} > { > version: "V4", > header_type: "RecordBatch", > header: { > nodes: [ > { > length: 5, > null_count: 0 > } > ], > buffers: [ > { > offset: 0, > length: 0 > }, > { > offset: 0, > length: 20 > } > ] > }, > bodyLength: 24 > } {code} > Reading further down the documentation section "Encapsulated message format" > it says serialization should use 8 byte alignment. > These both seem at odds with each other and some clarification is needed. > Is the documentation wrong? > Or > Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
[ https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099 ] Anthony Abate commented on ARROW-9035: -- yes - I didn't realize it was configurable - it probably works but i'll know soon if it doesnt) I thought the docs sections were in conflict - but now I realize that 8 byte alignment is the 'requirement' not 64.. (after 64 iis still a multiple of 8) > [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default > --- > > Key: ARROW-9035 > URL: https://issues.apache.org/jira/browse/ARROW-9035 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Affects Versions: 0.17.0 >Reporter: Anthony Abate >Priority: Minor > > I used the C++ library to create a very small arrow file (1 field of 5 int32) > and was surprised that the buffers are not aligned to 64 bytes as per the > documentation section "Buffer Alignment and Padding" with examples.. based on > the examples there, the 20 bytes of int32 should be padded to 64 bytes, but > it is only 24 (see below) . > extract message metadata - see BodyLength = 24 > {code:java} > { > version: "V4", > header_type: "RecordBatch", > header: { > nodes: [ > { > length: 5, > null_count: 0 > } > ], > buffers: [ > { > offset: 0, > length: 0 > }, > { > offset: 0, > length: 20 > } > ] > }, > bodyLength: 24 > } {code} > Reading further down the documentation section "Encapsulated message format" > it says serialization should use 8 byte alignment. > These both seem at odds with each other and some clarification is needed. > Is the documentation wrong? > Or > Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
[ https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099 ] Anthony Abate edited comment on ARROW-9035 at 6/4/20, 5:37 PM: --- yes - I didn't realize it was configurable - it probably works but i'll know soon if it doesnt) I thought the docs sections were in conflict - but now I realize that 8 byte alignment is the 'requirement' not 64.. (64 iis still a multiple of 8) was (Author: abbot): yes - I didn't realize it was configurable - it probably works but i'll know soon if it doesnt) I thought the docs sections were in conflict - but now I realize that 8 byte alignment is the 'requirement' not 64.. (after 64 iis still a multiple of 8) > [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default > --- > > Key: ARROW-9035 > URL: https://issues.apache.org/jira/browse/ARROW-9035 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Affects Versions: 0.17.0 >Reporter: Anthony Abate >Priority: Minor > > I used the C++ library to create a very small arrow file (1 field of 5 int32) > and was surprised that the buffers are not aligned to 64 bytes as per the > documentation section "Buffer Alignment and Padding" with examples.. based on > the examples there, the 20 bytes of int32 should be padded to 64 bytes, but > it is only 24 (see below) . > extract message metadata - see BodyLength = 24 > {code:java} > { > version: "V4", > header_type: "RecordBatch", > header: { > nodes: [ > { > length: 5, > null_count: 0 > } > ], > buffers: [ > { > offset: 0, > length: 0 > }, > { > offset: 0, > length: 20 > } > ] > }, > bodyLength: 24 > } {code} > Reading further down the documentation section "Encapsulated message format" > it says serialization should use 8 byte alignment. > These both seem at odds with each other and some clarification is needed. > Is the documentation wrong? > Or > Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9035) 8 vs 64 byte alignment
[ https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126052#comment-17126052 ] Anthony Abate commented on ARROW-9035: -- Perhaps in RFC terms ([https://tools.ietf.org/html/rfc2119)|https://tools.ietf.org/html/rfc2119] the doc should say: All buffers (metadata (flatbuffers) and data buffers) MUST be 8 byte aligned but SHOULD be 64 byte aligned - This would apply to both sections. With most of the docs going stressing 64 byte alignment, I didn't realize the 'default' alignment the C++ library is 8 bytes.. assumed it would be 64 byte. > 8 vs 64 byte alignment > -- > > Key: ARROW-9035 > URL: https://issues.apache.org/jira/browse/ARROW-9035 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Documentation >Affects Versions: 0.17.0 >Reporter: Anthony Abate >Priority: Minor > > I used the C++ library to create a very small arrow file (1 field of 5 int32) > and was surprised that the buffers are not aligned to 64 bytes as per the > documentation section "Buffer Alignment and Padding" with examples.. based on > the examples there, the 20 bytes of int32 should be padded to 64 bytes, but > it is only 24 (see below) . > extract message metadata - see BodyLength = 24 > {code:java} > { > version: "V4", > header_type: "RecordBatch", > header: { > nodes: [ > { > length: 5, > null_count: 0 > } > ], > buffers: [ > { > offset: 0, > length: 0 > }, > { > offset: 0, > length: 20 > } > ] > }, > bodyLength: 24 > } {code} > Reading further down the documentation section "Encapsulated message format" > it says serialization should use 8 byte alignment. > These both seem at odds with each other and some clarification is needed. > Is the documentation wrong? > Or > Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9035) 8 vs 64 byte alignment
Anthony Abate created ARROW-9035: Summary: 8 vs 64 byte alignment Key: ARROW-9035 URL: https://issues.apache.org/jira/browse/ARROW-9035 Project: Apache Arrow Issue Type: Bug Components: C++, Documentation Affects Versions: 0.17.0 Reporter: Anthony Abate I used the C++ library to create a very small arrow file (1 field of 5 int32) and was surprised that the buffers are not aligned to 64 bytes as per the documentation section "Buffer Alignment and Padding" with examples.. based on the examples there, the 20 bytes of int32 should be padded to 64 bytes, but it is only 24 (see below) . extract message metadata - see BodyLength = 24 {code:java} { version: "V4", header_type: "RecordBatch", header: { nodes: [ { length: 5, null_count: 0 } ], buffers: [ { offset: 0, length: 0 }, { offset: 0, length: 20 } ] }, bodyLength: 24 } {code} Reading further down the documentation section "Encapsulated message format" it says serialization should use 8 byte alignment. These both seem at odds with each other and some clarification is needed. Is the documentation wrong? Or Should 8 byte alignment be used for File and 64 byte for IPC ? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs
[ https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010293#comment-17010293 ] Anthony Abate commented on ARROW-7511: -- Now i remember why I thought Memory and Span can't support more than 2 gigs: The *.Slice()* function only takes int32 https://docs.microsoft.com/en-us/dotnet/api/system.memory-1.slice?view=netcore-3.1#System_Memory_1_Slice_System_Int32_System_Int32_ > [C#] - Batch / Data Size Can't Exceed 2 gigs > > > Key: ARROW-7511 > URL: https://issues.apache.org/jira/browse/ARROW-7511 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Major > > While the Arrow spec does not forbid batches larger than 2 gigs, the C# > library can not support this in its current form due to limits on managed > memory as it tries to put the whole batch into a single > Span/Memory > It is possible to fix this by not trying to use Memory/Span/byte[] for the > entire Batch.. and instead move the memory mapping to the ArrowBuffers. This > only move the problem 'lower' as it would then still set the limit of a > Column Data in a single batch to be 2 Gigs. > This seems like plenty of memory... but if you think of strings columns, the > data is just one giant string appended to together with offsets and it can > get very large quickly. > I think the unfortunate problem is that memory management in the C# managed > world is always going to hit the 2 gig limit somewhere. (please correct me if > I am wrong on this statement, but I thought i read some where that Memory > / Span are limited to int and changing to long would require major > framework rewrites - but i may be conflating that with array) > That ultimately means the C# library either has to reject files of certain > characteristics (ie validation checks on opening) , or the spec needs put > upper limits on certain internal arrow constructs (ie arrow buffer) to > eliminate the need for more than a 2 gigs of contiguous memory for the > smallest arrow object. > However, If the spec was indeed designed for the smallest buffer object to be > larger than 2 gigs, or for the entire memory buffer of arrow to be > contiguous, one has to wonder if at some point, it might just make sense for > the C# library to use the C++ library as its memory manager as replicating a > very large blocks of memory more work than its wroth. > In any case, this issue is more about 'deferring' the 2 gig size problem by > moving it down to the buffer objects... This might require some re-write of > the batch data structures > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs
[ https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7511: - Description: While the Arrow spec does not forbid batches larger than 2 gigs, the C# library can not support this in its current form due to limits on managed memory as it tries to put the whole batch into a single Span/Memory It is possible to fix this by not trying to use Memory/Span/byte[] for the entire Batch.. and instead move the memory mapping to the ArrowBuffers. This only move the problem 'lower' as it would then still set the limit of a Column Data in a single batch to be 2 Gigs. This seems like plenty of memory... but if you think of strings columns, the data is just one giant string appended to together with offsets and it can get very large quickly. I think the unfortunate problem is that memory management in the C# managed world is always going to hit the 2 gig limit somewhere. (please correct me if I am wrong on this statement, but I thought i read some where that Memory / Span are limited to int and changing to long would require major framework rewrites - but i may be conflating that with array) That ultimately means the C# library either has to reject files of certain characteristics (ie validation checks on opening) , or the spec needs put upper limits on certain internal arrow constructs (ie arrow buffer) to eliminate the need for more than a 2 gigs of contiguous memory for the smallest arrow object. However, If the spec was indeed designed for the smallest buffer object to be larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, one has to wonder if at some point, it might just make sense for the C# library to use the C++ library as its memory manager as replicating a very large blocks of memory more work than its wroth. In any case, this issue is more about 'deferring' the 2 gig size problem by moving it down to the buffer objects... This might require some re-write of the batch data structures was: While the Arrow spec does not forbid batches larger than 2 gigs, the C# library can not support this in its current form due to limits on managed memory as it tries to put the whole batch into a single Span/Memory It is possible to fix this by not trying to use Memory/Span/byte[] for the entire Batch.. and instead move the memory mapping to the ArrowBuffers. This only move the problem 'lower' as it would then still set the limit of a Column Data in a single batch to be 2 Gigs. This seems like plenty of memory... but if you think of strings columns, the data is just one giant string appended to together with offsets and it can get very large quickly. I think the unfortunate problem is that memory management in the C# managed world is always going to hit the 2 gig limit somewhere. (please correct me if I am wrong on this statement) That ultimately means the C# library either has to reject files of certain characteristics (ie validation checks on opening) , or the spec needs put upper limits on certain internal arrow constructs (ie arrow buffer) to eliminate the need for more than a 2 gigs of contiguous memory for the smallest arrow object. However, If the spec was indeed designed for the smallest buffer object to be larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, one has to wonder if at some point, it might just make sense for the C# library to use the C++ library as its memory manager as replicating a very large blocks of memory more work than its wroth. In any case, this issue is more about 'deferring' the 2 gig size problem by moving it down to the buffer objects... This might require some re-write of the batch data structures > [C#] - Batch / Data Size Can't Exceed 2 gigs > > > Key: ARROW-7511 > URL: https://issues.apache.org/jira/browse/ARROW-7511 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Major > > While the Arrow spec does not forbid batches larger than 2 gigs, the C# > library can not support this in its current form due to limits on managed > memory as it tries to put the whole batch into a single > Span/Memory > It is possible to fix this by not trying to use Memory/Span/byte[] for the > entire Batch.. and instead move the memory mapping to the ArrowBuffers. This > only move the problem 'lower' as it would then still set the limit of a > Column Data in a single batch to be 2 Gigs. > This seems like plenty of memory... but if you think of strings columns, the > data is just one giant string appended to together with offsets and it can > get very large quickly. > I think the unfortunate problem is that memory management in the C#
[jira] [Created] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs
Anthony Abate created ARROW-7511: Summary: [C#] - Batch / Data Size Can't Exceed 2 gigs Key: ARROW-7511 URL: https://issues.apache.org/jira/browse/ARROW-7511 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 0.15.1 Reporter: Anthony Abate While the Arrow spec does not forbid batches larger than 2 gigs, the C# library can not support this in its current form due to limits on managed memory as it tries to put the whole batch into a single Span/Memory It is possible to fix this by not trying to use Memory/Span/byte[] for the entire Batch.. and instead move the memory mapping to the ArrowBuffers. This only move the problem 'lower' as it would then still set the limit of a Column Data in a single batch to be 2 Gigs. This seems like plenty of memory... but if you think of strings columns, the data is just one giant string appended to together with offsets and it can get very large quickly. I think the unfortunate problem is that memory management in the C# managed world is always going to hit the 2 gig limit somewhere. (please correct me if I am wrong on this statement) That ultimately means the C# library either has to reject files of certain characteristics (ie validation checks on opening) , or the spec needs put upper limits on certain internal arrow constructs (ie arrow buffer) to eliminate the need for more than a 2 gigs of contiguous memory for the smallest arrow object. However, If the spec was indeed designed for the smallest buffer object to be larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, one has to wonder if at some point, it might just make sense for the C# library to use the C++ library as its memory manager as replicating a very large blocks of memory more work than its wroth. In any case, this issue is more about 'deferring' the 2 gig size problem by moving it down to the buffer objects... This might require some re-write of the batch data structures -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7508) [C#] DateTime32 Reading is Broken
[ https://issues.apache.org/jira/browse/ARROW-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7508: - Summary: [C#] DateTime32 Reading is Broken (was: [C#] DateTime Reading is Broken) > [C#] DateTime32 Reading is Broken > - > > Key: ARROW-7508 > URL: https://issues.apache.org/jira/browse/ARROW-7508 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Anthony Abate >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > DateTime support for writing works - but reading is broken. > This is another arithmetic overflow bug (reported a few already) which is > causing date to be misinterpreted > I extracted the current logic out to linqpad and to show the bug and fix: > > {code:java} > var dto = DateTimeOffset.Parse("2024-09-25"); > (dto.ToUnixTimeMilliseconds() / 8640).Dump(); > // YIELDS: 19991 > > unchecked (current code) > { > DateTimeOffset.FromUnixTimeMilliseconds(19991 * > 8640).Dump(); > // 1/8/1970 WRONG > } > checked > { > DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * > 8640).Dump(); > // 9/25/2024 CORRECT > } {code} > > > this fix is trivial - a cast to long is missing where ever > *FromUnixTimeMilliseconds* is used > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7509) [C#] Turn on Checked mode for debug builds
[ https://issues.apache.org/jira/browse/ARROW-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7509: - Summary: [C#] Turn on Checked mode for debug builds (was: Turn on Checked mode for debug builds) > [C#] Turn on Checked mode for debug builds > -- > > Key: ARROW-7509 > URL: https://issues.apache.org/jira/browse/ARROW-7509 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Minor > > Anyone object to turning on checked mode for debug builds? > There have been many arithmetic overflow bugs. These could have been caught > earlier simply by running the code with checked turned on. > Then the unit tests could be run in debug mode and any obvious overflow bugs > might be caught -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7509) Turn on Checked mode for debug builds
Anthony Abate created ARROW-7509: Summary: Turn on Checked mode for debug builds Key: ARROW-7509 URL: https://issues.apache.org/jira/browse/ARROW-7509 Project: Apache Arrow Issue Type: Improvement Components: C# Affects Versions: 0.15.1 Reporter: Anthony Abate Anyone object to turning on checked mode for debug builds? There have been many arithmetic overflow bugs. These could have been caught earlier simply by running the code with checked turned on. Then the unit tests could be run in debug mode and any obvious overflow bugs might be caught -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7508) DateTime Reading is Broken
[ https://issues.apache.org/jira/browse/ARROW-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7508: - Description: DateTime support for writing works - but reading is broken. This is another arithmetic overflow bug (reported a few already) which is causing date to be misinterpreted I extracted the current logic out to linqpad and to show the bug and fix: {code:java} var dto = DateTimeOffset.Parse("2024-09-25"); (dto.ToUnixTimeMilliseconds() / 8640).Dump(); // YIELDS: 19991 unchecked (current code) { DateTimeOffset.FromUnixTimeMilliseconds(19991 * 8640).Dump(); // 1/8/1970 WRONG } checked { DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 8640).Dump(); // 9/25/2024 CORRECT } {code} this fix is trivial - a cast to long is missing where ever *FromUnixTimeMilliseconds* is used was: DateTime support for writing works - but reading is broken. This another arithmetic overflow bug (reported a few already) which is causing date to be misinterpreted I extracted the current logic out to linqpad and to show the bug and fix: {code:java} var dto = DateTimeOffset.Parse("2024-09-25"); (dto.ToUnixTimeMilliseconds() / 8640).Dump(); // YIELDS: 19991 unchecked (current code) { DateTimeOffset.FromUnixTimeMilliseconds(19991 * 8640).Dump(); // 1/8/1970 WRONG } checked { DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 8640).Dump(); // 9/25/2024 CORRECT } {code} this fix is trivial - a cast to long is missing whereever FromUnixTimeMilliseconds is used > DateTime Reading is Broken > -- > > Key: ARROW-7508 > URL: https://issues.apache.org/jira/browse/ARROW-7508 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Assignee: Anthony Abate >Priority: Critical > > DateTime support for writing works - but reading is broken. > This is another arithmetic overflow bug (reported a few already) which is > causing date to be misinterpreted > I extracted the current logic out to linqpad and to show the bug and fix: > > {code:java} > var dto = DateTimeOffset.Parse("2024-09-25"); > (dto.ToUnixTimeMilliseconds() / 8640).Dump(); > // YIELDS: 19991 > > unchecked (current code) > { > DateTimeOffset.FromUnixTimeMilliseconds(19991 * > 8640).Dump(); > // 1/8/1970 WRONG > } > checked > { > DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * > 8640).Dump(); > // 9/25/2024 CORRECT > } {code} > > > this fix is trivial - a cast to long is missing where ever > *FromUnixTimeMilliseconds* is used > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7508) DateTime Reading is Broken
Anthony Abate created ARROW-7508: Summary: DateTime Reading is Broken Key: ARROW-7508 URL: https://issues.apache.org/jira/browse/ARROW-7508 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 0.15.1 Reporter: Anthony Abate Assignee: Anthony Abate DateTime support for writing works - but reading is broken. This another arithmetic overflow bug (reported a few already) which is causing date to be misinterpreted I extracted the current logic out to linqpad and to show the bug and fix: {code:java} var dto = DateTimeOffset.Parse("2024-09-25"); (dto.ToUnixTimeMilliseconds() / 8640).Dump(); // YIELDS: 19991 unchecked (current code) { DateTimeOffset.FromUnixTimeMilliseconds(19991 * 8640).Dump(); // 1/8/1970 WRONG } checked { DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 8640).Dump(); // 9/25/2024 CORRECT } {code} this fix is trivial - a cast to long is missing whereever FromUnixTimeMilliseconds is used -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
[ https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007195#comment-17007195 ] Anthony Abate edited comment on ARROW-6603 at 1/3/20 2:41 PM: -- The reason I did this is because as much as I tried to use the existing API - but I think it can't support null able correctly as many assumptions are baked into the generic always being non-nullable.. If you see how I implement nullable, I am filling in dummy values in the Value Buffer for nulls, but correctly setting the value bitmap... this results in a reader of the arrow file correctly interpreting the NULL. It still might be possible a builder method called AppendNullable() into the existing builder code... but I was able to get the code in the PR to work fairly quickly once I understood the flatbuffer spec was (Author: abbot): The reason I did this is because as much as I tried to use the existing API - but I think it can't support null able correctly as many assumptions are baked into the generic always being non-nullable.. If you see how I implement nullablitity, I am filling in dummy values in the Value Buffer for nulls, but correctly setting the value bitmap... this results in a reader of the arrow file correctly interpreting the NULL. It still might be possible a builder method called AppendNullable() into the existing builder code... but I was able to get the code in the PR to work fairly quickly once I understood the flatbuffer spec > [C#] ArrayBuilder API to support writing nulls > -- > > Key: ARROW-6603 > URL: https://issues.apache.org/jira/browse/ARROW-6603 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Eric Erhardt >Assignee: Anthony Abate >Priority: Major > Labels: pull-request-available > Original Estimate: 72h > Time Spent: 10m > Remaining Estimate: 71h 50m > > There is currently no API in the PrimitiveArrayBuilder class to support > writing nulls. See this TODO - > [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] > > Also see [https://github.com/apache/arrow/issues/5381]. > > We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007222#comment-17007222 ] Anthony Abate commented on ARROW-7040: -- created a PR for this: https://github.com/apache/arrow/pull/6122 > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Assignee: Anthony Abate >Priority: Blocker > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > Update 2: > This is due to a confirmed bug in System.Memory - The implications are that > Span.CopyTo needs to be removed / replaced. > This is method is used internally by ArrowBuffer so I can't work around this > easily. > Solutions > # Change the code > ## Remove it out right (including disable span in FlatBuffer) > ## create a multi target nuget where the offending code has compile blocks > #If (NETFRAMEWORK) - and disable span in FlatBuffers only for net framework > build > # wait for a System.Memory fix? > > I suspect option 2 won't happen anytime soon. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate reassigned ARROW-7040: Assignee: Anthony Abate > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Assignee: Anthony Abate >Priority: Blocker > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > Update 2: > This is due to a confirmed bug in System.Memory - The implications are that > Span.CopyTo needs to be removed / replaced. > This is method is used internally by ArrowBuffer so I can't work around this > easily. > Solutions > # Change the code > ## Remove it out right (including disable span in FlatBuffer) > ## create a multi target nuget where the offending code has compile blocks > #If (NETFRAMEWORK) - and disable span in FlatBuffers only for net framework > build > # wait for a System.Memory fix? > > I suspect option 2 won't happen anytime soon. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
[ https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007195#comment-17007195 ] Anthony Abate commented on ARROW-6603: -- The reason I did this is because as much as I tried to use the existing API - but I think it can't support null able correctly as many assumptions are baked into the generic always being non-nullable.. If you see how I implement nullablitity, I am filling in dummy values in the Value Buffer for nulls, but correctly setting the value bitmap... this results in a reader of the arrow file correctly interpreting the NULL. It still might be possible a builder method called AppendNullable() into the existing builder code... but I was able to get the code in the PR to work fairly quickly once I understood the flatbuffer spec > [C#] ArrayBuilder API to support writing nulls > -- > > Key: ARROW-6603 > URL: https://issues.apache.org/jira/browse/ARROW-6603 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Eric Erhardt >Assignee: Anthony Abate >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > There is currently no API in the PrimitiveArrayBuilder class to support > writing nulls. See this TODO - > [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] > > Also see [https://github.com/apache/arrow/issues/5381]. > > We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
[ https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007190#comment-17007190 ] Anthony Abate commented on ARROW-6603: -- I added a PR [https://github.com/apache/arrow/pull/6121] note that this does not change the existing API, but can be used in-lieu of when creating record batches. > [C#] ArrayBuilder API to support writing nulls > -- > > Key: ARROW-6603 > URL: https://issues.apache.org/jira/browse/ARROW-6603 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Eric Erhardt >Assignee: Anthony Abate >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > There is currently no API in the PrimitiveArrayBuilder class to support > writing nulls. See this TODO - > [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] > > Also see [https://github.com/apache/arrow/issues/5381]. > > We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
[ https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate reassigned ARROW-6603: Assignee: Anthony Abate > [C#] ArrayBuilder API to support writing nulls > -- > > Key: ARROW-6603 > URL: https://issues.apache.org/jira/browse/ARROW-6603 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Eric Erhardt >Assignee: Anthony Abate >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > There is currently no API in the PrimitiveArrayBuilder class to support > writing nulls. See this TODO - > [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] > > Also see [https://github.com/apache/arrow/issues/5381]. > > We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [C#] Large record batch is written with negative buffer length
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979727#comment-16979727 ] Anthony Abate commented on ARROW-7156: -- [~wesm] - I should also point out the C# library on Net Framework is not even stable in its current state due to random crashes related to this: ARROW-7040 (I have a local build that fixes is this - so I can make a PR for this one) Regarding the integration tests - since C# is not using the C++ libs, how do the integration tests work? (i can volunteer some of my time on this, but I may have a lot of questions) > [C#] Large record batch is written with negative buffer length > -- > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973723#comment-16973723 ] Anthony Abate commented on ARROW-7156: -- I uploaded some test files. they deceptively small compressed.. but 2gigs uncompressed I have a work around for now - just make sure my batches are less than 2 gigs. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Attachment: SingleBatch_String_7_Rows.ok.rar > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Attachment: SingleBatch_String_85000_Rows.crash.rar > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: SingleBatch_String_7_Rows.ok.rar, > SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973717#comment-16973717 ] Anthony Abate commented on ARROW-7156: -- >From Event viewer: Faulting application name: rsession.exe, version: 1.2.1335.0, time stamp: 0x5c9d0154 Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 0x5dc40022 Exception code: 0xc005 Fault offset: 0x003e4c05 Faulting process id: 0x8ec Faulting application start time: 0x01d59a59ff052a76 Faulting application path: C:\software\RStudio\bin\rsession.exe Faulting module path: C:\Users\aabate\Documents\R\win-library\3.6\arrow\libs\x64\arrow.dll Report Id: db7e29f8-54ba-40fc-a104-75d3b6f75d0e Faulting package full name: Faulting package-relative application ID: > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973714#comment-16973714 ] Anthony Abate commented on ARROW-7156: -- [~npr]- crashes RStudio means just that - instead of an error message !image-2019-11-13-16-27-30-641.png! > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Attachment: image-2019-11-13-16-27-30-641.png > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > Attachments: image-2019-11-13-16-27-30-641.png > > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973713#comment-16973713 ] Anthony Abate commented on ARROW-7156: -- [~npr]- do you know if an individual RecordBatch can exceed 2 gigs (int32 max) ? This might not be an Arrow C++ issue, but another bug in the C# library that I used to generate the file. > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Description: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. in 14.11 the individual batch errors.. in 15.1.1 the batch crashes R studio when it is used *14.1.1* {code:java} > rbn <- data_rbfr$get_batch(x) Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : Invalid: negative malloc size {code} *15.1.1* {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} Update I put the data in the batch into a separate file. The file size is over 2 gigs. Using 15.1.1, when I try to load this entire file via read_arrow it also fails. {code:java} ar <- arrow::read_arrow("e:\\temp\\file.arrow") Error in Table__from_RecordBatchFileReader(batch_reader) : Invalid: negative malloc size{code} {color:#c5060b} {color} was: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. in 14.11 the individual batch errors.. in 15.1.1 the batch crashes R studio when it is used *14.1.1* {code:java} > rbn <- data_rbfr$get_batch(x) Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : Invalid: negative malloc size {code} *15.1.1* {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > > Update > I put the data in the batch into a separate file. The file size is over 2 > gigs. > Using 15.1.1, when I try to load this entire file via read_arrow it also > fails. > {code:java} > ar <- arrow::read_arrow("e:\\temp\\file.arrow") > Error in Table__from_RecordBatchFileReader(batch_reader) : > Invalid: negative malloc size{code} > {color:#c5060b} {color} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Summary: [R] [C++] Large Batches Cause Error / Crashes (was: [R] [C++] get_batch - fails for large batches) > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1*** > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Description: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. in 14.11 the individual batch errors.. in 15.1.1 the batch crashes R studio when it is used *14.1.1* {code:java} > rbn <- data_rbfr$get_batch(x) Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : Invalid: negative malloc size {code} *15.1.1* {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} was: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. in 14.11 the individual batch errors.. in 15.1.1 the batch crashes R studio when it is used *14.1.1* {code:java} > rbn <- data_rbfr$get_batch(x) Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : Invalid: negative malloc size {code} *15.1.1*** {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} > [R] [C++] Large Batches Cause Error / Crashes > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1* > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Description: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. in 14.11 the individual batch errors.. in 15.1.1 the batch crashes R studio when it is used *14.1.1* {code:java} > rbn <- data_rbfr$get_batch(x) Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : Invalid: negative malloc size {code} *15.1.1*** {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} was: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. I dont know if this is fixed in 15.x because 15.x fails to load the file (another bug) {color:#FF}> {color}{color:#FF} rbn <- data_rbfr$get_batch(4){color}{color:#c5060b}Error in ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : Invalid: negative malloc size{color} > [R] [C++] get_batch - fails for large batches > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. in 14.11 the > individual batch errors.. in 15.1.1 the batch crashes R studio when it is used > *14.1.1* > {code:java} > > rbn <- data_rbfr$get_batch(x) > Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : > Invalid: negative malloc size > {code} > *15.1.1*** > {code:java} > rbn <- data_rbfr$get_batch(x) works! > df <- as.data.frame(rbn) - Crashes R Studio! {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - fails for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973478#comment-16973478 ] Anthony Abate commented on ARROW-7156: -- this is still a problem in 15.1.1 but the failure is slightly different rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! > [R] [C++] get_batch - fails for large batches > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-7156) [R] [C++] get_batch - fails for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973478#comment-16973478 ] Anthony Abate edited comment on ARROW-7156 at 11/13/19 4:09 PM: this is still a problem in 15.1.1 but the failure is different {code:java} rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! {code} was (Author: abbot): this is still a problem in 15.1.1 but the failure is slightly different rbn <- data_rbfr$get_batch(x) works! df <- as.data.frame(rbn) - Crashes R Studio! > [R] [C++] get_batch - fails for large batches > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Affects Version/s: 0.15.1 > [R] [C++] get_batch - fails for large batches > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1, 0.15.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Summary: [R] [C++] get_batch - fails for large batches (was: [R] [C++] get_batch - failes for large batches) > [R] [C++] get_batch - fails for large batches > - > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate closed ARROW-7157. Resolution: Not A Bug ok - seems like its not an issue - The API changed from 14.1 to 15.1 and i picked the wrong function.. > [R] RecordBatchFileReader - Crashes RStudio > --- > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Blocker > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973447#comment-16973447 ] Anthony Abate commented on ARROW-7157: -- hmm.. do you mean.. that forwards to 'placement new' ? should that even be accessible from R? > [R] RecordBatchFileReader - Crashes RStudio > --- > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Blocker > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - failes for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973440#comment-16973440 ] Anthony Abate commented on ARROW-7156: -- ok updated - I don't know the exact size of the batch - but it can't be coincidence that the largest batch in the file fails to load - i suspect there is some size limitation that was hit > [R] [C++] get_batch - failes for large batches > -- > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - failes for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7156: - Description: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. I dont know if this is fixed in 15.x because 15.x fails to load the file (another bug) {color:#FF}> {color}{color:#FF} rbn <- data_rbfr$get_batch(4){color}{color:#c5060b}Error in ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : Invalid: negative malloc size{color} was: I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. I dont know if this is fixed in 15.x because 15.x fails to load the file (another bug) > [R] [C++] get_batch - failes for large batches > -- > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > > {color:#FF}> {color}{color:#FF} rbn <- > data_rbfr$get_batch(4){color}{color:#c5060b}Error in > ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : > Invalid: negative malloc size{color} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - failes for large batches
[ https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973435#comment-16973435 ] Anthony Abate commented on ARROW-7156: -- I working on it-... I have to 'downgrade' arrow since 15.x seems even more broken... > [R] [C++] get_batch - failes for large batches > -- > > Key: ARROW-7156 > URL: https://issues.apache.org/jira/browse/ARROW-7156 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > > I have a 30 gig arrow file with 100 batches. the largest batch in the file > causes get batch to fail - All other batches load fine. > I dont know if this is fixed in 15.x because 15.x fails to load the file > (another bug) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio
[ https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7157: - Summary: [R] RecordBatchFileReader - Crashes RStudio (was: RecordBatchFileReader - Crashes RStudio) > [R] RecordBatchFileReader - Crashes RStudio > --- > > Key: ARROW-7157 > URL: https://issues.apache.org/jira/browse/ARROW-7157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.15.1 >Reporter: Anthony Abate >Priority: Blocker > > I have a 30 gig arrow file - using record batch reader crashes RStudio > arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7157) RecordBatchFileReader - Crashes RStudio
Anthony Abate created ARROW-7157: Summary: RecordBatchFileReader - Crashes RStudio Key: ARROW-7157 URL: https://issues.apache.org/jira/browse/ARROW-7157 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 0.15.1 Reporter: Anthony Abate I have a 30 gig arrow file - using record batch reader crashes RStudio arrow::RecordBatchFileReader$new("file.arrow") -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7156) [R] [C++] get_batch - failes for large batches
Anthony Abate created ARROW-7156: Summary: [R] [C++] get_batch - failes for large batches Key: ARROW-7156 URL: https://issues.apache.org/jira/browse/ARROW-7156 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 0.14.1 Reporter: Anthony Abate I have a 30 gig arrow file with 100 batches. the largest batch in the file causes get batch to fail - All other batches load fine. I dont know if this is fixed in 15.x because 15.x fails to load the file (another bug) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} Update 2: This is due to a confirmed bug in System.Memory - The implications are that Span.CopyTo needs to be removed / replaced. This is method is used internally by ArrowBuffer so I can't work around this easily. Solutions # Change the code ## Remove it out right (including disable span in FlatBuffer) ## create a multi target nuget where the offending code has compile blocks #If (NETFRAMEWORK) - and disable span in FlatBuffers only for net framework build # wait for a System.Memory fix? I suspect option 2 won't happen anytime soon. was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} Update 2: This is due to a confirmed bug in System.Memory - The implications are that Span.CopyTo needs to be removed / replaced. This is method is used internally by ArrowBuffer so I can't work around this easily. Solutions # Change the code ## Remove it out right (including with in FlatBuffer) ## create a multi target nuget where the offending code has compile blocks #If (NETFRAMEWORK) - and disable span in FlatBuffers # wait for a System.Memory fix? I suspect 3 won't
[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} Update 2: This is due to a confirmed bug in System.Memory - The implications are that Span.CopyTo needs to be removed / replaced. This is method is used internally by ArrowBuffer so I can't work around this easily. Solutions # Change the code ## Remove it out right (including with in FlatBuffer) ## create a multi target nuget where the offending code has compile blocks #If (NETFRAMEWORK) - and disable span in FlatBuffers # wait for a System.Memory fix? I suspect 3 won't happen anytime soon. was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} Update 2: This is due to a confirmed bug in System.Memory - The implications are that Span.CopyTo needs to be removed > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >
[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} Update 2: This is due to a confirmed bug in System.Memory - The implications are that Span.CopyTo needs to be removed was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Blocker > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { >
[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Priority: Blocker (was: Critical) > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Blocker > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Summary: [C#] System.Memory Span.CopyTo - Crashes on Net Framework (was: [C#] ArrowBuffer.Append - Crashes ) > [C#] System.Memory Span.CopyTo - Crashes on Net Framework > --- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964387#comment-16964387 ] Anthony Abate commented on ARROW-7040: -- this might not be an arrow issue - it might be an issue in the System.Memory code - I reported a bug here: [https://github.com/dotnet/corefx/issues/42276] It still is an open issue for us though because the current Array Builder implementation of arrow is currently crashing using strings and many threads. I am considering creating a separate builder for strings that internally uses byte[] instead of Spans to see if that makes the problem go away > [C#] ArrowBuffer.Append - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} > [C#] ArrowBuffer.Append - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); >
[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying *ArrowBuffer.Builder* {code:java} public async Task ValueBuffer_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder(); foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } }); wait.Add(t); } await Task.WhenAll(wait); }{code} was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying ArrowBuffer.Builder {code:java} public async Task ValueBuffer_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray();var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder();foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } });wait.Add(t); }await Task.WhenAll(wait); }{code} > [C#] ArrowBuffer.Append - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await
[jira] [Issue Comment Deleted] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Comment: was deleted (was: interesting - BinaryArrayBuilder does not crash if using *AppendRange(IEnumerable values)* StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... I tried forwarding StringArrayBuilder to *BinaryArrayBuilder.**AppendRange(IEnumerable* but the prob also occurs... ) > [C#] ArrowBuffer.Append - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying > *ArrowBuffer.Builder* > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List();for (int i = 0; i < 30; > ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder(); > foreach (var d in data) > { > ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); > } > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > }{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Update - it seems that the problem is in the underlying ArrowBuffer.Builder {code:java} public async Task ValueBuffer_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray();var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { ArrowBuffer.Builder ValueBuffer = new ArrowBuffer.Builder();foreach (var d in data) { ValueBuffer.Append(Encoding.UTF8.GetBytes(d)); } } });wait.Add(t); }await Task.WhenAll(wait); }{code} was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on Summary: [C#] ArrowBuffer.Append - Crashes (was: [C#] StringArrayBuilder.AppendRange - Crashes ) > [C#] ArrowBuffer.Append - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > Update - it seems that the problem is in the underlying ArrowBuffer.Builder > {code:java} > public async Task ValueBuffer_StressTest() > { > var wait = new List();for (int i = 0; i < 30; > ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray();var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > ArrowBuffer.Builder ValueBuffer = new > ArrowBuffer.Builder();foreach (var d in data) >
[jira] [Comment Edited] (ARROW-7040) [C#] StringArrayBuilder.AppendRange - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964199#comment-16964199 ] Anthony Abate edited comment on ARROW-7040 at 10/31/19 5:28 PM: interesting - BinaryArrayBuilder does not crash if using *AppendRange(IEnumerable values)* StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... I tried forwarding StringArrayBuilder to *BinaryArrayBuilder.**AppendRange(IEnumerable* but the prob also occurs... was (Author: abbot): interesting - BinaryArrayBuilder does not crash if using *AppendRange(IEnumerable values)* StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... I think i found the problem - if it works - I will submit a pull request > [C#] StringArrayBuilder.AppendRange - Crashes > -- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7040) [C#] StringArrayBuilder.AppendRange - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964199#comment-16964199 ] Anthony Abate commented on ARROW-7040: -- interesting - BinaryArrayBuilder does not crash if using *AppendRange(IEnumerable values)* StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... I think i found the problem - if it works - I will submit a pull request > [C#] StringArrayBuilder.AppendRange - Crashes > -- > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } }); wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } });wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on > StringArrayBuilder.AppendRange - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > }); > wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes
[ https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-7040: - Description: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List(); for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray(); var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } });wait.Add(t); } await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. (ie IntArrayBuilder) I suspect it is due to the offset array / and all the copy / resizing going on was: The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray();var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } });wait.Add(t); }await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. I suspect it is due to the offset array / and all the copy / resizing going on > StringArrayBuilder.AppendRange - Crashes > - > > Key: ARROW-7040 > URL: https://issues.apache.org/jira/browse/ARROW-7040 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1, 0.15.0 >Reporter: Anthony Abate >Priority: Critical > > The following code crashes on 8 cores. > {code:java} > public async Task StringArrayBuilder_StressTest() > { > var wait = new List(); > for (int i = 0; i < 30; ++i) > { > var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + > 1}").ToArray(); > var t = Task.Run(() => > { > for (int j = 0; j < 1000; ++j) > { > var builder = new StringArray.Builder(); > builder.AppendRange(data); > } > });wait.Add(t); > } > await Task.WhenAll(wait); > } {code} > > It does not happen with the primitive arrays. (ie IntArrayBuilder) > I suspect it is due to the offset array / and all the copy / resizing going on > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes
Anthony Abate created ARROW-7040: Summary: StringArrayBuilder.AppendRange - Crashes Key: ARROW-7040 URL: https://issues.apache.org/jira/browse/ARROW-7040 Project: Apache Arrow Issue Type: Bug Components: C# Affects Versions: 0.15.0, 0.14.1 Reporter: Anthony Abate The following code crashes on 8 cores. {code:java} public async Task StringArrayBuilder_StressTest() { var wait = new List();for (int i = 0; i < 30; ++i) { var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 1}").ToArray();var t = Task.Run(() => { for (int j = 0; j < 1000; ++j) { var builder = new StringArray.Builder(); builder.AppendRange(data); } });wait.Add(t); }await Task.WhenAll(wait); } {code} It does not happen with the primitive arrays. I suspect it is due to the offset array / and all the copy / resizing going on -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748 ] Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:44 PM: Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far the answer seems to be the following: || method || status || |read_arrow | unsupported| |read_table | supported, but uses lots of memory| |RecordBatchFileReader | manually possible via the code I provided, but slow| was (Author: abbot): Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far the answer seems to be the following: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - manually possible via the code I provided, but slow > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948751#comment-16948751 ] Anthony Abate commented on ARROW-6830: -- {quote}You can filter each record batch separately (using {{[}} methods or lower level if you prefer) and collect them all into a data.frame. {quote} this is what I am doing - is there a better way so I can do multiple columns in a single pass? {code:java} rbn <- data_rbfr$get_batch(i) df <- data.frame( rbn$column(5)$as_vector(),rbn$column(6)$as_vector(),rbn$column(100)$as_vector(),rbn$column(687)$as_vector(), rbn$column(444)$as_vector(),rbn$column(36)$as_vector(),rbn$column(500)$as_vector(),rbn$column(897)$as_vector(), rbn$column(24)$as_vector(),rbn$column(446)$as_vector(),rbn$column(777)$as_vector(),rbn$column(333)$as_vector(), rbn$column(96)$as_vector(),rbn$column(555)$as_vector(),rbn$column(888)$as_vector(),rbn$column(222)$as_vector() ) {code} > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748 ] Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:31 PM: Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far the answer seems to be the following: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - manually possible via the code I provided, but slow was (Author: abbot): Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far the answer seems to be the following: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - supported, but slow > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748 ] Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:30 PM: Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far the answer seems to be the following: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - supported, but slow was (Author: abbot): Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far my answer is as follow: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748 ] Anthony Abate commented on ARROW-6830: -- Yes - my original question is about slicing the arrow file to reduce columns - whether it be via read_arrow, read_table, or RecordBatchFileReader so far my answer is as follow: read_arrow - unsupported read_table - supported, but uses lots of memory RecordBatchFileReader - > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948744#comment-16948744 ] Anthony Abate commented on ARROW-6830: -- from my initial testing of read_table - it seems to be no better than read_arrow when it comes to memory usage and appears to load the entire file... {code:java} tab <- read_table("bigfile.arrow") nrow(tab) # uses 30 gigs! {code} > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948713#comment-16948713 ] Anthony Abate edited comment on ARROW-6830 at 10/10/19 3:48 PM: I was using *RecordBatchFileReader* since it seemed to be the only to limit memory usage (I thought *read_arrow* was my only alternative) We are indexing our data by record batch so we could be more efficient in filtering by passing the batch ids into the RecordBatchFileReader to avoid a 'full table scan' FYI - It was not clear to me from the name that *read_table* has anything to do with arrow files. Is read_table aware of underlying record batches so rows can be filtered out more efficiently? was (Author: abbot): I was using *RecordBatchFileReader* since it seemed to be the only to limit memory usage (I thought *read_arrow* was my only alternative) We are effectively indexing our data by record batch so we could be more efficient in filtering and would want to pass down to avoid a 'full table scan' FYI - It was not clear to me from the name that *read_table* has anything to do with arrow files. Is read_table aware of underlying record batches so rows can be filtered out more effeciently? > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948713#comment-16948713 ] Anthony Abate commented on ARROW-6830: -- I was using *RecordBatchFileReader* since it seemed to be the only to limit memory usage (I thought *read_arrow* was my only alternative) We are effectively indexing our data by record batch so we could be more efficient in filtering and would want to pass down to avoid a 'full table scan' FYI - It was not clear to me from the name that *read_table* has anything to do with arrow files. Is read_table aware of underlying record batches so rows can be filtered out more effeciently? > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6830: - Description: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {code:java} for(i in 0:data_rbfr$num_record_batches) { rbn <- data_rbfr$get_batch(i) if (i == 0) { merged <- as.data.frame(rbn$column(5)$as_vector()) } else { dfn <- as.data.frame(rbn$column(5)$as_vector()) merged <- rbind(merged,dfn) } print(paste(i, nrow(merged))) } {code} was: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ print(paste(i, nrow(merged)))}} {{}}} > Question / Feature Request- Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6830: - Description: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ print(paste(i, nrow(merged)))}} {{}}} was: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}} {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ }}} > Question / Feature Request- Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {{for(i in 0:data_rbfr$num_record_batches) {}} > {{ rbn <- data_rbfr$get_batch(i)}} > > {{ if (i == 0) }} > {{ {}} > {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} > {{ }}} > {{ else }} > {{ {}} > {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} > {{ merged <- rbind(merged,dfn)}} > {{ }}} > > {{ print(paste(i, nrow(merged)))}} > {{}}} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6830: - Description: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}} {{for(i in 0:data_rbfr$num_record_batches) {}} {{ rbn <- data_rbfr$get_batch(i)}} {{ if (i == 0) }} {{ {}} {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} {{ }}} {{ else }} {{ {}} {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} {{ merged <- rbind(merged,dfn)}} {{ }}} {{ }}} was: *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: data_rbfr <- arrow::RecordBatchFileReader("arrowfile") FOREACH BATCH: batch <- data_rbfr$get_batch(i) col4 <- batch$column(4) col5 <- batch$column(7) > Question / Feature Request- Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}} > {{for(i in 0:data_rbfr$num_record_batches) {}} > {{ rbn <- data_rbfr$get_batch(i)}} > {{ if (i == 0) }} > {{ {}} > {{ merged <- as.data.frame(rbn$column(5)$as_vector())}} > {{ }}} > {{ else }} > {{ {}} > {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}} > {{ merged <- rbind(merged,dfn)}} > {{ }}} > {{ }}} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow
Anthony Abate created ARROW-6830: Summary: Question / Feature Request- Select Subset of Columns in read_arrow Key: ARROW-6830 URL: https://issues.apache.org/jira/browse/ARROW-6830 Project: Apache Arrow Issue Type: New Feature Components: C++, R Reporter: Anthony Abate *Note:* Not sure if this is a limitation of the R library or the underlying C++ code: I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes 1. Is it possible at to use *read_arrow* to filter out columns? (similar to how *read_feather* has a (col_select =... ) 2. Or is it possible using *RecordBatchFileReader* to filter columns? The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually. ie like the following: data_rbfr <- arrow::RecordBatchFileReader("arrowfile") FOREACH BATCH: batch <- data_rbfr$get_batch(i) col4 <- batch$column(4) col5 <- batch$column(7) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C#] Arrow R/C++ hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941893#comment-16941893 ] Anthony Abate commented on ARROW-6682: -- [~wesm] - I understand that the file generation is fixed on the C# side, but isnt a malformed file taking down the library another problem? [~eerhardt] - Is there a pre-release nuget that I can test out? > [C#] Arrow R/C++ hangs reading binary file generated by C# > -- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C# >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, > script.runner.ps1 > > Time Spent: 20m > Remaining Estimate: 0h > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls
[ https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938975#comment-16938975 ] Anthony Abate commented on ARROW-6603: -- I have a few extension methods that do this - on thing i noticed, the spec seems to refer to terms NullBitmap and ValidityBitmaps - I think ValidityBitmap might be the more correct term since 1 = valid, whereas NullBitmap sounds like 1 = null. My first attempt at creating the nullbitmap inverted all the values > [C#] ArrayBuilder API to support writing nulls > -- > > Key: ARROW-6603 > URL: https://issues.apache.org/jira/browse/ARROW-6603 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Reporter: Eric Erhardt >Priority: Major > Original Estimate: 72h > Remaining Estimate: 72h > > There is currently no API in the PrimitiveArrayBuilder class to support > writing nulls. See this TODO - > [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.] > > Also see [https://github.com/apache/arrow/issues/5381]. > > We should add some APIs to support writing nulls. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938699#comment-16938699 ] Anthony Abate commented on ARROW-6682: -- [~npr] - setting that option may be a workaround for now I am not sure what the threads do since there seems to be no performance difference - at least in the read_arrow function > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, > script.runner.ps1 > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938149#comment-16938149 ] Anthony Abate commented on ARROW-6682: -- It sounds like there might be more than 1 issue here: * that malformed file * the hanging on R It would be troubling if a malformed file can take down / crash the library... (ie a DOS Exploit) When trying to use an out of date C# Feather library in R I did get some indication that the file was invalid: ([https://github.com/kevin-montrose/FeatherDotNet/issues/7)] Is there a way to validate the integrity of the arrow file on open? (ie check offsets, padding, etc) - might be slower, but when opening a file from an unknown source, could be safer. Regarding the hanging - There does seem to be some threadpool options for the C++ code, but I don't know how to access them in R > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, > script.runner.ps1 > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938048#comment-16938048 ] Anthony Abate commented on ARROW-6682: -- I also uploaded the exact script files / script loop runner > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, > script.runner.ps1 > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6682: - Attachment: arrow.benchmark.r > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, > script.runner.ps1 > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6682: - Attachment: script.runner.ps1 > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip, script.runner.ps1 > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938047#comment-16938047 ] Anthony Abate commented on ARROW-6682: -- [~npr] - I can't reproduce the issue on a single core, but I can on two cores - can you try a vm with two cores? > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938043#comment-16938043 ] Anthony Abate edited comment on ARROW-6682 at 9/25/19 8:24 PM: --- the other thing I should point out: I am using a new rscript.exe process each time: That way I know for certain the dll is unloaded and reinitialized without any static init related code. this is the script runner code I am using: {{(powershell script)}} {{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}} {{$rscript = "arrow.benchmark.r"}} {\{For ($i=0; $i -le 1; $i++) }} {{ Write-Output "run: $i"}} {{ $stopwatch = [system.diagnostics.stopwatch]::StartNew()}} {{ & $rpath --no-save --no-restore --verbose $rscript > c:\temp\outputFile.Rout 2>&1}} {{ $stopwatch.Elapsed.TotalSeconds}} } was (Author: abbot): the other thing I should point out: I am using a new rscript.exe process each time: That way I know for certain the dll is unloaded and reinitialized without any static init related code. this is the script runner code I am using: {{(powershell script)}} {{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}} {{$rscript = "arrow.benchmark.r"}} {{For ($i=0; $i -le 1; $i++) }} {{{}} {{ Write-Output "run: $i"}} {{ $stopwatch = [system.diagnostics.stopwatch]::StartNew()}} {{ & $rpath --no-save --no-restore --verbose $rscript > c:\temp\outputFile.Rout 2>&1}} {{ $stopwatch.Elapsed.TotalSeconds}} } > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938043#comment-16938043 ] Anthony Abate commented on ARROW-6682: -- the other thing I should point out: I am using a new rscript.exe process each time: That way I know for certain the dll is unloaded and reinitialized without any static init related code. this is the script runner code I am using: {{(powershell script)}} {{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}} {{$rscript = "arrow.benchmark.r"}} {{For ($i=0; $i -le 1; $i++) }} {{{}} {{ Write-Output "run: $i"}} {{ $stopwatch = [system.diagnostics.stopwatch]::StartNew()}} {{ & $rpath --no-save --no-restore --verbose $rscript > c:\temp\outputFile.Rout 2>&1}} {{ $stopwatch.Elapsed.TotalSeconds}} } > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938038#comment-16938038 ] Anthony Abate commented on ARROW-6682: -- [~npr] how many cores was your test vm using? > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Assignee: Eric Erhardt >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938034#comment-16938034 ] Anthony Abate commented on ARROW-6682: -- I can repro the problem fairly consistently - I can get more info if needed: {color:#FF}> {color}{color:#FF}sessionInfo(){color}R version 3.6.1 (2019-07-05) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17763) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] arrow_0.14.1.20190925 loaded via a namespace (and not attached): [1] tidyselect_0.2.5 bit_1.1-14 compiler_3.6.1 magrittr_1.5 assertthat_0.2.1 R6_2.4.0 tools_3.6.1 [8] glue_1.3.1 Rcpp_1.0.2 bit64_0.9-7 rlang_0.4.0 purrr_0.3.2 > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938002#comment-16938002 ] Anthony Abate commented on ARROW-6682: -- If the file is bad - Id expect the R library to fail the same as python if they both use the same underlying C++ code. However, I don't know the R / Python bindings / code. I would point out that I was able to validate a 30 million row x 37 column data set produced by C# in R including the null support I added. The only indication of any issue was a very rare hang on first use of the library in R Studio - if it didn't hang the first time, i was able to do many file loads of 10gb without issue I was attempting to narrow down that rare hang when it seemed to be a column width issue > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937994#comment-16937994 ] Anthony Abate edited comment on ARROW-6682 at 9/25/19 6:53 PM: --- [~eerhardt]- i used the current nuget version + some code i wrote to build null support [~wesm] [~npr] - this looks/behaves like a threading issue - I don't get any hanging if i reduce the VM to 1 core (not ideal) (I can't explain the core dumps though) was (Author: abbot): [~eerhardt]- i used the current nuget version + some code i wrote to build null support [~wesm] [~npr] - this looks like a threading issue - I don't get any hanging if i reduce the VM to 1 core (not ideal) > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937994#comment-16937994 ] Anthony Abate commented on ARROW-6682: -- [~eerhardt]- i used the current nuget version + some code i wrote to build null support [~wesm] [~npr] - this looks like a threading issue - I don't get any hanging if i reduce the VM to 1 core (not ideal) > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937921#comment-16937921 ] Anthony Abate commented on ARROW-6682: -- [~wesm] - I made a zip version of the file > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6682: - Attachment: Generated_4000Batch_50Columns_100Rows_PerBatch.zip > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, > Generated_4000Batch_50Columns_100Rows_PerBatch.zip > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937873#comment-16937873 ] Anthony Abate commented on ARROW-6682: -- [~npr] - seems like the same problem: other than install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r;) do i need to do anything else to use the dev package? is there a version number I can print out via runtime to make sure im using the new code? > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937860#comment-16937860 ] Anthony Abate commented on ARROW-6682: -- [~npr]- ok sounds like you have no problems with the file - let me try that latest package and I'll let you know > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937859#comment-16937859 ] Anthony Abate commented on ARROW-6682: -- if the file is 'bad' shouldn't that easily be determined by examining the attached file? > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937855#comment-16937855 ] Anthony Abate commented on ARROW-6682: -- [~npr] - how many times did you try to load it? - i get it to fail 4 out of every 5 times > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937853#comment-16937853 ] Anthony Abate commented on ARROW-6682: -- [~wesm] - Also, I can generate many other files from the C# libraries that no problems being loaded > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937839#comment-16937839 ] Anthony Abate commented on ARROW-6682: -- It loads 'sometimes' - so sounds like threading issues? > [C++][R] Arrow Hangs reading binary file generated by C# > > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937833#comment-16937833 ] Anthony Abate commented on ARROW-6682: -- code above (its trivial) system 8 cores (virtual) 64 gigs ram windows 10 > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937831#comment-16937831 ] Anthony Abate commented on ARROW-6682: -- {{}} start_time <- Sys.time() start_memory <- memory.size() library(arrow) dfcs <- read_arrow("e:\\Generated_4000Batch_50Columns_100Rows_PerBatch.arrow") end_memory <- memory.size() end_time <- Sys.time() print(end_memory) end_time - start_time end_memory - start_memory > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937831#comment-16937831 ] Anthony Abate edited comment on ARROW-6682 at 9/25/19 3:20 PM: --- start_time <- Sys.time() start_memory <- memory.size() library(arrow) dfcs <- read_arrow("e: Generated_4000Batch_50Columns_100Rows_PerBatch.arrow") end_memory <- memory.size() end_time <- Sys.time() print(end_memory) end_time - start_time end_memory - start_memory was (Author: abbot): {{}} start_time <- Sys.time() start_memory <- memory.size() library(arrow) dfcs <- read_arrow("e:\\Generated_4000Batch_50Columns_100Rows_PerBatch.arrow") end_memory <- memory.size() end_time <- Sys.time() print(end_memory) end_time - start_time end_memory - start_memory > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Major > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937820#comment-16937820 ] Anthony Abate edited comment on ARROW-6682 at 9/25/19 3:13 PM: --- I have a 150 meg file that i generated (with the C# library) random data and it has 50 columns and it hangs on (almost) every load! was (Author: abbot): I have a 150 meg file that i generated (with the C# library) random data and it has 50 columns and it hangs on every load! > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Blocker > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6682: - Description: I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). (the file has 37 columns) I have memory dumps - All threads seem to be in wait handles. Are there debug symbols somewhere? Is there a way to get the C++ code to produce diagnostic logging from R? *UPDATE:* it seems that the hangs are not related to file size, row counts, or # of record batches, but rather the number of *columns* was: I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). (the file has 37 columns) I have memory dumps - All threads seem to be in wait handles. Are there debug symbols somewhere? Is there a way to get the C++ code to produce diagnostic logging from R? it seems that the hangs are not related to file size, row counts, or # of record batches, but rather the number of *columns* > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Blocker > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > *UPDATE:* it seems that the hangs are not related to file size, row counts, > or # of record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Abate updated ARROW-6682: - Description: I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). (the file has 37 columns) I have memory dumps - All threads seem to be in wait handles. Are there debug symbols somewhere? Is there a way to get the C++ code to produce diagnostic logging from R? it seems that the hangs are not related to file size, row counts, or # of record batches, but rather the number of *columns* was: I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). I have memory dumps - All threads seem to be in wait handles. Are there debug symbols somewhere? Is there a way to get the C++ code to produce diagnostic logging from R? it seems that the hangs are not related to file size, row counts, or # of record batches, but rather the number of *columns* > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Blocker > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). (the file has 37 columns) > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > it seems that the hangs are not related to file size, row counts, or # of > record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)
[ https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937822#comment-16937822 ] Anthony Abate commented on ARROW-6682: -- See the attached file > Arrow Hangs on Large # of Columns (30+) > --- > > Key: ARROW-6682 > URL: https://issues.apache.org/jira/browse/ARROW-6682 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 0.14.1 >Reporter: Anthony Abate >Priority: Blocker > Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar > > > I get random hangs on arrow_read in R (windows) when using a very large file > (10-12gb). > I have memory dumps - All threads seem to be in wait handles. > Are there debug symbols somewhere? > Is there a way to get the C++ code to produce diagnostic logging from R? > > it seems that the hangs are not related to file size, row counts, or # of > record batches, but rather the number of *columns* -- This message was sent by Atlassian Jira (v8.3.4#803005)