[jira] [Comment Edited] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default

2020-06-04 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099
 ] 

Anthony Abate edited comment on ARROW-9035 at 6/4/20, 5:38 PM:
---

yes - I didn't realize it was configurable - it probably works (but i'll know 
soon if it doesnt)  

I thought the docs sections were in conflict - but now I realize that 8 byte 
alignment is the 'requirement' not 64..  (64 is still a multiple of 8)

 


was (Author: abbot):
yes - I didn't realize it was configurable - it probably works but i'll know 
soon if it doesnt)  

I thought the docs sections were in conflict - but now I realize that 8 byte 
alignment is the 'requirement' not 64..  (64 iis still a multiple of 8)

 

> [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
> ---
>
> Key: ARROW-9035
> URL: https://issues.apache.org/jira/browse/ARROW-9035
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Affects Versions: 0.17.0
>Reporter: Anthony Abate
>Priority: Minor
>
> I used the C++ library to create a very small arrow file (1 field of 5 int32) 
> and was surprised that the buffers are not aligned to 64 bytes as per the 
> documentation section "Buffer Alignment and Padding" with examples.. based on 
> the examples there, the 20 bytes of int32 should be padded to 64 bytes, but 
> it is only 24 (see below) .   
> extract message metadata - see BodyLength = 24
> {code:java}
> {
>   version: "V4",
>   header_type: "RecordBatch",
>   header: {
> nodes: [
>   {
> length: 5,
> null_count: 0
>   }
> ],
> buffers: [
>   {
> offset: 0,
> length: 0
>   },
>   {
> offset: 0,
> length: 20
>   }
> ]
>   },
>   bodyLength: 24
> } {code}
> Reading further down the documentation section "Encapsulated message format" 
> it says serialization should use 8 byte alignment. 
> These both seem at odds with each other and some clarification is needed.
> Is the documentation wrong? 
> Or
> Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default

2020-06-04 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099
 ] 

Anthony Abate commented on ARROW-9035:
--

yes - I didn't realize it was configurable - it probably works but i'll know 
soon if it doesnt)  

I thought the docs sections were in conflict - but now I realize that 8 byte 
alignment is the 'requirement' not 64..  (after 64 iis still a multiple of 8)

 

> [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
> ---
>
> Key: ARROW-9035
> URL: https://issues.apache.org/jira/browse/ARROW-9035
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Affects Versions: 0.17.0
>Reporter: Anthony Abate
>Priority: Minor
>
> I used the C++ library to create a very small arrow file (1 field of 5 int32) 
> and was surprised that the buffers are not aligned to 64 bytes as per the 
> documentation section "Buffer Alignment and Padding" with examples.. based on 
> the examples there, the 20 bytes of int32 should be padded to 64 bytes, but 
> it is only 24 (see below) .   
> extract message metadata - see BodyLength = 24
> {code:java}
> {
>   version: "V4",
>   header_type: "RecordBatch",
>   header: {
> nodes: [
>   {
> length: 5,
> null_count: 0
>   }
> ],
> buffers: [
>   {
> offset: 0,
> length: 0
>   },
>   {
> offset: 0,
> length: 20
>   }
> ]
>   },
>   bodyLength: 24
> } {code}
> Reading further down the documentation section "Encapsulated message format" 
> it says serialization should use 8 byte alignment. 
> These both seem at odds with each other and some clarification is needed.
> Is the documentation wrong? 
> Or
> Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-9035) [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default

2020-06-04 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126099#comment-17126099
 ] 

Anthony Abate edited comment on ARROW-9035 at 6/4/20, 5:37 PM:
---

yes - I didn't realize it was configurable - it probably works but i'll know 
soon if it doesnt)  

I thought the docs sections were in conflict - but now I realize that 8 byte 
alignment is the 'requirement' not 64..  (64 iis still a multiple of 8)

 


was (Author: abbot):
yes - I didn't realize it was configurable - it probably works but i'll know 
soon if it doesnt)  

I thought the docs sections were in conflict - but now I realize that 8 byte 
alignment is the 'requirement' not 64..  (after 64 iis still a multiple of 8)

 

> [C++] Writing IPC messages with 64-byte buffer alignment vs. 8-byte default
> ---
>
> Key: ARROW-9035
> URL: https://issues.apache.org/jira/browse/ARROW-9035
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Affects Versions: 0.17.0
>Reporter: Anthony Abate
>Priority: Minor
>
> I used the C++ library to create a very small arrow file (1 field of 5 int32) 
> and was surprised that the buffers are not aligned to 64 bytes as per the 
> documentation section "Buffer Alignment and Padding" with examples.. based on 
> the examples there, the 20 bytes of int32 should be padded to 64 bytes, but 
> it is only 24 (see below) .   
> extract message metadata - see BodyLength = 24
> {code:java}
> {
>   version: "V4",
>   header_type: "RecordBatch",
>   header: {
> nodes: [
>   {
> length: 5,
> null_count: 0
>   }
> ],
> buffers: [
>   {
> offset: 0,
> length: 0
>   },
>   {
> offset: 0,
> length: 20
>   }
> ]
>   },
>   bodyLength: 24
> } {code}
> Reading further down the documentation section "Encapsulated message format" 
> it says serialization should use 8 byte alignment. 
> These both seem at odds with each other and some clarification is needed.
> Is the documentation wrong? 
> Or
> Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9035) 8 vs 64 byte alignment

2020-06-04 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17126052#comment-17126052
 ] 

Anthony Abate commented on ARROW-9035:
--

Perhaps in RFC terms 
([https://tools.ietf.org/html/rfc2119)|https://tools.ietf.org/html/rfc2119] the 
doc should say:

All buffers (metadata (flatbuffers) and data buffers) MUST be 8 byte aligned 
but SHOULD be 64 byte aligned - This would apply to both sections.

With most of the docs going stressing 64 byte alignment, I didn't realize the 
'default' alignment the C++ library is 8 bytes.. assumed it would be 64 byte.

 

 

> 8 vs 64 byte alignment
> --
>
> Key: ARROW-9035
> URL: https://issues.apache.org/jira/browse/ARROW-9035
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Documentation
>Affects Versions: 0.17.0
>Reporter: Anthony Abate
>Priority: Minor
>
> I used the C++ library to create a very small arrow file (1 field of 5 int32) 
> and was surprised that the buffers are not aligned to 64 bytes as per the 
> documentation section "Buffer Alignment and Padding" with examples.. based on 
> the examples there, the 20 bytes of int32 should be padded to 64 bytes, but 
> it is only 24 (see below) .   
> extract message metadata - see BodyLength = 24
> {code:java}
> {
>   version: "V4",
>   header_type: "RecordBatch",
>   header: {
> nodes: [
>   {
> length: 5,
> null_count: 0
>   }
> ],
> buffers: [
>   {
> offset: 0,
> length: 0
>   },
>   {
> offset: 0,
> length: 20
>   }
> ]
>   },
>   bodyLength: 24
> } {code}
> Reading further down the documentation section "Encapsulated message format" 
> it says serialization should use 8 byte alignment. 
> These both seem at odds with each other and some clarification is needed.
> Is the documentation wrong? 
> Or
> Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9035) 8 vs 64 byte alignment

2020-06-04 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-9035:


 Summary: 8 vs 64 byte alignment
 Key: ARROW-9035
 URL: https://issues.apache.org/jira/browse/ARROW-9035
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Documentation
Affects Versions: 0.17.0
Reporter: Anthony Abate


I used the C++ library to create a very small arrow file (1 field of 5 int32) 
and was surprised that the buffers are not aligned to 64 bytes as per the 
documentation section "Buffer Alignment and Padding" with examples.. based on 
the examples there, the 20 bytes of int32 should be padded to 64 bytes, but it 
is only 24 (see below) .   

extract message metadata - see BodyLength = 24
{code:java}
{
  version: "V4",
  header_type: "RecordBatch",
  header: {
nodes: [
  {
length: 5,
null_count: 0
  }
],
buffers: [
  {
offset: 0,
length: 0
  },
  {
offset: 0,
length: 20
  }
]
  },
  bodyLength: 24
} {code}
Reading further down the documentation section "Encapsulated message format" it 
says serialization should use 8 byte alignment. 

These both seem at odds with each other and some clarification is needed.

Is the documentation wrong? 

Or

Should 8 byte alignment be used for File and 64 byte for IPC ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

2020-01-07 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17010293#comment-17010293
 ] 

Anthony Abate commented on ARROW-7511:
--

Now i remember why I thought Memory and Span can't support more than 2 gigs:

The *.Slice()* function only takes int32

https://docs.microsoft.com/en-us/dotnet/api/system.memory-1.slice?view=netcore-3.1#System_Memory_1_Slice_System_Int32_System_Int32_

 

> [C#] - Batch / Data Size Can't Exceed 2 gigs
> 
>
> Key: ARROW-7511
> URL: https://issues.apache.org/jira/browse/ARROW-7511
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C# 
> library can not support this in its current form due to limits on managed 
> memory as it tries to put the whole batch into a single 
> Span/Memory
> It is possible to fix this by not trying to use Memory/Span/byte[] for the 
> entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
> only move the problem 'lower' as it would then still set the limit of a 
> Column Data in a single batch to be 2 Gigs.  
> This seems like plenty of memory... but if you think of strings columns, the 
> data is just one giant string appended to together with offsets and it can 
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# managed 
> world is always going to hit the 2 gig limit somewhere. (please correct me if 
> I am wrong on this statement, but I thought i read some where that Memory 
> / Span are limited to int and changing to long would require major 
> framework rewrites - but i may be conflating that with array)
> That ultimately means the C# library either has to reject files of certain 
> characteristics (ie validation checks on opening) , or the spec needs put 
> upper limits on certain internal arrow constructs (ie arrow buffer) to 
> eliminate the need for more than a 2 gigs of contiguous memory for the 
> smallest arrow object.
> However, If the spec was indeed designed for the smallest buffer object to be 
> larger than 2 gigs, or for the entire memory buffer of arrow to be 
> contiguous, one has to wonder if at some point, it might just make sense for 
> the C# library to use the C++ library as its memory manager as replicating a 
> very large blocks of memory more work than its wroth.
> In any case,  this issue is more about 'deferring' the 2 gig size problem by 
> moving it down to the buffer objects... This might require some re-write of 
> the batch data structures
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

2020-01-07 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7511:
-
Description: 
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span/Memory

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement, but I thought i read some where that Memory / 
Span are limited to int and changing to long would require major framework 
rewrites - but i may be conflating that with array)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 

  was:
While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span/Memory

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 


> [C#] - Batch / Data Size Can't Exceed 2 gigs
> 
>
> Key: ARROW-7511
> URL: https://issues.apache.org/jira/browse/ARROW-7511
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> While the Arrow spec does not forbid batches larger than 2 gigs, the C# 
> library can not support this in its current form due to limits on managed 
> memory as it tries to put the whole batch into a single 
> Span/Memory
> It is possible to fix this by not trying to use Memory/Span/byte[] for the 
> entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
> only move the problem 'lower' as it would then still set the limit of a 
> Column Data in a single batch to be 2 Gigs.  
> This seems like plenty of memory... but if you think of strings columns, the 
> data is just one giant string appended to together with offsets and it can 
> get very large quickly.
> I think the unfortunate problem is that memory management in the C# 

[jira] [Created] (ARROW-7511) [C#] - Batch / Data Size Can't Exceed 2 gigs

2020-01-07 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7511:


 Summary: [C#] - Batch / Data Size Can't Exceed 2 gigs
 Key: ARROW-7511
 URL: https://issues.apache.org/jira/browse/ARROW-7511
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 0.15.1
Reporter: Anthony Abate


While the Arrow spec does not forbid batches larger than 2 gigs, the C# library 
can not support this in its current form due to limits on managed memory as it 
tries to put the whole batch into a single Span/Memory

It is possible to fix this by not trying to use Memory/Span/byte[] for the 
entire Batch.. and instead move the memory mapping to the ArrowBuffers.  This 
only move the problem 'lower' as it would then still set the limit of a Column 
Data in a single batch to be 2 Gigs.  

This seems like plenty of memory... but if you think of strings columns, the 
data is just one giant string appended to together with offsets and it can get 
very large quickly.

I think the unfortunate problem is that memory management in the C# managed 
world is always going to hit the 2 gig limit somewhere. (please correct me if I 
am wrong on this statement)

That ultimately means the C# library either has to reject files of certain 
characteristics (ie validation checks on opening) , or the spec needs put upper 
limits on certain internal arrow constructs (ie arrow buffer) to eliminate the 
need for more than a 2 gigs of contiguous memory for the smallest arrow object.

However, If the spec was indeed designed for the smallest buffer object to be 
larger than 2 gigs, or for the entire memory buffer of arrow to be contiguous, 
one has to wonder if at some point, it might just make sense for the C# library 
to use the C++ library as its memory manager as replicating a very large blocks 
of memory more work than its wroth.

In any case,  this issue is more about 'deferring' the 2 gig size problem by 
moving it down to the buffer objects... This might require some re-write of the 
batch data structures

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7508) [C#] DateTime32 Reading is Broken

2020-01-07 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7508:
-
Summary: [C#] DateTime32 Reading is Broken  (was: [C#] DateTime Reading is 
Broken)

> [C#] DateTime32 Reading is Broken
> -
>
> Key: ARROW-7508
> URL: https://issues.apache.org/jira/browse/ARROW-7508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Anthony Abate
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> DateTime support for writing works - but reading is broken.
> This is another arithmetic overflow bug (reported a few already) which is 
> causing date to be misinterpreted
>  I extracted the current logic out to linqpad and to show the bug and fix:
>  
> {code:java}
>   var dto = DateTimeOffset.Parse("2024-09-25");
>   (dto.ToUnixTimeMilliseconds() / 8640).Dump();
>   // YIELDS: 19991
>   
>   unchecked  (current code)
>   {
>   DateTimeOffset.FromUnixTimeMilliseconds(19991 * 
> 8640).Dump();
>   // 1/8/1970 WRONG
>   }   
> checked
>   {
>   DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 
> 8640).Dump();
>   // 9/25/2024 CORRECT
>   } {code}
>  
>  
> this fix is trivial - a cast to long is missing where ever 
> *FromUnixTimeMilliseconds* is used
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7509) [C#] Turn on Checked mode for debug builds

2020-01-07 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7509:
-
Summary: [C#] Turn on Checked mode for debug builds  (was: Turn on Checked 
mode for debug builds)

> [C#] Turn on Checked mode for debug builds
> --
>
> Key: ARROW-7509
> URL: https://issues.apache.org/jira/browse/ARROW-7509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Minor
>
> Anyone object to turning on checked mode for debug builds? 
> There have been many arithmetic overflow bugs. These could have been caught 
> earlier simply by running the code with checked turned on.
> Then the unit tests could be run in debug mode and any obvious overflow bugs 
> might be caught



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7509) Turn on Checked mode for debug builds

2020-01-07 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7509:


 Summary: Turn on Checked mode for debug builds
 Key: ARROW-7509
 URL: https://issues.apache.org/jira/browse/ARROW-7509
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C#
Affects Versions: 0.15.1
Reporter: Anthony Abate


Anyone object to turning on checked mode for debug builds? 

There have been many arithmetic overflow bugs. These could have been caught 
earlier simply by running the code with checked turned on.

Then the unit tests could be run in debug mode and any obvious overflow bugs 
might be caught



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7508) DateTime Reading is Broken

2020-01-07 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7508:
-
Description: 
DateTime support for writing works - but reading is broken.

This is another arithmetic overflow bug (reported a few already) which is 
causing date to be misinterpreted

 I extracted the current logic out to linqpad and to show the bug and fix:

 
{code:java}
var dto = DateTimeOffset.Parse("2024-09-25");
(dto.ToUnixTimeMilliseconds() / 8640).Dump();
// YIELDS: 19991

unchecked  (current code)
{
DateTimeOffset.FromUnixTimeMilliseconds(19991 * 
8640).Dump();
// 1/8/1970 WRONG
}   

checked
{
DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 
8640).Dump();
// 9/25/2024 CORRECT
} {code}
 

 

this fix is trivial - a cast to long is missing where ever 
*FromUnixTimeMilliseconds* is used

 

 

  was:
DateTime support for writing works - but reading is broken.

This another arithmetic overflow bug (reported a few already) which is causing 
date to be misinterpreted

 

I extracted the current logic out to linqpad and to show the bug and fix:

 
{code:java}
var dto = DateTimeOffset.Parse("2024-09-25");
(dto.ToUnixTimeMilliseconds() / 8640).Dump();
// YIELDS: 19991

unchecked  (current code)
{
DateTimeOffset.FromUnixTimeMilliseconds(19991 * 
8640).Dump();
// 1/8/1970 WRONG
}   

checked
{
DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 
8640).Dump();
// 9/25/2024 CORRECT
} {code}
 

 

this fix is trivial - a cast to long is missing whereever 
FromUnixTimeMilliseconds is used

 

 


> DateTime Reading is Broken
> --
>
> Key: ARROW-7508
> URL: https://issues.apache.org/jira/browse/ARROW-7508
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Assignee: Anthony Abate
>Priority: Critical
>
> DateTime support for writing works - but reading is broken.
> This is another arithmetic overflow bug (reported a few already) which is 
> causing date to be misinterpreted
>  I extracted the current logic out to linqpad and to show the bug and fix:
>  
> {code:java}
>   var dto = DateTimeOffset.Parse("2024-09-25");
>   (dto.ToUnixTimeMilliseconds() / 8640).Dump();
>   // YIELDS: 19991
>   
>   unchecked  (current code)
>   {
>   DateTimeOffset.FromUnixTimeMilliseconds(19991 * 
> 8640).Dump();
>   // 1/8/1970 WRONG
>   }   
> checked
>   {
>   DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 
> 8640).Dump();
>   // 9/25/2024 CORRECT
>   } {code}
>  
>  
> this fix is trivial - a cast to long is missing where ever 
> *FromUnixTimeMilliseconds* is used
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7508) DateTime Reading is Broken

2020-01-07 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7508:


 Summary: DateTime Reading is Broken
 Key: ARROW-7508
 URL: https://issues.apache.org/jira/browse/ARROW-7508
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 0.15.1
Reporter: Anthony Abate
Assignee: Anthony Abate


DateTime support for writing works - but reading is broken.

This another arithmetic overflow bug (reported a few already) which is causing 
date to be misinterpreted

 

I extracted the current logic out to linqpad and to show the bug and fix:

 
{code:java}
var dto = DateTimeOffset.Parse("2024-09-25");
(dto.ToUnixTimeMilliseconds() / 8640).Dump();
// YIELDS: 19991

unchecked  (current code)
{
DateTimeOffset.FromUnixTimeMilliseconds(19991 * 
8640).Dump();
// 1/8/1970 WRONG
}   

checked
{
DateTimeOffset.FromUnixTimeMilliseconds((long)19991 * 
8640).Dump();
// 9/25/2024 CORRECT
} {code}
 

 

this fix is trivial - a cast to long is missing whereever 
FromUnixTimeMilliseconds is used

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2020-01-03 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007195#comment-17007195
 ] 

Anthony Abate edited comment on ARROW-6603 at 1/3/20 2:41 PM:
--

The reason I did this is because as much as I tried to use the existing API - 
but I think it can't support null able correctly as many assumptions are baked 
into the generic always being non-nullable..

 If you see how I implement nullable, I am filling in dummy values in the Value 
Buffer for nulls, but correctly setting the value bitmap...   this results in a 
reader of the arrow file correctly interpreting the NULL.

It still might be possible a builder method called AppendNullable() into 
the existing builder code...   but I was able to get the code in the PR to work 
fairly quickly once I understood the flatbuffer spec


was (Author: abbot):
The reason I did this is because as much as I tried to use the existing API - 
but I think it can't support null able correctly as many assumptions are baked 
into the generic always being non-nullable..

 

If you see how I implement nullablitity, I am filling in dummy values in the 
Value Buffer for nulls, but correctly setting the value bitmap...   this 
results in a reader of the arrow file correctly interpreting the NULL.

It still might be possible a builder method called AppendNullable() into 
the existing builder code...   but I was able to get the code in the PR to work 
fairly quickly once I understood the flatbuffer spec

> [C#] ArrayBuilder API to support writing nulls
> --
>
> Key: ARROW-6603
> URL: https://issues.apache.org/jira/browse/ARROW-6603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Anthony Abate
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 72h
>  Time Spent: 10m
>  Remaining Estimate: 71h 50m
>
> There is currently no API in the PrimitiveArrayBuilder class to support 
> writing nulls.  See this TODO - 
> [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]
>  
> Also see [https://github.com/apache/arrow/issues/5381].
>  
> We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2020-01-02 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007222#comment-17007222
 ] 

Anthony Abate commented on ARROW-7040:
--

created a PR for this:

https://github.com/apache/arrow/pull/6122

> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Assignee: Anthony Abate
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
> Update 2:
> This is due to a confirmed bug in System.Memory - The implications are that 
> Span.CopyTo needs to be removed / replaced. 
> This is method is used internally by ArrowBuffer so I can't work around this 
> easily. 
> Solutions
>  # Change the code
>  ## Remove it out right (including disable span in FlatBuffer)
>  ## create a multi target nuget where the offending code has compile blocks 
> #If (NETFRAMEWORK) - and disable span in FlatBuffers only for net framework 
> build 
>  # wait for a System.Memory fix?
>  
> I suspect option 2 won't happen anytime soon.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2020-01-02 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate reassigned ARROW-7040:


Assignee: Anthony Abate

> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Assignee: Anthony Abate
>Priority: Blocker
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
> Update 2:
> This is due to a confirmed bug in System.Memory - The implications are that 
> Span.CopyTo needs to be removed / replaced. 
> This is method is used internally by ArrowBuffer so I can't work around this 
> easily. 
> Solutions
>  # Change the code
>  ## Remove it out right (including disable span in FlatBuffer)
>  ## create a multi target nuget where the offending code has compile blocks 
> #If (NETFRAMEWORK) - and disable span in FlatBuffers only for net framework 
> build 
>  # wait for a System.Memory fix?
>  
> I suspect option 2 won't happen anytime soon.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2020-01-02 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007195#comment-17007195
 ] 

Anthony Abate commented on ARROW-6603:
--

The reason I did this is because as much as I tried to use the existing API - 
but I think it can't support null able correctly as many assumptions are baked 
into the generic always being non-nullable..

 

If you see how I implement nullablitity, I am filling in dummy values in the 
Value Buffer for nulls, but correctly setting the value bitmap...   this 
results in a reader of the arrow file correctly interpreting the NULL.

It still might be possible a builder method called AppendNullable() into 
the existing builder code...   but I was able to get the code in the PR to work 
fairly quickly once I understood the flatbuffer spec

> [C#] ArrayBuilder API to support writing nulls
> --
>
> Key: ARROW-6603
> URL: https://issues.apache.org/jira/browse/ARROW-6603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Anthony Abate
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is currently no API in the PrimitiveArrayBuilder class to support 
> writing nulls.  See this TODO - 
> [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]
>  
> Also see [https://github.com/apache/arrow/issues/5381].
>  
> We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2020-01-02 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007190#comment-17007190
 ] 

Anthony Abate commented on ARROW-6603:
--

I added a PR

[https://github.com/apache/arrow/pull/6121]

 

note that this does not change the existing API, but can be used in-lieu of 
when creating record batches.

> [C#] ArrayBuilder API to support writing nulls
> --
>
> Key: ARROW-6603
> URL: https://issues.apache.org/jira/browse/ARROW-6603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Anthony Abate
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is currently no API in the PrimitiveArrayBuilder class to support 
> writing nulls.  See this TODO - 
> [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]
>  
> Also see [https://github.com/apache/arrow/issues/5381].
>  
> We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2020-01-02 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate reassigned ARROW-6603:


Assignee: Anthony Abate

> [C#] ArrayBuilder API to support writing nulls
> --
>
> Key: ARROW-6603
> URL: https://issues.apache.org/jira/browse/ARROW-6603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Assignee: Anthony Abate
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is currently no API in the PrimitiveArrayBuilder class to support 
> writing nulls.  See this TODO - 
> [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]
>  
> Also see [https://github.com/apache/arrow/issues/5381].
>  
> We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [C#] Large record batch is written with negative buffer length

2019-11-21 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16979727#comment-16979727
 ] 

Anthony Abate commented on ARROW-7156:
--

[~wesm] - I should also point out the C# library on Net Framework is not even 
stable in its current state due to random crashes related to this: ARROW-7040  
(I have a local build that fixes is this - so I can make a PR for this one)

Regarding the integration tests - since C# is not using the C++ libs, how do 
the integration tests work? (i can volunteer some of my time on this, but I may 
have a lot of questions)

> [C#] Large record batch is written with negative buffer length
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973723#comment-16973723
 ] 

Anthony Abate commented on ARROW-7156:
--

I uploaded some test files. they deceptively small compressed.. but 2gigs 
uncompressed

I have a work around for now - just make sure my batches are less than 2 gigs. 

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Attachment: SingleBatch_String_7_Rows.ok.rar

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Attachment: SingleBatch_String_85000_Rows.crash.rar

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: SingleBatch_String_7_Rows.ok.rar, 
> SingleBatch_String_85000_Rows.crash.rar, image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973717#comment-16973717
 ] 

Anthony Abate commented on ARROW-7156:
--

>From Event viewer:

 

Faulting application name: rsession.exe, version: 1.2.1335.0, time stamp: 
0x5c9d0154
Faulting module name: arrow.dll, version: 0.0.0.0, time stamp: 0x5dc40022
Exception code: 0xc005
Fault offset: 0x003e4c05
Faulting process id: 0x8ec
Faulting application start time: 0x01d59a59ff052a76
Faulting application path: C:\software\RStudio\bin\rsession.exe
Faulting module path: 
C:\Users\aabate\Documents\R\win-library\3.6\arrow\libs\x64\arrow.dll
Report Id: db7e29f8-54ba-40fc-a104-75d3b6f75d0e
Faulting package full name: 
Faulting package-relative application ID:

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973714#comment-16973714
 ] 

Anthony Abate commented on ARROW-7156:
--

[~npr]- crashes RStudio means just that - instead of an error message 

 

!image-2019-11-13-16-27-30-641.png!

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Attachment: image-2019-11-13-16-27-30-641.png

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: image-2019-11-13-16-27-30-641.png
>
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973713#comment-16973713
 ] 

Anthony Abate commented on ARROW-7156:
--

[~npr]- do you know if an individual RecordBatch can exceed 2 gigs (int32 max) 
? 

This might not be an Arrow C++ issue, but another bug in the C# library that I 
used to generate the file.

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Description: 
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine. in 14.11 the individual 
batch errors.. in 15.1.1 the batch crashes R studio when it is used

*14.1.1*
{code:java}
>  rbn <- data_rbfr$get_batch(x)
Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
Invalid: negative malloc size
  {code}
*15.1.1*
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio! {code}
 

Update

I put the data in the batch into a separate file.  The file size is over 2 
gigs. 

Using 15.1.1, when I try to load this entire file via read_arrow it also fails.
{code:java}
ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
Error in Table__from_RecordBatchFileReader(batch_reader) :
 Invalid: negative malloc size{code}
{color:#c5060b} {color}

  was:
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine. in 14.11 the individual 
batch errors.. in 15.1.1 the batch crashes R studio when it is used

*14.1.1*
{code:java}
>  rbn <- data_rbfr$get_batch(x)
Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
Invalid: negative malloc size
  {code}
*15.1.1*
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio! {code}
 


> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  
> Update
> I put the data in the batch into a separate file.  The file size is over 2 
> gigs. 
> Using 15.1.1, when I try to load this entire file via read_arrow it also 
> fails.
> {code:java}
> ar <- arrow::read_arrow("e:\\temp\\file.arrow") 
> Error in Table__from_RecordBatchFileReader(batch_reader) :
>  Invalid: negative malloc size{code}
> {color:#c5060b} {color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Summary: [R] [C++] Large Batches Cause Error / Crashes  (was: [R] [C++] 
get_batch - fails for large batches)

> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1* 
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1***
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] Large Batches Cause Error / Crashes

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Description: 
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine. in 14.11 the individual 
batch errors.. in 15.1.1 the batch crashes R studio when it is used

*14.1.1*
{code:java}
>  rbn <- data_rbfr$get_batch(x)
Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
Invalid: negative malloc size
  {code}
*15.1.1*
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio! {code}
 

  was:
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine. in 14.11 the individual 
batch errors.. in 15.1.1 the batch crashes R studio when it is used

*14.1.1* 
{code:java}
>  rbn <- data_rbfr$get_batch(x)
Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
Invalid: negative malloc size
  {code}
*15.1.1***
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio! {code}
 


> [R] [C++] Large Batches Cause Error / Crashes
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1*
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1*
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Description: 
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine. in 14.11 the individual 
batch errors.. in 15.1.1 the batch crashes R studio when it is used

*14.1.1* 
{code:java}
>  rbn <- data_rbfr$get_batch(x)
Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
Invalid: negative malloc size
  {code}
*15.1.1***
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio! {code}
 

  was:
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine.

I dont know if this is fixed in 15.x because 15.x fails to load the file 
(another bug)

 
{color:#FF}> {color}{color:#FF}  rbn <- 
data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
  Invalid: negative malloc size{color}
 


> [R] [C++] get_batch - fails for large batches
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine. in 14.11 the 
> individual batch errors.. in 15.1.1 the batch crashes R studio when it is used
> *14.1.1* 
> {code:java}
> >  rbn <- data_rbfr$get_batch(x)
> Error in ipc__RecordBatchFileReader_ReadRecordBatch(self, i) : 
> Invalid: negative malloc size
>   {code}
> *15.1.1***
> {code:java}
> rbn <- data_rbfr$get_batch(x)  works!
> df <- as.data.frame(rbn) - Crashes R Studio! {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - fails for large batches

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973478#comment-16973478
 ] 

Anthony Abate commented on ARROW-7156:
--

this is still a problem in 15.1.1 but the failure is slightly different

rbn <- data_rbfr$get_batch(x)  works! 

df <- as.data.frame(rbn) - Crashes R Studio!

 

 

> [R] [C++] get_batch - fails for large batches
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-7156) [R] [C++] get_batch - fails for large batches

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973478#comment-16973478
 ] 

Anthony Abate edited comment on ARROW-7156 at 11/13/19 4:09 PM:


this is still a problem in 15.1.1 but the failure is different
{code:java}
rbn <- data_rbfr$get_batch(x)  works!
df <- as.data.frame(rbn) - Crashes R Studio!
  {code}
 

 


was (Author: abbot):
this is still a problem in 15.1.1 but the failure is slightly different

rbn <- data_rbfr$get_batch(x)  works! 

df <- as.data.frame(rbn) - Crashes R Studio!

 

 

> [R] [C++] get_batch - fails for large batches
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Affects Version/s: 0.15.1

> [R] [C++] get_batch - fails for large batches
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1, 0.15.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - fails for large batches

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Summary: [R] [C++] get_batch - fails for large batches  (was: [R] [C++] 
get_batch - failes for large batches)

> [R] [C++] get_batch - fails for large batches
> -
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate closed ARROW-7157.

Resolution: Not A Bug

ok - seems like its not an issue - The API changed from 14.1 to 15.1 and i 
picked the wrong function..  

 

> [R] RecordBatchFileReader - Crashes RStudio
> ---
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Blocker
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973447#comment-16973447
 ] 

Anthony Abate commented on ARROW-7157:
--

hmm.. do you mean.. that forwards to 'placement new' ?  should that even be 
accessible from  R? 

 

> [R] RecordBatchFileReader - Crashes RStudio
> ---
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Blocker
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - failes for large batches

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973440#comment-16973440
 ] 

Anthony Abate commented on ARROW-7156:
--

ok updated - I don't know the exact size of the batch - but it can't be 
coincidence that the largest batch in the file fails to load - i suspect there 
is some size limitation that was hit

> [R] [C++] get_batch - failes for large batches
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7156) [R] [C++] get_batch - failes for large batches

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7156:
-
Description: 
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine.

I dont know if this is fixed in 15.x because 15.x fails to load the file 
(another bug)

 
{color:#FF}> {color}{color:#FF}  rbn <- 
data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
  Invalid: negative malloc size{color}
 

  was:
I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine.

I dont know if this is fixed in 15.x because 15.x fails to load the file 
(another bug)

 


> [R] [C++] get_batch - failes for large batches
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  
> {color:#FF}> {color}{color:#FF}  rbn <- 
> data_rbfr$get_batch(4){color}{color:#c5060b}Error in 
> ipc___RecordBatchFileReader__ReadRecordBatch(self, i) : 
>   Invalid: negative malloc size{color}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7156) [R] [C++] get_batch - failes for large batches

2019-11-13 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973435#comment-16973435
 ] 

Anthony Abate commented on ARROW-7156:
--

I working on it-... I have to 'downgrade' arrow since 15.x seems even more 
broken...

> [R] [C++] get_batch - failes for large batches
> --
>
> Key: ARROW-7156
> URL: https://issues.apache.org/jira/browse/ARROW-7156
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
>
> I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
> causes get batch to fail - All other batches load fine.
> I dont know if this is fixed in 15.x because 15.x fails to load the file 
> (another bug)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7157) [R] RecordBatchFileReader - Crashes RStudio

2019-11-13 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7157:
-
Summary: [R] RecordBatchFileReader - Crashes RStudio  (was: 
RecordBatchFileReader - Crashes RStudio)

> [R] RecordBatchFileReader - Crashes RStudio
> ---
>
> Key: ARROW-7157
> URL: https://issues.apache.org/jira/browse/ARROW-7157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.15.1
>Reporter: Anthony Abate
>Priority: Blocker
>
> I have a 30 gig arrow file - using record batch reader crashes RStudio
> arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7157) RecordBatchFileReader - Crashes RStudio

2019-11-13 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7157:


 Summary: RecordBatchFileReader - Crashes RStudio
 Key: ARROW-7157
 URL: https://issues.apache.org/jira/browse/ARROW-7157
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 0.15.1
Reporter: Anthony Abate


I have a 30 gig arrow file - using record batch reader crashes RStudio

arrow::RecordBatchFileReader$new("file.arrow") 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7156) [R] [C++] get_batch - failes for large batches

2019-11-13 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7156:


 Summary: [R] [C++] get_batch - failes for large batches
 Key: ARROW-7156
 URL: https://issues.apache.org/jira/browse/ARROW-7156
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Affects Versions: 0.14.1
Reporter: Anthony Abate


I have a 30 gig arrow file with 100 batches.  the largest batch in the file 
causes get batch to fail - All other batches load fine.

I dont know if this is fixed in 15.x because 15.x fails to load the file 
(another bug)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2019-11-01 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

Update 2:

This is due to a confirmed bug in System.Memory - The implications are that 
Span.CopyTo needs to be removed / replaced. 

This is method is used internally by ArrowBuffer so I can't work around this 
easily. 

Solutions
 # Change the code
 ## Remove it out right (including disable span in FlatBuffer)
 ## create a multi target nuget where the offending code has compile blocks #If 
(NETFRAMEWORK) - and disable span in FlatBuffers only for net framework build 
 # wait for a System.Memory fix?

 

I suspect option 2 won't happen anytime soon.  

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

Update 2:

This is due to a confirmed bug in System.Memory - The implications are that 
Span.CopyTo needs to be removed / replaced. 

This is method is used internally by ArrowBuffer so I can't work around this 
easily.  

Solutions
 # Change the code
 ## Remove it out right (including with in FlatBuffer)
 ## create a multi target nuget where the offending code has compile blocks #If 
(NETFRAMEWORK) - and disable span in FlatBuffers
 # wait for a System.Memory fix?

 

I suspect 3 won't 

[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2019-11-01 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

Update 2:

This is due to a confirmed bug in System.Memory - The implications are that 
Span.CopyTo needs to be removed / replaced. 

This is method is used internally by ArrowBuffer so I can't work around this 
easily.  

Solutions
 # Change the code
 ## Remove it out right (including with in FlatBuffer)
 ## create a multi target nuget where the offending code has compile blocks #If 
(NETFRAMEWORK) - and disable span in FlatBuffers
 # wait for a System.Memory fix?

 

I suspect 3 won't happen anytime soon.   

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

Update 2: 

This is due to a confirmed bug in System.Memory - The implications are that 
Span.CopyTo needs to be removed

 


> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>

[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2019-11-01 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

Update 2: 

This is due to a confirmed bug in System.Memory - The implications are that 
Span.CopyTo needs to be removed

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

 


> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Blocker
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
>   

[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2019-11-01 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Priority: Blocker  (was: Critical)

> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Blocker
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) [C#] System.Memory Span.CopyTo - Crashes on Net Framework

2019-11-01 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Summary: [C#] System.Memory  Span.CopyTo - Crashes on Net Framework   (was: 
[C#] ArrowBuffer.Append - Crashes )

> [C#] System.Memory  Span.CopyTo - Crashes on Net Framework 
> ---
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes

2019-10-31 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964387#comment-16964387
 ] 

Anthony Abate commented on ARROW-7040:
--

this might not be an arrow issue - it might be an issue in the System.Memory 
code - I reported a bug here:  [https://github.com/dotnet/corefx/issues/42276]


It still is an open issue for us though because the current Array Builder 
implementation of arrow is currently crashing using strings and many threads.   
I am considering creating a separate builder for strings that internally uses 
byte[] instead of Spans to see if that makes the problem go away

 

> [C#] ArrowBuffer.Append - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

 


> [C#] ArrowBuffer.Append - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
>  

[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying *ArrowBuffer.Builder*
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();
foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});
wait.Add(t);
}
await Task.WhenAll(wait);
}{code}
 

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying ArrowBuffer.Builder
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});wait.Add(t);
}await Task.WhenAll(wait);
}{code}
 

 


> [C#] ArrowBuffer.Append - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await 

[jira] [Issue Comment Deleted] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Comment: was deleted

(was: interesting - BinaryArrayBuilder does not crash if using 
*AppendRange(IEnumerable values)*

StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... 



I tried forwarding StringArrayBuilder to 
*BinaryArrayBuilder.**AppendRange(IEnumerable* but the prob also 
occurs...

 

 )

> [C#] ArrowBuffer.Append - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying 
> *ArrowBuffer.Builder*
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();for (int i = 0; i < 30; 
> ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();
> foreach (var d in data)
> {
> ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
> }
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> }{code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) [C#] ArrowBuffer.Append - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

Update - it seems that the problem is in the underlying ArrowBuffer.Builder
{code:java}
 public async Task ValueBuffer_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
ArrowBuffer.Builder ValueBuffer = new 
ArrowBuffer.Builder();foreach (var d in data)
{
ValueBuffer.Append(Encoding.UTF8.GetBytes(d));
}
}
});wait.Add(t);
}await Task.WhenAll(wait);
}{code}
 

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

 

Summary: [C#] ArrowBuffer.Append - Crashes   (was: [C#] 
StringArrayBuilder.AppendRange - Crashes )

> [C#] ArrowBuffer.Append - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
> Update - it seems that the problem is in the underlying ArrowBuffer.Builder
> {code:java}
>  public async Task ValueBuffer_StressTest()
> {
> var wait = new List();for (int i = 0; i < 30; 
> ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> ArrowBuffer.Builder ValueBuffer = new 
> ArrowBuffer.Builder();foreach (var d in data)
>

[jira] [Comment Edited] (ARROW-7040) [C#] StringArrayBuilder.AppendRange - Crashes

2019-10-31 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964199#comment-16964199
 ] 

Anthony Abate edited comment on ARROW-7040 at 10/31/19 5:28 PM:


interesting - BinaryArrayBuilder does not crash if using 
*AppendRange(IEnumerable values)*

StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... 



I tried forwarding StringArrayBuilder to 
*BinaryArrayBuilder.**AppendRange(IEnumerable* but the prob also 
occurs...

 

 


was (Author: abbot):
interesting - BinaryArrayBuilder does not crash if using 
*AppendRange(IEnumerable values)*

StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... 
I think i found the problem - if it works - I will submit a pull request

 

 

> [C#] StringArrayBuilder.AppendRange - Crashes 
> --
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7040) [C#] StringArrayBuilder.AppendRange - Crashes

2019-10-31 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964199#comment-16964199
 ] 

Anthony Abate commented on ARROW-7040:
--

interesting - BinaryArrayBuilder does not crash if using 
*AppendRange(IEnumerable values)*

StringArrayBuilder uses *BinaryArrayBuilder.Append(ReadOnlySpan span)*... 
I think i found the problem - if it works - I will submit a pull request

 

 

> [C#] StringArrayBuilder.AppendRange - Crashes 
> --
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});
wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

 


> StringArrayBuilder.AppendRange - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });
> wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes

2019-10-31 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-7040:
-
Description: 
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();
for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();
var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});wait.Add(t);
}

await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays.  (ie IntArrayBuilder)

I suspect it is due to the offset array / and all the copy / resizing going on

 

 

  was:
The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});wait.Add(t);
}await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays. 

 

I suspect it is due to the offset array / and all the copy / resizing going on

 


> StringArrayBuilder.AppendRange - Crashes 
> -
>
> Key: ARROW-7040
> URL: https://issues.apache.org/jira/browse/ARROW-7040
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1, 0.15.0
>Reporter: Anthony Abate
>Priority: Critical
>
> The following code crashes on 8 cores.
> {code:java}
> public async Task StringArrayBuilder_StressTest()
> {
> var wait = new List();
> for (int i = 0; i < 30; ++i)
> {
> var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
> 1}").ToArray();
> var t = Task.Run(() =>
> {
> for (int j = 0; j < 1000; ++j)
> {
> var builder = new StringArray.Builder();
> builder.AppendRange(data);
> }
> });wait.Add(t);
> }
> await Task.WhenAll(wait);
> } {code}
>  
> It does not happen with the primitive arrays.  (ie IntArrayBuilder)
> I suspect it is due to the offset array / and all the copy / resizing going on
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7040) StringArrayBuilder.AppendRange - Crashes

2019-10-31 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-7040:


 Summary: StringArrayBuilder.AppendRange - Crashes 
 Key: ARROW-7040
 URL: https://issues.apache.org/jira/browse/ARROW-7040
 Project: Apache Arrow
  Issue Type: Bug
  Components: C#
Affects Versions: 0.15.0, 0.14.1
Reporter: Anthony Abate


The following code crashes on 8 cores.
{code:java}
public async Task StringArrayBuilder_StressTest()
{
var wait = new List();for (int i = 0; i < 30; ++i)
{
var data = Enumerable.Range(0, 1000).Select(x => $"Item {x + 
1}").ToArray();var t = Task.Run(() =>
{
for (int j = 0; j < 1000; ++j)
{
var builder = new StringArray.Builder();
builder.AppendRange(data);
}
});wait.Add(t);
}await Task.WhenAll(wait);
} {code}
 

It does not happen with the primitive arrays. 

 

I suspect it is due to the offset array / and all the copy / resizing going on

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748
 ] 

Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:44 PM:


Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far the answer seems to be the following:

|| method || status ||
|read_arrow | unsupported|
|read_table | supported, but uses lots of memory|
|RecordBatchFileReader  | manually possible via the code I provided, but slow|

 


was (Author: abbot):
Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far the answer seems to be the following:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - manually possible via the code I provided, but slow

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948751#comment-16948751
 ] 

Anthony Abate commented on ARROW-6830:
--

{quote}You can filter each record batch separately (using {{[}} methods or 
lower level if you prefer) and collect them all into a data.frame.
{quote}
 

this is what I am doing - is there a better way so I can do multiple columns in 
a single pass?
{code:java}
rbn <- data_rbfr$get_batch(i)
  df <- data.frame(

rbn$column(5)$as_vector(),rbn$column(6)$as_vector(),rbn$column(100)$as_vector(),rbn$column(687)$as_vector(),

rbn$column(444)$as_vector(),rbn$column(36)$as_vector(),rbn$column(500)$as_vector(),rbn$column(897)$as_vector(),

rbn$column(24)$as_vector(),rbn$column(446)$as_vector(),rbn$column(777)$as_vector(),rbn$column(333)$as_vector(),

rbn$column(96)$as_vector(),rbn$column(555)$as_vector(),rbn$column(888)$as_vector(),rbn$column(222)$as_vector()
) {code}
 

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748
 ] 

Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:31 PM:


Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far the answer seems to be the following:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - manually possible via the code I provided, but slow

 


was (Author: abbot):
Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far the answer seems to be the following:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - supported, but slow

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748
 ] 

Anthony Abate edited comment on ARROW-6830 at 10/10/19 4:30 PM:


Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far the answer seems to be the following:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - supported, but slow

 


was (Author: abbot):
Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far my answer is as follow:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - 

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948748#comment-16948748
 ] 

Anthony Abate commented on ARROW-6830:
--

Yes - my original question is about slicing the arrow file to reduce columns - 
whether it be via read_arrow, read_table, or RecordBatchFileReader 

 

so far my answer is as follow:

read_arrow - unsupported

read_table - supported, but uses lots of memory

RecordBatchFileReader  - 

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948744#comment-16948744
 ] 

Anthony Abate commented on ARROW-6830:
--

from my initial testing of read_table - it seems to be no better than 
read_arrow when it comes to memory usage and appears to load the entire file... 
{code:java}
tab <- read_table("bigfile.arrow")
nrow(tab)  # uses 30 gigs!

{code}

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948713#comment-16948713
 ] 

Anthony Abate edited comment on ARROW-6830 at 10/10/19 3:48 PM:


I was using *RecordBatchFileReader* since it seemed to be the only to limit 
memory usage (I thought *read_arrow* was my only alternative)    We are 
indexing our data by record batch so we could be more efficient in filtering by 
passing the batch ids into the RecordBatchFileReader to avoid a 'full table 
scan' 

FYI - It was not clear to me from the name that *read_table* has anything to do 
with arrow files. 

Is read_table aware of underlying record batches so rows can be filtered out 
more efficiently?

 


was (Author: abbot):
I was using *RecordBatchFileReader* since it seemed to be the only to limit 
memory usage (I thought *read_arrow* was my only alternative)    We are 
effectively indexing our data by record batch so we could be more efficient in 
filtering and would want to pass down to avoid a 'full table scan' 

FYI - It was not clear to me from the name that *read_table* has anything to do 
with arrow files. 

Is read_table aware of underlying record batches so rows can be filtered out 
more effeciently?

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-10 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948713#comment-16948713
 ] 

Anthony Abate commented on ARROW-6830:
--

I was using *RecordBatchFileReader* since it seemed to be the only to limit 
memory usage (I thought *read_arrow* was my only alternative)    We are 
effectively indexing our data by record batch so we could be more efficient in 
filtering and would want to pass down to avoid a 'full table scan' 

FYI - It was not clear to me from the name that *read_table* has anything to do 
with arrow files. 

Is read_table aware of underlying record batches so rows can be filtered out 
more effeciently?

 

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:
{code:java}
for(i in 0:data_rbfr$num_record_batches) {
rbn <- data_rbfr$get_batch(i)
  
  if (i == 0) 
  {
merged <- as.data.frame(rbn$column(5)$as_vector())
  }
  else 
  {
dfn <- as.data.frame(rbn$column(5)$as_vector())
merged <- rbind(merged,dfn)
  }

  print(paste(i, nrow(merged)))
} {code}
 

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:


{{for(i in 0:data_rbfr$num_record_batches) {}}
{{ rbn <- data_rbfr$get_batch(i)}}
 
{{ if (i == 0) }}
{{ {}}
{{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{ merged <- rbind(merged,dfn)}}
{{ }}}
 
{{ print(paste(i, nrow(merged)))}}
{{}}}

 

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:


{{for(i in 0:data_rbfr$num_record_batches) {}}
{{ rbn <- data_rbfr$get_batch(i)}}
 
{{ if (i == 0) }}
{{ {}}
{{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{ merged <- rbind(merged,dfn)}}
{{ }}}
 
{{ print(paste(i, nrow(merged)))}}
{{}}}

 

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

{{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}

{{for(i in 0:data_rbfr$num_record_batches) {}}
{{  rbn <- data_rbfr$get_batch(i)}}
{{  if (i == 0) }}
{{ {}}
{{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{   merged <- rbind(merged,dfn)}}
{{ }}}
{{ }}}

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {{for(i in 0:data_rbfr$num_record_batches) {}}
> {{ rbn <- data_rbfr$get_batch(i)}}
>  
> {{ if (i == 0) }}
> {{ {}}
> {{ merged <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ }}}
> {{ else }}
> {{ {}}
> {{ dfn <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ merged <- rbind(merged,dfn)}}
> {{ }}}
>  
> {{ print(paste(i, nrow(merged)))}}
> {{}}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6830:
-
Description: 
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

{{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}

{{for(i in 0:data_rbfr$num_record_batches) {}}
{{  rbn <- data_rbfr$get_batch(i)}}
{{  if (i == 0) }}
{{ {}}
{{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
{{ }}}
{{ else }}
{{ {}}
{{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
{{   merged <- rbind(merged,dfn)}}
{{ }}}
{{ }}}

 

  was:
*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 


> Question / Feature Request- Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {{data_rbfr <- arrow::RecordBatchFileReader("arrowfile")}}
> {{for(i in 0:data_rbfr$num_record_batches) {}}
> {{  rbn <- data_rbfr$get_batch(i)}}
> {{  if (i == 0) }}
> {{ {}}
> {{   merged <- as.data.frame(rbn$column(5)$as_vector())}}
> {{ }}}
> {{ else }}
> {{ {}}
> {{   dfn <- as.data.frame(rbn$column(5)$as_vector())}}
> {{   merged <- rbind(merged,dfn)}}
> {{ }}}
> {{ }}}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6830) Question / Feature Request- Select Subset of Columns in read_arrow

2019-10-09 Thread Anthony Abate (Jira)
Anthony Abate created ARROW-6830:


 Summary: Question / Feature Request- Select Subset of Columns in 
read_arrow
 Key: ARROW-6830
 URL: https://issues.apache.org/jira/browse/ARROW-6830
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, R
Reporter: Anthony Abate


*Note:*  Not sure if this is a limitation of the R library or the underlying 
C++ code:

I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
batches of varying row sizes

1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
how *read_feather* has a (col_select =... )

2. Or is it possible using *RecordBatchFileReader* to filter columns?

 

The only thing I seem to be able to do (please confirm if this is my only 
option) is loop over all record batches, select a single column at a time, and 
construct the data I need to pull out manually.  ie like the following:

data_rbfr <- arrow::RecordBatchFileReader("arrowfile")

FOREACH BATCH:
 batch <- data_rbfr$get_batch(i) 
col4 <- batch$column(4)
 col5 <- batch$column(7)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C#] Arrow R/C++ hangs reading binary file generated by C#

2019-10-01 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941893#comment-16941893
 ] 

Anthony Abate commented on ARROW-6682:
--

[~wesm] - I understand that the file generation is fixed on the C# side, but 
isnt a malformed file taking down the library another problem?

[~eerhardt] - Is there a pre-release nuget that I can test out? 

> [C#] Arrow R/C++ hangs reading binary file generated by C#
> --
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C#
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6603) [C#] ArrayBuilder API to support writing nulls

2019-09-26 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938975#comment-16938975
 ] 

Anthony Abate commented on ARROW-6603:
--

I have a few extension methods that do this - on thing i noticed, the spec 
seems to refer to terms NullBitmap and ValidityBitmaps - I think ValidityBitmap 
might be the more correct term since 1 = valid, whereas NullBitmap sounds like 
1 = null.   My first attempt at creating the nullbitmap inverted all the values

> [C#] ArrayBuilder API to support writing nulls
> --
>
> Key: ARROW-6603
> URL: https://issues.apache.org/jira/browse/ARROW-6603
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Reporter: Eric Erhardt
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> There is currently no API in the PrimitiveArrayBuilder class to support 
> writing nulls.  See this TODO - 
> [https://github.com/apache/arrow/blob/1515fe10c039fb6685df2e282e2e888b773caa86/csharp/src/Apache.Arrow/Arrays/PrimitiveArrayBuilder.cs#L101.]
>  
> Also see [https://github.com/apache/arrow/issues/5381].
>  
> We should add some APIs to support writing nulls.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-26 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938699#comment-16938699
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr] - setting that option may be a workaround for now
I am not sure what the threads do since there seems to be no performance 
difference - at least in the read_arrow function

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938149#comment-16938149
 ] 

Anthony Abate commented on ARROW-6682:
--

It sounds like there might be more than 1 issue here:
 * that malformed file
 * the hanging on R 

It would be troubling if a malformed file can take down / crash the library... 
(ie a DOS Exploit) 

When trying to use an out of date C# Feather library in R I did get some 
indication that the file was invalid:  
([https://github.com/kevin-montrose/FeatherDotNet/issues/7)]

Is there a way to validate the integrity of the arrow file on open?  (ie check 
offsets, padding, etc)  - might be slower, but when opening a file from an 
unknown source, could be safer.

Regarding the hanging - There does seem to be some threadpool options for the 
C++ code, but I don't know how to access them in R

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938048#comment-16938048
 ] 

Anthony Abate commented on ARROW-6682:
--

I also uploaded the exact script files / script loop runner

 

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6682:
-
Attachment: arrow.benchmark.r

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, arrow.benchmark.r, 
> script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6682:
-
Attachment: script.runner.ps1

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip, script.runner.ps1
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938047#comment-16938047
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr] - I can't reproduce the issue on a single core, but I can on two cores - 
can you try a vm with two cores?

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938043#comment-16938043
 ] 

Anthony Abate edited comment on ARROW-6682 at 9/25/19 8:24 PM:
---

the other thing I should point out:  I am using a new rscript.exe process each 
time:  That way I know for certain the dll is unloaded and reinitialized 
without any static init related code.

this is the script runner code I am using:

 

{{(powershell script)}}

{{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}}
 {{$rscript = "arrow.benchmark.r"}}

{\{For ($i=0; $i -le 1; $i++) }}
{{  Write-Output "run: $i"}}
 {{  $stopwatch = [system.diagnostics.stopwatch]::StartNew()}}

{{  & $rpath --no-save --no-restore --verbose $rscript > 
c:\temp\outputFile.Rout 2>&1}}

{{  $stopwatch.Elapsed.TotalSeconds}}
 }


was (Author: abbot):
the other thing I should point out:  I am using a new rscript.exe process each 
time:  That way I know for certain the dll is unloaded and reinitialized 
without any static init related code. 

this is the script runner code I am using:

 

{{(powershell script)}}

{{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}}
{{$rscript = "arrow.benchmark.r"}}

{{For ($i=0; $i -le 1; $i++) }}
{{{}}
{{  Write-Output "run: $i"}}
{{  $stopwatch = [system.diagnostics.stopwatch]::StartNew()}}

{{  & $rpath --no-save --no-restore --verbose $rscript > 
c:\temp\outputFile.Rout 2>&1}}

{{  $stopwatch.Elapsed.TotalSeconds}}
}

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938043#comment-16938043
 ] 

Anthony Abate commented on ARROW-6682:
--

the other thing I should point out:  I am using a new rscript.exe process each 
time:  That way I know for certain the dll is unloaded and reinitialized 
without any static init related code. 

this is the script runner code I am using:

 

{{(powershell script)}}

{{$rpath = "C:\Program Files\r\R-3.6.1\bin\Rscript.exe"}}
{{$rscript = "arrow.benchmark.r"}}

{{For ($i=0; $i -le 1; $i++) }}
{{{}}
{{  Write-Output "run: $i"}}
{{  $stopwatch = [system.diagnostics.stopwatch]::StartNew()}}

{{  & $rpath --no-save --no-restore --verbose $rscript > 
c:\temp\outputFile.Rout 2>&1}}

{{  $stopwatch.Elapsed.TotalSeconds}}
}

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938038#comment-16938038
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr] how many cores was your test vm using?

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Assignee: Eric Erhardt
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938034#comment-16938034
 ] 

Anthony Abate commented on ARROW-6682:
--

I can repro the problem fairly consistently - I can get more info if needed:


{color:#FF}> {color}{color:#FF}sessionInfo(){color}R version 3.6.1 
(2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252  
  LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C   LC_TIME=English_United States.1252   
 

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] arrow_0.14.1.20190925

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 bit_1.1-14   compiler_3.6.1   magrittr_1.5 
assertthat_0.2.1 R6_2.4.0 tools_3.6.1 
 [8] glue_1.3.1   Rcpp_1.0.2   bit64_0.9-7  rlang_0.4.0  
purrr_0.3.2

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938002#comment-16938002
 ] 

Anthony Abate commented on ARROW-6682:
--

If the file is bad - Id expect the R library to fail the same as python if they 
both use the same underlying C++ code.  However, I don't know the R / Python 
bindings / code.

I would point out that I was able to validate a 30 million row x 37 column data 
set produced by C# in R including the null support I added.  

The only indication of any issue was a very rare hang on first use of the 
library in R Studio - if it didn't hang the first time, i was able to do many 
file loads of 10gb without issue

I was attempting to narrow down that rare hang when it seemed to be a column 
width issue

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937994#comment-16937994
 ] 

Anthony Abate edited comment on ARROW-6682 at 9/25/19 6:53 PM:
---

[~eerhardt]- i used the current nuget version + some code i wrote to build null 
support

 

[~wesm] [~npr]  - this looks/behaves like a threading issue - I don't get any 
hanging if i reduce the VM to 1 core (not ideal) 

(I can't explain the core dumps though)


was (Author: abbot):
[~eerhardt]- i used the current nuget version + some code i wrote to build null 
support

 

[~wesm] [~npr]  - this looks like a threading issue - I don't get any hanging 
if i reduce the VM to 1 core (not ideal)

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937994#comment-16937994
 ] 

Anthony Abate commented on ARROW-6682:
--

[~eerhardt]- i used the current nuget version + some code i wrote to build null 
support

 

[~wesm] [~npr]  - this looks like a threading issue - I don't get any hanging 
if i reduce the VM to 1 core (not ideal)

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937921#comment-16937921
 ] 

Anthony Abate commented on ARROW-6682:
--

[~wesm] - I made a zip version of the file

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6682:
-
Attachment: Generated_4000Batch_50Columns_100Rows_PerBatch.zip

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar, 
> Generated_4000Batch_50Columns_100Rows_PerBatch.zip
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937873#comment-16937873
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr]  - seems like the same problem:

 other than 
install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r;)
do i need to do anything else to use the dev package?

is there a version number I can print out via runtime to make sure im using the 
new code?

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937860#comment-16937860
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr]- ok sounds like you have no problems with the file - let me try that 
latest package and I'll let you know

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937859#comment-16937859
 ] 

Anthony Abate commented on ARROW-6682:
--

if the file is 'bad' shouldn't that easily be determined by examining the 
attached file? 

 

 

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937855#comment-16937855
 ] 

Anthony Abate commented on ARROW-6682:
--

[~npr] - how many times did you try to load it? - i get it to fail 4 out of 
every 5 times

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937853#comment-16937853
 ] 

Anthony Abate commented on ARROW-6682:
--

[~wesm]  - Also, I can generate many other files from the C# libraries that no 
problems being loaded

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) [C++][R] Arrow Hangs reading binary file generated by C#

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937839#comment-16937839
 ] 

Anthony Abate commented on ARROW-6682:
--

It loads 'sometimes' - so sounds like threading issues?

> [C++][R] Arrow Hangs reading binary file generated by C#
> 
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937833#comment-16937833
 ] 

Anthony Abate commented on ARROW-6682:
--

code above (its trivial)

system 8 cores  (virtual)
64 gigs ram
windows 10

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937831#comment-16937831
 ] 

Anthony Abate commented on ARROW-6682:
--

{{}}

start_time <- Sys.time()
start_memory <- memory.size()

library(arrow)

dfcs <- read_arrow("e:\\Generated_4000Batch_50Columns_100Rows_PerBatch.arrow")


end_memory <- memory.size()
end_time <- Sys.time()

print(end_memory)
end_time - start_time
end_memory - start_memory

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937831#comment-16937831
 ] 

Anthony Abate edited comment on ARROW-6682 at 9/25/19 3:20 PM:
---

start_time <- Sys.time()
 start_memory <- memory.size()

library(arrow)

dfcs <- read_arrow("e:
Generated_4000Batch_50Columns_100Rows_PerBatch.arrow")

end_memory <- memory.size()
 end_time <- Sys.time()

print(end_memory)
 end_time - start_time
 end_memory - start_memory


was (Author: abbot):
{{}}

start_time <- Sys.time()
start_memory <- memory.size()

library(arrow)

dfcs <- read_arrow("e:\\Generated_4000Batch_50Columns_100Rows_PerBatch.arrow")


end_memory <- memory.size()
end_time <- Sys.time()

print(end_memory)
end_time - start_time
end_memory - start_memory

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Major
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937820#comment-16937820
 ] 

Anthony Abate edited comment on ARROW-6682 at 9/25/19 3:13 PM:
---

I have a 150 meg file that i generated (with the C# library) random data and it 
has 50 columns and it hangs on (almost) every load!


was (Author: abbot):
I have a 150 meg file that i generated (with the C# library) random data and it 
has 50 columns and it hangs on every load!

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Blocker
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6682:
-
Description: 
I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). (the file has 37 columns)

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 

 

*UPDATE:* it seems that the hangs are not related to file size, row counts, or 
# of record batches, but rather the number of *columns*

  was:
I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). (the file has 37 columns)

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 

 

it seems that the hangs are not related to file size, row counts, or # of 
record batches, but rather the number of *columns*


> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Blocker
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> *UPDATE:* it seems that the hangs are not related to file size, row counts, 
> or # of record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Abate updated ARROW-6682:
-
Description: 
I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). (the file has 37 columns)

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 

 

it seems that the hangs are not related to file size, row counts, or # of 
record batches, but rather the number of *columns*

  was:
I get random hangs on arrow_read in R (windows) when using a very large file 
(10-12gb). 

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 

 

it seems that the hangs are not related to file size, row counts, or # of 
record batches, but rather the number of *columns*


> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Blocker
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). (the file has 37 columns)
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> it seems that the hangs are not related to file size, row counts, or # of 
> record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6682) Arrow Hangs on Large # of Columns (30+)

2019-09-25 Thread Anthony Abate (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937822#comment-16937822
 ] 

Anthony Abate commented on ARROW-6682:
--

See the attached file

> Arrow Hangs on Large # of Columns (30+)
> ---
>
> Key: ARROW-6682
> URL: https://issues.apache.org/jira/browse/ARROW-6682
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 0.14.1
>Reporter: Anthony Abate
>Priority: Blocker
> Attachments: Generated_4000Batch_50Columns_100Rows_PerBatch.rar
>
>
> I get random hangs on arrow_read in R (windows) when using a very large file 
> (10-12gb). 
> I have memory dumps - All threads seem to be in wait handles.
> Are there debug symbols somewhere? 
> Is there a way to get the C++ code to produce diagnostic logging from R? 
>  
> it seems that the hangs are not related to file size, row counts, or # of 
> record batches, but rather the number of *columns*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >