Rares Vernica created ARROW-1676:
------------------------------------

             Summary: [C++] Featehr inserts 0 in the beginning and trims one 
value at the end
                 Key: ARROW-1676
                 URL: https://issues.apache.org/jira/browse/ARROW-1676
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
    Affects Versions: 0.7.1
         Environment: libarrow-dev
Architecture: amd64
Version: 0.7.1-1

Python 2.7.13
>>> pyarrow.__version__
'0.7.1'
>>> feather.__version__
'0.4.0'
>>> pandas.__version__
u'0.20.3'
            Reporter: Rares Vernica


An extra {{0}} appears in the beginning when serializing and deserializing an 
array with more than {{128}} values and at least one {{NULL}} value using 
{{Feather}}. Once the extra {{0}} is inserted a value is trimmed at the end.

Here is the C++ code to write such an array:

{code:java}
#include <iostream>
#include <arrow/api.h>
#include <arrow/io/file.h>
#include <arrow/ipc/feather.h>
#include <arrow/pretty_print.h>

int main() {
  // 1. Build Array
  arrow::DoubleBuilder builder;
  for (int i = 0; i < 129; i++)
      if (i == 0)
          builder.AppendNull();
      else
          builder.Append(i);

  std::shared_ptr<arrow::Array> array;
  builder.Finish(&array);

  arrow::PrettyPrint(*array, 0, &std::cout);
  std::cout << std::endl;

  // 2. Write to Feather file
  std::shared_ptr<arrow::io::FileOutputStream> stream;
  arrow::io::FileOutputStream::Open("out.f", false, &stream);

  std::unique_ptr<arrow::ipc::feather::TableWriter> writer;
  arrow::ipc::feather::TableWriter::Open(stream, &writer);

  writer->SetNumRows(129);
  writer->Append("id", *array);

  writer->Finalize();
  stream->Close();

  return 0;
}
{code}

The output of running this code is:

{code:java}
# g++-4.9 -std=c++11 example.cpp -larrow && ./a.out
[null, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 
41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 
81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 
100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 
116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128]
{code}

The array is deserialized in Python and looks like this:
 
{code:java}
>>> pandas.read_feather('out.f')
        id
0      NaN
1      0.0
2      1.0
3      2.0
4      3.0
5      4.0
6      5.0
7      6.0
8      7.0
9      8.0
10     9.0
11    10.0
12    11.0
13    12.0
14    13.0
15    14.0
16    15.0
17    16.0
18    17.0
19    18.0
20    19.0
21    20.0
22    21.0
23    22.0
24    23.0
25    24.0
26    25.0
27    26.0
28    27.0
29    28.0
..     ...
99    98.0
100   99.0
101  100.0
102  101.0
103  102.0
104  103.0
105  104.0
106  105.0
107  106.0
108  107.0
109  108.0
110  109.0
111  110.0
112  111.0
113  112.0
114  113.0
115  114.0
116  115.0
117  116.0
118  117.0
119  118.0
120  119.0
121  120.0
122  121.0
123  122.0
124  123.0
125  124.0
126  125.0
127  126.0
128  127.0

[129 rows x 1 columns]
{code}

Notice the {{0.0}} value on index {{1}}. The value should have been {{1.0}}. 
Also, the last value is {{127.0}} instead of {{128.0}}.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to