Rares Vernica created ARROW-1676: ------------------------------------ Summary: [C++] Featehr inserts 0 in the beginning and trims one value at the end Key: ARROW-1676 URL: https://issues.apache.org/jira/browse/ARROW-1676 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.7.1 Environment: libarrow-dev Architecture: amd64 Version: 0.7.1-1
Python 2.7.13 >>> pyarrow.__version__ '0.7.1' >>> feather.__version__ '0.4.0' >>> pandas.__version__ u'0.20.3' Reporter: Rares Vernica An extra {{0}} appears in the beginning when serializing and deserializing an array with more than {{128}} values and at least one {{NULL}} value using {{Feather}}. Once the extra {{0}} is inserted a value is trimmed at the end. Here is the C++ code to write such an array: {code:java} #include <iostream> #include <arrow/api.h> #include <arrow/io/file.h> #include <arrow/ipc/feather.h> #include <arrow/pretty_print.h> int main() { // 1. Build Array arrow::DoubleBuilder builder; for (int i = 0; i < 129; i++) if (i == 0) builder.AppendNull(); else builder.Append(i); std::shared_ptr<arrow::Array> array; builder.Finish(&array); arrow::PrettyPrint(*array, 0, &std::cout); std::cout << std::endl; // 2. Write to Feather file std::shared_ptr<arrow::io::FileOutputStream> stream; arrow::io::FileOutputStream::Open("out.f", false, &stream); std::unique_ptr<arrow::ipc::feather::TableWriter> writer; arrow::ipc::feather::TableWriter::Open(stream, &writer); writer->SetNumRows(129); writer->Append("id", *array); writer->Finalize(); stream->Close(); return 0; } {code} The output of running this code is: {code:java} # g++-4.9 -std=c++11 example.cpp -larrow && ./a.out [null, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128] {code} The array is deserialized in Python and looks like this: {code:java} >>> pandas.read_feather('out.f') id 0 NaN 1 0.0 2 1.0 3 2.0 4 3.0 5 4.0 6 5.0 7 6.0 8 7.0 9 8.0 10 9.0 11 10.0 12 11.0 13 12.0 14 13.0 15 14.0 16 15.0 17 16.0 18 17.0 19 18.0 20 19.0 21 20.0 22 21.0 23 22.0 24 23.0 25 24.0 26 25.0 27 26.0 28 27.0 29 28.0 .. ... 99 98.0 100 99.0 101 100.0 102 101.0 103 102.0 104 103.0 105 104.0 106 105.0 107 106.0 108 107.0 109 108.0 110 109.0 111 110.0 112 111.0 113 112.0 114 113.0 115 114.0 116 115.0 117 116.0 118 117.0 119 118.0 120 119.0 121 120.0 122 121.0 123 122.0 124 123.0 125 124.0 126 125.0 127 126.0 128 127.0 [129 rows x 1 columns] {code} Notice the {{0.0}} value on index {{1}}. The value should have been {{1.0}}. Also, the last value is {{127.0}} instead of {{128.0}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)