pitrou commented on a change in pull request #7477: URL: https://github.com/apache/arrow/pull/7477#discussion_r443707259
########## File path: python/pyarrow/tensor.pxi ########## @@ -339,6 +350,15 @@ shape: {0.shape}""".format(self) def non_zero_length(self): return self.stp.non_zero_length() + @property + def has_canonical_format(self): + cdef: + _CSparseCOOIndexPtr csi + + csi = dynamic_cast[_CSparseCOOIndexPtr](self.stp.sparse_index().get()) Review comment: Why the `dynamic_cast`? Since this is a sparse COO tensor, the cast should always succeed. Just use a normal Cython cast. ########## File path: format/SparseTensor.fbs ########## @@ -63,6 +63,9 @@ table SparseTensorIndexCOO { /// The location and size of the indices matrix's data indicesBuffer: Buffer (required); + + /// The canonicality flag Review comment: What is this? Can you explain a bit more? ########## File path: python/pyarrow/tensor.pxi ########## @@ -270,8 +279,10 @@ shape: {0.shape}""".format(self) &out_data, &out_coords)) data = PyObject_to_object(out_data) coords = PyObject_to_object(out_coords) - result = coo_matrix((data[:, 0], (coords[:, 0], coords[:, 1])), - shape=self.shape) + row, col = coords[:, 0], coords[:, 1] + result = coo_matrix((data[:, 0], (row, col)), shape=self.shape) + if self.has_canonical_format: + result.sum_duplicates() Review comment: Why? ########## File path: cpp/src/arrow/sparse_tensor_test.cc ########## @@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const std::shared_ptr<Tensor>& sidx, const int } } -TEST(TestSparseCOOIndex, Make) { +//----------------------------------------------------------------------------- Review comment: Perhaps there should also be some tests with empty/all-zero tensors? ########## File path: python/pyarrow/tensor.pxi ########## @@ -199,7 +202,13 @@ shape: {0.shape}""".format(self) for x in dim_names: c_dim_names.push_back(tobytes(x)) - coords = np.vstack([obj.row, obj.col]).T + row = obj.row + col = obj.col + if obj.has_canonical_format: + order = np.lexsort((col, row)) # row-major order Review comment: I don't understand this. If Scipy says the indices are sorted, why do you need to sort them again? ########## File path: cpp/src/arrow/sparse_tensor.cc ########## @@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const std::shared_ptr<DataType>& type, return Status::OK(); } +void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const int64_t row, + std::vector<int64_t>* out_index) { + const auto& fw_index_value_type = + internal::checked_cast<const FixedWidthType&>(*coords->type()); + const size_t indices_elsize = fw_index_value_type.bit_width() / CHAR_BIT; + + const auto& shape = coords->shape(); + const int64_t non_zero_length = shape[0]; + DCHECK(0 <= row && row < non_zero_length); + + const int64_t ndim = shape[1]; + std::vector<int64_t> index; + index.reserve(ndim); Review comment: You can write directly into `out_index`: ```c++ out_index->resize(ndim); // ... for (int64_t i = 0; i < ndim; ++i) { (*out_index)[i] = static_cast<int64_t>(coords->Value<UInt8Type>({row, i})); } ``` ########## File path: python/pyarrow/tensor.pxi ########## @@ -199,7 +202,13 @@ shape: {0.shape}""".format(self) for x in dim_names: c_dim_names.push_back(tobytes(x)) - coords = np.vstack([obj.row, obj.col]).T + row = obj.row + col = obj.col + if obj.has_canonical_format: Review comment: Does this attribute exist? I don't see it in https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix In https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix, I see an attribute named `has_sorted_indices` ########## File path: cpp/src/arrow/sparse_tensor.cc ########## @@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const std::shared_ptr<DataType>& type, return Status::OK(); } +void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const int64_t row, Review comment: Can you follow the style guide? Use CamelCase for function names. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org