pitrou commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443707259



##########
File path: python/pyarrow/tensor.pxi
##########
@@ -339,6 +350,15 @@ shape: {0.shape}""".format(self)
     def non_zero_length(self):
         return self.stp.non_zero_length()
 
+    @property
+    def has_canonical_format(self):
+        cdef:
+            _CSparseCOOIndexPtr csi
+
+        csi = dynamic_cast[_CSparseCOOIndexPtr](self.stp.sparse_index().get())

Review comment:
       Why the `dynamic_cast`? Since this is a sparse COO tensor, the cast 
should always succeed. Just use a normal Cython cast.

##########
File path: format/SparseTensor.fbs
##########
@@ -63,6 +63,9 @@ table SparseTensorIndexCOO {
 
   /// The location and size of the indices matrix's data
   indicesBuffer: Buffer (required);
+
+  /// The canonicality flag

Review comment:
       What is this? Can you explain a bit more?

##########
File path: python/pyarrow/tensor.pxi
##########
@@ -270,8 +279,10 @@ shape: {0.shape}""".format(self)
                                               &out_data, &out_coords))
         data = PyObject_to_object(out_data)
         coords = PyObject_to_object(out_coords)
-        result = coo_matrix((data[:, 0], (coords[:, 0], coords[:, 1])),
-                            shape=self.shape)
+        row, col = coords[:, 0], coords[:, 1]
+        result = coo_matrix((data[:, 0], (row, col)), shape=self.shape)
+        if self.has_canonical_format:
+            result.sum_duplicates()

Review comment:
       Why?

##########
File path: cpp/src/arrow/sparse_tensor_test.cc
##########
@@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const 
std::shared_ptr<Tensor>& sidx, const int
   }
 }
 
-TEST(TestSparseCOOIndex, Make) {
+//-----------------------------------------------------------------------------

Review comment:
       Perhaps there should also be some tests with empty/all-zero tensors?

##########
File path: python/pyarrow/tensor.pxi
##########
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
             for x in dim_names:
                 c_dim_names.push_back(tobytes(x))
 
-        coords = np.vstack([obj.row, obj.col]).T
+        row = obj.row
+        col = obj.col
+        if obj.has_canonical_format:
+            order = np.lexsort((col, row))  # row-major order

Review comment:
       I don't understand this. If Scipy says the indices are sorted, why do 
you need to sort them again?

##########
File path: cpp/src/arrow/sparse_tensor.cc
##########
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const 
std::shared_ptr<DataType>& type,
   return Status::OK();
 }
 
+void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const 
int64_t row,
+                              std::vector<int64_t>* out_index) {
+  const auto& fw_index_value_type =
+      internal::checked_cast<const FixedWidthType&>(*coords->type());
+  const size_t indices_elsize = fw_index_value_type.bit_width() / CHAR_BIT;
+
+  const auto& shape = coords->shape();
+  const int64_t non_zero_length = shape[0];
+  DCHECK(0 <= row && row < non_zero_length);
+
+  const int64_t ndim = shape[1];
+  std::vector<int64_t> index;
+  index.reserve(ndim);

Review comment:
       You can write directly into `out_index`:
   ```c++
   out_index->resize(ndim);
   // ...
         for (int64_t i = 0; i < ndim; ++i) {
           (*out_index)[i] = 
static_cast<int64_t>(coords->Value<UInt8Type>({row, i}));
         }
   ```

##########
File path: python/pyarrow/tensor.pxi
##########
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
             for x in dim_names:
                 c_dim_names.push_back(tobytes(x))
 
-        coords = np.vstack([obj.row, obj.col]).T
+        row = obj.row
+        col = obj.col
+        if obj.has_canonical_format:

Review comment:
       Does this attribute exist? I don't see it in 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
   In 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix,
 I see an attribute named `has_sorted_indices`

##########
File path: cpp/src/arrow/sparse_tensor.cc
##########
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const 
std::shared_ptr<DataType>& type,
   return Status::OK();
 }
 
+void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const 
int64_t row,

Review comment:
       Can you follow the style guide? Use CamelCase for function names.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to