GitHub user RayZ0rr created a discussion: Handling numpy ndarray or tensor objects with atleast 1 dimension having variable size
I want to use objects which like numpy ndarray or pytorch tensors which can have atleast 1 dimension where the size varies. For example consider list of 2D pointclouds. Each pointcloud data or example has shape (N, 2). Here `N` can be different for different pointcloud data. [`pyarrow.FixedShapeTensorType`](https://arrow.apache.org/docs/python/generated/pyarrow.FixedShapeTensorType.html) doesn't work for this usecase. `VariableShapeTensor` implementations [1](https://github.com/apache/arrow/pull/40354) and [2](https://github.com/apache/arrow/issues/38007) has not been merged. While waiting for these merges I have implemented this in the following way for zero-copy retrieval of the original list of variable tensors from the pyarrow table. - For each variable shape tensor keep two columns one of type `pyarrow.ListType` with the child type same as `dtype` of the tensor and other column of type `pyarrow.ListType` with child as int32. - Take for example 1st column as `"points_val"` and other `"points_shape"`. Each element of `"points_val"` will be a flattened list of values of a single tensor (`view(-1)` or `reshape(-1)`). Each element of `"points_shape"` will have the shape of the tensor. - Using the following function we can get a list of original variable shape tensors back. There is a more efficient way to do this if the full tensor fits in memory. ``` def getTensors(table: pa.Table): vals = table["points_val"] shapes = table["points_shape"] out = [] M = len(vals) for i in range(M): data_np = vals[i].values.to_numpy() dims_np = shapes[i].values o = data_np.reshape(tuple(int(x) for x in dims_np)) out.append(o) return out ``` Does anyone know of a better way or think this is not zero-copy? GitHub link: https://github.com/apache/arrow/discussions/48099 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
