Thanks...
On 7/13/2022 9:45 AM, David Li wrote:
If `l` is a plain list there, I don't think it's possible. The __arrow_array__ protocol relies on you to have a type that you can define the method on. I also don't think there are other customization hooks for pa.array() but maybe someone else knows better.
On Tue, Jul 12, 2022, at 17:18, dl via user wrote:
Hi David,
Are there any good examples for the first section of your reference [1]: Controlling conversion to pyarrow.Array with the __arrow_array__ protocol?
I find examples of creating an extension array using an extension type with explicit code in test_extension_type.py, e.g. in test_ext_array_basics. I'm thinking it might be possible to have the array type inferred by pyarrow.array() or pyarrow.Table.from_arrays() using a extension array type as suggested there. Am I right about this? If so is there a good example? I haven't been able to get this to work.
For the record, here is what I can do.
l = list() for i in range(4): s = csr_matrix(random_dense()) struct = [('shape', s.shape), ('keys', s.data), ('indexes', s.indices)] l.append(struct) struct_type = pa.struct([('shape', pa.list_(pa.int32())), ('keys', pa.list_(pa.float64())), ('indexes', pa.list_(pa.int64()))]) arrow_array = pa.array(l,struct_type) extension_array = pa.ExtensionArray.from_storage(SparseStructType(), arrow_array)class SparseStructType(pa.PyExtensionType): storage_type = pa.struct([('shape', pa.list_(pa.int32())), ('keys', pa.list_(pa.float64())), ('indexes', pa.list_(pa.int64()))]) def __init__(self): pa.PyExtensionType.__init__(self,self.storage_type) def __reduce__(self): return SparseStructType, ()
I would like to be able to do something like
extension_array = pa.array(l,SparseStructType())
having the extension type of the array inferred by pa.array. Is that possible?
Thanks,
David
On 7/6/2022 4:26 PM, David Li wrote:
If I'm not mistaken, what you want is basically an extension type [1] for tensors, so you can have a column where each row contains a tensor/matrix. This has been discussed for quite some time [2].
Incidentally, you can keep the three-field representation but pack it into a single toplevel field with the Struct type.
On Wed, Jul 6, 2022, at 19:01, dl via user wrote:
I have tabular data with one record field of type scipy.sparse.csr_matrix. I want to convert this tabular data to a pyarrow table. I had been first converting the csr_matrix first to a custom representation using three fields (shape, keys, indices) and building the pyarrow table using a schema with the types of these fields and table data with a separate list for each field (and each list having one entry per input record). I was hoping I could use a single pyarrow.SparseCSRMatrix field instead of the custom three field representation. Is that possible? Incidentally, the shape of the csr_matrix is typically (1,N) where N may vary for different records. But I don't think "typically (1,N)" matters. It would work with variable shape (M,N). The shape field has type pyarrow.List with value_type = pyarrow.int32().
On 7/6/2022 2:53 PM, Rok Mihevc wrote:
Hey David,
I don't think Table is designed in a way that you could "populate" it with a 2D tensor. It should rather be populated with a collection of equal length arrays.
Sparse CSR tensor on the other hand is composed of three arrays (indices, indptr, values) and you need a bit more involved logic to manipulate those than regular arrays. See [1] for memory layout definition.
What are you looking to accomplish? What access patterns are you expecting?
Rok
On Wed, Jul 6, 2022 at 10:48 PM dl <dydx...@yahoo.com> wrote:
Hi Rok,
What data type would I use for a pyarrow SparseCSRMatrix in a schema? I need to build a table with rows which include a field of this type. I don't see a related example in the test module. I'm doing something like:
schema = pyarrow.schema(fields, metadata=metadata)
table = pyarrow.Table.from_arrays(table_data, schema=schema)
where fields is a list of tuples of the form (field_name, pyarrow_type), e.g. ('field1', pyarrow.string()). What should pyarrow_type be for a SparseCSRMatrix field? Or will this not work?
Thanks,
David
On 7/1/2022 9:18 AM, Rok Mihevc wrote:
We lack pyarow sparse tensor documentation (PRs welcome), so tests are perhaps most extensive description of what is doable: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_sparse_tensor.py
Rok
On Fri, Jul 1, 2022 at 5:38 PM dl via user <user@arrow.apache.org> wrote:
So, I guess this is supported in 8.0.0. I can do this:
import numpy as np import pyarrow as pa from scipy.sparse import csr_matrixa = np.random.rand(100) a[a < .9] = 0.0 s = csr_matrix(a) arrow_sparse_csr_matrix = pa.SparseCSRMatrix.from_scipy(s)Now, how do I use that to build a pyarrow table? Stay tuned...
On 7/1/2022 8:19 AM, dl wrote:
I find pyarrow.SparseCSRMatrix mentioned here. But how do I use that? Is there documentation for that class?
On 7/1/2022 7:47 AM, dl wrote:
Hi,
I'm trying to understand support for sparse tensors in Arrow. It looks like there is "experimental" support using the C++ API. When was this introduced? I see in the code base here Cython sparse array classes. Can these be accessed using the Python API. Are they included in the 8.0.0 release? Is there any other support for sparse arrays/tensors in the Python API? Are there good examples for any of this, in particular for using the 8.0.0 Python API to create sparse tensors?
Thanks,
David