[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-25 Thread Neal Becker
>
>
> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy
> 1.19.5) for each file is listed below:
>
> 0.179s  eye1e4.npy (mmap_mode=None)
> 0.001s  eye1e4.npy (mmap_mode=r)
> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
> 1.474s  eye1e4_bjd_zlib.jdb
> 0.635s  eye1e4_bjd_lzma.jdb
>
>
> clearly, mmapped loading is the fastest option without a surprise; it is
> true that the raw bjdata file is about 5x slower than npy loading, but
> given the main chunk of the data are stored identically (as contiguous
> buffer), I suppose with some optimization of the decoder, the gap between
> the two can be substantially shortened. The longer loading time of
> zlib/lzma (and similarly saving times) reflects a trade-off between smaller
> file sizes and time for compression/decompression/disk-IO.
>
> I think the load time for mmap may be deceptive, it isn't actually loading
> anything, just mapping to memory.  Maybe a better benchmark is to actually
> process the data, e.g., find the mean which would require reading the
> values.
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-25 Thread Bill Ross
>> For my case, I'd be curious about the time to add one 1T-entries file to 
>> another. 

> as I mentioned in the previous reply, bjdata is appendable [3], so you can 
> simply append another array (or a slice) to the end of the file. 

I'm thinking of numerical ops here, e.g. adding an array to itself would
double the values but not the size.

---
--

Phobrain.com 

On 2022-08-25 14:41, Qianqian Fang wrote:

> To avoid derailing the other thread [1] on extending .npy files, I am going 
> to start a new thread on alternative array storage file formats using binary 
> JSON - in case there is such a need and interest among numpy users
> 
> specifically, i want to first follow up with Bill's question below regarding 
> loading time 
> 
> On 8/25/22 11:02, Bill Ross wrote: 
> 
>> Can you give load times for these?
> 
> as I mentioned in the earlier reply to Robert, the most memory-efficient 
> (i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient (i.e. 
> may result in the largest data file sizes) BJData construct to store an ND 
> array using BJData's ND-array container.
> 
> I have to admit that both jdata and bjdata modules have not been extensively 
> optimized for speed. with the current implementation, here are the loading 
> time for a larger diagonal matrix (eye(1))
> 
> a BJData file storing a single eye(1) array using the ND-array container 
> can be downloaded from here [2](file size: 1MB with zip, if decompressed, it 
> is ~800MB, as the npy file) - this file was generated from a matlab encoder, 
> but can be loaded using Python (see below Re Robert).
> 
> 80128 eye1e4.npy
> 80014 eye1e4_bjd_raw_ndsyntax.jdb
> 813721 eye1e4_bjd_zlib.jdb
> 113067 eye1e4_bjd_lzma.jdb
> 
> the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 
> 1.19.5) for each file is listed below:
> 
> 0.179s  eye1e4.npy (mmap_mode=None)
> 0.001s  eye1e4.npy (mmap_mode=r)
> 0.718s  eye1e4_bjd_raw_ndsyntax.jdb
> 1.474s  eye1e4_bjd_zlib.jdb
> 0.635s  eye1e4_bjd_lzma.jdb
> 
> clearly, mmapped loading is the fastest option without a surprise; it is true 
> that the raw bjdata file is about 5x slower than npy loading, but given the 
> main chunk of the data are stored identically (as contiguous buffer), I 
> suppose with some optimization of the decoder, the gap between the two can be 
> substantially shortened. The longer loading time of zlib/lzma (and similarly 
> saving times) reflects a trade-off between smaller file sizes and time for 
> compression/decompression/disk-IO.
> 
> by no means I am saying the binary JSON format is readily available to 
> deliver better speed with its current non-optimized implementation. I just 
> want to bright the attention to this class of formats, and highlight that 
> it's flexibility gives abundant mechanisms to create fast, disk-mapped IO, 
> while allowing additional benefits such as compression, unlimited metadata 
> for future extensions etc. 
> 
>>> 8000128  eye5chunk.npy
>>> 5004297  eye5chunk_bjd_raw.jdb
>>> 10338  eye5chunk_bjd_zlib.jdb
>>> 2206  eye5chunk_bjd_lzma.jdb
>> 
>> For my case, I'd be curious about the time to add one 1T-entries file to 
>> another.
> 
> as I mentioned in the previous reply, bjdata is appendable [3], so you can 
> simply append another array (or a slice) to the end of the file. 
> 
>> Thanks, 
>> Bill
> 
> also related, Re @Robert's question below 
> 
>> Are any of them supported by a Python BJData implementation? I didn't see 
>> any option to get that done in the `bjdata` package you recommended, for 
>> example. 
>> https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200
> 
> the bjdata module currently only support nd-array in the decoder [4] (i.e. 
> map such buffer to a numpy.ndarray) - should be relatively trivial to add it 
> to the encoder though. 
> 
> on the other side, the annotated format is currently supported. one can call 
> jdata module (responsible for annotation-level encoding/decoding) as shown in 
> my sample code, then call bjdata internally for data serialization. 
> 
>> Okay. Given your wording, it looked like you were claiming that the binary 
>> JSON was supported by the whole ecosystem. Rather, it seems like you can 
>> either get binary encoding OR the ecosystem support, but not both at the 
>> same time.
> 
> all in relative terms of course - json has ~100 listed parsers on it's 
> website [5], MessagePack - another flavor of binary JSON - listed [6] ~50/60 
> parsers, and UBJSON listed [7] ~20 parsers. I am not familiar with npy 
> parsers, but googling it returns only a few. 
> 
> also, most binary JSON implementations provided tools to convert to JSON and 
> back, so, in that sense, whatever JSON has in its ecosystem can be 
> "potentially" used for binary JSON files if one wants to. There are also 
> recent publications comparing differences between various binary JSON formats 
> in case anyone is interested 
> 
> 

[Numpy-discussion] Exporting numpy arrays to binary JSON (BJData) for better portability

2022-08-25 Thread Qianqian Fang
To avoid derailing the other thread 
 
on extending .npy files, I am going to start a new thread on alternative 
array storage file formats using binary JSON - in case there is such a 
need and interest among numpy users


specifically, i want to first follow up with Bill's question below 
regarding loading time



On 8/25/22 11:02, Bill Ross wrote:


|Can you give load times for these?|



as I mentioned in the earlier reply to Robert, the most memory-efficient 
(i.e. fast loading, disk-mmap-able) but not necessarily disk-efficient 
(i.e. may result in the largest data file sizes) BJData construct to 
store an ND array using BJData's ND-array container.


I have to admit that both jdata and bjdata modules have not been 
extensively optimized for speed. with the current implementation, here 
are the loading time for a larger diagonal matrix (eye(1))


a BJData file storing a single eye(1) array using the ND-array 
container can be downloaded from here 
(file 
size: 1MB with zip, if decompressed, it is ~800MB, as the npy file) - 
this file was generated from a matlab encoder, but can be loaded using 
Python (see below Re Robert).


|80128 eye1e4.npy||
||80014 eye1e4_bjd_raw_ndsyntax.jdb||
||   813721 eye1e4_bjd_zlib.jdb||
||   113067 eye1e4_bjd_lzma.jdb|

the loading time (from an nvme drive, Ubuntu 18.04, python 3.6.9, numpy 
1.19.5) for each file is listed below:


|0.179s  eye1e4.npy (mmap_mode=None)||
||0.001s  eye1e4.npy (mmap_mode=r)||
||0.718s  eye1e4_bjd_raw_ndsyntax.jdb||
||1.474s  eye1e4_bjd_zlib.jdb||
||0.635s  eye1e4_bjd_lzma.jdb|


clearly, mmapped loading is the fastest option without a surprise; it is 
true that the raw bjdata file is about 5x slower than npy loading, but 
given the main chunk of the data are stored identically (as contiguous 
buffer), I suppose with some optimization of the decoder, the gap 
between the two can be substantially shortened. The longer loading time 
of zlib/lzma (and similarly saving times) reflects a trade-off between 
smaller file sizes and time for compression/decompression/disk-IO.


by no means I am saying the binary JSON format is readily available to 
deliver better speed with its current non-optimized implementation. I 
just want to bright the attention to this class of formats, and 
highlight that it's flexibility gives abundant mechanisms to create 
fast, disk-mapped IO, while allowing additional benefits such as 
compression, unlimited metadata for future extensions etc.




|> 8000128  eye5chunk.npy||
||> 5004297  eye5chunk_bjd_raw.jdb||
||>   10338  eye5chunk_bjd_zlib.jdb||
||>    2206  eye5chunk_bjd_lzma.jdb|

For my case, I'd be curious about the time to add one 1T-entries file 
to another.



as I mentioned in the previous reply, bjdata is appendable 
, 
so you can simply append another array (or a slice) to the end of the file.




Thanks,
Bill




also related, Re @Robert's question below

Are any of them supported by a Python BJData implementation? I didn't 
see any option to get that done in the `bjdata` package you 
recommended, for example.

https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200


the bjdata module currently only support nd-array in the decoder 
 
(i.e. map such buffer to a numpy.ndarray) - should be relatively trivial 
to add it to the encoder though.


on the other side, the annotated format is currently supported. one can 
call jdata module (responsible for annotation-level encoding/decoding) 
as shown in my sample code, then call bjdata internally for data 
serialization.



Okay. Given your wording, it looked like you were claiming that the 
binary JSON was supported by the whole ecosystem. Rather, it seems 
like you can either get binary encoding OR the ecosystem support, but 
not both at the same time.


all in relative terms of course - json has ~100 listed parsers on it's 
website , MessagePack - another 
flavor of binary JSON - listed  ~50/60 
parsers, and UBJSON listed  ~20 parsers. 
I am not familiar with npy parsers, but googling it returns only a few.


also, most binary JSON implementations provided tools to convert to JSON 
and back, so, in that sense, whatever JSON has in its ecosystem can be 
"potentially" used for binary JSON files if one wants to. There are also 
recent publications comparing differences between various binary JSON 
formats in case anyone is interested


https://github.com/ubjson/universal-binary-json/issues/115
___
NumPy-Discussion 

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Robert Kern
On Thu, Aug 25, 2022 at 3:47 PM Qianqian Fang  wrote:

> On 8/25/22 12:25, Robert Kern wrote:
>
> I don't quite know what this means. My installed version of `jq`, for
> example, doesn't seem to know what to do with these files.
>
> ❯ jq --version
> jq-1.6
>
> ❯ jq . eye5chunk_bjd_raw.jdb
> parse error: Invalid numeric literal at line 1, column 38
>
>
> the .jdb files are binary JSON files (specifically BJData) that jq does
> not currently support; to save as text-based JSON, you change the suffix to
> .json or .jdt - it results in ~33% increase compared to the binary due to
> base64
>
> Okay. Given your wording, it looked like you were claiming that the binary
JSON was supported by the whole ecosystem. Rather, it seems like you can
either get binary encoding OR the ecosystem support, but not both at the
same time.

> I think a fundamental problem here is that it looks like each element in
> the array is delimited. I.e. a `float64` value starts with b'D' then the 8
> IEEE-754 bytes representing the number. When we're talking about
> memory-mappability, we are talking about having the on-disk representation
> being exactly what it looks like in-memory, all of the IEEE-754 floats
> contiguous with each other, so we can use the `np.memmap` `ndarray`
> subclass to represent the on-disk data as a first-class array object. This
> spec lets us mmap the binary JSON file and manipulate its contents in-place
> efficiently, but that's not what is being asked for here.
>
> there are several BJData-compliant forms to store the same binary array
> losslessly. The most memory efficient and disk-mmapable (but not
> necessarily disk-efficient) form is to use the ND-array container syntax
> 
> that BJData spec extended over UBJSON.
>
> Are any of them supported by a Python BJData implementation? I didn't see
any option to get that done in the `bjdata` package you recommended, for
example.

https://github.com/NeuroJSON/pybj/blob/a46355a0b0df0bec1817b04368a5a573358645ef/bjdata/encoder.py#L200

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Qianqian Fang

On 8/25/22 12:25, Robert Kern wrote:
No one is really proposing another format, just a minor tweak to the 
existing NPY format.


agreed. I was just following the previous comment on alternative formats 
(such as hdf5) and pros/cons of npy.



I don't quite know what this means. My installed version of `jq`, for 
example, doesn't seem to know what to do with these files.


❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38



the .jdb files are binary JSON files (specifically BJData) that jq does 
not currently support; to save as text-based JSON, you change the suffix 
to .json or .jdt - it results in ~33% increase compared to the binary 
due to base64


jd.save(y, 'eye5chunk_bjd_zlib.jdt',  {'compression':'zlib'});

13694 Aug 25 12:54 eye5chunk_bjd_zlib.jdt
10338 Aug 25 15:41 eye5chunk_bjd_zlib.jdb

jq . eye5chunk_bjd_zlib.jdt

[
  {
    "_ArrayType_": "double",
    "_ArraySize_": [
  200,
  1000
    ],
    "_ArrayZipType_": "zlib",
    "_ArrayZipSize_": [
  1,
  20
    ],
    "_ArrayZipData_": "..."
   },
   ...

]


I think a fundamental problem here is that it looks like each element 
in the array is delimited. I.e. a `float64` value starts with b'D' 
then the 8 IEEE-754 bytes representing the number. When we're talking 
about memory-mappability, we are talking about having the on-disk 
representation being exactly what it looks like in-memory, all of the 
IEEE-754 floats contiguous with each other, so we can use the 
`np.memmap` `ndarray` subclass to represent the on-disk data as a 
first-class array object. This spec lets us mmap the binary JSON file 
and manipulate its contents in-place efficiently, but that's not what 
is being asked for here.



there are several BJData-compliant forms to store the same binary array 
losslessly. The most memory efficient and disk-mmapable (but not 
necessarily disk-efficient) form is to use the ND-array container syntax 
 
that BJData spec extended over UBJSON.


For example, a 100x200x300 3D float64 ($D) array can be stored as below 
(numbers are stored in binary forms, white spaces should be removed)


|[$D #[$u#U3 100 200 300 value0 value1 ...|

where the "value_i"s are contiguous (row-major) binary stream of the 
float64 buffer without the delimited marker ('D') because it is absorbed 
to the optimized header 
 of 
the array "[" following the type "$" marker. The data chunk is 
mmap-able, although if you desired a pre-determined initial offset, you 
can force the dimension vector (#[$u #U 3 100 200 300) to be an integer 
type ($u) large enough, for example uint32 (m), then the starting offset 
of the binary stream will be entirely predictable.


multiple ND arrays can be directly appended to the root level, for example,

|[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...||
||[$D #[$u#U3 100 200 300 value0 value1 ...|

can store 100x200x300 chunks of a 400x200x300 array

alternatively, one can also use an annotated format (in JSON form: 
|{"_ArrayType":"double","_ArraySize_":[100,200,300],"_ArrayData_":[value1,value2,...]}|) 
to store everything into 1D continuous buffer


|{|||U11 _ArrayType_ S U6 double |U11 _ArraySize_ [$u#U3 100 200 300 U11 
_ArrayData_ [$D #m 600 value1 value2 ...}|


The contiguous buffer in _ArrayData_ section is also disk-mmap-able; you 
can also make additional requirements for the array metadata to ensure a 
predictable initial offset, if desired.


similarly, these annotated chunks can be appended in either JSON or 
binary JSON forms, and the parsers can automatically handle both forms 
and convert them into the desired binary ND array with the expected type 
and dimensions.




here are the output file sizes in bytes:

|8000128  eye5chunk.npy||
||5004297  eye5chunk_bjd_raw.jdb|

Just a note: This difference is solely due to a special representation 
of `0` in 5 bytes rather than 8 (essentially, your encoder recognizes 
0.0 as a special value and uses the `float32` encoding of it). If you 
had any other value making up the bulk of the file, this would be 
larger than the NPY due to the additional delimiter b'D'.



the two BJData forms that I mentioned above (nd-array syntax or 
annotated array) will preserve the original precision/shape in 
round-trips. BJData follows the recommendations of the UBJSON spec and 
automatically reduces data size 
 
only if no precision loss (such as integer or zeros), but it is optional.




|  10338  eye5chunk_bjd_zlib.jdb||
||   2206  eye5chunk_bjd_lzma.jdb|

Qianqian

--
Robert Kern

___

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Robert Kern
On Thu, Aug 25, 2022 at 10:45 AM Qianqian Fang  wrote:

> I am curious what you and other developers think about adopting
> JSON/binary JSON as a similarly simple, reverse-engineering-able but
> universally parsable array exchange format instead of designing another
> numpy-specific binary format.
>
No one is really proposing another format, just a minor tweak to the
existing NPY format.

If you are proposing that numpy adopt BJData into numpy to underlay
`np.save()`, we are not very likely to for a number of reasons. However, if
you are addressing the wider community to advertise your work, by all means!

> I am interested in this topic (as well as thoughts among numpy developers)
> because I am currently working on a project - NeuroJSON (
> https://neurojson.org) - funded by the US National Institute of Health.
> The goal of the NeuroJSON project is to create easy-to-adopt,
> easy-to-extend, and preferably human-readable data formats to help
> disseminate and exchange neuroimaging data (and scientific data in
> general).
>
> Needless to say, numpy is a key toolkit that is widely used among
> neuroimaging data analysis pipelines. I've seen discussions of potentially
> adopting npy as a standardized way to share volumetric data (as ndarrays),
> such as in this thread
>
> https://github.com/bids-standard/bids-specification/issues/197
>
> however, several limitations were also discussed, for example
>
> 1. npy only support a single numpy array, does not support other metadata
> or other more complex data records (multiple arrays are only achieved via
> multiple files)
> 2. no internal (i.e. data-level) compression, only file-level compression
> 3. although the file is simple, it still requires a parser to read/write,
> and such parser is not widely available in other environments, making it
> mostly limited to exchange data among python programs
> 4. I am not entirely sure, but I suppose it does not support sparse
> matrices or special matrices (such as diagonal/band/symmetric etc) - I can
> be wrong though
>
> In the NeuroJSON project, we primarily use JSON and binary JSON
> (specifically, UBJSON  derived BJData
>  format) as the
> underlying data exchange files. Through standardized data annotations
> ,
> we are able to address most of the above limitations - the generated files
> are universally parsable in nearly all programming environments with
> existing parsers, support complex hierarchical data, compression, and can
> readily benefit from the large ecosystems of JSON (JSON-schema, JSONPath,
> JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).
>
I don't quite know what this means. My installed version of `jq`, for
example, doesn't seem to know what to do with these files.

❯ jq --version
jq-1.6

❯ jq . eye5chunk_bjd_raw.jdb
parse error: Invalid numeric literal at line 1, column 38
>
> I understand that simplicity is a key design spec here. I want to
> highlight UBJSON/BJData as a competitive alternative format. It is also
> designed with simplicity considered in the first place
> , yet, it allows to store hierarchical
> strongly-typed complex binary data and is easily extensible.
>
> A UBJSON/BJData parser may not necessarily longer than a npy parser, for
> example, the python reader of the full spec only takes about 500 lines of
> codes (including comments), similarly for a JS parser
>
> https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
> https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js
>
> We actually did a benchmark 
> a few months back - the test workloads are two large 2D numerical arrays
> (node, face to store surface mesh data), we compared parsing speed of
> various formats in Python, MATLAB, and JS. The uncompressed BJData
> (BMSHraw) reported a loading speed that is nearly as fast as reading raw
> binary dump; and internally compressed BJData (BMSHz) gives the best
> balance between small file sizes and loading speed, see our results here
>
> https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png=large
>
> I want to add two quick points to echo the features you desired in npy:
>
> 1. it is not common to use mmap in reading JSON/binary JSON files, but it
> is certainly possible. I recently wrote a JSON-mmap spec
> 
> and a MATLAB reference implementation
> 
>
I think a fundamental problem here is that it looks like each element in
the array is delimited. I.e. a `float64` value starts with b'D' then the 8
IEEE-754 bytes representing the number. When we're talking about
memory-mappability, we are talking about having the on-disk representation
being exactly what it looks like in-memory, all of 

[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Bill Ross
Can you give load times for these? 

> 8000128  eye5chunk.npy
> 5004297  eye5chunk_bjd_raw.jdb
>   10338  eye5chunk_bjd_zlib.jdb
>2206  eye5chunk_bjd_lzma.jdb

For my case, I'd be curious about the time to add one 1T-entries file to
another. 

Thanks, 
Bill 

--

Phobrain.com 

On 2022-08-24 20:02, Qianqian Fang wrote:

> I am curious what you and other developers think about adopting JSON/binary 
> JSON as a similarly simple, reverse-engineering-able but universally parsable 
> array exchange format instead of designing another numpy-specific binary 
> format. 
> 
> I am interested in this topic (as well as thoughts among numpy developers) 
> because I am currently working on a project - NeuroJSON 
> (https://neurojson.org) - funded by the US National Institute of Health. The 
> goal of the NeuroJSON project is to create easy-to-adopt, easy-to-extend, and 
> preferably human-readable data formats to help disseminate and exchange 
> neuroimaging data (and scientific data in general). 
> 
> Needless to say, numpy is a key toolkit that is widely used among 
> neuroimaging data analysis pipelines. I've seen discussions of potentially 
> adopting npy as a standardized way to share volumetric data (as ndarrays), 
> such as in this thread 
> 
> https://github.com/bids-standard/bids-specification/issues/197 
> 
> however, several limitations were also discussed, for example 
> 
> 1. npy only support a single numpy array, does not support other metadata or 
> other more complex data records (multiple arrays are only achieved via 
> multiple files)
> 2. no internal (i.e. data-level) compression, only file-level compression
> 3. although the file is simple, it still requires a parser to read/write, and 
> such parser is not widely available in other environments, making it mostly 
> limited to exchange data among python programs
> 4. I am not entirely sure, but I suppose it does not support sparse matrices 
> or special matrices (such as diagonal/band/symmetric etc) - I can be wrong 
> though 
> 
> In the NeuroJSON project, we primarily use JSON and binary JSON 
> (specifically, UBJSON [1] derived BJData [2] format) as the underlying data 
> exchange files. Through standardized data annotations [3], we are able to 
> address most of the above limitations - the generated files are universally 
> parsable in nearly all programming environments with existing parsers, 
> support complex hierarchical data, compression, and can readily benefit from 
> the large ecosystems of JSON (JSON-schema, JSONPath, JSON-LD, jq, numerous 
> parsers, web-ready, NoSQL db ...). 
> 
> I understand that simplicity is a key design spec here. I want to highlight 
> UBJSON/BJData as a competitive alternative format. It is also designed with 
> simplicity considered in the first place [4], yet, it allows to store 
> hierarchical strongly-typed complex binary data and is easily extensible. 
> 
> A UBJSON/BJData parser may not necessarily longer than a npy parser, for 
> example, the python reader of the full spec only takes about 500 lines of 
> codes (including comments), similarly for a JS parser 
> 
> https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
> https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js 
> 
> We actually did a benchmark [5] a few months back - the test workloads are 
> two large 2D numerical arrays (node, face to store surface mesh data), we 
> compared parsing speed of various formats in Python, MATLAB, and JS. The 
> uncompressed BJData (BMSHraw) reported a loading speed that is nearly as fast 
> as reading raw binary dump; and internally compressed BJData (BMSHz) gives 
> the best balance between small file sizes and loading speed, see our results 
> here 
> 
> https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png=large 
> 
> I want to add two quick points to echo the features you desired in npy: 
> 
> 1. it is not common to use mmap in reading JSON/binary JSON files, but it is 
> certainly possible. I recently wrote a JSON-mmap spec [6] and a MATLAB 
> reference implementation [7] 
> 
> 2. UBJSON/BJData natively support append-able root-level records; JSON has 
> been extensively used in data streaming with appendable nd-json or 
> concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming) 
> 
> just a quick comparison of output file sizes with a 1000x1000 unitary 
> diagonal matrix 
> 
> # python3 -m pip install jdata bjdata
> import numpy as np
> import jdata as jd
> x = np.eye(1000);   # create a large array
> y = np.vsplit(x, 5);# split into smaller chunks
> np.save('eye5chunk.npy',y); # save npy
> jd.save(y, 'eye5chunk_bjd_raw.jdb');# save as uncompressed bjd
> jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'});  # 
> zlib-compressed bjd
> jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'});  # 
> lzma-compressed bjd
> newy=jd.load('eye5chunk_bjd_zlib.jdb'); # loading/decoding
> newx = np.concatenate(newy);# regroup chunks
> 

[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

2022-08-25 Thread Robert Kern
On Thu, Aug 25, 2022 at 4:27 AM Bill Ross  wrote:

> Thanks, np.lib.format.open_memmap() works great! With prediction procs
> using minimal sys memory, I can get twice as many on GPU, with fewer
> optimization warnings.
>
> Why even have the number of records in the header? Shouldn't record size
> plus system-reported/growable file size be enough?
>
Only in the happy case where there is no corruption. Implicitness is not a
virtue in the use cases that the format was designed for. There is an
additional use case where the length is unknown a priori where implicitness
would help, but the format was not designed for that case (and I'm not sure
I want to add that use case).

> I'd love to have a shared-mem analog for smaller-scale data; now I load
> data and fork to emulate that effect.
>
There are a number of ways to do that, including using memmap on files on a
memory-backed filesystem like /dev/shm/ on Linux. See this article for
several more options:


https://luis-sena.medium.com/sharing-big-numpy-arrays-across-python-processes-abf0dc2a0ab2

> My file sizes will exceed memory, so I'm hoping to get the most out of
> memmap. Will this in-loop assignment to predsum work to avoid loading all
> to memory?
>
> predsum = np.lib.format.open_memmap(outfile, mode='w+',
> shape=(ids_sq,), dtype=np.float32)
>
> for i in range(len(IN_FILES)):
>
> pred = numpy.lib.format.open_memmap(IN_FILES[i])
>
> predsum = np.add(predsum, pred) # <-
>
This will replace the `predsum` array with a new in-memory array the first
time through this loop. Use `out=predsum` to make sure that the output goes
into the memory-mapped array

  np.add(predsum, pred, out=predsum)

Or the usual augmented assignment:

  predsum += pred

> del pred
> del predsum
>

The precise memory behavior will depend on your OS's virtual memory
configuration. But in general, `np.add()` will go through the arrays in
order, causing the virtual memory system to page in memory pages as they
are accessed for reading or writing, and page out the old ones to make room
for the new pages. Linux, in my experience, isn't always the best at
managing that backlog of old pages, especially if you have multiple
processes doing similar kinds of things (in the past, I have seen *each* of
those processes trying to use *all* of the main memory for their backlog of
old pages), but there are configuration tweaks that you can make.

-- 
Robert Kern
___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com


[Numpy-discussion] Re: An extension of the .npy file format

2022-08-25 Thread Qianqian Fang
I am curious what you and other developers think about adopting 
JSON/binary JSON as a similarly simple, reverse-engineering-able but 
universally parsable array exchange format instead of designing another 
numpy-specific binary format.


I am interested in this topic (as well as thoughts among numpy 
developers) because I am currently working on a project - NeuroJSON 
(https://neurojson.org) - funded by the US National Institute of Health. 
The goal of the NeuroJSON project is to create easy-to-adopt, 
easy-to-extend, and preferably human-readable data formats to help 
disseminate and exchange neuroimaging data (and scientific data in 
general).


Needless to say, numpy is a key toolkit that is widely used among 
neuroimaging data analysis pipelines. I've seen discussions of 
potentially adopting npy as a standardized way to share volumetric data 
(as ndarrays), such as in this thread


https://github.com/bids-standard/bids-specification/issues/197

however, several limitations were also discussed, for example

1. npy only support a single numpy array, does not support other 
metadata or other more complex data records (multiple arrays are only 
achieved via multiple files)

2. no internal (i.e. data-level) compression, only file-level compression
3. although the file is simple, it still requires a parser to 
read/write, and such parser is not widely available in other 
environments, making it mostly limited to exchange data among python 
programs
4. I am not entirely sure, but I suppose it does not support sparse 
matrices or special matrices (such as diagonal/band/symmetric etc) - I 
can be wrong though


In the NeuroJSON project, we primarily use JSON and binary JSON 
(specifically, UBJSON  derived BJData 
 format) as 
the underlying data exchange files. Through standardized data 
annotations 
, 
we are able to address most of the above limitations - the generated 
files are universally parsable in nearly all programming environments 
with existing parsers, support complex hierarchical data, compression, 
and can readily benefit from the large ecosystems of JSON (JSON-schema, 
JSONPath, JSON-LD, jq, numerous parsers, web-ready, NoSQL db ...).


I understand that simplicity is a key design spec here. I want to 
highlight UBJSON/BJData as a competitive alternative format. It is also 
designed with simplicity considered in the first place 
, yet, it allows to store hierarchical 
strongly-typed complex binary data and is easily extensible.


A UBJSON/BJData parser may not necessarily longer than a npy parser, for 
example, the python reader of the full spec only takes about 500 lines 
of codes (including comments), similarly for a JS parser


https://github.com/NeuroJSON/pybj/blob/master/bjdata/decoder.py
https://github.com/NeuroJSON/js-bjdata/blob/master/bjdata.js

We actually did a benchmark 
 a few months back - the 
test workloads are two large 2D numerical arrays (node, face to store 
surface mesh data), we compared parsing speed of various formats in 
Python, MATLAB, and JS. The uncompressed BJData (BMSHraw) reported a 
loading speed that is nearly as fast as reading raw binary dump; and 
internally compressed BJData (BMSHz) gives the best balance between 
small file sizes and loading speed, see our results here


https://pbs.twimg.com/media/FRPEdLGWYAEJe80?format=png=large

I want to add two quick points to echo the features you desired in npy:

1. it is not common to use mmap in reading JSON/binary JSON files, but 
it is certainly possible. I recently wrote a JSON-mmap spec 
 
and a MATLAB reference implementation 



2. UBJSON/BJData natively support append-able root-level records; JSON 
has been extensively used in data streaming with appendable nd-json or 
concatenated JSON (https://en.wikipedia.org/wiki/JSON_streaming)



just a quick comparison of output file sizes with a 1000x1000 unitary 
diagonal matrix


|# python3 -m pip install jdata bjdata||
||import numpy as np||
||import jdata as jd||
||x = np.eye(1000); *# create a large array*||
||y = np.vsplit(x, 5); *# split into smaller chunks*||
||np.save('eye5chunk.npy',y); *# save npy*||
||jd.save(y, 'eye5chunk_bjd_raw.jdb'); *# save as uncompressed bjd*||
||jd.save(y, 'eye5chunk_bjd_zlib.jdb', {'compression':'zlib'}); *# 
zlib-compressed bjd*||
||jd.save(y, 'eye5chunk_bjd_lzma.jdb', {'compression':'lzma'}); *# 
lzma-compressed bjd*||

||newy=jd.load('eye5chunk_bjd_zlib.jdb'); *# loading/decoding*||
||newx = np.concatenate(newy); *# regroup chunks*||
||newx.dtype|


here are the output file sizes in bytes:

|8000128  eye5chunk.npy||
||5004297  eye5chunk_bjd_raw.jdb||
||  

[Numpy-discussion] Re: writing a known-size 1D ndarray serially as it's calced

2022-08-25 Thread Bill Ross
Thanks, np.lib.format.open_memmap() works great! With prediction procs
using minimal sys memory, I can get twice as many on GPU, with fewer
optimization warnings. 

Why even have the number of records in the header? Shouldn't record size
plus system-reported/growable file size be enough?  

I'd love to have a shared-mem analog for smaller-scale data; now I load
data and fork to emulate that effect.  

My file sizes will exceed memory, so I'm hoping to get the most out of
memmap. Will this in-loop assignment to predsum work to avoid loading
all to memory? 

predsum = np.lib.format.open_memmap(outfile, mode='w+',
shape=(ids_sq,), dtype=np.float32) 

for i in range(len(IN_FILES)): 

pred = numpy.lib.format.open_memmap(IN_FILES[i]) 

predsum = np.add(predsum, pred) # <- 

del pred

del predsum 

--

Phobrain.com 

On 2022-08-23 18:02, Robert Kern wrote:

> On Tue, Aug 23, 2022 at 8:47 PM  wrote: 
> 
>> I want to calc multiple ndarrays at once and lack memory, so want to write 
>> in chunks (here sized to GPU batch capacity). It seems there should be an 
>> interface to write the header, then write a number of elements cyclically, 
>> then add any closing rubric and close the file. 
>> 
>> Is it as simple as lib.format.write_array_header_2_0(fp, d) 
>> then writing multiple shape(N,) arrays of float by fp.write(item.tobytes())?
> 
> `item.tofile(fp)` is more efficient, but yes, that's the basic scheme. There 
> is no footer after the data. 
> 
> The alternative is to use `np.lib.format.open_memmap(filename, mode='w+', 
> dtype=dtype, shape=shape)`, then assign slices sequentially to the returned 
> memory-mapped array. A memory-mapped array is usually going to be friendlier 
> to whatever memory limits you are running into than a nominally "in-memory" 
> array.
> -- 
> Robert Kern 
> ___
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: bross_phobr...@sonic.net___
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com