Hi Joel, thanks for your response.
On 06/06/16 14:29, Joel Nothman wrote: > - concatenation of theses arrays into a single CSR array appears to be > non-tivial given the memory constraints (e.g. scipy.sparse.vstack > transforms all arrays to COO sparse representation internally). > > There is a fast path for stacking a series of CSR matrices. Could you elaborate a bit more? When the final array is larger than the available memory? Do you mean something along the lines of, 1. Load all arrays of the series as memory maps, and calculate the expected final array shape 2. Allocate the `data`, `indices` and `indptr` arrays on disk using either numpy memory map or HDF5 3. Recalculate `indptr` for each array in the series and fill the 3 resulting arrays 4. Make sure that we can open these files as a scipy CSR array with the ability to load only a subset of rows to memory? I'm just wondering if there is a more standard storage solution in the scikit-learn environment that could be used efficiently with a stateless feature extractor (HashingVectorizer) , Cheers, -- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn