[Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Ryan R. Rosario
Hi,

I have a very large dictionary that must be shared across processes and does 
not fit in RAM. I need access to this object to be fast. The key is an integer 
ID and the value is a list containing two elements, both of them numpy arrays 
(one has ints, the other has floats). The key is sequential, starts at 0, and 
there are no gaps, so the “outer” layer of this data structure could really 
just be a list with the key actually being the index. The lengths of each pair 
of arrays may differ across keys. 

For a visual:

{
key=0:
[
numpy.array([1,8,15,…, 16000]),
numpy.array([0.1,0.1,0.1,…,0.1])
],
key=1:
[
numpy.array([5,6]),
numpy.array([0.5,0.5])
],
…
}

I’ve tried:
-   manager proxy objects, but the object was so big that low-level code 
threw an exception due to format and monkey-patching wasn’t successful. 
-   Redis, which was far too slow due to setting up connections and data 
conversion etc.
-   Numpy rec arrays + memory mapping, but there is a restriction that the 
numpy arrays in each “column” must be of fixed and same size.
-   I looked at PyTables, which may be a solution, but seems to have a very 
steep learning curve.
-   I haven’t tried SQLite3, but I am worried about the time it takes to 
query the DB for a sequential ID, and then translate byte arrays.

Any ideas? I greatly appreciate any guidance you can provide.

Thanks,
Ryan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] numpy.power -> numpy.random.choice Probabilities don't sum to 1

2015-12-18 Thread Ryan R. Rosario
Hi,

I have a matrix whose entries I must raise to a certain power and then 
normalize by row. After I do that, when I pass some rows to 
numpy.random.choice, I get a ValueError: probabilities do not sum to 1.

I understand that floating point is not perfect, and my matrix is so large that 
I cannot use np.longdouble because I will run out of RAM.

As an example on a smaller matrix:

np.power(mymatrix, 10, out=mymatrix)
row_normalized = np.apply_along_axis(lambda x: x / np.sum(x), 1, mymatrix)
sums = row_normalized.sum(axis=1)
sums[np.where(sums != 1)]

array([ 0.9994,  0.9994,  1.0012, ...,  0.9994,
 0.9994,  0.9994], dtype=float32)

np.random.choice(range(row_normalized.shape[0]), 1, p=row_normalized[0, :])
…
ValueError: probabilities do not sum to 1


I also tried the normalize function in sklearn.preprocessing and have the same 
problem.

Is there a way to avoid this problem without having to make manual adjustments 
to get the row sums to = 1?

— Ryan
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion