Re: Numpy and Terabyte data

2018-01-03 Thread Albert-Jan Roskam

On Jan 2, 2018 18:27, Rustom Mody  wrote:
>
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memory
>
> Well sure *python* can handle (streams of) terabyte data I guess
> *numpy* cannot
>
> Is there a more sophisticated answer?
>
> ["Terabyte" is a just a figure of speech for "too large for main memory"]

Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. 
Very, very cool stuff.

Dask DataFrames have been mentioned already.

numpy has memmapped arrays: 
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy and Terabyte data

2018-01-02 Thread Rustom Mody
On Wednesday, January 3, 2018 at 1:43:40 AM UTC+5:30, Paul  Moore wrote:
> On 2 January 2018 at 17:24, Rustom Mody wrote:
> > Someone who works in hadoop asked me:
> >
> > If our data is in terabytes can we do statistical (ie numpy pandas etc)
> > analysis on it?
> >
> > I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> > to not work if the data does not fit in memory
> >
> > Well sure *python* can handle (streams of) terabyte data I guess
> > *numpy* cannot
> >
> > Is there a more sophisticated answer?
> >
> > ["Terabyte" is a just a figure of speech for "too large for main memory"]
> 
> You might want to look at Dask (https://pypi.python.org/pypi/dask,
> docs at http://dask.pydata.org/en/latest/).

Thanks
Looks like what I was asking about
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy and Terabyte data

2018-01-02 Thread Paul Moore
On 2 January 2018 at 17:24, Rustom Mody  wrote:
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memory
>
> Well sure *python* can handle (streams of) terabyte data I guess
> *numpy* cannot
>
> Is there a more sophisticated answer?
>
> ["Terabyte" is a just a figure of speech for "too large for main memory"]

You might want to look at Dask (https://pypi.python.org/pypi/dask,
docs at http://dask.pydata.org/en/latest/).

I've not used it myself, but I believe it's designed for very much the
sort of use case you describe.
Paul
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy and Terabyte data

2018-01-02 Thread Irving Duran
I've never heard or done that type of testing for a large dataset solely on
python, so I don't know what's the cap from the memory standpoint that
python can handle base on memory availability.  Now, if I understand what
you are trying to do, you can achieve that by leveraging Apache Spark and
invoking "pyspark" where you can store data in memory and/or hard disk.
Also, if you are working with Hadoop, you can use spark to move/transfer
data back-and-forth.


Thank You,

Irving Duran

On Tue, Jan 2, 2018 at 12:06 PM,  wrote:

> I'm not sure if I'll be laughed at, but a statistical sampling of a
> randomized sample should resemble the whole.
>
> If you need min/max then min ( min(each node) )
> If you need average then you need sum( sum(each node)) sum(count(each
> node))*
>
> *You'll likely need to use log here, as you'll probably overflow.
>
> It doesn't really matter what numpy can nagle you just need to collate the
> data properly, defer the actual calculation until the node calculations are
> complete.
>
> Also, numpy should store values more densely than python itself.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Numpy and Terabyte data

2018-01-02 Thread jason
I'm not sure if I'll be laughed at, but a statistical sampling of a randomized 
sample should resemble the whole.

If you need min/max then min ( min(each node) )
If you need average then you need sum( sum(each node)) sum(count(each node))*

*You'll likely need to use log here, as you'll probably overflow.

It doesn't really matter what numpy can nagle you just need to collate the data 
properly, defer the actual calculation until the node calculations are 
complete. 

Also, numpy should store values more densely than python itself.


-- 
https://mail.python.org/mailman/listinfo/python-list


Numpy and Terabyte data

2018-01-02 Thread Rustom Mody
Someone who works in hadoop asked me:

If our data is in terabytes can we do statistical (ie numpy pandas etc)
analysis on it?

I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
to not work if the data does not fit in memory

Well sure *python* can handle (streams of) terabyte data I guess
*numpy* cannot

Is there a more sophisticated answer?

["Terabyte" is a just a figure of speech for "too large for main memory"]

-- 
https://mail.python.org/mailman/listinfo/python-list