Re: Numpy and Terabyte data
On Jan 2, 2018 18:27, Rustom Mody wrote: > > Someone who works in hadoop asked me: > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > analysis on it? > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > to not work if the data does not fit in memory > > Well sure *python* can handle (streams of) terabyte data I guess > *numpy* cannot > > Is there a more sophisticated answer? > > ["Terabyte" is a just a figure of speech for "too large for main memory"] Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. Very, very cool stuff. Dask DataFrames have been mentioned already. numpy has memmapped arrays: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy and Terabyte data
On Wednesday, January 3, 2018 at 1:43:40 AM UTC+5:30, Paul Moore wrote: > On 2 January 2018 at 17:24, Rustom Mody wrote: > > Someone who works in hadoop asked me: > > > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > > analysis on it? > > > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > > to not work if the data does not fit in memory > > > > Well sure *python* can handle (streams of) terabyte data I guess > > *numpy* cannot > > > > Is there a more sophisticated answer? > > > > ["Terabyte" is a just a figure of speech for "too large for main memory"] > > You might want to look at Dask (https://pypi.python.org/pypi/dask, > docs at http://dask.pydata.org/en/latest/). Thanks Looks like what I was asking about -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy and Terabyte data
On 2 January 2018 at 17:24, Rustom Mody wrote: > Someone who works in hadoop asked me: > > If our data is in terabytes can we do statistical (ie numpy pandas etc) > analysis on it? > > I said: No (I dont think so at least!) ie I expect numpy (pandas etc) > to not work if the data does not fit in memory > > Well sure *python* can handle (streams of) terabyte data I guess > *numpy* cannot > > Is there a more sophisticated answer? > > ["Terabyte" is a just a figure of speech for "too large for main memory"] You might want to look at Dask (https://pypi.python.org/pypi/dask, docs at http://dask.pydata.org/en/latest/). I've not used it myself, but I believe it's designed for very much the sort of use case you describe. Paul -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy and Terabyte data
I've never heard or done that type of testing for a large dataset solely on python, so I don't know what's the cap from the memory standpoint that python can handle base on memory availability. Now, if I understand what you are trying to do, you can achieve that by leveraging Apache Spark and invoking "pyspark" where you can store data in memory and/or hard disk. Also, if you are working with Hadoop, you can use spark to move/transfer data back-and-forth. Thank You, Irving Duran On Tue, Jan 2, 2018 at 12:06 PM, wrote: > I'm not sure if I'll be laughed at, but a statistical sampling of a > randomized sample should resemble the whole. > > If you need min/max then min ( min(each node) ) > If you need average then you need sum( sum(each node)) sum(count(each > node))* > > *You'll likely need to use log here, as you'll probably overflow. > > It doesn't really matter what numpy can nagle you just need to collate the > data properly, defer the actual calculation until the node calculations are > complete. > > Also, numpy should store values more densely than python itself. > > > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list
Re: Numpy and Terabyte data
I'm not sure if I'll be laughed at, but a statistical sampling of a randomized sample should resemble the whole. If you need min/max then min ( min(each node) ) If you need average then you need sum( sum(each node)) sum(count(each node))* *You'll likely need to use log here, as you'll probably overflow. It doesn't really matter what numpy can nagle you just need to collate the data properly, defer the actual calculation until the node calculations are complete. Also, numpy should store values more densely than python itself. -- https://mail.python.org/mailman/listinfo/python-list
Numpy and Terabyte data
Someone who works in hadoop asked me: If our data is in terabytes can we do statistical (ie numpy pandas etc) analysis on it? I said: No (I dont think so at least!) ie I expect numpy (pandas etc) to not work if the data does not fit in memory Well sure *python* can handle (streams of) terabyte data I guess *numpy* cannot Is there a more sophisticated answer? ["Terabyte" is a just a figure of speech for "too large for main memory"] -- https://mail.python.org/mailman/listinfo/python-list