This is my first email to this mailing list, so I apologize if I made any 
errors.



My team's going to be building an application and I'm investigating some 
options for distributed compute systems. We want to be performing computes on 
large matrices.



The requirements are as follows:



1.     The matrices can be expected to be up to 50,000 columns x 3 million 
rows. The values are all integers (except for the row/column headers).

2.     The application needs to select a specific row, and calculate the 
correlation coefficient ( 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
 ) against every other row. This means up to 3 million different calculations.

3.     A sorted list of the correlation coefficients and their corresponding 
row keys need to be returned in under 5 seconds.

4.     Users will eventually request random row/column subsets to run 
calculations on, so precomputing our coefficients is not an option. This needs 
to be done on request.



I've been looking at many compute solutions, but I'd consider Spark first due 
to the widespread use and community. I currently have my data loaded into 
Apache Hbase for a different scenario (random access of rows/columns). I've 
naively tired loading a dataframe from the CSV using a Spark instance hosted on 
AWS EMR, but getting the results for even a single correlation takes over 20 
seconds.



Thank you!


--gautham

Reply via email to