Do you really need the results of all 3MM computations, or only the top- and bottom-most correlation coefficients? Could correlations be computed on a sample and from that estimate a distribution of coefficients? Would it make sense to precompute offline and instead focus on fast key-value retrieval, like ElasticSearch or ScyllaDB?
Spark is a compute framework rather than a serving backend, I don't think it's designed with retrieval SLAs in mind and you may find those SLAs difficult to maintain. On Wed, Jul 17, 2019 at 3:14 PM Gautham Acharya <gauth...@alleninstitute.org> wrote: > Thanks for the reply, Bobby. > > > > I’ve received notice that we can probably tolerate response times of up to > 30 seconds. Would this be more manageable? 5 seconds was an initial ask, > but 20-30 seconds is also a reasonable response time for our use case. > > > > With the new SLA, do you think that we can easily perform this computation > in spark? > > --gautham > > > > *From:* Bobby Evans [mailto:reva...@gmail.com] > *Sent:* Wednesday, July 17, 2019 7:06 AM > *To:* Steven Stetzler <steven.stetz...@gmail.com> > *Cc:* Gautham Acharya <gauth...@alleninstitute.org>; user@spark.apache.org > *Subject:* Re: [Beginner] Run compute on large matrices and return the > result in seconds? > > > > *CAUTION:* This email originated from outside the Allen Institute. Please > do not click links or open attachments unless you've validated the sender > and know the content is safe. > ------------------------------ > > Let's do a few quick rules of thumb to get an idea of what kind of > processing power you will need in general to do what you want. > > > > You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends > up being about 560 GB that you need to fully process in 5 seconds. > > > > If you are reading this from spinning disks (which average about 80 MB/s) > you would need at least 1,450 disks to just read the data in 5 seconds > (that number can vary a lot depending on the storage format and your > compression ratio). > > If you are reading the data over a network (let's say 10GigE even though > in practice you cannot get that in the cloud easily) you would need about > 90 NICs just to read the data in 5 seconds, again depending on the > compression ration this may be lower. > > If you assume you have a cluster where it all fits in main memory and have > cached all of the data in memory (which in practice I have seen on most > modern systems at somewhere between 12 and 16 GB/sec) you would need > between 7 and 10 machines just to read through the data once in 5 seconds. > Spark also stores cached data compressed so you might need less as well. > > > > All the numbers fit with things that spark should be able to handle, but a > 5 second SLA is very tight for this amount of data. > > > > Can you make this work with Spark? probably. Does spark have something > built in that will make this fast and simple for you? I doubt it you have > some very tight requirements and will likely have to write something custom > to make it work the way you want. > > > > > > On Thu, Jul 11, 2019 at 4:12 PM Steven Stetzler <steven.stetz...@gmail.com> > wrote: > > Hi Gautham, > > > > I am a beginner spark user too and I may not have a complete understanding > of your question, but I thought I would start a discussion anyway. Have you > looked into using Spark's built in Correlation function? ( > https://spark.apache.org/docs/latest/ml-statistics.html > <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Fml-statistics.html&data=02%7C01%7C%7C7d44353d2dd5420bc35108d70abff11d%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C636989691818858010&sdata=UG7owx%2FyHayKECNbDbfoNV53nJCSlF06Oak1plpi4RY%3D&reserved=0>) > This might let you get what you want (per-row correlation against the same > matrix) without having to deal with parallelizing the computation yourself. > Also, I think the question of how quick you can get your results is largely > a data access question vs how fast is Spark question. As long as you can > exploit data parallelism (i.e. you can partition up your data), Spark will > give you a speedup. You can imagine that if you had a large machine with > many cores and ~100 GB of RAM (e.g. a m5.12xlarge EC2 instance), you could > fit your problem in main memory and perform your computation with thread > based parallelism. This might get your result relatively quickly. For a > dedicated application with well constrained memory and compute > requirements, it might not be a bad option to do everything on one machine > as well. Accessing an external database and distributing work over a large > number of computers can add overhead that might be out of your control. > > > > Thanks, > > Steven > > > > On Thu, Jul 11, 2019 at 9:24 AM Gautham Acharya < > gauth...@alleninstitute.org> wrote: > > Ping? I would really appreciate advice on this! Thank you! > > > > *From:* Gautham Acharya > *Sent:* Tuesday, July 9, 2019 4:22 PM > *To:* user@spark.apache.org > *Subject:* [Beginner] Run compute on large matrices and return the result > in seconds? > > > > This is my first email to this mailing list, so I apologize if I made any > errors. > > > > My team's going to be building an application and I'm investigating some > options for distributed compute systems. We want to be performing computes > on large matrices. > > > > The requirements are as follows: > > > > 1. The matrices can be expected to be up to 50,000 columns x 3 > million rows. The values are all integers (except for the row/column > headers). > > 2. The application needs to select a specific row, and calculate the > correlation coefficient ( > https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html > <https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpandas.pydata.org%2Fpandas-docs%2Fstable%2Freference%2Fapi%2Fpandas.DataFrame.corr.html&data=02%7C01%7C%7C7d44353d2dd5420bc35108d70abff11d%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C1%7C636989691818868018&sdata=e5blX8ItE1JDJRx9D3FnmsXp4TnOKvyH6fA6%2Fw2QTbI%3D&reserved=0> > ) > against every other row. This means up to 3 million different calculations. > > 3. A sorted list of the correlation coefficients and their > corresponding row keys need to be returned in under 5 seconds. > > 4. Users will eventually request random row/column subsets to run > calculations on, so precomputing our coefficients is not an option. This > needs to be done on request. > > > > I've been looking at many compute solutions, but I'd consider Spark first > due to the widespread use and community. I currently have my data loaded > into Apache Hbase for a different scenario (random access of rows/columns). > I’ve naively tired loading a dataframe from the CSV using a Spark instance > hosted on AWS EMR, but getting the results for even a single correlation > takes over 20 seconds. > > > > Thank you! > > > > > > --gautham > > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016