Thanks for the suggestions Frank and Orhan - I'll give chunking the matrix a try.
Best, Anthony On Thu, Jan 4, 2018 at 8:14 PM, Frank McQuillan <fmcquil...@pivotal.io> wrote: > I like Orhan's suggestion, it is less work. > > Slight correction to my comment above: > > "For each of the n chunks, if there is no non-zero value in the 100th > column, you will get an error that looks like this..." > > I meant > > For each of the n chunks, if there is no value of any kind (0 or > otherwise) in the 100th column, you will get an error that looks like > this..." > > Frank > > On Thu, Jan 4, 2018 at 5:26 PM, Orhan Kislal <okis...@pivotal.io> wrote: > >> Hello Anthony, >> >> I agree with Frank's suggestion, operating on chunks of the matrix should >> work. An alternate workaround for the 100th column issue you might >> encounter could be this: >> >> Check if there exists a value for the the first (or last or any other) >> row, last column. If there is one, then you can use the chunk as is. If >> not, put 0 as the value of that particular row/column. This will ensure the >> matrix size is calculated correctly, will not affect the output and will >> not require any additional operation for the assembly of the final vector. >> >> Please let us know if you have any questions. >> >> Thanks, >> >> Orhan Kislal >> >> On Thu, Jan 4, 2018 at 12:12 PM, Frank McQuillan <fmcquil...@pivotal.io> >> wrote: >> >>> Anthony, >>> >>> In that case, I think you are hitting the 1GB PostgreSQL limit. >>> >>> Operations on sparse matrix format requires loading into memory 2 >>> INTEGERS for row/col plus the value (INTEGER, DOUBLE PRECISION, whatever >>> size it is). >>> >>> It means for your matrix the 2 INTEGERS alone are ~1.00E+09 bytes which >>> is already on the limit without even considering the value yet. >>> >>> So I would suggest you do the computation in blocks. One approach to >>> this: >>> >>> * chunk your long matrix into n smaller VIEWS, say n=10 (i.e., MADlib >>> matrix operations do work on VIEWS) >>> * call matrix*vector for each chunk >>> * reassemble the n result vectors into the final vector >>> >>> You could do this in a PL/pgSQL or PL/Python function. >>> >>> There is one subtlety to be aware of though because you are working with >>> sparse matrices. For each of the n chunks, if there is no non-zero value in >>> the 100th column, you will get an error that looks like this: >>> >>> madlib=# SELECT madlib.matrix_vec_mult('mat_a_view', >>> NULL, >>> array[1,2,3,4,5,6,7,8,9,10] >>> ); >>> ERROR: plpy.Error: Matrix error: Dimension mismatch between matrix (1 x >>> 9) and vector (10 x 1) >>> CONTEXT: Traceback (most recent call last): >>> PL/Python function "matrix_vec_mult", line 24, in <module> >>> matrix_in, in_args, vector) >>> PL/Python function "matrix_vec_mult", line 2031, in matrix_vec_mult >>> PL/Python function "matrix_vec_mult", line 77, in _assert >>> PL/Python function "matrix_vec_mult" >>> >>> See the explanation at the top of >>> http://madlib.apache.org/docs/latest/group__grp__matrix.html >>> regarding dimensionality of sparse matrices. >>> >>> One way around this is to add a (fake) row to the bottom of your VIEW >>> with a 0 in the 100th column. But if you do this, be sure to drop the last >>> (fake) entry of each of the n intermediate vectors before you assemble into >>> the final vector. >>> >>> Frank >>> >>> >>> >>> >>> >>> On Wed, Jan 3, 2018 at 8:15 PM, Anthony Thomas <ahtho...@eng.ucsd.edu> >>> wrote: >>> >>>> Thanks Frank - the answer to both your questions is "yes" >>>> >>>> Best, >>>> >>>> Anthony >>>> >>>> On Wed, Jan 3, 2018 at 3:13 PM, Frank McQuillan <fmcquil...@pivotal.io> >>>> wrote: >>>> >>>>> >>>>> Anthony, >>>>> >>>>> Correct the install check error you are seeing is not related. >>>>> >>>>> Cpl questions: >>>>> >>>>> (1) >>>>> Are you using: >>>>> >>>>> -- Multiply matrix with vector >>>>> matrix_vec_mult( matrix_in, in_args, vector) >>>>> >>>>> (2) >>>>> Is matrix_in encoded in sparse format like at the top of >>>>> http://madlib.apache.org/docs/latest/group__grp__matrix.html >>>>> >>>>> e.g., like this? >>>>> >>>>> row_id | col_id | value >>>>> --------+--------+------- >>>>> 1 | 1 | 9 >>>>> 1 | 5 | 6 >>>>> 1 | 6 | 6 >>>>> 2 | 1 | 8 >>>>> 3 | 1 | 3 >>>>> 3 | 2 | 9 >>>>> 4 | 7 | 0 >>>>> >>>>> >>>>> Frank >>>>> >>>>> >>>>> On Wed, Jan 3, 2018 at 2:52 PM, Anthony Thomas <ahtho...@eng.ucsd.edu> >>>>> wrote: >>>>> >>>>>> Okay - thanks Ivan, and good to know about support for Ubuntu from >>>>>> Greenplum! >>>>>> >>>>>> Best, >>>>>> >>>>>> Anthony >>>>>> >>>>>> On Wed, Jan 3, 2018 at 2:38 PM, Ivan Novick <inov...@pivotal.io> >>>>>> wrote: >>>>>> >>>>>>> Hi Anthony, this does NOT look like a Ubuntu problem, and in fact >>>>>>> there is OSS Greenplum officially on Ubuntu you can see here: >>>>>>> http://greenplum.org/install-greenplum-oss-on-ubuntu/ >>>>>>> >>>>>>> Greenplum and PostgreSQL do limit to 1 Gig for each field (row/col >>>>>>> combination) but there are techniques to manage data sets working within >>>>>>> these constraints. I will let someone else who has more experience >>>>>>> then me >>>>>>> working with matrices answer how is the best way to do so in a case like >>>>>>> you have provided. >>>>>>> >>>>>>> Cheers, >>>>>>> Ivan >>>>>>> >>>>>>> On Wed, Jan 3, 2018 at 2:22 PM, Anthony Thomas < >>>>>>> ahtho...@eng.ucsd.edu> wrote: >>>>>>> >>>>>>>> Hi Madlib folks, >>>>>>>> >>>>>>>> I have a large tall and skinny sparse matrix which I'm trying to >>>>>>>> multiply by a dense vector. The matrix is 1.25e8 by 100 with >>>>>>>> approximately >>>>>>>> 1% nonzero values. This operations always triggers an error from >>>>>>>> Greenplum: >>>>>>>> >>>>>>>> plpy.SPIError: invalid memory alloc request size 1073741824 >>>>>>>> (context 'accumArrayResult') (mcxt.c:1254) (plpython.c:4957) >>>>>>>> CONTEXT: Traceback (most recent call last): >>>>>>>> PL/Python function "matrix_vec_mult", line 24, in <module> >>>>>>>> matrix_in, in_args, vector) >>>>>>>> PL/Python function "matrix_vec_mult", line 2044, in >>>>>>>> matrix_vec_mult >>>>>>>> PL/Python function "matrix_vec_mult", line 2001, in >>>>>>>> _matrix_vec_mult_dense >>>>>>>> PL/Python function "matrix_vec_mult" >>>>>>>> >>>>>>>> Some Googling suggests this error is caused by a hard limit from >>>>>>>> Postgres which restricts the maximum size of an array to 1GB. If this >>>>>>>> is >>>>>>>> indeed the cause of the error I'm seeing does anyone have any >>>>>>>> suggestions >>>>>>>> about how to circumvent this issue? This comes up in other cases as >>>>>>>> well >>>>>>>> like transposing a tall and skinny matrix. MVM with smaller matrices >>>>>>>> works >>>>>>>> fine. >>>>>>>> >>>>>>>> Here is relevant version information: >>>>>>>> >>>>>>>> SELECT VERSION(); >>>>>>>> PostgreSQL 8.3.23 (Greenplum Database 5.1.0 build dev) on >>>>>>>> x86_64-pc-linux-gnu, compiled by GCC gcc >>>>>>>> (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609 compiled on Dec 21 >>>>>>>> 2017 09:09:46 >>>>>>>> >>>>>>>> SELECT madlib.version(); >>>>>>>> MADlib version: 1.12, git revision: unknown, cmake configuration >>>>>>>> time: Thu Dec 21 18:04:47 UTC 201 >>>>>>>> 7, build type: RelWithDebInfo, build system: >>>>>>>> Linux-4.4.0-103-generic, C compiler: gcc 4.9.3, C++ co >>>>>>>> mpiler: g++ 4.9.3 >>>>>>>> >>>>>>>> Madlib install-check reported one error in the "convex" module >>>>>>>> related to "loss too high" which seems unrelated to the issue described >>>>>>>> above. I know Ubuntu isn't officially supported by Greenplum so I'd >>>>>>>> like to >>>>>>>> be confident this issue isn't just the result of using an unsupported >>>>>>>> OS. >>>>>>>> Please let me know if any other information would be helpful. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Anthony >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Ivan Novick, Product Manager Pivotal Greenplum >>>>>>> inov...@pivotal.io -- (Mobile) 408-230-6491 <(408)%20230-6491> >>>>>>> https://www.youtube.com/GreenplumDatabase >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >