Re: Passing a CoordinateMatrix to SystemML

2018-01-10 Thread Matthias Boehm
great - I'm glad to hear that. Thanks again for catching these issues Anthony. Regards, Matthias On Wed, Jan 10, 2018 at 11:09 AM, Anthony Thomas wrote: > Hey Matthias, > > Just wanted to confirm that patch above works for me - I'm now able to pass > a dataframe of sparse vectors to a DML scrip

Re: Passing a CoordinateMatrix to SystemML

2018-01-10 Thread Anthony Thomas
Hey Matthias, Just wanted to confirm that patch above works for me - I'm now able to pass a dataframe of sparse vectors to a DML script without issue. Sorry for the slow confirmation on this - I've been out of the office for the last couple weeks. Thanks for your help debugging this! Best, Antho

Re: Passing a CoordinateMatrix to SystemML

2017-12-25 Thread Matthias Boehm
ok that was very helpful - I just pushed two additional fixes which should resolve these issues. The underlying cause was an incorrect sparse row preallocation (to reduce GC overhead), which resulted in resizing issues for initial sizes of zero. These two patches fix the underlying issues, make

Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Anthony Thomas
Thanks Matthias - unfortunately I'm still running into an ArrayIndexOutOfBounds exception both in reading the file as IJV and when calling dataFrametoBinaryBlock. Just to confirm: I downloaded and compiled the latest version using: git clone https://github.com/apache/systemml cd systemml mvn clean

Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Matthias Boehm
Thanks again for catching this issue Anthony - this IJV reblock issue with large ultra-sparse matrices is now fixed in master. It likely did not show up on the 1% sample because the data was small enough to read it directly into the driver. However, the dataFrameToBinaryBlock might be another

Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Matthias Boehm
Hi Anthony, thanks for helping to debug this issue. There are no limits other than the dimensions and number of non-zeros being of type long. It sounds more like an issues of converting special cases of ultra-sparse matrices. I'll try to reproduce this issue and give an update as soon as I kn

Re: Passing a CoordinateMatrix to SystemML

2017-12-23 Thread Anthony Thomas
Okay thanks for the suggestions - I upgraded to 1.0 and tried providing dimensions and blocksizes to dataFrameToBinaryBlock both without success. I additionally wrote out the matrix to hdfs in IJV format and am still getting the same error when calling "read()" directly in the DML. However, I creat

Re: Passing a CoordinateMatrix to SystemML

2017-12-23 Thread Matthias Boehm
Given the line numbers from the stacktrace, it seems that you use a rather old version of SystemML. Hence, I would recommend to upgrade to SystemML 1.0 or at least 0.15 first. If the error persists or you're not able to upgrade, please try to call dataFrameToBinaryBlock with provided matrix ch

Re: Passing a CoordinateMatrix to SystemML

2017-12-22 Thread Anthony Thomas
Hi Matthias, Thanks for the help! In response to your questions: 1. Sorry - this was a typo: the correct schema is: [y: int, features: vector] - the column "features" was created using Spark's VectorAssembler and the underlying type is an org.apache.spark.ml.linalg.SparseVector. Calli

Re: Passing a CoordinateMatrix to SystemML

2017-12-22 Thread Matthias Boehm
well, let's do the following to figure this out: 1) If the schema is indeed [label: Integer, features: SparseVector], please change the third line to val y = input_data.select("label"). 2) For debugging, I would recommend to use a simple script like "print(sum(X));" and try converting X and y

Passing a CoordinateMatrix to SystemML

2017-12-21 Thread Anthony Thomas
Hi SystemML folks, I'm trying to pass some data from Spark to a DML script via the MLContext API. The data is derived from a parquet file containing a dataframe with the schema: [label: Integer, features: SparseVector]. I am doing the following: val input_data = spark.read.parquet(inputPa