Re: use of systemml-0.10.0.incubating.jar

Deron Eriksson Fri, 21 Oct 2016 14:25:37 -0700

Hi James,

Thank you for the great questions! I think some of the issues that you are
experiencing are usage issues from a failure on our part to convey this
information clearly. The good news is that a tremendous amount of effort
and focus is currently being directed towards fixing our website and
documentation. We also have very significant upcoming releases (we are just
finishing with our 0.11.0 voting).

1)

Here is some background to help.

The main jar ("systemml-0.10.0.incubating.jar") is typically used to
perform scalable machine learning across a Spark or Hadoop cluster. Spark
and Hadoop both have a large number of jars packaged with them (from a
maven viewpoint these are treated as provided dependencies). In addition,
SystemML has some additional libraries that it needs (wink, some antlr,
etc) that are not provided by Spark and Hadoop, so these libraries are
treated by SystemML as compile-scope dependencies and included in the main
jar so that if you would like to run SystemML on Spark or Hadoop, you only
need to include the single SystemML jar, as in these examples:
   $SPARK_HOME/bin/spark-submit systemml-0.10.0.incubating.jar -s
"print('hello world');" -exec hybrid_spark
   hadoop jar systemml-0.10.0.incubating.jar -s "print('hello world');"
So, I think the compile-scope dependencies haven't been shaded because
typically the main jar runs on Spark or Hadoop rather than being treated as
a library.

I think shading to change the namespaces so as to avoid namespace
collisions is a great idea in case the SystemML jar is being used as a
library.

2)

One of the ideas regarding SystemML is the ability to easily customize
scalable machine learning algorithms. We have .tar.gz and .zip artifacts
that can be unpacked that offer the scripts as text files that can easily
be modified. However, we also package them into the jar files in case
someone wants to run them and not really modify them. The Connection class
is part of the JMLC API (see
http://apache.github.io/incubator-systemml/jmlc.html), one of multiple APIs
that can be used to run SystemML. This API is fairly specialized and I
believe if you want to access a script in the jar using this API that you
need to do a getResourceAsStream and read the script as an InputStream.

However, if you would like to use a programmatic API to SystemML, I would
recommend the new SystemML MLContext API (0.10.0 contains an old MLContext
API and the very soon to be released 0.11.0 contains the completely
redesigned MLContext API). The new MLContext API features many conveniences
such as ScriptFactory.dmlFromResource() which lets you easily read a DML
file from the SystemML jar. For more information about this API, see
http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html

3)

As a Java developer with a lot of maven experience, my first inclination
when working with SystemML was to try to use the main jar as a library, and
I believe you are having the same experience I did. Because of the way the
project is structured, using SystemML as a library isn't perhaps as easy as
it should be.

Here are the steps that I just tried out to use the latest SystemML project
as a library (using the new MLContext API):

A) Check out the latest project and install the snapshot artifacts in local
maven repo:
mvn clean install -P distribution -DskipTests

B) Create a basic Java maven example project with the SystemML snapshot
dependency. Since SystemML treats most dependencies as provided scope, I'll
re-specify the Spark dependencies with default (compile) scope in my
example project's pom.xml.

<dependency>
<groupId>org.apache.systemml</groupId>
<artifactId>systemml</artifactId>
<version>0.12.0-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.10</artifactId>
<version>1.4.1</version>
</dependency>

C) Create a Java class to run an algorithm on SystemML using the new
MLContext API. This example reads the Univar-Stats.dml script from the jar
file and runs the Haberman dataset on the algorithm. It outputs the results
to the console for viewing.

package org.apache.systemml.example;

import java.util.ArrayList;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.sysml.api.mlcontext.MLContext;
import org.apache.sysml.api.mlcontext.Script;
import org.apache.sysml.api.mlcontext.ScriptFactory;

public class MLContextExample {

public static void main(String[] args) throws Exception {
SparkConf conf = new
SparkConf().setAppName("MLContextExample").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
MLContext ml = new MLContext(sc);

Script uni =
ScriptFactory.dmlFromResource("/scripts/algorithms/Univar-Stats.dml");
String habermanUrl = "
http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
";
uni.in("A", new java.net.URL(habermanUrl));
List<String> list = new ArrayList<String>();
list.add("1.0,1.0,1.0,2.0");
JavaRDD<String> typesRDD = sc.parallelize(list);
uni.in("K", typesRDD);
uni.in("$CONSOLE_OUTPUT", true);
ml.execute(uni);
}

}

I believe the JMLC API was originally designed to be a lightweight API.
However, I believe it currently requires at least the Hadoop dependencies.
Since a primary focus of SystemML is to distribute machine learning across
Spark and Hadoop clusters, it typically requires a significant number of
transitive dependencies to accomplish this.

I hope that helps.

Deron

On Fri, Oct 21, 2016 at 9:24 AM, Dyer, James <james.d...@ingramcontent.com>
wrote:

> Taking a look at "systemml-0.10.0.incubating.jar" from maven-central...
>
> 1. Looks like we have code embedded here in other projects' namespaces:
> org.apache.wink , org.antlr, org.abego, com.google.common .  Shouldn't we
> be using shade to re-namespace these so users do not have potential clashes?
>
> 2. I see the .dml files are included in the .jar under "scripts".  But I
> am not sure how to load and use these with an oasaj.Connection .  Is there
> something I am missing, or is this a to-do ?
>
> 3. Including "org.apache.systemml:systemml: 0.10.0-incubating" in my
> projects's POM did not seem to pull in any transitive dependencies.  But
> just to instantiate an oasaj.Connection, it needed hadoop-common and
> hadoop-mapreduce-client-common.  Is this an oversight or am I using the
> jar in the wrong way?  Also, is there any plan to remove these
> dependencies?  Ideally using the Java connector wouldn't need to pull in a
> significant portion of Hadoop.
>
> James Dyer
> Ingram Content Group
>
>

Re: use of systemml-0.10.0.incubating.jar

Reply via email to