Hi James, Thank you for the great questions! I think some of the issues that you are experiencing are usage issues from a failure on our part to convey this information clearly. The good news is that a tremendous amount of effort and focus is currently being directed towards fixing our website and documentation. We also have very significant upcoming releases (we are just finishing with our 0.11.0 voting).
1) Here is some background to help. The main jar ("systemml-0.10.0.incubating.jar") is typically used to perform scalable machine learning across a Spark or Hadoop cluster. Spark and Hadoop both have a large number of jars packaged with them (from a maven viewpoint these are treated as provided dependencies). In addition, SystemML has some additional libraries that it needs (wink, some antlr, etc) that are not provided by Spark and Hadoop, so these libraries are treated by SystemML as compile-scope dependencies and included in the main jar so that if you would like to run SystemML on Spark or Hadoop, you only need to include the single SystemML jar, as in these examples: $SPARK_HOME/bin/spark-submit systemml-0.10.0.incubating.jar -s "print('hello world');" -exec hybrid_spark hadoop jar systemml-0.10.0.incubating.jar -s "print('hello world');" So, I think the compile-scope dependencies haven't been shaded because typically the main jar runs on Spark or Hadoop rather than being treated as a library. I think shading to change the namespaces so as to avoid namespace collisions is a great idea in case the SystemML jar is being used as a library. 2) One of the ideas regarding SystemML is the ability to easily customize scalable machine learning algorithms. We have .tar.gz and .zip artifacts that can be unpacked that offer the scripts as text files that can easily be modified. However, we also package them into the jar files in case someone wants to run them and not really modify them. The Connection class is part of the JMLC API (see http://apache.github.io/incubator-systemml/jmlc.html), one of multiple APIs that can be used to run SystemML. This API is fairly specialized and I believe if you want to access a script in the jar using this API that you need to do a getResourceAsStream and read the script as an InputStream. However, if you would like to use a programmatic API to SystemML, I would recommend the new SystemML MLContext API (0.10.0 contains an old MLContext API and the very soon to be released 0.11.0 contains the completely redesigned MLContext API). The new MLContext API features many conveniences such as ScriptFactory.dmlFromResource() which lets you easily read a DML file from the SystemML jar. For more information about this API, see http://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide.html 3) As a Java developer with a lot of maven experience, my first inclination when working with SystemML was to try to use the main jar as a library, and I believe you are having the same experience I did. Because of the way the project is structured, using SystemML as a library isn't perhaps as easy as it should be. Here are the steps that I just tried out to use the latest SystemML project as a library (using the new MLContext API): A) Check out the latest project and install the snapshot artifacts in local maven repo: mvn clean install -P distribution -DskipTests B) Create a basic Java maven example project with the SystemML snapshot dependency. Since SystemML treats most dependencies as provided scope, I'll re-specify the Spark dependencies with default (compile) scope in my example project's pom.xml. <dependency> <groupId>org.apache.systemml</groupId> <artifactId>systemml</artifactId> <version>0.12.0-SNAPSHOT</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.4.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.4.1</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.10</artifactId> <version>1.4.1</version> </dependency> C) Create a Java class to run an algorithm on SystemML using the new MLContext API. This example reads the Univar-Stats.dml script from the jar file and runs the Haberman dataset on the algorithm. It outputs the results to the console for viewing. package org.apache.systemml.example; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.sysml.api.mlcontext.MLContext; import org.apache.sysml.api.mlcontext.Script; import org.apache.sysml.api.mlcontext.ScriptFactory; public class MLContextExample { public static void main(String[] args) throws Exception { SparkConf conf = new SparkConf().setAppName("MLContextExample").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); MLContext ml = new MLContext(sc); Script uni = ScriptFactory.dmlFromResource("/scripts/algorithms/Univar-Stats.dml"); String habermanUrl = " http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data "; uni.in("A", new java.net.URL(habermanUrl)); List<String> list = new ArrayList<String>(); list.add("1.0,1.0,1.0,2.0"); JavaRDD<String> typesRDD = sc.parallelize(list); uni.in("K", typesRDD); uni.in("$CONSOLE_OUTPUT", true); ml.execute(uni); } } I believe the JMLC API was originally designed to be a lightweight API. However, I believe it currently requires at least the Hadoop dependencies. Since a primary focus of SystemML is to distribute machine learning across Spark and Hadoop clusters, it typically requires a significant number of transitive dependencies to accomplish this. I hope that helps. Deron On Fri, Oct 21, 2016 at 9:24 AM, Dyer, James <james.d...@ingramcontent.com> wrote: > Taking a look at "systemml-0.10.0.incubating.jar" from maven-central... > > 1. Looks like we have code embedded here in other projects' namespaces: > org.apache.wink , org.antlr, org.abego, com.google.common . Shouldn't we > be using shade to re-namespace these so users do not have potential clashes? > > 2. I see the .dml files are included in the .jar under "scripts". But I > am not sure how to load and use these with an oasaj.Connection . Is there > something I am missing, or is this a to-do ? > > 3. Including "org.apache.systemml:systemml: 0.10.0-incubating" in my > projects's POM did not seem to pull in any transitive dependencies. But > just to instantiate an oasaj.Connection, it needed hadoop-common and > hadoop-mapreduce-client-common. Is this an oversight or am I using the > jar in the wrong way? Also, is there any plan to remove these > dependencies? Ideally using the Java connector wouldn't need to pull in a > significant portion of Hadoop. > > James Dyer > Ingram Content Group > >