GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
/tag/release-0.5.0 *Docs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Thanks to all contributors and to the community for feedback! Joseph -- Joseph Bradley Software Engineer

GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
ocs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
t ? > > Is there any way to get around the API and do that ? > > Thanks in advance for your insight. > > Mathieu > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
= sqlContext.read.parquet(out) >>>>> >>>>> // remove previous checkpoint >>>>> if (iteration > checkpointInterval) { >>>>> *FileSystem.get(sc.hadoopConfiguration)* >>>>> * .delete(n

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej BryƄski wrote: > Hi, > Do you plan to add tag for this release on github ? >

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins? That reduces communication (at the cost of coarser discretization of continuous features). On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley <jos...@databricks.com> wrote: > In my experience, 20K is a lot but often doable; 2K is easy; 200 is > small. C

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
? > What number of features can be considered as normal? > > -- > Be well! > Jean Morozov > > On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com> > wrote: > >> First thought: 70K features is *a lot* for the MLlib implementation (and >>

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and any PLANET-like implementation) Using fewer partitions is a good idea. Which Spark version was this on? On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov wrote: > The questions I have in mind: >

Re: Handling Missing Values in MLLIB Decision Tree

2016-03-22 Thread Joseph Bradley
It does not currently handle surrogate splits. You will need to preprocess your data to remove or fill in missing values. I'd recommend using the DataFrame API for that since it comes with a number of na methods. Joseph On Thu, Mar 17, 2016 at 9:51 PM, Abir Chakraborty

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
e that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> Hi Eugene, >

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi, This is more a question for the user list, not the dev list, so I'll CC user. If you're using mllib.clustering.LDAModel (RDD API), then can you make sure you're using a LocalLDAModel (or convert to it from DistributedLDAModel)? You can then call topicDistributions() on the new data. If

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM,

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
pache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1 >>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote: >>> >>>> Hi Joseph, >>>> >>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We

Re: Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Joseph Bradley
I'd recommend using the built-in save and load, which will be better for cross-version compatibility. You should be able to call myModel.save(path), and load it back with MatrixFactorizationModel.load(path). On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa madawa...@cse.mrt.ac.lk wrote: Hi All,

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
, Jul 25, 2015 at 8:07 AM, Joseph Bradley jos...@databricks.com wrote: I'd recommend starting with a few of the code examples to get a sense of Spark usage (in the examples/ folder when you check out the code). Then, you can work through the Spark methods they call, tracing as deep as needed

Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
I'd recommend starting with a few of the code examples to get a sense of Spark usage (in the examples/ folder when you check out the code). Then, you can work through the Spark methods they call, tracing as deep as needed to understand the component you are interested in. You can also find an

Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
Hi Yasemin, If you can convert your user IDs to Integers in pre-processing (if you have a couple billion users), that would work. Otherwise... In Spark 1.3: You may need to modify ALS to use Long instead of Int. In Spark 1.4: spark.ml.recommendation.ALS (in the Pipeline API) exposes ALS.train

Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
wrote: Hi Joseph, I was unable to find any function in Kmeans.scala where the initial centroids could be specified by the user. Kindly help. Thanks Regards, Meethu M On Tuesday, 19 May 2015 6:54 AM, Joseph Bradley jos...@databricks.com wrote: Hi Suman, For maxIterations, are you

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-30 Thread Joseph Bradley
This is really getting into an understanding of how optimization and GLMs work. I'd recommend reading some intro ML or stats literature on how Generalized Linear Models are estimated, as well as how convex optimization is used in ML. There are some free online texts as well as MOOCs which have

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread Joseph Bradley
significant variables and deletes the others with a zero in the coefficients? What is a high lambda for you? Is the lambda a parameter available in Spark 1.4 only or can I see it in Spark 1.3? 2015-05-23 0:04 GMT+02:00 Joseph Bradley jos...@databricks.com: If you want to select specific

Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using data with label i. You need to use all of your data to train each model so the models can compare each label i with the other labels (roughly speaking). However, what you're doing is multiclass (not multilabel)

Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
The model is learned using an iterative convex optimization algorithm. numIterations, stepSize and miniBatchFraction are for those; you can see details here: http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
If you want to select specific variable combinations by hand, then you will need to modify the dataset before passing it to the ML algorithm. The DataFrame API should make that easy to do. If you want to have an ML algorithm select variables automatically, then I would recommend using L1

Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature. If it makes sense for your data, it will run faster if you can group the categories or split the 1895 categories into a few features which have fewer categories. On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz brk...@gmail.com wrote:

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Joseph Bradley
Hi Xin, 2 suggestions: 1) Feature scaling: spark.mllib's LogisticRegressionWithLBFGS uses feature scaling, which scales feature values to have unit standard deviation. That improves optimization behavior, and it often improves statistical estimation (though maybe not for your dataset).

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
Hi Justin Ram, To clarify, PipelineModel.stages is not private[ml]; only the PipelineModel constructor is private[ml]. So it's safe to use pipelineModel.stages as a Spark user. Ram's example looks good. Btw, in Spark 1.4 (and the current master build), we've made a number of improvements to

Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin, It sound like you're on the right track. The best way to write a custom Evaluator will probably be to modify an existing Evaluator as you described. It's best if you don't remove the other code, which handles parameter set/get and schema validation. Joseph On Sun, May 17, 2015 at

Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
Hi Suman, For maxIterations, are you using the DenseKMeans.scala example code? (I'm guessing yes since you mention the command line.) If so, then you should be able to specify maxIterations via an extra parameter like --numIterations 50 (note the example uses numIterations in the current master

Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to modelFile as a Java object file. This step is loading the model back and reconstructing the KMeansModel, which can then be used to classify new tweets into different clusters. Joseph On Thu, May 7, 2015 at 12:40 PM, anshu shukla

Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
If you mean multilabel (predicting multiple label values), then MLlib does not yet support that. You would need to predict each label separately. If you mean multiclass (1 label taking 2 categorical values), then MLlib supports it via LogisticRegression (as DB said), as well as DecisionTree and

Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
Currently, SVMs don't have built-in multiclass support. Logistic Regression supports multiclass, as do trees and random forests. It would be great to add multiclass support for SVMs as well. There is some ongoing work on generic multiclass-to-binary reductions:

Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
Hi Peter, As far as setting the parallelism, I would recommend setting it as early as possible. Ideally, that would mean specifying the number of partitions when loading the initial data (rather than repartitioning later on). In general, working with Vector columns should be better since the

Re: Multiclass classification using Ml logisticRegression

2015-04-26 Thread Joseph Bradley
Unfortunately, the Pipelines API doesn't have multiclass logistic regression yet, only binary. It's really a matter of modifying the current implementation; I just added a JIRA for it: https://issues.apache.org/jira/browse/SPARK-7159 You'll need to use the old LogisticRegression API to do

Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that columnSimilarities will compute similarities between users. If you transpose the matrix (or construct it as the transpose), then columnSimilarities should do what you want, and it will return meaningful indices. Joseph On Fri,

Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
Yes, the count() should be the first task, and the sampling + collecting should be the second task. The first one is probably slow because the RDD being sampled is not yet cached/materialized. K-Means creates some RDDs internally while learning, and since they aren't needed after learning, they

Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
Hi Ayan, If you want to use DataFrame, then you should use the Pipelines API (org.apache.spark.ml.*) which will take DataFrames: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS In the examples/ directory for ml/, you can find a MovieLensALS

Re: From DataFrame to LabeledPoint

2015-04-06 Thread Joseph Bradley
failure: Lost task 0.3 in stage 6.0 (TID 243, 10.101.5.194): java.lang.NullPointerException Thank's for all. Sergio J. 2015-04-03 20:14 GMT+02:00 Joseph Bradley jos...@databricks.com: I'd recommend going through each step, taking 1 RDD element (myDataFrame.take(1)), and examining it to see where

Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley
If you can examine your data matrix and know that about 1/6 or so of the values are non-zero (so 5/6 are zeros), then it's probably worth using sparse vectors. (1/6 is a rough estimate.) There is support for L1 and L2 regularization. You can look at the guide here:

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley
Peter's suggestion sounds good, but watch out for the match case since I believe you'll have to match on: case (Row(feature1, feature2, ...), Row(label)) = On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi try next code: val labeledPoints: RDD[LabeledPoint] =

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote: Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could

Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to equal the number of executors? If your ETL changes the number of partitions, you can also repartition before calling KMeans. On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote: Hi, I have a large

Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Joseph Bradley
Hi Martin, In the short term: Would you be able to work with a different type other than Vector? If so, then you can override the *Predictor* class's *protected def featuresDataType: DataType* with a DataFrame type which fits your purpose. If you need Vector, then you might have to do a hack

Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit

Re: sparse vector operations in Python

2015-03-10 Thread Joseph Bradley
There isn't a great way currently. The best option is probably to convert to scipy.sparse column vectors and add using scipy. Joseph On Mon, Mar 9, 2015 at 4:21 PM, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi, Sorry to ask this, but how do I compute the sum of 2 (or more) mllib

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset size explode a bit.) Are you able to call count() (or any RDD action) on the data before you pass it to LBFGS? On

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Joseph Bradley
The minimization problem you're describing in the email title also looks like it could be solved using the RidgeRegression solver in MLlib, once you transform your DistributedMatrix into an RDD[LabeledPoint]. On Tue, Mar 3, 2015 at 11:02 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Joseph Bradley
, is it possible to create an UDF that can operatats on a partition in order to minimize the creation of a CaffeModel and to take advantage of the Caffe network batch processing ? On Tue, Mar 3, 2015 at 7:26 AM, Joseph Bradley jos...@databricks.com wrote: I see, thanks for clarifying! I'd recommend

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
the dataset. Thanks Gustavo On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley jos...@databricks.com wrote: Is that error actually occurring in LBFGS? It looks like it might be happening before the data even gets to LBFGS. (Perhaps the outer join you're trying to do is making the dataset

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-02 Thread Joseph Bradley
and I need to add the result of my pretrained model as a new column in the DataFrame. Preciselly, I want to implement the following transformer : class DeepCNNFeature extends Transformer ... { } On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley jos...@databricks.com wrote: Hi Jao, You can

Re: Reg. Difference in Performance

2015-02-28 Thread Joseph Bradley
Hi Deep, Compute times may not be very meaningful for small examples like those. If you increase the sizes of the examples, then you may start to observe more meaningful trends and speedups. Joseph On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-28 Thread Joseph Bradley
Hi Shahab, There are actually a few distributed Matrix types which support sparse representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix. The documentation has a bit more info about the various uses: http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix The

Re: Some questions after playing a little with the new ml.Pipeline.

2015-02-28 Thread Joseph Bradley
Hi Jao, You can use external tools and libraries if they can be called from your Spark program or script (with appropriate conversion of data types, etc.). The best way to apply a pre-trained model to a dataset would be to call the model from within a closure, e.g.: myRDD.map { myDatum =

Re: ML Transformer

2015-02-18 Thread Joseph Bradley
Hi Cesar, Thanks for trying out Pipelines and bringing up this issue! It's been an experimental API, but feedback like this will help us prepare it for becoming non-Experimental. I've made a JIRA, and will vote for this being protected (instead of private[ml]) for Spark 1.3:

Re: Storing DecisionTreeModel

2015-01-27 Thread Joseph Bradley
Hi Andres, Currently, serializing the object is probably the best way to do it. However, there are efforts to support actual model import/export: https://issues.apache.org/jira/browse/SPARK-4587 https://issues.apache.org/jira/browse/SPARK-1406 I'm hoping to have the PR for the first JIRA ready

Re: SVD in pyspark ?

2015-01-26 Thread Joseph Bradley
Hi Andreas, There unfortunately is not a Python API yet for distributed matrices or their operations. Here's the JIRA to follow to stay up-to-date on it: https://issues.apache.org/jira/browse/SPARK-3956 There are internal wrappers (used to create the Python API), but they are not really public

Re: [mllib] Decision Tree - prediction probabilites of label classes

2015-01-24 Thread Joseph Bradley
There is a JIRA...but not a PR yet. Here's the JIRA: https://issues.apache.org/jira/browse/SPARK-3727 I'm not aware of current work on it, but I agree it would be nice to have! Joseph On Thu, Jan 22, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote: You are right that this isn't

Re: Need some help to create user defined type for ML pipeline

2015-01-24 Thread Joseph Bradley
Hi Jao, You're right that defining serialize and deserialize is the main task in implementing a UDT. They are basically translating between your native representation (ByteImage) and SQL DataTypes. The sqlType you defined looks correct, and you're correct to use a row of length 4. Other than

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread Joseph Bradley
Hi Lokesh, Glad the update fixed the bug. maxBins is a parameter you can tune based on your data. Essentially, larger maxBins is potentially more accurate, but will run more slowly and use more memory. maxBins must be = training set size; I would say try some small values (4, 8, 16). If there

Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread Joseph Bradley
Hi, this sounds like a bug which has been fixed in the current master. What version of Spark are you using? Would it be possible to update to the current master? If not, it would be helpful to know some more of the problem dimensions (num examples, num features, feature types, label type).

Re: mlib model viewing and saving

2014-10-13 Thread Joseph Bradley
Currently, printing (toString) gives a human-readable version of the tree, but it is not a format which is easy to save and load. That sort of serialization is in the works, but not available for trees right now. (Note that the current master actually has toString (for a short summary of the

Re: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Joseph Bradley
I have run into that issue too, but only when the data were not pre-processed correctly. E.g., if a categorical feature is binary with values in {-1, +1} instead of {0,1}. Will be very interested to learn if it can occur elsewhere! On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak

Re: Decision tree classifier in MLlib

2014-07-18 Thread Joseph Bradley
Hi Sudha, Have you checked if the labels are being loaded correctly? It sounds like the DT algorithm can't find any useful splits to make, so maybe it thinks they are all the same? Some data loading functions threshold labels to make them binary. Hope it helps, Joseph On Fri, Jul 11, 2014 at