GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
Hi Spark community,

I'd like to announce a new release of GraphFrames, a Spark Package for
DataFrame-based graphs!

*We strongly encourage all users to use this latest release for the bug fix
described below.*

*Critical bug fix*
This release fixes a bug in indexing vertices.  This may have affected your
results if:
* your graph uses non-Integer IDs and
* you use ConnectedComponents and other algorithms which are wrappers
around GraphX.
The bug occurs when the input DataFrame is non-deterministic. E.g., running
an algorithm on a DataFrame just loaded from disk should be fine in
previous releases, but running that algorithm on a DataFrame produced using
shuffling, unions, and other operators can cause incorrect results. This
issue is fixed in this release.

*New features*
* Python API for aggregateMessages for building custom graph algorithms
* Scala API for parallel personalized PageRank, wrapping the GraphX
implementation. This is only available when using GraphFrames with Spark
2.1+.

Support for Spark 1.6, 2.0, and 2.1

*Special thanks to Felix Cheung for his work as a new committer for
GraphFrames!*

*Full release notes*:
https://github.com/graphframes/graphframes/releases/tag/release-0.5.0
*Docs*: http://graphframes.github.io/
*Spark Package*: https://spark-packages.org/package/graphframes/graphframes
*Source*: https://github.com/graphframes/graphframes

Thanks to all contributors and to the community for feedback!
Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


GraphFrames 0.4.0 release, with Apache Spark 2.1 support

2017-03-28 Thread Joseph Bradley
Hi Spark dev & users,

For those who use GraphFrames <http://graphframes.github.io/> (DataFrame-based
graphs), we have published a new release 0.4.0.  It adds support for Apache
Spark 2.1, with versions published for Spark 2.1 and 2.0 and for Scala 2.10
and 2.11.

*Docs*: http://graphframes.github.io/
*Spark Package*: https://spark-packages.org/package/graphframes/graphframes
*Source*: https://github.com/graphframes/graphframes

Joseph

-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


Re: LDA in Spark

2017-03-23 Thread Joseph Bradley
Hi Mathieu,

I'm CCing the Spark user list since this will be of general interest to the
forum.  Unfortunately, there is not a way to begin LDA training with an
existing model currently.  Some MLlib models have been augmented to support
specifying an "initialModel" argument, but LDA does not have this yet.
Please feel free to make a feature request JIRA for it!

Thanks,
Joseph

On Thu, Mar 23, 2017 at 4:54 PM, Mathieu DESPRIEE <mdespr...@bluedme.com>
wrote:

> Hello Joseph,
>
> I saw your contribution to online LDA in Spark (SPARK-5563). Please allow
> me a very quick question :
>
> I'm very much interested in training an LDA model incrementally with new
> batches of documents. This online algorithm seems to fit, but from what I
> understand of the current ml API, this is not possible to update a trained
> model with new documents.
> Is it ?
>
> Is there any way to get around the API and do that ?
>
> Thanks in advance for your insight.
>
> Mathieu
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Joseph Bradley
 java.lang.reflect.Method.invoke(Method.java:606)
>>>> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy
>>>> $SparkSubmit$$runMain(SparkSubmit.scala:731)
>>>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit
>>>> .scala:181)
>>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>
>>>> And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10
>>>>
>>>> Thanks
>>>> Ankur
>>>>
>>>> On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava <
>>>> ankur.srivast...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am rerunning the pipeline to generate the exact trace, I have below
>>>>> part of trace from last run:
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalArgumentException: Wrong
>>>>> FS: s3n://, expected: file:///
>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>>> ileSystem.java:69)
>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLoc
>>>>> alFileSystem.java:516)
>>>>> at 
>>>>> org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)
>>>>>
>>>>>
>>>>> Also I think the error is happening in this part of the code
>>>>> "ConnectedComponents.scala:339" I am referring the code @
>>>>> https://github.com/graphframes/graphframes/blob/master/src/
>>>>> main/scala/org/graphframes/lib/ConnectedComponents.scala
>>>>>
>>>>>   if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
>>>>> // TODO: remove this after DataFrame.checkpoint is implemented
>>>>> val out = s"${checkpointDir.get}/$iteration"
>>>>> ee.write.parquet(out)
>>>>> // may hit S3 eventually consistent issue
>>>>> ee = sqlContext.read.parquet(out)
>>>>>
>>>>> // remove previous checkpoint
>>>>> if (iteration > checkpointInterval) {
>>>>>   *FileSystem.get(sc.hadoopConfiguration)*
>>>>> *    .delete(new Path(s"${checkpointDir.get}/${iteration -
>>>>> checkpointInterval}"), true)*
>>>>> }
>>>>>
>>>>> System.gc() // hint Spark to clean shuffle directories
>>>>>   }
>>>>>
>>>>>
>>>>> Thanks
>>>>> Ankur
>>>>>
>>>>> On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung <
>>>>> felixcheun...@hotmail.com> wrote:
>>>>>
>>>>>> Do you have more of the exception stack?
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *From:* Ankur Srivastava <ankur.srivast...@gmail.com>
>>>>>> *Sent:* Wednesday, January 4, 2017 4:40:02 PM
>>>>>> *To:* user@spark.apache.org
>>>>>> *Subject:* Spark GraphFrame ConnectedComponents
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to use the ConnectedComponent algorithm of GraphFrames
>>>>>> but by default it needs a checkpoint directory. As I am running my spark
>>>>>> cluster with S3 as the DFS and do not have access to HDFS file system I
>>>>>> tried using a s3 directory as checkpoint directory but I run into below
>>>>>> exception:
>>>>>>
>>>>>> Exception in thread "main"java.lang.IllegalArgumentException: Wrong
>>>>>> FS: s3n://, expected: file:///
>>>>>>
>>>>>> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
>>>>>>
>>>>>> at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalF
>>>>>> ileSystem.java:69)
>>>>>>
>>>>>> If I set checkpoint interval to -1 to avoid checkpointing the driver
>>>>>> just hangs after 3 or 4 iterations.
>>>>>>
>>>>>> Is there some way I can set the default FileSystem to S3 for Spark or
>>>>>> any other option?
>>>>>>
>>>>>> Thanks
>>>>>> Ankur
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it:
https://github.com/graphframes/graphframes/releases/tag/release-0.2.0
Thanks for the reminder!
Joseph

On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński  wrote:

> Hi,
> Do you plan to add tag for this release on github ?
> https://github.com/graphframes/graphframes/releases
>
> Regards,
> Maciek
>
> 2016-08-17 3:18 GMT+02:00 Jacek Laskowski :
>
>> Hi Tim,
>>
>> AWESOME. Thanks a lot for releasing it. That makes me even more eager
>> to see it in Spark's codebase (and replacing the current RDD-based
>> API)!
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Aug 16, 2016 at 9:32 AM, Tim Hunter 
>> wrote:
>> > Hello all,
>> > I have released version 0.2.0 of the GraphFrames package. Apart from a
>> few
>> > bug fixes, it is the first release published for Spark 2.0 and both
>> scala
>> > 2.10 and 2.11. Please let us know if you have any comment or questions.
>> >
>> > It is available as a Spark package:
>> > https://spark-packages.org/package/graphframes/graphframes
>> >
>> > The source code is available as always at
>> > https://github.com/graphframes/graphframes
>> >
>> >
>> > What is GraphFrames?
>> >
>> > GraphFrames is a DataFrame-based graph engine Spark. In addition to the
>> > algorithms available in GraphX, users can write highly expressive
>> queries by
>> > leveraging the DataFrame API, combined with a new API for motif
>> finding. The
>> > user also benefits from DataFrame performance optimizations within the
>> Spark
>> > SQL engine.
>> >
>> > Cheers
>> >
>> > Tim
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Maciek Bryński
>


Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1  By the way, the JIRA for tracking (Scala) API parity is:
https://issues.apache.org/jira/browse/SPARK-4591

On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
wrote:

> This sounds good to me as well. The one thing we should pay attention to
> is how we update the docs so that people know to start with the spark.ml
> classes. Right now the docs list spark.mllib first and also seem more
> comprehensive in that area than in spark.ml, so maybe people naturally
> move towards that.
>
> Matei
>
> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>
> Yes, DB (cc'ed) is working on porting the local linear algebra library
> over (SPARK-13944). There are also frequent pattern mining algorithms we
> need to port over in order to reach feature parity. -Xiangrui
>
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> Overall this sounds good to me. One question I have is that in
>> addition to the ML algorithms we have a number of linear algebra
>> (various distributed matrices) and statistical methods in the
>> spark.mllib package. Is the plan to port or move these to the spark.ml
>> namespace in the 2.x series ?
>>
>> Thanks
>> Shivaram
>>
>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>> > certainly better than two.
>> >
>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  wrote:
>> >> Hi all,
>> >>
>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>> built
>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>> API has
>> >> been developed under the spark.ml package, while the old RDD-based
>> API has
>> >> been developed in parallel under the spark.mllib package. While it was
>> >> easier to implement and experiment with new APIs under a new package,
>> it
>> >> became harder and harder to maintain as both packages grew bigger and
>> >> bigger. And new users are often confused by having two sets of APIs
>> with
>> >> overlapped functions.
>> >>
>> >> We started to recommend the DataFrame-based API over the RDD-based API
>> in
>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>> development
>> >> and the usage gradually shifting to the DataFrame-based API. Just
>> counting
>> >> the lines of Scala code, from 1.5 to the current master we added ~1
>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>> to
>> >> gather more resources on the development of the DataFrame-based API
>> and to
>> >> help users migrate over sooner, I want to propose switching RDD-based
>> MLlib
>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>> >>
>> >> * We do not accept new features in the RDD-based spark.mllib package,
>> unless
>> >> they block implementing new features in the DataFrame-based spark.ml
>> >> package.
>> >> * We still accept bug fixes in the RDD-based API.
>> >> * We will add more features to the DataFrame-based API in the 2.x
>> series to
>> >> reach feature parity with the RDD-based API.
>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>> deprecate
>> >> the RDD-based API.
>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>> 3.0.
>> >>
>> >> Though the RDD-based API is already in de facto maintenance mode, this
>> >> announcement will make it clear and hence important to both MLlib
>> developers
>> >> and users. So we’d greatly appreciate your feedback!
>> >>
>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>> >> DataFrame-based API or even the entire MLlib component. This also
>> causes
>> >> confusion. To be clear, “Spark ML” is not an official name and there
>> are no
>> >> plans to rename MLlib to “Spark ML” at this time.)
>> >>
>> >> Best,
>> >> Xiangrui
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>
>


Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
Can you try reducing maxBins?  That reduces communication (at the cost of
coarser discretization of continuous features).

On Fri, Apr 1, 2016 at 11:32 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> In my experience, 20K is a lot but often doable; 2K is easy; 200 is
> small.  Communication scales linearly in the number of features.
>
> On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Joseph,
>>
>> Correction, there 20k features. Is it still a lot?
>> What number of features can be considered as normal?
>>
>> --
>> Be well!
>> Jean Morozov
>>
>> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> First thought: 70K features is *a lot* for the MLlib implementation (and
>>> any PLANET-like implementation)
>>>
>>> Using fewer partitions is a good idea.
>>>
>>> Which Spark version was this on?
>>>
>>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
>>>> The questions I have in mind:
>>>>
>>>> Is it smth that the one might expect? From the stack trace itself it's
>>>> not clear where does it come from.
>>>> Is it an already known bug? Although I haven't found anything like that.
>>>> Is it possible to configure something to workaround / avoid this?
>>>>
>>>> I'm not sure it's the right thing to do, but I've
>>>> increased thread stack size 10 times (to 80MB)
>>>> reduced default parallelism 10 times (only 20 cores are available)
>>>>
>>>> Thank you in advance.
>>>>
>>>> --
>>>> Be well!
>>>> Jean Morozov
>>>>
>>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>>>> evgeny.a.moro...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a web service that provides rest api to train random forest
>>>>> algo.
>>>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>>>> everything is cached (~22 GB).
>>>>> On a small datasets up to 100k samples everything is fine, but with
>>>>> the biggest one (400k samples and ~70k features) I'm stuck with
>>>>> StackOverflowError.
>>>>>
>>>>> Additional options for my web service
>>>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>>>> spark.default.parallelism = 200.
>>>>>
>>>>> On a 400k samples dataset
>>>>> - (with default thread stack size) it took 4 hours of training to get
>>>>> the error.
>>>>> - with increased stack size it took 60 hours to hit it.
>>>>> I can increase it, but it's hard to say what amount of memory it needs
>>>>> and it's applied to all of the treads and might waste a lot of memory.
>>>>>
>>>>> I'm looking at different stages at event timeline now and see that
>>>>> task deserialization time gradually increases. And at the end task
>>>>> deserialization time is roughly same as executor computing time.
>>>>>
>>>>> Code I use to train model:
>>>>>
>>>>> int MAX_BINS = 16;
>>>>> int NUM_CLASSES = 0;
>>>>> double MIN_INFO_GAIN = 0.0;
>>>>> int MAX_MEMORY_IN_MB = 256;
>>>>> double SUBSAMPLING_RATE = 1.0;
>>>>> boolean USE_NODEID_CACHE = true;
>>>>> int CHECKPOINT_INTERVAL = 10;
>>>>> int RANDOM_SEED = 12345;
>>>>>
>>>>> int NODE_SIZE = 5;
>>>>> int maxDepth = 30;
>>>>> int numTrees = 50;
>>>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>>>> QuantileStrategy.Sort(), new 
>>>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>>>> CHECKPOINT_INTERVAL);
>>>>> RandomForestModel model = 
>>>>> RandomForest.trainRegressor(labeledPoints.rdd(), strategy, numTrees, 
>>>>> "auto", RANDOM_SEED);
>>>>>
>>>>>
>>>>> Any advice would be highly appreciated.
>>>>>
>>>>> The exception 

Re: SparkML RandomForest java.lang.StackOverflowError

2016-04-01 Thread Joseph Bradley
In my experience, 20K is a lot but often doable; 2K is easy; 200 is small.
Communication scales linearly in the number of features.

On Thu, Mar 31, 2016 at 6:12 AM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:

> Joseph,
>
> Correction, there 20k features. Is it still a lot?
> What number of features can be considered as normal?
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 10:09 PM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> First thought: 70K features is *a lot* for the MLlib implementation (and
>> any PLANET-like implementation)
>>
>> Using fewer partitions is a good idea.
>>
>> Which Spark version was this on?
>>
>> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> The questions I have in mind:
>>>
>>> Is it smth that the one might expect? From the stack trace itself it's
>>> not clear where does it come from.
>>> Is it an already known bug? Although I haven't found anything like that.
>>> Is it possible to configure something to workaround / avoid this?
>>>
>>> I'm not sure it's the right thing to do, but I've
>>> increased thread stack size 10 times (to 80MB)
>>> reduced default parallelism 10 times (only 20 cores are available)
>>>
>>> Thank you in advance.
>>>
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
>>> evgeny.a.moro...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a web service that provides rest api to train random forest
>>>> algo.
>>>> I train random forest on a 5 nodes spark cluster with enough memory -
>>>> everything is cached (~22 GB).
>>>> On a small datasets up to 100k samples everything is fine, but with the
>>>> biggest one (400k samples and ~70k features) I'm stuck with
>>>> StackOverflowError.
>>>>
>>>> Additional options for my web service
>>>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>>>> spark.default.parallelism = 200.
>>>>
>>>> On a 400k samples dataset
>>>> - (with default thread stack size) it took 4 hours of training to get
>>>> the error.
>>>> - with increased stack size it took 60 hours to hit it.
>>>> I can increase it, but it's hard to say what amount of memory it needs
>>>> and it's applied to all of the treads and might waste a lot of memory.
>>>>
>>>> I'm looking at different stages at event timeline now and see that task
>>>> deserialization time gradually increases. And at the end task
>>>> deserialization time is roughly same as executor computing time.
>>>>
>>>> Code I use to train model:
>>>>
>>>> int MAX_BINS = 16;
>>>> int NUM_CLASSES = 0;
>>>> double MIN_INFO_GAIN = 0.0;
>>>> int MAX_MEMORY_IN_MB = 256;
>>>> double SUBSAMPLING_RATE = 1.0;
>>>> boolean USE_NODEID_CACHE = true;
>>>> int CHECKPOINT_INTERVAL = 10;
>>>> int RANDOM_SEED = 12345;
>>>>
>>>> int NODE_SIZE = 5;
>>>> int maxDepth = 30;
>>>> int numTrees = 50;
>>>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>>>> maxDepth, NUM_CLASSES, MAX_BINS,
>>>> QuantileStrategy.Sort(), new 
>>>> scala.collection.immutable.HashMap<>(), nodeSize, MIN_INFO_GAIN,
>>>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>>>> CHECKPOINT_INTERVAL);
>>>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>>>> strategy, numTrees, "auto", RANDOM_SEED);
>>>>
>>>>
>>>> Any advice would be highly appreciated.
>>>>
>>>> The exception (~3000 lines long):
>>>>  java.lang.StackOverflowError
>>>> at
>>>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>>>> at
>>>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>>>> at
>>>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>>>> at
>>>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>>>> at
>>>> java.io.ObjectInputStre

Re: SparkML RandomForest java.lang.StackOverflowError

2016-03-29 Thread Joseph Bradley
First thought: 70K features is *a lot* for the MLlib implementation (and
any PLANET-like implementation)

Using fewer partitions is a good idea.

Which Spark version was this on?

On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov 
wrote:

> The questions I have in mind:
>
> Is it smth that the one might expect? From the stack trace itself it's not
> clear where does it come from.
> Is it an already known bug? Although I haven't found anything like that.
> Is it possible to configure something to workaround / avoid this?
>
> I'm not sure it's the right thing to do, but I've
> increased thread stack size 10 times (to 80MB)
> reduced default parallelism 10 times (only 20 cores are available)
>
> Thank you in advance.
>
> --
> Be well!
> Jean Morozov
>
> On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a web service that provides rest api to train random forest algo.
>> I train random forest on a 5 nodes spark cluster with enough memory -
>> everything is cached (~22 GB).
>> On a small datasets up to 100k samples everything is fine, but with the
>> biggest one (400k samples and ~70k features) I'm stuck with
>> StackOverflowError.
>>
>> Additional options for my web service
>> spark.executor.extraJavaOptions="-XX:ThreadStackSize=8192"
>> spark.default.parallelism = 200.
>>
>> On a 400k samples dataset
>> - (with default thread stack size) it took 4 hours of training to get the
>> error.
>> - with increased stack size it took 60 hours to hit it.
>> I can increase it, but it's hard to say what amount of memory it needs
>> and it's applied to all of the treads and might waste a lot of memory.
>>
>> I'm looking at different stages at event timeline now and see that task
>> deserialization time gradually increases. And at the end task
>> deserialization time is roughly same as executor computing time.
>>
>> Code I use to train model:
>>
>> int MAX_BINS = 16;
>> int NUM_CLASSES = 0;
>> double MIN_INFO_GAIN = 0.0;
>> int MAX_MEMORY_IN_MB = 256;
>> double SUBSAMPLING_RATE = 1.0;
>> boolean USE_NODEID_CACHE = true;
>> int CHECKPOINT_INTERVAL = 10;
>> int RANDOM_SEED = 12345;
>>
>> int NODE_SIZE = 5;
>> int maxDepth = 30;
>> int numTrees = 50;
>> Strategy strategy = new Strategy(Algo.Regression(), Variance.instance(), 
>> maxDepth, NUM_CLASSES, MAX_BINS,
>> QuantileStrategy.Sort(), new scala.collection.immutable.HashMap<>(), 
>> nodeSize, MIN_INFO_GAIN,
>> MAX_MEMORY_IN_MB, SUBSAMPLING_RATE, USE_NODEID_CACHE, 
>> CHECKPOINT_INTERVAL);
>> RandomForestModel model = RandomForest.trainRegressor(labeledPoints.rdd(), 
>> strategy, numTrees, "auto", RANDOM_SEED);
>>
>>
>> Any advice would be highly appreciated.
>>
>> The exception (~3000 lines long):
>>  java.lang.StackOverflowError
>> at
>> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2320)
>> at
>> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2333)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:2828)
>> at
>> java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1453)
>> at
>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1512)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> scala.collection.immutable.$colon$colon.readObject(List.scala:366)
>> at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at
>> 

Re: Handling Missing Values in MLLIB Decision Tree

2016-03-22 Thread Joseph Bradley
It does not currently handle surrogate splits.  You will need to preprocess
your data to remove or fill in missing values.  I'd recommend using the
DataFrame API for that since it comes with a number of na methods.
Joseph

On Thu, Mar 17, 2016 at 9:51 PM, Abir Chakraborty 
wrote:

> Hello,
>
>
>
> Can MLLIB Decision Tree (DT) handle missing values by having surrogate
> split (as it is currently being done in “rpart” library in R)?
>
>
>
> Thanks,
>
> Abir
> --
>
> *Principal Data Scientist, Data Science Group, Innovation Labs*
>
> *[24]**7 **Inc. - *The Intuitive Consumer Experience Company™ *|* *We
> make life simple for consumers to connect with companies to get things done*
>
> Mobile: +91-9880755850 *|* e-mail: abi...@247-inc.com
>   Prestige Tech Platina, Kadubeesanahalli, Marathahalli Outer Ring Road
> *|* Bangalore 560087 *|* India *|* www.247-inc.com
>
>
>


Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
The indexing I mentioned is more restrictive than that: each index
corresponds to a unique position in a binary tree.  (I.e., the first index
of row 0 is 1, the first of row 1 is 2, the first of row 2 is 4, etc., IIRC)

You're correct that this restriction could be removed; with some careful
thought, we could probably avoid using indices altogether.  I just created
https://issues.apache.org/jira/browse/SPARK-14043  to track this.

On Mon, Mar 21, 2016 at 11:22 AM, Eugene Morozov <evgeny.a.moro...@gmail.com
> wrote:

> Hi, Joseph,
>
> I thought I understood, why it has a limit of 30 levels for decision tree,
> but now I'm not that sure. I thought that's because the decision tree
> stored in the array, which has length of type int, which cannot be more,
> than 2^31-1.
> But here are my new discoveries. I've trained two different random forest
> models of 50 trees and different maxDepth (20 and 30) and specified node
> size = 5. Here are couple of those trees
>
> Model with maxDepth = 20:
> depth=20, numNodes=471
> depth=19, numNodes=497
>
> Model with maxDepth = 30:
> depth=30, numNodes=11347
> depth=30, numNodes=10963
>
> It looks like the tree is not pretty balanced and I understand why that
> happens, but I'm surprised that actual number of nodes way less, than 2^31
> - 1. And now I'm not sure of why the limitation actually exists. With tree
> that consist of 2^31 nodes it'd required to have 8G of memory just to store
> those indexes, so I'd say that depth isn't the biggest issue in such a
> case.
>
> Is it possible to workaround or simply miss maxDepth limitation (without
> codebase modification) to train the tree until I hit the max number of
> nodes? I'd assume that in most cases I simply won't hit it, but the depth
> of the tree would be much more, than 30.
>
>
> --
> Be well!
> Jean Morozov
>
> On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Hi Eugene,
>>
>> The maxDepth parameter exists because the implementation uses Integer
>> node IDs which correspond to positions in the binary tree.  This simplified
>> the implementation.  I'd like to eventually modify it to avoid depending on
>> tree node IDs, but that is not yet on the roadmap.
>>
>> There is not an analogous limit for the GLMs you listed, but I'm not very
>> familiar with the perceptron implementation.
>>
>> Joseph
>>
>> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> I'm currently working on POC and try to use Random Forest
>>> (classification and regression). I also have to check SVM and Multiclass
>>> perceptron (other algos are less important at the moment). So far I've
>>> discovered that Random Forest has a limitation of maxDepth for trees and
>>> just out of curiosity I wonder why such a limitation has been introduced?
>>>
>>> An actual question is that I'm going to use Spark ML in production next
>>> year and would like to know if there are other limitations like maxDepth in
>>> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>>>
>>> Thanks in advance for your time.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>
>>
>


Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users,

I want to bring attention to a proposal to merge the MLlib (spark.ml)
concepts of Estimator and Model in Spark 2.0.  Please comment & discuss on
SPARK-14033  (not in
this email thread).

*TL;DR:*
*Proposal*: Merge Estimator and Model under a single abstraction
(Estimator).
*Goals*: Simplify API by combining the tightly coupled concepts of
Estimator & Model.  Match other ML libraries like scikit-learn.  Simplify
mutability semantics.

*Details*: See https://issues.apache.org/jira/browse/SPARK-14033 for a
design document (Google doc & PDF).

Thanks in advance for feedback!
Joseph


Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi,

This is more a question for the user list, not the dev list, so I'll CC
user.

If you're using mllib.clustering.LDAModel (RDD API), then can you make sure
you're using a LocalLDAModel (or convert to it from DistributedLDAModel)?
You can then call topicDistributions() on the new data.

If you're using ml.clustering.LDAModel (DataFrame API), then you can call
transform() on new data.

Does that work?

Joseph

On Tue, Jan 19, 2016 at 6:21 AM, doruchiulan  wrote:

> Hi,
>
> Just so you know, I am new to Spark, and also very new to ML (this is my
> first contact with ML).
>
> Ok, I am trying to write a DSL where you can run some commands.
>
> I did a command that trains the Spark LDA and it produces the topics I want
> and I saved it using the save method provided by the LDAModel.
>
> Now I want to load this LDAModel and use it to predict on a new set of
> data.
> I call the load method, obtain the LDAModel instance but here I am stuck.
>
> Isnt this possible ? Am I wrong in the way I understood LDA and we cannot
> reuse trained LDA to analyse new data ?
>
> If its possible can you point me to some documentation, or give me a hint
> on
> how should I do that.
>
> Thx
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-LDA-model-reuse-with-new-set-of-data-tp16047.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a
problem with the Parquet dependency.  What version of Parquet are you
building Spark 1.5 off of?  (I'm not that familiar with Parquet issues
myself, but hopefully a SQL person can chime in.)

On Tue, Dec 15, 2015 at 3:23 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I have recently upgraded spark version but when I try to run save a random 
> forest model using model save command I am getting nosuchmethoderror.  My 
> code works fine with 1.3x version.
>
>
>
> model.save(sc.sc(), "modelsavedir");
>
>
>
>
>
> ERROR:
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation -
> Aborting job.
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 22.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 22.0 (TID 230, localhost): java.lang.NoSuchMethodError:
> parquet.schema.Types$GroupBuilder.addField(Lparquet/schema/Type;)Lparquet/schema/Types$BaseGroupBuilder;
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:517)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convertField$1.apply(CatalystSchemaConverter.scala:516)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>
> at
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:516)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:312)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$convert$1.apply(CatalystSchemaConverter.scala:305)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>
> at
> org.apache.spark.sql.types.StructType.foreach(StructType.scala:92)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> org.apache.spark.sql.types.StructType.map(StructType.scala:92)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:305)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromAttributes(ParquetTypesConverter.scala:58)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:55)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:287)
>
> at
> parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:261)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
>
> at
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
>
> at
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:233)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>


Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene,

The maxDepth parameter exists because the implementation uses Integer node
IDs which correspond to positions in the binary tree.  This simplified the
implementation.  I'd like to eventually modify it to avoid depending on
tree node IDs, but that is not yet on the roadmap.

There is not an analogous limit for the GLMs you listed, but I'm not very
familiar with the perceptron implementation.

Joseph

On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov  wrote:

> Hello!
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
> Thanks in advance for your time.
> --
> Be well!
> Jean Morozov
>


Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
You can do grid search if you set the evaluator to a
MulticlassClassificationEvaluator, which expects a prediction column, not a
rawPrediction column.  There's a JIRA for making
BinaryClassificationEvaluator accept prediction instead of rawPrediction.
Joseph

On Tue, Dec 1, 2015 at 5:10 AM, Benjamin Fradet <benjamin.fra...@gmail.com>
wrote:

> Someone correct me if I'm wrong but no there isn't one that I am aware of.
>
> Unless someone is willing to explain how to obtain the raw prediction
> column with the GBTClassifier. In this case I'd be happy to work on a PR.
> On 1 Dec 2015 8:43 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>
>> Hi Benjamin,
>>
>> Thanks, the documentation you sent is clear.
>> Is there any other way to perform a Grid Search with GBT?
>>
>>
>> Ndjido
>> On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet <benjamin.fra...@gmail.com>
>> wrote:
>>
>>> Hi Ndjido,
>>>
>>> This is because GBTClassifier doesn't yet have a rawPredictionCol like
>>> the. RandomForestClassifier has.
>>> Cf:
>>> http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1
>>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote:
>>>
>>>> Hi Joseph,
>>>>
>>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'm getting
>>>> a "rawPredictionCol field does not exist exception" on Spark 1.5.2 for
>>>> Gradient Boosting Trees classifier.
>>>>
>>>>
>>>> Ardo
>>>> On Tue, 1 Dec 2015 at 01:34, Joseph Bradley <jos...@databricks.com>
>>>> wrote:
>>>>
>>>>> It should work with 1.5+.
>>>>>
>>>>> On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar <ndj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> Does anyone know whether the Grid Search capability is enabled since
>>>>>> the issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
>>>>>> column doesn't exist" when trying to perform a grid search with Spark 
>>>>>> 1.4.0.
>>>>>>
>>>>>> Cheers,
>>>>>> Ardo
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>


Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+.

On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar  wrote:

>
> Hi folks,
>
> Does anyone know whether the Grid Search capability is enabled since the
> issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol
> column doesn't exist" when trying to perform a grid search with Spark 1.4.0.
>
> Cheers,
> Ardo
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi,
Could you please submit this via JIRA as a bug report?  It will be very
helpful if you include the Spark version, system details, and other info
too.
Thanks!
Joseph

On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> *Issue:*
>
> I have a random forest model that am trying to load during streaming using
> following code.  The code is working fine when I am running the code from
> Eclipse but getting NPE when running the code using spark-submit.
>
>
>
> JavaStreamingContext jssc = new JavaStreamingContext(*jsc*, Durations.
> *seconds*(duration));
>
> System.*out*.println("& trying to get the context
> &&& " );
>
> final RandomForestModel model = 
> RandomForestModel.*load*(jssc.sparkContext().sc(),
> *MODEL_DIRECTORY*);//line 116 causing the issue.
>
> System.*out*.println("& model debug
> &&& " + model.toDebugString());
>
>
>
>
>
> *Exception Details:*
>
> INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 2.0,
> whose tasks have all completed, from pool
>
> Exception in thread "main" java.lang.NullPointerException
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$SplitData.toSplit(DecisionTreeModel.scala:144)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$16.apply(DecisionTreeModel.scala:291)
>
> at scala.Option.map(Option.scala:145)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:291)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:287)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructNode(DecisionTreeModel.scala:286)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTree(DecisionTreeModel.scala:268)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:251)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$$anonfun$12.apply(DecisionTreeModel.scala:250)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at
> org.apache.spark.mllib.tree.model.DecisionTreeModel$SaveLoadV1_0$.constructTrees(DecisionTreeModel.scala:250)
>
> at
> org.apache.spark.mllib.tree.model.TreeEnsembleModel$SaveLoadV1_0$.loadTrees(treeEnsembleModels.scala:340)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel$.load(treeEnsembleModels.scala:72)
>
> at
> org.apache.spark.mllib.tree.model.RandomForestModel.load(treeEnsembleModels.scala)
>
> at
> com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:116)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>
> at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>
> at
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> Nov 19, 2015 1:10:56 PM WARNING: parquet.hadoop.ParquetRecordReader: Can
> not initialize counter due 

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about
"""
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?
"""

--> MLlib trees do this currently, so you could check out that code as an
example.
I'm not sure how this would work as a generic transformer, though; it seems
more like an internal part of space-partitioning algorithms.



On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu 
wrote:

> Hi DB Tsai,
>
> Thank you again for your insightful comments!
>
> 1) I agree the sorting method you suggested is a very efficient way to
> handle the unordered categorical variables in binary classification
> and regression. I propose we have a Spark ML Transformer to do the
> sorting and encoding, bringing the benefits to many tree based
> methods. How about I open a jira for this?
>
> 2) For L2/L1 regularization vs Learning rate (I use this name instead
> shrinkage to avoid confusion), I have the following observations:
>
> Suppose G and H are the sum (over the data assigned to a leaf node) of
> the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
> Then for this leaf node,
>
> * With a learning rate eta, f_{m+1} = f_m - G/H*eta
>
> * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)
>
> If H>0 (convex loss), both approach lead to "shrinkage":
>
> * For the learning rate approach, the percentage of shrinkage is
> uniform for any leaf node.
>
> * For L2 regularization, the percentage of shrinkage would adapt to
> the number of instances assigned to a leaf node: more instances =>
> larger G and H => less shrinkage. This behavior is intuitive to me. If
> the value estimated from this node is based on a large amount of data,
> the value should be reliable and less shrinkage is needed.
>
> I suppose we could have something similar for L1.
>
> I am not aware of theoretical results to conclude which method is
> better. Likely to be dependent on the data at hand. Implementing
> learning rate is on my radar for version 0.2. I should be able to add
> it in a week or so. I will send you a note once it is done.
>
> Thanks,
>
> Meihua
>
> On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> > Hi Meihua,
> >
> > For categorical features, the ordinal issue can be solved by trying
> > all kind of different partitions 2^(q-1) -1 for q values into two
> > groups. However, it's computational expensive. In Hastie's book, in
> > 9.2.4, the trees can be trained by sorting the residuals and being
> > learnt as if they are ordered. It can be proven that it will give the
> > optimal solution. I have a proof that this works for learning
> > regression trees through variance reduction.
> >
> > I'm also interested in understanding how the L1 and L2 regularization
> > within the boosting works (and if it helps with overfitting more than
> > shrinkage).
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 0xAF08DF8D
> >
> >
> > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu 
> wrote:
> >> Hi DB Tsai,
> >>
> >> Thank you very much for your interest and comment.
> >>
> >> 1) feature sub-sample is per-node, like random forest.
> >>
> >> 2) The current code heavily exploits the tree structure to speed up
> >> the learning (such as processing multiple learning node in one pass of
> >> the training data). So a generic GBM is likely to be a different
> >> codebase. Do you have any nice reference of efficient GBM? I am more
> >> than happy to look into that.
> >>
> >> 3) The algorithm accept training data as a DataFrame with the
> >> featureCol indexed by VectorIndexer. You can specify which variable is
> >> categorical in the VectorIndexer. Please note that currently all
> >> categorical variables are treated as ordered. If you want some
> >> categorical variables as unordered, you can pass the data through
> >> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> >> unordered categorical variable using the approach in RF in Spark ML
> >> (Please see roadmap in the README.md)
> >>
> >> Thanks,
> >>
> >> Meihua
> >>
> >>
> >>
> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> >>> you think you can implement generic GBM and have it merged as part of
> >>> Spark codebase?
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> --
> >>> Web: https://www.dbtsai.com
> >>> PGP Key ID: 0xAF08DF8D
> >>>
> >>>
> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> >>>  wrote:
>  Hi Spark User/Dev,
> 
>  Inspired by the success 

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu,

The spark.ml classes are part of the higher-level "Pipelines" API, which
works with DataFrames.  When creating this API, we decided to separate it
from the old API to avoid confusion.  You can read more about it here:
http://spark.apache.org/docs/latest/ml-guide.html

For (3): We use Breeze, but we have to modify it in order to do distributed
optimization based on Spark.

Joseph

On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:

> Hi everyone,
>
> I'm curious about the difference between
> ml.classification.LogisticRegression and
> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> optimized using LBFGS, the only difference I see is LogisticRegression
> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
>
> So I wonder,
> 1. Why not simply add a DataFrame training interface to
> LogisticRegressionWithLBFGS?
> 2. Whats the difference between ml.classification and
> mllib.classification package?
> 3. Why doesn't ml.classification.LogisticRegression call
> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> in mllib.optimization.{LBFGS,OWLQN}.
>
> Thank you.
>
> Best,
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Serializing MLlib MatrixFactorizationModel

2015-08-17 Thread Joseph Bradley
I'd recommend using the built-in save and load, which will be better for
cross-version compatibility.  You should be able to call
myModel.save(path), and load it back with
MatrixFactorizationModel.load(path).

On Mon, Aug 17, 2015 at 6:31 AM, Madawa Soysa madawa...@cse.mrt.ac.lk
wrote:

 Hi All,

 I have an issue when i try to serialize a MatrixFactorizationModel object
 as a java object in a Java application. When I deserialize the object, I
 get the following exception.

 Caused by: java.lang.ClassNotFoundException:
 org.apache.spark.OneToOneDependency cannot be found by
 org.scala-lang.scala-library_2.10.4.v20140209-180020-VFINAL-b66a39653b

 Any solution for this?

 --

 *_**Madawa Soysa*

 Undergraduate,

 Department of Computer Science and Engineering,

 University of Moratuwa.


 Mobile: +94 71 461 6050 %2B94%2075%20812%200726 | Email:
 madawa...@cse.mrt.ac.lk
 LinkedIn http://lk.linkedin.com/in/madawasoysa | Twitter
 https://twitter.com/madawa_rc | Tumblr http://madawas.tumblr.com/



Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
Please checkout the Spark source from Github, and look here:
https://github.com/apache/spark/tree/master/examples/src/main

On Fri, Jul 24, 2015 at 8:43 PM, Chintan Bhatt 
chintanbhatt...@charusat.ac.in wrote:

 Hi.
 Can I know how to get such folder/code for spark implementation?

 On Sat, Jul 25, 2015 at 8:07 AM, Joseph Bradley jos...@databricks.com
 wrote:

 I'd recommend starting with a few of the code examples to get a sense of
 Spark usage (in the examples/ folder when you check out the code).  Then,
 you can work through the Spark methods they call, tracing as deep as needed
 to understand the component you are interested in.

 You can also find an interesting (small) JIRA, examine the piece of code
 it mentions, and explore out from that initial entry point.  That's how I
 mostly did it.  Good luck!

 Joseph

 On Fri, Jul 24, 2015 at 10:48 AM, shashank kapoor 
 shashank.prof...@gmail.com wrote:



 Hi guys,
 I am new to apache spark, I wanted to start contributing to this
 project. But before that I need to understand the basic coding flow here. I
 read How to contribute to apache spark but I couldn't find any way to
 start reading the code and start understanding Code Flow. Can anyone tell
 me entry point for the code.

 --
 Regards
 Shashank Kapoor





 --
 CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/
 Assistant Professor,
 U  P U Patel Department of Computer Engineering,
 Chandubhai S. Patel Institute of Technology,
 Charotar University of Science And Technology (CHARUSAT),
 Changa-388421, Gujarat, INDIA.
 http://www.charusat.ac.in
 *Personal Website*: https://sites.google.com/a/ecchanga.ac.in/chintan/



Re: want to contribute to apache spark

2015-07-24 Thread Joseph Bradley
I'd recommend starting with a few of the code examples to get a sense of
Spark usage (in the examples/ folder when you check out the code).  Then,
you can work through the Spark methods they call, tracing as deep as needed
to understand the component you are interested in.

You can also find an interesting (small) JIRA, examine the piece of code it
mentions, and explore out from that initial entry point.  That's how I
mostly did it.  Good luck!

Joseph

On Fri, Jul 24, 2015 at 10:48 AM, shashank kapoor 
shashank.prof...@gmail.com wrote:



 Hi guys,
 I am new to apache spark, I wanted to start contributing to this project.
 But before that I need to understand the basic coding flow here. I read
 How to contribute to apache spark but I couldn't find any way to start
 reading the code and start understanding Code Flow. Can anyone tell me
 entry point for the code.

 --
 Regards
 Shashank Kapoor



Re: ALS Rating Object

2015-06-03 Thread Joseph Bradley
Hi Yasemin,

If you can convert your user IDs to Integers in pre-processing (if you have
 a couple billion users), that would work.  Otherwise...
In Spark 1.3: You may need to modify ALS to use Long instead of Int.
In Spark 1.4: spark.ml.recommendation.ALS (in the Pipeline API) exposes
ALS.train as a DeveloperApi to allow users to use Long instead of Int.
We're also thinking about better ways to permit Long IDs.

Joseph

On Wed, Jun 3, 2015 at 5:04 AM, Yasemin Kaya godo...@gmail.com wrote:

 Hi,

 I want to use Spark's ALS in my project. I have the userid
 like 30011397223227125563254 and Rating Object which is the Object of ALS
 wants Integer as a userid so the id field does not fit into a 32 bit
 Integer. How can I solve that ? Thanks.

 Best,
 yasemin
 --
 hiç ender hiç



Re: Restricting the number of iterations in Mllib Kmeans

2015-06-01 Thread Joseph Bradley
Hi Suman  Meethu,
Apologies---I was wrong about KMeans supporting an initial set of
centroids!  JIRA created: https://issues.apache.org/jira/browse/SPARK-8018
If you're interested in submitting a PR, please do!
Thanks,
Joseph

On Mon, Jun 1, 2015 at 2:25 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:

 Hi Joseph,
 I was unable to find any function in Kmeans.scala where the initial
 centroids could be specified by the user. Kindly help.

 Thanks  Regards,
 Meethu M



   On Tuesday, 19 May 2015 6:54 AM, Joseph Bradley jos...@databricks.com
 wrote:


 Hi Suman,

 For maxIterations, are you using the DenseKMeans.scala example code?  (I'm
 guessing yes since you mention the command line.)  If so, then you should
 be able to specify maxIterations via an extra parameter like
 --numIterations 50 (note the example uses numIterations in the current
 master instead of maxIterations, which is sort of a bug in the example).
 If that does not cap the max iterations, then please report it as a bug.

 To specify the initial centroids, you will need to modify the DenseKMeans
 example code.  Please see the KMeans API docs for details.

 Good luck,
 Joseph

 On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW meethu2...@yahoo.co.in
 wrote:

 Hi,
 I think you cant supply an initial set of centroids to kmeans

 Thanks  Regards,
 Meethu M



   On Friday, 15 May 2015 12:37 AM, Suman Somasundar 
 suman.somasun...@oracle.com wrote:


 Hi,,

 I want to run a definite number of iterations in Kmeans.  There is a
 command line argument to set maxIterations, but even if I set it to a
 number, Kmeans runs until the centroids converge.
 Is there a specific way to specify it in command line?

 Also, I wanted to know if we can supply the initial set of centroids to
 the program instead of it choosing the centroids in random?

 Thanks,
 Suman.








Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-30 Thread Joseph Bradley
This is really getting into an understanding of how optimization and GLMs
work.  I'd recommend reading some intro ML or stats literature on how
Generalized Linear Models are estimated, as well as how convex optimization
is used in ML.  There are some free online texts as well as MOOCs which
have good intros.  (There is also the upcoming ML with Spark MOOC!)

On Fri, May 29, 2015 at 3:11 AM, SparknewUser melanie.galloi...@gmail.com
wrote:

 I've tried several different couple of parameters for my
 LogisticRegressionWithSGD and here are my results.
 My numIterations varies from 100 to 500 by 50 and my stepSize varies from
 0.1 to 1 by 0.1.
 My last line represents the maximum of each column and my last column the
 maximum of each line and we see a growth and diminution. What is the logic?

 My maximum is for the couple (numIter,StepSize)=(0.4,200)

 numIter/stepSize0,1 0,2 0,3 0,4 0,5 0,6
  0,7 0,8 0,9 1   line max
  1000,670,690,500,480,500,69
 0,700,500,660,55
 0,70
  1500,500,510,500,500,500,50
 0,530,500,530,68
 0,68
  2000,670,710,640,740,500,70
 0,710,710,500,50
 0,74
  2500,500,500,550,500,500,50
 0,730,550,500,50
 0,73
  3000,670,500,500,670,500,67
 0,720,480,660,67
 0,72
  3500,710,600,660,500,510,50
 0,660,620,660,71
 0,71
  4000,510,540,710,670,620,50
 0,500,500,510,50
 0,71
  4500,510,500,500,510,500,50
 0,660,510,500,50
 0,66
  5000,510,640,500,500,510,49
 0,660,670,540,51
 0,67

 column max   0,71   0,710,710,740,620,700,73
 0,710,660,71



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053p23082.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread Joseph Bradley
Spark 1.4 should be available next month, but I'm not sure about the exact
date.
Your interpretation of high lambda is reasonable.  High lambda is really
data-dependent.
lambda is the same as the regParam in Spark, available in all recent
Spark versions.

On Fri, May 29, 2015 at 5:35 AM, mélanie gallois 
melanie.galloi...@gmail.com wrote:

 When will Spark 1.4 be available exactly?
 To answer to Model selection can be achieved through high
 lambda resulting lots of zero in the coefficients : Do you mean that
 putting a high lambda as a parameter of the logistic regression keeps only
 a few significant variables and deletes the others with a zero in the
 coefficients? What is a high lambda for you?
 Is the lambda a parameter available in Spark 1.4 only or can I see it in
 Spark 1.3?

 2015-05-23 0:04 GMT+02:00 Joseph Bradley jos...@databricks.com:

 If you want to select specific variable combinations by hand, then you
 will need to modify the dataset before passing it to the ML algorithm.  The
 DataFrame API should make that easy to do.

 If you want to have an ML algorithm select variables automatically, then
 I would recommend using L1 regularization for now and possibly elastic net
 after 1.4 is release, per DB's suggestion.

 If you want detailed model statistics similar to what R provides, I've
 created a JIRA for discussing how we should add that functionality to
 MLlib.  Those types of stats will be added incrementally, but feedback
 would be great for prioritization:
 https://issues.apache.org/jira/browse/SPARK-7674

 To answer your question: How are the weights calculated: is there a
 correlation calculation with the variable of interest?
 -- Weights are calculated as with all logistic regression algorithms, by
 using convex optimization to minimize a regularized log loss.

 Good luck!
 Joseph

 On Fri, May 22, 2015 at 1:07 PM, DB Tsai dbt...@dbtsai.com wrote:

 In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
 pipeline framework. Model selection can be achieved through high
 lambda resulting lots of zero in the coefficients.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Fri, May 22, 2015 at 1:19 AM, SparknewUser
 melanie.galloi...@gmail.com wrote:
  I am new in MLlib and in Spark.(I use Scala)
 
  I'm trying to understand how LogisticRegressionWithLBFGS and
  LogisticRegressionWithSGD work.
  I usually use R to do logistic regressions but now I do it on Spark
  to be able to analyze Big Data.
 
  The model only returns weights and intercept. My problem is that I
 have no
  information about which variable is significant and which variable I
 had
  better
  to delete to improve my model. I only have the confusion matrix and
 the AUC
  to evaluate the performance.
 
  Is there any way to have information about the variables I put in my
 model?
  How can I try different variable combinations, do I have to modify the
  dataset
  of origin (e.g. delete one or several columns?)
  How are the weights calculated: is there a correlation calculation
 with the
  variable
  of interest?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 *Mélanie*



Re: Multilabel classification using logistic regression

2015-05-27 Thread Joseph Bradley
It looks like you are training each model i (for label i) by only using
data with label i.  You need to use all of your data to train each model so
the models can compare each label i with the other labels (roughly
speaking).

However, what you're doing is multiclass (not multilabel) classification,
which LogisticRegressionWithLBFGS already supports.  Can you not just use
LogisticRegressionWithLBFGS directly?

On Wed, May 27, 2015 at 8:53 AM, peterg pe...@garbers.me wrote:

 Hi all

 I believe I have created a multi-label classifier using LogisticRegression
 but there is one snag. No matter what features I use to get the prediction,
 it will always return the label. I feel like I need to set a threshold but
 can't seem to figure out how to do that. I attached the code below. It's
 super simple. Hopefully someone can point me in the correct :

 val labels = labeledPoints.map(l = l.label).take(1000).distinct // stupid
 hack
 val groupedRDDs = labels.map { l = labeledPoints.filter (m = m.label ==
 l)
 }.map(l = l.cache()) // should use groupBy
 val models = groupedRDDs.map(rdd = new
 LogisticRegressionWithLBFGS().setNumClasses(101).run(rdd))
 val results = models.map(m = m.predict(Vectors.dense(query.features)))

 Thanks

 Peter



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-classification-using-logistic-regression-tp23054.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread Joseph Bradley
The model is learned using an iterative convex optimization algorithm.
 numIterations, stepSize and miniBatchFraction are for those; you can
see details here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer
http://spark.apache.org/docs/latest/mllib-optimization.html

I would set miniBatchFraction at 1.0 and not mess with it.
For LogisticRegressionWithSGD, to know whether you have the other 2
parameters set correctly, you should try running with more iterations.
If running with more iterations changes your result significantly, then:
 - If the result is blowing up (really big model weights), then you need to
decrease stepSize.
 - If the result is not blowing up but keeps changing, then you need to
increase numIterations.

You should not need to set initialWeights, but it can help if you have some
estimate already calculated.

If you have access to a build of the current Spark master (or can wait for
1.4), then the org.apache.spark.ml.classification.LogisticRegression
implementation has been compared with R and should get very similar results.

Good luck!
Joseph

On Wed, May 27, 2015 at 8:22 AM, SparknewUser melanie.galloi...@gmail.com
wrote:

 I'm new to Spark and I'm getting bad performance with classification
 methods
 on Spark MLlib (worse than R in terms of AUC).
 I am trying to put my own parameters rather than the default parameters.
 Here is the method I want to use :
 train(RDDLabeledPoint input,
 int numIterations,
   double stepSize,
  double miniBatchFraction,
 Vector initialWeights)
 How to choose numIterations and stepSize?
 What does miniBatchFraction mean?
 Is initialWeights necessary to have a good model? Then, how to choose them?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-best-performance-with-LogisticRegressionWithSGD-tp23053.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread Joseph Bradley
If you want to select specific variable combinations by hand, then you will
need to modify the dataset before passing it to the ML algorithm.  The
DataFrame API should make that easy to do.

If you want to have an ML algorithm select variables automatically, then I
would recommend using L1 regularization for now and possibly elastic net
after 1.4 is release, per DB's suggestion.

If you want detailed model statistics similar to what R provides, I've
created a JIRA for discussing how we should add that functionality to
MLlib.  Those types of stats will be added incrementally, but feedback
would be great for prioritization:
https://issues.apache.org/jira/browse/SPARK-7674

To answer your question: How are the weights calculated: is there a
correlation calculation with the variable of interest?
-- Weights are calculated as with all logistic regression algorithms, by
using convex optimization to minimize a regularized log loss.

Good luck!
Joseph

On Fri, May 22, 2015 at 1:07 PM, DB Tsai dbt...@dbtsai.com wrote:

 In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
 pipeline framework. Model selection can be achieved through high
 lambda resulting lots of zero in the coefficients.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Fri, May 22, 2015 at 1:19 AM, SparknewUser
 melanie.galloi...@gmail.com wrote:
  I am new in MLlib and in Spark.(I use Scala)
 
  I'm trying to understand how LogisticRegressionWithLBFGS and
  LogisticRegressionWithSGD work.
  I usually use R to do logistic regressions but now I do it on Spark
  to be able to analyze Big Data.
 
  The model only returns weights and intercept. My problem is that I have
 no
  information about which variable is significant and which variable I had
  better
  to delete to improve my model. I only have the confusion matrix and the
 AUC
  to evaluate the performance.
 
  Is there any way to have information about the variables I put in my
 model?
  How can I try different variable combinations, do I have to modify the
  dataset
  of origin (e.g. delete one or several columns?)
  How are the weights calculated: is there a correlation calculation with
 the
  variable
  of interest?
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature.  If it makes
sense for your data, it will run faster if you can group the categories or
split the 1895 categories into a few features which have fewer categories.

On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz brk...@gmail.com wrote:

 Could you please open a JIRA for it? The maxBins input is missing for the
 Python Api.

 Is it possible if you can use the current master? In the current master,
 you should be able to use trees with the Pipeline Api and DataFrames.

 Best,
 Burak

 On Wed, May 20, 2015 at 2:44 PM, Don Drake dondr...@gmail.com wrote:

 I'm running Spark v1.3.1 and when I run the following against my dataset:

 model = GradientBoostedTrees.trainRegressor(trainingData,
 categoricalFeaturesInfo=catFeatu
 res, maxDepth=6, numIterations=3)

 The job will fail with the following message:
 Traceback (most recent call last):
   File /Users/drake/fd/spark/mltest.py, line 73, in module
 model = GradientBoostedTrees.trainRegressor(trainingData,
 categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py,
 line 553, in trainRegressor
 loss, numIterations, learningRate, maxDepth)
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py,
 line 438, in _train
 loss, numIterations, learningRate, maxDepth)
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py,
 line 120, in callMLlibFunc
 return callJavaFunc(sc, api, *args)
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py,
 line 113, in callJavaFunc
 return _java2py(sc, func(*args))
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
   File
 /Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
 py4j.protocol.Py4JJavaError: An error occurred while calling
 o69.trainGradientBoostedTreesModel.
 : java.lang.IllegalArgumentException: requirement failed: DecisionTree
 requires maxBins (= 32) = max categories in categorical features (= 1895)
 at scala.Predef$.require(Predef.scala:233)
 at
 org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
 at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
 at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
 at
 org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
 at
 org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
 at
 org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
 at
 org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)

 So, it's complaining about the maxBins, if I provide maxBins=1900 and
 re-run it:

 model = GradientBoostedTrees.trainRegressor(trainingData,
 categoricalFeaturesInfo=catFeatu
 res, maxDepth=6, numIterations=3, maxBins=1900)

 Traceback (most recent call last):
   File /Users/drake/fd/spark/mltest.py, line 73, in module
 model = GradientBoostedTrees.trainRegressor(trainingData,
 categoricalFeaturesInfo=catF
 eatures, maxDepth=6, numIterations=3, maxBins=1900)
 TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'

 It now says it knows nothing of maxBins.

 If I run the same command against DecisionTree or RandomForest (with
 maxBins=1900) it works just fine.

 Seems like a bug in GradientBoostedTrees.

 Suggestions?

 -Don

 --
 Donald Drake
 Drake Consulting
 http://www.drakeconsulting.com/
 800-733-2143





Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Joseph Bradley
Hi Xin,

2 suggestions:

1) Feature scaling: spark.mllib's LogisticRegressionWithLBFGS uses feature
scaling, which scales feature values to have unit standard deviation.  That
improves optimization behavior, and it often improves statistical
estimation (though maybe not for your dataset).  However, it effectively
changes the model being learned, so you should expect different results
from other libraries like R.  You could instead use LogisticRegressionWithSGD,
which does not do feature scaling.  With SGD, you may need to play around
with the stepSize more to get it to converge, but it should be able to
learn exactly the same model as R.

2) Convergence: I'd do a sanity check and make sure the algorithm is
converging.  (Compare with running for more iterations or using a lower
convergenceTol.)

Note: If you can use the Spark master branch (or wait for Spark 1.4), then
the spark.ml Pipelines API will be a good option.  It now has
LogisticRegression which does not do feature scaling, and it uses LBFGS or
OWLQN (depending on the regularization type) for optimization.  It's also
been compared with R in unit tests.

Good luck!
Joseph

On Wed, May 20, 2015 at 3:42 PM, Xin Liu liuxin...@gmail.com wrote:

 Hi,

 I have tried a few models in Mllib to train a LogisticRegression model.
 However, I consistently get much better results using other libraries such
 as statsmodel (which gives similar results as R) in terms of AUC. For
 illustration purpose, I used a small data (I have tried much bigger data)
  http://www.ats.ucla.edu/stat/data/binary.csv in
 http://www.ats.ucla.edu/stat/r/dae/logit.htm

 Here is the snippet of my usage of LogisticRegressionWithLBFGS.

 val algorithm = new LogisticRegressionWithLBFGS
  algorithm.setIntercept(true)
  algorithm.optimizer
.setNumIterations(100)
.setRegParam(0.01)
.setConvergenceTol(1e-5)
  val model = algorithm.run(training)
  model.clearThreshold()
  val scoreAndLabels = test.map { point =
val score = model.predict(point.features)
(score, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  val auROC = metrics.areaUnderROC()

 I did a (0.6, 0.4) split for training/test. The response is admit and
 features are GRE score, GPA, and college Rank.

 Spark:
 Weights (GRE, GPA, Rank):
 [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
 Intercept: -0.6488972641282202
 Area under ROC: 0.6294070512820512

 StatsModel:
 Weights [0.0018, 0.7220, -0.3148]
 Intercept: -3.5913
 Area under ROC: 0.69

 The weights from statsmodel seems more reasonable if you consider for a
 one unit increase in gpa, the log odds of being admitted to graduate school
 increases by 0.72 in statsmodel than 0.04 in Spark.

 I have seen much bigger difference with other data. So my question is has
 anyone compared the results with other libraries and is anything wrong with
 my code to invoke LogisticRegressionWithLBFGS?

 As the real data I am processing is pretty big and really want to use
 Spark to get this to work. Please let me know if you have similar
 experience and how you resolve it.

 Thanks,
 Xin



Re: Getting the best parameter set back from CrossValidatorModel

2015-05-19 Thread Joseph Bradley
Hi Justin  Ram,

To clarify, PipelineModel.stages is not private[ml]; only the PipelineModel
constructor is private[ml].  So it's safe to use pipelineModel.stages as a
Spark user.

Ram's example looks good.  Btw, in Spark 1.4 (and the current master
build), we've made a number of improvements to Params and Pipelines, so
this should become easier to use!

Joseph

On Sun, May 17, 2015 at 10:17 PM, Justin Yip yipjus...@prediction.io
wrote:


 Thanks Ram.

 Your sample look is very helpful. (there is a minor bug that
 PipelineModel.stages is hidden under private[ml], just need a wrapper
 around it. :)

 Justin

 On Sat, May 16, 2015 at 10:44 AM, Ram Sriharsha sriharsha@gmail.com
 wrote:

 Hi Justin

 The CrossValidatorExample here
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala
 is a good example of how to set up an ML Pipeline for extracting a model
 with the best parameter set.

 You set up the pipeline as in here:

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala#L73

 This pipeline is treated as an estimator and wrapped into a Cross
 Validator to do grid search and return the model with the best parameters .
 Once you have trained the best model as in here

 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala#L93

 The result is a CrossValidatorModel which contains the best estimator
 (i.e. the best pipeline above) and you can extract the best pipeline and
 inquire its parameters as follows:

 // what are the best parameters?
 val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
 val stages = bestPipelineModel.stages

 val hashingStage = stages(1).asInstanceOf[HashingTF]
 println(hashingStage.getNumFeatures)
 val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
 println(lrStage.getRegParam)



 Ram

 On Sat, May 16, 2015 at 3:17 AM, Justin Yip yipjus...@prediction.io
 wrote:

 Hello,

 I am using MLPipeline. I would like to extract the best parameter found
 by CrossValidator. But I cannot find much document about how to do it. Can
 anyone give me some pointers?

 Thanks.

 Justin

 --
 View this message in context: Getting the best parameter set back from
 CrossValidatorModel
 http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-best-parameter-set-back-from-CrossValidatorModel-tp22915.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.






Re: Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-18 Thread Joseph Bradley
Hi Justin,

It sound like you're on the right track.  The best way to write a custom
Evaluator will probably be to modify an existing Evaluator as you
described.  It's best if you don't remove the other code, which handles
parameter set/get and schema validation.

Joseph

On Sun, May 17, 2015 at 10:35 PM, Justin Yip yipjus...@prediction.io
wrote:

 Hello,

 I would like to use other metrics in BinaryClassificaitonEvaluator, I am
 thinking about simple ones (i.e. PrecisionByThreshold). From the api site,
 I can't tell much about how to implement it.

 From the code, it seems like I will have to override this function, using
 most of the existing code for checking column schema, then replace the line
 which compute the actual score
 https://github.com/apache/spark/blob/1b8625f4258d6d1a049d0ba60e39e9757f5a568b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala#L72
 .

 Is my understanding correct? Or there are more convenient way of
 implementing a metric in order to be used by ML pipeline?

 Thanks.

 Justin

 --
 View this message in context: Implementing custom metrics under
 MLPipeline's BinaryClassificationEvaluator
 http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-custom-metrics-under-MLPipeline-s-BinaryClassificationEvaluator-tp22930.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.



Re: Restricting the number of iterations in Mllib Kmeans

2015-05-18 Thread Joseph Bradley
Hi Suman,

For maxIterations, are you using the DenseKMeans.scala example code?  (I'm
guessing yes since you mention the command line.)  If so, then you should
be able to specify maxIterations via an extra parameter like
--numIterations 50 (note the example uses numIterations in the current
master instead of maxIterations, which is sort of a bug in the example).
If that does not cap the max iterations, then please report it as a bug.

To specify the initial centroids, you will need to modify the DenseKMeans
example code.  Please see the KMeans API docs for details.

Good luck,
Joseph

On Mon, May 18, 2015 at 3:22 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:

 Hi,
 I think you cant supply an initial set of centroids to kmeans

 Thanks  Regards,
 Meethu M



   On Friday, 15 May 2015 12:37 AM, Suman Somasundar 
 suman.somasun...@oracle.com wrote:


 Hi,,

 I want to run a definite number of iterations in Kmeans.  There is a
 command line argument to set maxIterations, but even if I set it to a
 number, Kmeans runs until the centroids converge.
 Is there a specific way to specify it in command line?

 Also, I wanted to know if we can supply the initial set of centroids to
 the program instead of it choosing the centroids in random?

 Thanks,
 Suman.





Re: Predict.scala using model for clustering In reference

2015-05-07 Thread Joseph Bradley
A KMeansModel was trained in the previous step, and it was saved to
modelFile as a Java object file.  This step is loading the model back and
reconstructing the KMeansModel, which can then be used to classify new
tweets into different clusters.
Joseph

On Thu, May 7, 2015 at 12:40 PM, anshu shukla anshushuk...@gmail.com
wrote:

 Can anyone please explain -

 println(Initalizaing the the KMeans model...)
 val model = new 
 KMeansModel(ssc.sparkContext.objectFile[Vector](modelFile.toString).collect())

 where modelfile is  *directory to persist the model while training *


   REF-


 https://github.com/databricks/reference-apps/blob/master/twitter_classifier/predict.md


 --
 Thanks  Regards,
 Anshu Shukla



Re: Multilabel Classification in spark

2015-05-05 Thread Joseph Bradley
If you mean multilabel (predicting multiple label values), then MLlib
does not yet support that.  You would need to predict each label separately.

If you mean multiclass (1 label taking 2 categorical values), then MLlib
supports it via LogisticRegression (as DB said), as well as DecisionTree
and RandomForest.

Joseph

On Tue, May 5, 2015 at 1:27 PM, DB Tsai dbt...@dbtsai.com wrote:

 LogisticRegression in MLlib package supports multilable classification.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Tue, May 5, 2015 at 1:13 PM, peterg pe...@garbers.me wrote:
  Hi all,
 
  I'm looking to implement a Multilabel classification algorithm but I am
  surprised to find that there are not any in the spark-mllib core
 library. Am
  I missing something? Would someone point me in the right direction?
 
  Thanks!
 
  Peter
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
Currently, SVMs don't have built-in multiclass support.  Logistic
Regression supports multiclass, as do trees and random forests.  It would
be great to add multiclass support for SVMs as well.

There is some ongoing work on generic multiclass-to-binary reductions:
https://issues.apache.org/jira/browse/SPARK-7015

I agree that naive one-vs-all reductions might not work that well, but that
the raw scores could be calibrated using the scaling you mentioned, or
other methods.

Joseph

On Mon, May 4, 2015 at 6:29 AM, Driesprong, Fokko fo...@driesprong.frl
wrote:

 Hi Robert,

 I would say, taking the sign of the numbers represent the class of the
 input-vector. What kind of data are you using, and what kind of traning-set
 do you use. Fundamentally a SVM is able to separate only two classes, you
 can do one vs the rest as you mentioned.

 I don't see how LVQ can benefit the SVM classifier. I would say that this
 is more a SVM problem, than a Spark.

 2015-05-04 15:22 GMT+02:00 Robert Musters robert.must...@openindex.io:

  Hi all,

 I am trying to understand the output of the SVM classifier.

 Right now, my output looks like this:

 -18.841544889249917 0.0

 168.32916035523283 1.0

 420.67763915879794 1.0

 -974.1942589201286 0.0

 71.73602841256813 1.0

 233.13636224524993 1.0

 -1000.5902168199027 0.0


  The documentation is unclear about what these numbers mean
 https://spark.apache.org/docs/0.9.2/api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint
 .

 I think it is the distance to the hyperplane with sign.


  My main question is: How can I convert distances from hyperplanes to
 probabilities in a multi-class one-vs-all approach?

 SVMLib http://www.csie.ntu.edu.tw/~cjlin/libsvm/ has this
 functionality and refers the process to get the probabilities as “Platt
 scaling”
 http://www.researchgate.net/profile/John_Platt/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods/links/004635154cff5262d600.pdf.


 I think this functionality should be in MLLib, but I can't find it?
 Do you think Platt scaling makes sense?


  Making clusters using Learning Vector Quantization, determining the
 spread function of a cluster with a Gaussian function and then retrieving
 the probability makes a lot more sense i.m.o. Using the distances from the
 hyperplanes from several SVM classifiers and then trying to determine some
 probability on these distance measures, does not make any sense, because
 the distribution property of the data-points belonging to a cluster is not
 taken into account.
 Does anyone see a fallacy in my reasoning?


  With kind regards,

 Robert





Re: [Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-26 Thread Joseph Bradley
Hi Peter,

As far as setting the parallelism, I would recommend setting it as early as
possible.  Ideally, that would mean specifying the number of partitions
when loading the initial data (rather than repartitioning later on).

In general, working with Vector columns should be better since the Vector
can be stored as a native array, rather than a bunch of objects.

I suspect the OOM is from Parquet's very large default buffer sizes.  This
is a problem with ML model import/export as well.  I have a JIRA for that:
https://issues.apache.org/jira/browse/SPARK-7148
I'm not yet sure if there's a good way to set the buffer size
automatically, though.

Joseph

On Fri, Apr 24, 2015 at 8:20 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi i have a next problem. I have a dataset with 30 columns (15 numeric,
 15 categorical) and using ml transformers/estimators to transform each
 column (StringIndexer for categorical  MeanImputor for numeric). This
 creates 30 more columns in a dataframe. After i’m using VectorAssembler to
 combine 30 transformed columns into 1 vector.
 After when i do df.select(“vector, Label”).saveAsParquetFile it fails with
 OOM error.

 15/04/24 16:33:05 ERROR Executor: Exception in task 2.0 in stage 52.0 (TID 
 2238)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] received message 
 StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=0 lim=4167 cap=4167]) 
 from Actor[akka://sparkDriver/deadLetters]
 15/04/24 16:33:05 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
 thread Thread[Executor task launch worker-1,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3468)
 at 
 java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3275)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1792)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
 at 
 scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142)
 at 
 scala.collection.mutable.HashMap$$anonfun$readObject$1.apply(HashMap.scala:142)
 at scala.collection.mutable.HashTable$class.init(HashTable.scala:105)
 at scala.collection.mutable.HashMap.init(HashMap.scala:39)
 at scala.collection.mutable.HashMap.readObject(HashMap.scala:142)
 at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1896)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
 at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
 15/04/24 16:33:05 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_52, 
 runningTasks: 3
 15/04/24 16:33:05 DEBUG Utils: Shutdown hook called
 15/04/24 16:33:05 DEBUG DiskBlockManager: Shutdown hook called
 15/04/24 16:33:05 DEBUG TaskSetManager: Moving to NODE_LOCAL after waiting 
 for 3000ms
 15/04/24 16:33:05 DEBUG TaskSetManager: Moving to ANY after waiting for 0ms
 15/04/24 16:33:05 INFO TaskSetManager: Starting task 4.0 in stage 52.0 (TID 
 2240, localhost, PROCESS_LOCAL, 1979 bytes)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] handled message (12.488047 ms) 
 StatusUpdate(2238,FAILED,java.nio.HeapByteBuffer[pos=4167 lim=4167 cap=4167]) 
 from Actor[akka://sparkDriver/deadLetters]
 15/04/24 16:33:05 INFO Executor: Running task 4.0 in stage 52.0 (TID 2240)
 15/04/24 16:33:05 DEBUG LocalActor: [actor] received message 
 

Re: Multiclass classification using Ml logisticRegression

2015-04-26 Thread Joseph Bradley
Unfortunately, the Pipelines API doesn't have multiclass logistic
regression yet, only binary.  It's really a matter of modifying the current
implementation; I just added a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-7159

You'll need to use the old LogisticRegression API to do multiclass for now,
until that JIRA gets completed.  (If you're interested in doing it, let me
know via the JIRA!)

Joseph

On Fri, Apr 24, 2015 at 3:26 AM, Selim Namsi selim.na...@gmail.com wrote:

 Hi,

 I just started using spark ML pipeline to implement a multiclass classifier
 using LogisticRegressionWithLBFGS (which accepts as a parameters number of
 classes), I followed the Pipeline example in ML- guide and I used
 LogisticRegression class which calls LogisticRegressionWithLBFGS class :

 val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)

 the problem is that LogisticRegression doesn't take numClasses as
 parameters

 Any idea how to solve this problem?

 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Multiclass-classification-using-Ml-logisticRegression-tp22644.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-25 Thread Joseph Bradley
It looks like your code is making 1 Row per item, which means that
columnSimilarities will compute similarities between users.  If you
transpose the matrix (or construct it as the transpose), then
columnSimilarities should do what you want, and it will return meaningful
indices.
Joseph

On Fri, Apr 24, 2015 at 11:20 PM, amghost zhengweita...@outlook.com wrote:

 I have encountered the all-pairs similarity problem in my recommendation
 system. Thanks to this databricks blog, it seems RowMatrix may come to
 help.

 However, RowMatrix is a matrix type without meaningful row indices, thereby
 I don't know how to retrieve the similarity result after invoking
 columnSimilarities(threshold) for specific item i and j

 Below is some details about what I am doing:

 1) My data file comes from Movielens with format like this:

 user::item::rating
 2) I build up a RowMatrix in which each sparse vector i represents the
 ratings of all users to this item i

 val dataPath = ...
 val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split(::) match {
   case Array(user, item, rate) = Rating(user.toInt, item.toInt,
 rate.toDouble)
 })
 val rows = ratings.map(rating=(rating.product, (rating.user,
 rating.rating)))
   .groupByKey()
   .map(p = Vectors.sparse(userAmount, p._2.map(r=(r._1-1, r._2)).toSeq))

 val mat = new RowMatrix(rows)

 val similarities = mat.columnSimilarities(0.5)
 Now I get a CoordinateMatrix similarities. How can I get the similarity of
 specific item i and j? Although it can be used to retrieve a
 RDD[MatrixEntry], I am not sure whether the row i and column j correspond
 to
 item i and j.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-retrieve-item-pair-after-calculating-similarity-using-RowMatrix-tp22654.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: KMeans takeSample jobs and RDD cached

2015-04-25 Thread Joseph Bradley
Yes, the count() should be the first task, and the sampling + collecting
should be the second task.  The first one is probably slow because the RDD
being sampled is not yet cached/materialized.

K-Means creates some RDDs internally while learning, and since they aren't
needed after learning, they are unpersisted (uncached) at the end.

Joseph

On Sat, Apr 25, 2015 at 6:36 AM, podioss grega...@hotmail.com wrote:

 Hi,
 i am running k-means algorithm with initialization mode set to random and
 various dataset sizes and values for clusters and i have a question
 regarding the takeSample job of the algorithm.
 More specific i notice that in every application there are  two sampling
 jobs. The first one is consuming the most time compared to all others while
 the second one is much quicker and that sparked my interest to investigate
 what is actually happening.
 In order to explain it, i  checked the source code of the takeSample
 operation and i saw that there is a count action involved and then the
 computation of a PartiotionwiseSampledRDD with a PoissonSampler.
 So my question is,if that count action corresponds to the first takeSample
 job and if the second takeSample job is the one doing the actual sampling.

 I also have a question for the RDDs that are created for the k-means. In
 the
 middle of the execution under the storage tab of the web ui i can see 3
 RDDs
 with their partitions cached in memory across all nodes which is very
 helpful for monitoring reasons. The problem is that after the completion i
 can only see one of them and the portion of the cache memory it used and i
 would like to ask why the web ui doesn't display all the RDDs involded in
 the computation.

 Thank you



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-takeSample-jobs-and-RDD-cached-tp22656.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark 1.3.1 Dataframe breaking ALS.train?

2015-04-21 Thread Joseph Bradley
Hi Ayan,

If you want to use DataFrame, then you should use the Pipelines API
(org.apache.spark.ml.*) which will take DataFrames:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.recommendation.ALS

In the examples/ directory for ml/, you can find a MovieLensALS example.

Good luck!
Joseph

On Tue, Apr 21, 2015 at 4:58 AM, ayan guha guha.a...@gmail.com wrote:

 Hi

 I am getting an error

 Also, I am getting an error in mlib.ALS.train function when passing
 dataframe (do I need to convert the DF to RDD?)

 Code:
 training = ssc.sql(select userId,movieId,rating from ratings where
 partitionKey  6).cache()
 print type(training)
 model = ALS.train(training,rank,numIter,lmbda)

 Error:
 class 'pyspark.sql.dataframe.DataFrame'

 Traceback (most recent call last):
   File D:\Project\Spark\code\movie_sql.py, line 109, in module
 bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
   File D:\Project\Spark\code\movie_sql.py, line 54, in getBestModel
 model = ALS.train(trainingRDD,rank,numIter,lmbda)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py,
 line 139, in train
 model = callMLlibFunc(trainALSModel, cls._prepare(ratings), rank,
 iterations,
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py,
 line 127, in _prepare
 assert isinstance(ratings, RDD), ratings should be RDD
 AssertionError: ratings should be RDD

 It was working fine in 1.2.0 (till last night :))

 Any solution? I am thinking to map the training dataframe back to a RDD,
 byt will lose the schema information.

 Best
 Ayan

 On Mon, Apr 20, 2015 at 10:23 PM, ayan guha guha.a...@gmail.com wrote:

 Hi
 Just upgraded to Spark 1.3.1.

 I am getting an warning

 Warning (from warnings module):
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py,
 line 191
 warnings.warn(inferSchema is deprecated, please use createDataFrame
 instead)
 UserWarning: inferSchema is deprecated, please use createDataFrame instead

 However, documentation still says to use inferSchema.
 Here: http://spark.apache.org/docs/latest/sql-programming-guide.htm in
 section

 Also, I am getting an error in mlib.ALS.train function when passing
 dataframe (do I need to convert the DF to RDD?)

 Code:
 training = ssc.sql(select userId,movieId,rating from ratings where
 partitionKey  6).cache()
 print type(training)
 model = ALS.train(training,rank,numIter,lmbda)

 Error:
 class 'pyspark.sql.dataframe.DataFrame'
 Rank:8 Lmbda:1.0 iteration:10

 Traceback (most recent call last):
   File D:\Project\Spark\code\movie_sql.py, line 109, in module
 bestConf = getBestModel(sc,ssc,training,validation,validationNoRating)
   File D:\Project\Spark\code\movie_sql.py, line 54, in getBestModel
 model = ALS.train(trainingRDD,rank,numIter,lmbda)
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py,
 line 139, in train
 model = callMLlibFunc(trainALSModel, cls._prepare(ratings), rank,
 iterations,
   File
 D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\mllib\recommendation.py,
 line 127, in _prepare
 assert isinstance(ratings, RDD), ratings should be RDD
 AssertionError: ratings should be RDD

 --
 Best Regards,
 Ayan Guha




 --
 Best Regards,
 Ayan Guha



Re: From DataFrame to LabeledPoint

2015-04-06 Thread Joseph Bradley
I'd make sure you're selecting the correct columns.  If not that, then your
input data might be corrupt.

CCing user to keep it on the user list.

On Mon, Apr 6, 2015 at 6:53 AM, Sergio Jiménez Barrio drarse.a...@gmail.com
 wrote:

 Hi!,

 I had tried your solution, and I saw that the first row is null. This is
 important? Can I work with null rows? Some rows have some columns with null
 values.

 This is the first row of Dataframe:
 scala dataDF.take(1)
 res11: Array[org.apache.spark.sql.Row] =
 Array([null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null])



 This is the RDD[LabeledPoint] created:
 scala data.take(1)
 15/04/06 15:46:31 ERROR TaskSetManager: Task 0 in stage 6.0 failed 4
 times; aborting job
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage
 6.0 (TID 243, 10.101.5.194): java.lang.NullPointerException

 Thank's for all.

 Sergio J.

 2015-04-03 20:14 GMT+02:00 Joseph Bradley jos...@databricks.com:

 I'd recommend going through each step, taking 1 RDD element
 (myDataFrame.take(1)), and examining it to see where this issue is
 happening.

 On Fri, Apr 3, 2015 at 9:44 AM, Sergio Jiménez Barrio 
 drarse.a...@gmail.com wrote:

 This solution its really good. But I was working with
 feature.toString.toDouble because the feature is the type Any. Now, when I
 try to work with the LabeledPoint created I have a NullPointerException =/
 El 02/04/2015 21:23, Joseph Bradley jos...@databricks.com escribió:

 Peter's suggestion sounds good, but watch out for the match case since
 I believe you'll have to match on:

 case (Row(feature1, feature2, ...), Row(label)) =

 On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com
 wrote:

  Hi try next code:

 val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{
 case Row(feture1, feture2,..., label) = LabeledPoint(label, 
 Vectors.dense(feature1, feature2, ...))
 }

 Thanks,
 Peter Rudenko

 On 2015-04-02 17:17, drarse wrote:

   Hello!,

 I have a questions since days ago. I am working with DataFrame and with
 Spark SQL I imported a jsonFile:

 /val df = sqlContext.jsonFile(file.json)/

 In this json I have the label and de features. I selected it:

 /
 val features = df.select (feature1,feature2,feature3,...);

 val labels = df.select (cassification)/

 But, now, I don't know create a LabeledPoint for RandomForest. I tried 
 some
 solutions without success. Can you help me?

 Thanks for all!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

   ​







Re: Regarding MLLIB sparse and dense matrix

2015-04-03 Thread Joseph Bradley
If you can examine your data matrix and know that about  1/6 or so of the
values are non-zero (so  5/6 are zeros), then it's probably worth using
sparse vectors.  (1/6 is a rough estimate.)

There is support for L1 and L2 regularization.  You can look at the guide
here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
and the API docs linked from the menu.

On Fri, Apr 3, 2015 at 1:24 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Hi All
 I am building a logistic regression for matching the person data lets say
 two person object is given with their attribute we need to find the score.
 that means at side you have 10 millions records and other side we have 1
 record , we need to tell which one match with highest score among 1 million.

 I am strong the score of similarity algos in dense matrix and considering
 this as features. will apply many similarity alogs on one attributes.

 Should i use sparse or dense? what happen in dense when score is null or
 when some of the attribute is missing?

 is there any support for regularized logistic regression ?currently i am
 using LogisticRegressionWithSGD.

 Regards
 jeetendra



Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley
Peter's suggestion sounds good, but watch out for the match case since I
believe you'll have to match on:

case (Row(feature1, feature2, ...), Row(label)) =

On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com
wrote:

  Hi try next code:

 val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{
 case Row(feture1, feture2,..., label) = LabeledPoint(label, 
 Vectors.dense(feature1, feature2, ...))
 }

 Thanks,
 Peter Rudenko

 On 2015-04-02 17:17, drarse wrote:

   Hello!,

 I have a questions since days ago. I am working with DataFrame and with
 Spark SQL I imported a jsonFile:

 /val df = sqlContext.jsonFile(file.json)/

 In this json I have the label and de features. I selected it:

 /
 val features = df.select (feature1,feature2,feature3,...);

 val labels = df.select (cassification)/

 But, now, I don't know create a LabeledPoint for RandomForest. I tried some
 solutions without success. Can you help me?

 Thanks for all!



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/From-DataFrame-to-LabeledPoint-tp22354.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

   ​



Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations*
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote:

 Hello,
 i am running the Kmeans algorithm in cluster mode from Mllib and i was
 wondering if i could run the algorithm with fixed number of iterations in
 some way.

 Thanks




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: k-means can only run on one executor with one thread?

2015-03-27 Thread Joseph Bradley
Can you try specifying the number of partitions when you load the data to
equal the number of executors?  If your ETL changes the number of
partitions, you can also repartition before calling KMeans.


On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen davidshe...@gmail.com wrote:

 Hi,

 I have a large data set, and I expects to get 5000 clusters.

 I load the raw data, convert them into DenseVector; then I did repartition
 and cache; finally I give the RDD[Vector] to KMeans.train().

 Now the job is running, and data are loaded. But according to the Spark
 UI, all data are loaded onto one executor. I checked that executor, and its
 CPU workload is very low. I think it is using only 1 of the 8 cores. And
 all other 3 executors are at rest.

 Did I miss something? Is it possible to distribute the workload to all 4
 executors?


 Thanks,
 David




Re: Spark ML Pipeline inaccessible types

2015-03-27 Thread Joseph Bradley
Hi Martin,

In the short term: Would you be able to work with a different type other
than Vector?  If so, then you can override the *Predictor* class's *protected
def featuresDataType: DataType* with a DataFrame type which fits your
purpose.  If you need Vector, then you might have to do a hack like Peter
suggested.

In the long term: VectorUDT should indeed be made public, but that will
have to wait until the next release.

Thanks for the feedback,
Joseph

On Fri, Mar 27, 2015 at 11:12 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Martin,

 Could you attach the code snippet and the stack trace? The default
 implementation of some methods uses reflection, which may be the
 cause.

 Best,
 Xiangrui

 On Wed, Mar 25, 2015 at 3:18 PM,  zapletal-mar...@email.cz wrote:
  Thanks Peter,
 
  I ended up doing something similar. I however consider both the
 approaches
  you mentioned bad practices which is why I was looking for a solution
  directly supported by the current code.
 
  I can work with that now, but it does not seem to be the proper solution.
 
  Regards,
  Martin
 
  -- Původní zpráva --
  Od: Peter Rudenko petro.rude...@gmail.com
  Komu: zapletal-mar...@email.cz, Sean Owen so...@cloudera.com
  Datum: 25. 3. 2015 13:28:38
 
 
  Předmět: Re: Spark ML Pipeline inaccessible types
 
 
  Hi Martin, here’s 2 possibilities to overcome this:
 
  1) Put your logic into org.apache.spark package in your project - then
  everything would be accessible.
  2) Dirty trick:
 
   object SparkVector extends HashingTF {
val VectorUDT: DataType = outputDataType
  }
 
  then you can do like this:
 
   StructType(vectorTypeColumn, SparkVector.VectorUDT, false))
 
  Thanks,
  Peter Rudenko
 
  On 2015-03-25 13:14, zapletal-mar...@email.cz wrote:
 
  Sean,
 
  thanks for your response. I am familiar with NoSuchMethodException in
  general, but I think it is not the case this time. The code actually
  attempts to get parameter by name using val m =
  this.getClass.getMethodName(paramName).
 
  This may be a bug, but it is only a side effect caused by the real
 problem I
  am facing. My issue is that VectorUDT is not accessible by user code and
  therefore it is not possible to use custom ML pipeline with the existing
  Predictors (see the last two paragraphs in my first email).
 
  Best Regards,
  Martin
 
  -- Původní zpráva --
  Od: Sean Owen so...@cloudera.com
  Komu: zapletal-mar...@email.cz
  Datum: 25. 3. 2015 11:05:54
  Předmět: Re: Spark ML Pipeline inaccessible types
 
 
  NoSuchMethodError in general means that your runtime and compile-time
  environments are different. I think you need to first make sure you
  don't have mismatching versions of Spark.
 
  On Wed, Mar 25, 2015 at 11:00 AM, zapletal-mar...@email.cz wrote:
  Hi,
 
  I have started implementing a machine learning pipeline using Spark
 1.3.0
  and the new pipelining API and DataFrames. I got to a point where I have
  my
  training data set prepared using a sequence of Transformers, but I am
  struggling to actually train a model and use it for predictions.
 
  I am getting a java.lang.NoSuchMethodException:
  org.apache.spark.ml.regression.LinearRegression.myFeaturesColumnName()
  exception thrown at checkInputColumn method in Params trait when using a
  Predictor (LinearRegression in my case, but that should not matter).
 This
  looks like a bug - the exception is thrown when executing
  getParam(colName)
  when the require(actualDataType.equals(datatype), ...) requirement is
 not
  met so the expected requirement failed exception is not thrown and is
  hidden
  by the unexpected NoSuchMethodException instead. I can raise a bug if
 this
  really is an issue and I am not using something incorrectly.
 
  The problem I am facing however is that the Predictor expects features
 to
  have VectorUDT type as defined in Predictor class (protected def
  featuresDataType: DataType = new VectorUDT). But since this type is
  private[spark] my Transformer can not prepare features with this type
  which
  then correctly results in the exception above when I use a different
 type.
 
  Is there a way to define a custom Pipeline that would be able to use the
  existing Predictors without having to bypass the access modifiers or
  reimplement something or is the pipelining API not yet expected to be
 used
  in this way?
 
  Thanks,
  Martin
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
which was asked about the reliability of zipping RDDs.  Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit types to make it clear):

val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new
HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
features) = LabeledPoint(label, features))

If you run into problems with zipping like this, please report them!

Thanks,
Joseph

On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that work
   with
   pure Vector RDD very difficult to work with due to the lack of ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: sparse vector operations in Python

2015-03-10 Thread Joseph Bradley
There isn't a great way currently.  The best option is probably to convert
to scipy.sparse column vectors and add using scipy.
Joseph

On Mon, Mar 9, 2015 at 4:21 PM, Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

 Hi,

 Sorry to ask this, but how do I compute the sum of 2 (or more) mllib
 SparseVectors in Python?

 Thanks,
 Ron




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
Is that error actually occurring in LBFGS?  It looks like it might be
happening before the data even gets to LBFGS.  (Perhaps the outer join
you're trying to do is making the dataset size explode a bit.)  Are you
able to call count() (or any RDD action) on the data before you pass it to
LBFGS?

On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres 
gsala...@ime.usp.br wrote:

 Just did with the same error.
 I think the problem is the data.count() call in LBFGS because for huge
 datasets that's naive to do.
 I was thinking to write my version of LBFGS but instead of doing
 data.count() I will pass that parameter which I will calculate from a Spark
 SQL query.

 I will let you know.

 Thanks


 On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try increasing your driver memory, reducing the executors and
 increasing the executor memory?

 Thanks
 Best Regards

 On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres 
 gsala...@ime.usp.br wrote:

 Hi there:

 I'm using LBFGS optimizer to train a logistic regression model. The code
 I implemented follows the pattern showed in
 https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but
 training data is obtained from a Spark SQL RDD.
 The problem I'm having is that LBFGS tries to count the elements in my
 RDD and that results in a OOM exception since my dataset is huge.
 I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on Hadoop
 YARN. My dataset is about 150 GB but I sample (I take only 1% of the data)
 it in order to scale logistic regression.
 The exception I'm getting is this:

 15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
 stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Arrays.java:2694)
 at java.lang.String.init(String.java:203)
 at com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 at
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
 at
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
 at
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at org.apache.spark.sql.execution.joins.HashOuterJoin.org
 $apache$spark$sql$execution$joins$HashOuterJoin$$buildHashTable(HashOuterJoin.scala:179)
 at
 org.apache.spark.sql.execution.joins.HashOuterJoin$$anonfun$execute$1.apply(HashOuterJoin.scala:199)
 at
 org.apache.spark.sql.execution.joins.HashOuterJoin$$anonfun$execute$1.apply(HashOuterJoin.scala:196)
 at
 org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)

 I'm using this parameters at runtime:
 --num-executors 128 --executor-memory 1G --driver-memory 4G
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
 --conf spark.storage.memoryFraction=0.2

 I also persist my dataset using MEMORY_AND_DISK_SER but get 

Re: Solve least square problem of the form min norm(A x - b)^2^ + lambda * n * norm(x)^2 ?

2015-03-03 Thread Joseph Bradley
The minimization problem you're describing in the email title also looks
like it could be solved using the RidgeRegression solver in MLlib, once you
transform your DistributedMatrix into an RDD[LabeledPoint].

On Tue, Mar 3, 2015 at 11:02 AM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 There are couple of solvers that I've written that is part of the AMPLab
 ml-matrix repo [1,2]. These aren't part of MLLib yet though and if you are
 interested in porting them I'd be happy to review it

 Thanks
 Shivaram


 [1]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala
 [2]
 https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/NormalEquations.scala

 On Tue, Mar 3, 2015 at 9:01 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,

 Is there a least square solver based on DistributedMatrix that we can use
 out of the box in the current (or the master) version of spark ?
 It seems that the only least square solver available in spark is private
 to recommender package.


 Cheers,

 Jao





Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-03 Thread Joseph Bradley
I see.  I think your best bet is to create the cnnModel on the master and
then serialize it to send to the workers.  If it's big (1M or so), then you
can broadcast it and use the broadcast variable in the UDF.  There is not a
great way to do something equivalent to mapPartitions with UDFs right now.

On Tue, Mar 3, 2015 at 4:36 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 Here is my current implementation with current master version of spark




 *class DeepCNNFeature extends Transformer with HasInputCol with
 HasOutputCol ... {   override def transformSchema(...) { ... }*
 *override def transform(dataSet: DataFrame, paramMap: ParamMap):
 DataFrame = {*

 *  transformSchema(dataSet.schema, paramMap, logging = true)*



 *  val map = this.paramMap ++ paramMap  val 
 deepCNNFeature = udf((v: Vector)= {*

 *  val cnnModel = new CaffeModel *

 *  cnnModel.transform(v)*




 *  } : Vector )  
 dataSet.withColumn(map(outputCol), deepCNNFeature(col(map(inputCol*


 * }*
 *}*

 where CaffeModel is a java api to Caffe C++ model.

 The problem here is that for every row it will create a new instance of
 CaffeModel which is inefficient since creating a new model
 means loading a large model file. And it will transform only a single row
 at a time. Or a Caffe network can process a batch of rows efficiently. In
 other words, is it possible to create an UDF that can operatats on a
 partition in order to minimize the creation of a CaffeModel and
 to take advantage of the Caffe network batch processing ?



 On Tue, Mar 3, 2015 at 7:26 AM, Joseph Bradley jos...@databricks.com
 wrote:

 I see, thanks for clarifying!

 I'd recommend following existing implementations in spark.ml
 transformers.  You'll need to define a UDF which operates on a single Row
 to compute the value for the new column.  You can then use the DataFrame
 DSL to create the new column; the DSL provides a nice syntax for what would
 otherwise be a SQL statement like select ... from   I'm recommending
 looking at the existing implementation (rather than stating it here)
 because it changes between Spark 1.2 and 1.3.  In 1.3, the DSL is much
 improved and makes it easier to create a new column.

 Joseph

 On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 class DeepCNNFeature extends Transformer ... {

 override def transform(data: DataFrame, paramMap: ParamMap):
 DataFrame = {


  // How can I do a map partition on the underlying RDD
 and then add the column ?

  }
 }

 On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi Joseph,

 Thank your for the tips. I understand what should I do when my data are
 represented as a RDD. The thing that I can't figure out is how to do the
 same thing when the data is view as a DataFrame and I need to add the
 result of my pretrained model as a new column in the DataFrame. Preciselly,
 I want to implement the following transformer :

 class DeepCNNFeature extends Transformer ... {

 }

 On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley jos...@databricks.com
 wrote:

 Hi Jao,

 You can use external tools and libraries if they can be called from
 your Spark program or script (with appropriate conversion of data types,
 etc.).  The best way to apply a pre-trained model to a dataset would be to
 call the model from within a closure, e.g.:

 myRDD.map { myDatum = preTrainedModel.predict(myDatum) }

 If your data is distributed in an RDD (myRDD), then the above call
 will distribute the computation of prediction using the pre-trained model.
 It will require that all of your Spark workers be able to run the
 preTrainedModel; that may mean installing Caffe and dependencies on all
 nodes in the compute cluster.

 For the second question, I would modify the above call as follows:

 myRDD.mapPartitions { myDataOnPartition =
   val myModel = // instantiate neural network on this partition
   myDataOnPartition.map { myDatum = myModel.predict(myDatum) }
 }

 I hope this helps!
 Joseph

 On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa jaon...@gmail.com
  wrote:

 Dear all,


 We mainly do large scale computer vision task (image classification,
 retrieval, ...). The pipeline is really great stuff for that. We're 
 trying
 to reproduce the tutorial given on that topic during the latest spark
 summit (
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
  )
 using the master version of spark pipeline and dataframe. The tutorial
 shows different examples of feature extraction stages before running
 machine learning algorithms. Even the tutorial is straightforward to
 reproduce with this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ?
An example of use case is to extract feature learned with 
 convolutional

Re: LBGFS optimizer performace

2015-03-03 Thread Joseph Bradley
I would recommend caching; if you can't persist, iterative algorithms will
not work well.

I don't think calling count on the dataset is problematic; every iteration
in LBFGS iterates over the whole dataset and does a lot more computation
than count().

It would be helpful to see some error occurring within LBFGS.  With the
given stack trace, I'm not sure what part of LBFGS it's happening in.

On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres 
gsala...@ime.usp.br wrote:

 Yeah, I can call count before that and it works. Also I was over caching
 tables but I removed those. Now there is no caching but it gets really slow
 since it calculates my table RDD many times.
 Also hacked the LBFGS code to pass the number of examples which I
 calculated outside in a Spark SQL query but just moved the location of the
 problem.

 The query I'm running looks like this:

 sSELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB  ON
 tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status='' 

 mappedFields contains a list of fields which I'm interested in. The result
 of that query goes through (including sampling) some transformations before
 being input to LBFGS.

 My dataset has 180GB just for feature selection, I'm planning to use 450GB
 to train the final model and I'm using 16 c3.2xlarge EC2 instances, that
 means I have 240GB of RAM available.

 Any suggestion? I'm starting to check the algorithm because I don't
 understand why it needs to count the dataset.

 Thanks

 Gustavo

 On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley jos...@databricks.com
 wrote:

 Is that error actually occurring in LBFGS?  It looks like it might be
 happening before the data even gets to LBFGS.  (Perhaps the outer join
 you're trying to do is making the dataset size explode a bit.)  Are you
 able to call count() (or any RDD action) on the data before you pass it to
 LBFGS?

 On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres 
 gsala...@ime.usp.br wrote:

 Just did with the same error.
 I think the problem is the data.count() call in LBFGS because for huge
 datasets that's naive to do.
 I was thinking to write my version of LBFGS but instead of doing
 data.count() I will pass that parameter which I will calculate from a Spark
 SQL query.

 I will let you know.

 Thanks


 On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Can you try increasing your driver memory, reducing the executors and
 increasing the executor memory?

 Thanks
 Best Regards

 On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres 
 gsala...@ime.usp.br wrote:

 Hi there:

 I'm using LBFGS optimizer to train a logistic regression model. The
 code I implemented follows the pattern showed in
 https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but
 training data is obtained from a Spark SQL RDD.
 The problem I'm having is that LBFGS tries to count the elements in my
 RDD and that results in a OOM exception since my dataset is huge.
 I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on
 Hadoop YARN. My dataset is about 150 GB but I sample (I take only 1% of 
 the
 data) it in order to scale logistic regression.
 The exception I'm getting is this:

 15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
 stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOfRange(Arrays.java:2694)
 at java.lang.String.init(String.java:203)
 at
 com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
 at
 com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
 at
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
 at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:651)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
 at
 com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
 at
 com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
 at
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
 at
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133

Re: Some questions after playing a little with the new ml.Pipeline.

2015-03-02 Thread Joseph Bradley
I see, thanks for clarifying!

I'd recommend following existing implementations in spark.ml transformers.
You'll need to define a UDF which operates on a single Row to compute the
value for the new column.  You can then use the DataFrame DSL to create the
new column; the DSL provides a nice syntax for what would otherwise be a
SQL statement like select ... from   I'm recommending looking at the
existing implementation (rather than stating it here) because it changes
between Spark 1.2 and 1.3.  In 1.3, the DSL is much improved and makes it
easier to create a new column.

Joseph

On Sun, Mar 1, 2015 at 1:26 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:

 class DeepCNNFeature extends Transformer ... {

 override def transform(data: DataFrame, paramMap: ParamMap): DataFrame
 = {


  // How can I do a map partition on the underlying RDD and
 then add the column ?

  }
 }

 On Sun, Mar 1, 2015 at 10:23 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Hi Joseph,

 Thank your for the tips. I understand what should I do when my data are
 represented as a RDD. The thing that I can't figure out is how to do the
 same thing when the data is view as a DataFrame and I need to add the
 result of my pretrained model as a new column in the DataFrame. Preciselly,
 I want to implement the following transformer :

 class DeepCNNFeature extends Transformer ... {

 }

 On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley jos...@databricks.com
 wrote:

 Hi Jao,

 You can use external tools and libraries if they can be called from your
 Spark program or script (with appropriate conversion of data types, etc.).
 The best way to apply a pre-trained model to a dataset would be to call the
 model from within a closure, e.g.:

 myRDD.map { myDatum = preTrainedModel.predict(myDatum) }

 If your data is distributed in an RDD (myRDD), then the above call will
 distribute the computation of prediction using the pre-trained model.  It
 will require that all of your Spark workers be able to run the
 preTrainedModel; that may mean installing Caffe and dependencies on all
 nodes in the compute cluster.

 For the second question, I would modify the above call as follows:

 myRDD.mapPartitions { myDataOnPartition =
   val myModel = // instantiate neural network on this partition
   myDataOnPartition.map { myDatum = myModel.predict(myDatum) }
 }

 I hope this helps!
 Joseph

 On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 Dear all,


 We mainly do large scale computer vision task (image classification,
 retrieval, ...). The pipeline is really great stuff for that. We're trying
 to reproduce the tutorial given on that topic during the latest spark
 summit (
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
  )
 using the master version of spark pipeline and dataframe. The tutorial
 shows different examples of feature extraction stages before running
 machine learning algorithms. Even the tutorial is straightforward to
 reproduce with this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ?
An example of use case is to extract feature learned with convolutional
neural network. In our case, this corresponds to a pre-trained neural
network with Caffe library (http://caffe.berkeleyvision.org/) .


- The second question is about the performance of the pipeline.
Library such as Caffe processes the data in batch and instancing one 
 Caffe
network can be time consuming when this network is very deep. So, we can
gain performance if we minimize the number of Caffe network creation and
give data in batch to the network. In the pipeline, this corresponds to 
 run
transformers that work on a partition basis and give the whole 
 partition to
a single caffe network. How can we create such a transformer ?



 Best,

 Jao







Re: Reg. Difference in Performance

2015-02-28 Thread Joseph Bradley
Hi Deep,

Compute times may not be very meaningful for small examples like those.  If
you increase the sizes of the examples, then you may start to observe more
meaningful trends and speedups.

Joseph

On Sat, Feb 28, 2015 at 7:26 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I am running Spark applications in GCE. I set up cluster with different
 number of nodes varying from 1 to 7. The machines are single core machines.
 I set the spark.default.parallelism to the number of nodes in the cluster
 for each cluster. I ran the four applications available in Spark Examples,
 SparkTC, SparkALS, SparkLR, SparkPi for each of the configurations.
 What I notice is the following:
 In case of SparkTC and SparkALS, the time to complete the job increases
 with the increase in number of nodes in cluster, where as in SparkLR and
 SparkPi, the time to complete the job remains the same across all the
 configurations.
 Could anyone explain me this?

 Thank You
 Regards,
 Deep



Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-28 Thread Joseph Bradley
Hi Shahab,

There are actually a few distributed Matrix types which support sparse
representations: RowMatrix, IndexedRowMatrix, and CoordinateMatrix.
The documentation has a bit more info about the various uses:
http://spark.apache.org/docs/latest/mllib-data-types.html#distributed-matrix

The Spark 1.3 RC includes a new one: BlockMatrix.

But since these are distributed, they are represented using RDDs, so they
of course will not be as fast as computations on smaller, locally stored
matrices.

Joseph

On Fri, Feb 27, 2015 at 4:39 AM, Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:

 try using breeze (scala linear algebra library)

 On Fri, Feb 27, 2015 at 5:56 PM, shahab shahab.mok...@gmail.com wrote:

 Thanks a lot Vijay, let me see how it performs.

 Best
 Shahab


 On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:

 Available in GML --

 http://x10-lang.org/x10-community/applications/global-
 matrix-library.html

 We are exploring how to make it available within Spark. Any ideas would
 be much appreciated.

 On 2/27/15 7:01 AM, shahab wrote:

 Hi,

 I just wonder if there is any Sparse Matrix implementation available
 in Spark, so it can be used in spark application?

 best,
 /Shahab



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Some questions after playing a little with the new ml.Pipeline.

2015-02-28 Thread Joseph Bradley
Hi Jao,

You can use external tools and libraries if they can be called from your
Spark program or script (with appropriate conversion of data types, etc.).
The best way to apply a pre-trained model to a dataset would be to call the
model from within a closure, e.g.:

myRDD.map { myDatum = preTrainedModel.predict(myDatum) }

If your data is distributed in an RDD (myRDD), then the above call will
distribute the computation of prediction using the pre-trained model.  It
will require that all of your Spark workers be able to run the
preTrainedModel; that may mean installing Caffe and dependencies on all
nodes in the compute cluster.

For the second question, I would modify the above call as follows:

myRDD.mapPartitions { myDataOnPartition =
  val myModel = // instantiate neural network on this partition
  myDataOnPartition.map { myDatum = myModel.predict(myDatum) }
}

I hope this helps!
Joseph

On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Dear all,


 We mainly do large scale computer vision task (image classification,
 retrieval, ...). The pipeline is really great stuff for that. We're trying
 to reproduce the tutorial given on that topic during the latest spark
 summit (
 http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html
  )
 using the master version of spark pipeline and dataframe. The tutorial
 shows different examples of feature extraction stages before running
 machine learning algorithms. Even the tutorial is straightforward to
 reproduce with this new API, we still have some questions :

- Can one use external tools (e.g via pipe) as a pipeline stage ? An
example of use case is to extract feature learned with convolutional neural
network. In our case, this corresponds to a pre-trained neural network with
Caffe library (http://caffe.berkeleyvision.org/) .


- The second question is about the performance of the pipeline.
Library such as Caffe processes the data in batch and instancing one Caffe
network can be time consuming when this network is very deep. So, we can
gain performance if we minimize the number of Caffe network creation and
give data in batch to the network. In the pipeline, this corresponds to run
transformers that work on a partition basis and give the whole partition to
a single caffe network. How can we create such a transformer ?



 Best,

 Jao



Re: ML Transformer

2015-02-18 Thread Joseph Bradley
Hi Cesar,

Thanks for trying out Pipelines and bringing up this issue!  It's been an
experimental API, but feedback like this will help us prepare it for
becoming non-Experimental.  I've made a JIRA, and will vote for this being
protected (instead of private[ml]) for Spark 1.3:
https://issues.apache.org/jira/browse/SPARK-5902

Thanks again,
Joseph

On Wed, Feb 18, 2015 at 12:17 PM, Cesar Flores ces...@gmail.com wrote:


 I am working right now with the ML pipeline, which I really like it.
 However in order to make a real use of it, I would like create my own
 transformers that implements org.apache.spark.ml.Transformer. In order to
 do that, a method from the PipelineStage needs to be implemented. But this
 method is private to the ml package:

 private[ml] def transformSchema(schema: StructType, paramMap: ParamMap):
 StructType

 Do any user can create their own transformers? If not, do this
 functionality will be added in the future.

 Thanks
 --
 Cesar Flores



Re: Storing DecisionTreeModel

2015-01-27 Thread Joseph Bradley
Hi Andres,

Currently, serializing the object is probably the best way to do it.
However, there are efforts to support actual model import/export:
https://issues.apache.org/jira/browse/SPARK-4587
https://issues.apache.org/jira/browse/SPARK-1406

I'm hoping to have the PR for the first JIRA ready soon.

Joseph

On Tue, Jan 27, 2015 at 7:45 AM, andresbm84 andresb...@gmail.com wrote:

 Hi everyone,

 Is there a way to save on disk the model to reuse it later? I could
 serialize the object and save the bytes, but I'm guessing there might be a
 better way to do so.

 Has anyone tried that?


 Andres.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Storing-DecisionTreeModel-tp21393.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SVD in pyspark ?

2015-01-26 Thread Joseph Bradley
Hi Andreas,

There unfortunately is not a Python API yet for distributed matrices or
their operations.  Here's the JIRA to follow to stay up-to-date on it:
https://issues.apache.org/jira/browse/SPARK-3956

There are internal wrappers (used to create the Python API), but they are
not really public APIs.  The bigger challenge is creating/storing the
distributed matrix in Python.

Joseph

On Sun, Jan 25, 2015 at 11:32 AM, Chip Senkbeil chip.senkb...@gmail.com
wrote:

 Hi Andreas,

 With regard to the notebook interface,  you can use the Spark Kernel (
 https://github.com/ibm-et/spark-kernel) as the backend for an IPython 3.0
 notebook. The kernel is designed to be the foundation for interactive
 applications connecting to Apache Spark and uses the IPython 5.0 message
 protocol - used by IPython 3.0 - to communicate.

 See the getting started section here:
 https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel

 It discusses getting IPython connected to a Spark Kernel. If you have any
 more questions, feel free to ask!

 Signed,
 Chip Senkbeil
 IBM Emerging Technologies Software Engineer

 On Sun Jan 25 2015 at 1:12:32 PM Andreas Rhode m.a.rh...@gmail.com
 wrote:

 Is the distributed SVD functionality exposed to Python yet?

 Seems it's only available to scala or java, unless I am missing something,
 looking for a pyspark equivalent to
 org.apache.spark.mllib.linalg.SingularValueDecomposition

 In case it's not there yet, is there a way to make a wrapper to call from
 python into the corresponding java/scala code? The reason for using python
 instead of just directly  scala is that I like to take advantage of the
 notebook interface for visualization.

 As a side, is there a inotebook like interface for the scala based REPL?

 Thanks

 Andreas



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/SVD-in-pyspark-tp21356.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: [mllib] Decision Tree - prediction probabilites of label classes

2015-01-24 Thread Joseph Bradley
There is a JIRA...but not a PR yet.  Here's the JIRA:
https://issues.apache.org/jira/browse/SPARK-3727

I'm not aware of current work on it, but I agree it would be nice to have!
Joseph

On Thu, Jan 22, 2015 at 2:50 AM, Sean Owen so...@cloudera.com wrote:

 You are right that this isn't implemented. I presume you could propose
 a PR for this. The impurity calculator implementations already receive
 category counts. The only drawback I see is having to store N
 probabilities at each leaf, not 1.

 On Wed, Jan 21, 2015 at 3:36 PM, Zsolt Tóth toth.zsolt@gmail.com
 wrote:
  Hi,
 
  I use DecisionTree for multi class classification.
  I can get the probability of the predicted label for every node in the
  decision tree from node.predict().prob(). Is it possible to retrieve or
  count the probability of every possible label class in the node?
  To be more clear:
  Say in Node A there are 4 of label 0.0, 2 of label 1.0 and 3 of label
 2.0.
  If I'm correct predict.prob() is 4/9 in this case. I need the values 2/9
 and
  3/9 for the 2 other labels.
 
  It would be great to retrieve the exact count of label classes ([4,2,3]
 in
  the example) but I don't think thats possible now. Is something like this
  planned for a future release?
 
  Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Need some help to create user defined type for ML pipeline

2015-01-24 Thread Joseph Bradley
Hi Jao,

You're right that defining serialize and deserialize is the main task in
implementing a UDT.  They are basically translating between your native
representation (ByteImage) and SQL DataTypes.  The sqlType you defined
looks correct, and you're correct to use a row of length 4.  Other than
that, it should just require copying data to and from SQL Rows.  There are
quite a few examples of that in the codebase; I'd recommend searching based
on the particular DataTypes you're using.

Are there particular issues you're running into?

Joseph

On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 Hi all,

 I'm trying to implement a pipeline for computer vision based on the latest
 ML package in spark. The first step of my pipeline is to decode image (jpeg
 for instance) stored in a parquet file.
 For this, I begin to create a UserDefinedType that represents a decoded
 image stored in a array of byte. Here is my first attempt :























 *@SQLUserDefinedType(udt = classOf[ByteImageUDT])class ByteImage(channels: 
 Int, width: Int, height: Int, data: Array[Byte])private[spark] class 
 ByteImageUDT extends UserDefinedType[ByteImage] {  override def sqlType: 
 StructType = {// type: 0 = sparse, 1 = dense// We only use values 
 for dense vectors, and size, indices, and values for sparse// 
 vectors. The values field is nullable because we might want to add binary 
 vectors later,// which uses size and indices, but not values.
 StructType(Seq(  StructField(channels, IntegerType, nullable = false),  
 StructField(width, IntegerType, nullable = false),  
 StructField(height, IntegerType, nullable = false),  
 StructField(data, BinaryType, nullable = false)  }  override def 
 serialize(obj: Any): Row = {val row = new GenericMutableRow(4)val img 
 = obj.asInstanceOf[ByteImage]*






 *...  }  override def deserialize(datum: Any): Vector = {  *

 **








 *}  }  override def pyUDT: String = pyspark.mllib.linalg.VectorUDT  
 override def userClass: Class[Vector] = classOf[Vector]}*


 I take the VectorUDT as a starting point but there's a lot of thing that I 
 don't really understand. So any help on defining serialize and deserialize 
 methods will be appreciated.

 Best Regards,

 Jao




Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-24 Thread Joseph Bradley
Hi Lokesh,
Glad the update fixed the bug.  maxBins is a parameter you can tune based
on your data.  Essentially, larger maxBins is potentially more accurate,
but will run more slowly and use more memory.  maxBins must be = training
set size; I would say try some small values (4, 8, 16).  If there is a
difference in performance between those, then you can tune it more;
otherwise, just pick one.
Good luck!
Joseph

On Fri, Oct 24, 2014 at 12:54 AM, lokeshkumar lok...@dataken.net wrote:

 Hi Joseph,

 Thanks for the help.

 I have tried this DecisionTree example with the latest spark code and it is
 working fine now. But how do we choose the maxBins for this model?

 Thanks
 Lokesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLLIB-Decision-Tree-ArrayIndexOutOfBounds-Exception-tp16907p17195.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark MLLIB Decision Tree - ArrayIndexOutOfBounds Exception

2014-10-21 Thread Joseph Bradley
Hi, this sounds like a bug which has been fixed in the current master.
What version of Spark are you using?  Would it be possible to update to the
current master?
If not, it would be helpful to know some more of the problem dimensions
(num examples, num features, feature types, label type).
Thanks,
Joseph


On Tue, Oct 21, 2014 at 2:42 AM, lokeshkumar lok...@dataken.net wrote:

 Hi All,

 I am trying to run the spark example JavaDecisionTree code using some
 external data set.
 It works for certain dataset only with specific maxBins and maxDepth
 settings. Even for a working dataset if I add a new data item I get a
 ArrayIndexOutOfBounds Exception, I get the same exception for the first
 case
 as well (changing maxBins and maxDepth). I am not sure what is wrong here,
 can anyone please explain this.

 Exception stacktrace:

 14/10/21 13:47:15 ERROR executor.Executor: Exception in task 1.0 in stage
 7.0 (TID 13)
 java.lang.ArrayIndexOutOfBoundsException: 6301
 at

 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)
 at

 org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)
 at

 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)
 at

 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 at

 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)
 at

 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at

 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at

 org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 at

 org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)
 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 at

 org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 at

 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 14/10/21 13:47:15 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 7.0
 (TID 13, localhost): java.lang.ArrayIndexOutOfBoundsException: 6301


 org.apache.spark.mllib.tree.DecisionTree$.updateBinForOrderedFeature$1(DecisionTree.scala:648)


 org.apache.spark.mllib.tree.DecisionTree$.binaryOrNotCategoricalBinSeqOp$1(DecisionTree.scala:706)


 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:798)


 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)


 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:830)


 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)


 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)


 org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)


 org.apache.spark.InterruptibleIterator.foldLeft(InterruptibleIterator.scala:28)

 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)


 org.apache.spark.InterruptibleIterator.aggregate(InterruptibleIterator.scala:28)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)


 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)


 

Re: mlib model viewing and saving

2014-10-13 Thread Joseph Bradley
Currently, printing (toString) gives a human-readable version of the tree,
but it is not a format which is easy to save and load.  That sort of
serialization is in the works, but not available for trees right now.
 (Note that the current master actually has toString (for a short summary
of the tree) and toDebugString (for a full printout of the model).
Joseph

On Fri, Aug 15, 2014 at 4:28 PM, Sameer Tilak ssti...@live.com wrote:

 Hi All,

 I have a mlib model:

 val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)


 I see model has following methods:
 algo   asInstanceOf   isInstanceOf   predicttoString
 topNode

 model.topNode outputs:
 org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = false, predict =
 0.5, split = Some(Feature = 87, threshold = 0.7931471805599453, featureType
 =  Continuous, categories = List()), stats = Some(gain = 0.89, impurity
 = 0.35, left impurity = 0.12, right impurity = 0.00, predict =
 0.50)

 I was wondering what is the best way to look at the model. We want to see
 what the decision tree looks like-- which features are selected, the
 details of splitting, what is the depth etc. Is there an easy way to see
 that? I can traverse it recursively using topNode.leftNode and 
 topNode.rightNode.
 However, was wondering if there is any way to look at the model and also to
 save it on the hdfs for later use.




Re: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Joseph Bradley
I have run into that issue too, but only when the data were not
pre-processed correctly.  E.g., if a categorical feature is binary with
values in {-1, +1} instead of {0,1}.  Will be very interested to learn if
it can occur elsewhere!


On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak ssti...@live.com wrote:


 Hi Yanbo,
 I think it was happening because some of the rows did not have all the
 columns. We are cleaning up the data and will let you know once we confirm
 this.

 --
 Date: Thu, 14 Aug 2014 22:50:58 +0800
 Subject: Re: java.lang.UnknownError: no bin was found for continuous
 variable.
 From: yanboha...@gmail.com
 To: ssti...@live.com

 Can you supply the detail code and data you used.
 From the log, it looks like can not find the bin for specific feature.
 The bin for continuous feature is a unit that covers a specific range of
 the feature.


 2014-08-14 7:43 GMT+08:00 Sameer Tilak ssti...@live.com:

 Hi All,

 I am using the decision tree algorithm and I get the following error. Any
 help would be great!


 java.lang.UnknownError: no bin was found for continuous variable.
  at
 org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)
 at
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
  at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
 at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
  at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
  at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
  at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
 at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
  at
 org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
  at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)
 14/08/13 16:36:06 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception in thread Thread[Executor task launch worker-0,5,main]
 java.lang.UnknownError: no bin was found for continuous variable.
 at
 org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492)
 at
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findBinsForLevel$1(DecisionTree.scala:529)
  at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
 at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$3.apply(DecisionTree.scala:653)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
  at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 at
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
  at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
  at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
 at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
  at
 org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
  at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:744)





Re: Decision tree classifier in MLlib

2014-07-18 Thread Joseph Bradley
Hi Sudha,
Have you checked if the labels are being loaded correctly?  It sounds like
the DT algorithm can't find any useful splits to make, so maybe it thinks
they are all the same?  Some data loading functions threshold labels to
make them binary.
Hope it helps,
Joseph


On Fri, Jul 11, 2014 at 2:25 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I have a small dataset (120 training points, 30 test points) that I am
 trying to classify into binary classes (1 or 0). The dataset has 4
 numerical
 features and 1 binary label (1 or 0).

 I used LogisticRegression and SVM in MLLib and I got 100% accuracy in both
 cases. But when I used DecisionTree, I am getting only 33% accuracy
 (basically all the predicted test labels are 1 whereas actually only 10 out
 of the 30 should be 1). I tried modifying the different parameters
 (maxDepth, bins, impurity etc) and still am able to get only 33% accuracy.

 I used the same dataset with R's decision tree  (rpart) and I am getting
 100% accuracy. I would like to understand why the performance of MLLib's
 decision tree model is poor  and if there is some way I can improve it.

 thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Decision-tree-classifier-in-MLlib-tp9457.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.