from:"DB Tsai"

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-19 Thread DB Tsai

If it's standalone mode, it's even easier. You should be able to
connect to hadoop 2.6 hdfs using 3.2 client. In your k8s cluster, just
don't put hadoop 2.6 into your classpath.

On Sun, Jul 19, 2020 at 10:25 PM Ashika Umanga Umagiliya
 wrote:
>
> Hello
>
> "spark.yarn.populateHadoopClasspath" is used in YARN mode correct?
> However our Spark cluster is standalone cluster not using YARN.
> We only connect to HDFS/Hive to access data.Computation is done on our spark 
> cluster running on K8s (not Yarn)
>
>
> On Mon, Jul 20, 2020 at 2:04 PM DB Tsai  wrote:
>>
>> In Spark 3.0, if you use the `with-hadoop` Spark distribution that has
>> embedded Hadoop 3.2, you can set
>> `spark.yarn.populateHadoopClasspath=false` to not populate the
>> cluster's hadoop classpath. In this scenario, Spark will use hadoop
>> 3.2 client to connect to hadoop 2.6 which should work fine. In fact,
>> we have production deployment using this way for a while.
>>
>> On Sun, Jul 19, 2020 at 8:10 PM Ashika Umanga  
>> wrote:
>> >
>> > Greetings,
>> >
>> > Hadoop 2.6 has been removed according to this ticket 
>> > https://issues.apache.org/jira/browse/SPARK-25016
>> >
>> > We run our Spark cluster on K8s in standalone mode.
>> > We access HDFS/Hive running on a Hadoop 2.6 cluster.
>> > We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0
>> > However, we dont have any control over the Hadoop cluster and it will 
>> > remain in 2.6
>> >
>> > Is Spark 3.0 still compatible with HDFS/Hive running on Hadoop 2.6 ?
>> >
>> > Best Regards,
>>
>>
>>
>> --
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 42E5B25A8F7A82C1
>
>
>
> --
> Umanga
> http://jp.linkedin.com/in/umanga
> http://umanga.ifreepages.com



-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 3.0 with Hadoop 2.6 HDFS/Hive

2020-07-20 Thread DB Tsai

In Spark 3.0, if you use the `with-hadoop` Spark distribution that has
embedded Hadoop 3.2, you can set
`spark.yarn.populateHadoopClasspath=false` to not populate the
cluster's hadoop classpath. In this scenario, Spark will use hadoop
3.2 client to connect to hadoop 2.6 which should work fine. In fact,
we have production deployment using this way for a while.

On Sun, Jul 19, 2020 at 8:10 PM Ashika Umanga  wrote:
>
> Greetings,
>
> Hadoop 2.6 has been removed according to this ticket 
> https://issues.apache.org/jira/browse/SPARK-25016
>
> We run our Spark cluster on K8s in standalone mode.
> We access HDFS/Hive running on a Hadoop 2.6 cluster.
> We've been using Spark 2.4.5 and planning on upgrading to Spark 3.0.0
> However, we dont have any control over the Hadoop cluster and it will remain 
> in 2.6
>
> Is Spark 3.0 still compatible with HDFS/Hive running on Hadoop 2.6 ?
>
> Best Regards,

-- 
Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 3.0.1 not connecting with Hive 2.1.1

2021-01-09 Thread DB Tsai

Hi Pradyumn,

I think it’s because of a HMS client backward compatibility issue described 
here, https://issues.apache.org/jira/browse/HIVE-24608

Thanks,

DB Tsai | ACI Spark Core |  Apple, Inc

> On Jan 9, 2021, at 9:53 AM, Pradyumn Agrawal  wrote:
> 
> Hi Michael, 
> Thanks for references, although I had a hard time translating the 3rd one as 
> Google Translate 
> <https://translate.google.com/translate?sl=auto&tl=en&u=https://blog.csdn.net/Young2018/article/details/108871542>
>  of the csdn blog, it didn't work correctly and already went through the 1st 
> and 2nd earlier. 
> But I can see the CDH distribution is different in my case, it is CDH-6.3.1.
> 
> 
> 
> As you can see here in the screenshot, it is saying that Invalid Methods 
> Name: get_table_req
> I am guessing that CDH Distribution has some changes on Hive Metastore Client 
> which is conflicting with Shim implementations of Spark. Although, I couldn't 
> debug a lot, it's totally a guesswork.
> Would certainly like to know your and other views on this?
> 
> Thanks and Regards
> Pradyumn Agrawal
> Media.net (India)
> 
> On Sat, Jan 9, 2021 at 8:01 PM michael.yang  <mailto:yangrong.jx...@gmail.com>> wrote:
> Hi Pradyumn,
> 
> We integrated Spark 3.0.1 with hive 2.1.1-cdh6.1.0 and it works fine to use
> spark-sql to query hive tables.
> 
> Make sure you config spark-defaults.conf and spark-env.sh well and copy
> hive/hadoop related config files to spark conf folder.
> 
> You can refer to below refrences for detail.
> 
> https://spark.apache.org/docs/latest/building-spark.html 
> <https://spark.apache.org/docs/latest/building-spark.html>
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 
> <https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html>
> https://blog.csdn.net/Young2018/article/details/108871542 
> <https://blog.csdn.net/Young2018/article/details/108871542>
> 
> Thanks
> Michael Yang
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 
> <http://apache-spark-user-list.1001560.n3.nabble.com/>
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
>

Re: [ANNOUNCE] Apache Spark 3.2.1 released

2022-01-28 Thread DB Tsai

Thank you, Huaxin for the 3.2.1 release!

Sent from my iPhone

> On Jan 28, 2022, at 5:45 PM, Chao Sun  wrote:
> 
> 
> Thanks Huaxin for driving the release!
> 
>> On Fri, Jan 28, 2022 at 5:37 PM Ruifeng Zheng  wrote:
>> It's Great!
>> Congrats and thanks, huaxin!
>> 
>> 
>> -- 原始邮件 --
>> 发件人: "huaxin gao" ;
>> 发送时间: 2022年1月29日(星期六) 上午9:07
>> 收件人: "dev";"user";
>> 主题: [ANNOUNCE] Apache Spark 3.2.1 released
>> 
>> We are happy to announce the availability of Spark 3.2.1!
>> 
>> Spark 3.2.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.2 maintenance branch of Spark. We strongly
>> recommend all 3.2 users to upgrade to this stable release.
>> 
>> To download Spark 3.2.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>> 
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-2-1.html
>> 
>> We would like to acknowledge all community members for contributing to this
>> release. This release would not have been possible without you.
>> 
>> Huaxin Gao

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread DB Tsai

Hi Xin,

If you take a look at the model you trained, the intercept from Spark
is significantly smaller than StatsModel, and the intercept represents
a prior on categories in LOR which causes the low accuracy in Spark
implementation. In LogisticRegressionWithLBFGS, the intercept is
regularized due to the implementation of Updater, and the intercept
should not be regularized.

In the new pipleline APIs, a LOR with elasticNet is implemented, and
the intercept is properly handled.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

As you can see the tests,
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
the result is exactly the same as R now.

BTW, in both version, the feature scalings are done before training,
and we train the model in scaled space but transform the model weights
back to original space. The only difference is in the mllib version,
LogisticRegressionWithLBFGS regularizes the intercept while in the ml
version, the intercept is excluded from regularization. As a result,
if lambda is zero, the model should be the same.

On Wed, May 20, 2015 at 3:42 PM, Xin Liu  wrote:
> Hi,
>
> I have tried a few models in Mllib to train a LogisticRegression model.
> However, I consistently get much better results using other libraries such
> as statsmodel (which gives similar results as R) in terms of AUC. For
> illustration purpose, I used a small data (I have tried much bigger data)
>  http://www.ats.ucla.edu/stat/data/binary.csv in
> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>
> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>
> val algorithm = new LogisticRegressionWithLBFGS
>  algorithm.setIntercept(true)
>  algorithm.optimizer
>.setNumIterations(100)
>.setRegParam(0.01)
>.setConvergenceTol(1e-5)
>  val model = algorithm.run(training)
>  model.clearThreshold()
>  val scoreAndLabels = test.map { point =>
>val score = model.predict(point.features)
>(score, point.label)
>  }
>  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
>  val auROC = metrics.areaUnderROC()
>
> I did a (0.6, 0.4) split for training/test. The response is "admit" and
> features are "GRE score", "GPA", and "college Rank".
>
> Spark:
> Weights (GRE, GPA, Rank):
> [0.0011576276331509304,0.048544858567336854,-0.394202150286076]
> Intercept: -0.6488972641282202
> Area under ROC: 0.6294070512820512
>
> StatsModel:
> Weights [0.0018, 0.7220, -0.3148]
> Intercept: -3.5913
> Area under ROC: 0.69
>
> The weights from statsmodel seems more reasonable if you consider for a one
> unit increase in gpa, the log odds of being admitted to graduate school
> increases by 0.72 in statsmodel than 0.04 in Spark.
>
> I have seen much bigger difference with other data. So my question is has
> anyone compared the results with other libraries and is anything wrong with
> my code to invoke LogisticRegressionWithLBFGS?
>
> As the real data I am processing is pretty big and really want to use Spark
> to get this to work. Please let me know if you have similar experience and
> how you resolve it.
>
> Thanks,
> Xin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread DB Tsai

Great to see the result comparable to R in new ML implementation.
Since majority of users will still use the old mllib api, we plan to
call the ML implementation from MLlib to handle the intercept
correctly with regularization.

JIRA is created.
https://issues.apache.org/jira/browse/SPARK-7780

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Fri, May 22, 2015 at 10:45 AM, Xin Liu  wrote:
> Thank you guys for the prompt help.
>
> I ended up building spark master and verified what DB has suggested.
>
> val lr = (new MlLogisticRegression)
>.setFitIntercept(true)
>.setMaxIter(35)
>
>  val model = lr.fit(sqlContext.createDataFrame(training))
>  val scoreAndLabels = model.transform(sqlContext.createDataFrame(test))
>.select("probability", "label")
>.map { case Row(probability: Vector, label: Double) =>
>  (probability(1), label)
>}
>
> Without doing much tuning, above generates
>
> Weights: [0.0013971323020715888,0.8559779783186241,-0.5052275562089914]
> Intercept: -3.3076806966913006
> Area under ROC: 0.7033511043412033
>
> I also tried it on a much bigger dataset I have and its results are close to
> what I get from statsmodel.
>
> Now early waiting for the 1.4 release.
>
> Thanks,
> Xin
>
>
>
> On Wed, May 20, 2015 at 9:37 PM, Chris Gore  wrote:
>>
>> I tried running this data set as described with my own implementation of
>> L2 regularized logistic regression using LBFGS to compare:
>> https://github.com/cdgore/fitbox
>>
>> Intercept: -0.886745823033
>> Weights (['gre', 'gpa', 'rank']):[ 0.28862268  0.19402388 -0.36637964]
>> Area under ROC: 0.724056603774
>>
>> The difference could be from the feature preprocessing as mentioned.  I
>> normalized the features around 0:
>>
>> binary_train_normalized = (binary_train - binary_train.mean()) /
>> binary_train.std()
>> binary_test_normalized = (binary_test - binary_train.mean()) /
>> binary_train.std()
>>
>> On a data set this small, the difference in models could also be the
>> result of how the training/test sets were split.
>>
>> Have you tried running k-folds cross validation on a larger data set?
>>
>> Chris
>>
>> On May 20, 2015, at 6:15 PM, DB Tsai  wrote:
>>
>> Hi Xin,
>>
>> If you take a look at the model you trained, the intercept from Spark
>> is significantly smaller than StatsModel, and the intercept represents
>> a prior on categories in LOR which causes the low accuracy in Spark
>> implementation. In LogisticRegressionWithLBFGS, the intercept is
>> regularized due to the implementation of Updater, and the intercept
>> should not be regularized.
>>
>> In the new pipleline APIs, a LOR with elasticNet is implemented, and
>> the intercept is properly handled.
>>
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
>>
>> As you can see the tests,
>>
>> https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
>> the result is exactly the same as R now.
>>
>> BTW, in both version, the feature scalings are done before training,
>> and we train the model in scaled space but transform the model weights
>> back to original space. The only difference is in the mllib version,
>> LogisticRegressionWithLBFGS regularizes the intercept while in the ml
>> version, the intercept is excluded from regularization. As a result,
>> if lambda is zero, the model should be the same.
>>
>>
>>
>> On Wed, May 20, 2015 at 3:42 PM, Xin Liu  wrote:
>>
>> Hi,
>>
>> I have tried a few models in Mllib to train a LogisticRegression model.
>> However, I consistently get much better results using other libraries such
>> as statsmodel (which gives similar results as R) in terms of AUC. For
>> illustration purpose, I used a small data (I have tried much bigger data)
>> http://www.ats.ucla.edu/stat/data/binary.csv in
>> http://www.ats.ucla.edu/stat/r/dae/logit.htm
>>
>> Here is the snippet of my usage of LogisticRegressionWithLBFGS.
>>
>> val algorithm = new LogisticRegressionWithLBFGS
>> algorithm.setIntercept(true)
>> algorithm.optimizer
>>   .setNumIterations(100)
>>   .setRegParam(0.01)
>>   .setConvergenceTol(1e-5)
>> val model = algorithm.run(training)
>> model.clearThreshold()
>> val s

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-22 Thread DB Tsai

In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
pipeline framework. Model selection can be achieved through high
lambda resulting lots of zero in the coefficients.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Fri, May 22, 2015 at 1:19 AM, SparknewUser
 wrote:
> I am new in MLlib and in Spark.(I use Scala)
>
> I'm trying to understand how LogisticRegressionWithLBFGS and
> LogisticRegressionWithSGD work.
> I usually use R to do logistic regressions but now I do it on Spark
> to be able to analyze Big Data.
>
> The model only returns weights and intercept. My problem is that I have no
> information about which variable is significant and which variable I had
> better
> to delete to improve my model. I only have the confusion matrix and the AUC
> to evaluate the performance.
>
> Is there any way to have information about the variables I put in my model?
> How can I try different variable combinations, do I have to modify the
> dataset
> of origin (e.g. delete one or several columns?)
> How are the weights calculated: is there a correlation calculation with the
> variable
> of interest?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai

If with mesos, how do we control the number of executors? In our cluster,
each node only has one executor with very big JVM. Sometimes, if the
executor dies, all the concurrent running tasks will be gone. We would like
to have multiple executors in one node but can not figure out a way to do
it in Yarn.

On Wednesday, May 27, 2015, Saisai Shao  wrote:

> The drive has a heuristic mechanism to decide the number of executors in
> the run-time according the pending tasks. You could enable with
> configuration, you could refer to spark document to find the details.
>
> 2015-05-27 15:00 GMT+08:00 canan chen  >:
>
>> How does the dynamic allocation works ? I mean does it related
>> with parallelism of my RDD and how does driver know how many executor it
>> needs ?
>>
>> On Wed, May 27, 2015 at 2:49 PM, Saisai Shao > > wrote:
>>
>>> It depends on how you use Spark, if you use Spark with Yarn and enable
>>> dynamic allocation, the number of executor is not fixed, will change
>>> dynamically according to the load.
>>>
>>> Thanks
>>> Jerry
>>>
>>> 2015-05-27 14:44 GMT+08:00 canan chen >> >:
>>>
 It seems the executor number is fixed for the standalone mode, not sure
 other modes.

>>>
>>>
>>
>

-- 
Sent from my iPhone

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread DB Tsai

Typo. We can not figure a way to increase the number of executor in one
node in mesos.

On Wednesday, May 27, 2015, DB Tsai  wrote:

> If with mesos, how do we control the number of executors? In our cluster,
> each node only has one executor with very big JVM. Sometimes, if the
> executor dies, all the concurrent running tasks will be gone. We would like
> to have multiple executors in one node but can not figure out a way to do
> it in Yarn.
>
> On Wednesday, May 27, 2015, Saisai Shao  > wrote:
>
>> The drive has a heuristic mechanism to decide the number of executors in
>> the run-time according the pending tasks. You could enable with
>> configuration, you could refer to spark document to find the details.
>>
>> 2015-05-27 15:00 GMT+08:00 canan chen :
>>
>>> How does the dynamic allocation works ? I mean does it related
>>> with parallelism of my RDD and how does driver know how many executor it
>>> needs ?
>>>
>>> On Wed, May 27, 2015 at 2:49 PM, Saisai Shao 
>>> wrote:
>>>
>>>> It depends on how you use Spark, if you use Spark with Yarn and enable
>>>> dynamic allocation, the number of executor is not fixed, will change
>>>> dynamically according to the load.
>>>>
>>>> Thanks
>>>> Jerry
>>>>
>>>> 2015-05-27 14:44 GMT+08:00 canan chen :
>>>>
>>>>> It seems the executor number is fixed for the standalone mode, not
>>>>> sure other modes.
>>>>>
>>>>
>>>>
>>>
>>
>
> --
> Sent from my iPhone
>


-- 
Sent from my iPhone

Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread DB Tsai

LinearRegressionWithSGD requires to tune the step size and # of
iteration very carefully. Please try Linear Regression with elastic
net implementation in Spark 1.4 in ML framework, which uses quasi
newton method and step size will be automatically determined. That
implementation also matches the result from R.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, May 27, 2015 at 9:08 PM, Maheshakya Wijewardena
 wrote:
>
> Hi,
>
> I'm trying to use Sparks' LinearRegressionWithSGD in PySpark with the
> attached dataset. The code is attached. When I check the model weights
> vector after training, it contains `nan` values.
>
> [nan,nan,nan,nan,nan,nan,nan,nan]
>
> But for some data sets, this problem does not occur. What might be the
> reason for this?
> Is this an issue with the data I'm using or a bug?
>
> Best regards.
>
> --
> Pruthuvi Maheshakya Wijewardena
> Software Engineer
> WSO2 Lanka (Pvt) Ltd
> Email: mahesha...@wso2.com
> Mobile: +94711228855
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-30 Thread DB Tsai

Alternatively, I will give a talk about LOR and LIR with elastic-net
implementation and interpretation of those models in spark summit.

https://spark-summit.org/2015/events/large-scale-lasso-and-elastic-net-regularized-generalized-linear-models/

You may attend or watch online.


Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com

On Fri, May 29, 2015 at 5:35 AM, mélanie gallois <
melanie.galloi...@gmail.com> wrote:

> When will Spark 1.4 be available exactly?
> To answer to "Model selection can be achieved through high
> lambda resulting lots of zero in the coefficients" : Do you mean that
> putting a high lambda as a parameter of the logistic regression keeps only
> a few significant variables and "deletes" the others with a zero in the
> coefficients? What is a high lambda for you?
> Is the lambda a parameter available in Spark 1.4 only or can I see it in
> Spark 1.3?
>
> 2015-05-23 0:04 GMT+02:00 Joseph Bradley :
>
>> If you want to select specific variable combinations by hand, then you
>> will need to modify the dataset before passing it to the ML algorithm.  The
>> DataFrame API should make that easy to do.
>>
>> If you want to have an ML algorithm select variables automatically, then
>> I would recommend using L1 regularization for now and possibly elastic net
>> after 1.4 is release, per DB's suggestion.
>>
>> If you want detailed model statistics similar to what R provides, I've
>> created a JIRA for discussing how we should add that functionality to
>> MLlib.  Those types of stats will be added incrementally, but feedback
>> would be great for prioritization:
>> https://issues.apache.org/jira/browse/SPARK-7674
>>
>> To answer your question: "How are the weights calculated: is there a
>> correlation calculation with the variable of interest?"
>> --> Weights are calculated as with all logistic regression algorithms, by
>> using convex optimization to minimize a regularized log loss.
>>
>> Good luck!
>> Joseph
>>
>> On Fri, May 22, 2015 at 1:07 PM, DB Tsai  wrote:
>>
>>> In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
>>> pipeline framework. Model selection can be achieved through high
>>> lambda resulting lots of zero in the coefficients.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> Blog: https://www.dbtsai.com
>>>
>>>
>>> On Fri, May 22, 2015 at 1:19 AM, SparknewUser
>>>  wrote:
>>> > I am new in MLlib and in Spark.(I use Scala)
>>> >
>>> > I'm trying to understand how LogisticRegressionWithLBFGS and
>>> > LogisticRegressionWithSGD work.
>>> > I usually use R to do logistic regressions but now I do it on Spark
>>> > to be able to analyze Big Data.
>>> >
>>> > The model only returns weights and intercept. My problem is that I
>>> have no
>>> > information about which variable is significant and which variable I
>>> had
>>> > better
>>> > to delete to improve my model. I only have the confusion matrix and
>>> the AUC
>>> > to evaluate the performance.
>>> >
>>> > Is there any way to have information about the variables I put in my
>>> model?
>>> > How can I try different variable combinations, do I have to modify the
>>> > dataset
>>> > of origin (e.g. delete one or several columns?)
>>> > How are the weights calculated: is there a correlation calculation
>>> with the
>>> > variable
>>> > of interest?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Mélanie*
>

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai

Which part of StandardScaler is slow? Fit or transform? Fit has shuffle but
very small, and transform doesn't do shuffle. I guess you don't have enough
partition, so please repartition your input dataset to a number at least
larger than the # of executors you have.

In Spark 1.4's new ML pipeline api, we have Linear Regression with elastic
net, and in that version, we use quasi newton for optimization, so it will
be a way faster than SGD implementation. Also, in that
implementation, StandardScaler is not required since in computing the loss
function, we implicitly do this for you.

https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef

Please try this out and give us feedback. Thanks.

On Wednesday, June 3, 2015, Piero Cinquegrana 
wrote:

>  Hello User group,
>
>
>
> I have a RDD of LabeledPoint composed of sparse vectors like showing
> below. In the next step, I am standardizing the columns with the Standard
> Scaler. The data has 2450 columns and ~110M rows. It took 1.5hrs to
> complete the standardization with 10 nodes and 80 executors. The
> spark.executor.memory was set to 2g and the driver memory to 5g.
>
>
>
> scala> val parsedData = stack_sorted.mapPartitions( partition =>
>
> partition.map{row
> => LabeledPoint(row._2._1.getDouble(4), sparseVectorCat(row._2,
> CategoriesIdx, InteractionIds, tupleMap, vecLength))
>
>  },
> preservesPartitioning=true).cache()
>
>
> CategoriesIdx: Array[Int] = Array(3, 8, 12)
>
> InteractionIds: Array[(Int, Int)] = Array((13,12))
>
> vecLength: Int = 2450
>
> parsedData:
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
> MapPartitionsRDD[93] at mapPartitions at :111
>
> (1.0,(2450,[1322,1361,2430],[1.0,1.0,1.0]))
>
> (0.0,(2450,[1322,1781,2430],[1.0,1.0,1.0]))
>
> (2.0,(2450,[1322,2193,2430],[1.0,1.0,1.0]))
>
> (1.0,(2450,[297,1322,2430],[1.0,1.0,1.0]))
>
> (0.0,(2450,[898,1322,2430],[1.0,1.0,1.0]))
>
>
>
>
>
> My suspicious is that because the data is partitioned using a custom
> partitioner the Standard Scaler is causing a major shuffle operation. Any
> suggestion on how to improve the performance this step and a
> LinearRegressionWithSGD() which is also taking a very long time?
>
>
>
> scala> parsedData.partitioner
>
> res72: Option[org.apache.spark.Partitioner] = Some(
> org.apache.spark.HashPartitioner@d2)
>
>
>
> scala> val scaler = new StandardScaler(withMean = false, withStd =
> true).fit(parsedData.map( row =>  row.features))
>
> scala> val scaledData = parsedData.mapPartitions(partition =>
> partition.map{row => LabeledPoint(row.label,
> scaler.transform(row.features))}).cache()
>
>
>
> scala> val numIterations = 100
>
> scala> val stepSize = 0.1
>
> scala> val miniBatchFraction = 0.1
>
> scala> val algorithm = new LinearRegressionWithSGD()
>
>
>
> scala> algorithm.setIntercept(false)
>
> scala> algorithm.optimizer.setNumIterations(numIterations)
>
> scala> algorithm.optimizer.setStepSize(stepSize)
>
> scala> algorithm.optimizer.setMiniBatchFraction(miniBatchFraction)
>
>
>
> scala> val model = algorithm.run(scaledData)
>
>
>
> Best,
>
>
>
> Piero Cinquegrana
>
> Marketing Scientist | MarketShare
> 11150 Santa Monica Blvd, 5th Floor, Los Angeles, CA 90025
> P: 310.914.5677 x242 M: 323.377.9197
> www.marketshare.com 
> twitter.com/marketsharep
>
>
>

Re: Standard Scaler taking 1.5hrs

2015-06-03 Thread DB Tsai

Can you do count() before fit to force materialize the RDD? I think
something before fit is slow.

On Wednesday, June 3, 2015, Piero Cinquegrana 
wrote:

>  The fit part is very slow, transform not at all.
>
>  The number of partitions was 210 vs number of executors 80.
>
>  Spark 1.4 sounds great but as my company is using Qubole we are
> dependent upon them to upgrade from version 1.3.1. Until that happens, can
> you think of any other reasons as to why it could be slow. Sparse vectors?
> Excessive number of columns?
>
> Sent from my mobile device. Please excuse any typos.
>
> On Jun 3, 2015, at 9:53 PM, DB Tsai  > wrote:
>
>   Which part of StandardScaler is slow? Fit or transform? Fit has shuffle
> but very small, and transform doesn't do shuffle. I guess you don't have
> enough partition, so please repartition your input dataset to a number at
> least larger than the # of executors you have.
>
>  In Spark 1.4's new ML pipeline api, we have Linear Regression with
> elastic net, and in that version, we use quasi newton for optimization, so
> it will be a way faster than SGD implementation. Also, in that
> implementation, StandardScaler is not required since in computing the loss
> function, we implicitly do this for you.
>
>
> https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef
>
>  Please try this out and give us feedback. Thanks.
>
> On Wednesday, June 3, 2015, Piero Cinquegrana <
> pcinquegr...@marketshare.com
> > wrote:
>
>>  Hello User group,
>>
>>
>>
>> I have a RDD of LabeledPoint composed of sparse vectors like showing
>> below. In the next step, I am standardizing the columns with the Standard
>> Scaler. The data has 2450 columns and ~110M rows. It took 1.5hrs to
>> complete the standardization with 10 nodes and 80 executors. The
>> spark.executor.memory was set to 2g and the driver memory to 5g.
>>
>>
>>
>> scala> val parsedData = stack_sorted.mapPartitions( partition =>
>>
>> partition.map{row
>> => LabeledPoint(row._2._1.getDouble(4), sparseVectorCat(row._2,
>> CategoriesIdx, InteractionIds, tupleMap, vecLength))
>>
>>  },
>> preservesPartitioning=true).cache()
>>
>>
>> CategoriesIdx: Array[Int] = Array(3, 8, 12)
>>
>> InteractionIds: Array[(Int, Int)] = Array((13,12))
>>
>> vecLength: Int = 2450
>>
>> parsedData:
>> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] =
>> MapPartitionsRDD[93] at mapPartitions at :111
>>
>> (1.0,(2450,[1322,1361,2430],[1.0,1.0,1.0]))
>>
>> (0.0,(2450,[1322,1781,2430],[1.0,1.0,1.0]))
>>
>> (2.0,(2450,[1322,2193,2430],[1.0,1.0,1.0]))
>>
>> (1.0,(2450,[297,1322,2430],[1.0,1.0,1.0]))
>>
>> (0.0,(2450,[898,1322,2430],[1.0,1.0,1.0]))
>>
>>
>>
>>
>>
>> My suspicious is that because the data is partitioned using a custom
>> partitioner the Standard Scaler is causing a major shuffle operation. Any
>> suggestion on how to improve the performance this step and a
>> LinearRegressionWithSGD() which is also taking a very long time?
>>
>>
>>
>> scala> parsedData.partitioner
>>
>> res72: Option[org.apache.spark.Partitioner] = Some(
>> org.apache.spark.HashPartitioner@d2)
>>
>>
>>
>> scala> val scaler = new StandardScaler(withMean = false, withStd =
>> true).fit(parsedData.map( row =>  row.features))
>>
>> scala> val scaledData = parsedData.mapPartitions(partition =>
>> partition.map{row => LabeledPoint(row.label,
>> scaler.transform(row.features))}).cache()
>>
>>
>>
>> scala> val numIterations = 100
>>
>> scala> val stepSize = 0.1
>>
>> scala> val miniBatchFraction = 0.1
>>
>> scala> val algorithm = new LinearRegressionWithSGD()
>>
>>
>>
>> scala> algorithm.setIntercept(false)
>>
>> scala> algorithm.optimizer.setNumIterations(numIterations)
>>
>> scala> algorithm.optimizer.setStepSize(stepSize)
>>
>> scala> algorithm.optimizer.setMiniBatchFraction(miniBatchFraction)
>>
>>
>>
>> scala> val model = algorithm.run(scaledData)
>>
>>
>>
>> Best,
>>
>>
>>
>> Piero Cinquegrana
>>
>> Marketing Scientist | MarketShare
>> 11150 Santa Monica Blvd, 5th Floor, Los Angeles, CA 90025
>> P: 310.914.5677 x242 M: 323.377.9197
>> www.marketshare.com <http://www.marketsharepartners.com/>
>> twitter.com/marketsharep
>>
>>
>>
>

-- 
- DB

Sent from my iPhone

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai

By default, the depth of the tree is 2. Each partition will be one node.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar  wrote:
> Hey Reza,
>
> Thanks for your response!
>
> Your response clarifies some of my initial thoughts. However, what I don't
> understand is how the depth of the tree is used to identify how many
> intermediate reducers there will be, and how many partitions are sent to the
> intermediate reducers. Could you provide some insight into this?
>
> Thanks,
> Raghav
>
> On Thursday, June 4, 2015, Reza Zadeh  wrote:
>>
>> In a regular reduce, all partitions have to send their reduced value to a
>> single machine, and that machine can become a bottleneck.
>>
>> In a treeReduce, the partitions talk to each other in a logarithmic number
>> of rounds. Imagine a binary tree that has all the partitions at its leaves
>> and the root will contain the final reduced value. This way there is no
>> single bottleneck machine.
>>
>> It remains to decide the number of children each node should have and how
>> deep the tree should be, which is some of the logic in the method you
>> pasted.
>>
>> On Wed, Jun 3, 2015 at 7:10 PM, raggy  wrote:
>>>
>>> I am trying to understand what the treeReduce function for an RDD does,
>>> and
>>> how it is different from the normal reduce function. My current
>>> understanding is that treeReduce tries to split up the reduce into
>>> multiple
>>> steps. We do a partial reduce on different nodes, and then a final reduce
>>> is
>>> done to get the final result. Is this correct? If so, I guess what I am
>>> curious about is, how does spark decide how many nodes will be on each
>>> level, and how many partitions will be sent to a given node?
>>>
>>> The bulk of the implementation is within this function:
>>>
>>> partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)
>>>   .getOrElse(throw new UnsupportedOperationException("empty
>>> collection"))
>>>
>>> The above function is expanded to
>>>
>>> val cleanSeqOp = context.clean(seqOp)
>>>   val cleanCombOp = context.clean(combOp)
>>>   val aggregatePartition =
>>> (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp,
>>> cleanCombOp)
>>>   var partiallyAggregated = mapPartitions(it =>
>>> Iterator(aggregatePartition(it)))
>>>   var numPartitions = partiallyAggregated.partitions.length
>>>   val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 /
>>> depth)).toInt, 2)
>>>   // If creating an extra level doesn't help reduce
>>>   // the wall-clock time, we stop tree aggregation.
>>>   while (numPartitions > scale + numPartitions / scale) {
>>> numPartitions /= scale
>>> val curNumPartitions = numPartitions
>>> partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex
>>> {
>>>   (i, iter) => iter.map((i % curNumPartitions, _))
>>> }.reduceByKey(new HashPartitioner(curNumPartitions),
>>> cleanCombOp).values
>>>   }
>>>   partiallyAggregated.reduce(cleanCombOp)
>>>
>>> I am completely lost about what is happening in this function. I would
>>> greatly appreciate some sort of explanation.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/TreeReduce-Functionality-in-Spark-tp23147.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: TreeReduce Functionality in Spark

2015-06-04 Thread DB Tsai

For the first round, you will have 16 reducers working since you have
32 partitions. Two of 32 partitions will know which reducer they will
go by sharing the same key using reduceByKey.

After this step is done, you will have 16 partitions, so the next
round will be 8 reducers.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Thu, Jun 4, 2015 at 12:06 PM, Raghav Shankar  wrote:
> Hey DB,
>
> Thanks for the reply!
>
> I still don't think this answers my question. For example, if I have a top()
> action being executed and I have 32 workers(32 partitions), and I choose a
> depth of 4, what does the overlay of intermediate reducers look like? How
> many reducers are there excluding the master and the worker? How many
> partitions get sent to each of these intermediate reducers? Does this number
> vary at each level?
>
> Thanks!
>
>
> On Thursday, June 4, 2015, DB Tsai  wrote:
>>
>> By default, the depth of the tree is 2. Each partition will be one node.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Thu, Jun 4, 2015 at 10:46 AM, Raghav Shankar 
>> wrote:
>> > Hey Reza,
>> >
>> > Thanks for your response!
>> >
>> > Your response clarifies some of my initial thoughts. However, what I
>> > don't
>> > understand is how the depth of the tree is used to identify how many
>> > intermediate reducers there will be, and how many partitions are sent to
>> > the
>> > intermediate reducers. Could you provide some insight into this?
>> >
>> > Thanks,
>> > Raghav
>> >
>> > On Thursday, June 4, 2015, Reza Zadeh  wrote:
>> >>
>> >> In a regular reduce, all partitions have to send their reduced value to
>> >> a
>> >> single machine, and that machine can become a bottleneck.
>> >>
>> >> In a treeReduce, the partitions talk to each other in a logarithmic
>> >> number
>> >> of rounds. Imagine a binary tree that has all the partitions at its
>> >> leaves
>> >> and the root will contain the final reduced value. This way there is no
>> >> single bottleneck machine.
>> >>
>> >> It remains to decide the number of children each node should have and
>> >> how
>> >> deep the tree should be, which is some of the logic in the method you
>> >> pasted.
>> >>
>> >> On Wed, Jun 3, 2015 at 7:10 PM, raggy  wrote:
>> >>>
>> >>> I am trying to understand what the treeReduce function for an RDD
>> >>> does,
>> >>> and
>> >>> how it is different from the normal reduce function. My current
>> >>> understanding is that treeReduce tries to split up the reduce into
>> >>> multiple
>> >>> steps. We do a partial reduce on different nodes, and then a final
>> >>> reduce
>> >>> is
>> >>> done to get the final result. Is this correct? If so, I guess what I
>> >>> am
>> >>> curious about is, how does spark decide how many nodes will be on each
>> >>> level, and how many partitions will be sent to a given node?
>> >>>
>> >>> The bulk of the implementation is within this function:
>> >>>
>> >>> partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)
>> >>>   .getOrElse(throw new UnsupportedOperationException("empty
>> >>> collection"))
>> >>>
>> >>> The above function is expanded to
>> >>>
>> >>> val cleanSeqOp = context.clean(seqOp)
>> >>>   val cleanCombOp = context.clean(combOp)
>> >>>   val aggregatePartition =
>> >>> (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp,
>> >>> cleanCombOp)
>> >>>   var partiallyAggregated = mapPartitions(it =>
>> >>> Iterator(aggregatePartition(it)))
>> >>>   var numPartitions = partiallyAggregated.partitions.length
>> >>>   val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 /
>> >>> depth)).toInt, 2)
>> >>>   // If creating an extra level doesn't help reduce
>> >>>   // the wall-clock time, we stop tree aggregation.
>> >>>   while (numPartitions > scale + numPartitions / scale) {

Re: Linear Regression with SGD

2015-06-09 Thread DB Tsai

As Robin suggested, you may try the following new implementation.

https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef

Thanks.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Tue, Jun 9, 2015 at 3:22 PM, Robin East  wrote:

> Hi Stephen
>
> How many is a very large number of iterations? SGD is notorious for
> requiring 100s or 1000s of iterations, also you may need to spend some time
> tweaking the step-size. In 1.4 there is an implementation of ElasticNet
> Linear Regression which is supposed to compare favourably with an
> equivalent R implementation.
> > On 9 Jun 2015, at 22:05, Stephen Carman  wrote:
> >
> > Hi User group,
> >
> > We are using spark Linear Regression with SGD as the optimization
> technique and we are achieving very sub-optimal results.
> >
> > Can anyone shed some light on why this implementation seems to produce
> such poor results vs our own implementation?
> >
> > We are using a very small dataset, but we have to use a very large
> number of iterations to achieve similar results to our implementation,
> we’ve tried normalizing the data
> > not normalizing the data and tuning every param. Our implementation is a
> closed form solution so we should be guaranteed convergence but the spark
> one is not, which is
> > understandable, but why is it so far off?
> >
> > Has anyone experienced this?
> >
> > Steve Carman, M.S.
> > Artificial Intelligence Engineer
> > Coldlight-PTC
> > scar...@coldlight.com
> > This e-mail is intended solely for the above-mentioned recipient and it
> may contain confidential or privileged information. If you have received it
> in error, please notify us immediately and delete the e-mail. You must not
> copy, distribute, disclose or take any action in reliance on it. In
> addition, the contents of an attachment to this e-mail may contain software
> viruses which could damage your own computer system. While ColdLight
> Solutions, LLC has taken every reasonable precaution to minimize this risk,
> we cannot accept liability for any damage which you sustain as a result of
> software viruses. You should perform your own virus checks before opening
> the attachment.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Implementing top() using treeReduce()

2015-06-09 Thread DB Tsai

Having the following code in RDD.scala works for me. PS, in the following
code, I merge the smaller queue into larger one. I wonder if this will help
performance. Let me know when you do the benchmark.

def treeTakeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  if (num == 0) {
Array.empty
  } else {
val mapRDDs = mapPartitions { items =>
  // Priority keeps the largest elements, so let's reverse the ordering.
  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
  queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
  Iterator.single(queue)
}
if (mapRDDs.partitions.length == 0) {
  Array.empty
} else {
  mapRDDs.treeReduce { (queue1, queue2) =>
if (queue1.size > queue2.size) {
  queue1 ++= queue2
  queue1
} else {
  queue2 ++= queue1
  queue2
}
  }.toArray.sorted(ord)
}
  }
}

def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
  treeTakeOrdered(num)(ord.reverse)
}



Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Tue, Jun 9, 2015 at 10:09 AM, raggy  wrote:

> I am trying to implement top-k in scala within apache spark. I am aware
> that
> spark has a top action. But, top() uses reduce(). Instead, I would like to
> use treeReduce(). I am trying to compare the performance of reduce() and
> treeReduce().
>
> The main issue I have is that I cannot use these 2 lines of code which are
> used in the top() action within my Spark application.
>
> val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
> queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
>
> How can I go about implementing top() using treeReduce()?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-top-using-treeReduce-tp23227.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai

+cc user@spark.apache.org

Reply inline.

On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
 wrote:
> Hi DB,
>
> Thank you for the reply. That explains a lot.
>
> I however had a few points regarding this:-
>
> 1. Just to help with the debate of not regularizing the b parameter. A 
> standard implementation argues against regularizing the b parameter. See Pg 
> 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>

Agreed. We just worry about it will change behavior, but we actually
have a PR to change the behavior to standard one,
https://github.com/apache/spark/pull/6386

> 2. Further, is the regularization of b also applicable for the SGD 
> implementation. Currently the SGD vs. BFGS implementations give different 
> results (and both the implementations don't match the IRLS algorithm). Are 
> the SGD/BFGS implemented for different loss functions? Can you please share 
> your thoughts on this.
>

In SGD implementation, we don't "standardize" the dataset before
training. As a result, those columns with low standard deviation will
be penalized more, and those with high standard deviation will be
penalized less. Also, "standardize" will help the rate of convergence.
As a result, in most of package, they "standardize" the data
implicitly, and get the weights in the "standardized" space, and
transform back to original space so it's transparent for users.

1) LORWithSGD: No standardization, and penalize the intercept.
2) LORWithLBFGS: With standardization but penalize the intercept.
3) New LOR implementation: With standardization without penalizing the
intercept.

As a result, only the new implementation in Spark ML handles
everything correctly. We have tests to verify that the results match
R.

>
> @Naveen: Please feel free to add/comment on the above points as you see 
> necessary.
>
> Thanks,
> Sauptik.
>
> -Original Message-
> From: DB Tsai
> Sent: Tuesday, June 16, 2015 2:08 PM
> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
> Cc: Dhar Sauptik (CR/RTC1.3-NA)
> Subject: Re: FW: MLLIB (Spark) Question.
>
> Hey,
>
> In the LORWithLBFGS api you use, the intercept is regularized while
> other implementations don't regularize the intercept. That's why you
> see the difference.
>
> The intercept should not be regularized, so we fix this in new Spark
> ML api in spark 1.4. Since this will change the behavior in the old
> api if we decide to not regularize the intercept in old version, we
> are still debating about this.
>
> See the following code for full running example in spark 1.4
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>
> And also check out my talk at spark summit.
> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>
>
> Sincerely,
>
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
>  wrote:
>> Hi DB,
>> Hope you are doing well! One of my colleagues, Sauptik, is working with
>> MLLib and the logistic regression based on LBFGS and is having trouble
>> reproducing the same results when compared to Matlab. Please see below for
>> details. I did take a look into this but seems like there’s also discrepancy
>> between the logistic regression with SGD and LBFGS implementations in MLLib.
>> We have attached all the codes for your analysis – it’s in PySpark though.
>> Let us know if you have any questions or concerns. We would very much
>> appreciate your help whenever you get a chance.
>>
>> Best,
>> Naveen.
>>
>> _
>> From: Dhar Sauptik (CR/RTC1.3-NA)
>> Sent: Thursday, June 11, 2015 6:03 PM
>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Subject: MLLIB (Spark) Question.
>>
>>
>> Hi Naveen,
>>
>> I am writing this owing to some MLLIB issues I found while using Logistic
>> Regression. Basically, I am trying to test the stability of the L1/L2 –
>> Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
>> the correctness of the algorithms. For comparison I implemented the
>> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
>> book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>> . Unfortunately the solutions don’t match:-
>>
>> For example:-
>>
>> Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
>>

Re: FW: MLLIB (Spark) Question.

2015-06-16 Thread DB Tsai

Hi Dhar,

For "standardization", we can disable it effectively by using
different regularization on each component. Thus, we're solving the
same problem but having better rate of convergence. This is one of the
features I will implement.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
 wrote:
> Hi DB,
>
> Thank you for the reply. The answers makes sense. I do have just one more 
> point to add.
>
> Note that it may be better to not implicitly standardize the data. Agreed 
> that a number of algorithms benefit from such standardization, but for many 
> applications with contextual information such standardization "may" not be 
> desirable.
> Users can always perform the standardization themselves.
>
> However, that's just a suggestion. Again, thank you for the clarification.
>
> Thanks,
> Sauptik.
>
>
> -Original Message-
> From: DB Tsai [mailto:dbt...@dbtsai.com]
> Sent: Tuesday, June 16, 2015 2:49 PM
> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
> Cc: user@spark.apache.org
> Subject: Re: FW: MLLIB (Spark) Question.
>
> +cc user@spark.apache.org
>
> Reply inline.
>
> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
>  wrote:
>> Hi DB,
>>
>> Thank you for the reply. That explains a lot.
>>
>> I however had a few points regarding this:-
>>
>> 1. Just to help with the debate of not regularizing the b parameter. A 
>> standard implementation argues against regularizing the b parameter. See Pg 
>> 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>>
>
> Agreed. We just worry about it will change behavior, but we actually
> have a PR to change the behavior to standard one,
> https://github.com/apache/spark/pull/6386
>
>> 2. Further, is the regularization of b also applicable for the SGD 
>> implementation. Currently the SGD vs. BFGS implementations give different 
>> results (and both the implementations don't match the IRLS algorithm). Are 
>> the SGD/BFGS implemented for different loss functions? Can you please share 
>> your thoughts on this.
>>
>
> In SGD implementation, we don't "standardize" the dataset before
> training. As a result, those columns with low standard deviation will
> be penalized more, and those with high standard deviation will be
> penalized less. Also, "standardize" will help the rate of convergence.
> As a result, in most of package, they "standardize" the data
> implicitly, and get the weights in the "standardized" space, and
> transform back to original space so it's transparent for users.
>
> 1) LORWithSGD: No standardization, and penalize the intercept.
> 2) LORWithLBFGS: With standardization but penalize the intercept.
> 3) New LOR implementation: With standardization without penalizing the
> intercept.
>
> As a result, only the new implementation in Spark ML handles
> everything correctly. We have tests to verify that the results match
> R.
>
>>
>> @Naveen: Please feel free to add/comment on the above points as you see 
>> necessary.
>>
>> Thanks,
>> Sauptik.
>>
>> -Original Message-
>> From: DB Tsai
>> Sent: Tuesday, June 16, 2015 2:08 PM
>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Cc: Dhar Sauptik (CR/RTC1.3-NA)
>> Subject: Re: FW: MLLIB (Spark) Question.
>>
>> Hey,
>>
>> In the LORWithLBFGS api you use, the intercept is regularized while
>> other implementations don't regularize the intercept. That's why you
>> see the difference.
>>
>> The intercept should not be regularized, so we fix this in new Spark
>> ML api in spark 1.4. Since this will change the behavior in the old
>> api if we decide to not regularize the intercept in old version, we
>> are still debating about this.
>>
>> See the following code for full running example in spark 1.4
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>>
>> And also check out my talk at spark summit.
>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
>>

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai

You need to build the spark assembly with your modification and deploy
into cluster.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar  wrote:
> I’ve implemented this in the suggested manner. When I build Spark and attach
> the new spark-core jar to my eclipse project, I am able to use the new
> method. In order to conduct the experiments I need to launch my app on a
> cluster. I am using EC2. When I setup my master and slaves using the EC2
> setup scripts, it sets up spark, but I think my custom built spark-core jar
> is not being used. How do it up on EC2 so that my custom version of
> Spark-core is used?
>
> Thanks,
> Raghav
>
> On Jun 9, 2015, at 7:41 PM, DB Tsai  wrote:
>
> Having the following code in RDD.scala works for me. PS, in the following
> code, I merge the smaller queue into larger one. I wonder if this will help
> performance. Let me know when you do the benchmark.
>
> def treeTakeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] =
> withScope {
>   if (num == 0) {
> Array.empty
>   } else {
> val mapRDDs = mapPartitions { items =>
>   // Priority keeps the largest elements, so let's reverse the ordering.
>   val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>   queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
>   Iterator.single(queue)
> }
> if (mapRDDs.partitions.length == 0) {
>   Array.empty
> } else {
>   mapRDDs.treeReduce { (queue1, queue2) =>
> if (queue1.size > queue2.size) {
>   queue1 ++= queue2
>   queue1
> } else {
>   queue2 ++= queue1
>   queue2
> }
>   }.toArray.sorted(ord)
> }
>   }
> }
>
> def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
>   treeTakeOrdered(num)(ord.reverse)
> }
>
>
>
> Sincerely,
>
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
> On Tue, Jun 9, 2015 at 10:09 AM, raggy  wrote:
>>
>> I am trying to implement top-k in scala within apache spark. I am aware
>> that
>> spark has a top action. But, top() uses reduce(). Instead, I would like to
>> use treeReduce(). I am trying to compare the performance of reduce() and
>> treeReduce().
>>
>> The main issue I have is that I cannot use these 2 lines of code which are
>> used in the top() action within my Spark application.
>>
>> val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>> queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
>>
>> How can I go about implementing top() using treeReduce()?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-top-using-treeReduce-tp23227.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Implementing top() using treeReduce()

2015-06-17 Thread DB Tsai

all of them.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Wed, Jun 17, 2015 at 5:15 PM, Raghav Shankar  wrote:
> So, I would add the assembly jar to the just the master or would I have to 
> add it to all the slaves/workers too?
>
> Thanks,
> Raghav
>
>> On Jun 17, 2015, at 5:13 PM, DB Tsai  wrote:
>>
>> You need to build the spark assembly with your modification and deploy
>> into cluster.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Wed, Jun 17, 2015 at 5:11 PM, Raghav Shankar  
>> wrote:
>>> I’ve implemented this in the suggested manner. When I build Spark and attach
>>> the new spark-core jar to my eclipse project, I am able to use the new
>>> method. In order to conduct the experiments I need to launch my app on a
>>> cluster. I am using EC2. When I setup my master and slaves using the EC2
>>> setup scripts, it sets up spark, but I think my custom built spark-core jar
>>> is not being used. How do it up on EC2 so that my custom version of
>>> Spark-core is used?
>>>
>>> Thanks,
>>> Raghav
>>>
>>> On Jun 9, 2015, at 7:41 PM, DB Tsai  wrote:
>>>
>>> Having the following code in RDD.scala works for me. PS, in the following
>>> code, I merge the smaller queue into larger one. I wonder if this will help
>>> performance. Let me know when you do the benchmark.
>>>
>>> def treeTakeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] =
>>> withScope {
>>>  if (num == 0) {
>>>Array.empty
>>>  } else {
>>>val mapRDDs = mapPartitions { items =>
>>>  // Priority keeps the largest elements, so let's reverse the ordering.
>>>  val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>>>  queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
>>>  Iterator.single(queue)
>>>}
>>>if (mapRDDs.partitions.length == 0) {
>>>  Array.empty
>>>} else {
>>>  mapRDDs.treeReduce { (queue1, queue2) =>
>>>if (queue1.size > queue2.size) {
>>>  queue1 ++= queue2
>>>  queue1
>>>} else {
>>>  queue2 ++= queue1
>>>  queue2
>>>}
>>>  }.toArray.sorted(ord)
>>>}
>>>  }
>>> }
>>>
>>> def treeTop(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
>>>  treeTakeOrdered(num)(ord.reverse)
>>> }
>>>
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Blog: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>> On Tue, Jun 9, 2015 at 10:09 AM, raggy  wrote:
>>>>
>>>> I am trying to implement top-k in scala within apache spark. I am aware
>>>> that
>>>> spark has a top action. But, top() uses reduce(). Instead, I would like to
>>>> use treeReduce(). I am trying to compare the performance of reduce() and
>>>> treeReduce().
>>>>
>>>> The main issue I have is that I cannot use these 2 lines of code which are
>>>> used in the top() action within my Spark application.
>>>>
>>>> val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
>>>> queue ++= util.collection.Utils.takeOrdered(items, num)(ord)
>>>>
>>>> How can I go about implementing top() using treeReduce()?
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Implementing-top-using-treeReduce-tp23227.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-19 Thread DB Tsai

Hi Wei,

I don't think ML is meant for single node computation, and the
algorithms in ML are designed for pipeline framework.

In short, the lasso regression in ML is new algorithm implemented from
scratch, and it's faster, and converged to the same solution as R's
glmnet but with scalability. Here is the talk I gave in Spark summit
about the new elastic-net feature in ML. I will encourage you to try
the one ML.

http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Fri, Jun 19, 2015 at 11:38 AM, Wei Zhou  wrote:
> Hi Spark experts,
>
> I see lasso regression/ elastic net implementation under both MLLib and ML,
> does anyone know what is the difference between the two implementation?
>
> In spark summit, one of the keynote speakers mentioned that ML is meant for
> single node computation, could anyone elaborate this?
>
> Thanks.
>
> Wei

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Missing values support in Mllib yet?

2015-06-19 Thread DB Tsai

Not really yet. But at work, we do GBDT missing values imputation, so
I've the interest to port them to mllib if I have enough time.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Fri, Jun 19, 2015 at 1:23 PM, Arun Luthra  wrote:
> Hi,
>
> Is there any support for handling missing values in mllib yet, especially
> for decision trees where this is a natural feature?
>
> Arun

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai

The regularization is handled after the objective function of data is
computed. See 
https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
 line 348 for L2.

For L1, it's handled by OWLQN, so you don't see it explicitly, but the
code is in line 128.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 23, 2015 at 3:14 PM, Wei Zhou  wrote:
> Hi DB Tsai,
>
> Thanks for your reply. I went through the source code of
> LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A
> weights - y||^2^. I cannot match this with the elasticNet loss function
> found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which
> is the sum of square error plus L1 and L2 penalty.
>
> I am able to follow the rest of the mathematical deviation in the code
> comment. I am hoping if you could point me to any references that can fill
> this knowledge gap.
>
> Best,
> Wei
>
>
>
> 2015-06-19 12:35 GMT-07:00 DB Tsai :
>>
>> Hi Wei,
>>
>> I don't think ML is meant for single node computation, and the
>> algorithms in ML are designed for pipeline framework.
>>
>> In short, the lasso regression in ML is new algorithm implemented from
>> scratch, and it's faster, and converged to the same solution as R's
>> glmnet but with scalability. Here is the talk I gave in Spark summit
>> about the new elastic-net feature in ML. I will encourage you to try
>> the one ML.
>>
>>
>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Fri, Jun 19, 2015 at 11:38 AM, Wei Zhou  wrote:
>> > Hi Spark experts,
>> >
>> > I see lasso regression/ elastic net implementation under both MLLib and
>> > ML,
>> > does anyone know what is the difference between the two implementation?
>> >
>> > In spark summit, one of the keynote speakers mentioned that ML is meant
>> > for
>> > single node computation, could anyone elaborate this?
>> >
>> > Thanks.
>> >
>> > Wei
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Difference between Lasso regression in MLlib package and ML package

2015-06-23 Thread DB Tsai

Please see the current version of code for better documentation.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 23, 2015 at 3:58 PM, DB Tsai  wrote:
> The regularization is handled after the objective function of data is
> computed. See 
> https://github.com/apache/spark/blob/6a827d5d1ec520f129e42c3818fe7d0d870dcbef/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
>  line 348 for L2.
>
> For L1, it's handled by OWLQN, so you don't see it explicitly, but the
> code is in line 128.
>
> Sincerely,
>
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Tue, Jun 23, 2015 at 3:14 PM, Wei Zhou  wrote:
>> Hi DB Tsai,
>>
>> Thanks for your reply. I went through the source code of
>> LinearRegression.scala. The algorithm minimizes square error L = 1/2n ||A
>> weights - y||^2^. I cannot match this with the elasticNet loss function
>> found here http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html, which
>> is the sum of square error plus L1 and L2 penalty.
>>
>> I am able to follow the rest of the mathematical deviation in the code
>> comment. I am hoping if you could point me to any references that can fill
>> this knowledge gap.
>>
>> Best,
>> Wei
>>
>>
>>
>> 2015-06-19 12:35 GMT-07:00 DB Tsai :
>>>
>>> Hi Wei,
>>>
>>> I don't think ML is meant for single node computation, and the
>>> algorithms in ML are designed for pipeline framework.
>>>
>>> In short, the lasso regression in ML is new algorithm implemented from
>>> scratch, and it's faster, and converged to the same solution as R's
>>> glmnet but with scalability. Here is the talk I gave in Spark summit
>>> about the new elastic-net feature in ML. I will encourage you to try
>>> the one ML.
>>>
>>>
>>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Blog: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Fri, Jun 19, 2015 at 11:38 AM, Wei Zhou  wrote:
>>> > Hi Spark experts,
>>> >
>>> > I see lasso regression/ elastic net implementation under both MLLib and
>>> > ML,
>>> > does anyone know what is the difference between the two implementation?
>>> >
>>> > In spark summit, one of the keynote speakers mentioned that ML is meant
>>> > for
>>> > single node computation, could anyone elaborate this?
>>> >
>>> > Thanks.
>>> >
>>> > Wei
>>
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: FW: MLLIB (Spark) Question.

2015-07-08 Thread DB Tsai

Hi Dhar,

Disabling `standardization` feature is just merged in master.

https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1

Let us know your feedback. Thanks.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 9:11 PM, Dhar Sauptik (CR/RTC1.3-NA)
 wrote:
> Hi DB,
>
> That will work too. I was just suggesting that as standardization is a simple 
> operation and could have been performed explicitly.
>
> Thank you for the replies.
>
> -Sauptik.
>
> -Original Message-
> From: DB Tsai [mailto:dbt...@dbtsai.com]
> Sent: Tuesday, June 16, 2015 9:04 PM
> To: Dhar Sauptik (CR/RTC1.3-NA)
> Cc: Ramakrishnan Naveen (CR/RTC1.3-NA); user@spark.apache.org
> Subject: Re: FW: MLLIB (Spark) Question.
>
> Hi Dhar,
>
> For "standardization", we can disable it effectively by using
> different regularization on each component. Thus, we're solving the
> same problem but having better rate of convergence. This is one of the
> features I will implement.
>
> Sincerely,
>
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
>  wrote:
>> Hi DB,
>>
>> Thank you for the reply. The answers makes sense. I do have just one more 
>> point to add.
>>
>> Note that it may be better to not implicitly standardize the data. Agreed 
>> that a number of algorithms benefit from such standardization, but for many 
>> applications with contextual information such standardization "may" not be 
>> desirable.
>> Users can always perform the standardization themselves.
>>
>> However, that's just a suggestion. Again, thank you for the clarification.
>>
>> Thanks,
>> Sauptik.
>>
>>
>> -Original Message-
>> From: DB Tsai [mailto:dbt...@dbtsai.com]
>> Sent: Tuesday, June 16, 2015 2:49 PM
>> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Cc: user@spark.apache.org
>> Subject: Re: FW: MLLIB (Spark) Question.
>>
>> +cc user@spark.apache.org
>>
>> Reply inline.
>>
>> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
>>  wrote:
>>> Hi DB,
>>>
>>> Thank you for the reply. That explains a lot.
>>>
>>> I however had a few points regarding this:-
>>>
>>> 1. Just to help with the debate of not regularizing the b parameter. A 
>>> standard implementation argues against regularizing the b parameter. See Pg 
>>> 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>>>
>>
>> Agreed. We just worry about it will change behavior, but we actually
>> have a PR to change the behavior to standard one,
>> https://github.com/apache/spark/pull/6386
>>
>>> 2. Further, is the regularization of b also applicable for the SGD 
>>> implementation. Currently the SGD vs. BFGS implementations give different 
>>> results (and both the implementations don't match the IRLS algorithm). Are 
>>> the SGD/BFGS implemented for different loss functions? Can you please share 
>>> your thoughts on this.
>>>
>>
>> In SGD implementation, we don't "standardize" the dataset before
>> training. As a result, those columns with low standard deviation will
>> be penalized more, and those with high standard deviation will be
>> penalized less. Also, "standardize" will help the rate of convergence.
>> As a result, in most of package, they "standardize" the data
>> implicitly, and get the weights in the "standardized" space, and
>> transform back to original space so it's transparent for users.
>>
>> 1) LORWithSGD: No standardization, and penalize the intercept.
>> 2) LORWithLBFGS: With standardization but penalize the intercept.
>> 3) New LOR implementation: With standardization without penalizing the
>> intercept.
>>
>> As a result, only the new implementation in Spark ML handles
>> everything correctly. We have tests to verify that the results match
>> R.
>>
>>>
>>> @Naveen: Please feel free to add/comment on the above points as you see 
>>> necessary.
>>>
>>> Thanks,
>>> Sauptik.
>>>
>>> -Original Message-
>>> From: DB Tsai
>>> Sent: Tuesday, June 16, 2015 2:08 PM
>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>>> Cc: Dhar Sauptik

Re: Incomplete data when reading from S3

2016-03-19 Thread DB Tsai

You need to use wholetextfiles to read the whole file at once. Otherwise,
It can be split.

DB Tsai - Sent From My Phone
On Mar 17, 2016 12:45 AM, "Blaž Šnuderl"  wrote:

> Hi.
>
> We have json data stored in S3 (json record per line). When reading the
> data from s3 using the following code we started noticing json decode
> errors.
>
> sc.textFile(paths).map(json.loads)
>
>
> After a bit more investigation we noticed an incomplete line, basically
> the line was
>
>> {"key": "value", "key2":  <- notice the line abruptly ends with no json
>> close tag etc
>
>
> It is not an issue with our data and it doesn't happen very often, but it
> makes us very scared since it means spark could be dropping data.
>
> We are using spark 1.5.1. Any ideas why this happens and possible fixes?
>
> Regards,
> Blaž Šnuderl
>

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-06 Thread DB Tsai

+1 for renaming the jar file.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Apr 5, 2016 at 8:02 PM, Chris Fregly  wrote:
> perhaps renaming to Spark ML would actually clear up code and documentation
> confusion?
>
> +1 for rename
>
> On Apr 5, 2016, at 7:00 PM, Reynold Xin  wrote:
>
> +1
>
> This is a no brainer IMO.
>
>
> On Tue, Apr 5, 2016 at 7:32 PM, Joseph Bradley 
> wrote:
>>
>> +1  By the way, the JIRA for tracking (Scala) API parity is:
>> https://issues.apache.org/jira/browse/SPARK-4591
>>
>> On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia 
>> wrote:
>>>
>>> This sounds good to me as well. The one thing we should pay attention to
>>> is how we update the docs so that people know to start with the spark.ml
>>> classes. Right now the docs list spark.mllib first and also seem more
>>> comprehensive in that area than in spark.ml, so maybe people naturally move
>>> towards that.
>>>
>>> Matei
>>>
>>> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
>>>
>>> Yes, DB (cc'ed) is working on porting the local linear algebra library
>>> over (SPARK-13944). There are also frequent pattern mining algorithms we
>>> need to port over in order to reach feature parity. -Xiangrui
>>>
>>> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman
>>>  wrote:
>>>>
>>>> Overall this sounds good to me. One question I have is that in
>>>> addition to the ML algorithms we have a number of linear algebra
>>>> (various distributed matrices) and statistical methods in the
>>>> spark.mllib package. Is the plan to port or move these to the spark.ml
>>>> namespace in the 2.x series ?
>>>>
>>>> Thanks
>>>> Shivaram
>>>>
>>>> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  wrote:
>>>> > FWIW, all of that sounds like a good plan to me. Developing one API is
>>>> > certainly better than two.
>>>> >
>>>> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng 
>>>> > wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API
>>>> >> built
>>>> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based
>>>> >> API has
>>>> >> been developed under the spark.ml package, while the old RDD-based
>>>> >> API has
>>>> >> been developed in parallel under the spark.mllib package. While it
>>>> >> was
>>>> >> easier to implement and experiment with new APIs under a new package,
>>>> >> it
>>>> >> became harder and harder to maintain as both packages grew bigger and
>>>> >> bigger. And new users are often confused by having two sets of APIs
>>>> >> with
>>>> >> overlapped functions.
>>>> >>
>>>> >> We started to recommend the DataFrame-based API over the RDD-based
>>>> >> API in
>>>> >> Spark 1.5 for its versatility and flexibility, and we saw the
>>>> >> development
>>>> >> and the usage gradually shifting to the DataFrame-based API. Just
>>>> >> counting
>>>> >> the lines of Scala code, from 1.5 to the current master we added
>>>> >> ~1
>>>> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So,
>>>> >> to
>>>> >> gather more resources on the development of the DataFrame-based API
>>>> >> and to
>>>> >> help users migrate over sooner, I want to propose switching RDD-based
>>>> >> MLlib
>>>> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
>>>> >>
>>>> >> * We do not accept new features in the RDD-based spark.mllib package,
>>>> >> unless
>>>> >> they block implementing new features in the DataFrame-based spark.ml
>>>> >> package.
>>>> >> * We still accept bug fixes in the RDD-based API.
>>>> >> * We will add more features to the DataFrame-based API in the 2.x
>>>> >> series to
>>>> >> reach feature parity with the RDD-based API.
>>>> >> * Once we reach feature parity (possibly in Spark 2.2), we will
>>>> >> deprecate
>>>> >> the RDD-based API.
>>>> >> * We will remove the RDD-based API from the main Spark repo in Spark
>>>> >> 3.0.
>>>> >>
>>>> >> Though the RDD-based API is already in de facto maintenance mode,
>>>> >> this
>>>> >> announcement will make it clear and hence important to both MLlib
>>>> >> developers
>>>> >> and users. So we’d greatly appreciate your feedback!
>>>> >>
>>>> >> (As a side note, people sometimes use “Spark ML” to refer to the
>>>> >> DataFrame-based API or even the entire MLlib component. This also
>>>> >> causes
>>>> >> confusion. To be clear, “Spark ML” is not an official name and there
>>>> >> are no
>>>> >> plans to rename MLlib to “Spark ML” at this time.)
>>>> >>
>>>> >> Best,
>>>> >> Xiangrui
>>>> >
>>>> > -
>>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>>> >
>>>
>>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai

Try to run to see actual ulimit. We found that mesos overrides the ulimit
which causes the issue.

import sys.process._
val p = 1 to 100
val rdd = sc.parallelize(p, 100)
val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect




Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang  wrote:

> I hit this issue with spark 1.3.0 stateful application (with
> updateStateByKey) function on mesos.  It will
> fail after running fine for about 24 hours.
> The error stack trace as below, I checked ulimit -n and we have very large
> numbers set on the machines.
> What else can be wrong?
> 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
> 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
> java.io.FileNotFoundException:
>
> /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
> (Too many open files)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at
>
> org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
> at
>
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-12 Thread DB Tsai

Hi Liu,

In ML, even after extracting the data into RDD, the versions between MLib
and ML are quite different. Due to legacy design, in MLlib, we use Updater
for handling regularization, and this layer of abstraction also does
adaptive step size which is only for SGD. In order to get it working with
LBFGS, some hacks were being done here and there, and in Updater, all the
components including intercept are regularized which is not desirable in
many cases. Also, in the legacy design, it's hard for us to do in-place
standardization to improve the convergency rate. As a result, at some
point, we decide to ditch those abstractions, and customize them for each
algorithms. (Even LiR and LoR use different tricks to have better
performance for numerical optimization, so it's hard to share code at that
time. But I can see the point that we have working code now, so it's time
to try to refactor those code to share more.)


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Mon, Oct 12, 2015 at 1:24 AM, YiZhi Liu  wrote:

> Hi Joseph,
>
> Thank you for clarifying the motivation that you setup a different API
> for ml pipelines, it sounds great. But I still think we could extract
> some common parts of the training & inference procedures for ml and
> mllib. In ml.classification.LogisticRegression, you simply transform
> the DataFrame into RDD and follow the same procedures in
> mllib.optimization.{LBFGS,OWLQN}, right?
>
> My suggestion is, if I may, ml package should focus on the public API,
> and leave the underlying implementations, e.g. numerical optimization,
> to mllib package.
>
> Please let me know if my understanding has any problem. Thank you!
>
> 2015-10-08 1:15 GMT+08:00 Joseph Bradley :
> > Hi YiZhi Liu,
> >
> > The spark.ml classes are part of the higher-level "Pipelines" API, which
> > works with DataFrames.  When creating this API, we decided to separate it
> > from the old API to avoid confusion.  You can read more about it here:
> > http://spark.apache.org/docs/latest/ml-guide.html
> >
> > For (3): We use Breeze, but we have to modify it in order to do
> distributed
> > optimization based on Spark.
> >
> > Joseph
> >
> > On Tue, Oct 6, 2015 at 11:47 PM, YiZhi Liu  wrote:
> >>
> >> Hi everyone,
> >>
> >> I'm curious about the difference between
> >> ml.classification.LogisticRegression and
> >> mllib.classification.LogisticRegressionWithLBFGS. Both of them are
> >> optimized using LBFGS, the only difference I see is LogisticRegression
> >> takes DataFrame while LogisticRegressionWithLBFGS takes RDD.
> >>
> >> So I wonder,
> >> 1. Why not simply add a DataFrame training interface to
> >> LogisticRegressionWithLBFGS?
> >> 2. Whats the difference between ml.classification and
> >> mllib.classification package?
> >> 3. Why doesn't ml.classification.LogisticRegression call
> >> mllib.optimization.LBFGS / mllib.optimization.OWLQN directly? Instead,
> >> it uses breeze.optimize.LBFGS and re-implements most of the procedures
> >> in mllib.optimization.{LBFGS,OWLQN}.
> >>
> >> Thank you.
> >>
> >> Best,
> >>
> >> --
> >> Yizhi Liu
> >> Senior Software Engineer / Data Mining
> >> www.mvad.com, Shanghai, China
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Yizhi Liu
> Senior Software Engineer / Data Mining
> www.mvad.com, Shanghai, China
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai

LinearRegressionWithSGD is not stable. Please use linear regression in
ML package instead.
http://spark.apache.org/docs/latest/ml-linear-methods.html

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
 wrote:
> Dear All,
>
> I have some program as below which makes me very much confused and
> inscrutable, it is about multiple dimension linear regression mode, the
> weight / coefficient is always perfect while the dimension is smaller than
> 4, otherwise it is wrong all the time.
> Or, whether the LinearRegressionWithSGD would be selected for another one?
>
> public class JavaLinearRegression {
>   public static void main(String[] args) {
> SparkConf conf = new SparkConf().setAppName("Linear Regression
> Example");
> JavaSparkContext sc = new JavaSparkContext(conf);
> SQLContext jsql = new SQLContext(sc);
>
> //Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
> //x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
> List localTraining = Lists.newArrayList(
> new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
> new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
> new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
> new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
>
> JavaRDD training = sc.parallelize(localTraining).cache();
>
> // Building the model
> int numIterations = 1000; //the number could be reset large
> final LinearRegressionModel model =
> LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
>
> //the coefficient weights are perfect while dimension of LabeledPoint is
> SMALLER than 4.
> //otherwise the output is always wrong and inscrutable.
> //for instance, one output is
> //Final w:
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
> System.out.print("Final w: " + model.weights() + "\n\n");
>   }
> }
>
>   I would appreciate your kind help or guidance very much~~
>
> Thank you!
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [SPARK MLLIB] could not understand the wrong and inscrutable result of Linear Regression codes

2015-10-25 Thread DB Tsai

Column 4 is always constant, so no predictive power resulting zero weight.

On Sunday, October 25, 2015, Zhiliang Zhu  wrote:

> Hi DB Tsai,
>
> Thanks very much for your kind reply help.
>
> As for your comment, I just modified and tested the key part of the codes:
>
>  LinearRegression lr = new LinearRegression()
>.setMaxIter(1)
>.setRegParam(0)
>.setElasticNetParam(0);  //the number could be reset
>
>  final LinearRegressionModel model = lr.fit(training);
>
> Now the output is much reasonable, however, x4 is always 0 while
> repeatedly reset those parameters in lr , would you help some about it how
> to properly set the parameters ...
>
> Final w: [1.00127825909,1.99979185054,2.3307136,0.0]
>
> Thank you,
> Zhiliang
>
>
>
>
> On Monday, October 26, 2015 5:14 AM, DB Tsai  > wrote:
>
>
> LinearRegressionWithSGD is not stable. Please use linear regression in
> ML package instead.
> http://spark.apache.org/docs/latest/ml-linear-methods.html
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Sun, Oct 25, 2015 at 10:14 AM, Zhiliang Zhu
>  > wrote:
> > Dear All,
> >
> > I have some program as below which makes me very much confused and
> > inscrutable, it is about multiple dimension linear regression mode, the
> > weight / coefficient is always perfect while the dimension is smaller
> than
> > 4, otherwise it is wrong all the time.
> > Or, whether the LinearRegressionWithSGD would be selected for another
> one?
> >
> > public class JavaLinearRegression {
> >  public static void main(String[] args) {
> >SparkConf conf = new SparkConf().setAppName("Linear Regression
> > Example");
> >JavaSparkContext sc = new JavaSparkContext(conf);
> >SQLContext jsql = new SQLContext(sc);
> >
> >//Ax = b, x = [1, 2, 3, 4] would be the only one output about weight
> >//x1 + 2 * x2 + 3 * x3 + 4 * x4 = y would be the multiple linear mode
> >List localTraining = Lists.newArrayList(
> >new LabeledPoint(30.0, Vectors.dense(1.0, 2.0, 3.0, 4.0)),
> >new LabeledPoint(29.0, Vectors.dense(0.0, 2.0, 3.0, 4.0)),
> >new LabeledPoint(25.0, Vectors.dense(0.0, 0.0, 3.0, 4.0)),
> >new LabeledPoint(16.0, Vectors.dense(0.0, 0.0, 0.0, 4.0)));
> >
> >JavaRDD training =
> sc.parallelize(localTraining).cache();
> >
> >// Building the model
> >int numIterations = 1000; //the number could be reset large
> >final LinearRegressionModel model =
> > LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numIterations);
> >
> >//the coefficient weights are perfect while dimension of LabeledPoint
> is
> > SMALLER than 4.
> >//otherwise the output is always wrong and inscrutable.
> >//for instance, one output is
> >//Final w:
> >
> [2.537341836047772E25,-7.744333206289736E24,6.697875883454909E23,-2.6704705246777624E22]
> >System.out.print("Final w: " + model.weights() + "\n\n");
> >  }
> > }
> >
> >  I would appreciate your kind help or guidance very much~~
> >
> > Thank you!
> > Zhiliang
> >
> >
>
>
>

-- 
- DB

Sent from my iPhone

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai

Interesting. For feature sub-sampling, is it per-node or per-tree? Do
you think you can implement generic GBM and have it merged as part of
Spark codebase?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
 wrote:
> Hi Spark User/Dev,
>
> Inspired by the success of XGBoost, I have created a Spark package for
> gradient boosting tree with 2nd order approximation of arbitrary
> user-defined loss functions.
>
> https://github.com/rotationsymmetry/SparkXGBoost
>
> Currently linear (normal) regression, binary classification, Poisson
> regression are supported. You can extend with other loss function as
> well.
>
> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>
> Thank you for testing. I am looking forward to your comments and
> suggestions. Bugs or improvements can be reported through GitHub.
>
> Many thanks!
>
> Meihua
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-26 Thread DB Tsai

Also, does it support categorical feature?

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> you think you can implement generic GBM and have it merged as part of
> Spark codebase?
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>  wrote:
>> Hi Spark User/Dev,
>>
>> Inspired by the success of XGBoost, I have created a Spark package for
>> gradient boosting tree with 2nd order approximation of arbitrary
>> user-defined loss functions.
>>
>> https://github.com/rotationsymmetry/SparkXGBoost
>>
>> Currently linear (normal) regression, binary classification, Poisson
>> regression are supported. You can extend with other loss function as
>> well.
>>
>> L1, L2, bagging, feature sub-sampling are also employed to avoid overfitting.
>>
>> Thank you for testing. I am looking forward to your comments and
>> suggestions. Bugs or improvements can be reported through GitHub.
>>
>> Many thanks!
>>
>> Meihua
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai

Hi Meihua,

For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they are ordered. It can be proven that it will give the
optimal solution. I have a proof that this works for learning
regression trees through variance reduction.

I'm also interested in understanding how the L1 and L2 regularization
within the boosting works (and if it helps with overfitting more than
shrinkage).

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu  wrote:
> Hi DB Tsai,
>
> Thank you very much for your interest and comment.
>
> 1) feature sub-sample is per-node, like random forest.
>
> 2) The current code heavily exploits the tree structure to speed up
> the learning (such as processing multiple learning node in one pass of
> the training data). So a generic GBM is likely to be a different
> codebase. Do you have any nice reference of efficient GBM? I am more
> than happy to look into that.
>
> 3) The algorithm accept training data as a DataFrame with the
> featureCol indexed by VectorIndexer. You can specify which variable is
> categorical in the VectorIndexer. Please note that currently all
> categorical variables are treated as ordered. If you want some
> categorical variables as unordered, you can pass the data through
> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> unordered categorical variable using the approach in RF in Spark ML
> (Please see roadmap in the README.md)
>
> Thanks,
>
> Meihua
>
>
>
> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
>> you think you can implement generic GBM and have it merged as part of
>> Spark codebase?
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>>
>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
>>  wrote:
>>> Hi Spark User/Dev,
>>>
>>> Inspired by the success of XGBoost, I have created a Spark package for
>>> gradient boosting tree with 2nd order approximation of arbitrary
>>> user-defined loss functions.
>>>
>>> https://github.com/rotationsymmetry/SparkXGBoost
>>>
>>> Currently linear (normal) regression, binary classification, Poisson
>>> regression are supported. You can extend with other loss function as
>>> well.
>>>
>>> L1, L2, bagging, feature sub-sampling are also employed to avoid 
>>> overfitting.
>>>
>>> Thank you for testing. I am looking forward to your comments and
>>> suggestions. Bugs or improvements can be reported through GitHub.
>>>
>>> Many thanks!
>>>
>>> Meihua
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [Spark MLlib] about linear regression issue

2015-11-01 Thread DB Tsai

For the constrains like all weights >=0, people do LBFGS-B which is
supported in our optimization library, Breeze.
https://github.com/scalanlp/breeze/issues/323

However, in Spark's LiR, our implementation doesn't have constrain
implementation. I do see this is useful given we're experimenting
SLIM: Sparse Linear Methods for recommendation,
http://www-users.cs.umn.edu/~xning/papers/Ning2011c.pdf which requires
all the weights to be positive (Eq. 3) to represent positive relations
between items.

In summary, it's possible and not difficult to add this constrain to
our current linear regression, but currently, there is no open source
implementation in Spark.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Sun, Nov 1, 2015 at 9:22 AM, Zhiliang Zhu  wrote:
> Dear All,
>
> As for N dimension linear regression, while the labeled training points
> number (or the rank of the labeled point space) is less than N,
> then from perspective of math, the weight of the trained linear model may be
> not unique.
>
> However, the output of model.weight() by spark may be with some wi < 0. My
> issue is, is there some proper way only to get
> some specific output weight with all wi >= 0 ...
>
> Yes, the above goes same with the issue about solving linear system of
> equations, Aw = b, and r(A, b) = r(A) < columnNo(A), then w is
> with infinite solutions, but here only needs one solution with all wi >= 0.
> When there is only unique solution, both LR and SVD will work perfect.
>
> I will appreciate your all kind help very much~~
> Best Regards,
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: thought experiment: use spark ML to real time prediction

2015-11-11 Thread DB Tsai

Do you think it will be useful to separate those models and model
loader/writer code into another spark-ml-common jar without any spark
platform dependencies so users can load the models trained by Spark ML in
their application and run the prediction?


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando  wrote:

> As of now, we are basically serializing the ML model and then deserialize
> it for prediction at real time.
>
> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase  wrote:
>
>> I don’t think this answers your question but here’s how you would
>> evaluate the model in realtime in a streaming app
>>
>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>
>> Maybe you can find a way to extract portions of MLLib and run them
>> outside of spark – loading the precomputed model and calling .predict on it…
>>
>> -adrian
>>
>> From: Andy Davidson
>> Date: Tuesday, November 10, 2015 at 11:31 PM
>> To: "user @spark"
>> Subject: thought experiment: use spark ML to real time prediction
>>
>> Lets say I have use spark ML to train a linear model. I know I can save
>> and load the model to disk. I am not sure how I can use the model in a real
>> time environment. For example I do not think I can return a “prediction” to
>> the client using spark streaming easily. Also for some applications the
>> extra latency created by the batch process might not be acceptable.
>>
>> If I was not using spark I would re-implement the model I trained in my
>> batch environment in a lang like Java  and implement a rest service that
>> uses the model to create a prediction and return the prediction to the
>> client. Many models make predictions using linear algebra. Implementing
>> predictions is relatively easy if you have a good vectorized LA package. Is
>> there a way to use a model I trained using spark ML outside of spark?
>>
>> As a motivating example, even if its possible to return data to the
>> client using spark streaming. I think the mini batch latency would not be
>> acceptable for a high frequency stock trading system.
>>
>> Kind regards
>>
>> Andy
>>
>> P.s. The examples I have seen so far use spark streaming to “preprocess”
>> predictions. For example a recommender system might use what current users
>> are watching to calculate “trending recommendations”. These are stored on
>> disk and served up to users when the use the “movie guide”. If a
>> recommendation was a couple of min. old it would not effect the end users
>> experience.
>>
>>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai

I think the use-case can be quick different from PMML.

By having a Spark platform independent ML jar, this can empower users to do
the following,

1) PMML doesn't contain all the models we have in mllib. Also, for a ML
pipeline trained by Spark, most of time, PMML is not expressive enough to
do all the transformation we have in Spark ML. As a result, if we are able
to serialize the entire Spark ML pipeline after training, and then load
them back in app without any Spark platform for production scorning, this
will be very useful for production deployment of Spark ML models. The only
issue will be if the transformer involves with shuffle, we need to figure
out a way to handle it. When I chatted with Xiangrui about this, he
suggested that we may tag if a transformer is shuffle ready. Currently, at
Netflix, we are not able to use ML pipeline because of those issues, and we
have to write our own scorers in our production which is quite a duplicated
work.

2) If users can use Spark's linear algebra like vector or matrix code in
their application, this will be very useful. This can help to share code in
Spark training pipeline and production deployment. Also, lots of good stuff
at Spark's mllib doesn't depend on Spark platform, and people can use them
in their application without pulling lots of dependencies. In fact, in my
project, I have to copy & paste code from mllib into my project to use
those goodies in apps.

3) Currently, mllib depends on graphx which means in graphx, there is no
way to use mllib's vector or matrix. And at Netflix, we implemented
parallel personalized page rank which requires to use sparse vector as part
of public api. We have to use breeze here since no access to mllib's basic
type in graphx. Before we contribute it back to open source community, we
need to address this.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 3:42 AM, Sean Owen  wrote:

> This is all starting to sound a lot like what's already implemented in
> Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm
> not clear it helps a lot to reimplement this in Spark.
>
> On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung 
> wrote:
>
>> +1 on that. It would be useful to use the model outside of Spark.
>>
>>
>> _
>> From: DB Tsai 
>> Sent: Wednesday, November 11, 2015 11:57 PM
>> Subject: Re: thought experiment: use spark ML to real time prediction
>> To: Nirmal Fernando 
>> Cc: Andy Davidson , Adrian Tanase <
>> atan...@adobe.com>, user @spark 
>>
>>
>>
>> Do you think it will be useful to separate those models and model
>> loader/writer code into another spark-ml-common jar without any spark
>> platform dependencies so users can load the models trained by Spark ML in
>> their application and run the prediction?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Web: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> On Wed, Nov 11, 2015 at 3:14 AM, Nirmal Fernando 
>> wrote:
>>
>>> As of now, we are basically serializing the ML model and then
>>> deserialize it for prediction at real time.
>>>
>>> On Wed, Nov 11, 2015 at 4:39 PM, Adrian Tanase 
>>> wrote:
>>>
>>>> I don’t think this answers your question but here’s how you would
>>>> evaluate the model in realtime in a streaming app
>>>>
>>>> https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html
>>>>
>>>> Maybe you can find a way to extract portions of MLLib and run them
>>>> outside of spark – loading the precomputed model and calling .predict on
>>>> it…
>>>>
>>>> -adrian
>>>>
>>>> From: Andy Davidson
>>>> Date: Tuesday, November 10, 2015 at 11:31 PM
>>>> To: "user @spark"
>>>> Subject: thought experiment: use spark ML to real time prediction
>>>>
>>>> Lets say I have use spark ML to train a linear model. I know I can save
>>>> and load the model to disk. I am not sure how I can use the model in a real
>>>> time environment. For example I do not think I can return a “prediction” to
>>>> the client using spark streaming easily. Also for some applications the
>>>> extra latency created by the batch process might not be acceptable.
>>>>
>>>> If I was not using spark I would re-implement the model I trained in my
>>>> batch environme

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai

This will bring the whole dependencies of spark will may break the web app.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando  wrote:

>
>
> On Fri, Nov 13, 2015 at 2:04 AM, darren  wrote:
>
>> I agree 100%. Making the model requires large data and many cpus.
>>
>> Using it does not.
>>
>> This is a very useful side effect of ML models.
>>
>> If mlib can't use models outside spark that's a real shame.
>>
>
> Well you can as mentioned earlier. You don't need Spark runtime for
> predictions, save the serialized model and deserialize to use. (you need
> the Spark Jars in the classpath though)
>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>> ---- Original message 
>> From: "Kothuvatiparambil, Viju" 
>>
>> Date: 11/12/2015 3:09 PM (GMT-05:00)
>> To: DB Tsai , Sean Owen 
>> Cc: Felix Cheung , Nirmal Fernando <
>> nir...@wso2.com>, Andy Davidson , Adrian
>> Tanase , "user @spark" ,
>> Xiangrui Meng , hol...@pigscanfly.ca
>> Subject: RE: thought experiment: use spark ML to real time prediction
>>
>> I am glad to see DB’s comments, make me feel I am not the only one facing
>> these issues. If we are able to use MLLib to load the model in web
>> applications (outside the spark cluster), that would have solved the
>> issue.  I understand Spark is manly for processing big data in a
>> distributed mode. But, there is no purpose in training a model using MLLib,
>> if we are not able to use it in applications where needs to access the
>> model.
>>
>>
>>
>> Thanks
>>
>> Viju
>>
>>
>>
>> *From:* DB Tsai [mailto:dbt...@dbtsai.com]
>> *Sent:* Thursday, November 12, 2015 11:04 AM
>> *To:* Sean Owen
>> *Cc:* Felix Cheung; Nirmal Fernando; Andy Davidson; Adrian Tanase; user
>> @spark; Xiangrui Meng; hol...@pigscanfly.ca
>> *Subject:* Re: thought experiment: use spark ML to real time prediction
>>
>>
>>
>> I think the use-case can be quick different from PMML.
>>
>>
>>
>> By having a Spark platform independent ML jar, this can empower users to
>> do the following,
>>
>>
>>
>> 1) PMML doesn't contain all the models we have in mllib. Also, for a ML
>> pipeline trained by Spark, most of time, PMML is not expressive enough to
>> do all the transformation we have in Spark ML. As a result, if we are able
>> to serialize the entire Spark ML pipeline after training, and then load
>> them back in app without any Spark platform for production scorning, this
>> will be very useful for production deployment of Spark ML models. The only
>> issue will be if the transformer involves with shuffle, we need to figure
>> out a way to handle it. When I chatted with Xiangrui about this, he
>> suggested that we may tag if a transformer is shuffle ready. Currently, at
>> Netflix, we are not able to use ML pipeline because of those issues, and we
>> have to write our own scorers in our production which is quite a duplicated
>> work.
>>
>>
>>
>> 2) If users can use Spark's linear algebra like vector or matrix code in
>> their application, this will be very useful. This can help to share code in
>> Spark training pipeline and production deployment. Also, lots of good stuff
>> at Spark's mllib doesn't depend on Spark platform, and people can use them
>> in their application without pulling lots of dependencies. In fact, in my
>> project, I have to copy & paste code from mllib into my project to use
>> those goodies in apps.
>>
>>
>>
>> 3) Currently, mllib depends on graphx which means in graphx, there is no
>> way to use mllib's vector or matrix. And
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>

Re: thought experiment: use spark ML to real time prediction

2015-11-17 Thread DB Tsai

I was thinking about to work on better version of PMML, JMML in JSON, but
as you said, this requires a dedicated team to define the standard which
will be a huge work.  However, option b) and c) still don't address the
distributed models issue. In fact, most of the models in production have to
be small enough to return the result to users within reasonable latency, so
I doubt how usefulness of the distributed models in real production
use-case. For R and Python, we can build a wrapper on-top of the
lightweight "spark-ml-common" project.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Tue, Nov 17, 2015 at 2:29 AM, Nick Pentreath 
wrote:

> I think the issue with pulling in all of spark-core is often with
> dependencies (and versions) conflicting with the web framework (or Akka in
> many cases). Plus it really is quite heavy if you just want a fairly
> lightweight model-serving app. For example we've built a fairly simple but
> scalable ALS factor model server on Scalatra, Akka and Breeze. So all you
> really need is the web framework and Breeze (or an alternative linear
> algebra lib).
>
> I definitely hear the pain-point that PMML might not be able to handle
> some types of transformations or models that exist in Spark. However,
> here's an example from scikit-learn -> PMML that may be instructive (
> https://github.com/scikit-learn/scikit-learn/issues/1596 and
> https://github.com/jpmml/jpmml-sklearn), where a fairly impressive list
> of estimators and transformers are supported (including e.g. scaling and
> encoding, and PCA).
>
> I definitely think the current model I/O and "export" or "deploy to
> production" situation needs to be improved substantially. However, you are
> left with the following options:
>
> (a) build out a lightweight "spark-ml-common" project that brings in the
> dependencies needed for production scoring / transformation in independent
> apps. However, here you only support Scala/Java - what about R and Python?
> Also, what about the distributed models? Perhaps "local" wrappers can be
> created, though this may not work for very large factor or LDA models. See
> also H20 example http://docs.h2o.ai/h2oclassic/userguide/scorePOJO.html
>
> (b) build out Spark's PMML support, and add missing stuff to PMML where
> possible. The benefit here is an existing standard with various tools for
> scoring (via REST server, Java app, Pig, Hive, various language support).
>
> (c) build out a more comprehensive I/O, serialization and scoring
> framework. Here you face the issue of supporting various predictors and
> transformers generically, across platforms and versioning. i.e. you're
> re-creating a new standard like PMML
>
> Option (a) is do-able, but I'm a bit concerned that it may be too "Spark
> specific", or even too "Scala / Java" specific. But it is still potentially
> very useful to Spark users to build this out and have a somewhat standard
> production serving framework and/or library (there are obviously existing
> options like PredictionIO etc).
>
> Option (b) is really building out the existing PMML support within Spark,
> so a lot of the initial work has already been done. I know some folks had
> (or have) licensing issues with some components of JPMML (e.g. the
> evaluator and REST server). But perhaps the solution here is to build an
> Apache2-licensed evaluator framework.
>
> Option (c) is obviously interesting - "let's build a better PMML (that
> uses JSON or whatever instead of XML!)". But it also seems like a huge
> amount of reinventing the wheel, and like any new standard would take time
> to garner wide support (if at all).
>
> It would be really useful to start to understand what the main missing
> pieces are in PMML - perhaps the lowest-hanging fruit is simply to
> contribute improvements or additions to PMML.
>
>
>
> On Fri, Nov 13, 2015 at 11:46 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> That may not be an issue if the app using the models runs by itself (not
>> bundled into an existing app), which may actually be the right way to
>> design it considering separation of concerns.
>>
>> Regards
>> Sab
>>
>> On Fri, Nov 13, 2015 at 9:59 AM, DB Tsai  wrote:
>>
>>> This will bring the whole dependencies of spark will may break the web
>>> app.
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> --
>>> Web: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>&

Re: Spark LogisticRegression returns scaled coefficients

2015-11-17 Thread DB Tsai

How do you compute the probability given the weights? Also, given a
probability, you need to sample positive and negative based on the
probability, and how do you do this? I'm pretty sure that the LoR will
give you correct weights, and please see the
generateMultinomialLogisticInput  in
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Nov 17, 2015 at 4:11 PM, njoshi  wrote:
> I am testing the LogisticRegression performance on a synthetically generated
> data. The weights I have as input are
>
>w = [2, 3, 4]
>
> with no intercept and three features. After training on 1000 synthetically
> generated datapoint assuming random normal distribution for each, the Spark
> LogisticRegression model I obtain has weights as
>
>  [6.005520656096823,9.35980263762698,12.203400879214152]
>
> I can see that each weight is scaled by a factor close to '3' w.r.t. the
> original values. I am unable to guess the reason behind this. The code is
> simple enough as
>
>
> /*
>  * Logistic Regression model
>  */
> val lr = new LogisticRegression()
>   .setMaxIter(50)
>   .setRegParam(0.001)
>   .setElasticNetParam(0.95)
>   .setFitIntercept(false)
>
> val lrModel = lr.fit(trainingData)
>
>
> println(s"${lrModel.weights}")
>
>
>
> I would greatly appreciate if someone could shed some light on what's fishy
> here.
>
> with kind regards, Nikhil
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-LogisticRegression-returns-scaled-coefficients-tp25405.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: the way to compare any two adjacent elements in one rdd

2015-12-04 Thread DB Tsai

This is tricky. You need to shuffle the ending and beginning elements
using mapPartitionWithIndex.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
> Hi All,
>
> I would like to compare any two adjacent elements in one given rdd, just as
> the single machine code part:
>
> int a[N] = {...};
> for (int i=0; i < N - 1; ++i) {
>compareFun(a[i], a[i+1]);
> }
> ...
>
> mapPartitions may work for some situations, however, it could not compare
> elements in different  partitions.
> foreach also seems not work.
>
> Thanks,
> Zhiliang
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: the way to compare any two adjacent elements in one rdd

2015-12-06 Thread DB Tsai

Only beginning and ending part of data. The rest in the partition can
be compared without shuffle.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu  wrote:
>
>
>
>
> On Saturday, December 5, 2015 3:00 PM, DB Tsai  wrote:
>
>
> This is tricky. You need to shuffle the ending and beginning elements
> using mapPartitionWithIndex.
>
>
> Does this mean that I need to shuffle the all elements in different
> partitions into one partition, then compare them by way of any two adjacent
> elements?
> It seems good, if it is like that.
>
> One more issue, will it loss parallelism since there become only one
> partition ...
>
> Thanks very much in advance!
>
>
>
>
>
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
>> Hi All,
>>
>> I would like to compare any two adjacent elements in one given rdd, just
>> as
>> the single machine code part:
>>
>> int a[N] = {...};
>> for (int i=0; i < N - 1; ++i) {
>>compareFun(a[i], a[i+1]);
>> }
>> ...
>>
>> mapPartitions may work for some situations, however, it could not compare
>> elements in different  partitions.
>> foreach also seems not work.
>>
>> Thanks,
>> Zhiliang
>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai

Could you paste some of your code for diagnosis?


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev 
wrote:

> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>
> We are running 2-3 LogisticRegression models in parallel (we'd love to run
> 10-20 actually), they are not really big at all, maybe 1-2 million rows in
> each model.
>
> Cluster itself, and all executors look good. Enough free memory and no
> exceptions or errors.
>
> However I see very strange behavior inside Spark driver. Allocated heap
> constantly growing. It grows up to 30 gigs in 1.5 hours and then everything
> becomes super sloow.
>
> We don't do any collect, and I really don't understand who is consuming
> all this memory. Looks like it's something inside LogisticRegression
> itself, however I only see treeAggregate which should not require so much
> memory to run.
>
> Any ideas?
>
> Plus I don't see any GC pause, looks like memory is still used by someone
> inside driver.
>
> [image: Inline image 2]
> [image: Inline image 1]
>

Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai

Hi,

Recently, we ran into this notorious exception while doing large
shuffle in mesos at Netflix. We ensure that `ulimit -n` is a very
large number, but still have the issue.

It turns out that mesos overrides the `ulimit -n` to a small number
causing the problem. It's very non-trivial to debug (as logging in on
the slave gives the right ulimit - it's only in the mesos context that
it gets overridden).

Here is the code you can run in Spark shell to get the actual allowed
# of open files for Spark.

import sys.process._
val p = 1 to 100
val rdd = sc.parallelize(p, 100)
val openFiles = rdd.map(x=> Seq("sh", "-c", "ulimit
-n").!!.toDouble.toLong).collect

Hope this can help someone in the same situation.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai

Your code looks correct for me. How many # of features do you have in this
training? How many tasks are running in the job?


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev 
wrote:

> It's really simple: https://gist.github.com/ezhulenev/886517723ca4a353
>
> The same strange heap behavior we've seen even for single model, it takes
> ~20 gigs heap on a driver to build single model with less than 1 million
> rows in input data frame.
>
> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai  wrote:
>
>> Could you paste some of your code for diagnosis?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>
>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev <
>> eugene.zhule...@gmail.com> wrote:
>>
>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>>>
>>> We are running 2-3 LogisticRegression models in parallel (we'd love to
>>> run 10-20 actually), they are not really big at all, maybe 1-2 million rows
>>> in each model.
>>>
>>> Cluster itself, and all executors look good. Enough free memory and no
>>> exceptions or errors.
>>>
>>> However I see very strange behavior inside Spark driver. Allocated heap
>>> constantly growing. It grows up to 30 gigs in 1.5 hours and then everything
>>> becomes super sloow.
>>>
>>> We don't do any collect, and I really don't understand who is consuming
>>> all this memory. Looks like it's something inside LogisticRegression
>>> itself, however I only see treeAggregate which should not require so much
>>> memory to run.
>>>
>>> Any ideas?
>>>
>>> Plus I don't see any GC pause, looks like memory is still used by
>>> someone inside driver.
>>>
>>> [image: Inline image 2]
>>> [image: Inline image 1]
>>>
>>
>>
>

Re: Debugging too many files open exception issue in Spark shuffle

2015-09-23 Thread DB Tsai

in  ./apps/mesos-0.22.1/sbin/mesos-daemon.sh

#!/usr/bin/env bash

prefix=/apps/mesos-0.22.1
exec_prefix=/apps/mesos-0.22.1

deploy_dir=${prefix}/etc/mesos

# Increase the default number of open file descriptors.
ulimit -n 8192


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Wed, Sep 23, 2015 at 5:14 PM, java8964  wrote:
> That is interesting.
>
> I don't have any Mesos experience, but just want to know the reason why it
> does so.
>
> Yong
>
>> Date: Wed, 23 Sep 2015 15:53:54 -0700
>> Subject: Debugging too many files open exception issue in Spark shuffle
>> From: dbt...@dbtsai.com
>> To: user@spark.apache.org
>
>>
>> Hi,
>>
>> Recently, we ran into this notorious exception while doing large
>> shuffle in mesos at Netflix. We ensure that `ulimit -n` is a very
>> large number, but still have the issue.
>>
>> It turns out that mesos overrides the `ulimit -n` to a small number
>> causing the problem. It's very non-trivial to debug (as logging in on
>> the slave gives the right ulimit - it's only in the mesos context that
>> it gets overridden).
>>
>> Here is the code you can run in Spark shell to get the actual allowed
>> # of open files for Spark.
>>
>> import sys.process._
>> val p = 1 to 100
>> val rdd = sc.parallelize(p, 100)
>> val openFiles = rdd.map(x=> Seq("sh", "-c", "ulimit
>> -n").!!.toDouble.toLong).collect
>>
>> Hope this can help someone in the same situation.
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LogisticRegression models consumes all driver memory

2015-09-23 Thread DB Tsai

You want to reduce the # of partitions to around the # of executors *
cores. Since you have so many tasks/partitions which will give a lot of
pressure on treeReduce in LoR. Let me know if this helps.


Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Wed, Sep 23, 2015 at 5:39 PM, Eugene Zhulenev 
wrote:

> ~3000 features, pretty sparse, I think about 200-300 non zero features in
> each row. We have 100 executors x 8 cores. Number of tasks is pretty big,
> 30k-70k, can't remember exact number. Training set is a result of pretty
> big join from multiple data frames, but it's cached. However as I
> understand Spark still keeps DAG history of RDD to be able to recover it in
> case of failure of one of the nodes.
>
> I'll try tomorrow to save train set as parquet, load it back as DataFrame
> and run modeling this way.
>
> On Wed, Sep 23, 2015 at 7:56 PM, DB Tsai  wrote:
>
>> Your code looks correct for me. How many # of features do you have in
>> this training? How many tasks are running in the job?
>>
>>
>> Sincerely,
>>
>> DB Tsai
>> --
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D
>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>
>> On Wed, Sep 23, 2015 at 4:38 PM, Eugene Zhulenev <
>> eugene.zhule...@gmail.com> wrote:
>>
>>> It's really simple:
>>> https://gist.github.com/ezhulenev/886517723ca4a353
>>>
>>> The same strange heap behavior we've seen even for single model, it
>>> takes ~20 gigs heap on a driver to build single model with less than 1
>>> million rows in input data frame.
>>>
>>> On Wed, Sep 23, 2015 at 6:31 PM, DB Tsai  wrote:
>>>
>>>> Could you paste some of your code for diagnosis?
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> --
>>>> Blog: https://www.dbtsai.com
>>>> PGP Key ID: 0xAF08DF8D
>>>> <https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>
>>>>
>>>> On Wed, Sep 23, 2015 at 3:19 PM, Eugene Zhulenev <
>>>> eugene.zhule...@gmail.com> wrote:
>>>>
>>>>> We are running Apache Spark 1.5.0 (latest code from 1.5 branch)
>>>>>
>>>>> We are running 2-3 LogisticRegression models in parallel (we'd love to
>>>>> run 10-20 actually), they are not really big at all, maybe 1-2 million 
>>>>> rows
>>>>> in each model.
>>>>>
>>>>> Cluster itself, and all executors look good. Enough free memory and no
>>>>> exceptions or errors.
>>>>>
>>>>> However I see very strange behavior inside Spark driver. Allocated
>>>>> heap constantly growing. It grows up to 30 gigs in 1.5 hours and then
>>>>> everything becomes super sloow.
>>>>>
>>>>> We don't do any collect, and I really don't understand who is
>>>>> consuming all this memory. Looks like it's something inside
>>>>> LogisticRegression itself, however I only see treeAggregate which should
>>>>> not require so much memory to run.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Plus I don't see any GC pause, looks like memory is still used by
>>>>> someone inside driver.
>>>>>
>>>>> [image: Inline image 2]
>>>>> [image: Inline image 1]
>>>>>
>>>>
>>>>
>>>
>>
>

Re: a question about LBFGS in Spark

2016-08-24 Thread DB Tsai

Hi Lingling,

I think you don't properly subscribe to mailing list yet, so I +cc to
the mailing list.

The mllib package is deprecated, and we no longer maintain it. The
reason why it designed in this way is because of backward
compatibility. In the original design, updater also has the logic of
step size, and in LBFGS, we don't use it. In the code, we have
documentation the math, and why this works.

/**
* It will return the gradient part of regularization using updater.
*
* Given the input parameters, the updater basically does the following,
*
* w' = w - thisIterStepSize * (gradient + regGradient(w))
* Note that regGradient is function of w
*
* If we set gradient = 0, thisIterStepSize = 1, then
*
* regGradient(w) = w - w'
*
* TODO: We need to clean it up by separating the logic of regularization out
* from updater to regularizer.
*/
// The following gradientTotal is actually the regularization part of gradient.
// Will add the gradientSum computed from the data with weights in the
next step.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

>> On Wed, Aug 24, 2016 at 7:16 AM Lingling Li  wrote:
>>>
>>> Hi!
>>>
>>> Sorry for getting in touch. This is Ling Ling and I am now reading the
>>> LBFGS code in Spark.
>>>
>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala
>>>
>>> I find that you are one of the contributors, so may be you can help me
>>> out here? I appreciate it!
>>>
>>> In the CostFun:
>>> val regVal = updater.compute(w, Vectors.zeros(n), 0, 1, regParam)._2
>>> axpy(-1.0, updater.compute(w, Vectors.zeros(n), 1, 1, regParam)._1,
>>> gradientTotal)
>>>
>>> Why is the gradient in the updater being set as 0？And why the stepsize is
>>> 0 and 1 respectively?
>>>
>>> Thank you very much for your help!
>>>
>>> All the best,
>>> Ling Ling
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SPARK ML- Feature Selection Techniques

2016-09-05 Thread DB Tsai

You can try LOR with L1.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Mon, Sep 5, 2016 at 5:31 AM, Bahubali Jain  wrote:
> Hi,
> Do we have any feature selection techniques implementation(wrapper
> methods,embedded methods) available in SPARK ML ?
>
> Thanks,
> Baahu
> --
> Twitter:http://twitter.com/Baahu
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-18 Thread DB Tsai

Hi Jong,

I think the definition from Kaggle is correct. I'm working on
implementing ranking metrics in Spark ML now, but the timeline is
unknown. Feel free to submit a PR for this in MLlib.

Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Sep 18, 2016 at 8:42 PM, Jong Wook Kim  wrote:
> Hi,
>
> I'm trying to evaluate a recommendation model, and found that Spark and
> Rival give different results, and it seems that Rival's one is what Kaggle
> defines: https://gist.github.com/jongwook/5d4e78290eaef22cb69abbf68b52e597
>
> Am I using RankingMetrics in a wrong way, or is Spark's implementation
> incorrect?
>
> To my knowledge, NDCG should be dependent on the relevance (or preference)
> values, but Spark's implementation seems not; it uses 1.0 where it should be
> 2^(relevance) - 1, probably assuming that relevance is all 1.0? I also tried
> tweaking, but its method to obtain the ideal DCG also seems wrong.
>
> Any feedback from MLlib developers would be appreciated. I made a
> modified/extended version of RankingMetrics that produces the identical
> numbers to Kaggle and Rival's results, and I'm wondering if it is something
> appropriate to be added back to MLlib.
>
> Jong Wook

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: can mllib Logistic Regression package handle 10 million sparse features?

2016-10-05 Thread DB Tsai

With the latest code in the current master, we're successfully
training LOR using Spark ML's implementation with 14M sparse features.
You need to tune the depth of aggregation to make it efficient.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x9DCC1DBD7FC7BBB2

On Wed, Oct 5, 2016 at 12:00 PM, Yang  wrote:
> anybody had actual experience applying it to real problems of this scale?
>
> thanks
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Why dataframe can be more efficient than dataset?

2017-04-13 Thread DB Tsai

There is a JIRA and prototype which analyzes the JVM bytecode in the black
box, and convert the closures into catalyst expressions.

https://issues.apache.org/jira/browse/SPARK-14083

This potentially can address the issue discussed here.


Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0

On Sun, Apr 9, 2017 at 11:17 AM, Koert Kuipers  wrote:

> in this case there is no difference in performance. both will do the
> operation directly on the internal representation of the data (so the
> InternalRow).
>
> also it is worth pointing out that switching back and forth between
> Dataset[X] and DataFrame is free.
>
> On Sun, Apr 9, 2017 at 1:28 PM, Shiyuan  wrote:
>
>> Thank you for the detailed explanation!  You point out two reasons why
>> Dataset is not as efficeint as dataframe:
>> 1). Spark cannot look into lambda and therefore cannot optimize.
>> 2). The  type conversion  occurs under the hood, eg. from X to internal
>> row.
>>
>> Just to check my understanding,  some method of Dataset can also take sql
>> expression string  instead of lambda function, in this case, Is it  the
>> type conversion still happens under the hood and therefore Dataset is still
>> not as efficient as DataFrame.  Here is the code,
>>
>> //define a dataset and a dataframe, same content, but one is stored as
>> Dataset, the other is Dataset
>> scala> case class Person(name: String, age: Long)
>> scala> val ds = Seq(Person("A",32), Person("B", 18)).toDS
>> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>> scala> val df = Seq(Person("A",32), Person("B", 18)).toDF
>> df: org.apache.spark.sql.DataFrame = [name: string, age: bigint]
>>
>> //Which filtering is more efficient? both use sql expression string.
>> scala> df.filter("age < 20")
>> res7: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name:
>> string, age: bigint]
>>
>> scala> ds.filter("age < 20")
>> res8: org.apache.spark.sql.Dataset[Person] = [name: string, age: bigint]
>>
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Apr 8, 2017 at 7:22 PM, Koert Kuipers  wrote:
>>
>>> how would you use only relational transformations on dataset?
>>>
>>> On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:
>>>
>>>> Hi Spark-users,
>>>> I came across a few sources which mentioned DataFrame can be more
>>>> efficient than Dataset.  I can understand this is true because Dataset
>>>> allows functional transformation which Catalyst cannot look into and hence
>>>> cannot optimize well. But can DataFrame be more efficient than Dataset even
>>>> if we only use the relational transformation on dataset? If so, can anyone
>>>> give some explanation why  it is so? Any benchmark comparing dataset vs.
>>>> dataframe?   Thank you!
>>>>
>>>> Shiyuan
>>>>
>>>
>>>
>>
>

Re: imbalance classe inside RANDOMFOREST CLASSIFIER

2017-05-05 Thread DB Tsai

We have the weighting algorithms implemented in linear models, but
unfortunately, it's not implemented in tree models. It's an important
feature, and welcome for PR! Thanks.

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0x5CED8B896A6BDFA0


On Fri, May 5, 2017 at 12:58 AM, issues solution
 wrote:
> Hi ,
> in sicki-learn we have sample_weights option that allow us to create array
> to balacne class category
>
>
> By calling like that
>
> rf.fit(X,Y,sample_weights=[10 10 10 ...1 1 10 ])
>
> i 'am wondering if equivelent exist inside ml or mlib class ???
> if yes can i ask refrence or example
> thx for advance .
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[ANNOUNCE] Announcing Apache Spark 2.4.1

2019-04-04 Thread DB Tsai

+user list

We are happy to announce the availability of Spark 2.4.1!

Apache Spark 2.4.1 is a maintenance release, based on the branch-2.4
maintenance branch of Spark. We strongly recommend all 2.4.0 users to
upgrade to this stable release.

In Apache Spark 2.4.1, Scala 2.12 support is GA, and it's no longer
experimental.
We will drop Scala 2.11 support in Spark 3.0, so please provide us feedback.

To download Spark 2.4.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-2-4-1.html

One more thing: to add a little color to this release, it's the
largest RC ever (RC9)!
We tried to incorporate many critical fixes at the last minute, and
hope you all enjoy it.

We would like to acknowledge all community members for contributing to
this release. This release would not have been possible without you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Release Apache Spark 2.4.4

2019-08-13 Thread DB Tsai

+1

On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Spark 2.4.3 was released three months ago (8th May).
> As of today (13th August), there are 112 commits (75 JIRAs) in `branch-24` 
> since 2.4.3.
>
> It would be great if we can have Spark 2.4.4.
> Shall we start `2.4.4 RC1` next Monday (19th August)?
>
> Last time, there was a request for K8s issue and now I'm waiting for 
> SPARK-27900.
> Please let me know if there is another issue.
>
> Thanks,
> Dongjoon.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: JDK11 Support in Apache Spark

2019-08-24 Thread DB Tsai

Congratulations on the great work!

Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 42E5B25A8F7A82C1

On Sat, Aug 24, 2019 at 8:11 AM Dongjoon Hyun  wrote:
>
> Hi, All.
>
> Thanks to your many many contributions,
> Apache Spark master branch starts to pass on JDK11 as of today.
> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>
> 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
> (JDK11 is used for building and testing.)
>
> We already verified all UTs (including PySpark/SparkR) before.
>
> Please feel free to use JDK11 in order to build/test/run `master` branch and
> share your experience including any issues. It will help Apache Spark 3.0.0 
> release.
>
> For the follow-ups, please follow 
> https://issues.apache.org/jira/browse/SPARK-24417 .
> The next step is `how to support JDK8/JDK11 together in a single artifact`.
>
> Bests,
> Dongjoon.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: foreachActive functionality

2015-01-25 Thread DB Tsai

PS, we were using Breeze's activeIterator originally as you can see in
the old code, but we found there are overhead there, so we implement
our own implementation which results 4x faster. See
https://github.com/apache/spark/pull/3288 for detail.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Sun, Jan 25, 2015 at 12:25 PM, Reza Zadeh  wrote:
> The idea is to unify the code path for dense and sparse vector operations,
> which makes the codebase easier to maintain. By handling (index, value)
> tuples, you can let the foreachActive method take care of checking if the
> vector is sparse or dense, and running a foreach over the values.
>
> On Sun, Jan 25, 2015 at 8:18 AM, kundan kumar  wrote:
>>
>> Can someone help me to understand the usage of "foreachActive"  function
>> introduced for the Vectors.
>>
>> I am trying to understand its usage in MultivariateOnlineSummarizer class
>> for summary statistics.
>>
>>
>> sample.foreachActive { (index, value) =>
>>   if (value != 0.0) {
>> if (currMax(index) < value) {
>>   currMax(index) = value
>> }
>> if (currMin(index) > value) {
>>   currMin(index) = value
>> }
>>
>> val prevMean = currMean(index)
>> val diff = value - prevMean
>> currMean(index) = prevMean + diff / (nnz(index) + 1.0)
>> currM2n(index) += (value - currMean(index)) * diff
>> currM2(index) += value * value
>> currL1(index) += math.abs(value)
>>
>> nnz(index) += 1.0
>>   }
>> }
>>
>> Regards,
>> Kundan
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LBGFS optimizer performace

2015-03-05 Thread DB Tsai

PS, I will recommend you compress the data when you cache the RDD.
There will be some overhead in compression/decompression, and
serialization/deserialization, but it will help a lot for iterative
algorithms with ability to caching more data.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Mar 3, 2015 at 2:27 PM, Gustavo Enrique Salazar Torres
 wrote:
> Yeah, I can call count before that and it works. Also I was over caching
> tables but I removed those. Now there is no caching but it gets really slow
> since it calculates my table RDD many times.
> Also hacked the LBFGS code to pass the number of examples which I calculated
> outside in a Spark SQL query but just moved the location of the problem.
>
> The query I'm running looks like this:
>
> s"SELECT $mappedFields, tableB.id as b_id FROM tableA LEFT JOIN tableB  ON
> tableA.id=tableB.table_a_id WHERE tableA.status='' OR tableB.status='' "
>
> mappedFields contains a list of fields which I'm interested in. The result
> of that query goes through (including sampling) some transformations before
> being input to LBFGS.
>
> My dataset has 180GB just for feature selection, I'm planning to use 450GB
> to train the final model and I'm using 16 c3.2xlarge EC2 instances, that
> means I have 240GB of RAM available.
>
> Any suggestion? I'm starting to check the algorithm because I don't
> understand why it needs to count the dataset.
>
> Thanks
>
> Gustavo
>
> On Tue, Mar 3, 2015 at 6:08 PM, Joseph Bradley 
> wrote:
>>
>> Is that error actually occurring in LBFGS?  It looks like it might be
>> happening before the data even gets to LBFGS.  (Perhaps the outer join
>> you're trying to do is making the dataset size explode a bit.)  Are you able
>> to call count() (or any RDD action) on the data before you pass it to LBFGS?
>>
>> On Tue, Mar 3, 2015 at 8:55 AM, Gustavo Enrique Salazar Torres
>>  wrote:
>>>
>>> Just did with the same error.
>>> I think the problem is the "data.count()" call in LBFGS because for huge
>>> datasets that's naive to do.
>>> I was thinking to write my version of LBFGS but instead of doing
>>> data.count() I will pass that parameter which I will calculate from a Spark
>>> SQL query.
>>>
>>> I will let you know.
>>>
>>> Thanks
>>>
>>>
>>> On Tue, Mar 3, 2015 at 3:25 AM, Akhil Das 
>>> wrote:
>>>>
>>>> Can you try increasing your driver memory, reducing the executors and
>>>> increasing the executor memory?
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, Mar 3, 2015 at 10:09 AM, Gustavo Enrique Salazar Torres
>>>>  wrote:
>>>>>
>>>>> Hi there:
>>>>>
>>>>> I'm using LBFGS optimizer to train a logistic regression model. The
>>>>> code I implemented follows the pattern showed in
>>>>> https://spark.apache.org/docs/1.2.0/mllib-linear-methods.html but training
>>>>> data is obtained from a Spark SQL RDD.
>>>>> The problem I'm having is that LBFGS tries to count the elements in my
>>>>> RDD and that results in a OOM exception since my dataset is huge.
>>>>> I'm running on a AWS EMR cluster with 16 c3.2xlarge instances on Hadoop
>>>>> YARN. My dataset is about 150 GB but I sample (I take only 1% of the data)
>>>>> it in order to scale logistic regression.
>>>>> The exception I'm getting is this:
>>>>>
>>>>> 15/03/03 04:21:44 WARN scheduler.TaskSetManager: Lost task 108.0 in
>>>>> stage 2.0 (TID 7600, ip-10-155-20-71.ec2.internal):
>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>> at java.util.Arrays.copyOfRange(Arrays.java:2694)
>>>>> at java.lang.String.(String.java:203)
>>>>> at
>>>>> com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
>>>>> at
>>>>> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
>>>>> at
>>>>> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
>>>>> at
>>>>> com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
>>>>> at
>>>>> com.esotericsoftware.kryo.seriali

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-15 Thread DB Tsai

In LBFGS version of logistic regression, the data is properly
standardized, so this should not happen. Can you provide a copy of
your dataset to us so we can test it? If the dataset can not be
public, can you have just send me a copy so I can dig into this? I'm
the author of LORWithLBFGS. Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Fri, Mar 13, 2015 at 2:41 PM, cjwang  wrote:
> I am running LogisticRegressionWithLBFGS.  I got these lines on my console:
>
> 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.5
> 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.25
> 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.125
> 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0625
> 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.03125
> 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.015625
> 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0078125
> 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.005859375
> 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,257 ERROR breeze.optimize.LBFGS | Failure! Resetting
> history: breeze.optimize.FirstOrderException: Line search zoom failed
>
>
> What causes them and how do I fix them?
>
> I checked my data and there seemed nothing out of the ordinary.  The
> resulting prediction model seemed acceptable to me.  So, are these ERRORs
> actually WARNINGs?  Could we or should we tune the level of these messages
> down one notch?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/LogisticRegressionWithLBFGS-shows-ERRORs-tp22042.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to deploy binary dependencies to workers?

2015-03-24 Thread DB Tsai

I would recommend to upload those jars to HDFS, and use add jars
option in spark-submit with URI from HDFS instead of URI from local
filesystem. Thus, it can avoid the problem of fetching jars from
driver which can be a bottleneck.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
> Hi,
>
> I am doing ML using Spark mllib. However, I do not have full control to the
> cluster. I am using Microsoft Azure HDInsight
>
> I want to deploy the BLAS or whatever required dependencies to accelerate
> the computation. But I don't know how to deploy those DLLs when I submit my
> JAR to the cluster.
>
> I know how to pack those DLLs into a jar. The real challenge is how to let
> the system find them...
>
>
> Thanks,
> David
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Can LBFGS be used on streaming data?

2015-03-25 Thread DB Tsai

Hi Arunkumar,

I think L-BFGS will not work since L-BFGS algorithm assumes that the
objective function will be always the same (i.e., the data is the
same) for entire optimization process to construct the approximated
Hessian matrix. In the streaming case, the data will be changing, so
it will cause problem for the algorithm.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Mon, Mar 16, 2015 at 3:19 PM, EcoMotto Inc.  wrote:
> Hello,
>
> I am new to spark streaming API.
>
> I wanted to ask if I can apply LBFGS (with LeastSquaresGradient) on
> streaming data? Currently I am using forecahRDD for parsing through DStream
> and I am generating a model based on each RDD. Am I doing anything logically
> wrong here?
> Thank you.
>
> Sample Code:
>
> val algorithm = new LBFGS(new LeastSquaresGradient(), new SimpleUpdater())
> var initialWeights =
> Vectors.dense(Array.fill(numFeatures)(scala.util.Random.nextDouble()))
> var isFirst = true
> var model = new LinearRegressionModel(null,1.0)
>
> parsedData.foreachRDD{rdd =>
>   if(isFirst) {
> val weights = algorithm.optimize(rdd, initialWeights)
> val w = weights.toArray
> val intercept = w.head
> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
> isFirst = false
>   }else{
> var ab = ArrayBuffer[Double]()
> ab.insert(0, model.intercept)
> ab.appendAll( model.weights.toArray)
> print("Intercept = "+model.intercept+" :: modelWeights =
> "+model.weights)
> initialWeights = Vectors.dense(ab.toArray)
> print("Initial Weights: "+ initialWeights)
> val weights = algorithm.optimize(rdd, initialWeights)
> val w = weights.toArray
> val intercept = w.head
> model = new LinearRegressionModel(Vectors.dense(w.drop(1)), intercept)
>   }
>
>
>
> Best Regards,
> Arunkumar

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to deploy binary dependencies to workers?

2015-03-25 Thread DB Tsai

Are you deploying the windows dll to linux machine?

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, Mar 25, 2015 at 3:57 AM, Xi Shen  wrote:
> I think you meant to use the "--files" to deploy the DLLs. I gave a try, but
> it did not work.
>
> From the Spark UI, Environment tab, I can see
>
> spark.yarn.dist.files
>
> file:/c:/openblas/libgcc_s_seh-1.dll,file:/c:/openblas/libblas3.dll,file:/c:/openblas/libgfortran-3.dll,file:/c:/openblas/liblapack3.dll,file:/c:/openblas/libquadmath-0.dll
>
> I think my DLLs are all deployed. But I still got the warn message that
> native BLAS library cannot be load.
>
> And idea?
>
>
> Thanks,
> David
>
>
> On Wed, Mar 25, 2015 at 5:40 AM DB Tsai  wrote:
>>
>> I would recommend to upload those jars to HDFS, and use add jars
>> option in spark-submit with URI from HDFS instead of URI from local
>> filesystem. Thus, it can avoid the problem of fetching jars from
>> driver which can be a bottleneck.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Tue, Mar 24, 2015 at 4:13 AM, Xi Shen  wrote:
>> > Hi,
>> >
>> > I am doing ML using Spark mllib. However, I do not have full control to
>> > the
>> > cluster. I am using Microsoft Azure HDInsight
>> >
>> > I want to deploy the BLAS or whatever required dependencies to
>> > accelerate
>> > the computation. But I don't know how to deploy those DLLs when I submit
>> > my
>> > JAR to the cluster.
>> >
>> > I know how to pack those DLLs into a jar. The real challenge is how to
>> > let
>> > the system find them...
>> >
>> >
>> > Thanks,
>> > David
>> >

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: LogisticRegressionWithLBFGS shows ERRORs

2015-03-25 Thread DB Tsai

We fixed couple issues in breeze LBFGS implementation. Can you try
Spark 1.3 and see if they still exist? Thanks.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Mon, Mar 16, 2015 at 12:48 PM, Chang-Jia Wang  wrote:
> I just used random numbers.
>
> (My ML lib was spark-mllib_2.10-1.2.1)
>
> Please see the attached log.  In the middle of the log, I dumped the data
> set before feeding into LogisticRegressionWithLBFGS.  The first column
> false/true was the label (attribute “a”), and columns 2-5 (attributes “x”,
> “y”, “z”, and “i”) were the features.  The 6th column was just row ID and
> was not used.
>
> The relationship was arbitrarily: a = (0.3 * x + 0.5 * y - 0.2 *z > 0.4)
>
> After that you can find LBFGS was doing its job and then pumped out the
> error messages.
>
> The model showed coefficients:
>
> 396.57624765427323, x
> 662.7969020937115, y
> -259.0975519038385, z
> 12.352037503257826, i
> -538.8516249699426, @a
>
> The last one was the intercept.  As you can see, the model seemed close
> enough.
>
> After that I fed the same data back to the model to see how the predictions
> worked.   (here attribute “a” was the prediction and “aa” was the original
> label)  I only displayed 20 rows.
>
> The error rate showed 2 errors out of 1000.
>
> count(INTEGER), errorRate(DOUBLE), countDiff(INTEGER)
> key=[], rows=1
> 1000, 0.002000949949026, 2
>
> So, the algorithm worked, just spitting out the errors was kind of annoying.
> If this is not result affecting, maybe it should be warning or info.
>
> C.J.
>
>
>
>
>
>
>
> On Mar 15, 2015, at 12:42 AM, DB Tsai  wrote:
>
> In LBFGS version of logistic regression, the data is properly
> standardized, so this should not happen. Can you provide a copy of
> your dataset to us so we can test it? If the dataset can not be
> public, can you have just send me a copy so I can dig into this? I'm
> the author of LORWithLBFGS. Thanks.
>
> Sincerely,
>
> DB Tsai
> ---
> Blog: https://www.dbtsai.com
>
>
> On Fri, Mar 13, 2015 at 2:41 PM, cjwang  wrote:
>
> I am running LogisticRegressionWithLBFGS.  I got these lines on my console:
>
> 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.5
> 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.25
> 2015-03-12 17:38:04,036 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to 0.125
> 2015-03-12 17:38:04,105 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0625
> 2015-03-12 17:38:04,176 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.03125
> 2015-03-12 17:38:04,247 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.015625
> 2015-03-12 17:38:04,317 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.0078125
> 2015-03-12 17:38:04,461 ERROR breeze.optimize.StrongWolfeLineSearch |
> Encountered bad values in function evaluation. Decreasing step size to
> 0.005859375
> 2015-03-12 17:38:04,605 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,672 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,747 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,818 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,890 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:04,962 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,038 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,107 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,186 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03-12 17:38:05,256 INFO breeze.optimize.StrongWolfeLineSearch | Line
> search t: NaN fval: NaN rhs: NaN cdd: NaN
> 2015-03

Re: Features scaling

2015-04-21 Thread DB Tsai

Hi Denys,

I don't see any issue in your python code, so maybe there is a bug in
python wrapper. If it's in scala, I think it should work. BTW,
LogsticRegressionWithLBFGS does the standardization internally, so you
don't need to do it yourself. It worths giving it a try!

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, Apr 21, 2015 at 1:00 AM, Denys Kozyr  wrote:
> Hi!
>
> I want to normalize features before train logistic regression. I setup scaler:
>
> scaler2 = StandardScaler(withMean=True, withStd=True).fit(features)
>
> and apply it to a dataset:
>
> scaledData = dataset.map(lambda x: LabeledPoint(x.label,
> scaler2.transform(Vectors.dense(x.features.toArray() 
>
> but I can't work with scaledData (can't output it or train regression
> on it), got an error:
>
> Exception: It appears that you are attempting to reference SparkContext from 
> a b
> roadcast variable, action, or transforamtion. SparkContext can only be used 
> on t
> he driver, not in code that it run on workers. For more information, see 
> SPARK-5
> 063.
>
> Does it correct code to make normalization? Why it doesn't work?
> Any advices are welcome.
> Thanks.
>
> Full code:
> https://gist.github.com/dkozyr/d31551a3ebed0ee17772
>
> Console output:
> https://gist.github.com/dkozyr/199f0d4f44cf522f9453
>
> Denys
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multiclass classification using Ml logisticRegression

2015-04-29 Thread DB Tsai

Wrapping the old LogisticRegressionWithLBFGS could be a quick solution
for 1.4, and it's not too hard so it's potentially to get into 1.4. In
the long term, I will like to implement a new version like
https://github.com/apache/spark/commit/6a827d5d1ec520f129e42c3818fe7d0d870dcbef
which handles the scaling and intercepts implicitly in objective
function so no overhead of creating new transformed dataset.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Wed, Apr 29, 2015 at 1:21 AM, selim namsi  wrote:
> Thank you for your Answer!
> Yes I would like to work on it.
>
> Selim
>
> On Mon, Apr 27, 2015 at 5:23 AM Joseph Bradley 
> wrote:
>>
>> Unfortunately, the Pipelines API doesn't have multiclass logistic
>> regression yet, only binary.  It's really a matter of modifying the current
>> implementation; I just added a JIRA for it:
>> https://issues.apache.org/jira/browse/SPARK-7159
>>
>> You'll need to use the old LogisticRegression API to do multiclass for
>> now, until that JIRA gets completed.  (If you're interested in doing it, let
>> me know via the JIRA!)
>>
>> Joseph
>>
>> On Fri, Apr 24, 2015 at 3:26 AM, Selim Namsi 
>> wrote:
>>>
>>> Hi,
>>>
>>> I just started using spark ML pipeline to implement a multiclass
>>> classifier
>>> using LogisticRegressionWithLBFGS (which accepts as a parameters number
>>> of
>>> classes), I followed the Pipeline example in ML- guide and I used
>>> LogisticRegression class which calls LogisticRegressionWithLBFGS class :
>>>
>>> val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.01)
>>>
>>> the problem is that LogisticRegression doesn't take numClasses as
>>> parameters
>>>
>>> Any idea how to solve this problem?
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Multiclass-classification-using-Ml-logisticRegression-tp22644.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multilabel Classification in spark

2015-05-05 Thread DB Tsai

LogisticRegression in MLlib package supports multilable classification.

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com


On Tue, May 5, 2015 at 1:13 PM, peterg  wrote:
> Hi all,
>
> I'm looking to implement a Multilabel classification algorithm but I am
> surprised to find that there are not any in the spark-mllib core library. Am
> I missing something? Would someone point me in the right direction?
>
> Thanks!
>
> Peter
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Logistic Regression MLLib Slow

2014-06-04 Thread DB Tsai

Hi Krishna,

Also, the default optimizer with SGD converges really slow. If you are
willing to write scala code, there is a full working example for
training Logistic Regression with L-BFGS (a quasi-Newton method) in
scala. It converges a way faster than SGD.

See
http://spark.apache.org/docs/latest/mllib-optimization.html
for detail.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng  wrote:
> Hi Krishna,
>
> Specifying executor memory in local mode has no effect, because all of
> the threads run inside the same JVM. You can either try
> --driver-memory 60g or start a standalone server.
>
> Best,
> Xiangrui
>
> On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng  wrote:
>> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
>> take that long, even on a single executor. Besides what Matei
>> suggested, could you also verify the executor memory in
>> http://localhost:4040 in the Executors tab. It is very likely the
>> executors do not have enough memory. In that case, caching may be
>> slower than reading directly from disk. -Xiangrui
>>
>> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia  
>> wrote:
>>> Ah, is the file gzipped by any chance? We can’t decompress gzipped files in
>>> parallel so they get processed by a single task.
>>>
>>> It may also be worth looking at the application UI (http://localhost:4040)
>>> to see 1) whether all the data fits in memory in the Storage tab (maybe it
>>> somehow becomes larger, though it seems unlikely that it would exceed 20 GB)
>>> and 2) how many parallel tasks run in each iteration.
>>>
>>> Matei
>>>
>>> On Jun 4, 2014, at 6:56 PM, Srikrishna S  wrote:
>>>
>>> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I am
>>> running to only 10 iterations.
>>>
>>> The MLLib version of logistic regression doesn't seem to use all the cores
>>> on my machine.
>>>
>>> Regards,
>>> Krishna
>>>
>>>
>>>
>>> On Wed, Jun 4, 2014 at 6:47 PM, Matei Zaharia 
>>> wrote:
>>>>
>>>> Are you using the logistic_regression.py in examples/src/main/python or
>>>> examples/src/main/python/mllib? The first one is an example of writing
>>>> logistic regression by hand and won’t be as efficient as the MLlib one. I
>>>> suggest trying the MLlib one.
>>>>
>>>> You may also want to check how many iterations it runs — by default I
>>>> think it runs 100, which may be more than you need.
>>>>
>>>> Matei
>>>>
>>>> On Jun 4, 2014, at 5:47 PM, Srikrishna S  wrote:
>>>>
>>>> > Hi All.,
>>>> >
>>>> > I am new to Spark and I am trying to run LogisticRegression (with SGD)
>>>> > using MLLib on a beefy single machine with about 128GB RAM. The dataset 
>>>> > has
>>>> > about 80M rows with only 4 features so it barely occupies 2Gb on disk.
>>>> >
>>>> > I am running the code using all 8 cores with 20G memory using
>>>> > spark-submit --executor-memory 20G --master local[8]
>>>> > logistic_regression.py
>>>> >
>>>> > It seems to take about 3.5 hours without caching and over 5 hours with
>>>> > caching.
>>>> >
>>>> > What is the recommended use for Spark on a beefy single machine?
>>>> >
>>>> > Any suggestions will help!
>>>> >
>>>> > Regards,
>>>> > Krishna
>>>> >
>>>> >
>>>> > Code sample:
>>>> >
>>>> > -
>>>> > # Dataset
>>>> > d = sys.argv[1]
>>>> > data = sc.textFile(d)
>>>> >
>>>> > # Load and parse the data
>>>> > #
>>>> > --
>>>> > def parsePoint(line):
>>>> > values = [float(x) for x in line.split(',')]
>>>> > return LabeledPoint(values[0], values[1:])
>>>> > _parsedData = data.map(parsePoint)
>>>> > parsedData = _parsedData.cache()
>>>> > results = {}
>>>> >
>>>> > # Spark
>>>> > #
>>>> > --
>>>> > start_time = time.time()
>>>> > # Build the gl_model
>>>> > niters = 10
>>>> > spark_model = LogisticRegressionWithSGD.train(parsedData,
>>>> > iterations=niters)
>>>> >
>>>> > # Evaluate the gl_model on training data
>>>> > labelsAndPreds = parsedData.map(lambda p: (p.label,
>>>> > spark_model.predict(p.features)))
>>>> > trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
>>>> > float(parsedData.count())
>>>> >
>>>>
>>>
>>>

Re: Logistic Regression MLLib Slow

2014-06-04 Thread DB Tsai

Hi Krishna,

It should work, and we use it in production with great success.
However, the constructor of LogisticRegressionModel is private[mllib],
so you have to write your code, and have the package name under
org.apache.spark.mllib instead of using scala console.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 4, 2014 at 11:47 PM, Srikrishna S  wrote:
> Does L-BFSG work with spark 1.0? (see code sample below).
>
> Eventually, I would like to have L-BFGS working but I was facing an issue
> where 10 passes over the data was taking forever. I ran spark in standalone
> mode and the performance is much better!
>
> Regards,
> Krishna
>
> 
>
> I am using http://spark.apache.org/docs/latest/mllib-optimization.html
>
> scala> val model = new LogisticRegressionModel(
>
>   Vectors.dense(weightsWithIntercept.toArray.slice(0,
> weightsWithIntercept.size - 1)),
>
>   weightsWithIntercept(weightsWithIntercept.size - 1))
>
>
> val model = new LogisticRegressionModel(
>
>  |   Vectors.dense(weightsWithIntercept.toArray.slice(0,
> weightsWithIntercept.size - 1)),
>
>  |   weightsWithIntercept(weightsWithIntercept.size - 1))
>
> :20: error: constructor LogisticRegressionModel in class
> LogisticRegressionModel cannot be accessed in class $iwC
>
>val model = new LogisticRegressionModel(
>
> Based on the documentation, it would seem like LogisticRegressionModel
> doesn't have a constructor:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel
>
> LogisticRegression *does* have a constructor:
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>
>
>
> On Wed, Jun 4, 2014 at 11:33 PM, DB Tsai  wrote:
>>
>> Hi Krishna,
>>
>> Also, the default optimizer with SGD converges really slow. If you are
>> willing to write scala code, there is a full working example for
>> training Logistic Regression with L-BFGS (a quasi-Newton method) in
>> scala. It converges a way faster than SGD.
>>
>> See
>> http://spark.apache.org/docs/latest/mllib-optimization.html
>> for detail.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng  wrote:
>> > Hi Krishna,
>> >
>> > Specifying executor memory in local mode has no effect, because all of
>> > the threads run inside the same JVM. You can either try
>> > --driver-memory 60g or start a standalone server.
>> >
>> > Best,
>> > Xiangrui
>> >
>> > On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng  wrote:
>> >> 80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
>> >> take that long, even on a single executor. Besides what Matei
>> >> suggested, could you also verify the executor memory in
>> >> http://localhost:4040 in the Executors tab. It is very likely the
>> >> executors do not have enough memory. In that case, caching may be
>> >> slower than reading directly from disk. -Xiangrui
>> >>
>> >> On Wed, Jun 4, 2014 at 7:06 PM, Matei Zaharia 
>> >> wrote:
>> >>> Ah, is the file gzipped by any chance? We can’t decompress gzipped
>> >>> files in
>> >>> parallel so they get processed by a single task.
>> >>>
>> >>> It may also be worth looking at the application UI
>> >>> (http://localhost:4040)
>> >>> to see 1) whether all the data fits in memory in the Storage tab
>> >>> (maybe it
>> >>> somehow becomes larger, though it seems unlikely that it would exceed
>> >>> 20 GB)
>> >>> and 2) how many parallel tasks run in each iteration.
>> >>>
>> >>> Matei
>> >>>
>> >>> On Jun 4, 2014, at 6:56 PM, Srikrishna S 
>> >>> wrote:
>> >>>
>> >>> I am using the MLLib one (LogisticRegressionWithSGD)  with PySpark. I
>> >>> am
>> >>> running to only 10 iterations.
>> >>>
>> >>> The MLLib version

Re: Gradient Descent with MLBase

2014-06-07 Thread DB Tsai

Hi Aslan,

You can check out the unittest code of GradientDescent.runMiniBatchSGD

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/optimization/GradientDescentSuite.scala


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Jun 7, 2014 at 6:24 AM, Aslan Bekirov 
wrote:

> Hi All,
>
> I have to create a model using SGD in mlbase. I examined a bit mlbase and
> run some samples of classification , collaborative filtering etc.. But I
> could not run Gradient descent. I have to run
>
>  "val model =  GradientDescent.runMiniBatchSGD(params)"
>
> of course before params must be computed. I tried but could not managed to
> give parameters correctly.
>
> Can anyone explain parameters a bit and give an example of code?
>
> BR,
> Aslan
>
>

Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai

Hi guys,

We would like to use spark hadoop api to get the first couple hundred
lines in design time to quickly show users the file-structure/meta
data, and the values in those lines without launching the full spark
job in cluster.

Since we're web-based application, there will be multiple users using
the spark hadoop api, for exmaple, sc.textFile(filePath). I wonder if
those APIs are thread-safe in local mode (each user will have its own
SparkContext object).

Secondly, it seems that even in local mode, the jetty UI tracker will
be lunched. For this kind of cheap operation, having jetty UI tracker
for each operation will be very expensive. Is there a way to disable
this behavior?

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Is spark context in local mode thread-safe?

2014-06-09 Thread DB Tsai

What if there are multiple threads using the same spark context, will
each of thread have it own UI? In this  case, it will quickly run out
of the ports.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 9, 2014 at 4:34 PM, Matei Zaharia  wrote:
> You currently can’t have multiple SparkContext objects in the same JVM, but 
> within a SparkContext, all of the APIs are thread-safe so you can share that 
> context between multiple threads. The other issue you’ll run into is that in 
> each thread where you want to use Spark, you need to use SparkEnv.set(env) 
> where “env” was obtained by SparkEnv.get in the thread that created the 
> context. This requirement will hopefully go away soon.
>
> Unfortunately there’s no way yet to disable the UI — feel free to open a JIRA 
> for it, it shouldn’t be hard to do.
>
> Matei
>
> On Jun 9, 2014, at 3:50 PM, DB Tsai  wrote:
>
>> Hi guys,
>>
>> We would like to use spark hadoop api to get the first couple hundred
>> lines in design time to quickly show users the file-structure/meta
>> data, and the values in those lines without launching the full spark
>> job in cluster.
>>
>> Since we're web-based application, there will be multiple users using
>> the spark hadoop api, for exmaple, sc.textFile(filePath). I wonder if
>> those APIs are thread-safe in local mode (each user will have its own
>> SparkContext object).
>>
>> Secondly, it seems that even in local mode, the jetty UI tracker will
>> be lunched. For this kind of cheap operation, having jetty UI tracker
>> for each operation will be very expensive. Is there a way to disable
>> this behavior?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>

Re: Optimizing reduce for 'huge' aggregated outputs.

2014-06-09 Thread DB Tsai

Hi Nick,

How does reduce work? I thought after reducing in the executor, it
will reduce in parallel between multiple executors instead of pulling
everything to driver and reducing there.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 9, 2014 at 11:07 PM, Nick Pentreath
 wrote:
> Can you key your RDD by some key and use reduceByKey? In fact if you are
> merging bunch of maps you can create a set of (k, v) in your mapPartitions
> and then reduceByKey using some merge function. The reduce will happen in
> parallel on multiple nodes in this case. You'll end up with just a single
> set of k, v per partition which you can reduce or collect and merge on the
> driver.
>
>
> —
> Sent from Mailbox
>
>
> On Tue, Jun 10, 2014 at 1:05 AM, Sung Hwan Chung 
> wrote:
>>
>> I suppose what I want is the memory efficiency of toLocalIterator and the
>> speed of collect. Is there any such thing?
>>
>>
>> On Mon, Jun 9, 2014 at 3:19 PM, Sung Hwan Chung 
>> wrote:
>>>
>>> Hello,
>>>
>>> I noticed that the final reduce function happens in the driver node with
>>> a code that looks like the following.
>>>
>>> val outputMap = mapPartition(domsomething).reduce(a: Map, b: Map) {
>>>  a.merge(b)
>>> }
>>>
>>> although individual outputs from mappers are small. Over time the
>>> aggregated result outputMap could be huuuge (say with hundreds of millions
>>> of keys and values, reaching giga bytes).
>>>
>>> I noticed that, even if we have a lot of memory in the driver node, this
>>> process becomes realy slow eventually (say we have 100+ partitions. the
>>> first reduce is fast, but progressively, it becomes veeery slow as more and
>>> more partition outputs get aggregated). Is this because the intermediate
>>> reduce output gets serialized and then deserialized every time?
>>>
>>> What I'd like ideally is, since reduce is taking place in the same
>>> machine any way, there's no need for any serialization and deserialization,
>>> and just aggregate the incoming results into the final aggregation. Is this
>>> possible?
>>
>>
>

Re: Normalizations in MLBase

2014-06-11 Thread DB Tsai

Hi Aslan,

Currently, we don't have the utility function to do so. However, you
can easily implement this by another map transformation. I'm working
on this feature now, and there will be couple different available
normalization option users can chose.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov  wrote:
> Hi All,
>
> I have to normalize a set of values in the range 0-500 to the [0-1] range.
>
> Is there any util method in MLBase to normalize large set of data?
>
> BR,
> Aslan

Re: Using Spark to crack passwords

2014-06-11 Thread DB Tsai

I think creating the samples in the search space within RDD will be
too expensive, and the amount of data will probably be larger than any
cluster.

However, you could create a RDD of searching ranges, and each range
will be searched by one map operation. As a result, in this design,
the # of row in RDD will be the same as the # of executors, and we can
use mapPartition to loop through all the sample in the range without
actually storing them in RDD.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 11, 2014 at 5:24 PM, Nick Chammas
 wrote:
> Spark is obviously well-suited to crunching massive amounts of data. How
> about to crunch massive amounts of numbers?
>
> A few years ago I put together a little demo for some co-workers to
> demonstrate the dangers of using SHA1 to hash and store passwords. Part of
> the demo included a live brute-forcing of hashes to show how SHA1's speed
> made it unsuitable for hashing passwords.
>
> I think it would be cool to redo the demo, but utilize the power of a
> cluster managed by Spark to crunch through hashes even faster.
>
> But how would you do that with Spark (if at all)?
>
> I'm guessing you would create an RDD that somehow defined the search space
> you're going to go through, and then partition it to divide the work up
> equally amongst the cluster's cores. Does that sound right?
>
> I wonder if others have already used Spark for computationally-intensive
> workloads like this, as opposed to just data-intensive ones.
>
> Nick
>
>
> 
> View this message in context: Using Spark to crack passwords
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Normalizations in MLBase

2014-06-12 Thread DB Tsai

Hi Asian,

I'm not sure if mlbase code is maintained for the current spark
master. The following is the code we use for standardization in my
company. I'm intended to clean up, and submit a PR. You could use it
for now.

  def standardize(data: RDD[Vector]): RDD[Vector] = {
val summarizer = new RowMatrix(data).computeColumnSummaryStatistics
val mean = summarizer.mean
val variance = summarizer.variance

// The standardization will always densify the output, so the output
// will be stored in dense vector.
data.map(x => {
  val n = x.toBreeze.length
  val output = BDV.zeros[Double](n)
  var i = 0
  while(i < n) {
if(variance(i) == 0) {
  output(i) = Double.NaN
} else {
  output(i) = (x(i) - mean(i)) / Math.sqrt(variance(i))
}
i += 1
  }
  Vectors.fromBreeze(output)
})
  }

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jun 12, 2014 at 1:49 AM, Aslan Bekirov  wrote:
> Hi DB,
>
> I found a piece of code that uses znorm to normalize data.
>
>
> /**
>  * build training data set from sample and summary data
>  */
>  val train_data = sample_data.map( v =>
>Array.tabulate[Double](field_cnt)(
>  i => zscore(v._2(i),sample_mean(i),sample_stddev(i))
>)
>  ).cache
>
> Please make your comments if you find something wrong.
>
> BR,
> Aslan
>
>
>
> On Thu, Jun 12, 2014 at 11:13 AM, Aslan Bekirov 
> wrote:
>>
>> Thanks a lot DB.
>>
>> I will try to do Znorm normalization using map transformation.
>>
>>
>> BR,
>> Aslan
>>
>>
>> On Thu, Jun 12, 2014 at 12:16 AM, DB Tsai  wrote:
>>>
>>> Hi Aslan,
>>>
>>> Currently, we don't have the utility function to do so. However, you
>>> can easily implement this by another map transformation. I'm working
>>> on this feature now, and there will be couple different available
>>> normalization option users can chose.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ---
>>> My Blog: https://www.dbtsai.com
>>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>
>>>
>>> On Wed, Jun 11, 2014 at 6:25 AM, Aslan Bekirov 
>>> wrote:
>>> > Hi All,
>>> >
>>> > I have to normalize a set of values in the range 0-500 to the [0-1]
>>> > range.
>>> >
>>> > Is there any util method in MLBase to normalize large set of data?
>>> >
>>> > BR,
>>> > Aslan
>>
>>
>

Re: MLlib-a problem of example code for L-BFGS

2014-06-13 Thread DB Tsai

Hi Congrui,

Since it's private in mllib package, one workaround will be write your
code in scala file with mllib package in order to use the constructor
of LogisticRegressionModel.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 13, 2014 at 11:50 AM, Congrui Yi
 wrote:
> Hi All,
>
> I'm new to Spark. Just tried out the example code on Spark website for
> L-BFGS. But the code "val model = new LogisticRegressionModel(..." gave me
> an error:
>
> :19: error: constructor LogisticRegressionModel in class
> LogisticRegres
> sionModel cannot be accessed in class $iwC
>  val model = new LogisticRegressionModel(
>   ^
>
> Then I checked the source code on github about the definition of the class
> LogisticRegressionModel. It says:
>
>
> It appears the reason is it has "private[mllib]" in the definition so access
> is restricted and it does not have a constructor either.
>
> So that's a contradiction.
>
> Thanks,
>
> BR,
>
> Congrui
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark regression results way off

2014-06-16 Thread DB Tsai

Is your data normalized? Sometimes, GD doesn't work well if the data
has wide range. If you are willing to write scala code, you can try
LBFGS optimizer which converges better than GD.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

On Mon, Jun 16, 2014 at 8:14 AM, jamborta  wrote:
> forgot to mention that I'm running spark 1.0
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: MLlib-Missing Regularization Parameter and Intercept for Logistic Regression

2014-06-16 Thread DB Tsai

Hi Congrui,

We're working on weighted regularization, so for intercept, you can
just set it as 0. It's also useful when the data is normalized but
want to solve the regularization with original data.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jun 16, 2014 at 11:18 AM, Xiangrui Meng  wrote:
> Someone is working on weighted regularization. Stay tuned. -Xiangrui
>
> On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
>  wrote:
>> Hi Xiangrui,
>>
>> Thank you for the reply! I have tried customizing 
>> LogisticRegressionSGD.optimizer as in the example you mentioned, but the 
>> source code reveals that the intercept is also penalized if one is included, 
>> which is usually inappropriate. The developer should fix this problem.
>>
>> Best,
>>
>> Congrui
>>
>> -Original Message-
>> From: Xiangrui Meng [mailto:men...@gmail.com]
>> Sent: Friday, June 13, 2014 11:50 PM
>> To: user@spark.apache.org
>> Cc: user
>> Subject: Re: MLlib-Missing Regularization Parameter and Intercept for 
>> Logistic Regression
>>
>> 1. 
>> "examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala"
>> contains example code that shows how to set regParam.
>>
>> 2. A static method with more than 3 parameters becomes hard to
>> remember and hard to maintain. Please use LogistricRegressionWithSGD's
>> default constructor and setters.
>>
>> -Xiangrui

Re: MLlib-a problem of example code for L-BFGS

2014-06-16 Thread DB Tsai

Hi Congrui,

I mean create your own TrainMLOR.scala with all the code provided in
the example, and have it under "package org.apache.spark.mllib"

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 13, 2014 at 1:50 PM, Congrui Yi
 wrote:
> Hi DB,
>
> Thank you for the help! I'm new to this, so could you give a bit more
> details how this could be done?
>
> Sincerely,
>
> Congrui Yi
>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-a-problem-of-example-code-for-L-BFGS-tp7589p7596.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai

Hi Xiangrui,

What's different between treeAggregate and aggregate? Why
treeAggregate scales better? What if we just use mapPartition, will it
be as fast as treeAggregate?

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Jun 17, 2014 at 12:58 PM, Xiangrui Meng  wrote:
> Hi Makoto,
>
> How many partitions did you set? If there are too many partitions,
> please do a coalesce before calling ML algorithms.
>
> Btw, could you try the tree branch in my repo?
> https://github.com/mengxr/spark/tree/tree
>
> I used tree aggregate in this branch. It should help with the scalability.
>
> Best,
> Xiangrui
>
> On Tue, Jun 17, 2014 at 12:22 PM, Makoto Yui  wrote:
>> Here is follow-up to the previous evaluation.
>>
>> "aggregate at GradientDescent.scala:178" never finishes at
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala#L178
>>
>> We confirmed, by -verbose:gc, that GC is not happening during the aggregate
>> and the cumulative CPU time for the task is increasing little by little.
>>
>> LBFGS also does not work for large # of features (news20.random.1000)
>> though it works fine for small # of features (news20.binary.1000).
>>
>> "aggregate at LBFGS.scala:201" also never finishes at
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L201
>>
>> ---
>> [Evaluated code for LBFGS]
>>
>> import org.apache.spark.SparkContext
>> import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>> import org.apache.spark.mllib.linalg.Vectors
>> import org.apache.spark.mllib.util.MLUtils
>> import org.apache.spark.mllib.classification.LogisticRegressionModel
>> import org.apache.spark.mllib.optimization._
>>
>> val data = MLUtils.loadLibSVMFile(sc,
>> "hdfs://dm01:8020/dataset/news20-binary/news20.random.1000",
>> multiclass=false)
>> val numFeatures = data.take(1)(0).features.size
>>
>> val training = data.map(x => (x.label, 
>> MLUtils.appendBias(x.features))).cache()
>>
>> // Run training algorithm to build the model
>> val numCorrections = 10
>> val convergenceTol = 1e-4
>> val maxNumIterations = 20
>> val regParam = 0.1
>> val initialWeightsWithIntercept = Vectors.dense(new
>> Array[Double](numFeatures + 1))
>>
>> val (weightsWithIntercept, loss) = LBFGS.runLBFGS(
>>   training,
>>   new LogisticGradient(),
>>   new SquaredL2Updater(),
>>   numCorrections,
>>   convergenceTol,
>>   maxNumIterations,
>>   regParam,
>>   initialWeightsWithIntercept)
>> ---
>>
>>
>> Thanks,
>> Makoto
>>
>> 2014-06-17 21:32 GMT+09:00 Makoto Yui :
>>> Hello,
>>>
>>> I have been evaluating LogisticRegressionWithSGD of Spark 1.0 MLlib on
>>> Hadoop 0.20.2-cdh3u6 but it does not work for a sparse dataset though
>>> the number of training examples used in the evaluation is just 1,000.
>>>
>>> It works fine for the dataset *news20.binary.1000* that has 178,560
>>> features. However, it does not work for *news20.random.1000* where # of
>>> features is large  (1,354,731 features) though we used a sparse vector
>>> through MLUtils.loadLibSVMFile().
>>>
>>> The execution seems not progressing while no error is reported in the
>>> spark-shell as well as in the stdout/stderr of executors.
>>>
>>> We used 32 executors with each allocating 7GB (2GB is for RDD) for
>>> working memory.
>>>
>>> Any suggesions? Your help is really appreciated.
>>>
>>> ==
>>> Executed code
>>> ==
>>> import org.apache.spark.mllib.util.MLUtils
>>> import org.apache.spark.mllib.classification.LogisticRegressionWithSGD
>>>
>>> //val training = MLUtils.loadLibSVMFile(sc,
>>> "hdfs://host:8020/dataset/news20-binary/news20.binary.1000",
>>> multiclass=false)
>>> val training = MLUtils.loadLibSVMFile(sc,
>>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
>>> multiclass=false)
>>>
>>> val numFeatures = training .take(1)(0).features.size
>>> //numFeatures: Int = 178560 for news20.binary.1000
>>> //numFeatures: Int = 1354731 for news20.random.1000
>>> val model = LogisticRegressionWithSGD.train(training, numIterations=1)
>>>
>>> ==
>>> The dataset used in the evaluation
>>> ==
>>>
>>> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#news20.binary
>>>
>>> $ head -1000 news20.binary | sed 's/+1/1/g' | sed 's/-1/0/g' >
>>> news20.binary.1000
>>> $ sort -R news20.binary > news20.random
>>> $ head -1000 news20.random | sed 's/+1/1/g' | sed 's/-1/0/g' >
>>> news20.random.1000
>>>
>>> You can find the dataset in
>>> https://dl.dropboxusercontent.com/u/13123103/news20.random.1000
>>> https://dl.dropboxusercontent.com/u/13123103/news20.binary.1000
>>>
>>>
>>> Thanks,
>>> Makoto

Re: news20-binary classification with LogisticRegressionWithSGD

2014-06-17 Thread DB Tsai

Hi Xiangrui,

Does it mean that mapPartition and then reduce shares the same
behavior as aggregate operation which is O(n)?

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng  wrote:
> Hi DB,
>
> treeReduce (treeAggregate) is a feature I'm testing now. It is a
> compromise between current reduce and butterfly allReduce. The former
> runs in linear time on the number of partitions, the latter introduces
> too many dependencies. treeAggregate with depth = 2 should run in
> O(sqrt(n)) time, where n is the number of partitions. It would be
> great if someone can help test its scalability.
>
> Best,
> Xiangrui
>
> On Tue, Jun 17, 2014 at 1:32 PM, Makoto Yui  wrote:
>> Hi Xiangrui,
>>
>>
>> (2014/06/18 4:58), Xiangrui Meng wrote:
>>>
>>> How many partitions did you set? If there are too many partitions,
>>> please do a coalesce before calling ML algorithms.
>>
>>
>> The training data "news20.random.1000" is small and thus only 2 partitions
>> are used by the default.
>>
>> val training = MLUtils.loadLibSVMFile(sc,
>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
>> multiclass=false).
>>
>> We also tried 32 partitions as follows but the aggregate never finishes.
>>
>> val training = MLUtils.loadLibSVMFile(sc,
>> "hdfs://host:8020/dataset/news20-binary/news20.random.1000",
>> multiclass=false, numFeatures = 1354731 , minPartitions = 32)
>>
>>
>>> Btw, could you try the tree branch in my repo?
>>> https://github.com/mengxr/spark/tree/tree
>>>
>>> I used tree aggregate in this branch. It should help with the scalability.
>>
>>
>> Is treeAggregate itself available on Spark 1.0?
>>
>> I wonder.. Could I test your modification just by running the following code
>> on REPL?
>>
>> ---
>> val (gradientSum, lossSum) = data.sample(false, miniBatchFraction, 42 + i)
>> .treeAggregate((BDV.zeros[Double](weights.size), 0.0))(
>>   seqOp = (c, v) => (c, v) match { case ((grad, loss), (label,
>> features)) =>
>> val l = gradient.compute(features, label, weights,
>> Vectors.fromBreeze(grad))
>> (grad, loss + l)
>>   },
>>   combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1),
>> (grad2, loss2)) =>
>> (grad1 += grad2, loss1 + loss2)
>>   }, 2)
>> -
>>
>> Rebuilding Spark is quite something to do evaluation.
>>
>> Thanks,
>> Makoto

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai

We are submitting the spark job in our tomcat application using
yarn-cluster mode with great success. As Kevin said, yarn-client mode
runs driver in your local JVM, and it will have really bad network
overhead when one do reduce action which will pull all the result from
executor to your local JVM. Also, since you can only have one spark
context object in one JVM, it will be tricky to run multiple spark
jobs concurrently with yarn-clinet mode.

This is how we submit spark job with yarn-cluster mode. Please use the
current master code, otherwise, after the job is finished, spark will
kill the JVM and exit your app.

We setup the configuration of spark in a scala map.

  def getArgsFromConf(conf: Map[String, String]): Array[String] = {
Array[String](
  "--jar", conf.get("app.jar").getOrElse(""),
  "--addJars", conf.get("spark.addJars").getOrElse(""),
  "--class", conf.get("spark.mainClass").getOrElse(""),
  "--num-executors", conf.get("spark.numWorkers").getOrElse("1"),
  "--driver-memory", conf.get("spark.masterMemory").getOrElse("1g"),
  "--executor-memory", conf.get("spark.workerMemory").getOrElse("1g"),
  "--executor-cores", conf.get("spark.workerCores").getOrElse("1"))
  }

  System.setProperty("SPARK_YARN_MODE", "true")
  val sparkConf = new SparkConf
  val args = getArgsFromConf(conf)
  new Client(new ClientArguments(args, sparkConf), hadoopConfig,
sparkConf).run

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey  wrote:
> Yarn client is much like Spark client mode, except that the executors are
> running in Yarn containers managed by the Yarn resource manager on the
> cluster instead of as Spark workers managed by the Spark master.  The driver
> executes as a local client in your local JVM.  It communicates with the
> workers on the cluster.  Transformations are scheduled on the cluster by the
> driver's logic.  Actions involve communication between local driver and
> remote cluster executors.  So, there is some additional network overhead,
> especially if the driver is not co-located on the cluster.  In yarn-cluster
> mode -- in contrast, the driver is executed as a thread in a Yarn
> application master on the cluster.
>
> In either case, the assembly JAR must be available to the application on the
> cluster.  Best to copy it to HDFS and specify its location by exporting its
> location as SPARK_JAR.
>
> Kevin Markey
>
>
> On 06/19/2014 11:22 AM, Koert Kuipers wrote:
>
> i am trying to understand how yarn-client mode works. i am not using
> spark-submit, but instead launching a spark job from within my own
> application.
>
> i can see my application contacting yarn successfully, but then in yarn i
> get an immediate error:
>
> Application application_1403117970283_0014 failed 2 times due to AM
> Container for appattempt_1403117970283_0014_02 exited with exitCode:
> -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not
> exist
> .Failing this attempt.. Failing the application.
>
> why is yarn trying to fetch my jar, and why as a local file? i would expect
> the jar to be send to yarn over the wire upon job submission?
>
>

Re: trying to understand yarn-client mode

2014-06-19 Thread DB Tsai

Currently, we save the result in HDFS, and read it back in our
application. Since Clinet.run is blocking call, it's easy to do it in
this way.

We are now working on using akka to bring back the result to app
without going through the HDFS, and we can use akka to track the log,
and stack trace, etc.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Jun 19, 2014 at 12:08 PM, Koert Kuipers  wrote:
> db tsai,
> if in yarn-cluster mode the driver runs inside yarn, how can you do a
> rdd.collect and bring the results back to your application?
>
>
> On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai  wrote:
>>
>> We are submitting the spark job in our tomcat application using
>> yarn-cluster mode with great success. As Kevin said, yarn-client mode
>> runs driver in your local JVM, and it will have really bad network
>> overhead when one do reduce action which will pull all the result from
>> executor to your local JVM. Also, since you can only have one spark
>> context object in one JVM, it will be tricky to run multiple spark
>> jobs concurrently with yarn-clinet mode.
>>
>> This is how we submit spark job with yarn-cluster mode. Please use the
>> current master code, otherwise, after the job is finished, spark will
>> kill the JVM and exit your app.
>>
>> We setup the configuration of spark in a scala map.
>>
>>   def getArgsFromConf(conf: Map[String, String]): Array[String] = {
>> Array[String](
>>   "--jar", conf.get("app.jar").getOrElse(""),
>>   "--addJars", conf.get("spark.addJars").getOrElse(""),
>>   "--class", conf.get("spark.mainClass").getOrElse(""),
>>   "--num-executors", conf.get("spark.numWorkers").getOrElse("1"),
>>   "--driver-memory", conf.get("spark.masterMemory").getOrElse("1g"),
>>   "--executor-memory", conf.get("spark.workerMemory").getOrElse("1g"),
>>   "--executor-cores", conf.get("spark.workerCores").getOrElse("1"))
>>   }
>>
>>   System.setProperty("SPARK_YARN_MODE", "true")
>>   val sparkConf = new SparkConf
>>   val args = getArgsFromConf(conf)
>>   new Client(new ClientArguments(args, sparkConf), hadoopConfig,
>> sparkConf).run
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Thu, Jun 19, 2014 at 11:22 AM, Kevin Markey 
>> wrote:
>> > Yarn client is much like Spark client mode, except that the executors
>> > are
>> > running in Yarn containers managed by the Yarn resource manager on the
>> > cluster instead of as Spark workers managed by the Spark master.  The
>> > driver
>> > executes as a local client in your local JVM.  It communicates with the
>> > workers on the cluster.  Transformations are scheduled on the cluster by
>> > the
>> > driver's logic.  Actions involve communication between local driver and
>> > remote cluster executors.  So, there is some additional network
>> > overhead,
>> > especially if the driver is not co-located on the cluster.  In
>> > yarn-cluster
>> > mode -- in contrast, the driver is executed as a thread in a Yarn
>> > application master on the cluster.
>> >
>> > In either case, the assembly JAR must be available to the application on
>> > the
>> > cluster.  Best to copy it to HDFS and specify its location by exporting
>> > its
>> > location as SPARK_JAR.
>> >
>> > Kevin Markey
>> >
>> >
>> > On 06/19/2014 11:22 AM, Koert Kuipers wrote:
>> >
>> > i am trying to understand how yarn-client mode works. i am not using
>> > spark-submit, but instead launching a spark job from within my own
>> > application.
>> >
>> > i can see my application contacting yarn successfully, but then in yarn
>> > i
>> > get an immediate error:
>> >
>> > Application application_1403117970283_0014 failed 2 times due to AM
>> > Container for appattempt_1403117970283_0014_02 exited with exitCode:
>> > -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does
>> > not
>> > exist
>> > .Failing this attempt.. Failing the application.
>> >
>> > why is yarn trying to fetch my jar, and why as a local file? i would
>> > expect
>> > the jar to be send to yarn over the wire upon job submission?
>> >
>> >
>
>

Re: parallel Reduce within a key

2014-06-20 Thread DB Tsai

Currently, the reduce operation combines the result from mapper
sequentially, so it's O(n).

Xiangrui is working on treeReduce which is O(log(n)). Based on the
benchmark, it dramatically increase the performance. You can test the
code in his own branch.
https://github.com/apache/spark/pull/1110

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Jun 20, 2014 at 6:57 AM, ansriniv  wrote:
> Hi,
>
> I am on Spark 0.9.0
>
> I have a 2 node cluster (2 worker nodes) with 16 cores on each node (so, 32
> cores in the cluster).
> I have an input rdd with 64 partitions.
>
> I am running  "sc.mapPartitions(...).reduce(...)"
>
> I can see that I get full parallelism on the mapper (all my 32 cores are
> busy simultaneously). However, when it comes to reduce(), the outputs of the
> mappers are all reduced SERIALLY. Further, all the reduce processing happens
> only on 1 of the workers.
>
> I was expecting that the outputs of the 16 mappers on node 1 would be
> reduced in parallel in node 1 while the outputs of the 16 mappers on node 2
> would be reduced in parallel on node 2 and there would be 1 final inter-node
> reduce (of node 1 reduced result and node 2 reduced result).
>
> Isn't parallel reduce supported WITHIN a key (in this case I have no key) ?
> (I know that there is parallelism in reduce across keys)
>
> Best Regards
> Anand
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/parallel-Reduce-within-a-key-tp7998.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark regression results way off

2014-06-25 Thread DB Tsai

There is no python binding for LBFGS. Feel free to submit a PR.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Jun 25, 2014 at 1:41 PM, Mohit Jaggi  wrote:
> Is a python binding for LBFGS in the works? My co-worker has written one and
> can contribute back if it helps.
>
>
> On Mon, Jun 16, 2014 at 11:00 AM, DB Tsai  wrote:
>>
>> Is your data normalized? Sometimes, GD doesn't work well if the data
>> has wide range. If you are willing to write scala code, you can try
>> LBFGS optimizer which converges better than GD.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Mon, Jun 16, 2014 at 8:14 AM, jamborta  wrote:
>> > forgot to mention that I'm running spark 1.0
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-05 Thread DB Tsai

You may try LBFGS to have more stable convergence. In spark 1.1, we will be
able to use LBFGS instead of GD in training process.
On Jul 4, 2014 1:23 PM, "Thomas Robert"  wrote:

> Hi all,
>
> I too am having some issues with *RegressionWithSGD algorithms.
>
> Concerning your issue Eustache, this could be due to the fact that these
> regression algorithms uses a fixed step (that is divided by
> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
> infinity cost, I guessed because the step was too big. I reduce it and
> managed to get good results on a very simple generated dataset.
>
> But I was wondering if anyone here had some advises concerning the use of
> these regression algorithms, for example how to choose a good step and
> number of iterations? I wonder if I'm using those right...
>
> Thanks,
>
> --
>
> *Thomas ROBERT*
> www.creativedata.fr
>
>
> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT :
>
>> Printing the model show the intercept is always 0 :(
>>
>> Should I open a bug for that ?
>>
>>
>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT :
>>
>>> Hi list,
>>>
>>> I'm benchmarking MLlib for a regression task [1] and get strange
>>> results.
>>>
>>> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
>>> the intercept:
>>>
>>> {code}
>>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>>> ...
>>> valuesAndPreds.take(10).map(t => println(t))
>>> {code}
>>>
>>> output:
>>>
>>> (2007.0,-3.784588726958493E75)
>>> (2003.0,-1.9562390324037716E75)
>>> (2005.0,-4.147413202985629E75)
>>> (2003.0,-1.524938024096847E75)
>>> ...
>>>
>>> If I change the parameters (step size, regularization and iterations) I
>>> get NaNs more often than not:
>>> (2007.0,NaN)
>>> (2003.0,NaN)
>>> (2005.0,NaN)
>>> ...
>>>
>>> On the other hand DecisionTree model give sensible results.
>>>
>>> I see there is a `setIntercept()` method in abstract class
>>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>>> but I'm unable to use it from the public interface :(
>>>
>>> Any help appreciated :)
>>>
>>> Eustache
>>>
>>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>>
>>
>

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai

Actually, the one needed to install the jar to each individual node is
standalone mode which works for both MR1 and MR2. Cloudera and
Hortonworks currently support spark in this way as far as I know.

For both yarn-cluster or yarn-client, Spark will distribute the jars
through distributed cache and each executor can find the jars there.

On Jul 7, 2014 6:23 AM, "Chester @work"  wrote:
>
> In Yarn cluster mode, you can either have spark on all the cluster nodes or 
> supply the spark jar yourself. In the 2nd case, you don't need install spark 
> on cluster at all. As you supply the spark assembly as we as your app jar 
> together.
>
> I hope this make it clear
>
> Chester
>
> Sent from my iPhone
>
> On Jul 7, 2014, at 5:05 AM, Konstantin Kudryavtsev 
>  wrote:
>
> thank you Krishna!
>
> Could you please explain why do I need install spark on each node if Spark 
> official site said: If you have a Hadoop 2 cluster, you can run Spark without 
> any installation needed
>
> I have HDP 2 (YARN) and that's why I hope I don't need to install spark on 
> each node
>
> Thank you,
> Konstantin Kudryavtsev
>
>
> On Mon, Jul 7, 2014 at 1:57 PM, Krishna Sankar  wrote:
>>
>> Konstantin,
>>
>> You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the 
>> nodes would have hdfs & YARN.
>> Then you need to install Spark on all nodes. I haven't had experience with 
>> HDP, but the tech preview might have installed Spark as well.
>> In the end, one should have hdfs,yarn & spark installed on all the nodes.
>> After installations, check the web console to make sure hdfs, yarn & spark 
>> are running.
>> Then you are ready to start experimenting/developing spark applications.
>>
>> HTH.
>> Cheers
>> 
>>
>>
>> On Mon, Jul 7, 2014 at 2:34 AM, Konstantin Kudryavtsev 
>>  wrote:
>>>
>>> guys, I'm not talking about running spark on VM, I don have problem with it.
>>>
>>> I confused in the next:
>>> 1) Hortonworks describe installation process as RPMs on each node
>>> 2) spark home page said that everything I need is YARN
>>>
>>> And I'm in stucj with understanding what I need to do to run spark on yarn 
>>> (do I need RPMs installations or only build spark on edge node?)
>>>
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>
>>> On Mon, Jul 7, 2014 at 4:34 AM, Robert James  wrote:

 I can say from my experience that getting Spark to work with Hadoop 2
 is not for the beginner; after solving one problem after another
 (dependencies, scripts, etc.), I went back to Hadoop 1.

 Spark's Maven, ec2 scripts, and others all use Hadoop 1 - not sure
 why, but, given so, Hadoop 2 has too many bumps

 On 7/6/14, Marco Shaw  wrote:
 > That is confusing based on the context you provided.
 >
 > This might take more time than I can spare to try to understand.
 >
 > For sure, you need to add Spark to run it in/on the HDP 2.1 express VM.
 >
 > Cloudera's CDH 5 express VM includes Spark, but the service isn't 
 > running by
 > default.
 >
 > I can't remember for MapR...
 >
 > Marco
 >
 >> On Jul 6, 2014, at 6:33 PM, Konstantin Kudryavtsev
 >>  wrote:
 >>
 >> Marco,
 >>
 >> Hortonworks provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you
 >> can try
 >> from
 >> http://hortonworks.com/wp-content/uploads/2014/05/SparkTechnicalPreview.pdf
 >>  HDP 2.1 means YARN, at the same time they propose ti install rpm
 >>
 >> On other hand, http://spark.apache.org/ said "
 >> Integrated with Hadoop
 >> Spark can run on Hadoop 2's YARN cluster manager, and can read any
 >> existing Hadoop data.
 >>
 >> If you have a Hadoop 2 cluster, you can run Spark without any 
 >> installation
 >> needed. "
 >>
 >> And this is confusing for me... do I need rpm installation on not?...
 >>
 >>
 >> Thank you,
 >> Konstantin Kudryavtsev
 >>
 >>
 >>> On Sun, Jul 6, 2014 at 10:56 PM, Marco Shaw 
 >>> wrote:
 >>> Can you provide links to the sections that are confusing?
 >>>
 >>> My understanding, the HDP1 binaries do not need YARN, while the HDP2
 >>> binaries do.
 >>>
 >>> Now, you can also install Hortonworks Spark RPM...
 >>>
 >>> For production, in my opinion, RPMs are better for manageability.
 >>>
  On Jul 6, 2014, at 5:39 PM, Konstantin Kudryavtsev
   wrote:
 
  Hello, thanks for your message... I'm confused, Hortonworhs suggest
  install spark rpm on each node, but on Spark main page said that yarn
  enough and I don't need to install it... What the difference?
 
  sent from my HTC
 
 > On Jul 6, 2014 8:34 PM, "vs"  wrote:
 > Konstantin,
 >
 > HWRK provides a Tech Preview of Spark 0.9.1 with HDP 2.1 that you can
 > try
 > from
 > http://hortonworks.com/wp-content/uploads/2014/05/SparkTech

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai

spark-clinet mode runs driver in your application's JVM while
spark-cluster mode runs driver in yarn cluster.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Jul 7, 2014 at 5:44 PM, Cheng Ju Chuang
 wrote:
> Hi,
>
>
>
> I am running some simple samples for my project. Right now the spark sample
> is running on Hadoop 2.2 with YARN. My question is what is the main
> different when we run as spark-client and spark-cluster except different way
> to submit our job. And what is the specific way to configure the job e.g.
> running on particular nodes which has more resource.
>
>
>
> Thank you very much
>
>
>
> Sincerely,
>
> Cheng-Ju Chuang

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai

We're doing similar thing to lunch spark job in tomcat, and I opened a
JIRA for this. There are couple technical discussions there.

https://issues.apache.org/jira/browse/SPARK-2100

In this end, we realized that spark uses jetty not only for Spark
WebUI, but also for distributing the jars and tasks, so it really hard
to remove the web dependency in Spark.

In the end, we lunch our spark job in yarn-cluster mode, and in the
runtime, the only dependency in our web application is spark-yarn
which doesn't contain any spark web stuff.

PS, upgrading the spark jetty 8.x to 9.x in spark may not be
straightforward by just changing the version in spark build script.
Jetty 9.x required Java 7 since the servlet api (servlet 3.1) requires
Java 7.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Tue, Jul 8, 2014 at 8:43 AM, Koert Kuipers  wrote:
> do you control your cluster and spark deployment? if so, you can try to
> rebuild with jetty 9.x
>
>
> On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter
>  wrote:
>>
>> Digging a bit more I see that there is yet another jetty instance that
>> is causing the problem, namely the BroadcastManager has one. I guess
>> this one isn't very wise to disable... It might very well be that the
>> WebUI is a problem as well, but I guess the code doesn't get far
>> enough. Any ideas on how to solve this? Spark seems to use jetty
>> 8.1.14, while dropwizard uses jetty 9.0.7, so that might be the source
>> of the problem. Any ideas?
>>
>> On Tue, Jul 8, 2014 at 2:58 PM, Martin Gammelsæter
>>  wrote:
>> > Hi!
>> >
>> > I am building a web frontend for a Spark app, allowing users to input
>> > sql/hql and get results back. When starting a SparkContext from within
>> > my server code (using jetty/dropwizard) I get the error
>> >
>> > java.lang.NoSuchMethodError:
>> > org.eclipse.jetty.server.AbstractConnector: method ()V not found
>> >
>> > when Spark tries to fire up its own jetty server. This does not happen
>> > when running the same code without my web server. This is probably
>> > fixable somehow(?) but I'd like to disable the webUI as I don't need
>> > it, and ideally I would like to access that information
>> > programatically instead, allowing me to embed it in my own web
>> > application.
>> >
>> > Is this possible?
>> >
>> > --
>> > Best regards,
>> > Martin Gammelsæter
>>
>>
>>
>> --
>> Mvh.
>> Martin Gammelsæter
>> 92209139
>
>

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai

It means pulling the code from latest development branch from git
repository.
On Jul 9, 2014 9:45 AM, "AlexanderRiggers" 
wrote:

> By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by
> master? Because I am using v 1.0.0 - Alex
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai

Are you using 1.0 or current master? A bug related to this is fixed in
master.
On Jul 12, 2014 8:50 AM, "Srikrishna S"  wrote:

> I am run logistic regression with SGD on a problem with about 19M
> parameters (the kdda dataset from the libsvm library)
>
> I consistently see that the nodes on my computer get disconnected and
> soon the whole job goes to a grinding halt.
>
> 14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler: Lost
> executor 2 on pachy4 remote Akka client disassociated
>
> Does this have anything to do with the akka.frame_size? I have tried
> upto 1024 MB and I still get the same thing.
>
> I don't have any more information in the logs about why the clients
> are getting disconnected. Any thoughts?
>
> Regards,
> Krishna
>

Re: Akka Client disconnected

2014-07-12 Thread DB Tsai

https://issues.apache.org/jira/browse/SPARK-2156

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sat, Jul 12, 2014 at 5:23 PM, Srikrishna S  wrote:
> I am using the master that I compiled 2 days ago. Can you point me to the 
> JIRA?
>
> On Sat, Jul 12, 2014 at 9:13 AM, DB Tsai  wrote:
>> Are you using 1.0 or current master? A bug related to this is fixed in
>> master.
>>
>> On Jul 12, 2014 8:50 AM, "Srikrishna S"  wrote:
>>>
>>> I am run logistic regression with SGD on a problem with about 19M
>>> parameters (the kdda dataset from the libsvm library)
>>>
>>> I consistently see that the nodes on my computer get disconnected and
>>> soon the whole job goes to a grinding halt.
>>>
>>> 14/07/12 03:05:16 ERROR cluster.YarnClientClusterScheduler: Lost
>>> executor 2 on pachy4 remote Akka client disassociated
>>>
>>> Does this have anything to do with the akka.frame_size? I have tried
>>> upto 1024 MB and I still get the same thing.
>>>
>>> I don't have any more information in the logs about why the clients
>>> are getting disconnected. Any thoughts?
>>>
>>> Regards,
>>> Krishna

Spark MLlib vs BIDMach Benchmark

2014-07-26 Thread DB Tsai

BIDMach is CPU and GPU-accelerated Machine Learning Library also from Berkeley.

https://github.com/BIDData/BIDMach/wiki/Benchmarks

They did benchmark against Spark 0.9, and they claimed that it's
significantly faster than Spark MLlib. In Spark 1.0, lot of
performance optimization had been done, and sparse data is supported.
It will be interesting to see new benchmark result.

Anyone familiar with BIDMach? Are they as fast as they claim?

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-27 Thread DB Tsai

Could you help to provide a test case to verify this issue and open a JIRA
to track this? Also, are you interested in submit a PR to fix it? Thanks.

Sent from my Google Nexus 5
On Jul 27, 2014 11:07 AM, "Aureliano Buendia"  wrote:

> Hi,
>
> The recently added NNLS implementation in MLlib returns wrong solutions.
> This is not data specific, just try any data in R's nnls, and then the same
> data in MLlib's NNLS. The results are very different.
>
> Also, the elected algorithm Polyak(1969) is not the best one around. The
> most popular one is Lawson-Hanson (1974):
>
> http://en.wikipedia.org/wiki/Non-negative_least_squares#Algorithms
>
>
>

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-02 Thread DB Tsai

I ran into this issue as well. The workaround by copying jar and ivy
manually suggested by Shivaram works for me.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Aug 1, 2014 at 3:31 PM, Patrick Wendell  wrote:
> I've had intermiddent access to the artifacts themselves, but for me the
> directory listing always 404's.
>
> I think if sbt hits a 404 on the directory, it sends a somewhat confusing
> error message that it can't download the artifact.
>
> - Patrick
>
>
> On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman
>  wrote:
>>
>> This fails for me too. I have no idea why it happens as I can wget the pom
>> from maven central. To work around this I just copied the ivy xmls and jars
>> from this github repo
>>
>> https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library
>> and put it in /root/.ivy2/cache/org.scala-lang/scala-library
>>
>> Thanks
>> Shivaram
>>
>>
>>
>>
>> On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau  wrote:
>>>
>>> Currently scala 2.10.2 can't be pulled in from maven central it seems,
>>> however if you have it in your ivy cache it should work.
>>>
>>>
>>> On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau 
>>> wrote:
>>>>
>>>> Me 3
>>>>
>>>>
>>>> On Fri, Aug 1, 2014 at 11:15 AM, nit  wrote:
>>>>>
>>>>> I also ran into same issue. What is the solution?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271
>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-04 Thread DB Tsai

You can try to define a wrapper class for your parser, and create an
instance of your parser in companion object as a singleton object.
Thus, even you create an object of wrapper in mapPartition every time,
each JVM will have only a single instance of your parser object.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Aug 4, 2014 at 2:01 AM, Fengyun RAO  wrote:
> Thanks, Sean!
>
> It works, but as the link in 2 - Why Is My Spark Job so Slow and Only Using
> a Single Thread? says " parser instance is now a singleton created in the
> scope of our driver program" which I thought was in the scope of executor.
> Am I wrong, or why?
>
> I didn't want the equivalent of "setup()" method, since I want to share the
> "parser" among tasks in the same worker node. It takes tens of seconds to
> initialize a "parser". What's more, I want to know if the "parser" could
> have a field such as ConcurrentHashMap which all tasks in the node may get()
> of put() items.
>
>
>
>
> 2014-08-04 16:35 GMT+08:00 Sean Owen :
>
>> The parser does not need to be serializable. In the line:
>>
>> lines.map(line => JSONParser.parse(line))
>>
>> ... the parser is called but there is no parser object that with state
>> that can be serialized. Are you sure it does not work?
>>
>> The error message alluded to originally refers to an object not shown
>> in the code, so I'm not 100% sure this was the original issue.
>>
>> If you want, the equivalent of "setup()" is really "writing some code
>> at the start of a call to mapPartitions()"
>>
>> On Mon, Aug 4, 2014 at 8:40 AM, Fengyun RAO  wrote:
>> > Thanks, Ron.
>> >
>> > The problem is that the "parser" is written in another package which is
>> > not
>> > serializable.
>> >
>> > In mapreduce, I could create the "parser" in the map setup() method.
>> >
>> > Now in spark, I want to create it for each worker, and share it among
>> > all
>> > the tasks on the same work node.
>> >
>> > I know different workers run on different machine, but it doesn't have
>> > to
>> > communicate between workers.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to share a NonSerializable variable among tasks in the same worker node?

2014-08-10 Thread DB Tsai

Spark cached the RDD in JVM, so presumably, yes, the singleton trick should
work.

Sent from my Google Nexus 5
On Aug 9, 2014 11:00 AM, "Kevin James Matzen" 
wrote:

> I have a related question.  With Hadoop, I would do the same thing for
> non-serializable objects and setup().  I also had a use case where it
> was so expensive to initialize the non-serializable object that I
> would make it a static member of the mapper, turn on JVM reuse across
> tasks, and then prevent the reinitialization for every task on the
> same node.  Is that easy to do with Spark?  Assuming Spark reuses the
> JVM across tasks by default, then taking raofengyun's factory method
> and it return a singleton should work, right?  Does Spark reuse JVMs
> across tasks?
>
> On Sat, Aug 9, 2014 at 7:48 AM, Fengyun RAO  wrote:
> > Although nobody answers the Two questions, in my practice, it seems both
> are
> > yes.
> >
> >
> > 2014-08-04 19:50 GMT+08:00 Fengyun RAO :
> >>
> >> object LogParserWrapper {
> >> private val logParser = {
> >> val settings = new ...
> >> val builders = new 
> >> new LogParser(builders, settings)
> >> }
> >> def getParser = logParser
> >> }
> >>
> >> object MySparkJob {
> >>def main(args: Array[String]) {
> >> val sc = new SparkContext()
> >> val lines = sc.textFile(arg(0))
> >>
> >> val parsed = lines.map(line =>
> >> LogParserWrapper.getParser.parse(line))
> >> ...
> >> }
> >>
> >> Q1: Is this the right way to share LogParser instance among all tasks on
> >> the same worker, if LogParser is not serializable?
> >>
> >> Q2: LogParser is read-only, but can LogParser hold a cache field such
> as a
> >> ConcurrentHashMap where all tasks on the same worker try to get() and
> put()
> >> items?
> >>
> >>
> >> 2014-08-04 19:29 GMT+08:00 Sean Owen :
> >>
> >>> The issue is that it's not clear what "parser" is. It's not shown in
> >>> your code. The snippet you show does not appear to contain a parser
> >>> object.
> >>>
> >>> On Mon, Aug 4, 2014 at 10:01 AM, Fengyun RAO 
> >>> wrote:
> >>> > Thanks, Sean!
> >>> >
> >>> > It works, but as the link in 2 - Why Is My Spark Job so Slow and Only
> >>> > Using
> >>> > a Single Thread? says " parser instance is now a singleton created in
> >>> > the
> >>> > scope of our driver program" which I thought was in the scope of
> >>> > executor.
> >>> > Am I wrong, or why?
> >>> >
> >>> > I didn't want the equivalent of "setup()" method, since I want to
> share
> >>> > the
> >>> > "parser" among tasks in the same worker node. It takes tens of
> seconds
> >>> > to
> >>> > initialize a "parser". What's more, I want to know if the "parser"
> >>> > could
> >>> > have a field such as ConcurrentHashMap which all tasks in the node
> may
> >>> > get()
> >>> > of put() items.
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > 2014-08-04 16:35 GMT+08:00 Sean Owen :
> >>> >
> >>> >> The parser does not need to be serializable. In the line:
> >>> >>
> >>> >> lines.map(line => JSONParser.parse(line))
> >>> >>
> >>> >> ... the parser is called but there is no parser object that with
> state
> >>> >> that can be serialized. Are you sure it does not work?
> >>> >>
> >>> >> The error message alluded to originally refers to an object not
> shown
> >>> >> in the code, so I'm not 100% sure this was the original issue.
> >>> >>
> >>> >> If you want, the equivalent of "setup()" is really "writing some
> code
> >>> >> at the start of a call to mapPartitions()"
> >>> >>
> >>> >> On Mon, Aug 4, 2014 at 8:40 AM, Fengyun RAO 
> >>> >> wrote:
> >>> >> > Thanks, Ron.
> >>> >> >
> >>> >> > The problem is that the "parser" is written in another package
> which
> >>> >> > is
> >>> >> > not
> >>> >> > serializable.
> >>> >> >
> >>> >> > In mapreduce, I could create the "parser" in the map setup()
> method.
> >>> >> >
> >>> >> > Now in spark, I want to create it for each worker, and share it
> >>> >> > among
> >>> >> > all
> >>> >> > the tasks on the same work node.
> >>> >> >
> >>> >> > I know different workers run on different machine, but it doesn't
> >>> >> > have
> >>> >> > to
> >>> >> > communicate between workers.
> >>> >
> >>> >
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai

We have an open-sourced Random Forest at Alpine Data Labs with the Apache
license. We're also trying to have it merged into Spark MLlib now.

https://github.com/AlpineNow/alpineml

It's been tested a lot, and the accuracy and training time benchmark is
great. There could be some bugs here and there, so we're looking forward to
your feedback, and please let us know what you think.

We'll continue to improve it and we'll be adding Gradient Boosting in the
near future as well.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

On Mon, Aug 11, 2014 at 10:52 AM, Sameer Tilak  wrote:

> Hi All,
> I read on the mailing list that random forest implementation was on the
> roadmap. I wanted to check about its status? We are currently using Weka
> and would like to move over to MLib for performance.
>

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread DB Tsai

Hi SK,

I'm working on a PR of adding a logistic regression interface with LBFGS.
It will be merged in Spark 1.1 release, I hope.
https://github.com/apache/spark/pull/1862

Before merging, you can just copy the code into your application to use it.
I'm also working on another PR which automatically rescale the training set
to improve the condition number of the optimization process. After
training, the scaling factors will be integrated back to weights so the
whole thing is transparent to users. Libsvm and glmnet do this to deal with
dataset that has huge variance in some columns.


Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Mon, Aug 11, 2014 at 2:21 PM, Burak Yavuz  wrote:

> Hi,
>
> // Initialize the optimizer using logistic regression as the loss function
> with L2 regularization
> val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater())
>
> // Set the hyperparameters
>
> lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorrections(numCor)
>
> // Retrieve the weights
> val weightsWithIntercept = lbfgs.optimize(data, initialWeights)
>
> //Slice weights with intercept into weight and intercept
>
> //Initialize Logistic Regression Model
> val model = new LogisticRegressionModel(weights, intercept)
>
> model.predict(test) //Make your predictions
>
> The example code doesn't generate the Logistic Regression Model that you
> can make predictions with.
>
> `LBFGS.runMiniBatchLBFGS` outputs a tuple of (weights, lossHistory). The
> example code was for a benchmark, so it was interested more
> in the loss history than the model itself.
>
> You can also run
> `val (weightsWithIntercept, localLoss) = LBFGS.runMiniBatchLBFGS ...`
>
> slice `weightsWithIntercept` into the intercept and the rest of the
> weights and instantiate the model again as:
> val model = new LogisticRegressionModel(weights, intercept)
>
>
> Burak
>
>
>
> - Original Message -
> From: "SK" 
> To: u...@spark.incubator.apache.org
> Sent: Monday, August 11, 2014 11:52:04 AM
> Subject: Re: [MLLib]:choosing the Loss function
>
> Hi,
>
> Thanks for the reference to the LBFGS optimizer.
> I tried to use the LBFGS optimizer, but I am not able to pass it  as an
> input to the LogisticRegression model for binary classification. After
> studying the code in mllib/classification/LogisticRegression.scala, it
> appears that the  only implementation of LogisticRegression uses
> GradientDescent as a fixed optimizer. In other words, I dont see a
> setOptimizer() function that I can use to change the optimizer to LBFGS.
>
> I tried to follow the code in
>
> https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala
> that makes use of LBFGS, but it is not clear to me where  the
> LogisticRegression  model with LBFGS is being returned that I can use for
> the classification of the test dataset.
>
> If some one has sample code that uses LogisticRegression with LBFGS instead
> of gradientDescent as the optimization algorithm, it would be helpful if
> you
> can post it.
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-choosing-the-Loss-function-tp11738p11913.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

1 2 >

1 - 100 of 167 matches

Mail list logo