Fwd: Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from hdfs/impa

Oryx + Spark mllib

2014-10-18 Thread Debasish Das
Hi, Is someone working on a project on integrating Oryx model serving layer with Spark ? Models will be built using either Streaming data / Batch data in HDFS and cross validated with mllib APIs but the model serving layer will give API endpoints like Oryx and read the models may be from hdfs/impa

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-10-17 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14175167#comment-14175167 ] Debasish Das commented on SPARK-2426: - 1. [~mengxr] Our legal was clear that Stan

NNLS bug

2014-10-16 Thread Debasish Das
Hi, I am validating the proximal algorithm for positive and bound constrained ALS and I came across the bug detailed in the JIRA while running ALS with NNLS: https://issues.apache.org/jira/browse/SPARK-3987 ADMM based proximal algorithm came up with correct result... Thanks. Deb

[jira] [Created] (SPARK-3987) NNLS generates incorrect result

2014-10-16 Thread Debasish Das (JIRA)
Debasish Das created SPARK-3987: --- Summary: NNLS generates incorrect result Key: SPARK-3987 URL: https://issues.apache.org/jira/browse/SPARK-3987 Project: Spark Issue Type: Bug

Re: Play framework

2014-10-16 Thread Debasish Das
It will really help if the spark users can point to github examples that integrated spark and playspecifically SparkSQL and Play... On Thu, Oct 16, 2014 at 9:23 AM, Mohammed Guller wrote: > Daniel, > > Thanks for sharing this. It is very helpful. > > > > The reason I want to use Spark subm

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
Just checked, QR is exposed by netlib: import org.netlib.lapack.Dgeqrf For the equality and bound version, I will use QR...it will be faster than the LU that I am using through jblas.solveSymmetric... On Thu, Oct 16, 2014 at 8:34 AM, Debasish Das wrote: > @xiangrui should we add this epsi

Re: Issues with ALS positive definite

2014-10-16 Thread Debasish Das
benchmarked that but I opted for QR in a different implementation and it > has worked fine. > > Now I have to go hunt for how the QR decomposition is exposed in BLAS... > Looks like its GEQRF which JBLAS helpfully exposes. Debasish you could try > it for fun at least. > On Oct 15,

Re: Issues with ALS positive definite

2014-10-15 Thread Debasish Das
ct 15, 2014 at 5:01 PM, Liquan Pei wrote: > Hi Debaish, > > I think ||r - wi'hj||^{2} is semi-positive definite. > > Thanks, > Liquan > > On Wed, Oct 15, 2014 at 4:57 PM, Debasish Das > wrote: > >> Hi, >> >> If I take the Movielens data and run the

Issues with ALS positive definite

2014-10-15 Thread Debasish Das
Hi, If I take the Movielens data and run the default ALS with regularization as 0.0, I am hitting exception from LAPACK that the gram matrix is not positive definite. This is on the master branch. This is how I run it : ./bin/spark-submit --total-executor-cores 1 --master spark:// tusca09lmlvt00

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia wrote: > Hi folks, > > I interrupt your regularly scheduled user / dev list to bring you some > pretty cool news for the project, which is that we've been

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Debasish Das
Awesome news Matei ! Congratulations to the databricks team and all the community members... On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia wrote: > Hi folks, > > I interrupt your regularly scheduled user / dev list to bring you some > pretty cool news for the project, which is that we've been

Re: protobuf error running spark on hadoop 2.4

2014-10-08 Thread Debasish Das
I have faced this in the past and I have to put a profile -Phadoop2.3... mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DskipTests install On Wed, Oct 8, 2014 at 1:40 PM, Chuang Liu wrote: > Hi: > > I tried to build Spark (1.1.0) with hadoop 2.4.0, and ran a simple > wordcount example

Re: Local tests logging to log4j

2014-10-07 Thread Debasish Das
N > log4j.logger.kafka=WARN > log4j.logger.akka=WARN > log4j.logger.org.apache.spark=WARN > log4j.logger.org.apache.spark.storage.BlockManager=ERROR > log4j.logger.org.apache.zookeeper=WARN > log4j.logger.org.eclipse.jetty=WARN > log4j.logger.org.I0Itec.zkclient=WARN > > On Tue, Oct 7, 2014 at 7:42 PM, Deb

Local tests logging to log4j

2014-10-07 Thread Debasish Das
Hi, I have added some changes to ALS tests and I am re-running tests as: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn -DwildcardSuites=org.apache.spark.mllib.recommendation.ALSSuite test I have some INFO logs in the code which I want to see on my console. They work fine if I add print

Re: lazy evaluation of RDD transformation

2014-10-06 Thread Debasish Das
Another rule of thumb is that definitely cache the RDD over which you need to do iterative analysis... For rest of them only cache if you have lot of free memory ! On Mon, Oct 6, 2014 at 2:39 PM, Sean Owen wrote: > I think you mean that data2 is a function of data1 in the first > example. I ima

Impala comparisons

2014-10-04 Thread Debasish Das
Hi, We write the output of models and other information as parquet files and later we let data APIs run SQL queries on the columnar data... SparkSQL is used to dump the data in parquet format and now we are considering whether using SparkSQL or Impala to read it back... I came across this benchm

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng wrote: > We don't handle missing value imputation in the current version of > MLlib. In future releases, we can store feature information in the > dataset metadata, wh

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg wrote: > Just realized that, of course, objects can't be generic, but how do I > create a generic AccumulatorParam?

Re: memory vs data_size

2014-09-30 Thread Debasish Das
Only fit the data in memory where you want to run the iterative algorithm For map-reduce operations, it's better not to cache if you have a memory crunch... Also schedule the persist and unpersist such that you utilize the RAM well... On Tue, Sep 30, 2014 at 4:34 PM, Liquan Pei wrote: > Hi

Re: Handling tree reduction algorithm with Spark in parallel

2014-09-30 Thread Debasish Das
If the tree is too big build it on graphxbut it will need thorough analysis so that the partitions are well balanced... On Tue, Sep 30, 2014 at 2:45 PM, Andy Twigg wrote: > Hi Boromir, > > Assuming the tree fits in memory, and what you want to do is parallelize > the computation, the 'obviou

Re: Cluster tests failing

2014-09-30 Thread Debasish Das
I have done mvn clean several times... Consistently all the mllib tests that are using LocalClusterSparkContext.scala, they fail !

Cluster tests failing

2014-09-30 Thread Debasish Das
Hi, Inside mllib I am running tests using: mvn -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn install The locat tests run fine but cluster tests are failing.. LBFGSClusterSuite: - task size should be small *** FAILED *** org.apache.spark.SparkException: Job aborted due to stage failure

Re: Hyper Parameter Optimization Algorithms

2014-09-29 Thread Debasish Das
You should look into Evan Spark's talk from Spark Summit 2014 http://spark-summit.org/2014/talk/model-search-at-scale I am not sure if some of it is already open sourced through MLBase... On Mon, Sep 29, 2014 at 7:45 PM, Lochana Menikarachchi wrote: > Hi, > > Is there anyone who works on hyper

Re: task getting stuck

2014-09-24 Thread Debasish Das
ep 24, 2014 at 9:41 AM, Debasish Das wrote: > spark SQL reads parquet file fine...did you follow one of these to > read/write parquet from spark ? > > http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ > > On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu wrote: > >> Addi

Re: task getting stuck

2014-09-24 Thread Debasish Das
;>> I was thinking along the same line. >>> >>> Jianshi: >>> See >>> http://hbase.apache.org/book.html#d0e6369 >>> >>> On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das >>> wrote: >>> >>>> HBase regionserver nee

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang wrote: > Hi Ted, > > It converts RDD[Edge] to HBase rowkey and colu

Re: Distributed dictionary building

2014-09-21 Thread Debasish Das
ay not happen to have > exhibited in your test. > > On Sun, Sep 21, 2014 at 2:13 AM, Debasish Das > wrote: > > Some more debug revealed that as Sean said I have to keep the > dictionaries > > persisted till I am done with the RDD manipulation. > > > > Thank

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
the DAG to speculate such things...similar to branch prediction ideas from comp arch... On Sat, Sep 20, 2014 at 1:56 PM, Debasish Das wrote: > I changed zipWithIndex to zipWithUniqueId and that seems to be working... > > What's the difference between zipWithIndex vs zipWithUniq

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
#x27;s not very clear from docs... On Sat, Sep 20, 2014 at 1:48 PM, Debasish Das wrote: > I did not persist / cache it as I assumed zipWithIndex will preserve > order... > > There is also zipWithUniqueId...I am trying that...If that also shows the > same issue, we should make it c

Re: Distributed dictionary building

2014-09-20 Thread Debasish Das
is being used to assign IDs. From a > recent JIRA discussion I understand this is not deterministic within a > partition so the index can be different when the RDD is reevaluated. If you > need it fixed, persist the zipped RDD on disk or in memory. > On Sep 20, 2014 8:10 PM, "

Distributed dictionary building

2014-09-20 Thread Debasish Das
Hi, I am building a dictionary of RDD[(String, Long)] and after the dictionary is built and cached, I find key "almonds" at value 5187 using: rdd.filter{case(product, index) => product == "almonds"}.collect Output: Debug product almonds index 5187 Now I take the same dictionary and write it out

Re: I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?

2014-09-19 Thread Debasish Das
Thanks Christoph. Are these numbers for mllib als implicit and explicit feedback on movielens/netflix datasets documented on JIRA ? On Sep 19, 2014 1:16 PM, "Christoph Sawade" < christoph.saw...@googlemail.com> wrote: > Hey Deb, > > NDCG is the "Normalized Discounted Cumulative Gain" [1]. Anothe

Re: I want to contribute MLlib two quality measures(ARHR and HR) for top N recommendation system. Is this meaningful?

2014-09-19 Thread Debasish Das
Hi Xiangrui, Could you please point to some reference for calculating prec@k and ndcg@k ? prec is precision I suppose but ndcg I have no idea about... Thanks. Deb On Mon, Aug 25, 2014 at 12:28 PM, Xiangrui Meng wrote: > The evaluation metrics are definitely useful. How do they differ from >

Re: Huge matrix

2014-09-18 Thread Debasish Das
The PR will updated > today. > Best, > Reza > > On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das > wrote: > >> Hi Reza, >> >> Have you tested if different runs of the algorithm produce different >> similarities (basically if the algorithm is deterministic) ?

Re: Huge matrix

2014-09-18 Thread Debasish Das
. We can add jaccard and other similarity measures in > later PRs. > > In the meantime, you can un-normalize the cosine similarities to get the > dot product, and then compute the other similarity measures from the dot > product. > > Best, > Reza > > > On Wed, S

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak wrote: > Hi All, > > I am able to run LinearReg

Re: Huge matrix

2014-09-18 Thread Debasish Das
n the meantime, you can un-normalize the cosine similarities to get the > dot product, and then compute the other similarity measures from the dot > product. > > Best, > Reza > > > On Wed, Sep 17, 2014 at 6:52 PM, Debasish Das > wrote: > >> Hi Reza, >> >

Joining multiple rowMatrix

2014-09-18 Thread Debasish Das
Hi, I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) and I would like to join multiple matrices to come up with a sqlTable for each key... What's the best way to do it ? Right now I am doing N joins if I want to combine data from N matrices which does not look quite r

Re: MLLib: LIBSVM issue

2014-09-17 Thread Debasish Das
We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out libsvm format from SparseVector... def toLibSvm(features: SparseVector): String = { val indices = features.indices.map(_ + 1) val values = features.values indices.zip(values).mkString(" ").r

Re: Huge matrix

2014-09-17 Thread Debasish Das
RowMatrix and CoordinateMatrix to be templated on the value... Are you considering this in your design ? Thanks. Deb On Tue, Sep 9, 2014 at 9:45 AM, Reza Zadeh wrote: > Better to do it in a PR of your own, it's not sufficiently related to > dimsum > > On Tue, Sep 9, 2014 at 7:03

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Debasish Das
Congratulations on the 1.1 release ! On Thu, Sep 11, 2014 at 9:08 PM, Matei Zaharia wrote: > Thanks to everyone who contributed to implementing and testing this > release! > > Matei > > On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: > > Thanks for all the good work. V

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
ALS is using a bunch of off-heap memory?). You mentioned > earlier in this thread that the property wasn't showing up in the > Environment tab. Are you sure it's making it in? > > -Sandy > > On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das > wrote: > >> Hmm...I d

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
. > > -Sandy > > On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das > wrote: > >> Hi Sandy, >> >> Any resolution for YARN failures ? It's a blocker for running spark on >> top of YARN. >> >> Thanks. >> Deb >> >> On Tue, Aug 19,

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
21 . We know that the > container got killed by YARN because it used much more memory that it > requested. But we haven't figured out the root cause yet. > > +Sandy > > Best, > Xiangrui > > On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das > wrote: > > Hi, >

Re: Huge matrix

2014-09-09 Thread Debasish Das
her one. For dense matrices with say, 1m > columns this won't be computationally feasible and you'll want to start > sampling with dimsum. > > It would be helpful to have a loadRowMatrix function, I would use it. > > Best, > Reza > > On Tue, Sep 9, 2014 at 12:05

Re: Huge matrix

2014-09-09 Thread Debasish Das
y in a future PR, probably > still for 1.2 > > > On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das > wrote: > >> Awesome...Let me try it out... >> >> Any plans of putting other similarity measures in future (jaccard is >> something that will be useful) ? I gue

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
how to do linear programming in a distributed way. > -Xiangrui > > On Mon, Sep 8, 2014 at 7:12 AM, Debasish Das > wrote: > > Xiangrui, > > > > Should I open up a JIRA for this ? > > > > Distributed lp/socp solver through ecos/ldl/amd ? > > > > I c

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
e jni version of ldl and amd which are lgpl... Let me know. Thanks. Deb On Sep 8, 2014 7:04 AM, "Debasish Das" wrote: > Durin, > > I have integrated ecos with spark which uses suitesparse under the hood > for linear equation solvesI have exposed only the qp solver

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Debasish Das
Durin, I have integrated ecos with spark which uses suitesparse under the hood for linear equation solvesI have exposed only the qp solver api in spark since I was comparing ip with proximal algorithms but we can expose suitesparse api as well...jni is used to load up ldl amd and ecos librarie

Re: Huge matrix

2014-09-05 Thread Debasish Das
sum with gamma as PositiveInfinity turns it > into the usual brute force algorithm for cosine similarity, there is no > sampling. This is by design. > > > On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das > wrote: > >> I looked at the code: similarColumns(Double.posIn

Re: Huge matrix

2014-09-05 Thread Debasish Das
ring (perhaps after dimensionality > reduction) if your goal is to find batches of similar points instead of all > pairs above a threshold. > > > > > On Fri, Sep 5, 2014 at 8:02 PM, Debasish Das > wrote: > >> Also for tall and wide (rows ~60M, columns 10M), I am conside

Re: Huge matrix

2014-09-05 Thread Debasish Das
Also for tall and wide (rows ~60M, columns 10M), I am considering running a matrix factorization to reduce the dimension to say ~60M x 50 and then run all pair similarity... Did you also try similar ideas and saw positive results ? On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das wrote: >

Re: Huge matrix

2014-09-05 Thread Debasish Das
you don't have to redo your code. Your call if you need it before a week. > Reza > > > On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das > wrote: > >> Ohh coolall-pairs brute force is also part of this PR ? Let me pull >> it in and test on our dataset... >> &g

Re: Huge matrix

2014-09-05 Thread Debasish Das
e/spark/pull/1778 > > Your question wasn't entirely clear - does this answer it? > > Best, > Reza > > > On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das > wrote: > >> Hi Reza, >> >> Have you compared with the brute force algorithm for sim

Re: Huge matrix

2014-09-05 Thread Debasish Das
Hi Reza, Have you compared with the brute force algorithm for similarity computation with something like the following in Spark ? https://github.com/echen/scaldingale I am adding cosine similarity computation but I do want to compute an all pair similarities... Note that the data is sparse for

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, "Frank van Lankvelt" wrote: > you could try looking at ScalaCL[1], it's targeting OpenCL rather than > CUDA, but that might be close enough? > > cheers, Frank >

Re: LDA example?

2014-08-22 Thread Debasish Das
Hi Burak, This LDA implementation is friendly to the equality and positivity als code that I added in the following JIRA to formulate robust plsa https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-2426 Should I build upon the PR that you pointed ? I want to run some experiment

Re: Lost executor on YARN ALS iterations

2014-08-21 Thread Debasish Das
odeManager > configuration, yarn.nodemanager.vmem-check-enabled is set to false. > > -Sandy > > > On Wed, Aug 20, 2014 at 12:27 AM, Debasish Das > wrote: > >> I could reproduce the issue in both 1.0 and 1.1 using YARN...so this is >> definitely a YARN related problem... &

Re: Akka usage in Spark

2014-08-20 Thread Debasish Das
sai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Wed, Aug 20, 2014 at 3:19 PM, Debasish Das > wrote: > > Hi Patrick, > > > > Last few days I came across some bugs which got exposed due to ALS runs > on > > large scale data...although it

Re: Akka usage in Spark

2014-08-20 Thread Debasish Das
rk's actor system > directly - it is an internal communication component in Spark and could > e.g. be re-factored later to not use akka at all. Could you elaborate a bit > more on your use case? > > - Patrick > > > On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das > wrot

Akka usage in Spark

2014-08-20 Thread Debasish Das
Hi, There have been some recent changes in the way akka is used in spark and I feel they are major changes... Is there a design document / JIRA / experiment on large datasets that highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great to understand where akka is used in the cod

Re: Lost executor on YARN ALS iterations

2014-08-20 Thread Debasish Das
issue as described in > https://issues.apache.org/jira/browse/SPARK-2121 . We know that the > container got killed by YARN because it used much more memory that it > requested. But we haven't figured out the root cause yet. > > +Sandy > > Best, > Xiangrui > > O

Lost executor on YARN ALS iterations

2014-08-19 Thread Debasish Das
Hi, During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-19 Thread Debasish Das
ed on YARN ? @dbtsai did your assembly on YARN ran fine or you are still noticing these exceptions ? Thanks. Deb On Thu, Aug 14, 2014 at 5:48 PM, Reynold Xin wrote: > Here: https://github.com/apache/spark/pull/1948 > > > > On Thu, Aug 14, 2014 at 5:45 PM, Debasish Das > wro

Re: [GitHub] spark pull request: [SPARK-3045] [SPARK-3046] Make Serializer inte...

2014-08-18 Thread Debasish Das
With the fixes, I could run it fine on top of branch-1.0 On master when running on YARN I am getting another KryoException: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 247 in stage 52.0 failed 4 times, most recent failure: Lost task 247.3 in

Spark on YARN webui

2014-08-18 Thread Debasish Das
Hi, We are running the snapshots (new spark features) on YARN and I was wondering if the webui is available on YARN mode... The deployment document does not mention webui on YARN mode... Is it available ? Thanks. Deb

Re: MLLib: implementing ALS with distributed matrix

2014-08-17 Thread Debasish Das
Hi Wei, Sparkler code was not available for benchmarking and so I picked up Jellyfish which uses SGD and if you look at the paper, the ideas are very similar to sparkler paper but Jellyfish is on shared memory and uses C code while sparkler was built on top of spark...Jellyfish used some interesti

Re: Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-16 Thread Debasish Das
Hi Brandon, Looks very cool...will try it out for ad-hoc analysis of our datasets and provide more feedback... Could you please give bit more details about the differences of Spindle architecture compared to Hue + Spark integration (python stack) and Ooyala Jobserver ? Does Spindle allow sharing

ALS checkpoint performance

2014-08-15 Thread Debasish Das
Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Debasish Das
DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor a

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-15 Thread Debasish Das
5:48 PM, "Reynold Xin" wrote: > Here: https://github.com/apache/spark/pull/1948 > > > > On Thu, Aug 14, 2014 at 5:45 PM, Debasish Das > wrote: > >> Is there a fix that I can test ? I have the flows setup for both >> standalone and YARN runs... >

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Debasish Das
't have the whole context and obviously I haven't spent nearly >>>>> as much time on this as you have, but I'm wondering what if we always pass >>>>> the executor's ClassLoader to the Kryo serializer? Will that solve this >>>>> proble

Performance hit for using sc.setCheckPointDir

2014-08-14 Thread Debasish Das
Hi, For our large ALS runs, we are considering using sc.setCheckPointDir so that the intermediate factors are written to HDFS and the lineage is broken... Is there a comparison which shows the performance degradation due to these options ? If not I will be happy to add experiments with it... Tha

Re: SPARK_LOCAL_DIRS

2014-08-14 Thread Debasish Das
Actually I faced it yesterday... I had to put it in spark-env.sh and take it out from spark-defaults.conf on 1.0.1...Note that this settings should be visible on all workers.. After that I validated that SPARK_LOCAL_DIRS was indeed getting used for shuffling... On Thu, Aug 14, 2014 at 10:27 AM,

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-14 Thread Debasish Das
the default). >>> Theoretically Spark supports custom serialisers, but due to a related >>> issue, custom serialisers currently can't live in application jars and must >>> be available to all executors at launch. My PR fixes this issue as well, >>> allowin

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-13 Thread Debasish Das
Sorry I just saw Graham's email after sending my previous email about this bug... I have been seeing this same issue on our ALS runs last week but I thought it was due my hacky way to run mllib 1.1 snapshot on core 1.0... What's the status of this PR ? Will this fix be back-ported to 1.0.1 as we

Kryo serialization issues

2014-08-13 Thread Debasish Das
Hi, Is there a JIRA for this bug ? I have seen it multiple times during our ALS runs now...some runs don't show while some runs fail due to the error msg https://github.com/GrahamDennis/spark-kryo-serialisation/blob/master/README.md One way to circumvent this is to not use kryo but then I am no

SPARK_LOCAL_DIRS option

2014-08-13 Thread Debasish Das
Hi, I have set up the SPARK_LOCAL_DIRS option in spark-env.sh so that Spark can use more shuffle space... Does Spark cleans all the shuffle files once the runs are done ? Seems to me that the shuffle files are not cleaned... Do I need to set this variable ? spark.cleaner.ttl Right now we are pl

Re: Contribution to Spark MLLib

2014-08-13 Thread Debasish Das
Dennis, If it is PLSA with least square loss then the QuadraticMinimizer that we open sourced should be able to solve it for modest topics (till 1000 I believe)...if we integrate a cg solver for equality (Nocedal's KNITRO paper is the reference) the topic size can be increased much larger than ALS

[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-08-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095232#comment-14095232 ] Debasish Das edited comment on SPARK-2426 at 8/13/14 3:3

[jira] [Updated] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-08-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Debasish Das updated SPARK-2426: Description: Current ALS supports least squares and nonnegative least squares. I presented ADMM

[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-08-13 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095232#comment-14095232 ] Debasish Das commented on SPARK-2426: - Hi Xiangrui, The branch is ready fo

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-12 Thread Debasish Das
I figured out the issuethe driver memory was at 512 MB and for our datasets, the following code needed more memory... // Materialize usersOut and productsOut. usersOut.count() productsOut.count() Thanks. Deb On Sat, Aug 9, 2014 at 6:12 PM, Debasish Das wrote: > Actually nope it

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
LS locallyMost likely it is a bug Thanks. Deb On Sat, Aug 9, 2014 at 11:12 AM, Debasish Das wrote: > Including mllib inside assembly worked fine...If I deploy only the core > and send mllib as --jars then this problem shows up... > > Xiangrui could you please comment if it is a bu

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
wrote: > I was having this same problem early this week and had to include my > changes in the assembly. > > > On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das > wrote: > >> I validated that I can reproduce this problem with master as well (without >> adding an

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) I will try now with mllib inside the assemblyIf that works then something is weird here ! On Sat, Aug 9, 2014 at 12:46 AM, Debasish Das wrote: > Hi Xiangrui, > > Based on your suggestion I moved core and

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Debasish Das
piled with Java 1.7_55 but the cluster JRE is at 1.7_45. Thanks. Deb On Wed, Aug 6, 2014 at 12:01 PM, Debasish Das wrote: > I did not play with Hadoop settings...everything is compiled with > 2.3.0CDH5.0.2 for me... > > I did try to bump the version number of HBase from 0.94 to 0.

Re: [SNAPSHOT] Snapshot1 of Spark 1.1.0 has been posted

2014-08-08 Thread Debasish Das
Hi Patrick, I am testing the 1.1 branch but I see lot of protobuf warnings while building the jars: [warn] Class com.google.protobuf.Parser not found - continuing with a stub. [warn] Class com.google.protobuf.Parser not found - continuing with a stub. [warn] Class com.google.protobuf.Parser not

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
> One related question, is mllib jar independent from hadoop version (doesnt > use hadoop api directly)? Can I use mllib jar compile for one version of > hadoop and use it in another version of hadoop? > > Sent from my Google Nexus 5 > On Aug 6, 2014 8:29 AM, "Debasish D

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
I'm really interested in how > they differ in the final recommendation? It would be great if you can > test prec@k or ndcg@k metrics. > > Best, > Xiangrui > > On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das > wrote: > > Hi Xiangrui, > > > > Maintaining another

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) On Tue, Aug 5, 2014 at 5:59 PM, Debasish Das wrote: > Hi Xiangrui, > > I used your idea and kept a cherry picked version of ALS.sc

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-05 Thread Debasish Das
: > If you cannot change the Spark jar deployed on the cluster, an easy > solution would be renaming ALS in your jar. If userClassPathFirst > doesn't work, could you create a JIRA and attach the log? Thanks! > -Xiangrui > > On Tue, Aug 5, 2014 at 9:10 AM, Debasish Das >

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-05 Thread Debasish Das
rst is behaving, there might be bugs in it... Any suggestions will be appreciated Thanks. Deb On Sat, Aug 2, 2014 at 11:12 AM, Xiangrui Meng wrote: > Yes, that should work. spark-mllib-1.1.0 should be compatible with > spark-core-1.0.1. > > On Sat, Aug 2, 2014 at 10:54 AM, Debasi

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-02 Thread Debasish Das
'm not > sure whether it could solve your problem. -Xiangrui > > On Sat, Aug 2, 2014 at 10:13 AM, Debasish Das > wrote: > > Hi, > > > > I have deployed spark stable 1.0.1 on the cluster but I have new code > that > > I added in mllib-1.1.0-SNAPSHOT. > &g

Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-02 Thread Debasish Das
Hi, I have deployed spark stable 1.0.1 on the cluster but I have new code that I added in mllib-1.1.0-SNAPSHOT. I am trying to access the new code using spark-submit as follows: spark-job --class com.verizon.bda.mllib.recommendation.ALSDriver --executor-memory 16g --total-executor-cores 16 --jar

Re: MLlib NNLS implementation is buggy, returning wrong solutions

2014-07-28 Thread Debasish Das
Hi Aureliano, Will it be possible for you to give the test-case ? You can add it to JIRA as well as an attachment I guess... I am preparing the PR for ADMM based QuadraticMinimizer...In my matlab experiments with scaling the rank to 1000 and beyond (which is too high for ALS but gives a good idea

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar,

Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
Hi, We have been using standalone spark for last 6 months and I used to run application jars fine on spark cluster with the following command. java -cp ":/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar" -Xms2g -Xmx2g -Ds

[jira] [Commented] (SPARK-2602) sbt/sbt test steals window focus on OS X

2014-07-20 Thread Debasish Das (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068094#comment-14068094 ] Debasish Das commented on SPARK-2602: - CDH5 does not even support java6 any

<    1   2   3   4   5   6   >