Re: Apache Spark - MLLib challenges

2017-09-23 Thread Aseem Bansal
This is something I wrote specifically for the challenges that we faced when taking spark ml models to production http://www.tothenew.com/blog/when-you-take-your-machine-learning-models-to-production-for-real-time-predictions/ On Sat, Sep 23, 2017 at 1:33 PM, Jörn Franke

NullPointer when collecting a dataset grouped a column

2017-07-24 Thread Aseem Bansal
I was doing this dataset.groupBy("column").collectAsList() It worked for a small dataset but for a bigger dataset I got a NullPointer exception in which went down to spark's code. Is this known behaviour? Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org

Re: Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
>* Alias for avg. >* >* @group agg_funcs >* @since 1.4.0 >*/ > def mean(e: Column): Column = avg(e) > > > That's the same when the argument is the column name. > > So no difference between mean and avg functions. > > > --

Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
If I want to aggregate mean and subtract from my column I can do either of the following in Spark 2.1.0 Java API. Is there any difference between these? Couldn't find anything from reading the docs. dataset.select(mean("mycol")) dataset.agg(mean("mycol")) dataset.select(avg("mycol"))

Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Aseem Bansal
/SPARK-13025 > > The most recent attempt is to start with simple linear regression, as > here: https://issues.apache.org/jira/browse/SPARK-21386 > > > On Thu, 20 Jul 2017 at 08:36 Aseem Bansal <asmbans...@gmail.com> wrote: > >> We were able to set initial weights on http

Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Aseem Bansal
We were able to set initial weights on https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS How can we set the initial weights on

Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Aseem Bansal
Hi I was reading the API of Spark 2.2.0 and noticed a change compared to 2.1.0 Compared to https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression the 2.2.0 docs at

Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Aseem Bansal
When we read files in spark it infers the schema. We have the option to not infer the schema. Is there a way to ask spark to infer the schema again just like when reading json? The reason we want to get this done is because we have a problem in our data files. We have a json file containing this

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
> What commands are you using to submit your job via spark-submit? > > On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <asmbans...@gmail.com> wrote: > >> When using spark ml's LogisticRegression, RandomForest, CrossValidator >> etc. do we need to give any consideration while coding

Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc. do we need to give any consideration while coding in making it scale with more CPUs or does it scale automatically? I am reading some data from S3, using a pipeline to train a model. I am running the job on a spark

Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Aseem Bansal
I was reading http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest and found that needs to be done in sklearn. Is that required in spark?

Re: spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Anyone has any idea what could I enable so as to find out what it is trying to connect to? On Thu, Mar 2, 2017 at 5:34 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > Is there a way to find out what is it trying to connect to? I am running > my spark client from within a docker co

spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Is there a way to find out what is it trying to connect to? I am running my spark client from within a docker container so I opened up various ports as per http://stackoverflow.com/questions/27729010/how-to-configure-apache-spark-random-worker-ports-for-tight-firewalls after adding all the

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
n sql " select a from json1 and b from josn2"then run > explain to give you a hint to how to do it in code > > Regards > Sam > On Tue, 14 Feb 2017 at 14:30, Aseem Bansal <asmbans...@gmail.com> wrote: > >> Say I have two files containing single rows >> &g

Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Say I have two files containing single rows json1.json {"a": 1} json2.json {"b": 2} I read in this json file using spark's API into a dataframe one at a time. So I have Dataset json1DF and Dataset json2DF If I run "select a, b from __THIS__" in a SQLTransformer then I will get an exception

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Aseem Bansal
s from http or kafka/msg > queues...for such cases raw access to ML model is essential similar to > mllib model access... > > Thanks. > Deb > On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbans...@gmail.com> wrote: > >> @Debasish >> >> I see

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
en to expose > the model out of PipelineModel so that predict can be called on itthere > is no dependency of spark context in ml model... > On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote: > >> >>- In Spark 2.0 there is a class ca

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
deserialized as ml model from the > store of choice within ms, it can be used on incoming features to score > through spark.ml.Model predict API...I am not clear on 2200x speedup...why > r we using dataframe and not the ML model directly from API ? > On Feb 4, 2017 7:52 AM, "Aseem

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
e operated on roughly 12 input features, and by > the time all the processing was done, we had somewhere around 1000 features > or so going into the linear regression after one hot encoding and > everything else. > > Hope this helps, > Hollin > > On Fri, Feb 3, 2017 at 4:05 AM

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Does this support Java 7? On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > Is computational time for predictions on the order of few milliseconds (< > 10 ms) like the old mllib library? > > On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Is computational time for predictions on the order of few milliseconds (< 10 ms) like the old mllib library? On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins wrote: > Hey everyone, > > > Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about > MLeap and how

Re: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0)

2017-02-01 Thread Aseem Bansal
Can a admin of mailing list please remove this email? I get this email every time I send an email to the mailing list. On Wed, Feb 1, 2017 at 5:12 PM, Yahoo! No Reply wrote: > > This is an automatically generated message. > > tylerchap...@yahoo-inc.com is no longer

Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Aseem Bansal
*What I want to do* I have a trained a ml.classification.LogisticRegressionModel using spark ml package. It has 3 features and 3 classes. So the generated model has coefficients in (3, 3) matrix and intercepts in Vector of length (3) as expected. Now, I want to take these coefficients and

Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
If you want to predict using dataset then transform is the way to go. If you want to predict on vectors then you will have to wait on this issue to be completed https://issues.apache.org/jira/browse/SPARK-10413 On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau wrote: > You

Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Aseem Bansal

Re: Is Spark launcher's listener API considered production ready?

2016-11-04 Thread Aseem Bansal
Anyone has any idea about this? On Thu, Nov 3, 2016 at 12:52 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > While using Spark launcher's listener we came across few cases where the > failures were not being reported correctly. > > >- https://issues.apache.org/ji

Is Spark launcher's listener API considered production ready?

2016-11-03 Thread Aseem Bansal
While using Spark launcher's listener we came across few cases where the failures were not being reported correctly. - https://issues.apache.org/jira/browse/SPARK-17742 - https://issues.apache.org/jira/browse/SPARK-18241 So just wanted to ensure whether this API considered production

Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
PM, Aseem Bansal <asmbans...@gmail.com> wrote: > Hi > > We are trying to use some of our artifacts as dependencies while > submitting spark jobs. To specify the remote artifactory URL we are using > the following syntax > > https://USERNAME:passw...@artifactory.com

[SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
Hi We are trying to use some of our artifacts as dependencies while submitting spark jobs. To specify the remote artifactory URL we are using the following syntax https://USERNAME:passw...@artifactory.companyname.com/artifactory/COMPANYNAME-libs But the resolution fails. Although the URL which

Fwd: Need help with SVM

2016-10-26 Thread Aseem Bansal
He replied to me. Forwarding to the mailing list. -- Forwarded message -- From: Aditya Vyas <adityavya...@gmail.com> Date: Tue, Oct 25, 2016 at 8:16 PM Subject: Re: Need help with SVM To: Aseem Bansal <asmbans...@gmail.com> Hello, Here is the public gist:https://gis

What syntax can be used to specify the latest version of JAR found while using spark submit

2016-10-26 Thread Aseem Bansal
Hi Can someone please share their thoughts on http://stackoverflow.com/questions/40259022/what-syntax-can-be-used-to-specify-the-latest-version-of-jar-found-while-using-s

Can application JAR name contain + for dependency resolution to latest version?

2016-10-26 Thread Aseem Bansal
Hi While using spark-submit to submit spark jobs is the exact name of the JAR file necessary? Or is there a way to use something like `1.0.+` to denote the latest version found?

Re: Need help with SVM

2016-10-25 Thread Aseem Bansal
Is there any labeled point with label 0 in your dataset? On Tue, Oct 25, 2016 at 2:13 AM, aditya1702 wrote: > Hello, > I am using linear SVM to train my model and generate a line through my > data. > However my model always predicts 1 for all the feature examples. Here

Re: mllib model in production web API

2016-10-18 Thread Aseem Bansal
kow...@gmail.com> wrote: > Hi > Did you try applying the model with akka instead of spark ? > https://spark-summit.org/eu-2015/events/real-time-anomaly- > detection-with-spark-ml-and-akka/ > > Le 18 oct. 2016 5:58 AM, "Aseem Bansal" <asmbans...@gmail.com>

Re: mllib model in production web API

2016-10-17 Thread Aseem Bansal
t more? I'm not sure I understand > it. At the moment we load our models from S3 ( > RandomForestClassificationModel.load(..) ) and then store that in an > object property so that it persists across requests - this is in Scala. Is > this essentially what you mean? > > > > > > &g

Re: mllib model in production web API

2016-10-12 Thread Aseem Bansal
Hi Faced a similar issue. Our solution was to load the model, cache it after converting it to a model from mllib and then use that instead of ml model. On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen wrote: > I don't believe it will ever scale to spin up a whole distributed job

Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Aseem Bansal
Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY) But the

Re: spark listener do not get fail status

2016-09-30 Thread Aseem Bansal
Hi In case my previous email was lacking in details here are some more details. - using Spark 2.0.0 - launching the job using org.apache.spark.launcher.SparkLauncher.startApplication(myListener) - checking state in the listener's stateChanged method On Thu, Sep 29, 2016 at 5:24 PM, Aseem

spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi Submitting job via spark api but I never get fail status even when the job throws an exception or exit via System.exit(-1) How do I indicate via SparkListener API that my job failed?

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-02 Thread Aseem Bansal
Hi Thanks for all the details. I was able to convert from ml.NaiveBayesModel to mllib.NaiveBayesModel and get it done. It is fast for our use case. Just one question. Before mllib is removed can ml package be expected to reach feature parity with mllib? On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
n't related to .ml vs .mllib APIs. > > On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > > I understand your point. > > > > Is there something like a bridge? Is it possible to convert the model > > trained using Dataset (i.e. the distribut

Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
ed should be fast. > > On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > > Hi > > > > Currently trying to use NaiveBayes to make predictions. But facing issues > > that doing the predictions takes order of few seconds. I tried with other &

Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
Hi Currently trying to use NaiveBayes to make predictions. But facing issues that doing the predictions takes order of few seconds. I tried with other model examples shipped with Spark but they also ran in minimum of 500 ms when I used Scala API. With Has anyone used spark ML to do predictions

Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
nerally trivial. >> >> On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal <asmbans...@gmail.com> >> wrote: >> > Hi >> > >> > Would there be any significant performance difference when using Java >> vs. >> > Scala API? >> >&g

Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
Hi Would there be any significant performance difference when using Java vs. Scala API?

spark 2.0.0 - code generation inputadapter_value is not rvalue

2016-09-01 Thread Aseem Bansal
Hi Does spark does some code generation? I am trying to use map on a Java RDD and getting a huge generated files with 17406 lines in my terminal and then a stacktrace 16/09/01 13:57:36 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 16/09/01 13:57:36 INFO

Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-29 Thread Aseem Bansal
Hi What all access is needed to save a model to S3? Initially I thought it should be only write. Then I found it also needs delete to delete temporary files. Now they have given me DELETE access I am getting the error Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:

spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Aseem Bansal
Hi When Spark saves anything to S3 it creates temporary files. Why? Asking this as this requires the the access credentails to be given delete permissions along with write permissions.

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying. On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ > and it mentioned that spark streaming actually mini-batch not actual > streaming. > > I ha

Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and it mentioned that spark streaming actually mini-batch not actual streaming. I have not used streaming and I am not sure what is the difference in the 2 terms. Hence could not make a judgement myself.

Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-11 Thread Aseem Bansal
Hi I have a Dataset I will change a String to String so there will be no schema changes. Is there a way I can run a map on it? I have seen the function at

Re: na.fill doesn't work

2016-08-11 Thread Aseem Bansal
Check the schema of the data frame. It may be that your columns are String. You are trying to give default for numerical data. On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote: > Hi everybody, > > I have a data frame after many transformation, my final task is fill na's >

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-10 Thread Aseem Bansal
, I don’t use the Java API sorry > > > > The simplest way to work around it would be to read the csv as a text file > using sparkContext textFile, split each row based on a comma, then convert > it to a dataset afterwards. > > > > *From:* Aseem Bansal [mailto:asmbans...@gm

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Aseem Bansal
Seems that this is a common issue with Spark 2.0.0 I faced similar with CSV. Saw someone facing this with JSON. https://issues.apache.org/jira/browse/SPARK-16893 On Mon, Aug 8, 2016 at 4:08 PM, Ted Yu wrote: > Can you examine classpath to see where *DefaultSource comes

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
s/latest/api/scala/index. > html#org.apache.spark.api.java.JavaSparkContext the classtag doesn't need > to be specified (instead it uses a "fake" class tag automatically for you). > Where are you seeing the different API? > > On Sun, Aug 7, 2016 at 11:32 PM, Aseem Ba

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-08 Thread Aseem Bansal
ys.asList("abc", "abc", "xyz"); Dataset ds > = context.createDataset(data, Encoders.STRING()); > > I think you should be calling > > .as((Encoders.STRING(), Encoders.STRING())) > > or similar > > Ewan > > On 8 Aug 2016 06:10, Aseem Ban

Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
Earlier for broadcasting we just needed to use sparkcontext.broadcast(objectToBroadcast) But now it is sparkcontext.broadcast(objectToBroadcast, classTag) What is classTag here?

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi All Has anyone done this with Java API? On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > I need to use few columns out of a csv. But as there is no option to read > few columns out of csv so > 1. I am reading the whole CSV using SparkSes

Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
g-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On Fri, Aug 5, 2016 at 2:06 PM, Aseem Bansal <asmbans...@gmail.com> wrote: > > I need to use few columns out of a csv. But as there is no option to read > > few columns out of csv so > >

Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
I need to use few columns out of a csv. But as there is no option to read few columns out of csv so 1. I am reading the whole CSV using SparkSession.csv() 2. selecting few of the columns using DataFrame.select() 3. applying schema using the .as() function of Dataset. I tried to extent

What is "Developer API " in spark documentation?

2016-08-05 Thread Aseem Bansal
Hi Many of spark documentation say "Developer API". What does that mean?

Spark 2.0 - Case sensitive column names while reading csv

2016-08-03 Thread Aseem Bansal
While reading csv via DataFrameReader how can I make column names case sensitive? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html None of the options specified mention case sensitivity