Re: Apache Spark - MLLib challenges

2017-09-23 Thread Aseem Bansal
This is something I wrote specifically for the challenges that we faced
when taking spark ml models to production
http://www.tothenew.com/blog/when-you-take-your-machine-learning-models-to-production-for-real-time-predictions/

On Sat, Sep 23, 2017 at 1:33 PM, Jörn Franke  wrote:

> As far as I know there is currently no encryption in-memory in Spark.
> There are some research projects to create secure enclaves in-memory based
> on Intel sgx, but there is still a lot to do in terms of performance and
> security objectives.
> The more interesting question is why would you need this for your
> organization. There are very few scenarios where it could be needed and if
> you have attacker’s in the cluster you have anyway other problems.
>
> On 23. Sep 2017, at 09:41, Irfan Kabli  wrote:
>
> Dear All,
>
> We are looking to position MLLib in our organisation for machine learning
> tasks and are keen to understand if their are any challenges that you might
> have seen with MLLib in production. We will be going with the pure
> open-source approach here, rather than using one of the hadoop
> distributions out their in the market.
>
> Furthemore, with a multi-tenant hadoop cluster, and data in memory, would
> spark support encrypting the data in memory with DataFrames.
>
> --
> Best Regards,
> Irfan Kabli
>
>


NullPointer when collecting a dataset grouped a column

2017-07-24 Thread Aseem Bansal
I was doing this

dataset.groupBy("column").collectAsList()


It worked for a small dataset but for a bigger dataset I got a NullPointer
exception in which went down to spark's code. Is this known behaviour?

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1505)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1493)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1492)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1492)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:803)
at scala.Option.foreach(Option.scala:257)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:803)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1720)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1675)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1664)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:629)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1958)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:934)
at
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1$$anonfun$apply$11.apply(Dataset.scala:2364)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1$$anonfun$apply$11.apply(Dataset.scala:2363)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2363)
at
org.apache.spark.sql.Dataset$$anonfun$collectAsList$1.apply(Dataset.scala:2362)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2778)
at org.apache.spark.sql.Dataset.collectAsList(Dataset.scala:2362)
at com.companyname.Main(Main.java:151)
... 7 more
Caused by: java.lang.NullPointerException
at sun.reflect.GeneratedMethodAccessor58.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1113)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1$$anonfun$apply$1.apply(SQLContext.scala:1113)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:1113)
at
org.apache.spark.sql.SQLContext$$anonfun$beansToRows$1.apply(SQLContext.scala:)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:107)
at
org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:97)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)

Re: Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
Any difference between using agg or select to do the aggregations?

On Mon, Jul 24, 2017 at 5:08 PM, yohann jardin <yohannjar...@hotmail.com>
wrote:

> Seen directly in the code:
>
>
>   /**
>* Aggregate function: returns the average of the values in a group.
>* Alias for avg.
>*
>* @group agg_funcs
>* @since 1.4.0
>*/
>   def mean(e: Column): Column = avg(e)
>
>
> That's the same when the argument is the column name.
>
> So no difference between mean and avg functions.
>
>
> --
> *De :* Aseem Bansal <asmbans...@gmail.com>
> *Envoyé :* lundi 24 juillet 2017 13:34
> *À :* user
> *Objet :* Is there a difference between these aggregations
>
> If I want to aggregate mean and subtract from my column I can do either of
> the following in Spark 2.1.0 Java API. Is there any difference between
> these? Couldn't find anything from reading the docs.
>
> dataset.select(mean("mycol"))
> dataset.agg(mean("mycol"))
>
> dataset.select(avg("mycol"))
> dataset.agg(avg("mycol"))
>


Is there a difference between these aggregations

2017-07-24 Thread Aseem Bansal
If I want to aggregate mean and subtract from my column I can do either of
the following in Spark 2.1.0 Java API. Is there any difference between
these? Couldn't find anything from reading the docs.

dataset.select(mean("mycol"))
dataset.agg(mean("mycol"))

dataset.select(avg("mycol"))
dataset.agg(avg("mycol"))


Re: Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Aseem Bansal
Hi

I had asked about this somewhere else too and was told that weightCol
method does that

On Thu, Jul 20, 2017 at 12:50 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Currently it's not supported, but is on the roadmap: see
> https://issues.apache.org/jira/browse/SPARK-13025
>
> The most recent attempt is to start with simple linear regression, as
> here: https://issues.apache.org/jira/browse/SPARK-21386
>
>
> On Thu, 20 Jul 2017 at 08:36 Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> We were able to set initial weights on https://spark.apache.org/
>> docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.classification.
>> LogisticRegressionWithLBFGS
>>
>> How can we set the initial weights on https://spark.apache.org/
>> docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.
>> LogisticRegression similar to above
>>
>> Trying to migrate some code from mllib version to the ml version
>>
>


Setting initial weights of ml.classification.LogisticRegression similar to mllib.classification.LogisticRegressionWithLBFGS

2017-07-20 Thread Aseem Bansal
We were able to set initial weights on
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS

How can we set the initial weights on
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression
similar to above

Trying to migrate some code from mllib version to the ml version


Regarding Logistic Regression changes in Spark 2.2.0

2017-07-19 Thread Aseem Bansal
Hi

I was reading the API of Spark 2.2.0 and noticed a change compared to 2.1.0

Compared to
https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression
the 2.2.0 docs at
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression
mention that "This class supports fitting traditional logistic regression
model by LBFGS/OWLQN and bound (box) constrained logistic regression model
by LBFGSB."

I went through the release notes and found that bound box constrain was
added in 2.2. I wanted to know whether LBFGS was the default in Spark
2.1.0. If not, can we use LBFGS in Spark 2.1.0 or do we have to upgrade to
2.2.0?


Spark 2.1 - Infering schema of dataframe after reading json files not during

2017-06-02 Thread Aseem Bansal
When we read files in spark it infers the schema. We have the option to not
infer the schema. Is there a way to ask spark to infer the schema again
just like when reading json?

The reason we want to get this done is because we have a problem in our
data files. We have a json file containing this

{"a": NESTED_JSON_VALUE}
{"a":"null"}

It should have been empty json but due to a bug it became "null" instead.
Now, when we read the file the column "a" is considered as a String.
Instead what we want to do is ask spark to read the file considering "a" as
a String, filter the "null" out/replace with empty json and then ask spark
to infer schema of "a" after the fix so we can access the nested json
properly.


Re: Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
   - Limited the data to 100,000 records.
   - 6 categorical feature which go through imputation, string indexing,
   one hot encoding. The maximum classes for the feature is 100. As data is
   imputated it becomes dense.
   - 1 numerical feature.
   - Training Logistic Regression through CrossValidation with grid to
   optimize its regularization parameter over the values 0.0001, 0.001, 0.005,
   0.01, 0.05, 0.1
   - Using spark's launcher api to launch it on a yarn cluster in Amazon
   AWS.

I was thinking that as CrossValidator is finding the best parameters it
should be able to run them independently. That sounds like something which
could be ran in parallel.


On Fri, Apr 7, 2017 at 5:20 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> What is the size of training data (number examples, number features)?
> Dense or sparse features? How many classes?
>
> What commands are you using to submit your job via spark-submit?
>
> On Fri, 7 Apr 2017 at 13:12 Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> When using spark ml's LogisticRegression, RandomForest, CrossValidator
>> etc. do we need to give any consideration while coding in making it scale
>> with more CPUs or does it scale automatically?
>>
>> I am reading some data from S3, using a pipeline to train a model. I am
>> running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
>> see much usage. It is running but I was expecting spark to use all RAM
>> available and make it faster. So that's why I was thinking whether we need
>> to take something particular in consideration or wrong expectations?
>>
>


Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc.
do we need to give any consideration while coding in making it scale with
more CPUs or does it scale automatically?

I am reading some data from S3, using a pipeline to train a model. I am
running the job on a spark cluster with 36 cores and 60GB RAM and I cannot
see much usage. It is running but I was expecting spark to use all RAM
available and make it faster. So that's why I was thinking whether we need
to take something particular in consideration or wrong expectations?


Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Aseem Bansal
I was reading
http://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest
and found that needs to be done in sklearn. Is that required in spark?


Re: spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Anyone has any idea what could I enable so as to find out what it is trying
to connect to?

On Thu, Mar 2, 2017 at 5:34 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> Is there a way to find out what is it trying to connect to? I am running
> my spark client from within a docker container so I opened up various ports
> as per http://stackoverflow.com/questions/27729010/how-to-
> configure-apache-spark-random-worker-ports-for-tight-firewalls after
> adding all the properties in conf/spark-defaults.conf on my spark cluster's
> installation.
>
> In stdout of the executor I can see the following debug
>
> 17/03/02 12:01:17 DEBUG UserGroupInformation: PrivilegedActionException
> as:root (auth:SIMPLE) cause:org.apache.spark.rpc.RpcTimeoutException:
> Cannot receive any reply in 120 seconds. This timeout is controlled by
> spark.rpc.askTimeout
>
> What is it trying to connect to?
>


spark keeps on creating executors and each one fails with "TransportClient has not yet been set."

2017-03-02 Thread Aseem Bansal
Is there a way to find out what is it trying to connect to? I am running my
spark client from within a docker container so I opened up various ports as
per
http://stackoverflow.com/questions/27729010/how-to-configure-apache-spark-random-worker-ports-for-tight-firewalls
after adding all the properties in conf/spark-defaults.conf on my spark
cluster's installation.

In stdout of the executor I can see the following debug

17/03/02 12:01:17 DEBUG UserGroupInformation: PrivilegedActionException
as:root (auth:SIMPLE) cause:org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in 120 seconds. This timeout is controlled by
spark.rpc.askTimeout

What is it trying to connect to?
17/03/02 11:46:47 INFO SecurityManager: Changing modify acls groups to:
17/03/02 11:46:47 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(ec2-user, root); 
groups with view permissions: Set(); users  with modify permissions: 
Set(ec2-user, root); groups with modify permissions: Set()
java.lang.IllegalArgumentException: requirement failed: TransportClient has not 
yet been set.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at 
org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply 
in 120 seconds. This timeout is controlled by spark.rpc.askTimeout

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Sorry if I trivialized the example. It is the same kind of file and
sometimes it could have "a", sometimes "b", sometimes both. I just don't
know. That is what I meant by missing columns.

It would be good if I read any of the JSON and if I do spark sql and it
gave me

for json1.json

a | b
1 | null

for json2.json

a | b
null | 2


On Tue, Feb 14, 2017 at 8:13 PM, Sam Elamin <hussam.ela...@gmail.com> wrote:

> I may be missing something super obvious here but can't you combine them
> into a single dataframe. Left join perhaps?
>
> Try writing it in sql " select a from json1 and b from josn2"then run
> explain to give you a hint to how to do it in code
>
> Regards
> Sam
> On Tue, 14 Feb 2017 at 14:30, Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> Say I have two files containing single rows
>>
>> json1.json
>>
>> {"a": 1}
>>
>> json2.json
>>
>> {"b": 2}
>>
>> I read in this json file using spark's API into a dataframe one at a
>> time. So I have
>>
>> Dataset json1DF
>> and
>> Dataset json2DF
>>
>> If I run "select a, b from __THIS__" in a SQLTransformer then I will get
>> an exception as for json1DF does not have "b" and json2DF does not have "a"
>>
>> How could I handle this situation with missing columns in JSON?
>>
>


Dealing with missing columns in SPARK SQL in JSON

2017-02-14 Thread Aseem Bansal
Say I have two files containing single rows

json1.json

{"a": 1}

json2.json

{"b": 2}

I read in this json file using spark's API into a dataframe one at a time.
So I have

Dataset json1DF
and
Dataset json2DF

If I run "select a, b from __THIS__" in a SQLTransformer then I will get an
exception as for json1DF does not have "b" and json2DF does not have "a"

How could I handle this situation with missing columns in JSON?


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Aseem Bansal
I agree with you that this is needed. There is a JIRA
https://issues.apache.org/jira/browse/SPARK-10413

On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hi Aseem,
>
> Due to production deploy, we did not upgrade to 2.0 but that's critical
> item on our list.
>
> For exposing models out of PipelineModel, let me look into the ML
> tasks...we should add it since dataframe should not be must for model
> scoring...many times model are scored on api or streaming paths which don't
> have micro batching involved...data directly lands from http or kafka/msg
> queues...for such cases raw access to ML model is essential similar to
> mllib model access...
>
> Thanks.
> Deb
> On Feb 4, 2017 9:58 PM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>
>> @Debasish
>>
>> I see that the spark version being used in the project that you mentioned
>> is 1.6.0. I would suggest that you take a look at some blogs related to
>> Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
>> of latest Spark 2.1.0 release has no way to call predict on single vector.
>> There is no API exposed. It is WIP but not yet released.
>>
>> On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> If we expose an API to access the raw models out of PipelineModel can't
>>> we call predict directly on it from an API ? Is there a task open to expose
>>> the model out of PipelineModel so that predict can be called on itthere
>>> is no dependency of spark context in ml model...
>>> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>
>>>>
>>>>- In Spark 2.0 there is a class called PipelineModel. I know that
>>>>the title says pipeline but it is actually talking about PipelineModel
>>>>trained via using a Pipeline.
>>>>- Why PipelineModel instead of pipeline? Because usually there is a
>>>>series of stuff that needs to be done when doing ML which warrants an
>>>>ordered sequence of operations. Read the new spark ml docs or one of the
>>>>databricks blogs related to spark pipelines. If you have used python's
>>>>sklearn library the concept is inspired from there.
>>>>- "once model is deserialized as ml model from the store of choice
>>>>within ms" - The timing of loading the model was not what I was
>>>>referring to when I was talking about timing.
>>>>- "it can be used on incoming features to score through
>>>>spark.ml.Model predict API". The predict API is in the old mllib package
>>>>not the new ml package.
>>>>- "why r we using dataframe and not the ML model directly from API"
>>>>- Because as of now the new ml package does not have the direct API.
>>>>
>>>>
>>>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com
>>>> > wrote:
>>>>
>>>>> I am not sure why I will use pipeline to do scoring...idea is to build
>>>>> a model, use model ser/deser feature to put it in the row or column store
>>>>> of choice and provide a api access to the model...we support these
>>>>> primitives in github.com/Verizon/trapezium...the api has access to
>>>>> spark context in local or distributed mode...once model is deserialized as
>>>>> ml model from the store of choice within ms, it can be used on incoming
>>>>> features to score through spark.ml.Model predict API...I am not clear on
>>>>> 2200x speedup...why r we using dataframe and not the ML model directly 
>>>>> from
>>>>> API ?
>>>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>> What is your timezone in case someone wanted to talk?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Aseem,
>>>>>>>
>>>>>>> We have built pipelines that execute several string indexers, one
>>>>>>> hot encoders, scaling, and a random forest or linear regression at the 
>>>>>>> end.
>>>>>>> Execution time for the linear regression was on the order of 11
>>>>>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
@Debasish

I see that the spark version being used in the project that you mentioned
is 1.6.0. I would suggest that you take a look at some blogs related to
Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
of latest Spark 2.1.0 release has no way to call predict on single vector.
There is no API exposed. It is WIP but not yet released.

On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on itthere
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>
>>
>>- In Spark 2.0 there is a class called PipelineModel. I know that the
>>title says pipeline but it is actually talking about PipelineModel trained
>>via using a Pipeline.
>>- Why PipelineModel instead of pipeline? Because usually there is a
>>series of stuff that needs to be done when doing ML which warrants an
>>ordered sequence of operations. Read the new spark ml docs or one of the
>>databricks blogs related to spark pipelines. If you have used python's
>>sklearn library the concept is inspired from there.
>>- "once model is deserialized as ml model from the store of choice
>>within ms" - The timing of loading the model was not what I was
>>referring to when I was talking about timing.
>>- "it can be used on incoming features to score through
>>spark.ml.Model predict API". The predict API is in the old mllib package
>>not the new ml package.
>>- "why r we using dataframe and not the ML model directly from API" -
>>Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>>>
>>>> Does this support Java 7?
>>>> What is your timezone in case someone wanted to talk?
>>>>
>>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>>>> wrote:
>>>>
>>>>> Hey Aseem,
>>>>>
>>>>> We have built pipelines that execute several string indexers, one hot
>>>>> encoders, scaling, and a random forest or linear regression at the end.
>>>>> Execution time for the linear regression was on the order of 11
>>>>> microseconds, a bit longer for random forest. This can be further 
>>>>> optimized
>>>>> by using row-based transformations if your pipeline is simple to around 
>>>>> 2-3
>>>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>>>> the time all the processing was done, we had somewhere around 1000 
>>>>> features
>>>>> or so going into the linear regression after one hot encoding and
>>>>> everything else.
>>>>>
>>>>> Hope this helps,
>>>>> Hollin
>>>>>
>>>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Does this support Java 7?
>>>>>>
>>>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is computational time for predictions on the order of few
>>>>>>> milliseconds (< 10 ms) like the old mllib library?
>>>>>>>
>>>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml>
>>>>>>> wrote:
>>>>>>>
>>>

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
   - In Spark 2.0 there is a class called PipelineModel. I know that the
   title says pipeline but it is actually talking about PipelineModel trained
   via using a Pipeline.
   - Why PipelineModel instead of pipeline? Because usually there is a
   series of stuff that needs to be done when doing ML which warrants an
   ordered sequence of operations. Read the new spark ml docs or one of the
   databricks blogs related to spark pipelines. If you have used python's
   sklearn library the concept is inspired from there.
   - "once model is deserialized as ml model from the store of choice
   within ms" - The timing of loading the model was not what I was
   referring to when I was talking about timing.
   - "it can be used on incoming features to score through spark.ml.Model
   predict API". The predict API is in the old mllib package not the new ml
   package.
   - "why r we using dataframe and not the ML model directly from API" -
   Because as of now the new ml package does not have the direct API.


On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das <debasish.da...@gmail.com>
wrote:

> I am not sure why I will use pipeline to do scoring...idea is to build a
> model, use model ser/deser feature to put it in the row or column store of
> choice and provide a api access to the model...we support these primitives
> in github.com/Verizon/trapezium...the api has access to spark context in
> local or distributed mode...once model is deserialized as ml model from the
> store of choice within ms, it can be used on incoming features to score
> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
> r we using dataframe and not the ML model directly from API ?
> On Feb 4, 2017 7:52 AM, "Aseem Bansal" <asmbans...@gmail.com> wrote:
>
>> Does this support Java 7?
>> What is your timezone in case someone wanted to talk?
>>
>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml>
>> wrote:
>>
>>> Hey Aseem,
>>>
>>> We have built pipelines that execute several string indexers, one hot
>>> encoders, scaling, and a random forest or linear regression at the end.
>>> Execution time for the linear regression was on the order of 11
>>> microseconds, a bit longer for random forest. This can be further optimized
>>> by using row-based transformations if your pipeline is simple to around 2-3
>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>> the time all the processing was done, we had somewhere around 1000 features
>>> or so going into the linear regression after one hot encoding and
>>> everything else.
>>>
>>> Hope this helps,
>>> Hollin
>>>
>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com>
>>> wrote:
>>>
>>>> Does this support Java 7?
>>>>
>>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is computational time for predictions on the order of few milliseconds
>>>>> (< 10 ms) like the old mllib library?
>>>>>
>>>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml>
>>>>> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>>
>>>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>>>> about MLeap and how you can use it to build production services from your
>>>>>> Spark-trained ML pipelines. MLeap is an open-source technology that 
>>>>>> allows
>>>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>>>> dependencies on a Spark context and the serialization format is entirely
>>>>>> based on Protobuf 3 and JSON.
>>>>>>
>>>>>>
>>>>>> The recent 0.5.0 release provides serialization and inference support
>>>>>> for close to 100% of Spark transformers (we don’t yet support ALS and 
>>>>>> LDA).
>>>>>>
>>>>>>
>>>>>> MLeap is open-source, take a look at our Github page:
>>>>>>
>>>>>> https://github.com/combust/mleap
>>>>>>
>>>>>>
>>>>>> Or join the conversation on Gitter:
>>>>>>
>>>>>> https://gitter.im/combust/mleap
>>>>>>
>>>>>>
>>>>>> We have a set of documentation to help get you started here:
>>>>>>
>>>>>> http://mleap-docs.combust.ml/
>>>>>>
>>>>>>
>>>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>>>> logistic and random forest models:
>>>>>>
>>>>>> https://github.com/combust/mleap-demo
>>>>>>
>>>>>>
>>>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>>>> expose a REST interface to your Spark ML pipeline models:
>>>>>>
>>>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>>>
>>>>>>
>>>>>> Several companies are using MLeap in production and even more are
>>>>>> currently evaluating it. Take a look and tell us what you think! We hope 
>>>>>> to
>>>>>> talk with you soon and welcome feedback/suggestions!
>>>>>>
>>>>>>
>>>>>> Sincerely,
>>>>>>
>>>>>> Hollin and Mikhail
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
Does this support Java 7?
What is your timezone in case someone wanted to talk?

On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins <hol...@combust.ml> wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further optimized
> by using row-based transformations if your pipeline is simple to around 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com>
>> wrote:
>>
>>> Is computational time for predictions on the order of few milliseconds
>>> (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml>
>>> wrote:
>>>
>>>> Hey everyone,
>>>>
>>>>
>>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>>> about MLeap and how you can use it to build production services from your
>>>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>>> dependencies on a Spark context and the serialization format is entirely
>>>> based on Protobuf 3 and JSON.
>>>>
>>>>
>>>> The recent 0.5.0 release provides serialization and inference support
>>>> for close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>>>
>>>>
>>>> MLeap is open-source, take a look at our Github page:
>>>>
>>>> https://github.com/combust/mleap
>>>>
>>>>
>>>> Or join the conversation on Gitter:
>>>>
>>>> https://gitter.im/combust/mleap
>>>>
>>>>
>>>> We have a set of documentation to help get you started here:
>>>>
>>>> http://mleap-docs.combust.ml/
>>>>
>>>>
>>>> We even have a set of demos, for training ML Pipelines and linear,
>>>> logistic and random forest models:
>>>>
>>>> https://github.com/combust/mleap-demo
>>>>
>>>>
>>>> Check out our latest MLeap-serving Docker image, which allows you to
>>>> expose a REST interface to your Spark ML pipeline models:
>>>>
>>>> http://mleap-docs.combust.ml/mleap-serving/
>>>>
>>>>
>>>> Several companies are using MLeap in production and even more are
>>>> currently evaluating it. Take a look and tell us what you think! We hope to
>>>> talk with you soon and welcome feedback/suggestions!
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> Hollin and Mikhail
>>>>
>>>
>>>
>>
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Does this support Java 7?

On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> Is computational time for predictions on the order of few milliseconds (<
> 10 ms) like the old mllib library?
>
> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins <hol...@combust.ml> wrote:
>
>> Hey everyone,
>>
>>
>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>> about MLeap and how you can use it to build production services from your
>> Spark-trained ML pipelines. MLeap is an open-source technology that allows
>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>> Models to a scoring engine instantly. The MLeap execution engine has no
>> dependencies on a Spark context and the serialization format is entirely
>> based on Protobuf 3 and JSON.
>>
>>
>> The recent 0.5.0 release provides serialization and inference support for
>> close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>>
>>
>> MLeap is open-source, take a look at our Github page:
>>
>> https://github.com/combust/mleap
>>
>>
>> Or join the conversation on Gitter:
>>
>> https://gitter.im/combust/mleap
>>
>>
>> We have a set of documentation to help get you started here:
>>
>> http://mleap-docs.combust.ml/
>>
>>
>> We even have a set of demos, for training ML Pipelines and linear,
>> logistic and random forest models:
>>
>> https://github.com/combust/mleap-demo
>>
>>
>> Check out our latest MLeap-serving Docker image, which allows you to
>> expose a REST interface to your Spark ML pipeline models:
>>
>> http://mleap-docs.combust.ml/mleap-serving/
>>
>>
>> Several companies are using MLeap in production and even more are
>> currently evaluating it. Take a look and tell us what you think! We hope to
>> talk with you soon and welcome feedback/suggestions!
>>
>>
>> Sincerely,
>>
>> Hollin and Mikhail
>>
>
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-03 Thread Aseem Bansal
Is computational time for predictions on the order of few milliseconds (<
10 ms) like the old mllib library?

On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins  wrote:

> Hey everyone,
>
>
> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits about
> MLeap and how you can use it to build production services from your
> Spark-trained ML pipelines. MLeap is an open-source technology that allows
> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
> Models to a scoring engine instantly. The MLeap execution engine has no
> dependencies on a Spark context and the serialization format is entirely
> based on Protobuf 3 and JSON.
>
>
> The recent 0.5.0 release provides serialization and inference support for
> close to 100% of Spark transformers (we don’t yet support ALS and LDA).
>
>
> MLeap is open-source, take a look at our Github page:
>
> https://github.com/combust/mleap
>
>
> Or join the conversation on Gitter:
>
> https://gitter.im/combust/mleap
>
>
> We have a set of documentation to help get you started here:
>
> http://mleap-docs.combust.ml/
>
>
> We even have a set of demos, for training ML Pipelines and linear,
> logistic and random forest models:
>
> https://github.com/combust/mleap-demo
>
>
> Check out our latest MLeap-serving Docker image, which allows you to
> expose a REST interface to your Spark ML pipeline models:
>
> http://mleap-docs.combust.ml/mleap-serving/
>
>
> Several companies are using MLeap in production and even more are
> currently evaluating it. Take a look and tell us what you think! We hope to
> talk with you soon and welcome feedback/suggestions!
>
>
> Sincerely,
>
> Hollin and Mikhail
>


Re: tylerchap...@yahoo-inc.com is no longer with Yahoo! (was: Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0)

2017-02-01 Thread Aseem Bansal
Can a admin of mailing list please remove this email? I get this email
every time I send an email to the mailing list.

On Wed, Feb 1, 2017 at 5:12 PM, Yahoo! No Reply 
wrote:

>
> This is an automatically generated message.
>
> tylerchap...@yahoo-inc.com is no longer with Yahoo! Inc.
>
> Your message will not be forwarded.
>
> If you have a sales inquiry, please email yahoosa...@yahoo-inc.com and
> someone will follow up with you shortly.
>
> If you require assistance with a legal matter, please send a message to
> legal-noti...@yahoo-inc.com
>
> Thank you!
>


Question about Multinomial LogisticRegression in spark mllib in spark 2.1.0

2017-02-01 Thread Aseem Bansal
*What I want to do*
I have a trained a ml.classification.LogisticRegressionModel using spark ml
package.

It has 3 features and 3 classes. So the generated model has coefficients in
(3, 3) matrix and intercepts in Vector of length (3) as expected.

Now, I want to take these coefficients and convert this
ml.classification.LogisticRegressionModel model to an instance of
mllib.classification.LogisticRegressionModel model.

*Why I want to do this*
Computational Speed as SPARK-10413 is still in progress and scheduled for
Spark 2.2 which is not yet released.

*Why I think this is possible*
I checked
https://spark.apache.org/docs/latest/mllib-linear-methods.html#logistic-regression
and in that example a multinomial Logistic Regression is trained. So as per
this the class mllib.classification.LogisticRegressionModel can encapsulate
these parameters.

*Problem faced*
The only constructor in mllib.classification.LogisticRegressionModel takes
a single vector as coefficients and single double as intercept but I have a
Matrix of coefficients and Vector of intercepts respectively.

I tried converting matrix to a vector by just taking the values (Guess
work) but got

requirement failed: LogisticRegressionModel.load with numClasses = 3 and
numFeatures = 3 expected weights of length 6 (without intercept) or 8 (with
intercept), but was given weights of length 9

So any ideas?


Re: ML version of Kmeans

2017-01-31 Thread Aseem Bansal
If you want to predict using dataset then transform is the way to go. If
you want to predict on vectors then you will have to wait on this issue to
be completed https://issues.apache.org/jira/browse/SPARK-10413

On Tue, Jan 31, 2017 at 3:01 PM, Holden Karau  wrote:

> You most likely want the transform function on KMeansModel (although that
> works on a dataset input rather than a single element at a time).
>
> On Tue, Jan 31, 2017 at 1:24 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am not able to find predict method on "ML" version of Kmeans.
>>
>> Mllib version has a predict method.  KMeansModel.predict(point: Vector)
>> .
>> How to predict the cluster point for new vectors in ML version of kmeans ?
>>
>> Regards,
>> Rajesh
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Aseem Bansal



Re: Is Spark launcher's listener API considered production ready?

2016-11-04 Thread Aseem Bansal
Anyone has any idea about this?

On Thu, Nov 3, 2016 at 12:52 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> While using Spark launcher's listener we came across few cases where the
> failures were not being reported correctly.
>
>
>- https://issues.apache.org/jira/browse/SPARK-17742
>- https://issues.apache.org/jira/browse/SPARK-18241
>
> So just wanted to ensure whether this API considered production ready and
> has anyone used it is production successfully? Has anyone used it? It seems
> that we need to handle the failures ourselves. How did you handle the
> failure cases?
>


Is Spark launcher's listener API considered production ready?

2016-11-03 Thread Aseem Bansal
While using Spark launcher's listener we came across few cases where the
failures were not being reported correctly.


   - https://issues.apache.org/jira/browse/SPARK-17742
   - https://issues.apache.org/jira/browse/SPARK-18241

So just wanted to ensure whether this API considered production ready and
has anyone used it is production successfully? Has anyone used it? It seems
that we need to handle the failures ourselves. How did you handle the
failure cases?


Re: [SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
To add to the above I have already checked the documentation, API and even
looked at the source code at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L882
but I could not find anything hence I am asking here.

On Fri, Oct 28, 2016 at 4:26 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> Hi
>
> We are trying to use some of our artifacts as dependencies while
> submitting spark jobs. To specify the remote artifactory URL we are using
> the following syntax
>
> https://USERNAME:passw...@artifactory.companyname.com/
> artifactory/COMPANYNAME-libs
>
> But the resolution fails. Although the URL which is in the logs for the
> artifact is accessible via a browser due to the username password being
> present
>
> So to use an enterprise artifactory with spark is there a special way to
> specify the username and password when passing the repositories String?
>


[SPARK 2.0.0] Specifying remote repository when submitting jobs

2016-10-28 Thread Aseem Bansal
Hi

We are trying to use some of our artifacts as dependencies while submitting
spark jobs. To specify the remote artifactory URL we are using the
following syntax

https://USERNAME:passw...@artifactory.companyname.com/artifactory/COMPANYNAME-libs

But the resolution fails. Although the URL which is in the logs for the
artifact is accessible via a browser due to the username password being
present

So to use an enterprise artifactory with spark is there a special way to
specify the username and password when passing the repositories String?


Fwd: Need help with SVM

2016-10-26 Thread Aseem Bansal
He replied to me. Forwarding to the mailing list.

-- Forwarded message --
From: Aditya Vyas <adityavya...@gmail.com>
Date: Tue, Oct 25, 2016 at 8:16 PM
Subject: Re: Need help with SVM
To: Aseem Bansal <asmbans...@gmail.com>


Hello,
Here is the public gist:https://gist.github.com/a
ditya1702/760cd5c95a6adf2447347e0b087bc318

Do tell if you need more information

Regards,
Aditya

On Tue, Oct 25, 2016 at 8:11 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> Is there any labeled point with label 0 in your dataset?
>
> On Tue, Oct 25, 2016 at 2:13 AM, aditya1702 <adityavya...@gmail.com>
> wrote:
>
>> Hello,
>> I am using linear SVM to train my model and generate a line through my
>> data.
>> However my model always predicts 1 for all the feature examples. Here is
>> my
>> code:
>>
>> print data_rdd.take(5)
>> [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]),
>> LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]),
>> LabeledPoint(1.0, [3.5772,2.856])]
>>
>> 
>> 
>> from pyspark.mllib.classification import SVMWithSGD
>> from pyspark.mllib.linalg import Vectors
>> from sklearn.svm import SVC
>> data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
>>
>> model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
>>
>> X=x_df.map(lambda x:x[0]).collect()
>> Y=x_df.map(lambda x:x[1]).collect()
>>
>> 
>> 
>> pred=[]
>> for i in X:
>>   pred.append(model.predict(i))
>> print pred
>>
>> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>> 1,
>> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
>> 1]
>>
>>
>> My dataset is as follows:
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n2
>> 7955/Screen_Shot_2016-10-25_at_2.png>
>>
>>
>> Can someone please help?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Need-help-with-SVM-tp27955.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


What syntax can be used to specify the latest version of JAR found while using spark submit

2016-10-26 Thread Aseem Bansal
Hi

Can someone please share their thoughts on
http://stackoverflow.com/questions/40259022/what-syntax-can-be-used-to-specify-the-latest-version-of-jar-found-while-using-s


Can application JAR name contain + for dependency resolution to latest version?

2016-10-26 Thread Aseem Bansal
Hi

While using spark-submit
 to
submit spark jobs is the exact name of the JAR file necessary? Or is there
a way to use something like `1.0.+` to denote the latest version found?


Re: Need help with SVM

2016-10-25 Thread Aseem Bansal
Is there any labeled point with label 0 in your dataset?

On Tue, Oct 25, 2016 at 2:13 AM, aditya1702  wrote:

> Hello,
> I am using linear SVM to train my model and generate a line through my
> data.
> However my model always predicts 1 for all the feature examples. Here is my
> code:
>
> print data_rdd.take(5)
> [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]),
> LabeledPoint(1.0, [2.9781,4.5651]), LabeledPoint(1.0, [2.932,3.5519]),
> LabeledPoint(1.0, [3.5772,2.856])]
>
> 
> 
> from pyspark.mllib.classification import SVMWithSGD
> from pyspark.mllib.linalg import Vectors
> from sklearn.svm import SVC
> data_rdd=x_df.map(lambda x:LabeledPoint(x[1],x[0]))
>
> model = SVMWithSGD.train(data_rdd, iterations=1000,regParam=1)
>
> X=x_df.map(lambda x:x[0]).collect()
> Y=x_df.map(lambda x:x[1]).collect()
>
> 
> 
> pred=[]
> for i in X:
>   pred.append(model.predict(i))
> print pred
>
> [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1]
>
>
> My dataset is as follows:
>  file/n27955/Screen_Shot_2016-10-25_at_2.png>
>
>
> Can someone please help?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Need-help-with-SVM-tp27955.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: mllib model in production web API

2016-10-18 Thread Aseem Bansal
Hi Vincent

I am not sure whether you are asking me or Nicolas. If me, then no we
didn't. Never used Akka and wasn't even aware that it has such
capabilities. Using Java API so we don't have Akka as a dependency right
now.

On Tue, Oct 18, 2016 at 12:47 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi
> Did you try applying the model with akka instead of spark ?
> https://spark-summit.org/eu-2015/events/real-time-anomaly-
> detection-with-spark-ml-and-akka/
>
> Le 18 oct. 2016 5:58 AM, "Aseem Bansal" <asmbans...@gmail.com> a écrit :
>
>> @Nicolas
>>
>> No, ours is different. We required predictions within 10ms time frame so
>> we needed much less latency than that.
>>
>> Every algorithm has some parameters. Correct? We took the parameters from
>> the mllib and used them to create ml package's model. ml package's model's
>> prediction time was much faster compared to mllib package's transformation.
>> So essentially use spark's distributed machine learning library to train
>> the model, save to S3, load from S3 in a different system and then convert
>> it into the vector based API model for actual predictions.
>>
>> There were obviously some transformations involved but we didn't use
>> Pipeline for those transformations. Instead, we re-wrote them for the
>> Vector based API. I know it's not perfect but if we had used the
>> transformations within the pipeline that would make us dependent on spark's
>> distributed API and we didn't see how we will really reach our latency
>> requirements. Would have been much simpler and more DRY if the
>> PipelineModel had a predict method based on vectors and was not distributed.
>>
>> As you can guess it is very much model-specific and more work. If we
>> decide to use another type of Model we will have to add conversion
>> code/transformation code for that also. Only if spark exposed a prediction
>> method which is as fast as the old machine learning package.
>>
>> On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long <nicolasl...@gmail.com>
>> wrote:
>>
>>> Hi Sean and Aseem,
>>>
>>> thanks both. A simple thing which sped things up greatly was simply to
>>> load our sql (for one record effectively) directly and then convert to a
>>> dataframe, rather than using Spark to load it. Sounds stupid, but this took
>>> us from > 5 seconds to ~1 second on a very small instance.
>>>
>>> Aseem: can you explain your solution a bit more? I'm not sure I
>>> understand it. At the moment we load our models from S3
>>> (RandomForestClassificationModel.load(..) ) and then store that in an
>>> object property so that it persists across requests - this is in Scala. Is
>>> this essentially what you mean?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12 October 2016 at 10:52, Aseem Bansal <asmbans...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Faced a similar issue. Our solution was to load the model, cache it
>>>> after converting it to a model from mllib and then use that instead of ml
>>>> model.
>>>>
>>>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> I don't believe it will ever scale to spin up a whole distributed job
>>>>> to serve one request. You can look possibly at the bits in mllib-local. 
>>>>> You
>>>>> might do well to export as something like PMML either with Spark's export
>>>>> or JPMML and then load it into a web container and score it, without Spark
>>>>> (possibly also with JPMML, OpenScoring)
>>>>>
>>>>>
>>>>> On Tue, Oct 11, 2016, 17:53 Nicolas Long <nicolasl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> so I have a model which has been stored in S3. And I have a Scala
>>>>>> webapp which for certain requests loads the model and transforms 
>>>>>> submitted
>>>>>> data against it.
>>>>>>
>>>>>> I'm not sure how to run this quickly on a single instance though. At
>>>>>> the moment Spark is being bundled up with the web app in an uberjar (sbt
>>>>>> assembly).
>>>>>>
>>>>>> But the process is quite slow. I'm aiming for responses < 1 sec so
>>>>>> that the webapp can respond quickly to requests. When I look the Spark 
&g

Re: mllib model in production web API

2016-10-17 Thread Aseem Bansal
@Nicolas

No, ours is different. We required predictions within 10ms time frame so we
needed much less latency than that.

Every algorithm has some parameters. Correct? We took the parameters from
the mllib and used them to create ml package's model. ml package's model's
prediction time was much faster compared to mllib package's transformation.
So essentially use spark's distributed machine learning library to train
the model, save to S3, load from S3 in a different system and then convert
it into the vector based API model for actual predictions.

There were obviously some transformations involved but we didn't use
Pipeline for those transformations. Instead, we re-wrote them for the
Vector based API. I know it's not perfect but if we had used the
transformations within the pipeline that would make us dependent on spark's
distributed API and we didn't see how we will really reach our latency
requirements. Would have been much simpler and more DRY if the
PipelineModel had a predict method based on vectors and was not distributed.

As you can guess it is very much model-specific and more work. If we decide
to use another type of Model we will have to add conversion
code/transformation code for that also. Only if spark exposed a prediction
method which is as fast as the old machine learning package.

On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long <nicolasl...@gmail.com> wrote:

> Hi Sean and Aseem,
>
> thanks both. A simple thing which sped things up greatly was simply to
> load our sql (for one record effectively) directly and then convert to a
> dataframe, rather than using Spark to load it. Sounds stupid, but this took
> us from > 5 seconds to ~1 second on a very small instance.
>
> Aseem: can you explain your solution a bit more? I'm not sure I understand
> it. At the moment we load our models from S3 (
> RandomForestClassificationModel.load(..) ) and then store that in an
> object property so that it persists across requests - this is in Scala. Is
> this essentially what you mean?
>
>
>
>
>
>
> On 12 October 2016 at 10:52, Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> Hi
>>
>> Faced a similar issue. Our solution was to load the model, cache it after
>> converting it to a model from mllib and then use that instead of ml model.
>>
>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> I don't believe it will ever scale to spin up a whole distributed job to
>>> serve one request. You can look possibly at the bits in mllib-local. You
>>> might do well to export as something like PMML either with Spark's export
>>> or JPMML and then load it into a web container and score it, without Spark
>>> (possibly also with JPMML, OpenScoring)
>>>
>>>
>>> On Tue, Oct 11, 2016, 17:53 Nicolas Long <nicolasl...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> so I have a model which has been stored in S3. And I have a Scala
>>>> webapp which for certain requests loads the model and transforms submitted
>>>> data against it.
>>>>
>>>> I'm not sure how to run this quickly on a single instance though. At
>>>> the moment Spark is being bundled up with the web app in an uberjar (sbt
>>>> assembly).
>>>>
>>>> But the process is quite slow. I'm aiming for responses < 1 sec so that
>>>> the webapp can respond quickly to requests. When I look the Spark UI I see:
>>>>
>>>> Summary Metrics for 1 Completed Tasks
>>>>
>>>> MetricMin25th percentileMedian75th percentileMax
>>>> Duration94 ms94 ms94 ms94 ms94 ms
>>>> Scheduler Delay0 ms0 ms0 ms0 ms0 ms
>>>> Task Deserialization Time3 s3 s3 s3 s3 s
>>>> GC Time2 s2 s2 s2 s2 s
>>>> Result Serialization Time0 ms0 ms0 ms0 ms0 ms
>>>> Getting Result Time0 ms0 ms0 ms0 ms0 ms
>>>> Peak Execution Memory0.0 B0.0 B0.0 B0.0 B0.0 B
>>>>
>>>> I don't really understand why deserialization and GC should take so
>>>> long when the models are already loaded. Is this evidence I am doing
>>>> something wrong? And where can I get a better understanding on how Spark
>>>> works under the hood here, and how best to do a standalone/bundled jar
>>>> deployment?
>>>>
>>>> Thanks!
>>>>
>>>> Nic
>>>>
>>>
>>
>


Re: mllib model in production web API

2016-10-12 Thread Aseem Bansal
Hi

Faced a similar issue. Our solution was to load the model, cache it after
converting it to a model from mllib and then use that instead of ml model.

On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen  wrote:

> I don't believe it will ever scale to spin up a whole distributed job to
> serve one request. You can look possibly at the bits in mllib-local. You
> might do well to export as something like PMML either with Spark's export
> or JPMML and then load it into a web container and score it, without Spark
> (possibly also with JPMML, OpenScoring)
>
>
> On Tue, Oct 11, 2016, 17:53 Nicolas Long  wrote:
>
>> Hi all,
>>
>> so I have a model which has been stored in S3. And I have a Scala webapp
>> which for certain requests loads the model and transforms submitted data
>> against it.
>>
>> I'm not sure how to run this quickly on a single instance though. At the
>> moment Spark is being bundled up with the web app in an uberjar (sbt
>> assembly).
>>
>> But the process is quite slow. I'm aiming for responses < 1 sec so that
>> the webapp can respond quickly to requests. When I look the Spark UI I see:
>>
>> Summary Metrics for 1 Completed Tasks
>>
>> MetricMin25th percentileMedian75th percentileMax
>> Duration94 ms94 ms94 ms94 ms94 ms
>> Scheduler Delay0 ms0 ms0 ms0 ms0 ms
>> Task Deserialization Time3 s3 s3 s3 s3 s
>> GC Time2 s2 s2 s2 s2 s
>> Result Serialization Time0 ms0 ms0 ms0 ms0 ms
>> Getting Result Time0 ms0 ms0 ms0 ms0 ms
>> Peak Execution Memory0.0 B0.0 B0.0 B0.0 B0.0 B
>>
>> I don't really understand why deserialization and GC should take so long
>> when the models are already loaded. Is this evidence I am doing something
>> wrong? And where can I get a better understanding on how Spark works under
>> the hood here, and how best to do a standalone/bundled jar deployment?
>>
>> Thanks!
>>
>> Nic
>>
>


Reading from and writing to different S3 buckets in spark

2016-10-12 Thread Aseem Bansal
Hi

I want to read CSV from one bucket, do some processing and write to a
different bucket. I know the way to set S3 credentials using

jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

But the problem is that spark is lazy. So if do the following

   - set credentails 1
   - read input csv
   - do some processing
   - set credentials 2
   - write result csv

Then there is a chance that due to laziness while reading input csv the
program may try to use credentails 2.

A solution is to cache the result csv but in case there is not enough
storage it is possible that the csv will be re-read. So how to handle this
situation?


Re: spark listener do not get fail status

2016-09-30 Thread Aseem Bansal
Hi

In case my previous email was lacking in details here are some more details.

- using Spark 2.0.0
- launching the job
using org.apache.spark.launcher.SparkLauncher.startApplication(myListener)
- checking state in the listener's stateChanged method


On Thu, Sep 29, 2016 at 5:24 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> Hi
>
> Submitting job via spark api but I never get fail status even when the job
> throws an exception or exit via System.exit(-1)
>
> How do I indicate via SparkListener API that my job failed?
>


spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi

Submitting job via spark api but I never get fail status even when the job
throws an exception or exit via System.exit(-1)

How do I indicate via SparkListener API that my job failed?


Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-02 Thread Aseem Bansal
Hi

Thanks for all the details. I was able to convert from ml.NaiveBayesModel
to mllib.NaiveBayesModel and get it done. It is fast for our use case.

Just one question. Before mllib is removed can ml package be expected to
reach feature parity with mllib?

On Thu, Sep 1, 2016 at 7:12 PM, Sean Owen  wrote:

> Yeah there's a method to predict one Vector in the .mllib API but not
> the newer one. You could possibly hack your way into calling it
> anyway, or just clone the logic.
>
> On Thu, Sep 1, 2016 at 2:37 PM, Nick Pentreath 
> wrote:
> > Right now you are correct that Spark ML APIs do not support predicting
> on a
> > single instance (whether Vector for the models or a Row for a pipeline).
> >
> > See https://issues.apache.org/jira/browse/SPARK-10413 and
> > https://issues.apache.org/jira/browse/SPARK-16431 (duplicate) for some
> > discussion.
> >
> > There may be movement in the short term to support the single Vector
> case.
> > But anything for pipelines is not immediately on the horizon I'd say.
> >
> > N
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
I understand from a theoretical perspective that the model itself is not
distributed. Thus it can be used for making predictions for a vector or a
RDD. But speaking in terms of the APIs provided by spark 2.0.0 when I
create a model from a large data the recommended way is to use the ml
library for fit. I have the option of getting a
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/NaiveBayesModel.html
 or wrapping it as
http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/PipelineModel.html

Both of these do not have any method which supports Vectors. How do I
bridge this gap in the API from my side? Is there anything in Spark's API
which I have missed? Or do I need to extract the parameters and use another
library for the predictions for a single row?

On Thu, Sep 1, 2016 at 6:38 PM, Sean Owen <so...@cloudera.com> wrote:

> How the model is built isn't that related to how it scores things.
> Here we're just talking about scoring. NaiveBayesModel can score
> Vector which is not a distributed entity. That's what you want to use.
> You do not want to use a whole distributed operation to score one
> record. This isn't related to .ml vs .mllib APIs.
>
> On Thu, Sep 1, 2016 at 2:01 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
> > I understand your point.
> >
> > Is there something like a bridge? Is it possible to convert the model
> > trained using Dataset (i.e. the distributed one) to the one which
> uses
> > vectors? In Spark 1.6 the mllib packages had everything as per vectors
> and
> > that should be faster as per my understanding. But in many spark blogs we
> > saw that spark is moving towards the ml package and mllib package will be
> > phased out. So how can someone train using huge data and then use it on a
> > row by row basis?
> >
> > Thanks for your inputs.
> >
> > On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> If you're trying to score a single example by way of an RDD or
> >> Dataset, then no it will never be that fast. It's a whole distributed
> >> operation, and while you might manage low latency for one job at a
> >> time, consider what will happen when hundreds of them are running at
> >> once. It's just huge overkill for scoring a single example (but,
> >> pretty fine for high-er latency, high throughput batch operations)
> >>
> >> However if you're scoring a Vector locally I can't imagine it's that
> >> slow. It does some linear algebra but it's not that complicated. Even
> >> something unoptimized should be fast.
> >>
> >> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbans...@gmail.com>
> wrote:
> >> > Hi
> >> >
> >> > Currently trying to use NaiveBayes to make predictions. But facing
> >> > issues
> >> > that doing the predictions takes order of few seconds. I tried with
> >> > other
> >> > model examples shipped with Spark but they also ran in minimum of 500
> ms
> >> > when I used Scala API. With
> >> >
> >> > Has anyone used spark ML to do predictions for a single row under 20
> ms?
> >> >
> >> > I am not doing premature optimization. The use case is that we are
> doing
> >> > real time predictions and we need results 20ms. Maximum 30ms. This is
> a
> >> > hard
> >> > limit for our use case.
> >
> >
>


Re: Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
I understand your point.

Is there something like a bridge? Is it possible to convert the model
trained using Dataset (i.e. the distributed one) to the one which uses
vectors? In Spark 1.6 the mllib packages had everything as per vectors and
that should be faster as per my understanding. But in many spark blogs we
saw that spark is moving towards the ml package and mllib package will be
phased out. So how can someone train using huge data and then use it on a
row by row basis?

Thanks for your inputs.

On Thu, Sep 1, 2016 at 6:15 PM, Sean Owen <so...@cloudera.com> wrote:

> If you're trying to score a single example by way of an RDD or
> Dataset, then no it will never be that fast. It's a whole distributed
> operation, and while you might manage low latency for one job at a
> time, consider what will happen when hundreds of them are running at
> once. It's just huge overkill for scoring a single example (but,
> pretty fine for high-er latency, high throughput batch operations)
>
> However if you're scoring a Vector locally I can't imagine it's that
> slow. It does some linear algebra but it's not that complicated. Even
> something unoptimized should be fast.
>
> On Thu, Sep 1, 2016 at 1:37 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
> > Hi
> >
> > Currently trying to use NaiveBayes to make predictions. But facing issues
> > that doing the predictions takes order of few seconds. I tried with other
> > model examples shipped with Spark but they also ran in minimum of 500 ms
> > when I used Scala API. With
> >
> > Has anyone used spark ML to do predictions for a single row under 20 ms?
> >
> > I am not doing premature optimization. The use case is that we are doing
> > real time predictions and we need results 20ms. Maximum 30ms. This is a
> hard
> > limit for our use case.
>


Spark 2.0.0 - has anyone used spark ML to do predictions under 20ms?

2016-09-01 Thread Aseem Bansal
Hi

Currently trying to use NaiveBayes to make predictions. But facing issues
that doing the predictions takes order of few seconds. I tried with other
model examples shipped with Spark but they also ran in minimum of 500 ms
when I used Scala API. With

Has anyone used spark ML to do predictions for a single row under 20 ms?

I am not doing premature optimization. The use case is that we are doing
real time predictions and we need results 20ms. Maximum 30ms. This is a
hard limit for our use case.


Re: Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
there is already a mail thread for scala vs python. check the archives

On Thu, Sep 1, 2016 at 5:18 PM, ayan guha <guha.a...@gmail.com> wrote:

> How about Scala vs Python?
>
> On Thu, Sep 1, 2016 at 7:27 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> I can't think of a situation where it would be materially different.
>> Both are using the JVM-based APIs directly. Here and there there's a
>> tiny bit of overhead in using the Java APIs because something is
>> translated from a Java-style object to a Scala-style object, but this
>> is generally trivial.
>>
>> On Thu, Sep 1, 2016 at 10:06 AM, Aseem Bansal <asmbans...@gmail.com>
>> wrote:
>> > Hi
>> >
>> > Would there be any significant performance difference when using Java
>> vs.
>> > Scala API?
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Spark 2.0.0 - Java vs Scala performance difference

2016-09-01 Thread Aseem Bansal
Hi

Would there be any significant performance difference when using Java vs.
Scala API?


spark 2.0.0 - code generation inputadapter_value is not rvalue

2016-09-01 Thread Aseem Bansal
Hi

Does spark does some code generation? I am trying to use map on a Java RDD
and getting a huge generated files with 17406 lines in my terminal and then
a stacktrace

16/09/01 13:57:36 INFO FileOutputCommitter: File Output Committer Algorithm
version is 1
16/09/01 13:57:36 INFO DefaultWriterContainer: Using output committer class
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter

16/09/01 13:57:38 ERROR CodeGenerator: failed to compile:
org.codehaus.commons.compiler.CompileException: *File 'generated.java',
Line 108, Column 36: Expression "inputadapter_value" is not an rvalue*

But I have not added anything with that name


Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-29 Thread Aseem Bansal
Hi

What all access is needed to save a model to S3? Initially I thought it
should be only write. Then I found it also needs delete to delete temporary
files. Now they have given me DELETE access I am getting the error

Exception in thread "main" org.apache.hadoop.fs.s3.S3Exception:
org.jets3t.service.S3ServiceException: S3 PUT failed for
'/dev-qa_%24folder%24' XML Error Message

So what all access are needed? Asking this as I need to ask someone to give
me appropriate access and I cannot just ask them to give me all access to
the bucket.

I have raised https://issues.apache.org/jira/browse/SPARK-17307 to ask for
better documentation but till that is done I will need to know what access
is needed. Can someone help?


spark 2.0.0 - when saving a model to S3 spark creates temporary files. Why?

2016-08-24 Thread Aseem Bansal
Hi

When Spark saves anything to S3 it creates temporary files. Why? Asking
this as this requires the the access credentails to be given
delete permissions along with write permissions.


Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying.

On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
> and it mentioned that spark streaming actually mini-batch not actual
> streaming.
>
> I have not used streaming and I am not sure what is the difference in the
> 2 terms. Hence could not make a judgement myself.
>


Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
and it mentioned that spark streaming actually mini-batch not actual
streaming.

I have not used streaming and I am not sure what is the difference in the 2
terms. Hence could not make a judgement myself.


Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-11 Thread Aseem Bansal
Hi

I have a Dataset

I will change a String to String so there will be no schema changes.

Is there a way I can run a map on it? I have seen the function at
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html#map(org.apache.spark.api.java.function.MapFunction,%20org.apache.spark.sql.Encoder)

But the problem is the second argument. What should I give? The row is not
in a specific format so I cannot go and create encoder for a bean. I want
the schema to remain the same.


Re: na.fill doesn't work

2016-08-11 Thread Aseem Bansal
Check the schema of the data frame. It may be that your columns are String.
You are trying to give default for numerical data.

On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey  wrote:

> Hi everybody,
>
> I have a data frame after many transformation, my final task is fill na's
> with zeros, but I run this command : df_fil1 = df_fil.na.fill(0), but this
> command doesn't work nulls doesn't disappear.
>
> I did a toy test it works correctly.
>
> I don't understand what happend.
>
> Thanks in advance.
>
> Samir
>


Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-10 Thread Aseem Bansal
To  those interested I changed the data frame to RDD. Then I created a data
frame. That has an option of giving a schema.

But probably someone should improve how to use the as function.

On Mon, Aug 8, 2016 at 1:05 PM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> Hmm I’m not sure, I don’t use the Java API sorry
>
>
>
> The simplest way to work around it would be to read the csv as a text file
> using sparkContext textFile, split each row based on a comma, then convert
> it to a dataset afterwards.
>
>
>
> *From:* Aseem Bansal [mailto:asmbans...@gmail.com]
> *Sent:* 08 August 2016 07:37
> *To:* Ewan Leith <ewan.le...@realitymine.com>
> *Cc:* user <user@spark.apache.org>
> *Subject:* Re: Spark 2.0.0 - Apply schema on few columns of dataset
>
>
>
> Hi Ewan
>
>
>
> The .as function take a single encoder or a single string or a single
> Symbol. I have like more than 10 columns so I cannot use the tuple
> functions. Passing using bracket does not work.
>
>
>
> On Mon, Aug 8, 2016 at 11:26 AM, Ewan Leith <ewan.le...@realitymine.com>
> wrote:
>
> Looking at the encoders api documentation at
>
> http://spark.apache.org/docs/latest/api/java/
>
> == Java == Encoders are specified by calling static methods on Encoders
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoders.html>
> .
>
> List data = Arrays.asList("abc", "abc", "xyz"); Dataset ds
> = context.createDataset(data, Encoders.STRING());
>
> I think you should be calling
>
> .as((Encoders.STRING(), Encoders.STRING()))
>
> or similar
>
> Ewan
>
>
>
> On 8 Aug 2016 06:10, Aseem Bansal <asmbans...@gmail.com> wrote:
>
> Hi All
>
>
>
> Has anyone done this with Java API?
>
>
>
> On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
>
> I need to use few columns out of a csv. But as there is no option to read
> few columns out of csv so
>
>  1. I am reading the whole CSV using SparkSession.csv()
>
>  2.  selecting few of the columns using DataFrame.select()
>
>  3. applying schema using the .as() function of Dataset.  I tried to
> extent org.apache.spark.sql.Encoder as the input for as function
>
>
>
> But I am getting the following exception
>
>
>
> Exception in thread "main" java.lang.RuntimeException: Only expression
> encoders are supported today
>
>
>
> So my questions are -
>
> 1. Is it possible to read few columns instead of whole CSV? I cannot
> change the CSV as that is upstream data
>
> 2. How do I apply schema to few columns if I cannot write my encoder?
>
>
>
>
>
>
>


Re: Multiple Sources Found for Parquet

2016-08-08 Thread Aseem Bansal
Seems that this is a common issue with Spark 2.0.0

I faced similar with CSV. Saw someone facing this with JSON.
https://issues.apache.org/jira/browse/SPARK-16893

On Mon, Aug 8, 2016 at 4:08 PM, Ted Yu  wrote:

> Can you examine classpath to see where *DefaultSource comes from ?*
>
> *Thanks*
>
> On Mon, Aug 8, 2016 at 2:34 AM, 金国栋  wrote:
>
>> I'm using Spark2.0.0 to do sql analysis over parquet files, when using
>> `read().parquet("path")`, or `write().parquet("path")` in Java(I followed
>> the example java file in source code exactly), I always encountered
>>
>> *Exception in thread "main" java.lang.RuntimeException: Multiple sources
>> found for parquet
>> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat,
>> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please
>> specify the fully qualified class name.*
>>
>> Any idea why?
>>
>> Thanks.
>>
>> Best,
>> Jelly
>>
>
>


Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
I am using the following to broadcast and it explicitly requires classtag

sparkSession.sparkContext().broadcast

On Mon, Aug 8, 2016 at 12:09 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Classtag is Scala concept (see http://docs.scala-lang.
> org/overviews/reflection/typetags-manifests.html) - although this should
> not be explicitly required - looking at http://spark.apache.org/
> docs/latest/api/scala/index.html#org.apache.spark.SparkContext we can see
> that in Scala the classtag tag is implicit and if your calling from Java
> http://spark.apache.org/docs/latest/api/scala/index.
> html#org.apache.spark.api.java.JavaSparkContext the classtag doesn't need
> to be specified (instead it uses a "fake" class tag automatically for you).
> Where are you seeing the different API?
>
> On Sun, Aug 7, 2016 at 11:32 PM, Aseem Bansal <asmbans...@gmail.com>
> wrote:
>
>> Earlier for broadcasting we just needed to use
>>
>> sparkcontext.broadcast(objectToBroadcast)
>>
>> But now it is
>>
>> sparkcontext.broadcast(objectToBroadcast, classTag)
>>
>> What is classTag here?
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-08 Thread Aseem Bansal
Hi Ewan

The .as function take a single encoder or a single string or a single
Symbol. I have like more than 10 columns so I cannot use the tuple
functions. Passing using bracket does not work.

On Mon, Aug 8, 2016 at 11:26 AM, Ewan Leith <ewan.le...@realitymine.com>
wrote:

> Looking at the encoders api documentation at
>
> http://spark.apache.org/docs/latest/api/java/
>
> == Java == Encoders are specified by calling static methods on Encoders
> <http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Encoders.html>
> .
>
> List data = Arrays.asList("abc", "abc", "xyz"); Dataset ds
> = context.createDataset(data, Encoders.STRING());
>
> I think you should be calling
>
> .as((Encoders.STRING(), Encoders.STRING()))
>
> or similar
>
> Ewan
>
> On 8 Aug 2016 06:10, Aseem Bansal <asmbans...@gmail.com> wrote:
>
> Hi All
>
> Has anyone done this with Java API?
>
> On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
>
> I need to use few columns out of a csv. But as there is no option to read
> few columns out of csv so
>  1. I am reading the whole CSV using SparkSession.csv()
>  2.  selecting few of the columns using DataFrame.select()
>  3. applying schema using the .as() function of Dataset.  I tried to
> extent org.apache.spark.sql.Encoder as the input for as function
>
> But I am getting the following exception
>
> Exception in thread "main" java.lang.RuntimeException: Only expression
> encoders are supported today
>
> So my questions are -
> 1. Is it possible to read few columns instead of whole CSV? I cannot
> change the CSV as that is upstream data
> 2. How do I apply schema to few columns if I cannot write my encoder?
>
>
>
>


Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Aseem Bansal
Earlier for broadcasting we just needed to use

sparkcontext.broadcast(objectToBroadcast)

But now it is

sparkcontext.broadcast(objectToBroadcast, classTag)

What is classTag here?


Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-07 Thread Aseem Bansal
Hi All

Has anyone done this with Java API?

On Fri, Aug 5, 2016 at 5:36 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> I need to use few columns out of a csv. But as there is no option to read
> few columns out of csv so
>  1. I am reading the whole CSV using SparkSession.csv()
>  2.  selecting few of the columns using DataFrame.select()
>  3. applying schema using the .as() function of Dataset.  I tried to
> extent org.apache.spark.sql.Encoder as the input for as function
>
> But I am getting the following exception
>
> Exception in thread "main" java.lang.RuntimeException: Only expression
> encoders are supported today
>
> So my questions are -
> 1. Is it possible to read few columns instead of whole CSV? I cannot
> change the CSV as that is upstream data
> 2. How do I apply schema to few columns if I cannot write my encoder?
>


Re: Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
Yes. This is what I am after. But I have to use the Java API. And using the
Java API I was not able to get the .as() function working

On Fri, Aug 5, 2016 at 7:09 PM, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> I don't understand where the issue is...
>
> ➜  spark git:(master) ✗ cat csv-logs/people-1.csv
> name,city,country,age,alive
> Jacek,Warszawa,Polska,42,true
>
> val df = spark.read.option("header", true).csv("csv-logs/people-1.csv")
> val nameCityPairs = df.select('name, 'city).as[(String, String)]
>
> scala> nameCityPairs.printSchema
> root
>  |-- name: string (nullable = true)
>  |-- city: string (nullable = true)
>
> Is this what you're after?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Aug 5, 2016 at 2:06 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
> > I need to use few columns out of a csv. But as there is no option to read
> > few columns out of csv so
> >  1. I am reading the whole CSV using SparkSession.csv()
> >  2.  selecting few of the columns using DataFrame.select()
> >  3. applying schema using the .as() function of Dataset.  I tried to
> > extent org.apache.spark.sql.Encoder as the input for as function
> >
> > But I am getting the following exception
> >
> > Exception in thread "main" java.lang.RuntimeException: Only expression
> > encoders are supported today
> >
> > So my questions are -
> > 1. Is it possible to read few columns instead of whole CSV? I cannot
> change
> > the CSV as that is upstream data
> > 2. How do I apply schema to few columns if I cannot write my encoder?
>


Spark 2.0.0 - Apply schema on few columns of dataset

2016-08-05 Thread Aseem Bansal
I need to use few columns out of a csv. But as there is no option to read
few columns out of csv so
 1. I am reading the whole CSV using SparkSession.csv()
 2.  selecting few of the columns using DataFrame.select()
 3. applying schema using the .as() function of Dataset.  I tried to
extent org.apache.spark.sql.Encoder as the input for as function

But I am getting the following exception

Exception in thread "main" java.lang.RuntimeException: Only expression
encoders are supported today

So my questions are -
1. Is it possible to read few columns instead of whole CSV? I cannot change
the CSV as that is upstream data
2. How do I apply schema to few columns if I cannot write my encoder?


What is "Developer API " in spark documentation?

2016-08-05 Thread Aseem Bansal
Hi

Many of spark documentation say "Developer API". What does that mean?


Spark 2.0 - Case sensitive column names while reading csv

2016-08-03 Thread Aseem Bansal
While reading csv via DataFrameReader how can I make column names case
sensitive?

http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html

None of the options specified mention case sensitivity

http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq)