Hi Vincent

I am not sure whether you are asking me or Nicolas. If me, then no we
didn't. Never used Akka and wasn't even aware that it has such
capabilities. Using Java API so we don't have Akka as a dependency right
now.

On Tue, Oct 18, 2016 at 12:47 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi
> Did you try applying the model with akka instead of spark ?
> https://spark-summit.org/eu-2015/events/real-time-anomaly-
> detection-with-spark-ml-and-akka/
>
> Le 18 oct. 2016 5:58 AM, "Aseem Bansal" <asmbans...@gmail.com> a écrit :
>
>> @Nicolas
>>
>> No, ours is different. We required predictions within 10ms time frame so
>> we needed much less latency than that.
>>
>> Every algorithm has some parameters. Correct? We took the parameters from
>> the mllib and used them to create ml package's model. ml package's model's
>> prediction time was much faster compared to mllib package's transformation.
>> So essentially use spark's distributed machine learning library to train
>> the model, save to S3, load from S3 in a different system and then convert
>> it into the vector based API model for actual predictions.
>>
>> There were obviously some transformations involved but we didn't use
>> Pipeline for those transformations. Instead, we re-wrote them for the
>> Vector based API. I know it's not perfect but if we had used the
>> transformations within the pipeline that would make us dependent on spark's
>> distributed API and we didn't see how we will really reach our latency
>> requirements. Would have been much simpler and more DRY if the
>> PipelineModel had a predict method based on vectors and was not distributed.
>>
>> As you can guess it is very much model-specific and more work. If we
>> decide to use another type of Model we will have to add conversion
>> code/transformation code for that also. Only if spark exposed a prediction
>> method which is as fast as the old machine learning package.
>>
>> On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long <nicolasl...@gmail.com>
>> wrote:
>>
>>> Hi Sean and Aseem,
>>>
>>> thanks both. A simple thing which sped things up greatly was simply to
>>> load our sql (for one record effectively) directly and then convert to a
>>> dataframe, rather than using Spark to load it. Sounds stupid, but this took
>>> us from > 5 seconds to ~1 second on a very small instance.
>>>
>>> Aseem: can you explain your solution a bit more? I'm not sure I
>>> understand it. At the moment we load our models from S3
>>> (RandomForestClassificationModel.load(..) ) and then store that in an
>>> object property so that it persists across requests - this is in Scala. Is
>>> this essentially what you mean?
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12 October 2016 at 10:52, Aseem Bansal <asmbans...@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> Faced a similar issue. Our solution was to load the model, cache it
>>>> after converting it to a model from mllib and then use that instead of ml
>>>> model.
>>>>
>>>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> I don't believe it will ever scale to spin up a whole distributed job
>>>>> to serve one request. You can look possibly at the bits in mllib-local. 
>>>>> You
>>>>> might do well to export as something like PMML either with Spark's export
>>>>> or JPMML and then load it into a web container and score it, without Spark
>>>>> (possibly also with JPMML, OpenScoring)
>>>>>
>>>>>
>>>>> On Tue, Oct 11, 2016, 17:53 Nicolas Long <nicolasl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> so I have a model which has been stored in S3. And I have a Scala
>>>>>> webapp which for certain requests loads the model and transforms 
>>>>>> submitted
>>>>>> data against it.
>>>>>>
>>>>>> I'm not sure how to run this quickly on a single instance though. At
>>>>>> the moment Spark is being bundled up with the web app in an uberjar (sbt
>>>>>> assembly).
>>>>>>
>>>>>> But the process is quite slow. I'm aiming for responses < 1 sec so
>>>>>> that the webapp can respond quickly to requests. When I look the Spark 
>>>>>> UI I
>>>>>> see:
>>>>>>
>>>>>> Summary Metrics for 1 Completed Tasks
>>>>>>
>>>>>> Metric    Min    25th percentile    Median    75th percentile    Max
>>>>>> Duration    94 ms    94 ms    94 ms    94 ms    94 ms
>>>>>> Scheduler Delay    0 ms    0 ms    0 ms    0 ms    0 ms
>>>>>> Task Deserialization Time    3 s    3 s    3 s    3 s    3 s
>>>>>> GC Time    2 s    2 s    2 s    2 s    2 s
>>>>>> Result Serialization Time    0 ms    0 ms    0 ms    0 ms    0 ms
>>>>>> Getting Result Time    0 ms    0 ms    0 ms    0 ms    0 ms
>>>>>> Peak Execution Memory    0.0 B    0.0 B    0.0 B    0.0 B    0.0 B
>>>>>>
>>>>>> I don't really understand why deserialization and GC should take so
>>>>>> long when the models are already loaded. Is this evidence I am doing
>>>>>> something wrong? And where can I get a better understanding on how Spark
>>>>>> works under the hood here, and how best to do a standalone/bundled jar
>>>>>> deployment?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Nic
>>>>>>
>>>>>
>>>>
>>>
>>

Reply via email to