Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread
if the num of user-item pairs to predict aren't too large, say millions,
you could transform the target dataframe and save the result to a hive
table, then build cache based on that table for online services.

if it's not the case(such as billions of user item pairs to predict), you
have to start a service with the model loaded, send user to the service,
first match several hundreds of items from all items available which could
itself be another service or cache, then transform this user and all items
using the model to get prediction, and return items ordered by prediction.

On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:

> hi,all:
>under spark2.0 ,I wonder to know after trained a
> ml.recommendation.ALSModel how I can do the recommend action?
>
>I try to save the model and load it by MatrixFactorizationModel but
> got error.
>
> 2017-03-16
> --
> lk_spark
>


Re: is dataframe thread safe?

2017-02-13 Thread
for my understanding, all transformations are thread-safe cause dataframe
is just a description of the calculation and it's immutable, so the case
above is all right. just be careful with the actions.

On Sun, Feb 12, 2017 at 4:06 PM, Mendelson, Assaf 
wrote:

> Hi,
>
> I was wondering if dataframe is considered thread safe. I know the spark
> session and spark context are thread safe (and actually have tools to
> manage jobs from different threads) but the question is, can I use the same
> dataframe in both threads.
>
> The idea would be to create a dataframe in the main thread and then in two
> sub threads do different transformations and actions on it.
>
> I understand that some things might not be thread safe (e.g. if I
> unpersist in one thread it would affect the other. Checkpointing would
> cause similar issues), however, I can’t find any documentation as to what
> operations (if any) are thread safe.
>
>
>
> Thanks,
>
> Assaf.
>


Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

2016-12-29 Thread
why not sync binlog of mysql(hopefully the data is immutable and the table
is append-only), send the log through kafka and then consume it by spark
streaming?

On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust 
wrote:

> We don't support this yet, but I've opened this JIRA as it sounds
> generally useful: https://issues.apache.org/jira/browse/SPARK-19031
>
> In the mean time you could try implementing your own Source, but that is
> pretty low level and is not yet a stable API.
>
> On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" 
> wrote:
>
>> Hi all,
>>
>> Thanks a lot for your contributions to bring us new technologies.
>>
>> I don't want to waste your time, so before I write to you, I googled,
>> checked stackoverflow and mailing list archive with keywords "streaming"
>> and "jdbc". But I was not able to get any solution to my use case. I hope I
>> can get some clarification from you.
>>
>> The use case is quite straightforward, I need to harvest a relational
>> database via jdbc, do something with data, and store result into Kafka. I
>> am stuck at the first step, and the difficulty is as follows:
>>
>> 1. The database is too large to ingest with one thread.
>> 2. The database is dynamic and time series data comes in constantly.
>>
>> Then an ideal workflow is that multiple workers process partitions of
>> data incrementally according to a time window. For example, the processing
>> starts from the earliest data with each batch containing data for one hour.
>> If data ingestion speed is faster than data production speed, then
>> eventually the entire database will be harvested and those workers will
>> start to "tail" the database for new data streams and the processing
>> becomes real time.
>>
>> With Spark SQL I can ingest data from a JDBC source with partitions
>> divided by time windows, but how can I dynamically increment the time
>> windows during execution? Assume that there are two workers ingesting data
>> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
>> for 2017-01-03. But I am not able to find out how to increment those values
>> during execution.
>>
>> Then I looked into Structured Streaming. It looks much more promising
>> because window operations based on event time are considered during
>> streaming, which could be the solution to my use case. However, from
>> documentation and code example I did not find anything related to streaming
>> data from a growing database. Is there anything I can read to achieve my
>> goal?
>>
>> Any suggestion is highly appreciated. Thank you very much and have a nice
>> day.
>>
>> Best regards,
>> Yang
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>