Error when using FlinkML iterations with KeyedCoProcessFunction

2024-03-27 Thread Komal M
Hi,

As the DataStream API's iterativeStream method has been deprecated for future 
flink releases, the documentation recommend’s using Flink ML's iteration as an 
alternative. I am trying to build my understanding of the new iterations API as 
it will be a requirement for our future projects.

As an exercise, I’m trying to implement a KeyedRichCoProcessFunction inside the 
iteration body that takes the feedback Stream and non-feedbackstream as inputs 
but get the following error. Do you know what could be causing it? For 
reference, I do not get any error when applying  .keyBy().flatMap() function on 
the streams individually inside the iteration body.

Exception in thread "main" 
org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at 
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
….
at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by 
NoRestartBackoffTimeStrategy
at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:139)
…
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
... 5 more
Caused by: java.lang.ClassCastException: class 
org.apache.flink.iteration.IterationRecord cannot be cast to class 
org.apache.flink.api.java.tuple.Tuple 
(org.apache.flink.iteration.IterationRecord and 
org.apache.flink.api.java.tuple.Tuple are in unnamed module of loader 'app')
at 
org.apache.flink.api.java.typeutils.runtime.TupleComparator.extractKeys(TupleComparator.java:148)
at 
org.apache.flink.streaming.util.keys.KeySelectorUtil$ComparableKeySelector.getKey(KeySelectorUtil.java:195)
at 
org.apache.flink.streaming.util.keys.KeySelectorUtil$ComparableKeySelector.getKey(KeySelectorUtil.java:168)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.setKeyContextElement(AbstractStreamOperator.java:502)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.setKeyContextElement1(AbstractStreamOperator.java:478)
at 
org.apache.flink.iteration.operator.allround.AbstractAllRoundWrapperOperator.setKeyContextElement1(AbstractAllRoundWrapperOperator.java:203)
at 
org.apache.flink.streaming.runtime.io.RecordProcessorUtils.lambda$getRecordProcessor1$1(RecordProcessorUtils.java:87)
at 
org.apache.flink.streaming.runtime.io.StreamTwoInputProcessorFactory$StreamTaskNetworkOutput.emitRecord(StreamTwoInputProcessorFactory.java:254)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:146)
at 
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at 
org.apache.flink.streaming.runtime.io.StreamMultipleInputProcessor.processInput(StreamMultipleInputProcessor.java:85)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:550)
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
at 
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
at java.base/java.lang.Thread.run(Thread.java:829)


I am attaching the full test code below for reference. All it does is subtracts 
1 from the feeback stream until  the tuples reaches 0.0. For each subtraction 
it outputs a relevant message in the finaloutput stream. These messages are 
stored in the keyedState of KeyedCoProcessFunction and are preloaded in the 
parallel instances by a dataset stream called initialStates. For each key there 
are different messages associated with it, hence the need for MapState.



import java.util.*;
import org.apache.flink.api.common.state.MapState;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.iteration.DataStreamList;
import org.apache.flink.iteration.IterationBodyResult;
import 

Re: FlinkML 'DenseVector' object has no attribute 'get_fields_by_names'

2023-09-19 Thread Evgeniy Lyutikov
Thanks for the answer, I'll try.
Are there examples or tutorials somewhere on how to use FlinkML in real-life 
scenarios, such as streaming Kafka through a model?



От: Xin Jiang 
Отправлено: 19 сентября 2023 г. 8:07:11
Кому: Evgeniy Lyutikov
Копия: user@flink.apache.org
Тема: Re: FlinkML 'DenseVector' object has no attribute 'get_fields_by_names'

Hi Evgeniy,

Yes, the reason of the exception is that you are returning an incorrect data 
type. Flink ML doesn’t have a data type for `DenseVector` but it provides a 
function called `pyflink.ml.functions.array_to_vector` which returns an 
`Expression`. So maybe you can modify your UDF to union multiple columns as one 
column of `DataTypes.ARRAY()`, and then call 
`pyflink.ml.functions.array_to_vector` on this column.


Best,
Xin


“This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом.”


Re: FlinkML 'DenseVector' object has no attribute 'get_fields_by_names'

2023-09-18 Thread Xin Jiang
Hi Evgeniy,

Yes, the reason of the exception is that you are returning an incorrect data 
type. Flink ML doesn’t have a data type for `DenseVector` but it provides a 
function called `pyflink.ml.functions.array_to_vector` which returns an 
`Expression`. So maybe you can modify your UDF to union multiple columns as one 
column of `DataTypes.ARRAY()`, and then call 
`pyflink.ml.functions.array_to_vector` on this column.


Best,
Xin

FlinkML 'DenseVector' object has no attribute 'get_fields_by_names'

2023-09-18 Thread Evgeniy Lyutikov
Hello community!

I'm trying to use FlinkML to train a model on data from a PostgreSQL table and 
I get an error when I try to view the output table after model


AttributeError: 'DenseVector' object has no attribute 'get_fields_by_names'


My code:

# Create train source table
t_env.execute_sql(
"""
CREATE TABLE train_source (
id TINYINT,
feature1 FLOAT,
feature2 FLOAT,
feature3 FLOAT,
feature4 FLOAT,
feature5 FLOAT,
label TINYINT
) WITH (
'url' = 'jdbc:postgresql://localhost:5432/train',
'table-name' = 'train',
'username' = 'postgres',
'password' = 'postgres',
'connector' = 'jdbc'
)
"""
)
input_table = t_env.from_path("train_source")

class ListDenseVector(ScalarFunction):
def eval(self, *args):
return Vectors.dense(list(args))

# Map DenseVector to features
t_env.create_temporary_function("dense_map", udf(ListDenseVector(), 
result_type=DataTypes.ROW()))
train_table = t_env.sql_query("SELECT id, dense_map(feature1, feature2, 
feature3, feature4, feature5) as features, CAST(label as DOUBLE) as label FROM 
train_source")

# Train model
logistic_regression = LogisticRegression()
model = logistic_regression.fit(train_table)
output = model.transform(train_table)[0]

# Print result
output.execute().print()

I also tried to change the output type in UDF, but I get an error

TypeError: Invalid returnType: returnType should be DataType or str but is 
DenseVectorTypeInfo


"This message contains confidential information/commercial secret. If you are 
not the intended addressee of this message you may not copy, save, print or 
forward it to any third party and you are kindly requested to destroy this 
message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся 
коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного 
сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его 
каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом 
отправителя электронным письмом."


FlinkMl

2023-05-14 Thread Danyal Awan
hello,

For my master thesis i am comparing ml frameworks on data streams.

What is the current status on FlinkML? Is distributed learning possible on
multiple nodes? If yes, how?

I played around with FlinkML a bit and modeled a simple pipeline for
sentiment analysis on tweets. For this I used the Sentiment 140 dataset
which contains 1.6 million tweets.
Unfortunately I can only use a small amount of data (about 3 samples)
for training, otherwise Taskmanager gets lost or crashes. I have also
allocated enough memory to taskmanager (JVM heap size is set to 50gb). But
training should also work with more data, right?I have also allocated
enough memory to taskmanager (JVM heap size is set to 50gb).
I have also allocated enough memory to taskmanager (JVM heap size is set to
50gb).

greetings


Re: info about flinkml

2020-09-14 Thread Yun Tang
Hi

The flinkML has been choosen to drop since Flink-1.9 [1] and a new machine 
learning library has been developed under the umbrella of FLIP-39 [2][3].
As far as I know, the new Flink ml library has not been completed and you could 
try Alink [4], a Machine Learning algorithm platform based on Flink, which also 
developed by active contributors for FLIP-39.


[1] 
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html
[2] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-39+Flink+ML+pipeline+and+ML+libs
[3] https://issues.apache.org/jira/browse/FLINK-12470
[4] https://github.com/alibaba/Alink

Best
Yun Tang


From: Cristian Lorenzetto 
Sent: Monday, September 14, 2020 18:59
To: user@flink.apache.org 
Subject: info about flinkml

Hi i m evaluating to adopt flink instead spark for data mining processor. I 
knew flinkML for this scope but in the last release i cant find it. Why?
Can you suggest the best way ?


--

Cristian Lorenzetto
Direzione ICT e Agenda Digitale
U.O. Demand, Progettazione e Sviluppo Software
Tel: 041 2792619

Ai sensi del vigente D.Lgs. 196/2003 in materia di privacy e del Regolamento 
(UE) 2016/679 del Parlamento europeo e del Consiglio si precisa che le 
informazioni contenute nel messaggio e negli eventuali allegati sono riservate 
esclusivamente al/ai destinatario/i indicato/i. Si invita ad astenersi 
dall'effettuare: inoltri, copie, distribuzioni e divulgazioni non autorizzate 
del presente messaggio e degli eventuali allegati. Nel caso di erroneo 
recapito, si chiede cortesemente a chi legge di dare immediata comunicazione al 
mittente e di cancellare il presente messaggio e gli eventuali allegati. 
Informazioni aggiuntive nella sezione **Privacy** del sito internet: 
www.regione.veneto.it<http://www.regione.veneto.it/>
--
According to the Italian law D.Lgs. 196/2003 and the Regulation (EU) 2016/679 
of the European Parliament and of the Council the information contained in this 
message and any attachment contained therein is addressed exclusively to the 
intended recipient. Please refain to not make copies, to forward the message 
and its attachments or disclose their content unless authorisation.
In case of incorrect delivered message to your mail, please inform immediately 
the sender and delete the message and its attachments. Additional information 
are available in the **Privacy** section, on the website: 
www.regione.veneto.it<http://www.regione.veneto.it/>


info about flinkml

2020-09-14 Thread Cristian Lorenzetto
Hi i m evaluating to adopt flink instead spark for data mining processor. I
knew flinkML for this scope but in the last release i cant find it. Why?
Can you suggest the best way ?


-- 

Cristian Lorenzetto
Direzione ICT e Agenda Digitale
U.O. Demand, Progettazione e Sviluppo Software
Tel: 041 2792619

-- 
*Ai sensi del vigente D.Lgs. 196/2003 in materia di privacy e del 
Regolamento (UE) 2016/679 del Parlamento europeo e del Consiglio si precisa 
che le informazioni contenute nel messaggio e negli eventuali allegati sono 
riservate esclusivamente al/ai destinatario/i indicato/i. Si invita ad 
astenersi dall'effettuare: inoltri, copie, distribuzioni e divulgazioni non 
autorizzate del presente messaggio e degli eventuali allegati. Nel caso di 
erroneo recapito, si chiede cortesemente a chi legge di dare immediata 
comunicazione al mittente e di cancellare il presente messaggio e gli 
eventuali allegati. Informazioni aggiuntive nella sezione **Privacy** del 
sito internet: www.regione.veneto.it <http://www.regione.veneto.it/>  *

*--*
*According to the Italian law D.Lgs. 
196/2003 and the Regulation (EU) 2016/679 of the European Parliament and of 
the Council the information contained in this message and any attachment 
contained therein is addressed exclusively to the intended recipient. 
Please refain to not make copies, to forward the message and its 
attachments or disclose their content unless authorisation.*
*In case of 
incorrect delivered message to your mail, please inform immediately the 
sender and delete the message and its attachments. Additional information 
are available in the **Privacy** section, on the website: 
www.regione.veneto.it <http://www.regione.veneto.it/>*



Re: FlinkML status

2020-08-03 Thread Till Rohrmann
Hi Mohamed,

the development of FlinkML has been stopped in favour of a new machine
learning library which you can find here [1]. Be aware that this library is
still under development.

[1] https://github.com/apache/flink/tree/master/flink-ml-parent

Cheers,
Till

On Sat, Aug 1, 2020 at 10:35 AM Mohamed Haseeb  wrote:

> Hi,
>
> What's the current status of FlinkML? is it still part of Flink? the last
> Flink release that has documentation about it is 1.8.
>
> Thanks,
> M. Haseeb
>


FlinkML status

2020-08-01 Thread Mohamed Haseeb
Hi,

What's the current status of FlinkML? is it still part of Flink? the last
Flink release that has documentation about it is 1.8.

Thanks,
M. Haseeb


Re: FlinkML

2018-04-18 Thread Christophe Salperwyck
Hi,

You could try to plug MOA/Weka library too. I did some preliminary work
with that:
https://moa.cms.waikato.ac.nz/moa-with-apache-flink/

but then it is not anymore FlinkML algorithms.

Best regards,
Christophe


2018-04-18 21:13 GMT+02:00 shashank734 <shashank...@gmail.com>:

> There are no active discussions or guide on that. But I found this example
> in
> the repo :
>
> https://github.com/apache/flink/blob/master/flink-examples/
> flink-examples-streaming/src/main/java/org/apache/flink/
> streaming/examples/ml/IncrementalLearningSkeleton.java
> <https://github.com/apache/flink/blob/master/flink-examples/
> flink-examples-streaming/src/main/java/org/apache/flink/
> streaming/examples/ml/IncrementalLearningSkeleton.java>
>
> Which is trying to do the same thing. Although I haven't checked this yet.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.
> nabble.com/
>


Re: FlinkML

2018-04-18 Thread shashank734
There are no active discussions or guide on that. But I found this example in
the repo :

https://github.com/apache/flink/blob/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/ml/IncrementalLearningSkeleton.java

  

Which is trying to do the same thing. Although I haven't checked this yet.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: FlinkML

2018-04-18 Thread Christophe Jolif
Szymon,

The short answer is no. See:
http://mail-archives.apache.org/mod_mbox/flink-user/201802.mbox/%3ccaadrtt39ciiec1uzwthzgnbkjxs-_h5yfzowhzph_zbidux...@mail.gmail.com%3E


On Mon, Apr 16, 2018 at 11:25 PM, Szymon Szczypiński <simo...@poczta.fm>
wrote:

> Hi,
>
> i wonder if there are possibility to build FlinkML streaming job not a
> batch job. Examples on https://ci.apache.org/projects
> /flink/flink-docs-release-1.4/dev/libs/ml/ are only batch examples.
>
> Is there any possibility?
>
>
> Best regards.
>
>


-- 
Christophe


FlinkML

2018-04-16 Thread Szymon Szczypiński

Hi,

i wonder if there are possibility to build FlinkML streaming job not a 
batch job. Examples on 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/libs/ml/ 
are only batch examples.


Is there any possibility?


Best regards.



Running FlinkML ALS with more than two features

2018-03-19 Thread Banias H
Hello Flink experts,

I am new to FlinkML and currently playing around with using ALS in a
recommender system. In our dataset, we have more than 2 features. When I
tried running the example towards the bottom of this page:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/libs/ml/als.html,
I got a *method not implemented error* in fit(). Here is how I set up
inputDS:

val inputDS: DataSet[(Int, Int, Int, Int, Double)] = env.readCsvFile[(Int,
Int, Int, Int, Double)](
  pathToTrainingFile)
...
als.fit(inputDS, parameters)

However when I used only 2 features (i.e passing DataSet[(Int, Int, Double)] to
fit()), it went successfully.  Is it a limitation in ALS in general or it
is an configuration issue?

I would appreciate any info on this. Thanks.

Regards,
BH


Re: FlinkML ALS is taking too long to run

2017-07-12 Thread Sebastian Schelter
I don't think you need to employ a distributed system for working with this
dataset. An SGD implementation on a single machine should easily handle the
job.

Best,
Sebastian

2017-07-12 9:26 GMT+02:00 Andrea Spina <andrea.sp...@radicalbit.io>:

> Dear Ziyad,
>
> Yep, I had encountered same very long runtimes with ALS as well at the time
> and I recorded improvements by increasing the number of blocks / decreasing
> #TSs/TM like you've stated out.
>
> Cheers,
>
> Andrea
>
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-
> taking-too-long-to-run-tp14154p14192.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: FlinkML ALS is taking too long to run

2017-07-12 Thread Andrea Spina
Dear Ziyad, 

Yep, I had encountered same very long runtimes with ALS as well at the time
and I recorded improvements by increasing the number of blocks / decreasing
#TSs/TM like you've stated out.

Cheers,

Andrea






--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14192.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: FlinkML ALS is taking too long to run

2017-07-11 Thread Andrea Spina
Dear Ziyad,
could you kindly share some additional info about your environment
(local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is
hanging?

Hope to help you, then.

Cheers,

Andrea



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14186.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


FlinkML ALS is taking too long to run

2017-07-07 Thread Ziyad Muhammed
Dear all

I'm trying to run Flink ALS against Yahoo-R2 data set[1] on HDFS. The
program is running without showing any errors, but it does not finish. The
operators running indefinitely are:

CoGroup (CoGroup at
org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:606))(11/240)

Join(Join at
org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:576))(15/240)


I was using the below parameters to run:

val als = ALS().setIterations(10).setNumFactors(10).setBlocks(100)

And I didn't set the hdfs temporary path. Can someone tell me the
parameters to set to run ALS on such large data sets? Why are these
operators running indefinitely?

[1] https://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Best
Ziyad


Re: Using FlinkML from Java?

2017-04-21 Thread Till Rohrmann
Hi Steve,

unfortunately, FlinkML's pipeline mechanism depends on Scala's implicit
value feature. Therefore, FlinkML can only be used with Scala if you don't
want to construct the pipelines manually (which I wouldn't recommend).

Cheers,
Till

On Thu, Apr 20, 2017 at 6:56 PM, Steve Jerman <st...@kloudspot.com> wrote:

> Hi Folks,
>
> I’m trying to use FlinkML 1.2 from Java … getting this:
>
> SVM svm = new SVM()
>   .setBlocks(env.getParallelism())
>   .setIterations(100)
>   .setRegularization(0.001)
>   .setStepsize(0.1)
>   .setSeed(42);
>
>
> svm.fit(labelledTraining);
>
> The type org.apache.flink.api.scala.DataSet cannot be resolved. It is
> indirectly referenced from required .class files.
>
> Are there any tricks required to get it running? Or is Java not supported?
>
> Steve
>


Flink Scheduling and FlinkML

2017-03-31 Thread Fábio Dias
Hi to all,

I'm building a recommendation system to my application.
I have a set of logs (that contains the user info, the hour, the button
that was clicked ect...) that arrive to my Flink by kafka, then I save
every log in a HDFS (HADOOP), but know I have a problem, I want to apply ML
to (all) my data.

I think in 2 scenarios:
First : Transform my DataStream in a DataSet and perform the ML task. It is
possible?
Second : Preform a task in flink that get the data from Hadoop and perform
the ML task.

What is the best way to do it?

I already check the IncrementalLearningSkeleton but I didn't understand how
to apply that to an actual real case. Is there some complex example that I
could look?
(
https://github.com/apache/flink/tree/master/flink-examples/flink-examples-streaming/src/main/java/org/apache/flink/streaming/examples/ml
)

Another thing that I would like to ask is how to perform the second
scenario, where I need to perform this task every hour, what it is the best
way to do it?

Thanks,
Fábio Dias.


Re: FlinkML and DataStream API

2016-12-21 Thread Márton Balassi
Thanks for mentioning it, Theo.

Here it is: https://github.com/streamline-eu/ML-Pipelines/tree/stream-ml

Look at these examples:
https://github.com/streamline-eu/ML-Pipelines/commit/314e3d940f1f1ac7b762ba96067e13d806476f57

On Wed, Dec 21, 2016 at 9:38 PM, <dromitl...@gmail.com> wrote:

> I'm interested in that code you mentioned too, I hope you can find it.
>
> Regards,
> Matt
>
> On Dec 21, 2016, at 17:12, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com> wrote:
>
> Hello Mäki,
>
> I think what you would like to do is train a model using batch, and use
> the Flink streaming API as a way to serve your model and make predictions.
>
> While we don't have an integrated way to do that in FlinkML currently, I
> definitely think that's possible. I know Marton Balassi has been working on
> something like this for the ALS algorithm, but I can't find the code right
> now on mobile.
> The general idea is to keep your model as state and use it to make
> predictions on a stream of incoming data.
>
> Model serving is definitely something we'll be working on in the future,
> I'll have a master student working on exactly that next semester.
>
> --
> Sent from a mobile device. May contain autocorrect errors.
>
> On Dec 21, 2016 5:24 PM, "Mäki Hanna" <hanna.m...@comptel.com> wrote:
>
>> Hi,
>>
>>
>>
>> I’m wondering if there is a way to use FlinkML and make predictions
>> continuously for test data coming from a DataStream.
>>
>>
>>
>> I know FlinkML only supports the DataSet API (batch) at the moment, but
>> is there a way to convert a DataStream into DataSets? I’m thinking of
>> something like
>>
>>
>>
>> (0. fit model in batch mode)
>>
>> 1. window the DataStream
>>
>> 2. convert the windowed stream to DataSets
>>
>> 3. use the FlinkML methods to make predictions
>>
>>
>>
>> BR,
>>
>> Hanna
>>
>>
>> Disclaimer: This message and any attachments thereto are intended solely
>> for the addressed recipient(s) and may contain confidential information. If
>> you are not the intended recipient, please notify the sender by reply
>> e-mail and delete the e-mail (including any attachments thereto) without
>> producing, distributing or retaining any copies thereof. Any review,
>> dissemination or other use of, or taking of any action in reliance upon,
>> this information by persons or entities other than the intended
>> recipient(s) is prohibited. Thank you.
>>
>


Re: FlinkML and DataStream API

2016-12-21 Thread dromitlabs
I'm interested in that code you mentioned too, I hope you can find it.

Regards,
Matt

> On Dec 21, 2016, at 17:12, Theodore Vasiloudis 
> <theodoros.vasilou...@gmail.com> wrote:
> 
> Hello Mäki, 
> 
> I think what you would like to do is train a model using batch, and use the 
> Flink streaming API as a way to serve your model and make predictions. 
> 
> While we don't have an integrated way to do that in FlinkML currently, I 
> definitely think that's possible. I know Marton Balassi has been working on 
> something like this for the ALS algorithm, but I can't find the code right 
> now on mobile.  
> The general idea is to keep your model as state and use it to make 
> predictions on a stream of incoming data. 
> 
> Model serving is definitely something we'll be working on in the future, I'll 
> have a master student working on exactly that next semester. 
> 
> -- 
> Sent from a mobile device. May contain autocorrect errors.
> 
>> On Dec 21, 2016 5:24 PM, "Mäki Hanna" <hanna.m...@comptel.com> wrote:
>> Hi,
>> 
>>  
>> 
>> I’m wondering if there is a way to use FlinkML and make predictions 
>> continuously for test data coming from a DataStream.
>> 
>>  
>> 
>> I know FlinkML only supports the DataSet API (batch) at the moment, but is 
>> there a way to convert a DataStream into DataSets? I’m thinking of something 
>> like
>> 
>>  
>> 
>> (0. fit model in batch mode)
>> 
>> 1. window the DataStream
>> 
>> 2. convert the windowed stream to DataSets
>> 
>> 3. use the FlinkML methods to make predictions
>> 
>>  
>> 
>> BR,
>> 
>> Hanna
>> 
>>  
>> 
>> Disclaimer: This message and any attachments thereto are intended solely for 
>> the addressed recipient(s) and may contain confidential information. If you 
>> are not the intended recipient, please notify the sender by reply e-mail and 
>> delete the e-mail (including any attachments thereto) without producing, 
>> distributing or retaining any copies thereof. Any review, dissemination or 
>> other use of, or taking of any action in reliance upon, this information by 
>> persons or entities other than the intended recipient(s) is prohibited. 
>> Thank you.


Re: FlinkML and DataStream API

2016-12-21 Thread Theodore Vasiloudis
Hello Mäki,

I think what you would like to do is train a model using batch, and use the
Flink streaming API as a way to serve your model and make predictions.

While we don't have an integrated way to do that in FlinkML currently, I
definitely think that's possible. I know Marton Balassi has been working on
something like this for the ALS algorithm, but I can't find the code right
now on mobile.
The general idea is to keep your model as state and use it to make
predictions on a stream of incoming data.

Model serving is definitely something we'll be working on in the future,
I'll have a master student working on exactly that next semester.

-- 
Sent from a mobile device. May contain autocorrect errors.

On Dec 21, 2016 5:24 PM, "Mäki Hanna" <hanna.m...@comptel.com> wrote:

> Hi,
>
>
>
> I’m wondering if there is a way to use FlinkML and make predictions
> continuously for test data coming from a DataStream.
>
>
>
> I know FlinkML only supports the DataSet API (batch) at the moment, but is
> there a way to convert a DataStream into DataSets? I’m thinking of
> something like
>
>
>
> (0. fit model in batch mode)
>
> 1. window the DataStream
>
> 2. convert the windowed stream to DataSets
>
> 3. use the FlinkML methods to make predictions
>
>
>
> BR,
>
> Hanna
>
>
> Disclaimer: This message and any attachments thereto are intended solely
> for the addressed recipient(s) and may contain confidential information. If
> you are not the intended recipient, please notify the sender by reply
> e-mail and delete the e-mail (including any attachments thereto) without
> producing, distributing or retaining any copies thereof. Any review,
> dissemination or other use of, or taking of any action in reliance upon,
> this information by persons or entities other than the intended
> recipient(s) is prohibited. Thank you.
>


FlinkML and DataStream API

2016-12-21 Thread Mäki Hanna
Hi,

I'm wondering if there is a way to use FlinkML and make predictions 
continuously for test data coming from a DataStream.

I know FlinkML only supports the DataSet API (batch) at the moment, but is 
there a way to convert a DataStream into DataSets? I'm thinking of something 
like

(0. fit model in batch mode)
1. window the DataStream
2. convert the windowed stream to DataSets
3. use the FlinkML methods to make predictions

BR,
Hanna

Disclaimer: This message and any attachments thereto are intended solely for 
the addressed recipient(s) and may contain confidential information. If you are 
not the intended recipient, please notify the sender by reply e-mail and delete 
the e-mail (including any attachments thereto) without producing, distributing 
or retaining any copies thereof. Any review, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or entities 
other than the intended recipient(s) is prohibited. Thank you.


Re: FlinkML - Fail to execute QuickStart example

2016-10-17 Thread Thomas FOURNIER
Hi,

No problem I'm going to create a JIRA.

Regards
Thomas

2016-10-17 21:34 GMT+02:00 Theodore Vasiloudis <
theodoros.vasilou...@gmail.com>:

> That is my bad, I must have been testing against a private branch when
> writing the guide, the SVM as it stands only has a predict operation for
> Vector not LabeledVector.
>
> IMHO I would like to have a predict operator for LabeledVector for all
> predictors (that would just call the existing Vector prediction
> internally), but IIRC we decided to go with an Evaluate operator instead as
> written in the evaluation PR .
>
> I'll make a PR to fix the guide, any chance you can create a JIRA for this?
>
> Regards,
> Theodore
>
> On Mon, Oct 17, 2016 at 6:22 PM, Thomas FOURNIER <
> thomasfournier...@gmail.com> wrote:
>
>> Hi,
>>
>> Executing the following code (see QuickStart):
>>
>> val env = ExecutionEnvironment.getExecutionEnvironment
>> val survival = env.readCsvFile[(String, String, String, 
>> String)]("src/main/resources/haberman.data", ",")
>>
>>
>> val survivalLV = survival
>>   .map { tuple =>
>> val list = tuple.productIterator.toList
>> val numList = list.map(_.asInstanceOf[String].toDouble)
>> LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
>>   }
>>
>>
>>
>> val astroTrain = MLUtils.readLibSVM(env, "src/main/resources/svmguide1")
>> val astroTest = MLUtils.readLibSVM(env, "src/main/resources/svmguide1.t")
>>
>>
>> val svm = SVM()
>>   .setBlocks(env.getParallelism)
>>   .setIterations(100)
>>   .setRegularization(0.001)
>>   .setStepsize(0.1)
>>   .setSeed(42)
>>
>> svm.fit(astroTrain)
>> svm.predict(astroTest)
>>
>>
>> I encounter the following error:
>>
>> Exception in thread "main" java.lang.RuntimeException: There is no 
>> PredictOperation defined for org.apache.flink.ml.classification.SVM which 
>> takes a DataSet[org.apache.flink.ml.common.LabeledVector] as input.
>>
>> Any idea ?
>>
>> Thanks
>>
>> Thomas
>>
>>
>>
>>
>>
>


Re: FlinkML - Fail to execute QuickStart example

2016-10-17 Thread Theodore Vasiloudis
That is my bad, I must have been testing against a private branch when
writing the guide, the SVM as it stands only has a predict operation for
Vector not LabeledVector.

IMHO I would like to have a predict operator for LabeledVector for all
predictors (that would just call the existing Vector prediction
internally), but IIRC we decided to go with an Evaluate operator instead as
written in the evaluation PR .

I'll make a PR to fix the guide, any chance you can create a JIRA for this?

Regards,
Theodore

On Mon, Oct 17, 2016 at 6:22 PM, Thomas FOURNIER <
thomasfournier...@gmail.com> wrote:

> Hi,
>
> Executing the following code (see QuickStart):
>
> val env = ExecutionEnvironment.getExecutionEnvironment
> val survival = env.readCsvFile[(String, String, String, 
> String)]("src/main/resources/haberman.data", ",")
>
>
> val survivalLV = survival
>   .map { tuple =>
> val list = tuple.productIterator.toList
> val numList = list.map(_.asInstanceOf[String].toDouble)
> LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
>   }
>
>
>
> val astroTrain = MLUtils.readLibSVM(env, "src/main/resources/svmguide1")
> val astroTest = MLUtils.readLibSVM(env, "src/main/resources/svmguide1.t")
>
>
> val svm = SVM()
>   .setBlocks(env.getParallelism)
>   .setIterations(100)
>   .setRegularization(0.001)
>   .setStepsize(0.1)
>   .setSeed(42)
>
> svm.fit(astroTrain)
> svm.predict(astroTest)
>
>
> I encounter the following error:
>
> Exception in thread "main" java.lang.RuntimeException: There is no 
> PredictOperation defined for org.apache.flink.ml.classification.SVM which 
> takes a DataSet[org.apache.flink.ml.common.LabeledVector] as input.
>
> Any idea ?
>
> Thanks
>
> Thomas
>
>
>
>
>


FlinkML - Fail to execute QuickStart example

2016-10-17 Thread Thomas FOURNIER
Hi,

Executing the following code (see QuickStart):

val env = ExecutionEnvironment.getExecutionEnvironment
val survival = env.readCsvFile[(String, String, String,
String)]("src/main/resources/haberman.data", ",")


val survivalLV = survival
  .map { tuple =>
val list = tuple.productIterator.toList
val numList = list.map(_.asInstanceOf[String].toDouble)
LabeledVector(numList(3), DenseVector(numList.take(3).toArray))
  }



val astroTrain = MLUtils.readLibSVM(env, "src/main/resources/svmguide1")
val astroTest = MLUtils.readLibSVM(env, "src/main/resources/svmguide1.t")


val svm = SVM()
  .setBlocks(env.getParallelism)
  .setIterations(100)
  .setRegularization(0.001)
  .setStepsize(0.1)
  .setSeed(42)

svm.fit(astroTrain)
svm.predict(astroTest)


I encounter the following error:

Exception in thread "main" java.lang.RuntimeException: There is no
PredictOperation defined for org.apache.flink.ml.classification.SVM
which takes a DataSet[org.apache.flink.ml.common.LabeledVector] as
input.

Any idea ?

Thanks

Thomas


Re: FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-09-02 Thread ANDREA SPINA
Hi Stefan,
Thank you so much for the answer. Ok, I'll do it asap.
For the sake of argument, could the issue be related to the low number of
blocks? I noticed the Flink implementation, as default, set the number of
blocks to the input count (which is actually a lot). So with a low
cardinality and big sized blocks, maybe they don't fit somewhere...
Thank you again.

Andrea

2016-09-02 10:51 GMT+02:00 Stefan Richter <s.rich...@data-artisans.com>:

> Hi,
>
> unfortunately, the log does not contain the required information for this
> case. It seems like a sender to the SortMerger failed. The best way to find
> this problem is to take a look to the exceptions that are reported in the
> web front-end for the failing job. Could you check if you find any reported
> exceptions there and provide them to us?
>
> Best,
> Stefan
>
> Am 01.09.2016 um 11:56 schrieb ANDREA SPINA <74...@studenti.unimore.it>:
>
> Sure. Here <https://drive.google.com/open?id=0B6TTuPO7UoeFRXY3RW1KQnNrd3c>
> you can find the complete logs file.
> Still can not run through the issue. Thank you for your help.
>
> 2016-08-31 18:15 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> I don't know whether my usual error is related to this one but is very
>> similar and it happens randomly...I still have to figure out the root cause
>> of the error:
>>
>> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce
>> (GroupReduce at createResult(IndexMappingExecutor.java:43)) -> Map (Map
>> at main(Jsonizer.java:90))' , caused an error: Error obtaining the sorted
>> input: Thread 'SortMerger spilling thread' terminated due to an exception:
>> -2
>> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
>> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTas
>> k.java:345)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
>> Thread 'SortMerger spilling thread' terminated due to an exception: -2
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> .getIterator(UnilateralSortMerger.java:619)
>> at org.apache.flink.runtime.operators.BatchTask.getInput(BatchT
>> ask.java:1079)
>> at org.apache.flink.runtime.operators.GroupReduceDriver.prepare
>> (GroupReduceDriver.java:94)
>> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
>> ... 3 more
>> Caused by: java.io.IOException: Thread 'SortMerger spilling thread'
>> terminated due to an exception: -2
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $ThreadBase.run(UnilateralSortMerger.java:800)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
>> at java.util.ArrayList.elementData(ArrayList.java:418)
>> at java.util.ArrayList.get(ArrayList.java:431)
>> at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadO
>> bject(MapReferenceResolver.java:42)
>> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
>> at com.esotericsoftware.kryo.serializers.MapSerializer.read(
>> MapSerializer.java:135)
>> at com.esotericsoftware.kryo.serializers.MapSerializer.read(
>> MapSerializer.java:21)
>> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
>> at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSeriali
>> zer.deserialize(KryoSerializer.java:219)
>> at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSeriali
>> zer.deserialize(KryoSerializer.java:245)
>> at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSeriali
>> zer.copy(KryoSerializer.java:255)
>> at org.apache.flink.api.java.typeutils.runtime.PojoSerializer.
>> copy(PojoSerializer.java:556)
>> at org.apache.flink.api.java.typeutils.runtime.TupleSerializerB
>> ase.copy(TupleSerializerBase.java:75)
>> at org.apache.flink.runtime.operators.sort.NormalizedKeySorter.
>> writeToOutput(NormalizedKeySorter.java:499)
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $SpillingThread.go(UnilateralSortMerger.java:1344)
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $ThreadBase.run(UnilateralSortMerger.java:796)
>>
>>
>> On Wed, Aug 31, 2016 at 5:57 PM, Stefan Richter <
>> s.rich...@data-artisans.com> wrote:
>>
>>> Hi,
>>>
>>> could you provide the log outputs for your job (ideally with debug
>>> logging enabled)?
>>>
>>> Best,
>>> Stefan
>>>
&

Re: FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-09-02 Thread Stefan Richter
Hi,

unfortunately, the log does not contain the required information for this case. 
It seems like a sender to the SortMerger failed. The best way to find this 
problem is to take a look to the exceptions that are reported in the web 
front-end for the failing job. Could you check if you find any reported 
exceptions there and provide them to us?

Best,
Stefan

> Am 01.09.2016 um 11:56 schrieb ANDREA SPINA <74...@studenti.unimore.it>:
> 
> Sure. Here <https://drive.google.com/open?id=0B6TTuPO7UoeFRXY3RW1KQnNrd3c> 
> you can find the complete logs file.
> Still can not run through the issue. Thank you for your help.
> 
> 2016-08-31 18:15 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it 
> <mailto:pomperma...@okkam.it>>:
> I don't know whether my usual error is related to this one but is very 
> similar and it happens randomly...I still have to figure out the root cause 
> of the error:
> 
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
> (GroupReduce at createResult(IndexMappingExecutor.java:43)) -> Map (Map at 
> main(Jsonizer.java:90))' , caused an error: Error obtaining the sorted input: 
> Thread 'SortMerger spilling thread' terminated due to an exception: -2
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger spilling thread' terminated due to an exception: -2
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
>   at 
> org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
>   ... 3 more
> Caused by: java.io.IOException: Thread 'SortMerger spilling thread' 
> terminated due to an exception: -2
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
>   at java.util.ArrayList.elementData(ArrayList.java:418)
>   at java.util.ArrayList.get(ArrayList.java:431)
>   at 
> com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
>   at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
>   at 
> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
>   at 
> com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:21)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:219)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:245)
>   at 
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:255)
>   at 
> org.apache.flink.api.java.typeutils.runtime.PojoSerializer.copy(PojoSerializer.java:556)
>   at 
> org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(TupleSerializerBase.java:75)
>   at 
> org.apache.flink.runtime.operators.sort.NormalizedKeySorter.writeToOutput(NormalizedKeySorter.java:499)
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1344)
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
> 
> 
> On Wed, Aug 31, 2016 at 5:57 PM, Stefan Richter <s.rich...@data-artisans.com 
> <mailto:s.rich...@data-artisans.com>> wrote:
> Hi,
> 
> could you provide the log outputs for your job (ideally with debug logging 
> enabled)?
> 
> Best,
> Stefan
> 
>> Am 31.08.2016 um 14:40 schrieb ANDREA SPINA <74...@studenti.unimore.it 
>> <mailto:74...@studenti.unimore.it>>:
>> 
>> Hi everyone.
>> I'm running the FlinkML ALS matrix factorization and I bumped into the 
>> following exception:
>> 
>> org.apache.flink.client.program.ProgramInvocationException: The program 
>> execution failed: Job execution failed.
>>  at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
>>  at org.apac

Re: FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-09-01 Thread ANDREA SPINA
Sure. Here <https://drive.google.com/open?id=0B6TTuPO7UoeFRXY3RW1KQnNrd3c>
you can find the complete logs file.
Still can not run through the issue. Thank you for your help.

2016-08-31 18:15 GMT+02:00 Flavio Pompermaier <pomperma...@okkam.it>:

> I don't know whether my usual error is related to this one but is very
> similar and it happens randomly...I still have to figure out the root cause
> of the error:
>
> java.lang.Exception: The data preparation for task 'CHAIN GroupReduce
> (GroupReduce at createResult(IndexMappingExecutor.java:43)) -> Map (Map
> at main(Jsonizer.java:90))' , caused an error: Error obtaining the sorted
> input: Thread 'SortMerger spilling thread' terminated due to an exception:
> -2
> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
> Thread 'SortMerger spilling thread' terminated due to an exception: -2
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.
> getIterator(UnilateralSortMerger.java:619)
> at org.apache.flink.runtime.operators.BatchTask.getInput(
> BatchTask.java:1079)
> at org.apache.flink.runtime.operators.GroupReduceDriver.
> prepare(GroupReduceDriver.java:94)
> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
> ... 3 more
> Caused by: java.io.IOException: Thread 'SortMerger spilling thread'
> terminated due to an exception: -2
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
> at java.util.ArrayList.elementData(ArrayList.java:418)
> at java.util.ArrayList.get(ArrayList.java:431)
> at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(
> MapReferenceResolver.java:42)
> at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
> at com.esotericsoftware.kryo.serializers.MapSerializer.
> read(MapSerializer.java:135)
> at com.esotericsoftware.kryo.serializers.MapSerializer.
> read(MapSerializer.java:21)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
> at org.apache.flink.api.java.typeutils.runtime.kryo.
> KryoSerializer.deserialize(KryoSerializer.java:219)
> at org.apache.flink.api.java.typeutils.runtime.kryo.
> KryoSerializer.deserialize(KryoSerializer.java:245)
> at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(
> KryoSerializer.java:255)
> at org.apache.flink.api.java.typeutils.runtime.PojoSerializer.copy(
> PojoSerializer.java:556)
> at org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(
> TupleSerializerBase.java:75)
> at org.apache.flink.runtime.operators.sort.NormalizedKeySorter.
> writeToOutput(NormalizedKeySorter.java:499)
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> SpillingThread.go(UnilateralSortMerger.java:1344)
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ThreadBase.run(UnilateralSortMerger.java:796)
>
>
> On Wed, Aug 31, 2016 at 5:57 PM, Stefan Richter <
> s.rich...@data-artisans.com> wrote:
>
>> Hi,
>>
>> could you provide the log outputs for your job (ideally with debug
>> logging enabled)?
>>
>> Best,
>> Stefan
>>
>> Am 31.08.2016 um 14:40 schrieb ANDREA SPINA <74...@studenti.unimore.it>:
>>
>> Hi everyone.
>> I'm running the FlinkML ALS matrix factorization and I bumped into the
>> following exception:
>>
>> org.apache.flink.client.program.ProgramInvocationException: The program
>> execution failed: Job execution failed.
>> at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
>> at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
>> at org.apache.flink.client.program.Client.runBlocking(Client.java:315)
>> at org.apache.flink.client.program.ContextEnvironment.execute(C
>> ontextEnvironment.java:60)
>> at org.apache.flink.api.scala.ExecutionEnvironment.execute(Exec
>> utionEnvironment.scala:652)
>> at org.apache.flink.ml.common.FlinkMLTools$.persist(FlinkMLTool
>> s.scala:94)
>> at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:507)
>> at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:433)
>> at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
>> at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
>> at dima.tu.berlin.bench

Re: FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-08-31 Thread Flavio Pompermaier
I don't know whether my usual error is related to this one but is very
similar and it happens randomly...I still have to figure out the root cause
of the error:

java.lang.Exception: The data preparation for task 'CHAIN GroupReduce
(GroupReduce at createResult(IndexMappingExecutor.java:43)) -> Map (Map at
main(Jsonizer.java:90))' , caused an error: Error obtaining the sorted
input: Thread 'SortMerger spilling thread' terminated due to an exception:
-2
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:456)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:345)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
Thread 'SortMerger spilling thread' terminated due to an exception: -2
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
at
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
at
org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:450)
... 3 more
Caused by: java.io.IOException: Thread 'SortMerger spilling thread'
terminated due to an exception: -2
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -2
at java.util.ArrayList.elementData(ArrayList.java:418)
at java.util.ArrayList.get(ArrayList.java:431)
at
com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:759)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at
com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:21)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:219)
at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:245)
at
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:255)
at
org.apache.flink.api.java.typeutils.runtime.PojoSerializer.copy(PojoSerializer.java:556)
at
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(TupleSerializerBase.java:75)
at
org.apache.flink.runtime.operators.sort.NormalizedKeySorter.writeToOutput(NormalizedKeySorter.java:499)
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$SpillingThread.go(UnilateralSortMerger.java:1344)
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)


On Wed, Aug 31, 2016 at 5:57 PM, Stefan Richter <s.rich...@data-artisans.com
> wrote:

> Hi,
>
> could you provide the log outputs for your job (ideally with debug logging
> enabled)?
>
> Best,
> Stefan
>
> Am 31.08.2016 um 14:40 schrieb ANDREA SPINA <74...@studenti.unimore.it>:
>
> Hi everyone.
> I'm running the FlinkML ALS matrix factorization and I bumped into the
> following exception:
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: Job execution failed.
> at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
> at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
> at org.apache.flink.client.program.Client.runBlocking(Client.java:315)
> at org.apache.flink.client.program.ContextEnvironment.execute(
> ContextEnvironment.java:60)
> at org.apache.flink.api.scala.ExecutionEnvironment.execute(Exec
> utionEnvironment.scala:652)
> at org.apache.flink.ml.common.FlinkMLTools$.persist(FlinkMLTools.scala:94)
> at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:507)
> at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:433)
> at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
> at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
> at dima.tu.berlin.benchmark.flink.als.RUN$.main(RUN.scala:78)
> at dima.tu.berlin.benchmark.flink.als.RUN.main(RUN.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.flink.client.program.PackagedProgram.callMainMeth
> od(PackagedProgram.java:505)
> at org.apache.flink.client.program.PackagedProgram.invokeIntera
> ctiveModeForExecution(PackagedProgram.java:403)
> at org.apache.flink.

Re: FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-08-31 Thread Stefan Richter
Hi,

could you provide the log outputs for your job (ideally with debug logging 
enabled)?

Best,
Stefan

> Am 31.08.2016 um 14:40 schrieb ANDREA SPINA <74...@studenti.unimore.it>:
> 
> Hi everyone.
> I'm running the FlinkML ALS matrix factorization and I bumped into the 
> following exception:
> 
> org.apache.flink.client.program.ProgramInvocationException: The program 
> execution failed: Job execution failed.
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:315)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:60)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:652)
>   at 
> org.apache.flink.ml.common.FlinkMLTools$.persist(FlinkMLTools.scala:94)
>   at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:507)
>   at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:433)
>   at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
>   at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
>   at dima.tu.berlin.benchmark.flink.als.RUN$.main(RUN.scala:78)
>   at dima.tu.berlin.benchmark.flink.als.RUN.main(RUN.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
>   at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
>   at 
> org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
>   at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
>   at 
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)
>   at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
>   at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>   at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.RuntimeException: Initializing the input processing 
> failed: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' 
> terminated due to an exception: null
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:325)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: null
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
>   at 
> org.apache.flink.runtime.operators.BatchTask.initLocalStrategies(BatchTask.java:819)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:321)
>   ... 2 more
> Caused by: java.io.IOException: Thread 'SortMerger Reading

FlinkML ALS matrix factorization: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: null

2016-08-31 Thread ANDREA SPINA
Hi everyone.
I'm running the FlinkML ALS matrix factorization and I bumped into the
following exception:

org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: Job execution failed.
at org.apache.flink.client.program.Client.runBlocking(Client.java:381)
at org.apache.flink.client.program.Client.runBlocking(Client.java:355)
at org.apache.flink.client.program.Client.runBlocking(Client.java:315)
at
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:60)
at
org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:652)
at org.apache.flink.ml.common.FlinkMLTools$.persist(FlinkMLTools.scala:94)
at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:507)
at org.apache.flink.ml.recommendation.ALS$$anon$115.fit(ALS.scala:433)
at org.apache.flink.ml.pipeline.Estimator$class.fit(Estimator.scala:55)
at org.apache.flink.ml.recommendation.ALS.fit(ALS.scala:122)
at dima.tu.berlin.benchmark.flink.als.RUN$.main(RUN.scala:78)
at dima.tu.berlin.benchmark.flink.als.RUN.main(RUN.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:505)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:403)
at org.apache.flink.client.program.Client.runBlocking(Client.java:248)
at
org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:866)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:333)
at
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1192)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1243)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job
execution failed.
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:717)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:663)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.RuntimeException: Initializing the input processing
failed: Error obtaining the sorted input: Thread 'SortMerger Reading
Thread' terminated due to an exception: null
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:325)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
Thread 'SortMerger Reading Thread' terminated due to an exception: null
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
at
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1079)
at
org.apache.flink.runtime.operators.BatchTask.initLocalStrategies(BatchTask.java:819)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:321)
... 2 more
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
terminated due to an exception: null
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by:
org.apache.flink.runtime.io.network.partition.ProducerFailedException
at
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextLookAhead(LocalInputChannel.java:270)
at
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.onNotification(LocalInputChannel.java:238)
at
org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.release(PipelinedSubpartition.java:158)
at
org.apache.flink.runtime.io.network.partition.ResultPartition.release(ResultPartition.java:320)
at
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.releasePartitionsProducedBy(ResultPartitionManager.java:95)
at
org.apache.flink.runtime.io.network.NetworkEnvironment.unregisterTask(NetworkEnvironment.java:370

Re: Using FlinkML algorithms in Streaming

2016-05-11 Thread Simone Robutti
Actually model portability and persistence is a serious limitation to
practical use of FlinkML in streaming. If you know what you're doing, you
can write a blunt serializer for your model, write it in a file and rebuild
the model stream-side with deserialized informations.

I tried it for an SVM model and there were no obstacles. It's ugly but it
works.

2016-05-11 16:18 GMT+02:00 Márton Balassi <balassi.mar...@gmail.com>:

> Currently I am not aware of streaming learners support, you would need to
> implement that yourself at this point.
>
> As for streaming predictors for batch learners I have some preview code
> that you might like. [1]
>
> [1]
> https://github.com/streamline-eu/ML-Pipelines/blob/314e3d940f1f1ac7b762ba96067e13d806476f57/flink-libraries/flink-stream-ml/src/main/scala/org/apache/flink/stream/ml/examples/MLRExample.scala
>
>
>
> On Wed, May 11, 2016 at 3:52 PM, Piyush Shrivastava <piyush...@yahoo.co.in
> > wrote:
>
>> Hi Márton,
>>
>> I want to train and get the residuals.
>>
>> Thanks and Regards,
>> Piyush Shrivastava <piy...@webograffiti.com>
>> [image: WeboGraffiti]
>> http://webograffiti.com
>>
>>
>> On Wednesday, 11 May 2016 7:19 PM, Márton Balassi <
>> balassi.mar...@gmail.com> wrote:
>>
>>
>> Hey Piyush,
>>
>> Would you like to train or predict on the streaming data?
>>
>> Best,
>>
>> Marton
>>
>> On Wed, May 11, 2016 at 3:44 PM, Piyush Shrivastava <
>> piyush...@yahoo.co.in> wrote:
>>
>> Hello all,
>>
>> I want to perform linear regression using FlinkML's
>> MultipleLinearRegression() function on streaming data.
>>
>> This function takes a DataSet as an input and I cannot create a DataSet
>> inside the MapFunction of a DataStream. How can I use this function on my
>> DataStream?
>>
>> Thanks and Regards,
>> Piyush Shrivastava <piy...@webograffiti.com>
>> [image: WeboGraffiti]
>> http://webograffiti.com
>>
>>
>>
>>
>>
>


Re: Using FlinkML algorithms in Streaming

2016-05-11 Thread Márton Balassi
Currently I am not aware of streaming learners support, you would need to
implement that yourself at this point.

As for streaming predictors for batch learners I have some preview code
that you might like. [1]

[1]
https://github.com/streamline-eu/ML-Pipelines/blob/314e3d940f1f1ac7b762ba96067e13d806476f57/flink-libraries/flink-stream-ml/src/main/scala/org/apache/flink/stream/ml/examples/MLRExample.scala



On Wed, May 11, 2016 at 3:52 PM, Piyush Shrivastava 
wrote:

> Hi Márton,
>
> I want to train and get the residuals.
>
> Thanks and Regards,
> Piyush Shrivastava 
> [image: WeboGraffiti]
> http://webograffiti.com
>
>
> On Wednesday, 11 May 2016 7:19 PM, Márton Balassi <
> balassi.mar...@gmail.com> wrote:
>
>
> Hey Piyush,
>
> Would you like to train or predict on the streaming data?
>
> Best,
>
> Marton
>
> On Wed, May 11, 2016 at 3:44 PM, Piyush Shrivastava  > wrote:
>
> Hello all,
>
> I want to perform linear regression using FlinkML's
> MultipleLinearRegression() function on streaming data.
>
> This function takes a DataSet as an input and I cannot create a DataSet
> inside the MapFunction of a DataStream. How can I use this function on my
> DataStream?
>
> Thanks and Regards,
> Piyush Shrivastava 
> [image: WeboGraffiti]
> http://webograffiti.com
>
>
>
>
>


Re: Using FlinkML algorithms in Streaming

2016-05-11 Thread Piyush Shrivastava
Hi Márton,
I want to train and get the residuals.
 Thanks and Regards,Piyush Shrivastava
http://webograffiti.com
 

On Wednesday, 11 May 2016 7:19 PM, Márton Balassi 
 wrote:
 

 Hey Piyush,
Would you like to train or predict on the streaming data?
Best,
Marton
On Wed, May 11, 2016 at 3:44 PM, Piyush Shrivastava  
wrote:

Hello all,

I want to perform linear regression using FlinkML's MultipleLinearRegression() 
function on streaming data.
This function takes a DataSet as an input and I cannot create a DataSet inside 
the MapFunction of a DataStream. How can I use this function on my DataStream? 
Thanks and Regards,Piyush Shrivastava
http://webograffiti.com




  

Re: Using FlinkML algorithms in Streaming

2016-05-11 Thread Márton Balassi
Hey Piyush,

Would you like to train or predict on the streaming data?

Best,

Marton

On Wed, May 11, 2016 at 3:44 PM, Piyush Shrivastava 
wrote:

> Hello all,
>
> I want to perform linear regression using FlinkML's
> MultipleLinearRegression() function on streaming data.
>
> This function takes a DataSet as an input and I cannot create a DataSet
> inside the MapFunction of a DataStream. How can I use this function on my
> DataStream?
>
> Thanks and Regards,
> Piyush Shrivastava 
> [image: WeboGraffiti]
> http://webograffiti.com
>


Re: FlinkML 0.10.1 - Using SparseVectors with MLR does not work

2016-02-04 Thread Till Rohrmann
Hi Sourigna,

it turned out to be a bug in the GradientDescent implementation which
cannot handle sparse gradients. That is not so problematic by itself,
because the sum of gradient vectors is usually dense even if the individual
gradient vectors are sparse. We simply forgot to initialize the initial
vector of the reduce operation to be dense. I’ve created a PR [1] which
should fix the problem. After reviewing it, it should be merged in the next
days.

[1] https://github.com/apache/flink/pull/1587

Cheers,
Till
​

On Thu, Feb 4, 2016 at 5:09 AM, Chiwan Park <chiwanp...@apache.org> wrote:

> Hi Gna,
>
> Thanks for reporting the problem. Because level 1 operation in FlinkML
> BLAS library doesn’t support SparseVector, SparseVector is not supported
> currently. I’ve filed this to JIRA [1].
>
> Maybe I can send a patch to solve this in few days.
>
> [1]: https://issues.apache.org/jira/browse/FLINK-3330
>
> Regards,
> Chiwan Park
>
> > On Feb 4, 2016, at 5:39 AM, Sourigna Phetsarath <
> gna.phetsar...@teamaol.com> wrote:
> >
> > All:
> >
> > I'm trying to use SparseVectors with FlinkML 0.10.1.  It does not seem
> to be working.  Here is a UnitTest that I created to recreate the problem:
> >
> >
> > package com.aol.ds.arc.ml.poc.flink
> >
> > import org.junit.After
> > import org.junit.Before
> > import org.slf4j.LoggerFactory
> > import org.apache.flink.test.util.ForkableFlinkMiniCluster
> > import scala.concurrent.duration.FiniteDuration
> > import java.util.concurrent.TimeUnit
> > import org.apache.flink.test.util.TestBaseUtils
> > import org.apache.flink.runtime.StreamingMode
> > import org.apache.flink.test.util.TestEnvironment
> > import org.junit.Test
> > import org.apache.flink.ml.common.LabeledVector
> > import org.apache.flink.ml.math.SparseVector
> > import org.apache.flink.api.scala._
> > import org.apache.flink.ml.regression.MultipleLinearRegression
> > import org.apache.flink.ml.math.DenseVector
> > class FlinkMLRTest {
> >   var Logger = LoggerFactory.getLogger(getClass.getName)
> >   var cluster: Option[ForkableFlinkMiniCluster] = None
> >   val parallelism = 4
> >   val DEFAULT_AKKA_ASK_TIMEOUT = 1000
> >   val DEFAULT_TIMEOUT = new FiniteDuration(DEFAULT_AKKA_ASK_TIMEOUT,
> TimeUnit.SECONDS)
> >   @Before
> >   def doBefore(): Unit = {
> > val cl = TestBaseUtils.startCluster(
> >   1,
> >   parallelism,
> >   StreamingMode.BATCH_ONLY,
> >   false,
> >   false,
> >   true)
> > val clusterEnvironment = new TestEnvironment(cl, parallelism)
> > clusterEnvironment.setAsContext()
> > cluster = Some(cl)
> >   }
> >   @After
> >   def doAfter(): Unit = {
> > cluster.map(c => TestBaseUtils.stopCluster(c, DEFAULT_TIMEOUT))
> >   }
> >   @Test
> >   def testMLR() {
> > val env = ExecutionEnvironment.getExecutionEnvironment
> > val training = Seq(
> >   new LabeledVector(1.0, new SparseVector(10, Array(0, 2, 3),
> Array(1.0, 1.0, 1.0))),
> >   new LabeledVector(1.0, new SparseVector(10, Array(0, 1, 5, 9),
> Array(1.0, 1.0, 1.0, 1.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0, 2),
> Array(0.0, 1.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0), Array(0.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0, 2),
> Array(0.0, 1.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0), Array(0.0
> > val testing = Seq(
> >   new LabeledVector(1.0, new SparseVector(10, Array(0, 3),
> Array(1.0, 1.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0, 2, 3),
> Array(0.0, 1.0, 1.0))),
> >   new LabeledVector(0.0, new SparseVector(10, Array(0), Array(0.0
> > val trainingDS = env.fromCollection(training)
> > val testingDS = env.fromCollection(testing)
> > trainingDS.print()
> > val mlr = MultipleLinearRegression()
> >   .setIterations(100)
> >   .setStepsize(2)
> >   .setConvergenceThreshold(0.001)
> > mlr.fit(trainingDS)
> > val weights = mlr.weightsOption match {
> >   case Some(weights) => { weights.collect() }
> >   case None => throw new Exception("Could not calculate the
> weights.")
> > }
> > if (Logger.isInfoEnabled())
> >   Logger.info(s"*** WEIGHTS: ${weights.mkString(";")}")
> > testingDS.print()
> > val predictions = mlr.evaluate(testingDS.map(x => (x.vector,
> x.label)))
> > if (Logger.isInfoEna

FlinkML 0.10.1 - Using SparseVectors with MLR does not work

2016-02-03 Thread Sourigna Phetsarath
All:

I'm trying to use SparseVectors with FlinkML 0.10.1.  It does not seem to
be working.  Here is a UnitTest that I created to recreate the problem:


*package* com.aol.ds.arc.ml.poc.flink


> *import* org.junit.After
> *import* org.junit.Before
> *import* org.slf4j.LoggerFactory
> *import* org.apache.flink.test.util.ForkableFlinkMiniCluster
> *import* scala.concurrent.duration.FiniteDuration
> *import* java.util.concurrent.TimeUnit
> *import* org.apache.flink.test.util.TestBaseUtils
> *import* org.apache.flink.runtime.StreamingMode
> *import* org.apache.flink.test.util.TestEnvironment
> *import* org.junit.Test
> *import* org.apache.flink.ml.common.LabeledVector
> *import* org.apache.flink.ml.math.SparseVector
> *import* org.apache.flink.api.scala._
> *import* org.apache.flink.ml.regression.MultipleLinearRegression
> *import* org.apache.flink.ml.math.DenseVector
> *class* FlinkMLRTest {
>   *var* Logger = LoggerFactory.getLogger(getClass.getName)
>   *var* cluster: Option[ForkableFlinkMiniCluster] = None
>   *val* parallelism = 4
>   *val* DEFAULT_AKKA_ASK_TIMEOUT = 1000
>   *val* DEFAULT_TIMEOUT = *new* FiniteDuration(DEFAULT_AKKA_ASK_TIMEOUT,
> TimeUnit.SECONDS)
>   @Before
>   *def* doBefore(): Unit = {
> *val* cl = TestBaseUtils.startCluster(
>   1,
>   parallelism,
>   StreamingMode.BATCH_ONLY,
>   *false*,
>   *false*,
>   *true*)
> *val* clusterEnvironment = *new* TestEnvironment(cl, parallelism)
> clusterEnvironment.setAsContext()
> cluster = Some(cl)
>   }
>   @After
>   *def* doAfter(): Unit = {
> cluster.map(c => TestBaseUtils.stopCluster(c, DEFAULT_TIMEOUT))
>   }
>   @Test
>   *def* testMLR() {
> *val* env = ExecutionEnvironment.getExecutionEnvironment
> *val* training = Seq(
>   *new* LabeledVector(1.0, *new* SparseVector(10, Array(0, 2, 3),
> Array(1.0, 1.0, 1.0))),
>   *new* LabeledVector(1.0, *new* SparseVector(10, Array(0, 1, 5, 9),
> Array(1.0, 1.0, 1.0, 1.0))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0, 2), Array(
> 0.0, 1.0))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0), Array(0.0
> ))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0, 2), Array(
> 0.0, 1.0))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0), Array(0.0
> 
> *val* testing = Seq(
>   *new* LabeledVector(1.0, *new* SparseVector(10, Array(0, 3), Array(
> 1.0, 1.0))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0, 2, 3),
> Array(0.0, 1.0, 1.0))),
>   *new* LabeledVector(0.0, *new* SparseVector(10, Array(0), Array(0.0
> 
> *val* trainingDS = env.fromCollection(training)
> *val* testingDS = env.fromCollection(testing)
> trainingDS.print()
> *val* mlr = MultipleLinearRegression()
>   .setIterations(100)
>   .setStepsize(2)
>   .setConvergenceThreshold(0.001)
> mlr.fit(trainingDS)
> *val* weights = mlr.weightsOption *match* {
>   *case* Some(weights) => { weights.collect() }
>   *case* None => *throw* *new* Exception("Could not calculate the
> weights.")
> }
> *if* (Logger.isInfoEnabled())
>   Logger.info(s"*** WEIGHTS: ${weights.mkString(";")}")
> testingDS.print()
> *val* predictions = mlr.evaluate(testingDS.map(x => (x.vector, x.label
> )))
> *if* (Logger.isInfoEnabled()) {
>   Logger.info(predictions.collect().mkString(","))
> }
>   }
>   @Test
>   *def* testMLR_DenseVector() {
> *val* env = ExecutionEnvironment.getExecutionEnvironment
> *val* training = Seq(
>   *new* LabeledVector(1.0, DenseVector(1.0, 0.0, 0.0, 1.0, 1.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(1.0, DenseVector(1.0, 0.0, 1.0, 0.0, 0.0, 0.0,
> 1.0, 0.0, 0.0, 0.0, 1.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)))
> *val* testing = Seq(
>   *new* LabeledVector(1.0, DenseVector(1.0, 0.0, 0.0, 0.0, 1.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 1.0, 1.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)),
>   *new* LabeledVector(0.0, DenseVector(0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0)))
> *val* trainingDS = env.fromCollection(training)
> *val* testingDS = env.from