Spark CEP

2018-08-14 Thread Esa Heikkinen
Hi

I would want to know more about Spark CEP (Complex Event Processing). Are there 
exist some simple (but also complex) examples with input data (log files ?).

Whether Spark CEP is based on Siddhi ? If yes, it is better to use Siddhi 
directly ?

I know CEP engines are intended to stream data, but it is maybe not the problem 
to use them to batch data ?

BR Esa



Execution model in Spark

2018-05-28 Thread Esa Heikkinen
Hi

I don't know whether this question is suitable for this forum, but I take the 
risk and ask :)

In my understanding the execution model in Spark is very data (flow) stream 
oriented and specific. Is it difficult to build a control flow logic (like 
state-machine) outside of the stream specific processings ?

It is only way to combine all different type event streams to one big stream 
and then process it by some own stateful "logic" ?
And how to build this logic ?

Best Regards, Esa



RE: How to Spark can solve this example

2018-05-18 Thread Esa Heikkinen
Hello

That is good to hear, but are there exist some good practical (Python or Scala) 
examples ? This would help a lot.

I tried to do that by Apache Flink (and its CEP) and it was not so piece cake.

Best, Esa

From: Matteo Cossu <elco...@gmail.com>
Sent: Friday, May 18, 2018 10:51 AM
To: Esa Heikkinen <esa.heikki...@student.tut.fi>
Cc: user@spark.apache.org
Subject: Re: How to Spark can solve this example

Hello Esa,
all the steps that you described can be performed with Spark. I don't know 
about CEP, but Spark Streaming should be enough.

Best,

Matteo

On 18 May 2018 at 09:20, Esa Heikkinen 
<esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote:
Hi

I have attached fictive example (pdf-file) about processing of event traces 
from data streams (or batch data). I hope the picture of the attachment is 
clear and understandable.

I would be very interested in how best to solve it with Spark. Or it is 
possible or not ? If it is possible, can it be solved for example by CEP ?

Little explanations.. Data processing reads three different and parallel 
streams (or batch data): A, B and C. Each of them have events which have 
different “keys with value” (like K1-K4) or record.

I would want to find all event traces, which have certain dependences or 
patterns between streams (or batches). To find pattern there are three steps:

1)  Searches an event that have value “X” in K1 in stream A and if it is 
found, stores it to global data for later use and continues next step

2)  Searches an event that have value A(K1) in K2 in stream B and if it is 
found, stores it to global data for later use and continues next step

3)  Searches an event that have value A(K1) in K1 and value B(K3) in K2 in 
stream C and if it is found, continues next step (back to step 1)

If that is not possible by Spark, do you have any idea of tools, which can 
solve this ?

Best, Esa



-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



Spark CEP with files and no streams ?

2018-02-07 Thread Esa Heikkinen
Hello

I am trying to use CEP of Spark for log files (as batch job), but not for 
streams (as realtime).
Is that possible ? If yes, do you know examples Scala codes about that ?

Or should I convert the log files (with time stamps) into streams ?
But how to handle time stamps in Spark ?

If I can not use Spark at all for this purpose, do you have any recommendations 
of other tools ?

I would want CEP type analysis for log files.



Best active groups, forums or contacts for Spark ?

2018-01-26 Thread Esa Heikkinen
Hi

It is very often difficult to get answers of question about Spark in many 
forums.. Maybe they are inactive or my questions are too bad. I don't know, but 
does anyone know good active groups, forums or contacts other like this ?

Esa Heikkinen



Spark and CEP type examples

2018-01-22 Thread Esa Heikkinen
Hi

I am looking for simple examples using CEP (Complex Event Processing) with 
Scala and Python. Does anyone know good ones ?

I do not need preprocessing (like in Kafka), but only analyzing phase of CEP 
inside Spark.

I am also interested to know other possibilities to search sequential event 
patterns from (time series) log data using by Spark. These event patterns can 
be described by DAG (directed acyclic graphs), so they are no simple chain of 
events. And sometimes it can be better to search events in the backward or 
reverse order (not always in forward order like in streaming case). But this 
can be impossible by Spark ?

Esa Heikkinen




Pyspark and searching items from data structures

2017-12-28 Thread Esa Heikkinen
Hi

I would want to build pyspark-application, which searches sequential items or 
events of time series from csv-files.

What are the best data structures for this purpose ? Dataframe of pyspark or 
pandas, or RDD or SQL or something else ?

---
Esa


Spark and neural networks

2017-12-27 Thread Esa Heikkinen
Hi


What would be the best way to use Spark and neutral networks (especially RNN 
LSTM) ?


I think it would be possible by "tool"-combination:

Pyspark + anaconda + pandas + numpy + keras + tensorflow + scikit


But what about scalability and usability by Spark (pyspark) ?


How compatible are data structures (for example dataframe) between Spark and 
other "tools" ?

And it is possible convert them between different "tools" ?


---

Eras


Windows10 + pyspark + ipython + csv file loading with timestamps

2017-12-16 Thread Esa Heikkinen

Hi

Does anyone have any hints or example (code) how to get combination: 
Windows10 + pyspark + ipython notebook + csv file loading with 
timestamps (timeseries data) to dataframe or RDD to work ?


I have already installed windows10 + pyspark + ipython notebook and they 
seem to work, but my python code in notebook does not, because "spark 
context" may not work ?


What commands should be put into the beginning of the notebook ? sc = 
SparkContext.getOrCreate() ? spark = SparkSession(sc) ?


I have installed: spark-2.2.1-bin-hadoop2.7  and ipython 6.1.0 to Windows10.



Eras



VS: VS: Using Spark as a simulator

2017-07-07 Thread Esa Heikkinen

I only want to simulate very huge "network" with even millions parallel time 
syncronized actors (state machines). There are also communication between 
actors via some (key-value pairs) database. I also want the simulation should 
work in the real time.


I don't know what would be the best framework or tool for that kind of 
simulation. I think Akka would be the best and easiest to deploy ?

Or do you know better frameworks or tools ?


Nowdays Spark is using Netty instead of Akka ?


Lähettäjä: Jörn Franke <jornfra...@gmail.com>
Lähetetty: 7. heinäkuuta 2017 10:04
Vastaanottaja: Esa Heikkinen
Kopio: Mahesh Sawaiker; user@spark.apache.org
Aihe: Re: VS: Using Spark as a simulator

Spark dropped Akka some time ago...
I think the main issue he will face is a library for simulating the state 
machines (randomly), storing a huge amount of files (HDFS is probably the way 
to go if you want it fast) and distributing the work (here you can select 
different options).
Are you trying to have some mathematical guarantees on the state machines, such 
as Markov chains/steady state?

On 7. Jul 2017, at 08:46, Esa Heikkinen 
<esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote:



Would it be better to use Akka as simulator rather than Spark ?

http://akka.io/

Akka<http://akka.io/>
akka.io<http://akka.io>
Build powerful reactive, concurrent & distributed applications more easily. 
Akka is a toolkit and runtime for building highly concurrent, distributed, and 
resilient ...


The spark was originally built on it (Akka).


Esa


Lähettäjä: Mahesh Sawaiker 
<mahesh_sawai...@persistent.com<mailto:mahesh_sawai...@persistent.com>>
Lähetetty: 21. kesäkuuta 2017 14:45
Vastaanottaja: Esa Heikkinen; Jörn Franke
Kopio: user@spark.apache.org<mailto:user@spark.apache.org>
Aihe: RE: Using Spark as a simulator


Spark can help you to create one large file if needed, but hdfs itself will 
provide abstraction over such things, so it’s a trivial problem if anything.

If you have spark installed, then you can use spark-shell to try a few samples, 
and build from there.If you can collect all the files in a folder then spark 
can read all files from there. The programming guide below has enough 
information to get started.



https://spark.apache.org/docs/latest/programming-guide.html

Spark Programming Guide - Spark 2.1.1 
Documentation<https://spark.apache.org/docs/latest/programming-guide.html>
spark.apache.org<http://spark.apache.org>
Spark Programming Guide. Overview; Linking with Spark; Initializing Spark. 
Using the Shell; Resilient Distributed Datasets (RDDs) Parallelized Collections


All of Spark’s file-based input methods, including textFile, support running on 
directories, compressed files, and wildcards as well. For example, you can use 
textFile("/my/directory"), textFile("/my/directory/*.txt"), and 
textFile("/my/directory/*.gz").



After reading the file you can map it using map function, which will split the 
individual line and possibly create a scala object. This way you will get a RDD 
of scala objects, which you can then process functional/set operators.



You would want to read about PairRDDs.



From: Esa Heikkinen [mailto:esa.heikki...@student.tut.fi]
Sent: Wednesday, June 21, 2017 1:12 PM
To: Jörn Franke
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: VS: Using Spark as a simulator





Hi



Thanks for the answer.



I think my simulator includes a lot of parallel state machines and each of them 
generates log file (with timestamps). Finally all events (rows) of all log 
files should combine as time order to (one) very huge log file. Practically the 
combined huge log file can also be split into smaller ones.



What transformation or action functions can i use in Spark for that purpose ?

Or are there exist some code sample (Python or Scala) about that ?

Regards

Esa Heikkinen





Lähettäjä: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Lähetetty: 20. kesäkuuta 2017 17:12
Vastaanottaja: Esa Heikkinen
Kopio: user@spark.apache.org<mailto:user@spark.apache.org>
Aihe: Re: Using Spark as a simulator



It is fine, but you have to design it that generated rows are written in large 
blocks for optimal performance.

The most tricky part with data generation is the conceptual part, such as 
probabilistic distribution etc

You have to check as well that you use a good random generator, for some cases 
the Java internal might be not that well.

On 20. Jun 2017, at 16:04, Esa Heikkinen 
<esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote:

Hi



Spark is a data analyzer, but would it be possible to use Spark as a data 
generator or simulator ?



My simulation can be very huge and i think a parallelized s

VS: Using Spark as a simulator

2017-07-07 Thread Esa Heikkinen

Would it be better to use Akka as simulator rather than Spark ?

http://akka.io/

Akka<http://akka.io/>
akka.io
Build powerful reactive, concurrent & distributed applications more easily. 
Akka is a toolkit and runtime for building highly concurrent, distributed, and 
resilient ...


The spark was originally built on it (Akka).


Esa


Lähettäjä: Mahesh Sawaiker <mahesh_sawai...@persistent.com>
Lähetetty: 21. kesäkuuta 2017 14:45
Vastaanottaja: Esa Heikkinen; Jörn Franke
Kopio: user@spark.apache.org
Aihe: RE: Using Spark as a simulator


Spark can help you to create one large file if needed, but hdfs itself will 
provide abstraction over such things, so it’s a trivial problem if anything.

If you have spark installed, then you can use spark-shell to try a few samples, 
and build from there.If you can collect all the files in a folder then spark 
can read all files from there. The programming guide below has enough 
information to get started.



https://spark.apache.org/docs/latest/programming-guide.html

Spark Programming Guide - Spark 2.1.1 
Documentation<https://spark.apache.org/docs/latest/programming-guide.html>
spark.apache.org
Spark Programming Guide. Overview; Linking with Spark; Initializing Spark. 
Using the Shell; Resilient Distributed Datasets (RDDs) Parallelized Collections


All of Spark’s file-based input methods, including textFile, support running on 
directories, compressed files, and wildcards as well. For example, you can use 
textFile("/my/directory"), textFile("/my/directory/*.txt"), and 
textFile("/my/directory/*.gz").



After reading the file you can map it using map function, which will split the 
individual line and possibly create a scala object. This way you will get a RDD 
of scala objects, which you can then process functional/set operators.



You would want to read about PairRDDs.



From: Esa Heikkinen [mailto:esa.heikki...@student.tut.fi]
Sent: Wednesday, June 21, 2017 1:12 PM
To: Jörn Franke
Cc: user@spark.apache.org
Subject: VS: Using Spark as a simulator





Hi



Thanks for the answer.



I think my simulator includes a lot of parallel state machines and each of them 
generates log file (with timestamps). Finally all events (rows) of all log 
files should combine as time order to (one) very huge log file. Practically the 
combined huge log file can also be split into smaller ones.



What transformation or action functions can i use in Spark for that purpose ?

Or are there exist some code sample (Python or Scala) about that ?

Regards

Esa Heikkinen





Lähettäjä: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>>
Lähetetty: 20. kesäkuuta 2017 17:12
Vastaanottaja: Esa Heikkinen
Kopio: user@spark.apache.org<mailto:user@spark.apache.org>
Aihe: Re: Using Spark as a simulator



It is fine, but you have to design it that generated rows are written in large 
blocks for optimal performance.

The most tricky part with data generation is the conceptual part, such as 
probabilistic distribution etc

You have to check as well that you use a good random generator, for some cases 
the Java internal might be not that well.

On 20. Jun 2017, at 16:04, Esa Heikkinen 
<esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote:

Hi



Spark is a data analyzer, but would it be possible to use Spark as a data 
generator or simulator ?



My simulation can be very huge and i think a parallelized simulation using by 
Spark (cloud) could work.

Is that good or bad idea ?



Regards

Esa Heikkinen



DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


VS: Using Spark as a simulator

2017-06-21 Thread Esa Heikkinen

Hi


Thanks for the answer.


I think my simulator includes a lot of parallel state machines and each of them 
generates log file (with timestamps). Finally all events (rows) of all log 
files should combine as time order to (one) very huge log file. Practically the 
combined huge log file can also be split into smaller ones.


What transformation or action functions can i use in Spark for that purpose ?

Or are there exist some code sample (Python or Scala) about that ?

Regards

Esa Heikkinen


Lähettäjä: Jörn Franke <jornfra...@gmail.com>
Lähetetty: 20. kesäkuuta 2017 17:12
Vastaanottaja: Esa Heikkinen
Kopio: user@spark.apache.org
Aihe: Re: Using Spark as a simulator

It is fine, but you have to design it that generated rows are written in large 
blocks for optimal performance.
The most tricky part with data generation is the conceptual part, such as 
probabilistic distribution etc
You have to check as well that you use a good random generator, for some cases 
the Java internal might be not that well.

On 20. Jun 2017, at 16:04, Esa Heikkinen 
<esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote:


Hi


Spark is a data analyzer, but would it be possible to use Spark as a data 
generator or simulator ?


My simulation can be very huge and i think a parallelized simulation using by 
Spark (cloud) could work.

Is that good or bad idea ?


Regards

Esa Heikkinen



Using Spark as a simulator

2017-06-20 Thread Esa Heikkinen
Hi


Spark is a data analyzer, but would it be possible to use Spark as a data 
generator or simulator ?


My simulation can be very huge and i think a parallelized simulation using by 
Spark (cloud) could work.

Is that good or bad idea ?


Regards

Esa Heikkinen



Book recommendations

2017-01-11 Thread Esa Heikkinen

Hi

Does anyone know of good books about event processing in distributed 
event systems (like IoT-systems) ?


I have already read book: "Power of Events" (Luckham 2002), but are 
there exist newer ones ?


Best Regards
Esa Heikkinen


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Simple "state machine" functionality using Scala or Python

2016-11-15 Thread Esa Heikkinen

Hello

Can anyone provide a simple example how to implement a "state machine" 
functionality using Scala or Python in Spark?


Sequence of the state machine would be like this:
1) Searches first event of log and its data
2) Based on the data of the first event, searches second event of log 
and its data

3) And so on..

Regards
Esa Heikkinen

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark support for Complex Event Processing (CEP)

2016-05-09 Thread Esa Heikkinen


Sorry for answering delay.. Yes, this is not pure "CEP", but quite close 
for it or many similar "functionalities".


My case is not so easy, because i dont' want to compare against original 
time schedule of route.


I want to compare how close (ITS) system has estimated arrival time to 
bus stop.


That means i have to read more (LOG C) logs (and do little calculation) 
to determine the estimated arrival time.


And then it is checked how much a difference (error) between bus real 
and the system's estimated arrival time..


In practic the situation can be little more complex..

---
Esa Heikkinen

29.4.2016, 19:38, Michael Segel kirjoitti:
If you’re getting the logs, then it really isn’t CEP unless you 
consider the event to be the log from the bus.

This doesn’t sound like there is a time constraint.

Your bus schedule is fairly fixed and changes infrequently.
Your bus stops are relatively fixed points. (Within a couple of meters)

So then you’re taking bus A who is scheduled to drive route 123 and 
you want to compare their nearest location to the bus stop at time T 
and see how close it is to the scheduled route.



Or am I missing something?

-Mike

On Apr 29, 2016, at 3:54 AM, Esa Heikkinen 
<esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> 
wrote:



Hi

I try to explain my case ..

Situation is not so simple in my logs and solution. There also many 
types of logs and there are from many sources.

They are as csv-format and header line includes names of the columns.

This is simplified description of input logs.

LOG A's: bus coordinate logs (every bus has own log):
- timestamp
- bus number
- coordinates

LOG B: bus login/logout (to/from line) message log:
- timestamp
- bus number
- line number

LOG C:  log from central computers:
- timestamp
- bus number
- bus stop number
- estimated arrival time to bus stop

LOG A are updated every 30 seconds (i have also another system by 1 
seconds interval). LOG B are updated when bus starts from terminal 
bus stop and arrives to final bus stop in a line. LOG C is updated 
when central computer sends new arrival time estimation to bus stop.


I also need metadata for logs (and analyzer). For example coordinates 
for bus stop areas.


Main purpose of analyzing is to check an accuracy (error) of the 
estimated arrival time to bus stops.


Because there are many buses and lines, it is too time-comsuming to 
check all of them. So i check only specific lines with specific bus 
stops. There are many buses (logged to lines) coming to one bus stop 
and i am interested about only certain bus.


To do that, i have to read log partly not in time order (upstream) by 
sequence:

1. From LOG C is searched bus number
2. From LOG A is searched when the bus has leaved from terminal bus stop
3. From LOG B is searched when bus has sent a login to the line
4. From LOG A is searched when the bus has entered to bus stop
5. From LOG C is searched a last estimated arrival time to the bus 
stop and calculates error between real and estimated value


In my understanding (almost) all log file analyzers reads all data 
(lines) in time order from log files. My need is only for specific 
part of log (lines). To achieve that, my solution is to read logs in 
an arbitrary order (with given time window).


I know this solution is not suitable for all cases (for example for 
very fast analyzing and very big data). This solution is suitable for 
very complex (targeted) analyzing. It can be too slow and 
memory-consuming, but well done pre-processing of log data can help a 
lot.


---
Esa Heikkinen

28.4.2016, 14:44, Michael Segel kirjoitti:

I don’t.

I believe that there have been a  couple of hack-a-thons like one 
done in Chicago a few years back using public transportation data.


The first question is what sort of data do you get from the city?

I mean it could be as simple as time_stamp, bus_id, route and GPS 
(x,y).   Or they could provide more information. Like last stop, 
distance to next stop, avg current velocity…


Then there is the frequency of the updates. Every second? Every 3 
seconds? 5 or 6 seconds…


This will determine how much work you have to do.

Maybe they provide the routes of the busses via a different API call 
since its relatively static.


This will drive your solution more than the underlying technology.

Oh and whileI focused on bus, there are also rail and other modes of 
public transportation like light rail, trains, etc …


HTH

-Mike


On Apr 28, 2016, at 4:10 AM, Esa Heikkinen 
<esa.heikki...@student.tut.fi 
<mailto:esa.heikki...@student.tut.fi>> wrote:



Do you know any good examples how to use Spark streaming in 
tracking public transportation systems ?


Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:

Uhm…
I think you need to clarify a couple of things…

First there is this thing called analog signal processing…. Is 
that continuous en

Re: Spark support for Complex Event Processing (CEP)

2016-04-29 Thread Esa Heikkinen


Hi

I try to explain my case ..

Situation is not so simple in my logs and solution. There also many 
types of logs and there are from many sources.

They are as csv-format and header line includes names of the columns.

This is simplified description of input logs.

LOG A's: bus coordinate logs (every bus has own log):
- timestamp
- bus number
- coordinates

LOG B: bus login/logout (to/from line) message log:
- timestamp
- bus number
- line number

LOG C:  log from central computers:
- timestamp
- bus number
- bus stop number
- estimated arrival time to bus stop

LOG A are updated every 30 seconds (i have also another system by 1 
seconds interval). LOG B are updated when bus starts from terminal bus 
stop and arrives to final bus stop in a line. LOG C is updated when 
central computer sends new arrival time estimation to bus stop.


I also need metadata for logs (and analyzer). For example coordinates 
for bus stop areas.


Main purpose of analyzing is to check an accuracy (error) of the 
estimated arrival time to bus stops.


Because there are many buses and lines, it is too time-comsuming to 
check all of them. So i check only specific lines with specific bus 
stops. There are many buses (logged to lines) coming to one bus stop and 
i am interested about only certain bus.


To do that, i have to read log partly not in time order (upstream) by 
sequence:

1. From LOG C is searched bus number
2. From LOG A is searched when the bus has leaved from terminal bus stop
3. From LOG B is searched when bus has sent a login to the line
4. From LOG A is searched when the bus has entered to bus stop
5. From LOG C is searched a last estimated arrival time to the bus stop 
and calculates error between real and estimated value


In my understanding (almost) all log file analyzers reads all data 
(lines) in time order from log files. My need is only for specific part 
of log (lines). To achieve that, my solution is to read logs in an 
arbitrary order (with given time window).


I know this solution is not suitable for all cases (for example for very 
fast analyzing and very big data). This solution is suitable for very 
complex (targeted) analyzing. It can be too slow and memory-consuming, 
but well done pre-processing of log data can help a lot.


---
Esa Heikkinen

28.4.2016, 14:44, Michael Segel kirjoitti:

I don’t.

I believe that there have been a  couple of hack-a-thons like one done 
in Chicago a few years back using public transportation data.


The first question is what sort of data do you get from the city?

I mean it could be as simple as time_stamp, bus_id, route and GPS 
(x,y).   Or they could provide more information. Like last stop, 
distance to next stop, avg current velocity…


Then there is the frequency of the updates. Every second? Every 3 
seconds? 5 or 6 seconds…


This will determine how much work you have to do.

Maybe they provide the routes of the busses via a different API call 
since its relatively static.


This will drive your solution more than the underlying technology.

Oh and whileI focused on bus, there are also rail and other modes of 
public transportation like light rail, trains, etc …


HTH

-Mike


On Apr 28, 2016, at 4:10 AM, Esa Heikkinen 
<esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> 
wrote:



Do you know any good examples how to use Spark streaming in tracking 
public transportation systems ?


Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:

Uhm…
I think you need to clarify a couple of things…

First there is this thing called analog signal processing…. Is that 
continuous enough for you?


But more to the point, Spark Streaming does micro batching so if 
you’re processing a continuous stream of tick data, you will have 
more than 50K of tics per second while there are markets open and 
trading.  Even at 50K a second, that would mean 1 every .02 ms or 50 
ticks a ms.


And you don’t want to wait until you have a batch to start 
processing, but you want to process when the data hits the queue and 
pull it from the queue as quickly as possible.


Spark streaming will be able to pull batches in as little as 500ms. 
So if you pull a batch at t0 and immediately have a tick in your 
queue, you won’t process that data until t0+500ms. And said batch 
would contain 25,000 entries.


Depending on what you are doing… that 500ms delay can be enough to 
be fatal to your trading process.


If you don’t like stock data, there are other examples mainly when 
pulling data from real time embedded systems.



If you go back and read what I said, if your data flow is >> (much 
slower) than 500ms, and / or the time to process is >> 500ms ( much 
longer )  you could use spark streaming.  If not… and there are 
applications which require that type of speed…  then you shouldn’t 
use spark streaming.


If you do have that constraint, then you can look at systems like 
storm/flink/samza /

Classification or grouping of analyzing tools

2016-04-28 Thread Esa Heikkinen


Hi

I am newbie in this analyzing field. It seems there are exist many 
tools, frameworks, ecosystems, softwares, languages and so on.


1) Are there exist some classifications or groupings  for them ?
2) What kind of types of tools are exist ?
3) What are the main purposes ot tools ?

Regards
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark support for Complex Event Processing (CEP)

2016-04-28 Thread Esa Heikkinen


Do you know any good examples how to use Spark streaming in tracking 
public transportation systems ?


Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:

Uhm…
I think you need to clarify a couple of things…

First there is this thing called analog signal processing…. Is that 
continuous enough for you?


But more to the point, Spark Streaming does micro batching so if 
you’re processing a continuous stream of tick data, you will have more 
than 50K of tics per second while there are markets open and trading. 
 Even at 50K a second, that would mean 1 every .02 ms or 50 ticks a ms.


And you don’t want to wait until you have a batch to start processing, 
but you want to process when the data hits the queue and pull it from 
the queue as quickly as possible.


Spark streaming will be able to pull batches in as little as 500ms. So 
if you pull a batch at t0 and immediately have a tick in your queue, 
you won’t process that data until t0+500ms. And said batch would 
contain 25,000 entries.


Depending on what you are doing… that 500ms delay can be enough to be 
fatal to your trading process.


If you don’t like stock data, there are other examples mainly when 
pulling data from real time embedded systems.



If you go back and read what I said, if your data flow is >> (much 
slower) than 500ms, and / or the time to process is >> 500ms ( much 
longer )  you could use spark streaming.  If not… and there are 
applications which require that type of speed…  then you shouldn’t use 
spark streaming.


If you do have that constraint, then you can look at systems like 
storm/flink/samza / whatever where you have a continuous queue and 
listener and no micro batch delays.
Then for each bolt (storm) you can have a spark context for processing 
the data. (Depending on what sort of processing you want to do.)


To put this in perspective… if you’re using spark streaming / akka / 
storm /etc to handle real time requests from the web, 500ms added 
delay can be a long time.


Choose the right tool.

For the OP’s problem. Sure Tracking public transportation could be 
done using spark streaming. It could also be done using half a dozen 
other tools because the rate of data generation is much slower than 
500ms.


HTH


On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh 
<mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:


couple of things.

There is no such thing as Continuous Data Streaming as there is no 
such thing as Continuous Availability.


There is such thing as Discrete Data Streaming and  High 
Availability  but they reduce the finite unavailability to minimum. 
In terms of business needs a 5 SIGMA is good enough and acceptable. 
Even the candles set to a predefined time interval say 2, 4, 15 
seconds overlap. No FX savvy trader makes a sell or buy decision on 
the basis of 2 seconds candlestick


The calculation itself in measurements is subject to finite error as 
defined by their Confidence Level (CL) using Standard Deviation 
function.


OK so far I have never noticed a tool that requires that details of 
granularity. Those stuff from Flink etc is in practical term is of 
little value and does not make commercial sense.


Now with regard to your needs, Spark micro batching is perfectly 
adequate.


HTH

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 
<http://talebzadehmich.wordpress.com/>



On 27 April 2016 at 22:10, Esa Heikkinen 
<esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> 
wrote:



Hi

Thanks for the answer.

I have developed a log file analyzer for RTPIS (Real Time
Passenger Information System) system, where buses drive lines and
the system try to estimate the arrival times to the bus stops.
There are many different log files (and events) and analyzing
situation can be very complex. Also spatial data can be included
to the log data.

The analyzer also has a query (or analyzing) language, which
describes a expected behavior. This can be a requirement of
system. Analyzer can be think to be also a test oracle.

I have published a paper (SPLST'15 conference) about my analyzer
and its language. The paper is maybe too technical, but it is found:
http://ceur-ws.org/Vol-1525/paper-19.pdf

I do not know yet where it belongs. I think it can be some "CEP
with delays". Or do you know better ?
My analyzer can also do little bit more complex and
time-consuming analyzings? There are no a need for real time.

And it is possible to do "CEP with delays" reasonably some
existing analyzer (for example Spark) ?

Regards
PhD student at Tampere University of Technology, Finland,
www.tut.fi <http://www.tut.fi/>
Esa Heikkinen

27.4.2016, 15:51, Michael Segel kirjoitti:


Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Esa Heikkinen


Hi

Thanks for the answer.

I have developed a log file analyzer for RTPIS (Real Time Passenger 
Information System) system, where buses drive lines and the system try 
to estimate the arrival times to the bus stops. There are many different 
log files (and events) and analyzing situation can be very complex. Also 
spatial data can be included to the log data.


The analyzer also has a query (or analyzing) language, which describes a 
expected behavior. This can be a requirement of system. Analyzer can be 
think to be also a test oracle.


I have published a paper (SPLST'15 conference) about my analyzer and its 
language. The paper is maybe too technical, but it is found:

http://ceur-ws.org/Vol-1525/paper-19.pdf

I do not know yet where it belongs. I think it can be some "CEP with 
delays". Or do you know better ?
My analyzer can also do little bit more complex and time-consuming 
analyzings? There are no a need for real time.


And it is possible to do "CEP with delays" reasonably some existing 
analyzer (for example Spark) ?


Regards
PhD student at Tampere University of Technology, Finland, www.tut.fi 
<http://www.tut.fi/>

Esa Heikkinen

27.4.2016, 15:51, Michael Segel kirjoitti:

Spark and CEP? It depends…

Ok, I know that’s not the answer you want to hear, but its a bit more 
complicated…


If you consider Spark Streaming, you have some issues.
Spark Streaming isn’t a Real Time solution because it is a micro batch 
solution. The smallest Window is 500ms.  This means that if your 
compute time is >> 500ms and/or  your event flow is >> 500ms this 
could work.
(e.g. 'real time' image processing on a system that is capturing 60FPS 
because the processing time is >> 500ms. )


So Spark Streaming wouldn’t be the best solution….

However, you can combine spark with other technologies like Storm, 
Akka, etc .. where you have continuous streaming.

So you could instantiate a spark context per worker in storm…

I think if there are no class collisions between Akka and Spark, you 
could use Akka, which may have a better potential for communication 
between workers.

So here you can handle CEP events.

HTH

On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh 
<mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:


please see my other reply

Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com 
<http://talebzadehmich.wordpress.com/>



On 27 April 2016 at 10:40, Esa Heikkinen 
<esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> 
wrote:


Hi

I have followed with interest the discussion about CEP and Spark.
It is quite close to my research, which is a complex analyzing
for log files and "history" data  (not actually for real time
streams).

I have few questions:

1) Is CEP only for (real time) stream data and not for "history"
data?

2) Is it possible to search "backward" (upstream) by CEP with
given time window? If a start time of the time window is earlier
than the current stream time.

3) Do you know any good tools or softwares for "CEP's" using for
log data ?

4) Do you know any good (scientific) papers i should read about CEP ?


    Regards
    PhD student at Tampere University of Technology, Finland,
www.tut.fi <http://www.tut.fi/>
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org
<mailto:user-h...@spark.apache.org>




The opinions expressed here are mine, while they may reflect a 
cognitive thought, that is purely accidental.

Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com <http://hotmail.com>









Re: Spark support for Complex Event Processing (CEP)

2016-04-27 Thread Esa Heikkinen

Hi

I have followed with interest the discussion about CEP and Spark. It is 
quite close to my research, which is a complex analyzing for log files 
and "history" data  (not actually for real time streams).


I have few questions:

1) Is CEP only for (real time) stream data and not for "history" data?

2) Is it possible to search "backward" (upstream) by CEP with given time 
window? If a start time of the time window is earlier than the current 
stream time.


3) Do you know any good tools or softwares for "CEP's" using for log data ?

4) Do you know any good (scientific) papers i should read about CEP ?


Regards
PhD student at Tampere University of Technology, Finland, www.tut.fi
Esa Heikkinen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to implement statemachine functionality in apache-spark by python

2015-12-21 Thread Esa Heikkinen


Hi

I am newbie with apache spark and i would want to know or find good 
example python codes how to implement (finite) statemachine 
functionality in spark. I try to read many different log files to find 
certain events by specific order. Is this possible or even impossible ? 
Or is that only "problem" of Python?


Regards
Esa


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org