Spark CEP
Hi I would want to know more about Spark CEP (Complex Event Processing). Are there exist some simple (but also complex) examples with input data (log files ?). Whether Spark CEP is based on Siddhi ? If yes, it is better to use Siddhi directly ? I know CEP engines are intended to stream data, but it is maybe not the problem to use them to batch data ? BR Esa
Execution model in Spark
Hi I don't know whether this question is suitable for this forum, but I take the risk and ask :) In my understanding the execution model in Spark is very data (flow) stream oriented and specific. Is it difficult to build a control flow logic (like state-machine) outside of the stream specific processings ? It is only way to combine all different type event streams to one big stream and then process it by some own stateful "logic" ? And how to build this logic ? Best Regards, Esa
RE: How to Spark can solve this example
Hello That is good to hear, but are there exist some good practical (Python or Scala) examples ? This would help a lot. I tried to do that by Apache Flink (and its CEP) and it was not so piece cake. Best, Esa From: Matteo Cossu <elco...@gmail.com> Sent: Friday, May 18, 2018 10:51 AM To: Esa Heikkinen <esa.heikki...@student.tut.fi> Cc: user@spark.apache.org Subject: Re: How to Spark can solve this example Hello Esa, all the steps that you described can be performed with Spark. I don't know about CEP, but Spark Streaming should be enough. Best, Matteo On 18 May 2018 at 09:20, Esa Heikkinen <esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote: Hi I have attached fictive example (pdf-file) about processing of event traces from data streams (or batch data). I hope the picture of the attachment is clear and understandable. I would be very interested in how best to solve it with Spark. Or it is possible or not ? If it is possible, can it be solved for example by CEP ? Little explanations.. Data processing reads three different and parallel streams (or batch data): A, B and C. Each of them have events which have different “keys with value” (like K1-K4) or record. I would want to find all event traces, which have certain dependences or patterns between streams (or batches). To find pattern there are three steps: 1) Searches an event that have value “X” in K1 in stream A and if it is found, stores it to global data for later use and continues next step 2) Searches an event that have value A(K1) in K2 in stream B and if it is found, stores it to global data for later use and continues next step 3) Searches an event that have value A(K1) in K1 and value B(K3) in K2 in stream C and if it is found, continues next step (back to step 1) If that is not possible by Spark, do you have any idea of tools, which can solve this ? Best, Esa - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
Spark CEP with files and no streams ?
Hello I am trying to use CEP of Spark for log files (as batch job), but not for streams (as realtime). Is that possible ? If yes, do you know examples Scala codes about that ? Or should I convert the log files (with time stamps) into streams ? But how to handle time stamps in Spark ? If I can not use Spark at all for this purpose, do you have any recommendations of other tools ? I would want CEP type analysis for log files.
Best active groups, forums or contacts for Spark ?
Hi It is very often difficult to get answers of question about Spark in many forums.. Maybe they are inactive or my questions are too bad. I don't know, but does anyone know good active groups, forums or contacts other like this ? Esa Heikkinen
Spark and CEP type examples
Hi I am looking for simple examples using CEP (Complex Event Processing) with Scala and Python. Does anyone know good ones ? I do not need preprocessing (like in Kafka), but only analyzing phase of CEP inside Spark. I am also interested to know other possibilities to search sequential event patterns from (time series) log data using by Spark. These event patterns can be described by DAG (directed acyclic graphs), so they are no simple chain of events. And sometimes it can be better to search events in the backward or reverse order (not always in forward order like in streaming case). But this can be impossible by Spark ? Esa Heikkinen
Pyspark and searching items from data structures
Hi I would want to build pyspark-application, which searches sequential items or events of time series from csv-files. What are the best data structures for this purpose ? Dataframe of pyspark or pandas, or RDD or SQL or something else ? --- Esa
Spark and neural networks
Hi What would be the best way to use Spark and neutral networks (especially RNN LSTM) ? I think it would be possible by "tool"-combination: Pyspark + anaconda + pandas + numpy + keras + tensorflow + scikit But what about scalability and usability by Spark (pyspark) ? How compatible are data structures (for example dataframe) between Spark and other "tools" ? And it is possible convert them between different "tools" ? --- Eras
Windows10 + pyspark + ipython + csv file loading with timestamps
Hi Does anyone have any hints or example (code) how to get combination: Windows10 + pyspark + ipython notebook + csv file loading with timestamps (timeseries data) to dataframe or RDD to work ? I have already installed windows10 + pyspark + ipython notebook and they seem to work, but my python code in notebook does not, because "spark context" may not work ? What commands should be put into the beginning of the notebook ? sc = SparkContext.getOrCreate() ? spark = SparkSession(sc) ? I have installed: spark-2.2.1-bin-hadoop2.7 and ipython 6.1.0 to Windows10. Eras
VS: VS: Using Spark as a simulator
I only want to simulate very huge "network" with even millions parallel time syncronized actors (state machines). There are also communication between actors via some (key-value pairs) database. I also want the simulation should work in the real time. I don't know what would be the best framework or tool for that kind of simulation. I think Akka would be the best and easiest to deploy ? Or do you know better frameworks or tools ? Nowdays Spark is using Netty instead of Akka ? Lähettäjä: Jörn Franke <jornfra...@gmail.com> Lähetetty: 7. heinäkuuta 2017 10:04 Vastaanottaja: Esa Heikkinen Kopio: Mahesh Sawaiker; user@spark.apache.org Aihe: Re: VS: Using Spark as a simulator Spark dropped Akka some time ago... I think the main issue he will face is a library for simulating the state machines (randomly), storing a huge amount of files (HDFS is probably the way to go if you want it fast) and distributing the work (here you can select different options). Are you trying to have some mathematical guarantees on the state machines, such as Markov chains/steady state? On 7. Jul 2017, at 08:46, Esa Heikkinen <esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote: Would it be better to use Akka as simulator rather than Spark ? http://akka.io/ Akka<http://akka.io/> akka.io<http://akka.io> Build powerful reactive, concurrent & distributed applications more easily. Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient ... The spark was originally built on it (Akka). Esa Lähettäjä: Mahesh Sawaiker <mahesh_sawai...@persistent.com<mailto:mahesh_sawai...@persistent.com>> Lähetetty: 21. kesäkuuta 2017 14:45 Vastaanottaja: Esa Heikkinen; Jörn Franke Kopio: user@spark.apache.org<mailto:user@spark.apache.org> Aihe: RE: Using Spark as a simulator Spark can help you to create one large file if needed, but hdfs itself will provide abstraction over such things, so it’s a trivial problem if anything. If you have spark installed, then you can use spark-shell to try a few samples, and build from there.If you can collect all the files in a folder then spark can read all files from there. The programming guide below has enough information to get started. https://spark.apache.org/docs/latest/programming-guide.html Spark Programming Guide - Spark 2.1.1 Documentation<https://spark.apache.org/docs/latest/programming-guide.html> spark.apache.org<http://spark.apache.org> Spark Programming Guide. Overview; Linking with Spark; Initializing Spark. Using the Shell; Resilient Distributed Datasets (RDDs) Parallelized Collections All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz"). After reading the file you can map it using map function, which will split the individual line and possibly create a scala object. This way you will get a RDD of scala objects, which you can then process functional/set operators. You would want to read about PairRDDs. From: Esa Heikkinen [mailto:esa.heikki...@student.tut.fi] Sent: Wednesday, June 21, 2017 1:12 PM To: Jörn Franke Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: VS: Using Spark as a simulator Hi Thanks for the answer. I think my simulator includes a lot of parallel state machines and each of them generates log file (with timestamps). Finally all events (rows) of all log files should combine as time order to (one) very huge log file. Practically the combined huge log file can also be split into smaller ones. What transformation or action functions can i use in Spark for that purpose ? Or are there exist some code sample (Python or Scala) about that ? Regards Esa Heikkinen Lähettäjä: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Lähetetty: 20. kesäkuuta 2017 17:12 Vastaanottaja: Esa Heikkinen Kopio: user@spark.apache.org<mailto:user@spark.apache.org> Aihe: Re: Using Spark as a simulator It is fine, but you have to design it that generated rows are written in large blocks for optimal performance. The most tricky part with data generation is the conceptual part, such as probabilistic distribution etc You have to check as well that you use a good random generator, for some cases the Java internal might be not that well. On 20. Jun 2017, at 16:04, Esa Heikkinen <esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote: Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized s
VS: Using Spark as a simulator
Would it be better to use Akka as simulator rather than Spark ? http://akka.io/ Akka<http://akka.io/> akka.io Build powerful reactive, concurrent & distributed applications more easily. Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient ... The spark was originally built on it (Akka). Esa Lähettäjä: Mahesh Sawaiker <mahesh_sawai...@persistent.com> Lähetetty: 21. kesäkuuta 2017 14:45 Vastaanottaja: Esa Heikkinen; Jörn Franke Kopio: user@spark.apache.org Aihe: RE: Using Spark as a simulator Spark can help you to create one large file if needed, but hdfs itself will provide abstraction over such things, so it’s a trivial problem if anything. If you have spark installed, then you can use spark-shell to try a few samples, and build from there.If you can collect all the files in a folder then spark can read all files from there. The programming guide below has enough information to get started. https://spark.apache.org/docs/latest/programming-guide.html Spark Programming Guide - Spark 2.1.1 Documentation<https://spark.apache.org/docs/latest/programming-guide.html> spark.apache.org Spark Programming Guide. Overview; Linking with Spark; Initializing Spark. Using the Shell; Resilient Distributed Datasets (RDDs) Parallelized Collections All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz"). After reading the file you can map it using map function, which will split the individual line and possibly create a scala object. This way you will get a RDD of scala objects, which you can then process functional/set operators. You would want to read about PairRDDs. From: Esa Heikkinen [mailto:esa.heikki...@student.tut.fi] Sent: Wednesday, June 21, 2017 1:12 PM To: Jörn Franke Cc: user@spark.apache.org Subject: VS: Using Spark as a simulator Hi Thanks for the answer. I think my simulator includes a lot of parallel state machines and each of them generates log file (with timestamps). Finally all events (rows) of all log files should combine as time order to (one) very huge log file. Practically the combined huge log file can also be split into smaller ones. What transformation or action functions can i use in Spark for that purpose ? Or are there exist some code sample (Python or Scala) about that ? Regards Esa Heikkinen Lähettäjä: Jörn Franke <jornfra...@gmail.com<mailto:jornfra...@gmail.com>> Lähetetty: 20. kesäkuuta 2017 17:12 Vastaanottaja: Esa Heikkinen Kopio: user@spark.apache.org<mailto:user@spark.apache.org> Aihe: Re: Using Spark as a simulator It is fine, but you have to design it that generated rows are written in large blocks for optimal performance. The most tricky part with data generation is the conceptual part, such as probabilistic distribution etc You have to check as well that you use a good random generator, for some cases the Java internal might be not that well. On 20. Jun 2017, at 16:04, Esa Heikkinen <esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote: Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized simulation using by Spark (cloud) could work. Is that good or bad idea ? Regards Esa Heikkinen DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
VS: Using Spark as a simulator
Hi Thanks for the answer. I think my simulator includes a lot of parallel state machines and each of them generates log file (with timestamps). Finally all events (rows) of all log files should combine as time order to (one) very huge log file. Practically the combined huge log file can also be split into smaller ones. What transformation or action functions can i use in Spark for that purpose ? Or are there exist some code sample (Python or Scala) about that ? Regards Esa Heikkinen Lähettäjä: Jörn Franke <jornfra...@gmail.com> Lähetetty: 20. kesäkuuta 2017 17:12 Vastaanottaja: Esa Heikkinen Kopio: user@spark.apache.org Aihe: Re: Using Spark as a simulator It is fine, but you have to design it that generated rows are written in large blocks for optimal performance. The most tricky part with data generation is the conceptual part, such as probabilistic distribution etc You have to check as well that you use a good random generator, for some cases the Java internal might be not that well. On 20. Jun 2017, at 16:04, Esa Heikkinen <esa.heikki...@student.tut.fi<mailto:esa.heikki...@student.tut.fi>> wrote: Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized simulation using by Spark (cloud) could work. Is that good or bad idea ? Regards Esa Heikkinen
Using Spark as a simulator
Hi Spark is a data analyzer, but would it be possible to use Spark as a data generator or simulator ? My simulation can be very huge and i think a parallelized simulation using by Spark (cloud) could work. Is that good or bad idea ? Regards Esa Heikkinen
Book recommendations
Hi Does anyone know of good books about event processing in distributed event systems (like IoT-systems) ? I have already read book: "Power of Events" (Luckham 2002), but are there exist newer ones ? Best Regards Esa Heikkinen - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Simple "state machine" functionality using Scala or Python
Hello Can anyone provide a simple example how to implement a "state machine" functionality using Scala or Python in Spark? Sequence of the state machine would be like this: 1) Searches first event of log and its data 2) Based on the data of the first event, searches second event of log and its data 3) And so on.. Regards Esa Heikkinen - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark support for Complex Event Processing (CEP)
Sorry for answering delay.. Yes, this is not pure "CEP", but quite close for it or many similar "functionalities". My case is not so easy, because i dont' want to compare against original time schedule of route. I want to compare how close (ITS) system has estimated arrival time to bus stop. That means i have to read more (LOG C) logs (and do little calculation) to determine the estimated arrival time. And then it is checked how much a difference (error) between bus real and the system's estimated arrival time.. In practic the situation can be little more complex.. --- Esa Heikkinen 29.4.2016, 19:38, Michael Segel kirjoitti: If you’re getting the logs, then it really isn’t CEP unless you consider the event to be the log from the bus. This doesn’t sound like there is a time constraint. Your bus schedule is fairly fixed and changes infrequently. Your bus stops are relatively fixed points. (Within a couple of meters) So then you’re taking bus A who is scheduled to drive route 123 and you want to compare their nearest location to the bus stop at time T and see how close it is to the scheduled route. Or am I missing something? -Mike On Apr 29, 2016, at 3:54 AM, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote: Hi I try to explain my case .. Situation is not so simple in my logs and solution. There also many types of logs and there are from many sources. They are as csv-format and header line includes names of the columns. This is simplified description of input logs. LOG A's: bus coordinate logs (every bus has own log): - timestamp - bus number - coordinates LOG B: bus login/logout (to/from line) message log: - timestamp - bus number - line number LOG C: log from central computers: - timestamp - bus number - bus stop number - estimated arrival time to bus stop LOG A are updated every 30 seconds (i have also another system by 1 seconds interval). LOG B are updated when bus starts from terminal bus stop and arrives to final bus stop in a line. LOG C is updated when central computer sends new arrival time estimation to bus stop. I also need metadata for logs (and analyzer). For example coordinates for bus stop areas. Main purpose of analyzing is to check an accuracy (error) of the estimated arrival time to bus stops. Because there are many buses and lines, it is too time-comsuming to check all of them. So i check only specific lines with specific bus stops. There are many buses (logged to lines) coming to one bus stop and i am interested about only certain bus. To do that, i have to read log partly not in time order (upstream) by sequence: 1. From LOG C is searched bus number 2. From LOG A is searched when the bus has leaved from terminal bus stop 3. From LOG B is searched when bus has sent a login to the line 4. From LOG A is searched when the bus has entered to bus stop 5. From LOG C is searched a last estimated arrival time to the bus stop and calculates error between real and estimated value In my understanding (almost) all log file analyzers reads all data (lines) in time order from log files. My need is only for specific part of log (lines). To achieve that, my solution is to read logs in an arbitrary order (with given time window). I know this solution is not suitable for all cases (for example for very fast analyzing and very big data). This solution is suitable for very complex (targeted) analyzing. It can be too slow and memory-consuming, but well done pre-processing of log data can help a lot. --- Esa Heikkinen 28.4.2016, 14:44, Michael Segel kirjoitti: I don’t. I believe that there have been a couple of hack-a-thons like one done in Chicago a few years back using public transportation data. The first question is what sort of data do you get from the city? I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). Or they could provide more information. Like last stop, distance to next stop, avg current velocity… Then there is the frequency of the updates. Every second? Every 3 seconds? 5 or 6 seconds… This will determine how much work you have to do. Maybe they provide the routes of the busses via a different API call since its relatively static. This will drive your solution more than the underlying technology. Oh and whileI focused on bus, there are also rail and other modes of public transportation like light rail, trains, etc … HTH -Mike On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote: Do you know any good examples how to use Spark streaming in tracking public transportation systems ? Or Storm or some other tool example ? Regards Esa Heikkinen 28.4.2016, 3:16, Michael Segel kirjoitti: Uhm… I think you need to clarify a couple of things… First there is this thing called analog signal processing…. Is that continuous en
Re: Spark support for Complex Event Processing (CEP)
Hi I try to explain my case .. Situation is not so simple in my logs and solution. There also many types of logs and there are from many sources. They are as csv-format and header line includes names of the columns. This is simplified description of input logs. LOG A's: bus coordinate logs (every bus has own log): - timestamp - bus number - coordinates LOG B: bus login/logout (to/from line) message log: - timestamp - bus number - line number LOG C: log from central computers: - timestamp - bus number - bus stop number - estimated arrival time to bus stop LOG A are updated every 30 seconds (i have also another system by 1 seconds interval). LOG B are updated when bus starts from terminal bus stop and arrives to final bus stop in a line. LOG C is updated when central computer sends new arrival time estimation to bus stop. I also need metadata for logs (and analyzer). For example coordinates for bus stop areas. Main purpose of analyzing is to check an accuracy (error) of the estimated arrival time to bus stops. Because there are many buses and lines, it is too time-comsuming to check all of them. So i check only specific lines with specific bus stops. There are many buses (logged to lines) coming to one bus stop and i am interested about only certain bus. To do that, i have to read log partly not in time order (upstream) by sequence: 1. From LOG C is searched bus number 2. From LOG A is searched when the bus has leaved from terminal bus stop 3. From LOG B is searched when bus has sent a login to the line 4. From LOG A is searched when the bus has entered to bus stop 5. From LOG C is searched a last estimated arrival time to the bus stop and calculates error between real and estimated value In my understanding (almost) all log file analyzers reads all data (lines) in time order from log files. My need is only for specific part of log (lines). To achieve that, my solution is to read logs in an arbitrary order (with given time window). I know this solution is not suitable for all cases (for example for very fast analyzing and very big data). This solution is suitable for very complex (targeted) analyzing. It can be too slow and memory-consuming, but well done pre-processing of log data can help a lot. --- Esa Heikkinen 28.4.2016, 14:44, Michael Segel kirjoitti: I don’t. I believe that there have been a couple of hack-a-thons like one done in Chicago a few years back using public transportation data. The first question is what sort of data do you get from the city? I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). Or they could provide more information. Like last stop, distance to next stop, avg current velocity… Then there is the frequency of the updates. Every second? Every 3 seconds? 5 or 6 seconds… This will determine how much work you have to do. Maybe they provide the routes of the busses via a different API call since its relatively static. This will drive your solution more than the underlying technology. Oh and whileI focused on bus, there are also rail and other modes of public transportation like light rail, trains, etc … HTH -Mike On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote: Do you know any good examples how to use Spark streaming in tracking public transportation systems ? Or Storm or some other tool example ? Regards Esa Heikkinen 28.4.2016, 3:16, Michael Segel kirjoitti: Uhm… I think you need to clarify a couple of things… First there is this thing called analog signal processing…. Is that continuous enough for you? But more to the point, Spark Streaming does micro batching so if you’re processing a continuous stream of tick data, you will have more than 50K of tics per second while there are markets open and trading. Even at 50K a second, that would mean 1 every .02 ms or 50 ticks a ms. And you don’t want to wait until you have a batch to start processing, but you want to process when the data hits the queue and pull it from the queue as quickly as possible. Spark streaming will be able to pull batches in as little as 500ms. So if you pull a batch at t0 and immediately have a tick in your queue, you won’t process that data until t0+500ms. And said batch would contain 25,000 entries. Depending on what you are doing… that 500ms delay can be enough to be fatal to your trading process. If you don’t like stock data, there are other examples mainly when pulling data from real time embedded systems. If you go back and read what I said, if your data flow is >> (much slower) than 500ms, and / or the time to process is >> 500ms ( much longer ) you could use spark streaming. If not… and there are applications which require that type of speed… then you shouldn’t use spark streaming. If you do have that constraint, then you can look at systems like storm/flink/samza /
Classification or grouping of analyzing tools
Hi I am newbie in this analyzing field. It seems there are exist many tools, frameworks, ecosystems, softwares, languages and so on. 1) Are there exist some classifications or groupings for them ? 2) What kind of types of tools are exist ? 3) What are the main purposes ot tools ? Regards Esa Heikkinen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark support for Complex Event Processing (CEP)
Do you know any good examples how to use Spark streaming in tracking public transportation systems ? Or Storm or some other tool example ? Regards Esa Heikkinen 28.4.2016, 3:16, Michael Segel kirjoitti: Uhm… I think you need to clarify a couple of things… First there is this thing called analog signal processing…. Is that continuous enough for you? But more to the point, Spark Streaming does micro batching so if you’re processing a continuous stream of tick data, you will have more than 50K of tics per second while there are markets open and trading. Even at 50K a second, that would mean 1 every .02 ms or 50 ticks a ms. And you don’t want to wait until you have a batch to start processing, but you want to process when the data hits the queue and pull it from the queue as quickly as possible. Spark streaming will be able to pull batches in as little as 500ms. So if you pull a batch at t0 and immediately have a tick in your queue, you won’t process that data until t0+500ms. And said batch would contain 25,000 entries. Depending on what you are doing… that 500ms delay can be enough to be fatal to your trading process. If you don’t like stock data, there are other examples mainly when pulling data from real time embedded systems. If you go back and read what I said, if your data flow is >> (much slower) than 500ms, and / or the time to process is >> 500ms ( much longer ) you could use spark streaming. If not… and there are applications which require that type of speed… then you shouldn’t use spark streaming. If you do have that constraint, then you can look at systems like storm/flink/samza / whatever where you have a continuous queue and listener and no micro batch delays. Then for each bolt (storm) you can have a spark context for processing the data. (Depending on what sort of processing you want to do.) To put this in perspective… if you’re using spark streaming / akka / storm /etc to handle real time requests from the web, 500ms added delay can be a long time. Choose the right tool. For the OP’s problem. Sure Tracking public transportation could be done using spark streaming. It could also be done using half a dozen other tools because the rate of data generation is much slower than 500ms. HTH On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote: couple of things. There is no such thing as Continuous Data Streaming as there is no such thing as Continuous Availability. There is such thing as Discrete Data Streaming and High Availability but they reduce the finite unavailability to minimum. In terms of business needs a 5 SIGMA is good enough and acceptable. Even the candles set to a predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader makes a sell or buy decision on the basis of 2 seconds candlestick The calculation itself in measurements is subject to finite error as defined by their Confidence Level (CL) using Standard Deviation function. OK so far I have never noticed a tool that requires that details of granularity. Those stuff from Flink etc is in practical term is of little value and does not make commercial sense. Now with regard to your needs, Spark micro batching is perfectly adequate. HTH Dr Mich Talebzadeh LinkedIn /https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/ http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote: Hi Thanks for the answer. I have developed a log file analyzer for RTPIS (Real Time Passenger Information System) system, where buses drive lines and the system try to estimate the arrival times to the bus stops. There are many different log files (and events) and analyzing situation can be very complex. Also spatial data can be included to the log data. The analyzer also has a query (or analyzing) language, which describes a expected behavior. This can be a requirement of system. Analyzer can be think to be also a test oracle. I have published a paper (SPLST'15 conference) about my analyzer and its language. The paper is maybe too technical, but it is found: http://ceur-ws.org/Vol-1525/paper-19.pdf I do not know yet where it belongs. I think it can be some "CEP with delays". Or do you know better ? My analyzer can also do little bit more complex and time-consuming analyzings? There are no a need for real time. And it is possible to do "CEP with delays" reasonably some existing analyzer (for example Spark) ? Regards PhD student at Tampere University of Technology, Finland, www.tut.fi <http://www.tut.fi/> Esa Heikkinen 27.4.2016, 15:51, Michael Segel kirjoitti:
Re: Spark support for Complex Event Processing (CEP)
Hi Thanks for the answer. I have developed a log file analyzer for RTPIS (Real Time Passenger Information System) system, where buses drive lines and the system try to estimate the arrival times to the bus stops. There are many different log files (and events) and analyzing situation can be very complex. Also spatial data can be included to the log data. The analyzer also has a query (or analyzing) language, which describes a expected behavior. This can be a requirement of system. Analyzer can be think to be also a test oracle. I have published a paper (SPLST'15 conference) about my analyzer and its language. The paper is maybe too technical, but it is found: http://ceur-ws.org/Vol-1525/paper-19.pdf I do not know yet where it belongs. I think it can be some "CEP with delays". Or do you know better ? My analyzer can also do little bit more complex and time-consuming analyzings? There are no a need for real time. And it is possible to do "CEP with delays" reasonably some existing analyzer (for example Spark) ? Regards PhD student at Tampere University of Technology, Finland, www.tut.fi <http://www.tut.fi/> Esa Heikkinen 27.4.2016, 15:51, Michael Segel kirjoitti: Spark and CEP? It depends… Ok, I know that’s not the answer you want to hear, but its a bit more complicated… If you consider Spark Streaming, you have some issues. Spark Streaming isn’t a Real Time solution because it is a micro batch solution. The smallest Window is 500ms. This means that if your compute time is >> 500ms and/or your event flow is >> 500ms this could work. (e.g. 'real time' image processing on a system that is capturing 60FPS because the processing time is >> 500ms. ) So Spark Streaming wouldn’t be the best solution…. However, you can combine spark with other technologies like Storm, Akka, etc .. where you have continuous streaming. So you could instantiate a spark context per worker in storm… I think if there are no class collisions between Akka and Spark, you could use Akka, which may have a better potential for communication between workers. So here you can handle CEP events. HTH On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote: please see my other reply Dr Mich Talebzadeh LinkedIn /https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/ http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> On 27 April 2016 at 10:40, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote: Hi I have followed with interest the discussion about CEP and Spark. It is quite close to my research, which is a complex analyzing for log files and "history" data (not actually for real time streams). I have few questions: 1) Is CEP only for (real time) stream data and not for "history" data? 2) Is it possible to search "backward" (upstream) by CEP with given time window? If a start time of the time window is earlier than the current stream time. 3) Do you know any good tools or softwares for "CEP's" using for log data ? 4) Do you know any good (scientific) papers i should read about CEP ? Regards PhD student at Tampere University of Technology, Finland, www.tut.fi <http://www.tut.fi/> Esa Heikkinen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> For additional commands, e-mail: user-h...@spark.apache.org <mailto:user-h...@spark.apache.org> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com <http://hotmail.com>
Re: Spark support for Complex Event Processing (CEP)
Hi I have followed with interest the discussion about CEP and Spark. It is quite close to my research, which is a complex analyzing for log files and "history" data (not actually for real time streams). I have few questions: 1) Is CEP only for (real time) stream data and not for "history" data? 2) Is it possible to search "backward" (upstream) by CEP with given time window? If a start time of the time window is earlier than the current stream time. 3) Do you know any good tools or softwares for "CEP's" using for log data ? 4) Do you know any good (scientific) papers i should read about CEP ? Regards PhD student at Tampere University of Technology, Finland, www.tut.fi Esa Heikkinen - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to implement statemachine functionality in apache-spark by python
Hi I am newbie with apache spark and i would want to know or find good example python codes how to implement (finite) statemachine functionality in spark. I try to read many different log files to find certain events by specific order. Is this possible or even impossible ? Or is that only "problem" of Python? Regards Esa - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org